DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx

1
1
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 1 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

2
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary

3
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific
simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

4
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics, or
linguistics.)
 Computational Science traditionally meant simulation. It grew out of our
inability to find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online

The Internet and computing Grid that makes all these archives universally
accessible
 Scientific info. management, acquisition, organization, query, and visualization
tasks scale almost linearly with data volumes. Data mining is a major new
challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,

5
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information
systems

6
 Summary

7
What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems

8
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
warehousing communities
 Data mining plays an essential
role in the knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

9
Example: A Web Mining Framework
 Web mining usually involves
 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into
knowledge-base

10
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

11
Example: Mining vs. Data Exploration
 Business intelligence view
 Warehouse, data cube, reporting but not much
mining
 Business objects vs. data mining tools
 Supply chain example: tools
 Data presentation
 Exploration

12
KDD Process: A Typical View from ML and
Statistics
Input Data Data
Mining
Data Pre-
Processing
Post-
Processing
 This is a view from typical machine learning and statistics communities
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association &
correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization

13
Example: Medical Data Mining
 Health care & medical data mining – often
adopted such a view in statistics and machine
learning
 Preprocessing of the data (including feature
extraction and dimension reduction)
 Classification or/and clustering processes
 Post-processing for presentation

14
 Summary

15
Multi-Dimensional View of Data Mining
 Data to be mined
 Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web,
multi-media, graphs & social and information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance,
etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.

16
 Summary

17
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

18
 Summary

19
Data Mining Function: (1) Generalization
 Information integration and data warehouse
construction
 Data cleaning, transformation, integration, and
multidimensional data model
 Data cube technology
 Scalable methods for computing (i.e., materializing)
multidimensional aggregates
 OLAP (online analytical processing)
 Multidimensional concept description:
Characterization and discrimination
 Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region

20
Data Mining Function: (2) Association and
Correlation Analysis
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in
your Walmart?
 Association, correlation vs. causality
 A typical association rule

Diaper  Beer [0.5%, 75%] (support,
confidence)
 Are strongly associated items also strongly
correlated?
 How to mine such patterns and rules efficiently in
large datasets?
 How to use such patterns for classification, clustering,

21
Data Mining Function: (3) Classification
 Classification and label prediction
 Construct models (functions) based on some training
examples
 Describe and distinguish classes or concepts for future
prediction

E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

22
Data Mining Function: (4) Cluster Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing
interclass similarity
 Many methods and applications

23
Data Mining Function: (5) Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general
behavior of the data
 Noise or exception? ― One person’s garbage could be another
person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

24
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
 Sequence, trend and evolution analysis
 Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
 Sequential pattern mining

e.g., first buy digital camera, then buy large SD
memory cards
 Periodicity analysis
 Motifs and biological sequence analysis

Approximate and consecutive motifs
 Similarity-based analysis
 Mining data streams
 Ordered, time-varying, potentially infinite, data
streams

25
Structure and Network Analysis
 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
 Information network analysis
 Social networks: actors (objects, nodes) and relationships
(edges)

e.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks

A person could be multiple information networks: friends,
family, classmates, …
 Links carry a lot of semantic information: Link mining
 Web mining
 Web is a big information network: from PageRank to Google
 Analysis of Web information networks

Web community discovery, opinion mining, usage mining,
…

26
Evaluation of Knowledge
 Are all mined knowledge interesting?
 One can mine tremendous amount of “patterns” and
knowledge
 Some may fit only certain dimension space (time, location, …)
 Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
 …

27
 Summary

28
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology

29
Why Confluence of Multiple Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes
of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

30
 Summary

31
Applications of Data Mining
 Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
 Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
 From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to
invisible data mining

32
 Summary

33
Major Issues in Data Mining (1)
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results

34
Major Issues in Data Mining (2)
 Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

35
 Summary

36
A Brief History of Data Mining Society
 1989 IJCAI Workshop on Knowledge Discovery in Databases
 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
 1991-1994 Workshops on Knowledge Discovery in Databases
 Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
 Journal of Data Mining and Knowledge Discovery (1997)
 ACM SIGKDD conferences since 1998 and SIGKDD Explorations
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
 ACM Transactions on KDD starting in 2007

37
Conferences and Journals on Data Mining
 KDD Conferences
 ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
(KDD)
 SIAM Data Mining Conf. (SDM)
 (IEEE) Int. Conf. on Data Mining
(ICDM)
 European Conf. on Machine
Learning and Principles and
practices of Knowledge
Discovery and Data Mining
(ECML-PKDD)
 Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
 Int. Conf. on Web Search and
Data Mining (WSDM)
 Other related conferences
 DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
 Web and IR conferences: WWW,
SIGIR, WSDM
 ML conferences: ICML, NIPS
 PR conferences: CVPR,
 Journals
 Data Mining and Knowledge
Discovery (DAMI or DMKD)
 IEEE Trans. On Knowledge and
Data Eng. (TKDE)
 KDD Explorations
 ACM Trans. on KDD

38
Where to Find References? DBLP, CiteSeer, Google
 Data mining and KDD (SIGKDD: CDROM)
 Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
 Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
 Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
 Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
 Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
 AI & Machine Learning
 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS,
etc.
 Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
 Web and IR
 Conferences: SIGIR, WWW, CIKM, etc.
 Journals: WWW: Internet and Web Information Systems,
 Statistics
 Conferences: Joint Stat. Meeting, etc.
 Journals: Annals of statistics, etc.
 Visualization
 Conference proceedings: CHI, ACM-SIGGraph, etc.
 Journals: IEEE Trans. visualization and computer graphics, etc.

39
 Summary

40
Summary
 Data mining: Discovering interesting patterns and knowledge
from massive amount of data
 A natural evolution of database technology, in great demand, with
wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
 Data mining technologies and applications
 Major issues in data mining

41
Recommended Reference Books
 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery,
Morgan Kaufmann, 2001
 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd
ed., 2011
 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, 2nd
ed., Springer-Verlag, 2009
 B. Liu, Web Data Mining, Springer 2006.
 T. M. Mitchell, Machine Learning, McGraw Hill, 1997
 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd
ed. 2005

42
Data Mining:
— Chapter 2 —
University of Illinois at Urbana-Champaign
©2011 Han, Kamber, and Pei. All rights
reserved.

43
Chapter 2: Getting to Know Your Data
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary

44
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,
crosstabs
 Document data: text documents: term-
frequency vector
 Transaction data
 Graph and network
 World Wide Web
 Social or information networks
 Molecular Structures
 Ordered
 Video data: sequence of images
 Temporal data: time-series
 Sequential Data: transaction sequences
 Genetic sequence data
 Spatial, image and multimedia:
 Spatial data: maps
 Image data:
 Video data:
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

45
Important Characteristics of Structured Data
 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution

Patterns depend on the scale
 Distribution
 Centrality and dispersion

46
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

47
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or
feature of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Numeric: quantitative

Interval-scaled

Ratio-scaled

48
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important

e.g., gender
 Asymmetric binary: outcomes not equally important.

e.g., medical test (positive vs. negative)

Convention: assign 1 to most important outcome (e.g.,
HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings

49
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval

Measured on a scale of equal-sized units

Values have order
 E.g., temperature in C˚or F˚, calendar dates

No true zero-point
 Ratio

Inherent zero-point

We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

50
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

E.g., zip codes, profession, or the set of words in a
collection of documents
 Sometimes, represented as integer variables
 Note: Binary attributes are a special case of discrete
attributes
 Continuous Attribute
 Has real numbers as attribute values

E.g., temperature, height, or weight
 Practically, real values can only be measured and
represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables

51
 Summary

52
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple
granularities of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed
cube

53
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
 Weighted arithmetic mean:
 Trimmed mean: chopping extreme values
 Median:
 Middle value if odd number of values, or average of
the middle two values otherwise
 Estimated by interpolation (for grouped data):
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:



n
i
i
x
n
x
1
1




 n
i
i
n
i
i
i
w
x
w
x
1
1
width
freq
l
freq
n
L
median
median
)
)
(
2
/
(
1




)
(
3 median
mean
mode
mean 



N
x




October 24, 2024
Data Mining: Concepts and
Techniques 54
Symmetric vs. Skewed Data
 Median, mean and mode of
symmetric, positively and
negatively skewed data
positively skewed negatively skewed
symmetric

55
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th
percentile), Q3 (75th
percentile)
 Inter-quartile range: IQR = Q3 –Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
 Standard deviation s (or σ) is the square root of variance s2 (
or σ2)
 
  







n
i
n
i
i
i
n
i
i x
n
x
n
x
x
n
s
1 1
2
2
1
2
2
]
)
(
1
[
1
1
)
(
1
1

 





n
i
i
n
i
i x
N
x
N 1
2
2
1
2
2 1
)
(
1




56
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and
third quartiles, i.e., the height of the box is
IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box
extended to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

October 24, 2024
Techniques 57
Visualization of Data Dispersion: 3-D Boxplots

58
Properties of Normal Distribution Curve
 The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

59
Graphic Displays of Basic Statistical Descriptions
 Boxplot: graphic display of five-number summary
 Histogram: x-axis are values, y-axis repres.
frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane

60
Histogram Analysis
 Histogram: Graph display of
tabulated frequencies, shown as
bars
 It shows what proportion of cases
fall into each of several categories
 Differs from a bar chart in that it is
the area of the bar that denotes
the value, not the height as in bar
charts, a crucial distinction when
the categories are not of uniform
width
 The categories are usually
specified as non-overlapping
intervals of some variable. The
categories (bars) must be adjacent
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000

61
Histograms Often Tell More than Boxplots
 The two histograms
shown in the left may
have the same
boxplot
representation
 The same values
for: min, Q1,
median, Q3, max
 But they have rather
different data
distributions

Techniques 62
Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi

63
Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to be
lower than those at Branch 2.

64
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates
and plotted as points in the plane

65
Positively and Negatively Correlated Data
 The left half fragment is positively
correlated
 The right half is negative
correlated

67
 Summary

68
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto graphical
primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships
among data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations

69
Pixel-Oriented Visualization Techniques
 For a data set of m dimensions, create m windows on the screen,
one for each dimension
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
 The colors of the pixels reflect the corresponding values
(a) Income (b) Credit
Limit
(c) transaction volume (d) age

70
Laying Out Pixels in Circle Segments
 To save space and show the connections among multiple
dimensions, space filling is often done in a circle segment
(a) Representing a data record
in circle segment
(b) Laying out pixels in circle
segment

71
Geometric Projection Visualization Techniques
 Visualization of geometric transformations and
projections of the data
 Methods
 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes
 Projection pursuit technique: Help users find
meaningful projections of multidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates

Techniques 72
Direct Data Visualization
Ribbons
with
Twists
Based
on
Vorticity

73
Scatterplot Matrices
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
Used
by
ermission
of
M.
Ward,
Worcester
Polytechnic
Institute

74
news articles
visualized as
a landscape
Used
by
permission
of
B.
Wright,
Visible
Decisions
Inc.
Landscapes
 Visualization of the data as perspective landscape
 The data needs to be transformed into a (possibly artificial) 2D
spatial representation which preserves the characteristics of the
data

75
Attr. 1 Attr. 2 Attr. k
Attr. 3
• • •
Parallel Coordinates
 n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
 The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
 Every data item corresponds to a polygonal line which intersects
each of the axes at the point which corresponds to the value for
the attribute

76
Parallel Coordinates of a Data Set

77
Icon-Based Visualization Techniques
 Visualization of the data values as features of icons
 Typical visualization methods
 Chernoff Faces
 Stick Figures
 General techniques
 Shape coding: Use shape to represent certain
information encoding
 Color icons: Use color icons to encode more
information
 Tile bars: Use small icons to represent the relevant
feature vectors in document retrieval

78
Chernoff Faces
 A way to display variables on a two-dimensional surface, e.g., let x
be eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated
using Mathematica (S. Dickson)
 REFERENCE: Gonick, L. and Smith, W.
The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
 Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.ht
ml

79
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture
A census data
figure showing
age, income,
gender,
education, etc.
used
by
permissio
n
of
G.
Grinste
in,
University
of
Massac
husettes
at
Lowell
Stick Figure
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)

80
Hierarchical Visualization Techniques
 Visualization of the data using a hierarchical
partitioning into subspaces
 Methods
 Dimensional Stacking
 Worlds-within-Worlds
 Tree-Map
 Cone Trees
 InfoCube

81
Dimensional Stacking
attribute 1
attribute 2
attribute 3
attribute 4
 Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
 Partitioning of the attribute value ranges into classes.
The important attributes should be used on the outer
levels.
 Adequate for data with ordinal attributes of low
cardinality
 But, difficult to display more than nine dimensions

82
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
Dimensional Stacking

83
Worlds-within-Worlds
 Assign the function and two most important parameters to
innermost world
 Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
 Software that uses this paradigm
 N–vision: Dynamic
interaction through
data glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
 Auto Visual: Static
interaction by means of
queries

84
Tree-Map
 Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending on the
attribute values
 The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan Image
Ack.:

85
Tree-Map of a File System (Schneiderman)

86
InfoCube
 A 3-D visualization technique where hierarchical
information is displayed as nested semi-
transparent cubes
 The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smaller cubes inside the
outermost cubes, and so on

87
Three-D Cone Trees
 3D cone tree visualization technique
works well for up to a thousand nodes
or so
 First build a 2D circle tree that arranges
its nodes in concentric circles centered
on the root node
 Cannot avoid overlaps when projected
to 2D
 G. Robertson, J. Mackinlay, S. Card.
“Cone Trees: Animated 3D Visualizations
of Hierarchical Information”, ACM
SIGCHI'91
 Graph from Nadeau Software Consulting
website: Visualize a social network data
set that models the way an infection
spreads from one person to the next
Ack.: http://nadeausoftware.com/articles/visualization

Visualizing Complex Data and Relations
 Visualizing non-numerical data: text and social networks
 Tag cloud: visualizing user-generated tags
 The importance of
tag is represented
by font size/color
 Besides text data,
there are also
methods to visualize
relationships, such
as visualizing social
networks
Newsmap: Google News Stories in

89
 Summary

90
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data
objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies
 Proximity refers to a similarity or dissimilarity

91
Data Matrix and Dissimilarity Matrix
 Data matrix
 n data points with p
dimensions
 Two modes
 Dissimilarity matrix
 n data points, but
registers only the
distance
 A triangular matrix
 Single mode


















np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
















0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0

92
Proximity Measure for Nominal Attributes
 Can take 2 or more states, e.g., red, yellow,
blue, green (generalization of a binary
attribute)
 Method 1: Simple matching
 m: # of matches, p: total # of variables
 Method 2: Use a large number of binary
attributes
 creating a new binary attribute for each of
p
m
p
j
i
d 

)
,
(

93
Proximity Measure for Binary Attributes
 A contingency table for binary
data
 Distance measure for symmetric
binary variables:
 Distance measure for asymmetric
binary variables:
 Jaccard coefficient (similarity
measure for asymmetric binary
variables):
 Note: Jaccard coefficient is the same as “coherence”:
Object i
Object j

94
Dissimilarity between Binary Variables
 Example
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75
.
0
2
1
1
2
1
)
,
(
67
.
0
1
1
1
1
1
)
,
(
33
.
0
1
0
2
1
0
)
,
(















mary
jim
d
jim
jack
d
mary
jack
d

95
Standardizing Numeric Data
 Z-score:
 X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
 the distance between the raw score and the population mean
in units of the standard deviation
 negative when the raw score is below the mean, “+” when
above
 An alternative way: Calculate the mean absolute deviation
where
 standardized measure (z-score):
 Using mean absolute deviation is more robust than using
standard deviation
.
)
...
2
1
1
nf
f
f
f
x
x
(x
n
m 



|)
|
...
|
|
|
(|
1
2
1 f
nf
f
f
f
f
f
m
x
m
x
m
x
n
s 






f
f
if
if s
m
x
z





 x
z

96
Example:
Data Matrix and Dissimilarity Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Data Matrix
0 2 4
2
4
x1
x2
x3
x4

97
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

98
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1
norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
 h = 2: (L2 norm) Euclidean distance
 h  . “supremum” (Lmax
norm, L
norm) distance.
 This is the maximum difference between any component
(attribute) of the vectors
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






|
|
...
|
|
|
|
)
,
(
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 







99
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Manhattan (L1)
Euclidean (L2)
Supremum
0 2 4
2
4
x1
x2
x3
x4

100
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank
 map the range of each variable onto [0, 1] by
replacing i-th object in the f-th variable by
 compute the dissimilarity using methods for
interval-scaled variables
1
1



f
if
if M
r
z
}
,...,
1
{ f
if
M
r 

101
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
 One may use a weighted formula to combine their
effects
 f is binary or nominal:
dij
(f)
= 0 if xif = xjf , or dij
(f)
= 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal

Compute ranks rif and
)
(
1
)
(
)
(
1
)
,
( f
ij
p
f
f
ij
f
ij
p
f
d
j
i
d







1
1



f
if
M
r
zif

102
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
 Other vector objects: gene features in micro-arrays, …
 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1
and d2
are two vectors (e.g., term-frequency
vectors), then
cos(d1
, d2
) = (d1
 d2
) /||d1
|| ||d2
|| ,
where  indicates vector dot product, ||d||: the length of vector
d

103
Example: Cosine Similarity
 cos(d1
, d2
) = (d1
 d2
) /||d1
|| ||d2
|| ,
where  indicates vector dot product, ||d|: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1
= (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2
= (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1
d2
= 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1
||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5
=(42)0.5
=
6.481
||d2
||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5
=(17)0.5
= 4.12
cos(d1
, d2
) = 0.94

104
 Summary

Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of
research.
105

References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
106

107
Data Mining:
(3rd
ed.)
— Chapter 3 —

108
108
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

109
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not,
dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

110
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

111
111
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

112
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
 incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers

e.g., Salary=“ 10” (an error)
−
 inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
 Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?

113
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at
the time of entry
 not register history or changes of the data

114
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective when
the % of missing values per attribute varies
considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as

115
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data

116
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency)
bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)

117
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

118
118
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

119
119
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units

120
120
Handling Redundancy in Data Integration
 Redundant data occur often when integration of
multiple databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

121
Correlation Analysis (Nominal Data)
 Χ2
(chi-square) test
 The larger the Χ2
value, the more likely the variables
are related
 The cells that contribute the most to the Χ2
value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(


122
Chi-Square Calculation: An Example
 Χ2
(chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data
distribution in the two categories)
 It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Play
chess
Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science
fiction
50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

123
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
1
(
)
(
)
1
(
)
)(
( 1
1
,








 

A B

124
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.

125
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot
product
)
(
/
))
(
(
' A
std
A
mean
a
a k
k 

)
(
/
))
(
(
' B
std
B
mean
b
b k
k 

'
'
)
,
( B
A
B
A
n
correlatio 


126
Covariance (Numeric Data)
 Covariance is similar to correlation
where n is the number of tuples, and are the respective mean
or expected values of A and B, σA and σB are the respective
standard deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data
follow multivariate normal distributions) does a covariance of 0 imply
A B
Correlation coefficient:

Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 4 × 9.6 = 4
−
 Thus, A and B rise together since Cov(A, B) > 0.

128
128
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

129
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or
almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

Wavelet transforms

Principal Components Analysis (PCA)

Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)

Regression and Log-Linear Models

Histograms, clustering, sampling

Data cube aggregation

130
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

131
Mapping Data to a New Space
Two Sine Waves Two Sine Waves + Noise Frequency
 Fourier transform
 Wavelet transform

132
What Is Wavelet Transform?
 Decomposes a signal into
different frequency
subbands
 Applicable to n-
dimensional signals
 Data are transformed to
preserve relative distance
between objects at different
levels of resolution
 Allow natural clusters to
become more
distinguishable
 Used for image

133
Wavelet Transformation
 Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
 Compressed approximation: store only a small fraction
of the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s,
when necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length
L/2
 Applies two functions recursively, until reaches the desired
Haar2 Daubechie4

134
Wavelet Decomposition
 Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
 S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23
/4,
-11
/4, 1
/2, 0, 0, -1, -1, 0]
 Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients
are retained

135
Haar Wavelet Coefficients
Coefficient
“Supports”
2 2 0 2 3 5 4 4
-1.25
2.75
0.5 0
0 -1 0
-1
+
-
+
+
+ + +
+
+
- -
- - - -
+
-
+
+
-
+
-
+
-+
-
-
+
+
-
-1
-1
0.5
0
2.75
-1.25
0
0
Original frequency distribution
Hierarchical
decomposition
structure (a.k.a.
“error tree”)

136
Why Wavelet Transform?
 Use hat-shape filters
 Emphasize region where points cluster
 Suppress weaker information in their boundaries
 Effective removal of outliers
 Insensitive to noise, insensitive to input order
 Multi-resolution
 Detect arbitrary shaped clusters at different scales
 Efficient
 Complexity O(N)
 Only applicable to low dimensional data

137
x2
x1
e
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in
data
 The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the eigenvectors of
the covariance matrix, and these eigenvectors define the new
space

138
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent
data
 Normalize input data: Each attribute falls within the same
range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k
principal component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with
low variance (i.e., using the strongest principal components, it
is possible to reconstruct a good approximation of the original
Principal Component Analysis (Steps)

139
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained
in one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

140
Heuristic Search in Attribute Selection
 There are 2d
possible attribute combinations of d
attributes
 Typical heuristic attribute selection methods:
 Best single attribute under the attribute
independence assumption: choose by significance
tests
 Best step-wise feature selection:

The best single-attribute is picked first

Then next best attribute condition to the first, ...
 Step-wise attribute elimination:

Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination
 Optimal branch and bound:

141
Attribute Creation (Feature Generation)
 Create new attributes (features) that can capture the
important information in a data set more effectively
than the original ones
 Three general methodologies
 Attribute extraction

Domain-specific
 Mapping data to new space (see: data reduction)

E.g., Fourier transformation, wavelet
transformation, manifold approaches (not
covered)
 Attribute construction

Combining features (see: discriminative frequent
patterns in Chapter 7)


142
Data Reduction 2: Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in
m-D space as the product on appropriate marginal
subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …

143
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
 Log-linear model
 Approximates discrete multidimensional
probability distributions

144
Regression Analysis
 Regression analysis: A collective name
for techniques for the modeling and
analysis of numerical data consisting of
values of a dependent variable (also
called response variable or
measurement) and of one or more
independent variables (aka. explanatory
variables or predictors)
 The parameters are estimated so as to
give a "best fit" of the data
 Most commonly the best fit is evaluated
by using the least squares method, but
other criteria have also been used
 Used for prediction
(including forecasting of
time-series data),
inference, hypothesis
testing, and modeling of
causal relationships
y
x
y = x + 1
X1
Y1
Y1’

145
 Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to
be estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2,
…, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability
distributions
 Estimate the probability of each point (tuple) in a multi-
dimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations
Regress Analysis and Log-Linear Models

146
Histogram Analysis
 Divide data into buckets and
store average (sum) for each
bucket
 Partitioning rules:
 Equal-width: equal bucket
range
 Equal-frequency (or
equal-depth)
0
5
10
15
20
25
30
35
40
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
5
0
0
0
0
6
0
0
0
0
7
0
0
0
0
8
0
0
0
0
9
0
0
0
0
1
0
0
0
0
0

147
Clustering
 Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms
 Cluster analysis will be studied in depth in Chapter 10

148
Sampling
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the
data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at

149
Types of Sampling
 Simple random sampling
 There is an equal probability of selecting any
particular item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the
same percentage of the data)
 Used in conjunction with skewed data

150
Sampling: With or without Replacement
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data

151
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample

152
Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of
interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible

153
Data Reduction 3: Data Compression
 String compression
 There are extensive theories and well-tuned
algorithms
 Typically lossless, but only limited manipulation is
possible without expansion
 Audio/video compression
 Typically lossy compression, with progressive
refinement
 Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
 Time sequence is not audio
 Typically short and vary slowly with time
 Dimensionality and numerosity reduction may also be

154
Data Compression
Original Data Compressed
Data
lossless
Original Data
Approximated
lossy

155
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

156
Data Transformation
 A function that maps the entire set of values of a given attribute
to a new set of replacement values s.t. each old value can be
identified with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction

New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range

min-max normalization

z-score normalization

normalization by decimal scaling
 Discretization: Concept hierarchy climbing

157
Normalization
 Min-max normalization: to [new_minA, new_maxA]
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
 Ex. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
v
v




'
j
v
v
10
' Where j is the smallest integer such that Max(|ν’|) < 1
225
.
1
000
,
16
000
,
54
600
,
73



158
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into
intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification

159
Data Discretization Methods
 Typical methods: All the methods can be applied
recursively
 Binning

Top-down split, unsupervised
 Histogram analysis

Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2
) analysis (unsupervised, bottom-
up merge)

160
Simple Discretization: Binning
 Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate
presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky

161
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

162
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width
(binning)
Equal frequency (binning) K-means clustering leads to better
results

163
Discretization by Classification &
Correlation Analysis
 Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Details to be covered in Chapter 7
 Correlation analysis (e.g., Chi-merge: χ2
-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2
values) to
merge
 Merge performed recursively, until a predefined stopping

164
Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a
data warehouse
 Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric
and nominal data. For numeric data, use discretization methods
shown.

165
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by
explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}

166
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values

167
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

168
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem
 Remove redundancies
 Detect inconsistencies
 Data reduction
 Numerosity reduction
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

169
References
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
 H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995

170
170
Data Mining:
(3rd
ed.)
— Chapter 4 —

171
Chapter 4: Data Warehousing and On-line
Analytical Processing
 Data Warehouse: Basic Concepts
 Data Warehouse Modeling: Data Cube and
OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary

172
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately
from the organization’s operational database
 Support information processing by providing a solid platform
of consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

173
Data Warehouse—Subject-Oriented
 Organized around major subjects, such as customer,
product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

174
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous
data sources
 relational databases, flat files, on-line transaction
records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources

E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is
converted.

175
Data Warehouse—Time Variant
 The time horizon for the data warehouse is
significantly longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”

176
Data Warehouse—Nonvolatile
 A physically separate store of data transformed from
the operational environment
 Operational update of data does not occur in the
data warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:

initial loading of data and access of data

177
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

178
Why a Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
 Note: There are more and more systems which perform OLAP
analysis directly on relational databases

179
Data Warehouse: A Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
Other
sources
Data Storage
OLAP Server

180
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects
spanning the entire organization
 Data Mart
 a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data
mart

Independent vs. dependent (directly from warehouse) data
mart
 Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views may be

181
Extraction, Transformation, and Loading (ETL)
 Data extraction
 get data from multiple, heterogeneous, and external
sources
 Data cleaning
 detect errors in the data and rectify them when
possible
 Data transformation
 convert data from legacy or host format to
warehouse format
 Load
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the
warehouse

182
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
 Operational meta-data
 data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit
trails)
 The algorithms used for summarization
 The mapping from operational environment to the data
warehouse
 Data related to system performance
 warehouse schema, view and derived data definitions
 Business data

183
OLAP
Induction
 Summary

184
From Tables and Spreadsheets to
Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level
of summarization, is called the apex cuboid. The lattice of
cuboids forms a data cube.

185
Cube: A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D (base) cuboid

186
Conceptual Modeling of Data Warehouses
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to
a set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

187
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch

188
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city

189
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper

190
A Concept Hierarchy:
Dimension (location)
all
Europe North_America
Mexico
Canada
Spain
Germany
Vancouver
M. Wind
L. Chan
...
...
...
... ...
...
all
region
office
country
Toronto
Frankfurt
city

191
Data Cube Measures: Three Categories
 Distributive: if the result derived by applying the
function to n aggregate values is the same as that
derived by applying the function on all the data without
partitioning

E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded
integer), each of which is obtained by applying a
distributive aggregate function

E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage
size needed to describe a subaggregate.

E.g., median(), mode(), rank()

192
View of Warehouses and Hierarchies
Specification of hierarchies
 Schema hierarchy
day < {month <
quarter; week} < year
 Set_grouping hierarchy
{1..10} < inexpensive

193
Multidimensional Data
 Sales volume as a function of product, month,
and region
Product
R
e
g
i
o
n
Month
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day

194
A Sample Data Cube
Total annual sales
of TVs in U.S.A.
Date
P
r
o
d
u
c
t
Country
sum
sum
TV
VCR
PC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum

195
Cuboids Corresponding to the Cube
all
product date country
product,date product,country date, country
product, date, country
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D (base) cuboid

196
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions

Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

197
Fig. 3.10 Typical
OLAP Operations

198
A Star-Net Query Model
Shipping Method
AIR-EXPRESS
TRUCK
ORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
Organization
Promotion
CITY
COUNTRY
REGION
Location
DAILY
QTRLY
ANNUALY
Time
Each circle is
called a
footprint

199
Browsing a Data Cube
 Visualization
 OLAP capabilities
 Interactive
manipulation

200
OLAP
Induction
 Summary

201
Design of Data Warehouse: A Business
Analysis Framework
 Four views regarding the design of a data warehouse
 Top-down view

allows selection of the relevant information necessary for
the data warehouse
 Data source view

exposes the information being captured, stored, and
managed by operational systems
 Data warehouse view

consists of fact tables and dimension tables
 Business query view

sees the perspectives of data in the warehouse from the
view of end-user

202
Data Warehouse Design Process
 Top-down, bottom-up approaches or a combination of both
 Top-down: Starts with overall design and planning (mature)
 Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view
 Waterfall: structured and systematic analysis at each step
before proceeding to the next
 Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around
 Typical data warehouse design process
 Choose a business process to model, e.g., orders, invoices, etc.
 Choose the grain (atomic level of data) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record

203
Data Warehouse Development:
A Recommended Approach
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed
Data Marts
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinement
Model refinement

204
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing

supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
 Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling,
pivoting
 Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools

205
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
 Why online analytical mining?
 High quality of data in data warehouses

DW contains integrated, consistent, cleaned data
 Available information processing structure
surrounding data warehouses

ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
 OLAP-based exploratory data analysis

Mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions

Integration and swapping of multiple mining
functions, algorithms, and tasks

206
OLAP
Induction
 Summary

207
Efficient Data Cube Computation
 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 How many cuboids in an n-dimensional cube with L
levels?
 Materialization of data cube
 Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
 Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.
)
1
1
( 



n
i
i
L
T

208
The “Compute Cube” Operator
 Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
 Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
 Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
(item)
(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)

209
Indexing OLAP Data: Bitmap Index
 Index on a particular column
 Each value in the column has a bit vector: bit-op is fast
 The length of the bit vector: # of records in the base table
 The i-th bit is set if the i-th row of the base table has the value for
the indexed column
 not suitable for high cardinality domains
 A recent bit compression technique, Word-Aligned Hybrid (WAH),
makes it work for high cardinality domain as well [Wu, et al.
TODS’06]
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
Base table Index on Region Index on Type

210
Indexing OLAP Data: Join Indices
 Join index: JI(R-id, S-id) where R (R-id, …)  S (S-
id, …)
 Traditional indices map the values to a list of
record ids
 It materializes relational join in JI file and
speeds up relational join
 In data warehouses, join index relates the
values of the dimensions of a start schema to
rows in the fact table.
 E.g. fact table: Sales and two dimensions
city and product

A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
 Join indices can span multiple dimensions

211
Efficient Processing OLAP Queries
 Determine which operations should be performed on the available
cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
 Determine which materialized cuboid(s) should be selected for OLAP op.
 Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
 Explore indexing structures and compressed vs. dense array structs in

212
OLAP Server Architectures
 Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and
services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake

213
OLAP
Induction
 Summary

214
Attribute-Oriented Induction
 Proposed in 1989 (KDD ‘89 workshop)
 Not confined to categorical data nor particular
measures
 How it is done?
 Collect the task-relevant data (initial relation) using a
relational database query
 Perform generalization by attribute removal or
attribute generalization
 Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts
 Interaction with users for knowledge presentation

215
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate
students in the University database
 Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
 Step 2. Perform attribute-oriented induction
 Step 3. Present results in generalized relation, cross-tab,
or rule forms

216
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman
M CS Vancouver,BC,
Canada
8-12-76 3511 Main St.,
Richmond
687-4598 3.67
Scott
Lachance
M CS Montreal, Que,
Canada
28-7-75 345 1st Ave.,
Richmond
253-9106 3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Removed Retained Sci,Eng,
Bus
Country Age range City Removed Excl,
VG,..
Gender Major Birth_region Age_range Residence GPA Count
M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
… … … … … … …
Birth_Region
Gender
Canada Foreign Total
M 16 14 30
F 10 22 32
Total 26 36 62
Prime
Generalized
Relation
Initial
Relation

217
Basic Principles of Attribute-Oriented Induction
 Data focusing: task-relevant data, including
dimensions, and the result is the initial relation
 Attribute-removal: remove attribute A if there is a large
set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
 Attribute-generalization: If there is a large set of
distinct values for A, and there exists a set of
generalization operators on A, then select an operator
and generalize A
 Attribute-threshold control: typical 2-8,
specified/default

218
Attribute-Oriented Induction: Basic
Algorithm
 InitialRel: Query processing of task-relevant data,
deriving the initial relation.
 PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan
for each attribute: removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.

219
Presentation of Generalized Results
 Generalized relation:
 Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to
contingency tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
quantitative information associated with it, e.g.,
.
%]
47
:
[
"
"
)
(
_
%]
53
:
[
"
"
)
(
_
)
(
)
(
t
foreign
x
region
birth
t
Canada
x
region
birth
x
male
x
grad






220
Mining Class Comparisons
 Comparison: Comparing two or more classes
 Method:
 Partition the set of relevant data into the target class and the
contrasting class(es)
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures
 support - distribution within single class
 comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different
classes

221
Concept Description vs. Cube-Based OLAP
 Similarity:
 Data generalization
 Presentation of data summarization at multiple levels
of abstraction
 Interactive drilling, pivoting, slicing and dicing
 Differences:
 OLAP has systematic preprocessing, query
independent, and can drill down to rather low level
 AOI has automated desired level allocation, and may
perform dimension relevance analysis/ranking when
there are many relevant dimensions
 AOI works on the data which are not in relational
forms

222
OLAP
Induction
 Summary

223
Summary
 Data warehousing: A multi-dimensional model of a data warehouse
 A data cube consists of dimensions & measures
 Star schema, snowflake schema, fact constellations
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 Data Warehouse Architecture, Design, and Usage
 Multi-tiered architecture
 Business analysis design framework
 Information processing, analytical processing, data mining, OLAM
(Online Analytical Mining)
 Implementation: Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Indexing OALP data: Bitmap index and join index
 OLAP query processing
 OLAP servers: ROLAP, MOLAP, HOLAP
 Data generalization: Attribute-oriented induction

224
References (I)
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
 E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993.
 J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and
sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
 A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
 J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
 J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97

225
References (II)
 C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
 W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
 P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
11, Sept. 1995.
 P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
 S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
 A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
 D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
 J. Widom. Research problems in data warehousing. CIKM’95
 K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006

227
Compression of Bitmap Indices
 Bitmap indexes must be compressed to reduce I/O
costs and minimize CPU usage—majority of the bits
are 0’s
 Two compression schemes:
 Byte-aligned Bitmap Code (BBC)
 Word-Aligned Hybrid (WAH) code
 Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
 Optimal on attributes of low cardinality as well as
those of high cardinality.
 WAH out performs BBC by about a factor of two

228
228
Data Mining:
(3rd
ed.)
— Chapter 5 —

229
229
Chapter 5: Data Cube Technology
 Data Cube Computation: Preliminary
Concepts
 Data Cube Computation Methods
 Processing Advanced Queries by Exploring
Data Cube Technology
 Multidimensional Data Analysis in Cube Space
 Summary

230
230
Data Cube: A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplierc
all
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid

231
Data Cube: A Lattice of Cuboids
 Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child
cells
1. (9/15, milk, Urbana, Dairy_land)
2. (9/15, milk, Urbana, *)
3. (*, milk, Urbana, *)
4. (*, milk, Urbana, *)
5. (*, milk, Chicago, *)
6. (*, milk, *, *)
all
time,item
time,item,location
time, item, location, supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid

232
232
Cube Materialization:
Full Cube vs. Iceberg Cube
 Full cube vs. iceberg cube
compute cube sales iceberg as
select month, city, customer group, count(*)
from salesInfo
cube by month, city, customer group
having count(*) >= min support
 Computing only the cuboid cells whose measure satisfies the
iceberg condition
 Only a small portion of cells may be “above the water’’ in a
sparse cube
 Avoid explosive growth: A cube with 100 dimensions
 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)

How many aggregate cells if “having count >= 1”?

What about “having count >= 2”?
iceberg
condition

233
Iceberg Cube, Closed Cube & Cube Shell
 Is iceberg cube good enough?
 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
 How many cells will the iceberg cube have if having count(*) >=
10? Hint: A huge but tricky number!
 Close cube:
 Closed cell c: if there exists no cell d, s.t. d is a descendant of c,
and d has the same measure value as c.
 Closed cube: a cube consisting of only closed cells
 What is the closed cube of the above base cuboid? Hint: only 3
cells
 Cube Shell
 Precompute only the cuboids involving a small # of
dimensions, e.g., 3
 More dimension combinations will need to be computed on
the fly
For (A1, A2, … A10), how many combinations to
compute?

234
234
Roadmap for Efficient Computation
 General cube computation heuristics (Agarwal et al.’96)
 Computing full/iceberg cubes: 3 methodologies
 Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande &
Naughton, SIGMOD’97)
 Top-down:

BUC (Beyer & Ramarkrishnan, SIGMOD’99)

H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01)
 Integrating Top-Down and Bottom-Up:

Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)
 High-dimensional OLAP: A Minimal Cubing Approach (Li, et al.
VLDB’04)
 Computing alternative kinds of cubes:
 Partial cube, closed cube, approximate cube, etc.

235
235
General Heuristics (Agarwal et al. VLDB’96)
 Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related tuples
 Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
 Smallest-child: computing a cuboid from the smallest,
previously computed cuboid
 Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
 Amortize-scans: computing as many as possible cuboids at the
same time to amortize disk reads
 Share-sorts: sharing sorting costs cross multiple cuboids when
sort-based method is used
 Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used

236
236
Concepts
 Summary

237
237
Data Cube Computation Methods
 Multi-Way Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP

238
238
Multi-Way Array Aggregation
 Array-based “bottom-up” algorithm
 Using multi-dimensional chunks
 No direct tuple comparisons
 Simultaneous aggregation on
multiple dimensions
 Intermediate aggregate values are
re-used for computing ancestor
cuboids
 Cannot do Apriori pruning: No
iceberg optimization
ABC
AB
A
All
B
AC BC
C

239
239
Multi-way Array Aggregation for Cube
Computation (MOLAP)
 Partition arrays into chunks (a small subcube which fits in
memory).
 Compressed sparse array addressing: (chunk_id, offset)
 Compute aggregates in “multiway” by visiting cube cells in the
order which minimizes the # of times to visit each cell, and
reduces memory access and storage cost.
What is the best
traversing order
to do multi-way
aggregation?
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64
63
62
61
48
47
46
45
a1
a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
B
44
28 56
40
24 52
36
20
60

240
Computation (3-D to 2-D)
all
A B
A B
A BC
A C BC
C
 The best order is
the one that
minimizes the
memory
requirement and
reduced I/Os
ABC
AB
A
All
B
AC BC
C

241
Computation (2-D to 1-D)
ABC
AB
A
All
B
AC BC
C

242
242
Multi-Way Array Aggregation for Cube
Computation (Method Summary)
 Method: the planes should be sorted and computed
according to their size in ascending order
 Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
 Limitation of the method: computing well only for a
small number of dimensions
 If there are a large number of dimensions, “top-
down” computation and iceberg cube computation
methods can be explored

243
243
 BUC
 Star-Cubing

244
244
Bottom-Up Computation (BUC)
 BUC (Beyer & Ramakrishnan,
SIGMOD’99)
 Bottom-up cube computation
(Note: top-down in our view!)
 Divides dimensions into
partitions and facilitates iceberg
pruning
 If a partition does not satisfy
min_sup, its descendants can
be pruned
 If minsup = 1 Þ compute full
CUBE!
 No simultaneous aggregation
all
A B C
A C BC
A BC A BD A C D BC D
A D BD C D
D
A BC D
A B
1 all
2 A 10 B 14 C
7 A C 11 BC
4 A BC 6 A BD 8 A C D 12 BC D
9 A D 13 BD 15 C D
16 D
5 A BC D
3 A B

245
245
BUC: Partitioning
 Usually, entire data set can’t
fit in main memory
 Sort distinct values
 partition into blocks that fit
 Continue processing
 Optimizations
 Partitioning

External Sorting, Hashing, Counting Sort
 Ordering dimensions to encourage pruning

Cardinality, Skew, Correlation
 Collapsing duplicates

Can’t do holistic aggregates anymore!

246
246
 BUC
 Star-Cubing

247
247
Star-Cubing: An Integrating Method
 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration, VLDB'03
 Explore shared dimensions
 E.g., dimension A is the shared dimension of ACD and AD
 ABD/AB means cuboid ABD has shared dimensions AB
 Allows for shared computations
 e.g., cuboid AB is computed simultaneously as ABD
C/C
AC/A C BC/BC
ABC/ABC ABD/AB ACD/A BCD
AD/A BD/B CD
D
ABC D/all
 Aggregate in a top-down
manner but with the bottom-
up sub-layer underneath
which will allow Apriori
pruning
 Shared dimensions grow in
bottom-up fashion

248
248
Iceberg Pruning in Shared Dimensions
 Anti-monotonic property of shared dimensions
 If the measure is anti-monotonic, and if the
aggregate value on a shared dimension does
not satisfy the iceberg condition, then all the
cells extended from this shared dimension
cannot satisfy the condition either
 Intuition: if we can compute the shared
dimensions before the actual cuboid, we can use
them to do Apriori pruning
 Problem: how to prune while still aggregate
simultaneously on multiple dimensions?

249
249
Cell Trees
 Use a tree structure similar
to H-tree to represent
cuboids
 Collapses common prefixes
to save memory
 Keep count at node
 Traverse the tree to
retrieve a particular tuple

250
250
Star Attributes and Star Nodes
 Intuition: If a single-dimensional
aggregate on an attribute value p
does not satisfy the iceberg
condition, it is useless to
distinguish them during the
iceberg computation
 E.g., b2, b3, b4, c1, c2, c4, d1, d2, d3
 Solution: Replace such attributes
by a *. Such attributes are star
attributes, and the corresponding
nodes in the cell tree are star
nodes
A B C D Count
a1 b1 c1 d1 1
a1 b1 c4 d3 1
a1 b2 c2 d2 1
a2 b3 c3 d4 1
a2 b4 c3 d4 1

251
251
Example: Star Reduction
 Suppose minsup = 2
 Perform one-dimensional
aggregation. Replace attribute
values whose count < 2 with *.
And collapse all *’s together
 Resulting table has all such
attributes replaced with the star-
attribute
 With regards to the iceberg
computation, this new table is a
lossless compression of the original
table
A B C D Count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2
A B C D Count
a1 b1 * * 1
a1 b1 * * 1
a1 * * * 1
a2 * c3 d4 1
a2 * c3 d4 1

252
252
Star Tree
 Given the new compressed
table, it is possible to
construct the
corresponding cell tree—
called star tree
 Keep a star table at the side
for easy lookup of star
attributes
 The star tree is a lossless
compression of the original
cell tree
A B C D Count
a1 b1 * * 2
a1 * * * 1
a2 * c3 d4 2

253
253
Star-Cubing Algorithm—DFS on Lattice Tree
all
A B/B C/C
AC/AC BC /BC
ABC/ABC ABD/AB A CD /A BCD
AD /A BD/B CD
D/D
A BC D
/A
AB/A B
BCD : 51
b*: 33 b1: 26
c*: 27
c3: 211
c*: 14
d*: 15 d4: 212 d*: 28
root: 5
a1: 3 a2: 2
b*: 2
b1: 2
b*: 1
d*: 1
c*: 1
d*: 2
c*: 2
d4: 2
c3: 2

254
254
Multi-Way Aggregation
A BC /ABC
ABD /AB
A CD/A
BCD
ABC D

255
255
Star-Cubing Algorithm—DFS on Star-Tree
A BC /ABC
ABD /AB
A CD/A
BCD
ABC D

256
256
Multi-Way Star-Tree Aggregation
 Start depth-first search at the root of the base star tree
 At each new node in the DFS, create corresponding star tree that are descendants
of the current tree according to the integrated traversal ordering
 E.g., in the base tree, when DFS reaches a1, the ACD/A tree is created
 When DFS reaches b*, the ABD/AD tree is created
 The counts in the base tree are carried over to the new trees
 When DFS reaches a leaf node (e.g., d*), start backtracking
 On every backtracking branch, the count in the corresponding trees are output,
the tree is destroyed, and the node in the base tree is destroyed
 Example
 When traversing from d* back to c*, the a1b*c*/a1b*c* tree is output and
destroyed
 When traversing from c* back to b*, the a1b*D/a1b* tree is output and
destroyed
 When at b*, jump to b1 and repeat similar process
ABC /ABC
ABD/AB
ACD /A
BCD
ABCD

257
257
 BUC
 Star-Cubing

258
258
The Curse of Dimensionality
 None of the previous cubing method can handle high
dimensionality!
 A database of 600k tuples. Each dimension has
cardinality of 100 and zipf of 2.

259
259
Motivation of High-D OLAP
 X. Li, J. Han, and H. Gonzalez, High-Dimensional
OLAP: A Minimal Cubing Approach, VLDB'04
 Challenge to current cubing methods:
 The “curse of dimensionality’’ problem
 Iceberg cube and compressed cubes: only delay
the inevitable explosion
 Full materialization: still significant overhead in
accessing results on disk
 High-D OLAP is needed in applications
 Science and engineering analysis
 Bio-data analysis: thousands of genes
 Statistical surveys: hundreds of variables

260
260
Fast High-D OLAP with Minimal Cubing
 Observation: OLAP occurs only on a small subset of
dimensions at a time
 Semi-Online Computational Model
1. Partition the set of dimensions into shell
fragments
2. Compute data cubes for each shell fragment
while retaining inverted indices or value-list
indices
3. Given the pre-computed fragment cubes,
dynamically compute cube cells of the high-

261
261
Properties of Proposed Method
 Partitions the data vertically
 Reduces high-dimensional cube into a set of lower
dimensional cubes
 Online re-construction of original high-dimensional
space
 Lossless reduction
 Offers tradeoffs between the amount of pre-
processing and the speed of online computation

262
262
Example Computation
 Let the cube aggregation function be count
 Divide the 5 dimensions into 2 shell fragments:
 (A, B, C) and (D, E)
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3

263
263
1-D Inverted Indices
 Build traditional invert index or RID list
Attribute Value TID List List Size
a1 1 2 3 3
a2 4 5 2
b1 1 4 5 3
b2 2 3 2
c1 1 2 3 4 5 5
d1 1 3 4 5 4
d2 2 1
e1 1 2 2
e2 3 4 2
e3 5 1

264
264
Shell Fragment Cubes: Ideas
 Generalize the 1-D inverted indices to multi-dimensional
ones in the data cube sense
 Compute all cuboids for data cubes ABC and DE while
retaining the inverted indices
 For example, shell
fragment cube ABC
contains 7 cuboids:
 A, B, C
 AB, AC, BC
 ABC
 This completes the offline
computation stage
1
1
1 2 3 1 4 5
a1 b1
0
4 5 2 3
a2 b2
2
4 5
4 5 1 4 5
a2 b1
2
2 3
1 2 3 2 3
a1 b2
List Size
TID List
Intersection
Cell











265
265
Shell Fragment Cubes: Size and Design
 Given a database of T tuples, D dimensions, and F shell
fragment size, the fragment cubes’ space requirement is:
 For F < 5, the growth is sub-linear
 Shell fragments do not have to be disjoint
 Fragment groupings can be arbitrary to allow for
maximum online performance
 Known common combinations (e.g.,<city, state>)
should be grouped together.
 Shell fragment sizes can be adjusted for optimal balance
between offline and online computation

O T
D
F






(2F
 1)







266
266
ID_Measure Table
 If measures other than count are present, store in
ID_measure table separate from the shell fragments
tid count sum
1 5 70
2 3 10
3 8 20
4 5 40
5 2 30

267
267
The Frag-Shells Algorithm
1. Partition set of dimension (A1,…,An) into a set of k fragments (P1,
…,Pk).
2. Scan base table once and do the following
3. insert <tid, measure> into ID_measure table.
4. for each attribute value ai of each dimension Ai
5. build inverted index entry <ai, tidlist>
6. For each fragment partition Pi
7. build local fragment cube Si by intersecting tid-lists in
bottom- up fashion.

268
268
Frag-Shells (2)
A B C D E F …
ABC
Cube
DEF
Cube
D Cuboid
EF Cuboid
DE Cuboid
Cell Tuple-ID List
d1 e1 {1, 3, 8, 9}
d1 e2 {2, 4, 6, 7}
d2 e1 {5, 10}
… …
Dimensions

269
269
Online Query Computation: Query
 A query has the general form
 Each ai has 3 possible values
1. Instantiated value
2. Aggregate * function
3. Inquire ? function
 For example, returns a 2-D data
cube.


a1,a2,,an : M

3 ? ? * 1:count

270
270
Online Query Computation: Method
 Given the fragment cubes, process a query as
follows
1. Divide the query into fragment, same as the
shell
2. Fetch the corresponding TID list for each
fragment from the fragment cube
3. Intersect the TID lists from each fragment to
construct instantiated base table
4. Compute the data cube using the base table
with any cubing algorithm

271
271
Online Query Computation: Sketch
A B C D E F G H I J K L M N …
Online
Cube
Instantiated
Base Table

272
272
Experiment: Size vs. Dimensionality (50
and 100 cardinality)
 (50-C): 106
tuples, 0 skew, 50 cardinality, fragment size 3.
 (100-C): 106
tuples, 2 skew, 100 cardinality, fragment size 2.

273
273
Experiments on Real World Data
 UCI Forest CoverType data set
 54 dimensions, 581K tuples
 Shell fragments of size 2 took 33 seconds and
325MB to compute
 3-D subquery with 1 instantiate D: 85ms~1.4 sec.
 Longitudinal Study of Vocational Rehab. Data
 24 dimensions, 8818 tuples
 Shell fragments of size 3 took 0.9 seconds and
60MB to compute
 5-D query with 0 instantiated D: 227ms~2.6 sec.

274
274
 Data Cube Computation: Preliminary Concepts
 Processing Advanced Queries by Exploring Data Cube
Technology
 Sampling Cube
 Ranking Cube
 Summary

275
275
Processing Advanced Queries by
Exploring Data Cube Technology
 Sampling Cube
 X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling
Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
 Ranking Cube
 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k
queries with multi-dimensional selections: The
ranking cube approach. VLDB’06
 Other advanced cubes for processing data and
queries
 Stream cube, spatial cube, multimedia cube, text
cube, RFID cube, etc. — to be studied in volume 2

276
276
Statistical Surveys and OLAP
 Statistical survey: A popular tool to collect information
about a population based on a sample
 Ex.: TV ratings, US Census, election polls
 A common tool in politics, health, market research,
science, and many more
 An efficient way of collecting information (Data
collection is expensive)
 Many statistical tools available, to determine validity
 Confidence intervals
 Hypothesis tests
 OLAP (multidimensional analysis) on survey data
 highly desirable but can it be done well?

277
277
Surveys: Sample vs. Whole Population
AgeEducation High-school College Graduate
18
19
20
…
Data is only a sample of population

278
278
Problems for Drilling in Multidim. Space
AgeEducation High-school College Graduate
18
19
20
…
Data is only a sample of population but samples could be small
when drilling to certain multidimensional space

279
279
OLAP on Survey (i.e., Sampling) Data
Age/Education High-school College Graduate
18
19
20
…
 Semantics of query is unchanged
 Input data has changed

280
280
Challenges for OLAP on Sampling Data
 Computing confidence intervals in OLAP
context
 No data?
 Not exactly. No data in subspaces in cube
 Sparse data
 Causes include sampling bias and query
selection bias
 Curse of dimensionality
 Survey data can be high dimensional
 Over 600 dimensions in real world
example
 Impossible to fully materialize

281
281
Example 1: Confidence Interval
18
19
20
…
What is the average income of 19-year-old high-school students?
Return not only query result but also confidence interval

282
282
Confidence Interval
 Confidence interval at :

x is a sample of data set; is the mean of sample
 tc is the critical t-value, calculated by a look-up
 is the estimated standard error of the mean
 Example: $50,000 ± $3,000 with 95% confidence

Treat points in cube cell as samples

Compute confidence interval as traditional sample
set
 Return answer in the form of confidence interval

Indicates quality of query answer

No
Image

283
283
Efficient Computing Confidence Interval Measures
 Efficient computation in all cells in data cube

Both mean and confidence interval are algebraic

Why confidence interval measure is algebraic?
is algebraic
where both s and l (count) are algebraic
 Thus one can calculate cells efficiently at more general
cuboids without having to start at the base cuboid each
time
No
Image

284
284
Example 2: Query Expansion
18
19
20
…
What is the average income of 19-year-old college students?

285
285
Boosting Confidence by Query Expansion
 From the example: The queried cell “19-year-old
college students” contains only 2 samples
 Confidence interval is large (i.e., low confidence).
why?
 Small sample size
 High standard deviation with samples
 Small sample sizes can occur at relatively low
dimensional selections
 Collect more data?― expensive!
 Use data in other cells? Maybe, but have to be
careful

286
286
Intra-Cuboid Expansion: Choice 1
18
19
20
…
Expand query to include 18 and 20 year olds?

287
287
Intra-Cuboid Expansion: Choice 2
18
19
20
…
Expand query to include high-school and graduate students?

289
Intra-Cuboid Expansion
 Combine other cells’ data into own to “boost”
confidence
 If share semantic and cube similarity
 Use only if necessary
 Bigger sample size will decrease confidence
interval
 Cell segment similarity
 Some dimensions are clear: Age
 Some are fuzzy: Occupation
 May need domain knowledge
 Cell value similarity
 How to determine if two cells’ samples come
from the same population?
 Two-sample t-test (confidence-based)

290
290
Inter-Cuboid Expansion
 If a query dimension is

Not correlated with cube value

But is causing small sample size by drilling down
too much
 Remove dimension (i.e., generalize to *) and move to
a more general cuboid
 Can use two-sample t-test to determine similarity
between two cells across cuboids
 Can also use a different method to be shown later

291
291
Query Expansion Experiments
 Real world sample data: 600 dimensions and
750,000 tuples
 0.05% to simulate “sample” (allows error
checking)

292
292
Technology
 Sampling Cube
 Ranking Cube
 Summary

293
Ranking Cubes – Efficient Computation of
Ranking queries
 Data cube helps not only OLAP but also ranked search
 (top-k) ranking query: only returns the best k results
according to a user-specified preference, consisting of
(1) a selection condition and (2) a ranking function
 Ex.: Search for apartments with expected price 1000
and expected square feet 800

Select top 1 from Apartment

where City = “LA” and Num_Bedroom = 2

order by [price – 1000]^2 + [sq feet - 800]^2 asc
 Efficiency question: Can we only search what we need?
 Build a ranking cube on both selection dimensions
and ranking dimensions

294
Sliced Partition
for city=“LA”
Sliced Partition
for BR=2
Ranking Cube: Partition Data on Both
Selection and Ranking Dimensions
One single data
partition as the template
Slice the data partition
by selection conditions
Partition for
all data

295
Materialize Ranking-Cube
tid City BR Price Sq feet Block ID
t1 SEA 1 500 600 5
t2 CLE 2 700 800 5
t3 SEA 1 800 900 2
t4 CLE 3 1000 1000 6
t5 LA 1 1100 200 15
t6 LA 2 1200 500 11
t7 LA 2 1200 560 11
t8 CLE 3 1350 1120 4
Step 1: Partition Data on
Ranking Dimensions
Step 2: Group data by
Selection Dimensions
City
BR
City & BR
3 4
2
1
CLE
LA
SEA
Step 3: Compute Measures for each group
For the cell (LA)
1 2 3 4
5 6 7 8
9 10 11
12
13 14 15
16
Block-level: {11, 15}
Data-level: {11: t6, t7; 15: t5}

296
Search with Ranking-Cube:
Simultaneously Push Selection and Ranking
where city = “LA”
800
1000
Without ranking-cube: start
search from here
With ranking-cube:
start search from here
Measure for
LA: {11, 15}
{11: t6,t7;
15:t5}
11
15
Given the bin boundaries,
locate the block with top score
Bin boundary for price [500, 600, 800, 1100,1350]
Bin boundary for sq feet [200, 400, 600, 800, 1120]

297
Processing Ranking Query: Execution Trace
where city = “LA”
800
1000
With ranking-
cube: start search
from here
Measure for
LA: {11, 15}
{11: t6,t7;
15:t5}
11
15
f=[price-1000]^2 + [sq feet – 800]^2
Bin boundary for price [500, 600, 800, 1100,1350]
Bin boundary for sq feet [200, 400, 600, 800, 1120]
Execution Trace:
1. Retrieve High-level measure for LA {11, 15}
2. Estimate lower bound score for block 11, 15
f(block 11) = 40,000, f(block 15) = 160,000
3. Retrieve block 11
4. Retrieve low-level measure for block 11
5. f(t6) = 130,000, f(t7) = 97,600
Output t7, done!

298
Ranking Cube: Methodology and Extension
 Ranking cube methodology
 Push selection and ranking simultaneously
 It works for many sophisticated ranking functions
 How to support high-dimensional data?
 Materialize only those atomic cuboids that contain
single selection dimensions

Uses the idea similar to high-dimensional OLAP

Achieves low space overhead and high
performance in answering ranking queries with
a high number of selection dimensions

299
299
Concepts
 Summary

300
300
Multidimensional Data Analysis in
Cube Space
 Prediction Cubes: Data Mining in Multi-
Dimensional Cube Space
 Multi-Feature Cubes: Complex Aggregation at
Multiple Granularities
 Discovery-Driven Exploration of Data Cubes

301
Data Mining in Cube Space
 Data cube greatly increases the analysis bandwidth
 Four ways to interact OLAP-styled analysis and data
mining
 Using cube space to define data space for mining
 Using OLAP queries to generate features and targets
for mining, e.g., multi-feature cube
 Using data-mining models as building blocks in a
multi-step mining process, e.g., prediction cube
 Using data-cube computation techniques to speed
up repeated model construction

Cube-space data mining may require building a
model for each candidate data space

Sharing computation across model-construction
for different candidates may lead to efficient

302
Prediction Cubes
 Prediction cube: A cube structure that stores
prediction models in multidimensional data space and
supports prediction in OLAP manner
 Prediction models are used as building blocks to
define the interestingness of subsets of data, i.e., to
answer which subsets of data indicate better
prediction

303
How to Determine the Prediction Power
of an Attribute?
 Ex. A customer table D:
 Two dimensions Z: Time (Month, Year ) and Location
(State, Country)
 Two features X: Gender and Salary
 One class-label attribute Y: Valued Customer
 Q: “Are there times and locations in which the value of
a customer depended greatly on the customers
gender (i.e., Gender: predictiveness attribute V)?”
 Idea:
 Compute the difference between the model built on
that using X to predict Y and that built on using X –
V to predict Y
 If the difference is large, V must play an important
role at predicting Y

304
Efficient Computation of Prediction Cubes
 Naïve method: Fully materialize the prediction
cube, i.e., exhaustively build models and
evaluate them for each cell and for each
granularity
 Better approach: Explore score function
decomposition that reduces prediction cube
computation to data cube computation

305
305
Cube Space

306
306
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
 Multi-feature cubes (Ross, et al. 1998): Compute complex
queries involving multiple dependent aggregates at multiple
granularities
 Ex. Grouping by all subsets of {item, region, month}, find the
maximum price in 2010 for each group, and the total sales
among all maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 2010
cube by item, region, month: R
such that R.price = max(price)
 Continuing the last example, among the max price tuples, find
the min and max shelf live, and find the fraction of the total
sales due to tuple that have min shelf life within the set of all
max price tuples

307
307
Cube Space

308
308
Discovery-Driven Exploration of Data Cubes
 Hypothesis-driven
 exploration by user, huge search space
 Discovery-driven (Sarawagi, et al.’98)
 Effective navigation of large OLAP data cubes
 pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of aggregation
 Exception: significantly different from the value
anticipated, based on a statistical model
 Visual cues such as background color are used to
reflect the degree of exception of each cell

309
309
Kinds of Exceptions and their Computation
 Parameters
 SelfExp: surprise of cell relative to other cells at
same level of aggregation
 InExp: surprise beneath the cell
 PathExp: surprise beneath cell for each drill-down
path
 Computation of exception indicator (modeling fitting
and computing SelfExp, InExp, and PathExp values)
can be overlapped with cube construction
 Exception themselves can be stored, indexed and
retrieved like precomputed aggregates

310
310
Examples: Discovery-Driven Data Cubes

311
311
Concepts
 Summary

312
312
Data Cube Technology: Summary
 MultiWay Array Aggregation
 BUC
 Star-Cubing
 High-Dimensional OLAP with Shell-Fragments
Technology
 Sampling Cubes
 Ranking Cubes
 Multi-feature Cubes


313
313
Ref.(I) Data Cube Computation Methods
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the
computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97
 K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99
 M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently.
VLDB’98
 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube:
A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29–54, 1997.
 J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01
 L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the Semantics of a Data Cube,
VLDB'02
 X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach, VLDB'04
 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional
aggregates. SIGMOD’97
 K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97
 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration,
VLDB'03
 D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking,
ICDE'06

314
314
Ref. (II) Advanced Applications with Data Cubes
 D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over
uncertain and imprecise data. VLDB’05
 X. Li, J. Han, Z. Yin, J.-G. Lee, Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over
Sampling Data”, SIGMOD’08
 C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR measures for
multidimensional text database analysis. ICDM’08
 D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP operations in spatial data warehouses.
SSTD’01
 N. Stefanovic, J. Han, and K. Koperski. Object-based selective materialization for efficient
implementation of spatial data cubes. IEEE Trans. Knowledge and Data Engineering, 12:938–958,
2000.
 T. Wu, D. Xin, Q. Mei, and J. Han. Promotion analysis in multidimensional space. VLDB’09
 T. Wu, D. Xin, and J. Han. ARCube: Supporting ranking aggregate queries in partially materialized
data cubes. SIGMOD’08
 D. Xin, J. Han, H. Cheng, and X. Li. Answering top-k queries with multi-dimensional selections:
The ranking cube approach. VLDB’06
 J. S. Vitter, M. Wang, and B. R. Iyer. Data cube approximation and histograms via wavelets.
CIKM’98
 D. Zhang, C. Zhai, and J. Han. Topic cube: Topic modeling for OLAP on multi-dimensional text
databases. SDM’09

315
Ref. (III) Knowledge Discovery with Data Cubes
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
 B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. VLDB’05
 B.-C. Chen, R. Ramakrishnan, J.W. Shavlik, and P. Tamma. Bellwether analysis: Predicting global
aggregates from local regions. VLDB’06
 Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression Analysis of
Time-Series Data Streams, VLDB'02
 G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data
Cubes. VLDB’ 01
 R. Fagin, R. V. Guha, R. Kumar, J. Novak, D. Sivakumar, and A. Tomkins. Multi-structural
databases. PODS’05
 J. Han. Towards on-line analytical mining in large databases. SIGMOD Record, 27:97–107, 1998
 T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. Data
Mining & Knowledge Discovery, 6:219–258, 2002.
 R. Ramakrishnan and B.-C. Chen. Exploratory mining in cube space. Data Mining and Knowledge
Discovery, 15:29–54, 2007.
 K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities.
EDBT'98
 S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes.
EDBT'98
 G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01

317
317
 Efficient Methods for Data Cube Computation

Preliminary Concepts and General Strategies for Cube Computation
 Multiway Array Aggregation for Full Cube Computation
 BUC: Computing Iceberg Cubes from the Apex Cuboid Downward
 H-Cubing: Exploring an H-Tree Structure
 Star-cubing: Computing Iceberg Cubes Using a Dynamic Star-tree
Structure
 Precomputing Shell Fragments for Fast High-Dimensional OLAP
 Data Cubes for Advanced Applications
 Sampling Cubes: OLAP on Sampling Data
 Ranking Cubes: Efficient Computation of Ranking Queries
 Knowledge Discovery with Data Cubes
 Complex Aggregation at Multiple Granularity: Multi-feature Cubes
 Prediction Cubes: Data Mining in Multi-Dimensional Cube Space
 Summary

318
318
H-Cubing: Using H-Tree Structure
 Bottom-up computation
 Exploring an H-tree
structure
 If the current
computation of an H-
tree cannot pass
min_sup, do not
proceed further
(pruning)
 No simultaneous
aggregation
a ll
A B C
A C B C
A B C A B D A C D B C D
A D B D C D
D
A B C D
A B

319
319
H-tree: A Prefix Hyper-tree
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hhd TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520
… … … … … …
root
edu hhd bus
Jan Mar Jan Feb
Tor Van Tor Mon
Q.I.
Q.I. Q.I.
Quant-
Info
Sum: 1765
Cnt: 2
bins
Attr. Val. Quant-Info Side-link
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Tor …
Van …
Mon …
… …
Header
table

320
320
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mon.
Q.I.
Q.I. Q.I.
Quant-
Info
Sum: 1765
Cnt: 2
bins
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Tor …
Van …
Mon …
… …
Attr. Val. Q.I. Side-link
Edu …
Hhd …
Bus …
… …
Jan …
Feb …
… …
Header
Table
HTor
From (*, *, Tor) to (*, Jan, Tor)
Computing Cells Involving “City”

321
321
Computing Cells Involving Month But No City
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mont.
Q.I.
Q.I. Q.I.
Edu. Sum:2285 …
Hhd. …
Bus. …
… …
Jan. …
Feb. …
Mar. …
… …
Tor. …
Van. …
Mont. …
… …
1. Roll up quant-info
2. Compute cells
involving month but
no city
Q.I.
Top-k OK mark: if Q.I. in a child passes
top-k avg threshold, so does its
parents. No binning is needed!

322
322
Computing Cells Involving Only Cust_grp
root
edu hhd bus
Jan Mar Jan Feb
Tor Van Tor Mon
Q.I.
Q.I. Q.I.
Edu Sum:2285 …
Hhd …
Bus …
… …
Jan …
Feb …
Mar …
… …
Tor …
Van …
Mon …
… …
Check header table
directly
Q.I.

323
323
Data Mining:
(3rd
ed.)
— Chapter 6 —

324
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary

325
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA

326
Why Is Freq. Pattern Mining Important?
 Freq. pattern: An intrinsic and important property of
datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: discriminative, frequent pattern
analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications

327
Basic Concepts: Frequent Patterns
 itemset: A set of one or more
items
 k-itemset X = {x1, …, xk}
 (absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
 (relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a
minsup threshold
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk

328
Basic Concepts: Association Rules
 Find all the rules X  Y with
minimum support and
confidence
 support, s, probability that a
transaction contains X  Y
 confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys
diaper
Customer
buys both
Customer
buys beer
Nuts, Eggs, Milk
40
Nuts, Coffee, Diaper, Eggs, Milk
50
Beer, Diaper, Eggs
30
Beer, Coffee, Diaper
20
Beer, Nuts, Diaper
10
Items bought
Tid
 Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)

329
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of
sub-patterns, e.g., {a1, …, a100} contains (100
1
) + (100
2
) + …
+ (1
1
0
0
0
0
) = 2100
– 1 = 1.27*1030
sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y ‫כ‬ X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫כ‬ X (proposed by
Bayardo @ SIGMOD’98)
 Closed pattern is a lossless compression of freq.
patterns


330
Closed Patterns and Max-Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!

331
Computational Complexity of Frequent Itemset
Mining
 How many itemsets are potentially to be generated in the worst case?
 The number of frequent itemsets to be generated is senstive to the
minsup threshold
 When minsup is low, there exist potentially an exponential number
of frequent itemsets
 The worst case: MN
where M: # distinct items, and N: max length of
transactions
 The worst case complexty vs. the expected probability
 Ex. Suppose Walmart has 104
kinds of products

The chance to pick up one product 10-4

The chance to pick up a particular set of 10 products: ~10-40

What is the chance this particular set of 10 products to be
frequent 103
times in 109
transactions?

332
 Basic Concepts
Evaluation Methods
 Summary

333
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test
Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth
Approach
 ECLAT: Frequent Pattern Mining with Vertical

334
The Downward Closure Property and Scalable
Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)

335
Apriori: A Candidate Generation & Test Approach
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length
k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated

336
The Apriori Algorithm—An Example
Database TDB
1st
scan
C1
L1
L2
C2 C2
2nd
scan
C3 L3
3rd
scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2

337
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

338
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}

339
How to Count Supports of Candidates?
 Why counting supports of candidates a problem?
 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets
and counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained
in a transaction

340
Counting Supports of Candidates Using Hash Tree
1,4,7
2,5,8
3,6,9
Subset function
2 3 4
5 6 7
1 4 5
1 3 6
1 2 4
4 5 7 1 2 5
4 5 8
1 5 9
3 4 5 3 5 6
3 5 7
6 8 9
3 6 7
3 6 8
Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6
1 3 + 5 6

341
Candidate Generation: An SQL Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with relational database systems:
Alternatives and implications. SIGMOD’98]

342
 Apriori: A Candidate Generation-and-Test Approach
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data
Format


343
Further Improvement of the Apriori Method
 Major computational challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

Partition: Scan Database Only Twice
 Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent
patterns
 Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1 DB2 DBk
+ = DB
+
+
sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB

345
DHP: Reduce the Number of Candidates
 A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
 Candidates: a, b, c, d, e
 Hash entries

{ab, ad, ae}

{bd, be, de}

…
 Frequent 1-itemset: a, b, d, e
 ab is not a candidate 2-itemset if the sum of count of {ab, ad,
ae} is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD’95
count itemset
s
35 {ab, ad, ae}
{yz, qs, wt}
88
102
.
.
.
{bd, be, de}
.
.
.
Hash Table

346
Sampling for Frequent Patterns
 Select a sample of original database, mine frequent
patterns within sample using Apriori
 Scan database once to verify frequent itemsets found
in sample, only borders of closure of frequent patterns
are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent patterns
 H. Toivonen. Sampling large databases for association
rules. In VLDB’96

347
DIC: Reduce Number of Scans
ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Itemset lattice
 Once both A and D are determined
frequent, the counting of AD begins
 Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
Transactions
1-itemsets
2-itemsets
…
Apriori
1-itemsets
2-items
3-items
DIC
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset
counting and implication rules
for market basket data.
SIGMOD’97

348
Format


349
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test

Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent
pattern

350
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
TID Items bought (ordered) frequent
items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
F-list = f-c-a-b-m-p

351
Partition Patterns and Databases
 Frequent patterns can be partitioned into
subsets according to f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
 Completeness and non-redundency

352
Find Patterns Having P From P-conditional Database
 Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent
item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3

353
From Conditional Pattern-bases to Conditional FP-trees
 For each pattern-base
 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of
the pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent
patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam


{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3

354
Recursion: Mining Each Conditional FP-tree
{}
f:3
c:3
a:3
m-conditional FP-tree
Cond. pattern base of “am”: (fc:3)
{}
f:3
c:3
am-conditional FP-tree
Cond. pattern base of “cm”: (f:3)
{}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree

355
A Special Case: Single Prefix Path in FP-tree
 Suppose a (conditional) FP-tree T has a shared
single prefix-path P
 Mining can be decomposed into two parts
 Reduction of the single prefix path into one
node
 Concatenation of the mining results of the two
parts

a2:n2
a3:n3
a1:n1
{}
b1:m1
C1:k1
C2:k2 C3:k3
b1:m1
C1:k1
C2:k2 C3:k3
r1
+
a2:n2
a3:n3
a1:n1
{}
r1 =

356
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not count
node-links and the count field)

357
The Frequent Pattern Growth Mining Method
 Idea: Frequent pattern growth
 Recursively grow frequent patterns by pattern and
database partition
 Method
 For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
 Repeat the process on each newly created
conditional FP-tree
 Until the resulting FP-tree is empty, or it contains
only one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern

358
Scaling FP-growth by Database Projection
 What about if FP-tree cannot fit in memory?
 DB projection
 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected DB
 Parallel projection vs. partition projection techniques
 Parallel projection

Project the DB in parallel for each frequent item

Parallel projection is space costly

All the partitions can be processed in parallel
 Partition projection

Partition the DB based on the ordered frequent items

Passing the unprocessed parts to the subsequent partitions

359
Partition-Based Projection
 Parallel projection needs a lot
of disk space
 Partition projection saves it
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
p-proj DB
fcam
cb
fcam
m-proj DB
fcab
fca
fca
b-proj DB
f
cb
…
a-proj DB
fc
…
c-proj DB
f
…
f-proj DB
…
am-proj DB
fc
fc
fc
cm-proj DB
f
f
f
…

Performance of FPGrowth in Large Datasets
FP-Growth vs. Apriori
360
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Run
tim
e(sec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Support threshold (%)
Runtime
(sec.)
D2 FP-growth
D2 TreeProjection
Data set T25I20D100K
FP-Growth vs. Tree-Projection

361
Advantages of the Pattern Growth Approach
 Divide-and-conquer:
 Decompose both the mining task and DB according to the
frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree,
no pattern search and matching
 A good open-source implementation and refinement of
FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03)

362
Further Improvements of Mining Methods
 AFOPT (Liu, et al. @ KDD’03)
 A “push-right” method for mining condensed frequent pattern
(CFP) tree
 Carpenter (Pan, et al. @ KDD’03)
 Mine data sets with small rows but numerous columns
 Construct a row-enumeration tree for efficient mining
 FPgrowth+ (Grahne and Zhu, FIMI’03)
 Efficiently Using Prefix-Trees in Mining Frequent Itemsets,
Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI'03), Melbourne, FL, Nov. 2003
 TD-Close (Liu, et al, SDM’06)

363
Extension of Pattern Growth Mining Methodology
 Mining closed frequent itemsets and max-patterns
 CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
 Mining sequential patterns
 PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
 Mining graph patterns
 gSpan (ICDM’02), CloseGraph (KDD’03)
 Constraint-based mining of frequent patterns
 Convertible constraints (ICDE’01), gPrune (PAKDD’03)
 Computing iceberg data cubes with complex measures
 H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
 Pattern-growth-based Clustering
 MaPle (Pei, et al., ICDM’03)
 Pattern-Growth-Based Classification
 Mining frequent and discriminative patterns (Cheng, et al,
ICDE’07)

364
Format


365
ECLAT: Mining by Exploring Vertical Data
Format
 Vertical format: t(AB) = {T11, T25, …}
 tid-list: list of trans.-ids containing an itemset
 Deriving frequent patterns based on vertical intersections
 t(X) = t(Y): X and Y always happen together
 t(X)  t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat (Zaki et al. @KDD’97)
 Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)

366
Format


Mining Frequent Closed Patterns: CLOSET
 Flist: list of all frequent items in support ascending order
 Flist: d-a-f-e-c
 Divide search space
 Patterns having d
 Patterns having d but no a, etc.
 Find frequent closed pattern recursively
 Every transaction having d also has cfa  cfad is a
frequent closed pattern
 J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", DMKD'00.
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
Min_sup=2

CLOSET+: Mining Closed Itemsets by Pattern-Growth
 Itemset merging: if Y appears in every occurrence of X, then
Y is merged with X
 Sub-itemset pruning: if Y ‫כ‬ X, and sup(X) = sup(Y), X and all of
X’s descendants in the set enumeration tree can be pruned
 Hybrid tree projection
 Bottom-up physical tree-projection
 Top-down pseudo tree-projection
 Item skipping: if a local frequent item has the same support
in several header tables at different levels, one can prune it
from the header table at higher levels
 Efficient subset checking

MaxMiner: Mining Max-Patterns
 1st
scan: find frequent items
 A, B, C, D, E
 2nd
scan: find support for
 AB, AC, AD, AE, ABCDE
 BC, BD, BE, BCDE
 CD, CE, CDE, DE
 Since BCDE is a max-pattern, no need to check BCD,
BDE, CDE in later scan
 R. Bayardo. Efficiently mining long patterns from
databases. SIGMOD’98
Tid Items
10 A, B, C, D, E
20 B, C, D, E,
30 A, C, D, F
Potential
max-patterns

CHARM: Mining by Exploring Vertical Data
Format
 Vertical format: t(AB) = {T11, T25, …}
 tid-list: list of trans.-ids containing an itemset
 Deriving closed patterns based on vertical
intersections
 t(X) = t(Y): X and Y always happen together
 t(X)  t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy

371
Visualization of Association Rules: Plane
Graph

372
Visualization of Association Rules: Rule
Graph

373
Visualization of Association Rules
(SGI/MineSet 3.0)

374
 Basic Concepts
Evaluation Methods
 Summary

375
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift
89
.
0
5000
/
3750
*
5000
/
3000
5000
/
2000
)
,
( 

C
B
lift
Basketbal
l
Not
basketball
Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
)
(
)
(
)
(
B
P
A
P
B
A
P
lift


33
.
1
5000
/
1250
*
5000
/
3000
5000
/
1000
)
,
( 

C
B
lift

376
Are lift and 2
Good Measures of Correlation?
 “Buy walnuts  buy
milk [1%, 80%]” is
misleading if 85% of
customers buy milk
 Support and
confidence are not
good to indicate
correlations
 Over 20
interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?

October 24, 2024 Data Mining: Concepts and
Techniques
378
Comparison of Interestingness Measures
Milk No Milk Sum (row)
Coffee m, c ~m, c c
No
Coffee
m, ~c ~m, ~c ~c
Sum(col.) m ~m 
 Null-(transaction) invariance is crucial for correlation analysis
 Lift and 2
are not null-invariant
 5 null-invariant measures
Null-transactions
w.r.t. m and c Null-invariant
Subtle: They disagree
Kulczynski
measure (1927)

379
Analysis of DBLP Coauthor Relationships
Advisor-advisee relation: Kulc: high,
coherence: low, cosine: middle
Recent DB conferences, removing balanced associations, low sup, etc.
 Tianyi Wu, Yuguo Chen and Jiawei Han, “
Association Mining in Large Databases: A Re-Examination of Its Me
asures
”, Proc. 2007 Int. Conf. Principles and Practice of Knowledge
Discovery in Databases (PKDD'07), Sept. 2007

Which Null-Invariant Measure Is Better?
 IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
 Kulczynski and Imbalance Ratio (IR) together present a
clear picture for all the three datasets D4 through D6
 D4 is balanced & neutral
 D5 is imbalanced & neutral
 D6 is very imbalanced & neutral

381
 Basic Concepts
Evaluation Methods
 Summary

382
Summary
 Basic concepts: association rules, support-
confident framework, closed and max-patterns
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)
 Vertical format approach (ECLAT, CHARM, ...)
 Which patterns are interesting?
 Pattern evaluation methods

383
Ref: Basic Concepts of Frequent Pattern Mining
 (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. SIGMOD'93
 (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
databases. SIGMOD'98
 (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Discovering frequent closed itemsets for association rules. ICDT'99
 (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns.
ICDE'95

384
Ref: Apriori and Its Improvements
 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94
 H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
 A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
 J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
 H. Toivonen. Sampling large databases for association rules. VLDB'96
 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications. SIGMOD'98

385
Ref: Depth-First, Projection-Based FP Mining
 R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation
of frequent itemsets. J. Parallel and Distributed Computing, 2002.
 G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
FIMI'03
 B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining
implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov. 2003
 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00
 J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
Projection. KDD'02
 J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
 J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets. KDD'03

386
Ref: Vertical Format and Row Enumeration Methods
 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for
discovery of association rules. DAMI:97.
 M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for Closed Itemset
Mining, SDM'02.
 C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning
Algorithm for Itemsets with Constraints. KDD’02.
 F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding
Closed Patterns in Long Biological Datasets. KDD'03.
 H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High
Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.

387
Ref: Mining Correlations and Interesting Rules
 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97.
 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94.
 R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic, 2001.
 C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98.
 P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. KDD'02.
 E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
 T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371-
397, 2010

388
388
Data Mining:
(3rd
ed.)
— Chapter 7 —

October 24, 2024
Techniques 389

390
Chapter 7 : Advanced Frequent Pattern Mining
 Pattern Mining: A Road Map
 Pattern Mining in Multi-Level, Multi-Dimensional
Space
 Constraint-Based Frequent Pattern Mining
 Mining High-Dimensional Data and Colossal Patterns
 Mining Compressed or Approximate Patterns
 Pattern Exploration and Application
 Summary

Research
on
Pattern
Mining:
A
Road
Map
391

392
Space
 Mining Multi-Level Association
 Mining Multi-Dimensional Association
 Mining Quantitative Association Rules
 Mining Rare Patterns and Negative Patterns
 Summary

393
Mining Multiple-Level Association Rules
 Items often form hierarchies
 Flexible support settings
 Items at the lower level are expected to have lower
support
 Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)
uniform
support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Level 1
min_sup = 5%
Level 2
min_sup = 3%
reduced support

394
Multi-level Association: Flexible Support and
Redundancy filtering
 Flexible min-support thresholds: Some items are more valuable
but less frequent
 Use non-uniform, group-based min-support
 E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
 Redundancy Filtering: Some rules may be redundant due to
“ancestor” relationships between items
 milk  wheat bread [support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule
 A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor

395
Space
 Summary

396
Mining Multi-Dimensional Association
 Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or predicates
 Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
 hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes: finite number of possible values,
no ordering among values—data cube approach
 Quantitative Attributes: Numeric, implicit ordering
among values—discretization, clustering, and gradient
approaches

397
Space
 Summary

398
Mining Quantitative Associations
Techniques can be categorized by how numerical
attributes, such as age or salary are treated
1. Static discretization based on predefined concept
hierarchies (data cube methods)
2. Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal &
Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97)
 One dimensional clustering then association
4. Deviation: (such as Aumann and Lindell@KDD99)
Sex = female => Wage: mean=$7/hr (overall mean = $9)

399
Static Discretization of Quantitative Attributes
 Discretized prior to mining using concept hierarchy.
 Numeric values are replaced by ranges
 In relational database, finding all frequent k-predicate
sets will require k or k+1 table scans
 Data cube is well suited for mining
 The cells of an n-dimensional
cuboid correspond to the
predicate sets
 Mining from data cubes
can be much faster
(income)
(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)

400
Quantitative Association Rules Based on Statistical
Inference Theory [Aumann and Lindell@DMKD’03]
 Finding extraordinary and therefore interesting phenomena, e.g.,
(Sex = female) => Wage: mean=$7/hr (overall mean = $9)
 LHS: a subset of the population
 RHS: an extraordinary behavior of this subset
 The rule is accepted only if a statistical test (e.g., Z-test) confirms
the inference with high confidence
 Subrule: highlights the extraordinary behavior of a subset of the
pop. of the super rule
 E.g., (Sex = female) ^ (South = yes) => mean wage = $6.3/hr
 Two forms of rules
 Categorical => quantitative rules, or Quantitative => quantitative rules
 E.g., Education in [14-18] (yrs) => mean wage = $11.64/hr
 Open problem: Efficient methods for LHS containing two or more
quantitative attributes

401
Space
 Summary

402
Negative and Rare Patterns
 Rare patterns: Very low support but interesting
 E.g., buying Rolex watches
 Mining: Setting individual-based or special group-
based support threshold for valuable items
 Negative patterns
 Since it is unlikely that one buys Ford Expedition (an
SUV car) and Toyota Prius (a hybrid car) together,
Ford Expedition and Toyota Prius are likely negatively
correlated patterns
 Negatively correlated patterns that are infrequent tend
to be more interesting than those that are frequent

403
Defining Negative Correlated Patterns (I)
 Definition 1 (support-based)
 If itemsets X and Y are both frequent but rarely occur together,
i.e.,
sup(X U Y) < sup (X) * sup(Y)
 Then X and Y are negatively correlated
 Problem: A store sold two needle 100 packages A and B, only one
transaction containing both A and B.
 When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
 When there are 105
transactions, we have
s(A U B) = 1/105
, s(A) * s(B) = 1/103 *
1/103
, s(A U B) > s(A) * s(B)
 Where is the problem? —Null transactions, i.e., the support-
based definition is not null-invariant!

404
Defining Negative Correlated Patterns (II)
 Definition 2 (negative itemset-based)
 X is a negative itemset if (1) X = Ā U B, where B is a set of positive
items, and Ā is a set of negative items, |Ā| 1, and (2) s(X)
≥ ≥ μ
 Itemsets X is negatively correlated, if
 This definition suffers a similar null-invariant problem
 Definition 3 (Kulzynski measure-based) If itemsets X and Y are
frequent, but (P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern
threshold, then X and Y are negatively correlated.
 Ex. For the same needle package problem, when no matter there
are 200 or 105
transactions, if є = 0.01, we have
(P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є

405
Space
 Summary

406
Constraint-based (Query-Directed) Mining
 Finding all the patterns in a database autonomously? —
unrealistic!
 The patterns could be too many but not focused!
 Data mining should be an interactive process
 User directs what to be mined using a data mining query
language (or a graphical user interface)
 Constraint-based mining
 User flexibility: provides constraints on what to be mined
 Optimization: explores such constraints for efficient mining —
constraint-based mining: constraint-pushing, similar to push
selection first in DB query processing
 Note: still find all the answers satisfying constraints, not
finding some answers in “heuristic search”

407
Constraints in Data Mining
 Knowledge type constraint:
 classification, association, etc.
 Data constraint — using SQL-like queries
 find product pairs sold together in stores in Chicago
this year
 Dimension/level constraint
 in relevance to region, price, brand, customer
category
 Rule (or pattern) constraint
 small sales (price < $10) triggers big sales (sum >
$200)
 Interestingness constraint
 strong rules: min_support  3%, min_confidence 
60%

Meta-Rule Guided Mining
 Meta-rule can be in the rule form with partially instantiated
predicates and constants
P1(X, Y) ^ P2(X, W) => buys(X, “iPad”)
 The resulting rule derived can be
age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”)
 In general, it can be in the form of
P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr
 Method to find meta-rules
 Find frequent (l+r) predicates (based on min-support
threshold)
 Push constants deeply when possible into the mining process
(see the remaining discussions on constraint-push techniques)
 Use confidence, correlation, and other measures when 408

409
Constraint-Based Frequent Pattern Mining
 Pattern space pruning constraints
 Anti-monotonic: If constraint c is violated, its further mining
can be terminated
 Monotonic: If c is satisfied, no need to check c again
 Succinct: c must be satisfied, so one can start with the data
sets satisfying c
 Convertible: c is not monotonic nor anti-monotonic, but it can
be converted into it if items in the transaction can be properly
ordered
 Data space pruning constraint
 Data succinct: Data space can be pruned at the initial pattern
mining process
 Data anti-monotonic: If a transaction t does not satisfy c, t can
be pruned from its further mining

410
Pattern Space Pruning with Anti-Monotonicity Constraints
 A constraint C is anti-monotone if the super
pattern satisfies C, all of its sub-patterns do so
too
 In other words, anti-monotonicity: If an itemset
S violates the constraint, so does any of its
superset
 Ex. 1. sum(S.price)  v is anti-monotone
 Ex. 2. range(S.profit)  15 is anti-monotone
 Itemset ab violates C
 So does every superset of ab
 Ex. 3. sum(S.Price)  v is not anti-monotone
 Ex. 4. support count is anti-monotone: core
property used in Apriori
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10

411
Pattern Space Pruning with Monotonicity Constraints
 A constraint C is monotone if the pattern
satisfies C, we do not need to check C in
subsequent mining
 Alternatively, monotonicity: If an itemset S
satisfies the constraint, so does any of its
superset
 Ex. 1. sum(S.Price)  v is monotone
 Ex. 2. min(S.Price)  v is monotone
 Ex. 3. C: range(S.profit)  15
 Itemset ab satisfies C
 So does every superset of ab
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10

412
Data Space Pruning with Data Anti-monotonicity
 A constraint c is data anti-monotone if for a
pattern p cannot satisfy a transaction t under c,
p’s superset cannot satisfy t under c either
 The key for data anti-monotone is recursive data
reduction
 Ex. 1. sum(S.Price)  v is data anti-monotone
 Ex. 2. min(S.Price)  v is data anti-monotone
 Ex. 3. C: range(S.profit)  25 is data anti-
monotone
 Itemset {b, c}’s projected DB:

T10’: {d, f, h}, T20’: {d, f, g, h}, T30’: {d, f, g}
 since C cannot satisfy T10’, T10’ can be
pruned
TID Transaction
10 a, b, c, d, f, h
20 b, c, d, f, g, h
30 b, c, d, f, g
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d -15
e -30
f -10
g 20
h -5

413
Pattern Space Pruning with Succinctness
 Succinctness:
 Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based
on A1 , i.e., S contains a subset belonging to A1
 Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
 min(S.Price)  v is succinct
 sum(S.Price)  v is not succinct
 Optimization: If C is succinct, C is pre-counting
pushable

414
Naïve Algorithm: Apriori + Constraint
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3 L3
itemset
{2 3 5}
Scan D itemset sup
{2 3 5} 2
Constraint:
Sum{S.price} < 5

415
Constrained Apriori : Push a Succinct Constraint
Deep
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3 L3
itemset
{2 3 5}
Scan D itemset sup
{2 3 5} 2
Constraint:
min{S.price } <= 1
not immediately
to be used

416
Constrained FP-Growth: Push a Succinct
Constraint Deep
Constraint:
min{S.price } <= 1
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
TID Items
100 1 3
200 2 3 5
300 1 2 3 5
400 2 5
Remove
infrequent
length 1
FP-Tree
TID Items
100 3 4
300 2 3 5
1-Projected DB
No Need to project on 2, 3, or 5

417
Constrained FP-Growth: Push a Data
Anti-monotonic Constraint Deep
Constraint:
min{S.price } <= 1
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
TID Items
100 1 3
300 1 3
FP-Tree
Single branch, we are done
Remove from data

418
Constrained FP-Growth: Push a Data
Anti-monotonic Constraint Deep
Constraint:
range{S.price } > 25
min_sup >= 2
FP-Tree
TID Transaction
10 a, c, d, f, h
20 c, d, f, g, h
30 c, d, f, g
B-Projected DB
B
FP-Tree
TID Transaction
10 a, b, c, d, f, h
20 b, c, d, f, g, h
30 b, c, d, f, g
40 a, c, e, f, g
TID Transaction
10 a, b, c, d, f, h
20 b, c, d, f, g, h
30 b, c, d, f, g
40 a, c, e, f, g
Item Profit
a 40
b 0
c -20
d -15
e -30
f -10
g 20
h -5
Recursive
Data
Pruning
Single branch:
bcdfg: 2

419
Convertible Constraints: Ordering Data in
Transactions
 Convert tough constraints into anti-
monotone or monotone by properly
ordering items
 Examine C: avg(S.profit)  25
 Order items in value-descending
order

<a, f, g, d, b, h, c, e>
 If an itemset afb violates C

So does afbh, afb*

It becomes anti-monotone!
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10

420
Strongly Convertible Constraints
 avg(X)  25 is convertible anti-monotone
w.r.t. item value descending order R: <a, f, g,
d, b, h, c, e>
 If an itemset af violates a constraint C, so
does every itemset with af as prefix, such
as afd
 avg(X)  25 is convertible monotone w.r.t.
item value ascending order R-1
: <e, c, h, b, d,
g, f, a>
 If an itemset d satisfies a constraint C, so
does itemsets df and dfa, which having d
as a prefix
 Thus, avg(X)  25 is strongly convertible
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10

421
Can Apriori Handle Convertible Constraints?
 A convertible, not monotone nor anti-
monotone nor succinct constraint cannot be
pushed deep into the an Apriori mining
algorithm
 Within the level wise framework, no direct
pruning based on the constraint can be
made
 Itemset df violates constraint C: avg(X) >=
25
 Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned
 But it can be pushed into frequent-pattern
Item Value
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10

422
Pattern Space Pruning w. Convertible Constraints
 C: avg(X) >= 25, min_sup=2
 List items in every transaction in value
descending order R: <a, f, g, d, b, h, c, e>
 C is convertible anti-monotone w.r.t. R
 Scan TDB once
 remove infrequent items

Item h is dropped
 Itemsets a and f are good, …
 Projection-based mining
 Imposing an appropriate order on item
projection
 Many tough constraints can be converted
into (anti)-monotone
TID Transaction
10 a, f, d, b, c
20 f, g, d, b, c
30 a, f, d, c, e
40 f, g, h, c, e
TDB (min_sup=2)
Item Value
a 40
f 30
g 20
d 10
b 0
h -10
c -20
e -30

423
Handling Multiple Constraints
 Different constraints may require different or even
conflicting item-ordering
 If there exists an order R s.t. both C1 and C2 are
convertible w.r.t. R, then there is no conflict between
the two convertible constraints
 If there exists conflict on order of items
 Try to satisfy one constraint first
 Then using the order for the other constraint to
mine frequent itemsets in the corresponding
projected database

424
What Constraints Are Convertible?
Constraint
Convertible anti-
monotone
Convertible
monotone
Strongly
convertible
avg(S)  ,  v Yes Yes Yes
median(S)  ,  v Yes Yes Yes
sum(S)  v (items could be of any
value, v  0)
Yes No No
sum(S)  v (items could be of any
value, v  0)
No Yes No
sum(S)  v (items could be of any
value, v  0)
No Yes No
sum(S)  v (items could be of any
value, v  0)
Yes No No
……

425
Constraint-Based Mining — A General Picture
Constraint Anti-monotone Monotone Succinct
v  S no yes yes
S  V no yes yes
S  V yes no yes
min(S)  v no yes yes
min(S)  v yes no yes
max(S)  v yes no yes
max(S)  v no yes yes
count(S)  v yes no weakly
count(S)  v no yes weakly
sum(S)  v ( a  S, a  0 ) yes no no
sum(S)  v ( a  S, a  0 ) no yes no
range(S)  v yes no no
range(S)  v no yes no
avg(S)  v,   { , ,  } convertible convertible no
support(S)   yes no no
support(S)   no yes no

426
Space
 Summary

427
Mining Colossal Frequent Patterns
 F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal
Frequent Patterns by Core Pattern Fusion”, ICDE'07.
 We have many algorithms, but can we mine large (i.e., colossal)
patterns? ― such as just size around 50 to 100? Unfortunately, not!
 Why not? ― the curse of “downward closure” of frequent patterns
 The “downward closure” property

Any sub-pattern of a frequent pattern is frequent.
 Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1,
a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There
are about 2100
such frequent itemsets!
 No matter using breadth-first search (e.g., Apriori) or depth-first
search (FPgrowth), we have to examine so many patterns
 Thus the downward closure property leads to explosion!

428
Closed/maximal patterns may
partially alleviate the problem but not
really solve it: We often need to
mine scattered large patterns!
Let the minimum support threshold
σ= 20
There are frequent patterns of
size 20
Each is closed and maximal
# patterns =
The size of the answer set is
exponential to n
Colossal Patterns: A Motivating Example
T1 = 1 2 3 4 ….. 39 40
T2 = 1 2 3 4 ….. 39 40
: .
: .
: .
: .
T40=1 2 3 4 ….. 39 40 







20
40
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40
: .
: .
: .
: .
T40=1 2 3 4 …… 39
n
n
n n
2
/
2
2
/










Then delete the items on the diagonal
Let’s make a set of 40 transactions

429
Colossal Pattern Set: Small but Interesting
 It is often the case that
only a small number of
patterns are colossal,
i.e., of large size
 Colossal patterns are
usually attached with
greater importance
than those of small
pattern sizes

430
Mining Colossal Patterns: Motivation and
Philosophy
 Motivation: Many real-world tasks need mining colossal patterns
 Micro-array analysis in bioinformatics (when support is low)
 Biological sequence patterns
 Biological/sociological/information graph pattern mining
 No hope for completeness
 If the mining of mid-sized patterns is explosive in size, there is
no hope to find colossal patterns efficiently by insisting
“complete set” mining philosophy
 Jumping out of the swamp of the mid-sized results
 What we may develop is a philosophy that may jump out of the
swamp of mid-sized results that are explosive in size and jump
to reach colossal patterns
 Striving for mining almost complete colossal patterns
 The key is to develop a mechanism that may quickly reach
colossal patterns and discover most of them

431
Let the min-support threshold σ= 20
Then there are closed/maximal
frequent patterns of size 20
However, there is only one with size
greater than 20, (i.e., colossal):
α= {41,42,…,79} of size 39
Alas, A Show of Colossal Pattern Mining!








20
40
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40
: .
: .
: .
: .
T40=1 2 3 4 …… 39
T41= 41 42 43 ….. 79
T42= 41 42 43 ….. 79
: .
: .
T60= 41 42 43 … 79
The existing fastest mining algorithms
(e.g., FPClose, LCM) fail to complete
running
Our algorithm outputs this colossal
pattern in seconds

432
Methodology of Pattern-Fusion Strategy
 Pattern-Fusion traverses the tree in a bounded-breadth way
 Always pushes down a frontier of a bounded-size candidate
pool
 Only a fixed number of patterns in the current candidate pool
will be used as the starting nodes to go down in the pattern tree
― thus avoids the exponential search space
 Pattern-Fusion identifies “shortcuts” whenever possible
 Pattern growth is not performed by single-item addition but by
leaps and bounded: agglomeration of multiple patterns in the
pool
 These shortcuts will direct the search down the tree much more
rapidly towards the colossal patterns

433
Observation: Colossal Patterns and Core Patterns
A colossal pattern α
D
Dα
α1
Transaction Database D
Dα1
Dα2
α2
α
αk
Dαk
Subpatterns α1 to αk cluster tightly around the colossal pattern α by
sharing a similar support. We call such subpatterns core patterns of α

434
Robustness of Colossal Patterns
 Core Patterns
Intuitively, for a frequent pattern α, a subpattern β is a τ-core
pattern of α if β shares a similar support set with α, i.e.,
where τ is called the core ratio
 Robustness of Colossal Patterns
A colossal pattern is robust in the sense that it tends to have much
more core patterns than small patterns




|
|
|
|
D
D
1
0 


435
Example: Core Patterns
 A colossal pattern has far more core patterns than a small-sized
pattern
 A colossal pattern has far more core descendants of a smaller size c
 A random draw from a complete set of pattern of size c would more
likely to pick a core descendant of a colossal pattern
 A colossal pattern can be generated by merging a set of core
patterns
Transaction (# of
Ts)
Core Patterns (τ = 0.5)
(abe) (100) (abe), (ab), (be), (ae), (e)
(bcf) (100) (bcf), (bc), (bf)
(acf) (100) (acf), (ac), (af)
(abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e), (abc),
(abf), (abe), (ace), (acf), (afe), (bcf), (bce), (bfe), (cfe),
(abcf), (abce), (bcfe), (acfe), (abfe), (abcef)

437
Colossal Patterns Correspond to Dense Balls
 Due to their robustness,
colossal patterns correspond
to dense balls
 Ω( 2^d) in population
 A random draw in the pattern
space will hit somewhere in
the ball with high probability

438
Idea of Pattern-Fusion Algorithm
 Generate a complete set of frequent patterns up to a
small size
 Randomly pick a pattern β, and β has a high
probability to be a core-descendant of some colossal
pattern α
 Identify all α’s descendants in this complete set, and
merge all of them ― This would generate a much
larger core-descendant of α
 In the same fashion, we select K patterns. This set of
larger core-descendants will be the candidate pool for
the next iteration

439
Pattern-Fusion: The Algorithm
 Initialization (Initial pool): Use an existing algorithm to
mine all frequent patterns up to a small size, e.g., 3
 Iteration (Iterative Pattern Fusion):
 At each iteration, k seed patterns are randomly
picked from the current pattern pool
 For each seed pattern thus picked, we find all the
patterns within a bounding ball centered at the
seed pattern
 All these patterns found are fused together to
generate a set of super-patterns. All the super-
patterns thus generated form a new pool for the
next iteration
 Termination: when the current pool contains no more
than K patterns at the beginning of an iteration

440
Why Is Pattern-Fusion Efficient?
 A bounded-breadth pattern
tree traversal
 It avoids explosion in
mining mid-sized ones
 Randomness comes to
help to stay on the right
path
 Ability to identify “short-
cuts” and take “leaps”
 fuse small patterns
together in one step to
generate new patterns of
significant sizes
 Efficiency

441
Pattern-Fusion Leads to Good Approximation
 Gearing toward colossal patterns
 The larger the pattern, the greater the chance it
will be generated
 Catching outliers
 The more distinct the pattern, the greater the
chance it will be generated

442
Experimental Setting
 Synthetic data set
 Diagn an n x (n-1) table where ith
row has integers from 1 to n
except i. Each row is taken as an itemset. min_support is n/2.
 Real data set
 Replace: A program trace data set collected from the “replace”
program, widely used in software engineering research
 ALL: A popular gene expression data set, a clinical data on ALL-
AML leukemia (www.broad.mit.edu/tools/data.html).

Each item is a column, representing the activitiy level of
gene/protein in the same

Frequent pattern would reveal important correlation between
gene expression patterns and disease outcomes

443
Experiment Results on Diagn
 LCM run time increases
exponentially with pattern
size n
 Pattern-Fusion finishes
efficiently
 The approximation error of
Pattern-Fusion (with min-
sup 20) in comparison with
the complete set) is rather
close to uniform sampling
(which randomly picks K
patterns from the
complete answer set)

444
Experimental Results on ALL
 ALL: A popular gene expression data set with 38
transactions, each with 866 columns
 There are 1736 items in total
 The table shows a high frequency threshold of 30

445
Experimental Results on REPLACE
 REPLACE
 A program trace data set, recording 4395
calls and transitions
 The data set contains 4395 transactions with
57 items in total
 With support threshold of 0.03, the largest
patterns are of size 44
 They are all discovered by Pattern-Fusion
with different settings of K and τ, when
started with an initial pool of 20948 patterns
of size <=3

446
Experimental Results on REPLACE
 Approximation error when
compared with the complete
mining result
 Example. Out of the total 98
patterns of size >=42, when
K=100, Pattern-Fusion returns
80 of them
 A good approximation to the
colossal patterns in the sense
that any pattern in the
complete set is on average at
most 0.17 items away from
one of these 80 patterns

447
Space
 Summary

448
Mining Compressed Patterns: δ-clustering
 Why compressed patterns?
 too many, but less
meaningful
 Pattern distance measure
 δ-clustering: For each pattern P,
find all patterns which can be
expressed by P and their
distance to P are within δ (δ-
cover)
 All patterns in the cluster can
be represented by P
 Xin et al., “Mining Compressed
ID Item-Sets Support
P1 {38,16,18,12} 205227
P2 {38,16,18,12,17} 205211
P3 {39,38,16,18,12,17} 101758
P4 {39,16,18,12,17} 161563
P5 {39,16,18,12} 161576
 Closed frequent pattern
 Report P1, P2, P3, P4, P5
 Emphasize too much on
support
 no compression
 Max-pattern, P3: info loss
 A desirable output: P2, P3,
P4

449
Redundancy-Award Top-k Patterns
 Why redundancy-aware top-k patterns?
 Desired patterns: high
significance & low
redundancy
 Propose the MMS
(Maximal Marginal
Significance) for
measuring the
combined significance
of a pattern set
 Xin et al., Extracting
Redundancy-Aware
Top-K Patterns,
KDD’06

450
Space
 Summary

 Do they all make sense?
 What do they mean?
 How are they useful?
diaper beer
female sterile (2) tekele
Annotate patterns with semantic information
morphological info. and simple statistics
Semantic Information
Not all frequent patterns are useful, only meaningful ones …
How to Understand and Interpret Patterns?

Word: “pattern” – from Merriam-Webster
A Dictionary Analogy
Non-semantic info.
Examples of Usage
Definitions indicating
semantics
Synonyms
Related Words

Semantic Analysis with Context Models
 Task1: Model the context of a frequent pattern
Based on the Context Model…
 Task2: Extract strongest context indicators
 Task3: Extract representative transactions
 Task4: Extract semantically similar patterns

Annotating DBLP Co-authorship & Title Pattern
Substructure Similarity Search
in Graph Databases
X.Yan, P. Yu, J. Han
…
…
…
…
Database:
Title
Authors
Frequent Patterns
P1: { x_yan, j_han }
Frequent
Itemset
P2: “substructure search”
Pattern { x_yan, j_han}
Non Sup = …
CI {p_yu}, graph pattern, …
Trans. gSpan: graph-base……
SSPs { j_wang }, {j_han, p_yu}, …
Semantic Annotations
Context Units
< { p_yu, j_han}, { d_xin }, … , “graph pattern”,
… “substructure similarity”, … >
Pattern = {xifeng_yan, jiawei_han} Annotation Results:
Context Indicator (CI) graph; {philip_yu}; mine close; graph pattern; sequential pattern; …
Representative
Transactions (Trans)
> gSpan: graph-base substructure pattern mining;
> mining close relational graph connect constraint; …
Semantically Similar
Patterns (SSP)
{jiawei_han, philip_yu}; {jian_pei, jiawei_han}; {jiong_yang, philip_yu,
wei_wang}; …

455
Space
 Summary

456
Summary
 Roadmap: Many aspects & extensions on pattern
mining
 Mining patterns in multi-level, multi dimensional
space
 Mining rare and negative patterns
 Constraint-based pattern mining
 Specialized methods for mining high-dimensional
data and colossal patterns
 Mining compressed or approximate patterns
 Pattern exploration and understanding: Semantic

457
Ref: Mining Multi-Level and Quantitative Rules
 Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association
Rules, KDD'99
 T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using
two-dimensional optimized association rules: Scheme, algorithms, and
visualization. SIGMOD'96.
 J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. VLDB'95.
 R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.
 R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.
 R. Srikant and R. Agrawal. Mining quantitative association rules in large
relational tables. SIGMOD'96.
 K. Wang, Y. He, and J. Han. Mining frequent itemsets using support
constraints. VLDB'00
 K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing
optimized rectilinear regions for association rules. KDD'97.

458
Ref: Mining Other Kinds of Rules
 F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
paradigm for fast, quantifiable data mining. VLDB'98
 Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of
Functional and Approximate Dependencies Using Partitions. ICDE’98.
 H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern
Extraction with Fascicles. VLDB'99
 B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.
 R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96.
 A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative
associations in a large database of customer transactions. ICDE'98.
 D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov.
Query flocks: A generalization of association-rule mining. SIGMOD'98.

459
Ref: Constraint-Based Pattern Mining
 R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item
constraints. KDD'97
 R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning
optimizations of constrained association rules. SIGMOD’98
 G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained
correlated sets. ICDE'00
 J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with
Convertible Constraints. ICDE'01
 J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in
Large Databases, CIKM'02
 F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. ExAnte: Anticipated
Data Reduction in Constrained Pattern Mining, PKDD'03
 F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework
for Graph Pattern Mining”, PAKDD'07

460
Ref: Mining Sequential Patterns
 X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with
gap constraints. ICDM'05
 H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
 J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.
 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. EDBT’96.
 X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
 M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine
Learning:01.

Mining Graph and Structured Patterns
 A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for
mining frequent substructures from graph data. PKDD'00
 M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
 X. Yan and J. Han. gSpan: Graph-based substructure pattern mining.
ICDM'02
 X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns.
KDD'03
 X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent
structure analysis. ACM TODS, 30:960–993, 2005
 X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity
search. ACM Trans. Database Systems, 31:1418–1453, 2006
461

462
Ref: Mining Spatial, Spatiotemporal, Multimedia Data
 H. Cao, N. Mamoulis, and D. W. Cheung. Mining frequent spatiotemporal
sequential patterns. ICDM'05
 D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns.
SSTD'01
 K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic
Information Databases, SSD’95
 H. Xiong, S. Shekhar, Y. Huang, V. Kumar, X. Ma, and J. S. Yoo. A framework
for discovering co-location patterns in data sets with extended spatial
objects. SDM'04
 J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: From visual
words to visual phrases. CVPR'07
 O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with
Progressive Resolution Refinement. ICDE'00

463
Ref: Mining Frequent Patterns in Time-Series Data
 B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98.
 J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series
Database, ICDE'99.
 J. Shieh and E. Keogh. iSAX: Indexing and mining terabyte sized time series. KDD'08
 B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online
Data Mining for Co-Evolving Time Sequences. ICDE'00.
 W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical
Attributes. ICDE’01.
 J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data.
TKDE’03
 L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. KDD'09

464
Ref: FP for Classification and Clustering
 G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
 B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule
Mining. KDD’98.
 W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based
on Multiple Class-Association Rules. ICDM'01.
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets. SIGMOD’ 02.
 J. Yang and W. Wang. CLUSEQ: efficient and effective sequence clustering.
ICDE’03.
 X. Yin and J. Han. CPAR: Classification based on Predictive Association
Rules. SDM'03.
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern
Analysis for Effective Classification”, ICDE'07

465
Ref: Privacy-Preserving FP Mining
 A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining
of Association Rules. KDD’02.
 A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in
Privacy Preserving Data Mining. PODS’03
 J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in
Vertically Partitioned Data. KDD’02

Mining Compressed Patterns
 D. Xin, H. Cheng, X. Yan, and J. Han. Extracting redundancy-
aware top-k patterns. KDD'06
 D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed
frequent-pattern sets. VLDB'05
 X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset
patterns: A profile-based approach. KDD'05
466

Mining Colossal Patterns
 F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal
frequent patterns by core pattern fusion. ICDE'07
 F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han. P. S. Yu, Mining Top-K Large
Structural Patterns in a Massive Network. VLDB’11
467

468
Ref: FP Mining from Data Streams
 Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional
Regression Analysis of Time-Series Data Streams. VLDB'02.
 R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for
finding frequent elements in streams and bags. TODS 2003.
 G. Manku and R. Motwani. Approximate Frequency Counts over Data
Streams. VLDB’02.
 A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent
and top-k elements in data streams. ICDT'05

469
Ref: Freq. Pattern Mining Applications
 T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or
How to Build a Data Quality Browser. SIGMOD'02
 M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting
interactive complexity bugs in sensor networks., SenSys'08
 Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related
bugs in operating system code. In Proc. 2004 Symp. Operating Systems Design and
Implementation (OSDI'04)
 Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and
detecting violations in large software code. FSE'05
 D. Lo, H. Cheng, J. Han, S. Khoo, and C. Sun. Classification of software behaviors for failure
detection: A discriminative pattern mining approach. KDD'09
 Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent patterns.
ACM TKDD, 2007.
 K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02.

470
Data Mining:
(3rd
ed.)
— Chapter 8 —

472
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Summary

473
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data

474
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Numeric Prediction
 models continuous-valued functions, i.e., predicts unknown
or missing values
 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
Prediction Problems: Classification vs.
Numeric Prediction

475
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

The known label of test sample is compared with the classified
result from the model

Accuracy rate is the percentage of test set samples that are
correctly classified by the model

Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify new data
 Note: If the test set is used to select models, it is called validation (test) set

476
Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)

477
Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?

478
Ensemble Methods
 Summary

479
Decision Tree Induction: An Example
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
 Training data set: Buys_computer
 The data set follows an example of
Quinlan’s ID3 (Playing Tennis)
 Resulting tree:

480
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left

Brief Review of Entropy

481
m = 2

482
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
 Information needed (after using A to split D into v partitions) to
classify D:
 Information gained by branching on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info 



)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 


(D)
Info
Info(D)
Gain(A) A



483
Attribute Selection: Information Gain
g Class P: buys_computer = “yes”
g Class N: buys_computer = “no”
means “age <=30” has 5 out of
14 samples, with 2 yes’es and 3
no’s. Hence
Similarly,
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
D
Infoage
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(



rating
credit
Gain
student
Gain
income
Gain
246
.
0
)
(
)
(
)
( 

 D
Info
D
Info
age
Gain age
age income student credit_rating buys_computer
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
( 2
2 



I
D
Info

484
Computing Information-Gain for
Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point

485
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019
 The attribute with the maximum gain ratio is selected as the
splitting attribute
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo
j
v
j
j
A 

 


486
Gini Index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
 Reduction in Impurity:
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)




n
j
p j
D
gini
1
2
1
)
(
)
(
|
|
|
|
)
(
|
|
|
|
)
( 2
2
1
1
D
gini
D
D
D
gini
D
D
D
giniA


)
(
)
(
)
( D
gini
D
gini
A
gini A




487
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split
values

459
.
0
14
5
14
9
1
)
(
2
2
















D
gini
)
(
14
4
)
(
14
10
)
( 2
1
}
,
{ D
Gini
D
Gini
D
gini medium
low
income 















488
Comparing Attribute Selection Measures
 The three measures, in general, return good results but
 Information gain:

biased towards multivalued attributes
 Gain ratio:

tends to prefer unbalanced splits in which one partition is
much smaller than the others
 Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized partitions
and purity in both partitions

489
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on χ2
test for
independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to χ2
distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
 The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others

490
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to
noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early do not split a node
̵
if this would result in the goodness measure falling below a
threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees

Use a set of data different from the training data to
decide which is the “best pruned tree”

491
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication

492
Classification in Large Databases
 Classification—a classical problem extensively studied by
statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
 Why is decision tree induction popular?

relatively faster learning speed (than other classification
methods)

convertible to simple and easy to understand classification
rules

can use SQL queries for accessing databases

comparable classification accuracy with other methods
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

Builds an AVC-list (attribute, value, class label)

493
Scalability Framework for RainForest
 Separates the scalability aspects from the criteria that
determine the quality of the tree
 Builds an AVC-list: AVC (Attribute, Value, Class_label)
 AVC-set (of an attribute X )
 Projection of training dataset onto the attribute X and
class label where counts of individual class label are
aggregated
 AVC-group (of a node n )
 Set of AVC-sets of all predictor attributes at the node n

494
Rainforest: Training Set and Its AVC Sets
student Buy_Computer
yes no
yes 6 1
no 3 4
Age Buy_Computer
yes no
<=30 2 3
31..40 4 0
>40 3 2
Credit
rating
Buy_Computer
yes no
fair 6 2
excellent 3 3
age income studentcredit_rating
buys_computer
AVC-set on income
AVC-set on Age
AVC-set on Student
Training Examples
income Buy_Computer
yes no
high 2 2
medium 4 2
low 3 1
AVC-set on
credit_rating

495
BOAT (Bootstrapped Optimistic
Algorithm for Tree Construction)
 Use a statistical technique called bootstrapping to create
several smaller samples (subsets), each fits in memory
 Each subset is used to create a tree, resulting in several
trees
 These trees are examined and used to construct a new
tree T’
 It turns out that T’ is very close to the tree that would
be generated using the whole data set together
 Adv: requires only two scans of DB, an incremental alg.

October 24, 2024
Techniques 496
Presentation of Classification Results

October 24, 2024
Techniques 497
Visualization of a Decision Tree in SGI/MineSet 3.0

Techniques 498
Interactive Visual Mining by Perception-
Based Classification (PBC)

499
Ensemble Methods
 Summary

500
Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured

501
Bayes’ Theorem: Basics
 Total probability Theorem:
 Bayes’ Theorem:
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
 P(H) (prior probability): the initial probability

E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds

E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
)
(
)
1
|
(
)
(
i
A
P
M
i i
A
B
P
B
P 


)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 



502
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 



503
Classification Is to Derive the Maximum Posteriori
 Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
 Since P(X) is constant for all classes, only
needs to be maximized
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P 
)
(
)
|
(
)
|
(
i
C
P
i
C
P
i
C
P X
X 

504
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
 This greatly reduces the computation cost: Only counts the
class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k







X
2
2
2
)
(
2
1
)
,
,
( 








x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P 


X

505
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age income student
credit_rating
buys_compu

506
Naïve Bayes Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007
age income student
credit_rating
buys_comp

507
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts



n
k
Ci
xk
P
Ci
X
P
1
)
|
(
)
|
(

508
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore loss of
accuracy
 Practically, dependencies exist among variables

E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.

Dependencies among these cannot be modeled by Naïve
Bayes Classifier
 How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)

509
Ensemble Methods
 Summary

510
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
 Class-based ordering: decreasing order of prevalence or misclassification
cost per class
 Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts

511
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fair
excellent
yes
no
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees
 One rule is created for each path from the
root to a leaf
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class
prediction
 Rules are mutually exclusive and exhaustive

512
Rule Induction: Sequential Covering Method
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously

513
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 3
Examples covered
by Rule 2
Examples covered
by Rule 1
Positive
examples

514
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
Positive
examples
Negative
examples
A3=1
A3=1&&A1=2
A3=1&&A1=2
&&A8=5

515
How to Learn-One-Rule?
 Start with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy
 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition

favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
)
log
'
'
'
(log
'
_ 2
2
neg
pos
pos
neg
pos
pos
pos
Gain
FOIL





neg
pos
neg
pos
R
Prune
FOIL



)
(
_

516
Ensemble Methods
 Summary

Model Evaluation and Selection
 Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
 Use validation test set of class-labeled tuples instead of training
set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
517

Classifier Evaluation Metrics: Confusion
Matrix
Actual classPredicted
class
buy_computer
= yes
buy_computer
= no
Total
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry, CMi,j in a confusion matrix indicates
# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
Confusion Matrix:
Actual classPredicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
518

Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
 Classifier Accuracy, or
recognition rate: percentage of
test set tuples that are correctly
classified
Accuracy = (TP + TN)/All
 Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
 Class Imbalance Problem:
 One class may be rare, e.g.
fraud, or HIV-positive
 Significant majority of the
negative class and minority of
the positive class
 Sensitivity: True Positive
recognition rate

Sensitivity = TP/P
 Specificity: True Negative
recognition rate

Specificity = TN/N
AP C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
519

Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
 Recall: completeness – what % of positive tuples did the
classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall

F measure (F1 or F-score): harmonic mean of precision and
recall,

Fß: weighted measure of precision and recall

assigns ß times as much weight to recall as to precision
520

Classifier Evaluation Metrics: Example
521
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method

Given data is randomly partitioned into two independent sets

Training set (e.g., 2/3) for model construction

Test set (e.g., 1/3) for accuracy estimation

Random sampling: a variation of holdout

Repeat holdout k times, accuracy = avg. of the accuracies
obtained
 Cross-validation (k-fold, where k = 10 is most popular)

Randomly partition the data into k mutually exclusive subsets,
each approximately equal size

At i-th iteration, use Di as test set and others as training set

Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
522

Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement

i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d
≈ e-1
= 0.368)
 Repeat the sampling procedure k times, overall accuracy of the model:
523

Estimating Confidence Intervals:
Classifier Models M1 vs. M2
 Suppose we have 2 classifiers, M1 and M2, which one is better?
 Use 10-fold cross-validation to obtain and
 These mean error rates are just estimates of error on the true
population of future data cases
 What if the difference between the 2 error rates is just
attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates
524

Null Hypothesis
 Perform 10-fold cross-validation
 Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
 Use t-test (or Student’s t-test)
 Null Hypothesis: M1 & M2 are the same
 If we can reject null hypothesis, then
 we conclude that the difference between M1 & M2 is
statistically significant
 Chose model with lower error rate
525

Estimating Confidence Intervals: t-test
 If only 1 test set available: pairwise comparison
 For ith
round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
 Average over 10 rounds to get
 t-test computes t-statistic with k-1 degrees of
freedom:
 If two test sets available: use non-paired t-test
where
an
d
wher
e
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
526

Table for t-distribution
 Symmetric
 Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
 Confidence limit, z
= sig/2
527

Statistical Significance
 Are M1 & M2 significantly different?
 Compute t. Select significance level (e.g. sig = 5%)
 Consult table for t-distribution: Find t value corresponding to
k-1 degrees of freedom (here, 9)
 t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
z=sig/2 (here, 0.025)
 If t > z or t < -z, then t value lies in rejection region:
 Reject null hypothesis that mean error rates of M1 & M2
are same
 Conclude: statistically significant difference between M1
& M2
 Otherwise, conclude that any difference is chance
528

Model Selection: ROC Curves
 ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection theory
 Shows the trade-off between the true
positive rate and the false positive rate
 The area under the ROC curve is a
measure of the accuracy of the model
 Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears at
the top of the list
 The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
accurate is the model
 Vertical axis
represents the true
positive rate
 Horizontal axis rep.
the false positive rate
 The plot also shows a
diagonal line
 A model with perfect
accuracy will have an
area of 1.0
529

Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
530

531
Ensemble Methods
 Summary

Ensemble Methods: Increasing the Accuracy
 Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of
classifiers
 Boosting: weighted vote with a collection of classifiers
 Ensemble: combining a set of heterogeneous classifiers
532

Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority vote
 Training
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di
 Classification: classify an unknown sample X
 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class with the
most votes to X
 Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
 Accuracy
 Often significantly better than a single classifier derived from D
 For noise data: not considerably worse, more robust
 Proved improved accuracy in prediction
533

Boosting
 Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned

After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 534

535
Adaboost (Freund and Schapire, 1997)
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 Initially, all the weights of tuples are set the same (1/d)
 Generate k classifiers in k rounds. At round i,

Tuples from D are sampled (with replacement) to form a training set Di
of the same size
 Each tuple’s chance of being selected is based on its weight

A classification model Mi is derived from Di

Its error rate is calculated using Di as a test set
 If a tuple is misclassified, its weight is increased, o.w. it is decreased
 Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
 The weight of classifier Mi’s vote is
)
(
)
(
1
log
i
i
M
error
M
error

 

d
j
j
i err
w
M
error )
(
)
( j
X

Random Forest (Breiman 2001)
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
 During classification, each tree votes and the most popular class is
returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
 Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to errors and outliers
 Insensitive to the number of attributes selected for consideration at each
split, and faster than bagging or boosting
536

Classification of Class-Imbalanced Data Sets
 Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
 Traditional methods assume a balanced distribution of classes
and equal error costs: not suitable for class-imbalanced data
 Typical methods for imbalance data in 2-class classification:
 Oversampling: re-sampling of data from positive class
 Under-sampling: randomly eliminate tuples from negative
class
 Threshold-moving: moves the decision threshold, t, so that
the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors
 Ensemble techniques: Ensemble multiple classifiers
introduced above
 Still difficult for class imbalance problem on multiclass tasks
537

538
Ensemble Methods
 Summary

Summary (I)
 Classification is a form of data analysis that extracts models
describing important data classes.
 Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
 Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
 Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.
539

Summary (II)
 Significance tests and ROC curves are useful for model selection.
 There have been numerous comparisons of the different
classification methods; the matter remains a research topic
 No single method has been found to be superior over all others
for all data sets
 Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
540

References (1)
 C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu,
Discriminative Frequent Pattern Analysis for Effective Classification, ICDE'07
 H. Cheng, X. Yan, J. Han, and P. S. Yu,
Direct Discriminative Pattern Mining for Effective Classification, ICDE'08
 W. Cohen. Fast effective rule induction. ICML'95
 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
541

References (2)
 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
 G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
 U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
 D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
 W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
542

References (3)
 T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
 J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
543

References (4)
 R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
 J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
 P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
 X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
544

CS412 Midterm Exam Statistics
 Opinion Question Answering:
 Like the style: 70.83%, dislike: 29.16%
 Exam is hard: 55.75%, easy: 0.6%, just right: 43.63%
 Time: plenty:3.03%, enough: 36.96%, not: 60%
 Score distribution: # of students (Total: 180)
 >=90: 24
 80-89: 54
 70-79: 46
 Final grading are based on overall score accumulation
and relative class distributions
546
 60-69: 37
 50-59: 15
 40-49: 2
 <40: 2

547
Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules

548
Predictor Error Measures
 Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
 Loss function: measures the error betw. yi and the predicted value yi’
 Absolute error: | yi – yi’|
 Squared error: (yi – yi’)2
 Test error (generalization error): the average loss over the test set
 Mean absolute error: Mean squared error:
 Relative absolute error: Relative squared error:
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
d
y
y
d
i
i
i



1
|
'
|
d
y
y
d
i
i
i



1
2
)
'
(






d
i
i
d
i
i
i
y
y
y
y
1
1
|
|
|
'
|






d
i
i
d
i
i
i
y
y
y
y
1
2
1
2
)
(
)
'
(

549
Scalable Decision Tree Induction Methods
 SLIQ (EDBT’96 — Mehta et al.)
 Builds an index for each attribute and only class list and the
current attribute list reside in memory
 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
 Integrates tree splitting and tree pruning: stop growing the
tree earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)
 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
 Uses bootstrapping to create several small samples

550
Data Cube-Based Decision-Tree Induction
 Integration of generalization with decision-tree induction
(Kamber et al.’97)
 Classification at primitive concept levels
 E.g., precise temperature, humidity, outlook, etc.
 Low-level concepts, scattered classes, bushy classification-
trees
 Semantic interpretation problems
 Cube-based multi-level classification
 Relevance analysis at multi-levels
 Information-gain analysis with dimension + level

551
Data Mining:
(3rd
ed.)
— Chapter 9 —
Classification: Advanced Methods

552
Chapter 9. Classification: Advanced Methods
 Bayesian Belief Networks
 Classification by Backpropagation
 Support Vector Machines
 Classification by Using Frequent Patterns
 Lazy Learners (or Learning from Your
Neighbors)
 Other Classification Methods
 Additional Topics Regarding Classification
 Summary

553
Bayesian Belief Networks
 Bayesian belief networks (also known as Bayesian
networks, probabilistic networks): allow class
conditional independencies between subsets of variables
 A (directed acyclic) graphical model of causal
relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution
X Y
Z
P
 Nodes: random variables
 Links: dependency
 X and Y are the parents of Z, and Y is
the parent of P
 No dependency between Z and P
 Has no loops/cycles

554
Bayesian Belief Network: An Example
Family
History (FH)
LungCancer
(LC)
PositiveXRay
Smoker (S)
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Network
CPT: Conditional Probability Table
for variable LungCancer:



n
i
Y
Parents i
xi
P
x
x
P n
1
))
(
|
(
)
,...,
( 1
shows the conditional probability for
each possible combination of its
parents
Derivation of the probability of a
particular combination of values of
X, from CPT:

555
Training Bayesian Networks: Several
Scenarios
 Scenario 1: Given both the network structure and all variables
observable: compute only the CPT entries
 Scenario 2: Network structure known, some variables hidden:
gradient descent (greedy hill-climbing) method, i.e., search for a
solution along the steepest descent of a criterion function
 Weights are initialized to random probability values
 At each iteration, it moves towards what appears to be the best
solution at the moment, w.o. backtracking
 Weights are updated at each iteration & converge to local
optimum
 Scenario 3: Network structure unknown, all variables observable:
search through the model space to reconstruct network topology
 Scenario 4: Unknown structure, all hidden variables: No good
algorithms known for this purpose
 D. Heckerman. A Tutorial on Learning with Bayesian Networks. In
Learning in Graphical Models, M. Jordan, ed.. MIT Press, 1999.

556
Neighbors)
 Summary

557
Classification by Backpropagation
 Backpropagation: A neural network learning
algorithm
 Started by psychologists and neurobiologists to
develop and test computational analogues of neurons
 A neural network: A set of connected input/output
units where each connection has a weight associated
with it
 During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class label of the input tuples
 Also referred to as connectionist learning due to the

558
Neural Network as a Classifier
 Weakness
 Long training time
 Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
 Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units” in the
network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on an array of real-world data, e.g., hand-written
letters
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of

559
A Multi-Layer Feed-Forward Neural Network
Output layer
Input layer
Hidden layer
Output vector
Input vector: X
wij
ij
k
i
i
k
j
k
j x
y
y
w
w )
ˆ
( )
(
)
(
)
1
(






560
How A Multi-Layer Neural Network Works
 The inputs to the network correspond to the attributes
measured for each training tuple
 Inputs are fed simultaneously into the units making up the input
layer
 They are then weighted and fed simultaneously to a hidden
layer
 The number of hidden layers is arbitrary, although usually only
one
 The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's
prediction
 The network is feed-forward: None of the weights cycles back to
an input unit or to an output unit of a previous layer
 From a statistical point of view, networks perform nonlinear

561
Defining a Network Topology
 Decide the network topology: Specify # of units in the
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer
 Normalize the input values for each attribute measured
in the training tuples to [0.0—1.0]
 One input unit per domain value, each initialized to 0
 Output, if for classification and more than two classes,
one output unit per class is used
 Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a
different network topology or a different set of initial
weights

562
Backpropagation
 Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
 For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the
actual target value
 Modifications are made in the “backwards” direction: from the
output layer, through each hidden layer down to the first hidden
layer, hence “backpropagation”
 Steps
 Initialize weights to small random numbers, associated with
biases
 Propagate the inputs forward (by applying activation function)
 Backpropagate the error (by updating weights and biases)


563
Neuron: A Hidden/Output Layer Unit
 An n-dimensional input vector x is mapped into variable y by means of the
scalar product and a nonlinear function mapping
 The inputs to unit are outputs from the previous layer. They are multiplied by
their corresponding weights to form a weighted sum, which is added to the
bias associated with unit. Then a nonlinear activation function is applied to it.
mk
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
å
w0
w1
wn
x0
x1
xn
)
sign(
y
Example
For
n
0
i
k
i
i x
w 

 

bias

564
Efficiency and Interpretability
 Efficiency of backpropagation: Each epoch (one iteration through
the training set) takes O(|D| * w), with |D| tuples and w weights,
but # of epochs can be exponential to n, the number of inputs, in
worst case
 For easier comprehension: Rule extraction by network pruning
 Simplify the network structure by removing weighted links that
have the least effect on the trained network
 Then perform link, unit, or activation value clustering
 The set of input and activation values are studied to derive
rules describing the relationship between the input and hidden
unit layers
 Sensitivity analysis: assess the impact that a given input variable
has on a network output. The knowledge gained from this analysis
can be represented in rules

565
Neighbors)
 Summary

566
Classification: A Mathematical Mapping
 Classification: predicts categorical class labels
 E.g., Personal homepage classification
 xi = (x1, x2, x3, …), yi = +1 or –1
 x1 : # of word “homepage”
 x2 : # of word “welcome”
 Mathematically, x  X = n
, y  Y = {+1, –1},
 We want to derive a function f: X  Y
 Linear Classification
 Binary Classification problem
 Data above the red line belongs to class ‘x’
 Data below red line belongs to class ‘o’
 Examples: SVM, Perceptron, Probabilistic Classifiers
x
x
x
x
x
x
x
x
x
x o
o
o
o
o
o
o
o
o o
o
o
o

567
Discriminative Classifiers
 Advantages
 Prediction accuracy is generally high

As compared to Bayesian methods – in general
 Robust, works when training examples contain errors
 Fast evaluation of the learned target function

Bayesian networks are normally slow
 Criticism
 Long training time
 Difficult to understand the learned function (weights)

Bayesian networks can be used easily for pattern
discovery
 Not easy to incorporate domain knowledge

Easy in the form of priors on the data or
distributions

568
SVM—Support Vector Machines
 A relatively new classification method for both linear
and nonlinear data
 It uses a nonlinear mapping to transform the original
training data into a higher dimension
 With the new dimension, it searches for the linear
optimal separating hyperplane (i.e., “decision
boundary”)
 With an appropriate nonlinear mapping to a
sufficiently high dimension, data from two classes can
always be separated by a hyperplane
 SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by
the support vectors)

569
SVM—History and Applications
 Vapnik and colleagues (1992)—groundwork from
Vapnik & Chervonenkis’ statistical learning theory in
1960s
 Features: training can be slow but accuracy is high
owing to their ability to model complex nonlinear
decision boundaries (margin maximization)
 Used for: classification and numeric prediction
 Applications:
 handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests

570
SVM—General Philosophy
Support Vectors
Small Margin Large Margin

October 24, 2024
Techniques 571
SVM—Margins and Support Vectors

572
SVM—When Data Is Linearly Separable
m
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we
want to find the best one (the one that minimizes classification error on
unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)

573
SVM—Linearly Separable
 A separating hyperplane can be written as
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 1 for y
≥ i = +1, and
H2: w0 + w1 x1 + w2 x2 – 1 for y
≤ i = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints 
Quadratic Programming (QP)  Lagrangian multipliers

574
Why Is SVM Effective on High Dimensional Data?
 The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training examples
—they lie closest to the decision boundary (MMH)
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier,
which is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have
good generalization, even when the dimensionality of the data is
high

575
SVM—Linearly Inseparable
 Transform the original input data into a higher
dimensional space
 Search for a linear separating hyperplane in the new
space
A1
A2

576
SVM: Different Kernel functions
 Instead of computing the dot product on the
transformed data, it is math. equivalent to applying a
kernel function K(Xi, Xj) to the original data, i.e., K(Xi, Xj)
= Φ(Xi) Φ(Xj)
 Typical Kernel Functions
 SVM can also be used for classifying multiple (> 2)
classes and for regression analysis (with additional

577
Scaling SVM by Hierarchical Micro-Clustering
 SVM is not scalable to the number of data objects in terms of
training time and memory usage
 H. Yu, J. Yang, and J. Han, “
Classifying Large Data Sets Using SVM with Hierarchical Clusters”,
KDD'03)
 CB-SVM (Clustering-Based SVM)
 Given limited amount of system resources (e.g., memory),
maximize the SVM performance in terms of accuracy and the
training speed
 Use micro-clustering to effectively reduce the number of
points to be considered
 At deriving support vectors, de-cluster micro-clusters near
“candidate vector” to ensure high classification accuracy

578
CF-Tree: Hierarchical Micro-cluster
 Read the data set once, construct a statistical summary of the
data (i.e., hierarchical clusters) given a limited amount of
memory
 Micro-clustering: Hierarchical indexing structure
 provide finer samples closer to the boundary and coarser
samples farther from the boundary

579
Selective Declustering: Ensure High Accuracy
 CF tree is a suitable base structure for selective declustering
 De-cluster only the cluster Ei such that
 Di – Ri < Ds, where Di is the distance from the boundary to the
center point of Ei and Ri is the radius of Ei
 Decluster only the cluster whose subclusters have possibilities
to be the support cluster of the boundary

“Support cluster”: The cluster whose centroid is a support
vector

580
CB-SVM Algorithm: Outline
 Construct two CF-trees from positive and negative data
sets independently
 Need one scan of the data set
 Train an SVM from the centroids of the root entries
 De-cluster the entries near the boundary into the next
level
 The children entries de-clustered from the parent
entries are accumulated into the training set with
the non-declustered parent entries
 Train an SVM again from the centroids of the entries in
the training set
 Repeat until nothing is accumulated

581
Accuracy and Scalability on Synthetic Dataset
 Experiments on large synthetic data sets shows better
accuracy than random sampling approaches and far
more scalable than the original SVM algorithm

582
SVM vs. Neural Network
 SVM
 Deterministic
algorithm
 Nice generalization
properties
 Hard to learn –
learned in batch mode
using quadratic
programming
techniques
 Using kernels can
 Neural Network
 Nondeterministic
algorithm
 Generalizes well but
doesn’t have strong
mathematical
foundation
 Can easily be learned in
incremental fashion
 To learn complex
functions—use
multilayer perceptron

583
SVM Related Links
 SVM Website: http://www.kernel-machines.org/
 Representative implementations
 LIBSVM: an efficient implementation of SVM, multi-
class classifications, nu-SVM, one-class SVM,
including also various interfaces with java, python,
etc.
 SVM-light: simpler but performance is not better
than LIBSVM, support only binary classification and
only in C
 SVM-torch: another recent implementation also

584
Neighbors)
 Summary

585
Associative Classification
 Associative classification: Major steps
 Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels
 Association rules are generated in the form of
P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)
 Organize the rules to form a rule-based classifier
 Why effective?
 It explores highly confident associations among multiple
attributes and may overcome some constraints introduced by
decision-tree induction, which considers only one attribute at a
time
 Associative classification has been found to be often more
accurate than some traditional classification methods, such as

586
Typical Associative Classification Methods
 CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD’98)
 Mine possible association rules in the form of

Cond-set (a set of attribute-value pairs)  class label
 Build classifier: Organize rules according to decreasing
precedence based on confidence and then support
 CMAR (Classification based on Multiple Association Rules: Li, Han,
Pei, ICDM’01)
 Classification: Statistical analysis on multiple rules
 CPAR (Classification based on Predictive Association Rules: Yin & Han,
SDM’03)
 Generation of predictive rules (FOIL-like analysis) but allow
covered rules to retain with reduced weight
 Prediction using best k rules


587
Frequent Pattern-Based Classification
 H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “
Discriminative Frequent Pattern Analysis for Effective Cl
assification
”, ICDE'07
 Accuracy issue
 Increase the discriminative power
 Increase the expressive power of the feature space
 Scalability issue
 It is computationally infeasible to generate all
feature combinations and filter them with an
information gain threshold
 Efficient method (DDPMine: FPtree pruning): H.
Cheng, X. Yan, J. Han, and P. S. Yu, "
Direct Discriminative Pattern Mining for Effective Cla

588
Frequent Pattern vs. Single Feature
(a) Austral (c) Sonar
(b) Cleve
Fig. 1. Information Gain vs. Pattern Length
The discriminative power of some frequent patterns is
higher than that of single features.

589
Empirical Results
0 100 200 300 400 500 600 700
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
InfoGain
IG_UpperBnd
Support
Information
Gain
(a) Austral (c) Sonar
(b) Breast
Fig. 2. Information Gain vs. Pattern Frequency

590
Feature Selection
 Given a set of frequent patterns, both non-
discriminative and redundant patterns exist, which can
cause overfitting
 We want to single out the discriminative patterns and
remove redundant ones
 The notion of Maximal Marginal Relevance (MMR) is
borrowed
 A document has high marginal relevance if it is both
relevant to the query and contains minimal marginal
similarity to previously selected documents

593
DDPMine: Branch-and-Bound Search
Association between information
gain and frequency
a
b
a: constant, a parent node
b: variable, a descendent
)
sup(
)
sup( parent
child 
)
sup(
)
sup( a
b 

594
DDPMine Efficiency: Runtime
PatClass
Harmony
DDPMine
PatClass:
ICDE’07 Pattern
Classification
Alg.

595
Neighbors)
 Summary

596
Lazy vs. Eager Learning
 Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and
waits until it is given a test tuple
 Eager learning (the above discussed methods):
Given a set of training tuples, constructs a
classification model before receiving new (e.g., test)
data to classify
 Lazy: less time in training but more time in predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions to
form an implicit global approximation to the target
function
 Eager: must commit to a single hypothesis that

597
Lazy Learner: Instance-Based Methods
 Instance-based learning:
 Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
 Typical approaches
 k-nearest neighbor approach

Instances represented as points in a Euclidean
space.
 Locally weighted regression

Constructs local approximation
 Case-based reasoning

Uses symbolic representations and knowledge-
based inference

598
The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space
 The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most
common value among the k training examples
nearest to xq
 Vonoroi diagram: the decision surface induced by
1-NN for a typical set of training examples
.
_
+
_ xq
+
_ _
+
_
_
+
.
.
.
. .

599
Discussion on the k-NN Algorithm
 k-NN for real-valued prediction for a given unknown
tuple
 Returns the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm
 Weight the contribution of each of the k neighbors
according to their distance to the query xq

Give greater weight to closer neighbors
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes
 To overcome it, axes stretch or elimination of the
least relevant attributes
2
)
,
(
1
i
x
q
x
d
w

600
Case-Based Reasoning (CBR)
 CBR: Uses a database of problem solutions to solve new problems
 Store symbolic description (tuples or cases)—not points in a
Euclidean space
 Applications: Customer-service (product-related diagnosis), legal
ruling
 Methodology
 Instances represented by rich symbolic descriptions (e.g.,
function graphs)
 Search for similar cases, multiple retrieved cases may be
combined
 Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving
 Challenges
 Find a good similarity metric
 Indexing based on syntactic similarity measure, and when

601
Neighbors)
 Summary

602
Genetic Algorithms (GA)
 Genetic Algorithm: based on an analogy to biological evolution
 An initial population is created consisting of randomly generated
rules
 Each rule is represented by a string of bits
 E.g., if A1 and ¬A2 then C2 can be encoded as 100
 If an attribute has k > 2 values, k bits can be used
 Based on the notion of survival of the fittest, a new population is
formed to consist of the fittest rules and their offspring
 The fitness of a rule is represented by its classification accuracy on a
set of training examples
 Offspring are generated by crossover and mutation
 The process continues until a population P evolves when each rule in
P satisfies a prespecified threshold
 Slow but easily parallelizable

603
Rough Set Approach
 Rough sets are used to approximately or “roughly” define
equivalent classes
 A rough set for a given class C is approximated by two sets: a
lower approximation (certain to be in C) and an upper
approximation (cannot be described as not belonging to C)
 Finding the minimal subsets (reducts) of attributes for feature
reduction is NP-hard but a discernibility matrix (which stores
the differences between attribute values for each pair of data
tuples) is used to reduce the computation intensity

604
Fuzzy Set
Approaches
 Fuzzy logic uses truth values between 0.0 and 1.0 to represent
the degree of membership (such as in a fuzzy membership graph)
 Attribute values are converted to fuzzy values. Ex.:
 Income, x, is assigned a fuzzy membership value to each of
the discrete categories {low, medium, high}, e.g. $49K
belongs to “medium income” with fuzzy value 0.15 but
belongs to “high income” with fuzzy value 0.96
 Fuzzy membership values do not have to sum to 1.
 Each applicable rule contributes a vote for membership in the
categories
 Typically, the truth values for each predicted category are
summed, and these sums are combined

605
Neighbors)
 Summary

Multiclass Classification
 Classification involving more than two classes (i.e., > 2 Classes)
 Method 1. One-vs.-all (OVA): Learn a classifier one at a time
 Given m classes, train m classifiers: one for each class
 Classifier j: treat tuples in class j as positive & all others as
negative
 To classify a tuple X, the set of classifiers vote as an ensemble
 Method 2. All-vs.-all (AVA): Learn a classifier for each pair of classes
 Given m classes, construct m(m-1)/2 binary classifiers
 A classifier is trained using tuples of the two classes
 To classify a tuple X, each classifier votes. X is assigned to the
class with maximal vote
 Comparison
 All-vs.-all tends to be superior to one-vs.-all
 Problem: Binary classifier is sensitive to errors, and errors affect
606

Error-Correcting Codes for Multiclass Classification
 Originally designed to correct errors during
data transmission for communication tasks by
exploring data redundancy
 Example
 A 7-bit codeword associated with classes 1-4
607
Class Error-Corr.
Codeword
C1 1 1 1 1 1 1 1
C2 0 0 0 0 1 1 1
C3 0 0 1 1 0 0 1
C4 0 1 0 1 0 1 0
 Given a unknown tuple X, the 7-trained classifiers output:
0001010
 Hamming distance: # of different bits between two codewords
 H(X, C1) = 5, by checking # of bits between [1111111] & [0001010]
 H(X, C2) = 3, H(X, C3) = 3, H(X, C4) = 1, thus C4 as the label for X
 Error-correcting codes can correct up to (h-1)/h 1-bit error, where h
is the minimum Hamming distance between any two codewords
 If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code
are insufficient to self-correct
 When selecting error-correcting codes, there should be good row-
wise and col.-wise separation between the codewords

Semi-Supervised Classification
 Semi-supervised: Uses labeled and unlabeled data to build a
classifier
 Self-training:
 Build a classifier using the labeled data
 Use it to label the unlabeled data, and those with the most
confident label prediction are added to the set of labeled data
 Repeat the above process
 Adv: easy to understand; disadv: may reinforce errors
 Co-training: Use two or more classifiers to teach each other
 Each learner uses a mutually independent set of features of each
tuple to train a good classifier, say f1
 Then f1 and f2 are used to predict the class label for unlabeled
data X
 Teach each other: The tuple having the most confident
prediction from f1 is added to the set of labeled data for f2, & vice
versa 608

Active Learning
 Class labels are expensive to obtain
 Active learner: query human (oracle) for labels
 Pool-based approach: Uses a pool of unlabeled data
 L: a small subset of D is labeled, U: a pool of unlabeled data in
D
 Use a query function to carefully select one or more tuples
from U and request labels from an oracle (a human annotator)
 The newly labeled samples are added to L, and learn a model
 Goal: Achieve high accuracy using as few labeled data as
possible
 Evaluated using learning curves: Accuracy as a function of the
number of instances queried (# of tuples to be queried should be
small)
 Research issue: How to choose the data tuples to be queried?
 Uncertainty sampling: choose the least certain ones
 Reduce version space, the subset of hypotheses consistent w.
the training data
 Reduce expected entropy over U: Find the greatest reduction in 609

Transfer Learning: Conceptual Framework
 Transfer learning: Extract knowledge from one or more source
tasks and apply the knowledge to a target task
 Traditional learning: Build a new classifier for each new task
 Transfer learning: Build new classifier by applying existing
knowledge learned from source tasks
Learning System Learning System Learning System
Different Tasks
610
Traditional Learning Framework Transfer Learning Framework
Knowledge Learning System
Source Tasks Target Task

Transfer Learning: Methods and Applications
 Applications: Especially useful when data is outdated or distribution
changes, e.g., Web document classification, e-mail spam filtering
 Instance-based transfer learning: Reweight some of the data from
source tasks and use it to learn the target task
 TrAdaBoost (Transfer AdaBoost)
 Assume source and target data each described by the same set of
attributes (features) & class labels, but rather diff. distributions
 Require only labeling a small amount of target data
 Use source data in training: When a source tuple is misclassified,
reduce the weight of such tupels so that they will have less effect
on the subsequent classifier
 Research issues
 Negative transfer: When it performs worse than no transfer at all
 Heterogeneous transfer learning: Transfer knowledge from
different feature space or multiple source domains
 Large-scale transfer learning
611

612
Neighbors)
 Summary

613
Summary
 Effective and advanced classification methods
 Bayesian belief network (probabilistic networks)
 Backpropagation (Neural networks)
 Support Vector Machine (SVM)
 Pattern-based classification
 Other classification methods: lazy learners (KNN, case-based
reasoning), genetic algorithms, rough set and fuzzy set
approaches
 Additional Topics on Classification
 Multiclass classification
 Semi-supervised classification
 Active learning
 Transfer learning

614
References
 Please see the references of Chapter 8

616
What Is Prediction?
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given
input
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees

617
Linear Regression
 Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
 Multiple linear regression: involves more than one predictor
variable
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method or using SAS, S-







 |
|
1
2
|
|
1
)
(
)
)(
(
1 D
i
i
D
i
i
i
x
x
y
y
x
x
w x
w
y
w
1
0



618
 Some nonlinear models can be modeled by a polynomial
function
 A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2
+ w3 x3
convertible to linear with new variables: x2 = x2
, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Other functions, such as power function, can also be
transformed to linear model
 Some models are intractable nonlinear (e.g., sum of
exponential terms)
 possible to obtain least square estimates through
extensive calculation on more complex formulae
Nonlinear Regression

619
 Generalized linear model:
 Foundation on which linear regression can be applied to
modeling categorical response variables
 Variance of y is a function of the mean value of y, not a constant
 Logistic regression: models the prob. of some event occurring
as a linear function of a set of predictor variables
 Poisson regression: models the data that exhibit a Poisson
distribution
 Log-linear models: (for categorical data)
 Approximate discrete multidimensional prob. distributions
 Also useful for data compression and smoothing
 Regression trees and model trees
 Trees to predict continuous values rather than class labels
Other Regression-Based Models

620
Regression Trees and Model Trees
 Regression tree: proposed in CART system (Breiman et al. 1984)
 CART: Classification And Regression Trees
 Each leaf stores a continuous-valued prediction
 It is the average value of the predicted attribute for the training
tuples that reach the leaf
 Model tree: proposed by Quinlan (1992)
 Each leaf holds a regression model—a multivariate linear
equation for the predicted attribute
 A more general case than regression tree
 Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple
linear model

621
 Predictive modeling: Predict data values or construct
generalized linear models based on the database data
 One can only predict value ranges or category
distributions
 Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the
prediction
 Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Predictive Modeling in Multidimensional Databases

622
Prediction: Numerical Data

623
Prediction: Categorical Data

624
SVM—Introductory Literature
 “Statistical Learning Theory” by Vapnik: extremely hard to
understand, containing many errors too.
 C. J. C. Burges.
A Tutorial on Support Vector Machines for Pattern Recognition.
Knowledge Discovery and Data Mining, 2(2), 1998.
 Better than the Vapnik’s book, but still written too hard for
introduction, and the examples are so not-intuitive
 The book “An Introduction to Support Vector Machines” by N.
Cristianini and J. Shawe-Taylor
 Also written hard for introduction, but the explanation about
the mercer’s theorem is better than above literatures
 The neural network book by Haykins
 Contains one nice chapter of SVM introduction

625
Notes about SVM—
Introductory Literature
 “Statistical Learning Theory” by Vapnik: difficult to understand,
containing many errors.
 C. J. C. Burges.
A Tutorial on Support Vector Machines for Pattern Recognition.
Knowledge Discovery and Data Mining, 2(2), 1998.
 Easier than Vapnik’s book, but still not introductory level; the
examples are not so intuitive
 The book An Introduction to Support Vector Machines by
Cristianini and Shawe-Taylor
 Not introductory level, but the explanation about Mercer’s
Theorem is better than above literatures
 Neural Networks and Learning Machines by Haykin
 Contains a nice chapter on SVM introduction

626
Associative Classification Can Achieve High
Accuracy and Efficiency (Cong et al. SIGMOD05)

627
A Closer Look at CMAR
 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
 Efficiency: Uses an enhanced FP-tree that maintains the distribution
of class labels among tuples satisfying each frequent itemset
 Rule pruning whenever a rule is inserted into the tree
 Given two rules, R1 and R2, if the antecedent of R1 is more general
than that of R2 and conf(R1) conf(R
≥ 2), then prune R2
 Prunes rules for which the rule antecedent and class are not
positively correlated, based on a χ2
test of statistical significance
 Classification based on generated/pruned rules
 If only one rule satisfies tuple X, assign the class label of the rule
 If a rule set S satisfies X, CMAR

divides S into groups according to class labels

uses a weighted χ2
measure to find the strongest group of
rules, based on the statistical correlation of rules within a
group

assigns X the class label of the strongest group

628
Perceptron & Winnow
• Vector: x, w
• Scalar: x, y, w
Input: {(x1, y1), …}
Output: classification function f(x)
f(xi) > 0 for yi = +1
f(xi) < 0 for yi = -1
f(x) => wx + b = 0
or w1x1+w2x2+b = 0
x1
x2
• Perceptron: update W
additively
• Winnow: update W
multiplicatively

Data Mining:
(3rd
ed.)
— Chapter 10 —
629

630
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
630

631
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

632
Clustering for Data Understanding and
Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market resarch

633
Clustering as a Preprocessing Tool (Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster

Quality: What Is Good Clustering?
 A good clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns
634

Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective
635

Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
636

Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples
 Ability to deal with different types of attributes
 Numerical, binary, categorical, ordinal, linked, and mixture of
these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality
637

Major Clustering Approaches (I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
638

Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: p-Cluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
 Typical methods: COD (obstacles), constrained clustering
 Link-based clustering:
 Objects are often linked together in various ways
 Massive links can be used to cluster objects: SimRank, LinkClus
639

640
Methods
 Summary
640

Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
2
1 )
( i
C
p
k
i c
p
E i



 

641

The K-Means Clustering Method
 Given k, the k-means algorithm is implemented in four
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change
642

An Example of K-Means Clustering
K=2
Arbitrarily
partition
objects
into k
groups
Update
the cluster
centroids
Update
the cluster
centroids
Reassign objects
Loop if
needed
643
The initial data
set
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change

Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2
), CLARA: O(ks2
+ k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
644

Variations of the K-Means Method
 Most of the variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method
645

What Is the Problem of the K-Means Method?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
646

647
PAM: A Typical K-Medoids Algorithm
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrar
y choose
k object
as initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remaini
ng
object to
nearest
medoids
Randomly select a
nonmedoid
object,Oramdom
Compute
total cost
of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

The K-Medoid Clustering Method
 K-Medoids Clustering: Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling
648

649
Methods
 Summary
649

Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
agglomerative
(AGNES)
divisive
(DIANA)
650

AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
651

Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
652

DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
653

Distance between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
X X
654

Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster
 Radius: square root of average distance from any point
of the cluster to its centroid
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
N
t
N
i ip
m
C
)
(
1



N
m
c
ip
t
N
i
m
R
2
)
(
1




)
1
(
2
)
(
1
1







N
N
iq
t
ip
t
N
i
N
i
m
D
655

Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n2
),
where n is the number of total objects
 Integration of hierarchical & distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
656

BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record
657

Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:
SS: square sum of N points
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)


N
i
i
X
1
2
1


N
i
i
X
658

CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
 Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: max # of children
 Threshold: max diameter of sub-clusters stored at the leaf
nodes 659

The CF Tree Structure
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
660

The Birch Algorithm
 Cluster Diameter
 For each point in the input
 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly
parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so
natural
 Clusters tend to be spherical given the radius and diameter
measures
 

2
)
(
)
1
(
1
j
x
i
x
n
n
661

CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
662

Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness 663

664
CHAMELEON (Clustering Complex Objects)

Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
665

Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:
 The probability that a point xi ∈ X is generated by the
model
 The likelihood that X is generated by the model:
 The task of learning the generative model: find the
parameters μ and σ2
such that the maximum
likelihood
666

A Probabilistic Hierarchical Clustering Algorithm
 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,
where P() is the maximum likelihood
 Distance between clusters C1 and C2:
 Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci C
∪ j )/(P(Ci)P(Cj ))};
If log (P(Ci C
∪ j )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
667

668
Methods
 Summary
668

Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such
as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-
based)
669

Density-Based Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)
 core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
670

Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable from
a point q w.r.t. Eps, MinPts if there
is a chain of points p1, …, pn, p1 =
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o w.r.t.
Eps and MinPts
p
q
p1
p q
o
671

DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
672

DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
 Continue the process until all of the points have been
processed
673

DBSCAN: Sensitive to Parameters
674

OPTICS: A Cluster-Ordering Method (1999)
 OPTICS: Ordering Points To Identify the Clustering
Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
 Produces a special order of the database wrt its
density-based clustering structure
 This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
 Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
 Can be represented graphically or using visualization
techniques
675

OPTICS: Some Extension from DBSCAN
 Index-based:

k = number of dimensions

N = 20

p = 75%

M = N(1-p) = 5
 Complexity: O(NlogN)
 Core Distance:
 min eps s.t. point is core
 Reachability Distance
D
p2
MinPts = 5
e = 3 cm
Max (core-distance (o), d (o, p))
r(p1, o) = 2.8cm. r(p2,o) = 4cm
o
o
p1
676



Reachability
-distance
Cluster-order
of the objects
undefined
 ‘
677

678
Density-Based Clustering: OPTICS & Its Applications

DENCLUE: Using Statistical Density Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Using statistical density functions:
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
f x y e
Gaussian
d x y
( , )
( , )


2
2
2  


N
i
x
x
d
D
Gaussian
i
e
x
f 1
2
)
,
(
2
2
)
( 
 





N
i
x
x
d
i
i
D
Gaussian
i
e
x
x
x
x
f 1
2
)
,
(
2
2
)
(
)
,
( 
influence of y
on x
total influence
on x
gradient of x in
the direction
of xi
679

 Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
 Influence function: describes the impact of a data point within its
neighborhood
 Overall density of the data space can be calculated as the sum of the
influence function of all data points
 Clusters can be determined mathematically by identifying density
attractors
 Density attractors are local maximal of the overall density function
 Center defined clusters: assign to each density attractor the points
density attracted to it
 Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)
Denclue: Technical Essence
680

Center-Defined and Arbitrary
682

683
Methods
 Summary
683

Grid-Based Clustering Method
 Using multi-resolution grid data structure
 Several interesting methods
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)

A multi-resolution clustering approach using
wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)

Both grid-based and subspace clustering
684

STING: A Statistical Information Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different
685
i-th layer
(i-1)st layer
1st layer

The STING Clustering Method
 Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
 Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
 Parameters of higher level cells can be easily calculated
from parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data queries
 Start from a pre-selected layer—typically with a small
number of cells
 For each cell in the current level compute the confidence
interval
686

STING Algorithm and Its Analysis
 Remove the irrelevant cells from further consideration
 When finish examining the current layer, proceed to the
next lower level
 Repeat this process until the bottom layer is reached
 Advantages:
 Query-independent, easy to parallelize, incremental
update
 O(K), where K is the number of grid cells at the lowest
level
 Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
687

688
CLIQUE (Clustering In QUEst)
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace

689
CLIQUE: The Major Steps
 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster

690
Salary
(10,000)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacation(
week)
age
Vacation
Salary 30 50
 = 3

691
Strength and Weakness of CLIQUE
 Strength
 automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
 insensitive to the order of records in input and does not
presume some canonical data distribution
 scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
 Weakness
 The accuracy of the clustering result may be degraded
at the expense of simplicity of the method

692
Methods
 Summary
692

Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
 Test spatial randomness by statistic test: Hopkins Static
 Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in
the data space
 Sample n points, p1, …, pn, uniformly from D. For each pi, find its
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
 Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and
v ≠ qi
 Calculate the Hopkins Statistic:
 If D is uniformly distributed, ∑ xi and ∑ yi will be close to each
other and H is close to 0.5. If D is highly skewed, H is close to 0 693

Determine the Number of Clusters
 Empirical method
 # of clusters ≈√n/2 for a dataset of n points
 Elbow method
 Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
 Cross validation method
 Divide a given data set into m parts
 Use m – 1 parts to obtain a clustering model
 Use the remaining part to test the quality of the clustering

E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test
set and the closest centroids to measure how well the model fits
the test set
 For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
694

Measuring Clustering Quality
 Two methods: extrinsic vs. intrinsic
 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient
695

Measuring Clustering Quality: Extrinsic Methods
 Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
 Q is good if it satisfies the following 4 essential criteria
 Cluster homogeneity: the purer, the better
 Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
 Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
 Small cluster preservation: splitting a small category
into pieces is more harmful than splitting a large
category into pieces
696

697
Methods
 Summary
697

Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
698

699
CS512-Spring 2011: An Introduction
 Coverage
 Cluster Analysis: Chapter 11
 Outlier Detection: Chapter 12
 Mining Sequence Data: BK2: Chapter 8
 Mining Graphs Data: BK2: Chapter 9
 Social and Information Network Analysis

BK2: Chapter 9

Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U., 2010

Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets:
Reasoning About a Highly Connected World”, Cambridge U., 2010

Recent research papers
 Mining Data Streams: BK2: Chapter 8
 Requirements
 One research project
 One class presentation (15 minutes)
 Two homeworks (no programming assignment)
 Two midterm exams (no final exam)

References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
700

References (2)
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,
1988.
 G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
701

References (3)
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
 X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic
Links”, VLDB'06
702

704
A Typical K-Medoids Algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrar
y choose
k object
as initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remaini
ng
object to
nearest
medoids
Randomly select a
nonmedoid
object,Oramdom
Compute
total cost
of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

705
PAM (Partitioning Around Medoids) (1987)
 PAM (Kaufman and Rousseeuw, 1987), built in Splus
 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h

Then assign each non-selected object to the most
similar representative object
 repeat steps 2-3 until there is no change

706
PAM Clustering: Finding the Best Cluster Center
 Case 1: p currently belongs to oj. If oj is replaced by orandom as a
representative object and p is the closest to one of the other
representative object oi, then p is reassigned to oi

707
What Is the Problem with PAM?
 Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2
) for each iteration
where n is # of data,k is # of clusters
 Sampling-based method
CLARA(Clustering LARge Applications)

708
CLARA (Clustering Large Applications)
(1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as SPlus
 It draws multiple samples of the data set, applies PAM
on each sample, and gives the best clustering as the
output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased

709
CLARANS (“Randomized” CLARA) (1994)
 CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
 Draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, it starts with new randomly
selected node in search for a new local optimum
 Advantages: More efficient and scalable than both PAM
and CLARA
 Further improvement: Focusing techniques and spatial
access structures (Ester et al.’95)

710
ROCK: Clustering Categorical Data
 ROCK: RObust Clustering using linKs
 S. Guha, R. Rastogi & K. Shim, ICDE’99
 Major ideas
 Use links to measure similarity/proximity
 Not distance-based
 Algorithm: sampling-based clustering
 Draw random sample
 Cluster with links
 Label data in disk
 Experiments
 Congressional voting, mushroom data

711
Similarity Measure in ROCK
 Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
 Example: Two groups (clusters) of transactions
 C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
 C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Jaccard co-efficient may lead to wrong clustering result
 C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
 C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
 Jaccard co-efficient-based similarity function:
 Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
Sim T T
T T
T T
( , )
1 2
1 2
1 2



2
.
0
5
1
}
,
,
,
,
{
}
{
)
,
( 2
1 


e
d
c
b
a
c
T
T
Sim

712
Link Measure in ROCK
 Clusters

C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}

C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Neighbors

Two transactions are neighbors if sim(T1,T2) > threshold
 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e},
{a,b,f}, {a,b,g}

T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}

T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
 Link Similarity

Link similarity between two transactions is the # of common neighbors
 link(T1, T2) = 4, since they have 4 common neighbors

{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
 link(T1, T3) = 3, since they have 3 common neighbors

{a, b, d}, {a, b, e}, {a, b, g}

Aggregation-Based Similarity Computation
4 5
10 12 13 14
a b
ST2
ST1
11
0.2
0.9 1.0 0.8 0.9 1.0
For each node nk {
∈ n10, n11, n12} and nl {
∈ n13, n14}, their path-
based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).
 
 
 
 
171
.
0
2
,
,
3
,
,
14
13 5
5
4
12
10 4





 
 l l
k k
b
a
n
n
s
n
n
s
n
n
s
n
n
sim
After aggregation, we reduce quadratic time computation to
linear time computation.
takes O(3+2) time
714

Computing Similarity with Aggregation
To compute sim(na,nb):
 Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb
with nj.
 Calculate similarity (and weight) between na and nb w.r.t. ni and nj.
 Calculate weighted average similarity between na and nb w.r.t. all such
pairs.
sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5)
= 0.9 x 0.2 x 0.95 = 0.171
sim(na, nb) can be computed
from aggregated similarities
Average similarity
and total weight 4 5
10 12 13 14
a b
a:
(0.9,3)
b:(0.95,2)
11
0.2
715

716
Methods
 Overview of Clustering Methods
 Summary
716

Link-Based Clustering: Calculate Similarities
Based On Links
Jeh & Widom, KDD’2002: SimRank
Two objects are similar if they are
linked with the same or similar
objects
 The similarity between two
objects x and y is defined as
the average similarity between
objects linked with x and those
with y:
 Issue: Expensive to compute:
 For a dataset of N objects
and M links, it takes O(N2
)
space and O(M2
) time to
compute all similarities.
Tom sigmod03
Mike
Cathy
John
sigmod04
sigmod05
vldb03
vldb04
vldb05
sigmod
vldb
Mary
aaai04
aaai05
aaai
Authors Proceedings Conferences
 
   
   
 
 
 
 
 

a
I
i
b
I
j
j
i b
I
a
I
b
I
a
I
C
b
a
1 1
,
sim
,
sim
717

Observation 1: Hierarchical Structures
 Hierarchical structures often exist naturally among objects
(e.g., taxonomy of animals)
All
electronics
grocery apparel
DVD camera
TV
A hierarchical structure of
products in Walmart
Articles
Words
Relationships between articles and
words (Chakrabarti, Papadimitriou,
Modha, Faloutsos, 2004)
718

Observation 2: Distribution of Similarity
 Power law distribution exists in similarities
 56% of similarity entries are in [0.005, 0.015]
 1.4% of similarity entries are larger than 0.1
 Can we design a data structure that stores the significant
similarities and compresses insignificant ones?
0
0.1
0.2
0.3
0.4
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
similarity value
portion
of
entries
Distribution of SimRank similarities
among DBLP authors
719

A Novel Data Structure: SimTree
Each leaf node
represents an object
Each non-leaf node
represents a group
of similar lower-level
nodes
Similarities between
siblings are stored
Consumer
electronics
Apparels
Canon A40
digital camera
Sony V3 digital
camera
Digital
Cameras
TVs
720

Similarity Defined by SimTree
 Path-based node similarity
 simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8)
 Similarity between two nodes is the average similarity
between objects linked with them in other SimTrees
 Adjust/ ratio for x =
n1 n2
n4 n5
n6
n3
0.9 1.0
0.9
0.8
0.2
n7 n9
0.3
n8
0.8
0.9
Similarity between two
sibling nodes n1 and n2
Adjustment ratio
for node n7
Average similarity between x and all other nodes
Average similarity between x’s parent and all other
nodes
721

LinkClus: Efficient Clustering via
Heterogeneous Semantic Links
Method
 Initialize a SimTree for objects of each type
 Repeat until stable
 For each SimTree, update the similarities between its
nodes using similarities in other SimTrees

Similarity between two nodes x and y is the average
similarity between objects linked with them
 Adjust the structure of each SimTree

Assign each node to the parent node that it is most
similar to
For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient
Clustering via Heterogeneous Semantic Links”, VLDB'06
722

Initialization of SimTrees
 Initializing a SimTree
 Repeatedly find groups of tightly related nodes, which
are merged into a higher-level node
 Tightness of a group of nodes
 For a group of nodes {n1, …, nk}, its tightness is
defined as the number of leaf nodes in other SimTrees
that are connected to all of {n1, …, nk}
n1
1
2
3
4
5
n2
The tightness of {n1, n2} is 3
Nodes Leaf nodes in
another SimTree
723

Finding Tight Groups by Freq. Pattern Mining
 Finding tight groups Frequent pattern mining
 Procedure of initializing a tree
 Start from leaf nodes (level-0)
 At each level l, find non-overlapping groups of similar
nodes with frequent pattern mining
Reduced to
g1
g2
{n1}
{n1, n2}
{n2}
{n1, n2}
{n1, n2}
{n2, n3,
n4}
{n4}
{n3, n4}
{n3, n4}
Transactions
n1
1
2
3
4
5
6
7
8
9
n2
n3
n4
The tightness of a
group of nodes is the
support of a frequent
pattern
724

Adjusting SimTree Structures
 After similarity changes, the tree structure also needs to be
changed
 If a node is more similar to its parent’s sibling, then move
it to be a child of that sibling
 Try to move each node to its parent’s sibling that it is most
similar to, under the constraint that each parent node can
have at most c children
n1 n2
n4 n5
n6
n3
n7 n9
n8
0.8
0.9
n7
725

Complexity
Time Space
Updating similarities O(M(logN)2
) O(M+N)
Adjusting tree structures O(N) O(N)
LinkClus O(M(logN)2
) O(M+N)
SimRank O(M2
) O(N2
)
For two types of objects, N in each, and M linkages between them.
726

Experiment: Email Dataset
 F. Nielsen. Email dataset.
www.imm.dtu.dk/~rem/data/Email-1431.zip
 370 emails on conferences, 272 on jobs,
and 789 spam emails
 Accuracy: measured by manually labeled
data
 Accuracy of clustering: % of pairs of objects
in the same cluster that share common label
Approach Accuracy time (s)
LinkClus 0.8026 1579.6
SimRank 0.7965 39160
ReCom 0.5711 74.6
F-SimRank 0.3688 479.7
CLARANS 0.4768 8.55
 Approaches compared:
 SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities
 SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005

pre-computes a large sample of random paths from each object and uses
samples of two objects to estimate SimRank similarity
 ReCom (Wang et al. SIGIR 2003)

Iteratively clustering objects using cluster labels of linked objects
727

WaveCluster: Clustering by Wavelet Analysis (1998)
 Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multi-resolution clustering approach which applies wavelet transform
to the feature space; both grid-based and density-based
 Wavelet transform: A signal processing technique that decomposes a
signal into different frequency sub-band
 Data are transformed to preserve relative distance between objects
at different levels of resolution
 Allows natural clusters to become more distinguishable
728

The WaveCluster Algorithm
 How to apply wavelet transform to find clusters
 Summarizes the data by imposing a multidimensional grid
structure onto data space
 These multidimensional spatial data objects are represented in a
n-dimensional feature space
 Apply wavelet transform on feature space to find the dense
regions in the feature space
 Apply wavelet transform multiple times which result in clusters at
different scales from fine to coarse
 Major features:
 Complexity O(N)
 Detect arbitrary shaped clusters at different scales
 Not sensitive to noise, not sensitive to input order
 Only applicable to low dimensional data
729

730
Quantization
& Transformation
 Quantize data into m-D grid structure,
then wavelet transform
a) scale 1: high resolution
b) scale 2: medium resolution
c) scale 3: low resolution

731
Data Mining:
(3rd
ed.)
— Chapter 11 —
731

732
Review: Basic Cluster Analysis Methods (Chap.
10)
 Group data so that object similarity is high within clusters but low
across clusters
 K-means and k-medoids algorithms and their refinements
 Agglomerative and divisive method, Birch, Cameleon
 DBScan, Optics and DenCLu
 STING and CLIQUE (subspace clustering)
 Assess clustering tendency, determine # of clusters, and measure
clustering quality
732

K-Means Clustering
K=2
Arbitrarily
partition
objects
into k
groups
Update
the cluster
centroids
Update
the cluster
centroids
Reassign objects
Loop if
needed
733
The initial data
set
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change

Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
b
d
c
e
a a b
d e
c d e
a b c d e
agglomerative
(AGNES)
divisive
(DIANA)
734

Distance between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
X X
735

BIRCH and the Clustering Feature
(CF) Tree Structure
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
736
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),
(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness 737

Density-Based Clustering: DBSCAN
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)
 core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
738

739
Density-Based Clustering: OPTICS & Its Applications

DENCLU: Center-Defined and Arbitrary
740

STING: A Statistical Information Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different
741
i-th layer
(i-1)st layer
1st layer

Evaluation of Clustering Quality
 Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring
the probability that the data is generated by a uniform data
distribution
 Determine the Number of Clusters
 Empirical method: # of clusters ≈√n/2
 Elbow method: Use the turning point in the curve of sum of within
cluster variance w.r.t # of clusters
 Cross validation method
 Measuring Clustering Quality
 Extrinsic: supervised

Compare a clustering against the ground truth using certain
clustering quality measure
 Intrinsic: unsupervised

Evaluate the goodness of a clustering by considering how well
the clusters are separated, and how compact the clusters are
742

743
Outline of Advanced Clustering Analysis
 Probability Model-Based Clustering
 Each object may take a probability to belong to a cluster
 Clustering High-Dimensional Data
 Curse of dimensionality: Difficulty of distance measure in high-D
space
 Clustering Graphs and Network Data
 Similarity measurement and clustering methods for graph and
networks
 Clustering with Constraints
 Cluster analysis under different kinds of constraints, e.g., that raised
from background knowledge or spatial distribution of the objects

744
Chapter 11. Cluster Analysis: Advanced Methods
 Summary
744

Fuzzy Set and Fuzzy Cluster
 Clustering methods discussed so far
 Every data object is assigned to exactly one cluster
 Some applications may need for fuzzy or soft cluster assignment
 Ex. An e-game could belong to both entertainment and software
 Methods: fuzzy clusters and probabilistic model-based clusters
 Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
 Example: Popularity of cameras is defined as a fuzzy mapping
 Then, A(0.05), B(1), C(0.86), D(0.27)
745

Fuzzy (Soft) Clustering
 Example: Let cluster features be
 C1 :“digital camera” and “lens”
 C2: “computer“
 Fuzzy clustering
 k fuzzy clusters C1, …,Ck ,represented as a partition matrix M = [wij]
 P1: for each object oi and cluster Cj, 0 ≤ wij ≤ 1 (fuzzy set)
 P2: for each object oi, , equal participation in the clustering
 P3: for each cluster Cj , ensures there is no empty cluster
 Let c1, …, ck as the center of the k clusters
 For an object oi, sum of the squared error (SSE), p is a parameter:
 For a cluster Ci, SSE:
 Measure how well a clustering fits the data:
746

Probabilistic Model-Based Clustering
 Cluster analysis is to find hidden categories.
 A hidden category (i.e., probabilistic cluster) is a distribution over the
data space, which can be mathematically represented using a
probability density function (or distribution function).
 Ex. 2 categories for digital cameras
sold
 consumer line vs. professional line
 density functions f1, f2 for C1, C2
 obtained by probabilistic clustering
 A mixture model assumes that a set of observed objects is a mixture
of instances from multiple probabilistic clusters, and conceptually
each observed object is generated independently
 Out task: infer a set of k probabilistic clusters that is mostly likely to
generate D using the above data generation process
747

748
Model-Based Clustering
 A set C of k probabilistic clusters C1, …,Ck with probability density
functions f1, …, fk, respectively, and their probabilities ω1, …, ωk.
 Probability of an object o generated by cluster Cj is
 Probability of o generated by the set of cluster C is
 Since objects are assumed to be generated
independently, for a data set D = {o1, …, on}, we have,
 Task: Find a set C of k probabilistic clusters s.t. P(D|C) is maximized
 However, maximizing P(D|C) is often intractable since the probability
density function of a cluster can take an arbitrarily complicated form
 To make it computationally feasible (as a compromise), assume the
probability density functions being some parameterized distributions

749
Univariate Gaussian Mixture Model
 O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the
k distributions), and Pj(oi| θj) is the probability that oi is generated from
the j-th distribution using parameter θj, we have
 Univariate Gaussian mixture model
 Assume the probability density function of each cluster follows a 1-
d Gaussian distribution. Suppose that there are k clusters.
 The probability density function of each cluster are centered at μj
with standard deviation σj, θj, = (μj, σj), we have

The EM (Expectation Maximization) Algorithm
 The k-means algorithm has two steps at each iteration:
 Expectation Step (E-step): Given the current cluster centers, each
object is assigned to the cluster whose center is closest to the
object: An object is expected to belong to the closest cluster
 Maximization Step (M-step): Given the cluster assignment, for
each cluster, the algorithm adjusts the center so that the sum of
distance from the objects assigned to this cluster and the new
center is minimized
 The (EM) algorithm: A framework to approach maximum likelihood or
maximum a posteriori estimates of parameters in statistical models.
 E-step assigns objects to clusters according to the current fuzzy
clustering or parameters of probabilistic clusters
 M-step finds the new clustering or parameters that maximize the
sum of squared error (SSE) or the expected likelihood
750

Fuzzy Clustering Using the EM Algorithm
 Initially, let c1 = a and c2 = b
 1st
E-step: assign o to c1,w. wt =

 1st
M-step: recalculate the centroids according to the partition matrix,
minimizing the sum of squared error (SSE)
 Iteratively calculate this until the cluster centers converge or the change
is small enough

752
Univariate Gaussian Mixture Model
 O = {o1, …, on} (n observed objects), Θ = {θ1, …, θk} (parameters of the
k distributions), and Pj(oi| θj) is the probability that oi is generated from
the j-th distribution using parameter θj, we have
 Univariate Gaussian mixture model
 Assume the probability density function of each cluster follows a 1-
d Gaussian distribution. Suppose that there are k clusters.
 The probability density function of each cluster are centered at μj
with standard deviation σj, θj, = (μj, σj), we have

753
Computing Mixture Models with EM
 Given n objects O = {o1, …, on}, we want to mine a set of parameters Θ
= {θ1, …, θk} s.t.,P(O|Θ) is maximized, where θj = (μj, σj) are the mean and
standard deviation of the j-th univariate Gaussian distribution
 We initially assign random values to parameters θj, then iteratively
conduct the E- and M- steps until converge or sufficiently small change
 At the E-step, for each object oi, calculate the probability that oi belongs
to each distribution,
 At the M-step, adjust the parameters θj = (μj, σj) so that the expected
likelihood P(O|Θ) is maximized

Advantages and Disadvantages of Mixture Models
 Strength
 Mixture models are more general than partitioning and fuzzy
clustering
 Clusters can be characterized by a small number of parameters
 The results may satisfy the statistical assumptions of the
generative models
 Weakness
 Converge to local optimal (overcome: run multi-times w. random
initialization)
 Computationally expensive if the number of distributions is large,
or the data set contains very few observed data points
 Need large data sets
 Hard to estimate the number of clusters
754

755
 Summary
755

756
Clustering High-Dimensional Data
 Clustering high-dimensional data (How high is high-D in clustering?)
 Many applications: text documents, DNA micro-array data
 Major challenges:

Many irrelevant dimensions may mask clusters

Distance measure becomes meaningless—due to equi-distance

Clusters may exist only in some subspaces
 Methods
 Subspace-clustering: Search for clusters existing in subspaces of
the given high dimensional data space

CLIQUE, ProClus, and bi-clustering approaches
 Dimensionality reduction approaches: Construct a much lower
dimensional space and search for clusters there (may construct
new dimensions by combining some dimensions in the original
data)

Dimensionality reduction methods and spectral clustering

Traditional Distance Measures May Not
Be Effective on High-D Data
 Traditional distance measure could be dominated by noises in many
dimensions
 Ex. Which pairs of customers are more similar?
 By Euclidean distance, we get,
 despite Ada and Cathy look more similar
 Clustering should not only consider dimensions but also attributes
(features)
 Feature transformation: effective if most dimensions are relevant
(PCA & SVD useful when features are highly correlated/redundant)
 Feature selection: useful to find a subspace where the data have
nice clusters
757

758
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorations 2004)
 Data in only one dimension is relatively
packed
 Adding a dimension “stretch” the
points across that dimension, making
them further apart
 Adding more dimensions will make the
points further apart—high dimensional
data is extremely sparse
 Distance measure becomes
meaningless—due to equi-distance

759
Why Subspace Clustering?
(adapted from Parsons et al. SIGKDD Explorations 2004)
 Clusters may exist only in some subspaces
 Subspace-clustering: find clusters in all the subspaces

Subspace Clustering Methods
 Subspace search methods: Search various subspaces to
find clusters
 Bottom-up approaches
 Top-down approaches
 Correlation-based clustering methods
 E.g., PCA based approaches
 Bi-clustering methods
 Optimization-based methods
 Enumeration methods

Subspace Clustering Method (I):
Subspace Search Methods
 Search various subspaces to find clusters
 Bottom-up approaches
 Start from low-D subspaces and search higher-D subspaces only
when there may be clusters in such subspaces
 Various pruning techniques to reduce the number of higher-D
subspaces to be searched
 Ex. CLIQUE (Agrawal et al. 1998)
 Top-down approaches
 Start from full space and search smaller subspaces recursively
 Effective only if the locality assumption holds: restricts that the
subspace of a cluster can be determined by the local neighborhood
 Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid-like method
761

762
Salary
(10,000
)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacation
(week)
age
Vacatio
n
Salary 30 50
 = 3
CLIQUE: SubSpace Clustering with
Aprori Pruning

Subspace Clustering Method (II):
Correlation-Based Methods
 Subspace search method: similarity based on distance or
density
 Correlation-based method: based on advanced correlation
models
 Ex. PCA-based approach:
 Apply PCA (for Principal Component Analysis) to derive a
set of new, uncorrelated dimensions,
 then mine clusters in the new space or its subspaces
 Other space transformations:
 Hough transform
 Fractal dimensions
763

Subspace Clustering Method (III):
Bi-Clustering Methods
 Bi-clustering: Cluster both objects and attributes
simultaneously (treat objs and attrs in symmetric way)
 Four requirements:
 Only a small set of objects participate in a cluster
 A cluster only involves a small number of attributes
 An object may participate in multiple clusters, or
does not participate in any cluster at all
 An attribute may be involved in multiple clusters, or
is not involved in any cluster at all
764
 Ex 1. Gene expression or microarray data: a gene
sample/condition matrix.
 Each element in the matrix, a real number,
records the expression level of a gene under a
specific condition
 Ex. 2. Clustering customers and products
 Another bi-clustering problem

Types of Bi-clusters
 Let A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditions
 A bi-cluster: A submatrix where genes and conditions follow some
consistent patterns
 4 types of bi-clusters (ideal cases)
 Bi-clusters with constant values:
 for any i in I and j in J, eij = c
 Bi-clusters with constant values on rows:
 eij = c + αi

Also, it can be constant values on columns
 Bi-clusters with coherent values (aka. pattern-based clusters)
 eij = c + αi + βj
 Bi-clusters with coherent evolutions on rows
 eij (ei1j1− ei1j2)(ei2j1− ei2j2) ≥ 0

i.e., only interested in the up- or down- regulated changes across
genes or conditions without constraining on the exact values 765

Bi-Clustering Methods
 Real-world data is noisy: Try to find approximate bi-clusters
 Methods: Optimization-based methods vs. enumeration methods
 Optimization-based methods
 Try to find a submatrix at a time that achieves the best significance
as a bi-cluster
 Due to the cost in computation, greedy search is employed to find
local optimal bi-clusters
 Ex. δ-Cluster Algorithm (Cheng and Church, ISMB’2000)
 Enumeration methods
 Use a tolerance threshold to specify the degree of noise allowed in
the bi-clusters to be mined
 Then try to enumerate all submatrices as bi-clusters that satisfy the
requirements
 Ex. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle:
Pei et al., ICDM’2003)
766

767
Bi-Clustering for Micro-Array Data Analysis
 Left figure: Micro-array “raw” data shows 3 genes and their
values in a multi-D space: Difficult to find their patterns
 Right two: Some subsets of dimensions form nice shift and
scaling patterns
 No globally defined similarity/distance measure
 Clusters may not be exclusive
 An object can appear in multiple clusters

Bi-Clustering (I): δ-Bi-Cluster
 For a submatrix I x J, the mean of the i-th row:
 The mean of the j-th column:
 The mean of all elements in the submatrix is
 The quality of the submatrix as a bi-cluster can be measured by the mean
squared residue value
 A submatrix I x J is δ-bi-cluster if H(I x J) ≤ δ where δ ≥ 0 is a threshold.
When δ = 0, I x J is a perfect bi-cluster with coherent values. By setting δ > 0,
a user can specify the tolerance of average noise per element against a
perfect bi-cluster
 residue(eij) = eij − eiJ − eIj + eIJ
768

Bi-Clustering (I): The δ-Cluster Algorithm
 Maximal δ-bi-cluster is a δ-bi-cluster I x J such that there does not exist
another δ-bi-cluster I′ x J′ which contains I x J
 Computing is costly: Use heuristic greedy search to obtain local optimal clusters
 Two phase computation: deletion phase and additional phase
 Deletion phase: Start from the whole matrix, iteratively remove rows and
columns while the mean squared residue of the matrix is over δ
 At each iteration, for each row/column, compute the mean squared residue:
 Remove the row or column of the largest mean squared residue
 Addition phase:
 Expand iteratively the δ-bi-cluster I x J obtained in the deletion phase as
long as the δ-bi-cluster requirement is maintained
 Consider all the rows/columns not involved in the current bi-cluster I x J by
calculating their mean squared residues
 A row/column of the smallest mean squared residue is added into the current
δ-bi-cluster
 It finds only one δ-bi-cluster, thus needs to run multiple times: replacing the
elements in the output bi-cluster by random numbers 769

Bi-Clustering (II): δ-pCluster
 Enumerating all bi-clusters (δ-pClusters) [H. Wang, et al., Clustering by pattern
similarity in large data sets. SIGMOD’02]
 Since a submatrix I x J is a bi-cluster with (perfect) coherent values iff ei1j1 − ei2j1
= ei1j2 − ei2j2. For any 2 x 2 submatrix of I x J, define p-score
 A submatrix I x J is a δ-pCluster (pattern-based cluster) if the p-score of every 2
x 2 submatrix of I x J is at most δ, where δ ≥ 0 is a threshold specifying a user's
tolerance of noise against a perfect bi-cluster
 The p-score controls the noise on every element in a bi-cluster, while the mean
squared residue captures the average noise
 Monotonicity: If I x J is a δ-pClusters, every x x y (x,y ≥ 2) submatrix of I x J is
also a δ-pClusters.
 A δ-pCluster is maximal if no more row or column can be added into the cluster
and retain δ-pCluster: We only need to compute all maximal δ-pClusters.
770

MaPle: Efficient Enumeration of δ-pClusters
 Pei et al., MaPle: Efficient enumerating all maximal δ-
pClusters. ICDM'03
 Framework: Same as pattern-growth in frequent pattern
mining (based on the downward closure property)
 For each condition combination J, find the maximal subsets
of genes I such that I x J is a δ-pClusters
 If I x J is not a submatrix of another δ-pClusters
 then I x J is a maximal δ-pCluster.
 Algorithm is very similar to mining frequent closed itemsets
 Additional advantages of δ-pClusters:
 Due to averaging of δ-cluster, it may contain outliers
but still within δ-threshold
 Computing bi-clusters for scaling patterns, take
logarithmic on
will lead to the p-score form 771


yb
xb
ya
xa
d
d
d
d
/
/

Dimensionality-Reduction Methods
 Dimensionality reduction: In some situations, it is
more effective to construct a new space instead
of using some subspaces of the original data
772
 Ex. To cluster the points in the right figure, any subspace of the original
one, X and Y, cannot help, since all the three clusters will be projected
into the overlapping areas in X and Y axes.
 Construct a new dimension as the dashed one, the three clusters
become apparent when the points projected into the new dimension
 Dimensionality reduction methods
 Feature selection and extraction: But may not focus on clustering
structure finding
 Spectral clustering: Combining feature extraction and clustering (i.e.,
use the spectrum of the similarity matrix of the data to perform
dimensionality reduction for clustering in fewer dimensions)

Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000)

The Ng-Jordan-Weiss algorithm (NIPS’01)

Spectral Clustering:
The Ng-Jordan-Weiss (NJW) Algorithm
 Given a set of objects o1, …, on, and the distance between each pair
of objects, dist(oi, oj), find the desired number k of clusters
 Calculate an affinity matrix W, where σ is a scaling parameter that
controls how fast the affinity Wij decreases as dist(oi, oj) increases.
In NJW, set Wij = 0
 Derive a matrix A = f(W). NJW defines a matrix D to be a diagonal
matrix s.t. Dii is the sum of the i-th row of W, i.e.,
Then, A is set to
 A spectral clustering method finds the k leading eigenvectors of A
 A vector v is an eigenvector of matrix A if Av = λv, where λ is the
corresponding eigen-value
 Using the k leading eigenvectors, project the original data into the
new space defined by the k leading eigenvectors, and run a
clustering algorithm, such as k-means, to find k clusters
 Assign the original data points to clusters according to how the
transformed points are assigned in the clusters obtained
773

Spectral Clustering: Illustration and Comments
 Spectral clustering: Effective in tasks like image processing
 Scalability challenge: Computing eigenvectors on a large matrix is costly
 Can be combined with other clustering methods, such as bi-clustering
774

775
 Summary
775

Clustering Graphs and Network Data
 Applications
 Bi-partite graphs, e.g., customers and products,
authors and conferences
 Web search engines, e.g., click through graphs and
Web graphs
 Social networks, friendship/coauthor graphs
 Similarity measures
 Geodesic distances
 Distance based on random walk (SimRank)
 Graph clustering methods
 Minimum cuts: FastModularity (Clauset, Newman &
Moore, 2004)
 Density-based clustering: SCAN (Xu et al., KDD’2007)
776

Similarity Measure (I): Geodesic Distance
 Geodesic distance (A, B): length (i.e., # of edges) of the shortest path
between A and B (if not connected, defined as infinite)
 Eccentricity of v, eccen(v): The largest geodesic distance between v
and any other vertex u V − {v}.
∈
 E.g., eccen(a) = eccen(b) = 2; eccen(c) = eccen(d) = eccen(e) = 3
 Radius of graph G: The minimum eccentricity of all vertices, i.e., the
distance between the “most central point” and the “farthest border”
 r = min v V
∈ eccen(v)
 E.g., radius (g) = 2
 Diameter of graph G: The maximum eccentricity of all vertices, i.e., the
largest distance between any pair of vertices in G
 d = max v V
∈ eccen(v)
 E.g., diameter (g) = 3
 A peripheral vertex is a vertex that achieves the diameter.
 E.g., Vertices c, d, and e are peripheral vertices
777

SimRank: Similarity Based on Random
Walk and Structural Context
 SimRank: structural-context similarity, i.e., based on the similarity of its
neighbors
 In a directed graph G = (V,E),
 individual in-neighborhood of v: I(v) = {u | (u, v) E}
∈
 individual out-neighborhood of v: O(v) = {w | (v, w) E}
∈
 Similarity in SimRank:
 Initialization:
 Then we can compute si+1 from si based on the definition
 Similarity based on random walk: in a strongly connected component
 Expected distance:
 Expected meeting distance:
 Expected meeting probability:
778
P[t] is the probability of the
tour

Graph Clustering: Sparsest Cut
 G = (V,E). The cut set of a cut is the set
of edges {(u, v) E | u S, v T }
∈ ∈ ∈
and S and T are in two partitions
 Size of the cut: # of edges in the cut set
 Min-cut (e.g., C1) is not a good partition
 A better measure: Sparsity:
 A cut is sparsest if its sparsity is not greater than that of any other cut
 Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut
 For k clusters, the modularity of a clustering assesses the quality of the
clustering:
 The modularity of a clustering of a graph is the difference between the
fraction of all edges that fall into individual clusters and the fraction that
would do so if the graph vertices were randomly connected
 The optimal clustering of graphs maximizes the modularity
li: # edges between vertices in the i-th cluster
di: the sum of the degrees of the vertices in the i-th
cluster
779

Graph Clustering: Challenges of Finding Good Cuts
 High computational cost
 Many graph cut problems are computationally expensive
 The sparsest cut problem is NP-hard
 Need to tradeoff between efficiency/scalability and quality
 Sophisticated graphs
 May involve weights and/or cycles.
 High dimensionality
 A graph can have many vertices. In a similarity matrix, a vertex is
represented as a vector (a row in the matrix) whose
dimensionality is the number of vertices in the graph
 Sparsity
 A large graph is often sparse, meaning each vertex on average
connects to only a small number of other vertices
 A similarity matrix from a large sparse graph can also be sparse
780

Two Approaches for Graph Clustering
 Two approaches for clustering graph data
 Use generic clustering methods for high-dimensional data
 Designed specifically for clustering graphs
 Using clustering methods for high-dimensional data
 Extract a similarity matrix from a graph using a similarity measure
 A generic clustering method can then be applied on the similarity
matrix to discover clusters
 Ex. Spectral clustering: approximate optimal graph cut solutions
 Methods specific to graphs
 Search the graph to find well-connected components as clusters
 Ex. SCAN (Structural Clustering Algorithm for Networks)

X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A
Structural Clustering Algorithm for Networks”, KDD'07
781

SCAN: Density-Based Clustering of
Networks
 How many clusters?
 What size should they be?
 What is the best partitioning?
 Should some points be
segregated?
782
An Example Network
 Application: Given simply information of who associates with whom,
could one identify clusters of individuals with common interests or
special relationships (families, cliques, terrorist cells)?

A Social Network Model
 Cliques, hubs and outliers
 Individuals in a tight social group, or clique, know many of the
same people, regardless of the size of the group
 Individuals who are hubs know many people in different groups
but belong to no single group. Politicians, for example bridge
multiple groups
 Individuals who are outliers reside at the margins of society.
Hermits, for example, know few people and belong to no group
 The Neighborhood of a Vertex
783
v
 Define () as the immediate
neighborhood of a vertex (i.e. the set
of people that an individual knows )

Structure Similarity
 The desired features tend to be captured by a measure
we call Structural Similarity
 Structural similarity is large for members of a clique
and small for hubs and outliers
|
)
(
||
)
(
|
|
)
(
)
(
|
)
,
(
w
v
w
v
w
v







784
v

Structural Connectivity [1]
 -Neighborhood:
 Core:
 Direct structure reachable:
 Structure reachable: transitive closure of direct structure
reachability
 Structure connected:
}
)
,
(
|
)
(
{
)
( 

 


 w
v
v
w
v
N



 
 |
)
(
|
)
(
, v
N
v
CORE
)
(
)
(
)
,
( ,
, v
N
w
v
CORE
w
v
DirRECH 



 


)
,
(
)
,
(
:
)
,
( ,
,
, w
u
RECH
v
u
RECH
V
u
w
v
CONNECT 




 



[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases
785

Structure-Connected Clusters
 Structure-connected cluster C
 Connectivity:
 Maximality:
 Hubs:
 Not belong to any cluster
 Bridge to many clusters
 Outliers:
 Not belong to any cluster
 Connect to less clusters
)
,
(
:
, , w
v
CONNECT
C
w
v 



C
w
w
v
REACH
C
v
V
w
v 




 )
,
(
:
, ,

hub
outlier
786

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
787

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.63
788

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.75
0.67
0.82
789

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
790

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.67
791

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.73
0.73
0.73
792

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
793

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.51
794

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.68
795

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
0.51
796

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
797

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7 0.51
0.51
0.68
798

13
9
10
11
7
8
12
6
4
0
1
5
2
3
Algorithm
 = 2
 = 0.7
799

Running Time
 Running time = O(|E|)
 For sparse networks = O(|V|)
[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004).
800

 Summary
801

802
Why Constraint-Based Cluster Analysis?
 Need user feedback: Users know their applications the best
 Less parameters but more user-desired constraints, e.g., an
ATM allocation problem: obstacle & desired clusters

803
Categorization of Constraints
 Constraints on instances: specifies how a pair or a set of instances
should be grouped in the cluster analysis
 Must-link vs. cannot link constraints

must-link(x, y): x and y should be grouped into one cluster
 Constraints can be defined using variables, e.g.,

cannot-link(x, y) if dist(x, y) > d
 Constraints on clusters: specifies a requirement on the clusters
 E.g., specify the min # of objects in a cluster, the max diameter of a
cluster, the shape of a cluster (e.g., a convex), # of clusters (e.g., k)
 Constraints on similarity measurements: specifies a requirement that
the similarity calculation must respect
 E.g., driving on roads, obstacles (e.g., rivers, lakes)
 Issues: Hard vs. soft constraints; conflicting or redundant constraints

804
Constraint-Based Clustering Methods (I):
Handling Hard Constraints
 Handling hard constraints: Strictly respect the constraints in cluster
assignments
 Example: The COP-k-means algorithm
 Generate super-instances for must-link constraints

Compute the transitive closure of the must-link constraints

To represent such a subset, replace all those objects in the
subset by the mean.

The super-instance also carries a weight, which is the number
of objects it represents
 Conduct modified k-means clustering to respect cannot-link
constraints

Modify the center-assignment process in k-means to a nearest
feasible center assignment

An object is assigned to the nearest center so that the
assignment respects all cannot-link constraints

Constraint-Based Clustering Methods (II):
Handling Soft Constraints
 Treated as an optimization problem: When a clustering violates a soft
constraint, a penalty is imposed on the clustering
 Overall objective: Optimizing the clustering quality, and minimizing the
constraint violation penalty
 Ex. CVQE (Constrained Vector Quantization Error) algorithm: Conduct
k-means clustering while enforcing constraint violation penalties
 Objective function: Sum of distance used in k-means, adjusted by the
constraint violation penalties
 Penalty of a must-link violation

If objects x and y must-be-linked but they are assigned to two
different centers, c1 and c2, dist(c1, c2) is added to the objective
function as the penalty
 Penalty of a cannot-link violation

If objects x and y cannot-be-linked but they are assigned to a
common center c, dist(c, c′), between c and c′ is added to the
objective function as the penalty, where c′ is the closest cluster
to c that can accommodate x or y
805

806
Speeding Up Constrained Clustering
 It is costly to compute some constrained
clustering
 Ex. Clustering with obstacle objects: Tung,
Hou, and Han. Spatial clustering in the
presence of obstacles, ICDE'01
 K-medoids is more preferable since k-means
may locate the ATM center in the middle of a
lake
 Visibility graph and shortest path
 Triangulation and micro-clustering
 Two kinds of join indices (shortest-paths)
worth pre-computation
 VV index: indices for any pair of obstacle
vertices
 MV index: indices for any pair of micro-
cluster and obstacle indices

807
An Example: Clustering With Obstacle Objects
Taking obstacles into account
Not Taking obstacles into account

808
User-Guided Clustering: A Special Kind of
Constraints
name
office
position
Professor
course-id
name
area
course
semester
instructor
office
position
Student
name
student
course
semester
unit
Register
grade
professor
student
degree
Advise
name
Group
person
group
Work-In
area
year
conf
Publication
title
title
Publish
author
Target of
clustering
User hint
Course
Open-course
 X. Yin, J. Han, P. S. Yu, “Cross-Relational Clustering with User's Guidance”,
KDD'05
 User usually has a goal of clustering, e.g., clustering students by research area
 User specifies his clustering goal to CrossClus

809
Comparing with Classification
 User-specified feature (in the form
of attribute) is used as a hint, not
class labels
 The attribute may contain too
many or too few distinct values,
e.g., a user may want to
cluster students into 20
clusters instead of 3
 Additional features need to be
included in cluster analysis
All tuples for clustering
User hint

810
Comparing with Semi-Supervised Clustering
 Semi-supervised clustering: User provides a training set
consisting of “similar” (“must-link) and “dissimilar”
(“cannot link”) pairs of objects
 User-guided clustering: User specifies an attribute as a
hint, and more relevant features are found for clustering
All
tuples
for
clustering
Semi-supervised clustering
All tuples for clustering
User-guided clustering
x

811
Why Not Semi-Supervised Clustering?
 Much information (in multiple relations) is needed to judge
whether two tuples are similar
 A user may not be able to provide a good training set
 It is much easier for a user to specify an attribute as a hint,
such as a student’s research area
Tom Smith SC1211 TA
Jane Chang BI205 RA
Tuples to be compared
User hint

812
CrossClus: An Overview
 Measure similarity between features by how they group
objects into clusters
 Use a heuristic method to search for pertinent features
 Start from user-specified feature and gradually
expand search range
 Use tuple ID propagation to create feature values
 Features can be easily created during the expansion
of search range, by propagating IDs
 Explore three clustering algorithms: k-means, k-medoids,
and hierarchical clustering

813
Multi-Relational Features
 A multi-relational feature is defined by:
 A join path, e.g., Student → Register → OpenCourse → Course
 An attribute, e.g., Course.area
 (For numerical feature) an aggregation operator, e.g., sum or average
 Categorical feature f = [Student → Register → OpenCourse → Course,
Course.area, null]
Tuple Areas of courses
DB AI TH
t1 5 5 0
t2 0 3 7
t3 1 5 4
t4 5 0 5
t5 3 3 4
areas of courses of each student
Tuple Feature f
DB AI TH
t1 0.5 0.5 0
t2 0 0.3 0.7
t3 0.1 0.5 0.4
t4 0.5 0 0.5
t5 0.3 0.3 0.4
Values of feature f f(t1)
f(t2)
f(t3)
f(t4)
f(t5)
DB
AI
TH

814
Representing Features
 Similarity between tuples t1 and t2 w.r.t. categorical feature f
 Cosine similarity between vectors f(t1) and f(t2)
 Most important information of a
feature f is how f groups tuples into
clusters
 f is represented by similarities
between every pair of tuples
indicated by f
 The horizontal axes are the tuple
indices, and the vertical axis is the
similarity
 This can be considered as a vector
of N x N dimensions
Similarity vector Vf
 
   
   









L
k
k
L
k
k
L
k
k
k
f
p
t
f
p
t
f
p
t
f
p
t
f
t
t
1
2
2
1
2
1
1
2
1
2
1
.
.
.
.
,
sim

815
Similarity Between Features
Feature f (course) Feature g (group)
DB AI TH Info sys Cog sci Theory
t1 0.5 0.5 0 1 0 0
t2 0 0.3 0.7 0 0 1
t3 0.1 0.5 0.4 0 0.5 0.5
t4 0.5 0 0.5 0.5 0 0.5
t5 0.3 0.3 0.4 0.5 0.5 0
Values of Feature f and g
Similarity between two features –
cosine similarity of two vectors
Vf
Vg
  g
f
g
f
V
V
V
V
g
f
sim


,

816
Computing Feature Similarity
Tuples
Feature f Feature g
DB
AI
TH
Info sys
Cog sci
Theory
Similarity between feature
values w.r.t. the tuples
sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq
DB Info sys
     
2
1 1
1 1
,
,
, 
  
 




l
k
m
q
q
k
N
i
N
j
j
i
g
j
i
f
g
f
g
f
sim
t
t
sim
t
t
sim
V
V Tuple similarities,
hard to compute
Feature value similarities,
easy to compute
DB
AI
TH
Info sys
Cog sci
Theory
Compute similarity
between each pair of
feature values by one
scan on data

817
Searching for Pertinent Features
 Different features convey different aspects of information
 Features conveying same aspect of information usually
cluster tuples in more similar ways
 Research group areas vs. conferences of publications
 Given user specified feature
 Find pertinent features by computing feature similarity
Research group area
Advisor
Conferences of papers
Research area
GPA
Number of papers
GRE score
Academic Performances
Nationality
Permanent address
Demographic info

818
Heuristic Search for Pertinent Features
Overall procedure
1. Start from the user-
specified feature
2. Search in neighborhood
of existing pertinent
features
3. Expand search range
gradually
name
office
position
Professor
office
position
Student
name
student
course
semester
unit
Register
grade
professor
student
degree
Advise
person
group
Work-In
name
Group
area
year
conf
Publication
title
title
Publish
author
Target of
clustering
User hint
course-id
name
area
Course
course
semester
instructor
Open-course
1
2
 Tuple ID propagation is used to create multi-relational features
 IDs of target tuples can be propagated along any join path, from
which we can find tuples joinable with each target tuple

819
Clustering with Multi-Relational Features
 Given a set of L pertinent features f1, …, fL, similarity
between two tuples
 Weight of a feature is determined in feature search by
its similarity with other pertinent features
 Clustering methods
 CLARANS [Ng & Han 94], a scalable clustering
algorithm for non-Euclidean space
 K-means
 Agglomerative hierarchical clustering
   




L
i
i
f weight
f
t
t
t
t i
1
2
1
2
1 .
,
sim
,
sim

820
Experiments: Compare CrossClus with
 Baseline: Only use the user specified feature
 PROCLUS [Aggarwal, et al. 99]: a state-of-the-art
subspace clustering algorithm
 Use a subset of features for each cluster
 We convert relational database to a table by
propositionalization
 User-specified feature is forced to be used in every
cluster
 RDBC [Kirsten and Wrobel’00]
 A representative ILP clustering algorithm
 Use neighbor information of objects for clustering
 User-specified feature is forced to be used

821
Measure of Clustering Accuracy
 Accuracy
 Measured by manually labeled data

We manually assign tuples into clusters according
to their properties (e.g., professors in different
research areas)
 Accuracy of clustering: Percentage of pairs of tuples in
the same cluster that share common label

This measure favors many small clusters

We let each approach generate the same number of
clusters

822
DBLP Dataset
Clustering Accurarcy - DBLP
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Conf
W
ord
Coauthor
Conf+W
ord
Conf+Coauthor
W
ord+Coauthor
A
ll three
CrossClus K-Medoids
CrossClus K-Means
CrossClus Agglm
Baseline
PROCLUS
RDBC

823
 Summary
823

824
Summary
 Fuzzy clustering
 Probability-model-based clustering
 The EM algorithm
 Subspace clustering: bi-clustering methods
 Dimensionality reduction: Spectral clustering
 Graph clustering: min-cut vs. sparsest cut
 High-dimensional clustering methods
 Graph-specific clustering methods, e.g., SCAN
 Constraints on instance objects, e.g., Must link vs. Cannot Link
 Constraint-based clustering algorithms

825
References (I)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD’98
 C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, and J.-S. Park. Fast algorithms for projected
clustering. SIGMOD’99
 S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning.
J. ACM, 56:5:1–5:37, 2009.
 J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press,
1981.
 K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ”nearest neighbor”
meaningful? ICDT’99
 Y. Cheng and G. Church. Biclustering of expression data. ISMB’00
 I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means
algorithm. SDM’05
 I. Davidson, K. L. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering
algorithms. PKDD’06
 C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation.
J. American Stat. Assoc., 97:611–631, 2002.
 F. H¨oppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis: Methods for
Classification, Data Analysis and Image Recognition. Wiley, 1999.
 G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. KDD’02
 H.-P. Kriegel, P. Kroeger, and A. Zimek. Clustering high dimensional data: A survey on subspace
clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discovery
from Data (TKDD), 3, 2009.
 U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395–416, 2007

References (II)
 G. J. McLachlan and K. E. Bkasford. Mixture Models: Inference and Applications to Clustering. John
Wiley & Sons, 1988.
 B. Mirkin. Mathematical classification and clustering. J. of Global Optimization, 12:105–108, 1998.
 S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey.
IEEE/ACM Trans. Comput. Biol. Bioinformatics, 1, 2004.
 A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS’01
 J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu. Maple: A fast algorithm for maximal pattern-based
clustering. ICDM’03
 M. Radovanović, A. Nanopoulos, and M. Ivanović. Nearest neighbors in high-dimensional data: the
emergence and influence of hubs. ICML’09
 S. E. Schaeffer. Graph clustering. Computer Science Review, 1:27–64, 2007.
 A. K. H. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. ICDE’01
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-based clustering in large
databases. ICDT’01
 A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. In Handbook of Computational
Molecular Biology, Chapman & Hall, 2004.
 K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background
knowledge. ICML’01
 H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets.
SIGMOD’02
 X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. SCAN: A structural clustering algorithm for networks.
KDD’07
 X. Yin, J. Han, and P.S. Yu, “Cross-Relational Clustering with User's Guidance”, KDD'05

Slides Not to Be Used in Class
827

828
Conceptual Clustering
 Conceptual clustering
 A form of clustering in machine learning
 Produces a classification scheme for a set of unlabeled
objects
 Finds characteristic description for each concept (class)
 COBWEB (Fisher’87)
 A popular a simple method of incremental conceptual
learning
 Creates a hierarchical clustering in the form of a
classification tree
 Each node refers to a concept and contains a
probabilistic description of that concept

829
COBWEB Clustering Method
A classification tree

830
More on Conceptual Clustering
 Limitations of COBWEB
 The assumption that the attributes are independent of each other is
often too strong because correlation may exist
 Not suitable for clustering large database data – skewed tree and
expensive probability distributions
 CLASSIT
 an extension of COBWEB for incremental clustering of continuous
data
 suffers similar problems as COBWEB
 AutoClass (Cheeseman and Stutz, 1996)
 Uses Bayesian statistical analysis to estimate the number of
clusters
 Popular in industry

831
Neural Network Approaches
 Neural network approaches
 Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
 New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
 Typical methods
 SOM (Soft-Organizing feature Map)
 Competitive learning

Involves a hierarchical architecture of several units
(neurons)

Neurons compete in a “winner-takes-all” fashion for
the object currently being presented

832
Self-Organizing Feature Map (SOM)
 SOMs, also called topological ordered maps, or Kohonen Self-
Organizing Feature Map (KSOMs)
 It maps all the points in a high-dimensional source space into a 2 to 3-d
target space, s.t., the distance and proximity relationship (i.e., topology)
are preserved as much as possible
 Similar to k-means: cluster centers tend to lie in a low-dimensional
manifold in the feature space
 Clustering is performed by having several units competing for the
current object
 The unit whose weight vector is closest to the current object wins
 The winner and its neighbors learn by having their weights adjusted
 SOMs are believed to resemble processing that can occur in the brain
 Useful for visualizing high-dimensional data in 2- or 3-D space

833
Web Document Clustering Using SOM
 The result of
SOM clustering
of 12088 Web
articles
 The picture on
the right: drilling
down on the
keyword
“mining”
 Based on
websom.hut.fi
Web page

845
Data Mining:
(3rd
ed.)
— Chapter 12 —

846
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary

847
What Are Outliers?
 Outlier: A data object that deviates significantly from the normal
objects as if it were generated by a different mechanism
 Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne
Gretzky, ...
 Outliers are different from the noise data
 Noise is random error or variance in a measured variable
 Noise should be removed before outlier detection
 Outliers are interesting: It violates the mechanism that generates the
normal data
 Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
 Applications:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis

848
Types of Outliers (I)
 Three kinds: global, contextual and collective outliers
 Global outlier (or point anomaly)
 Object is Og if it significantly deviates from the rest of the data set
 Ex. Intrusion detection in computer networks
 Issue: Find an appropriate measurement of deviation
 Contextual outlier (or conditional outlier)
 Object is Oc if it deviates significantly based on a selected context
 Ex. 80o
F in Urbana: outlier? (depending on summer or winter?)
 Attributes of data objects should be divided into two groups

Contextual attributes: defines the context, e.g., time & location

Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
 Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
 Issue: How to define or formulate meaningful context?
Global Outlier

849
Types of Outliers (II)
 Collective Outliers
 A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
 Applications: E.g., intrusion detection:

When a number of computers keep sending
denial-of-service packages to each other
Collective Outlier
 Detection of collective outliers

Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier

850
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application
 The border between normal and outlier objects is often a gray area
 Application-specific outlier detection
 Choice of distance measure among objects and the model of
relationship among objects are often application-dependent
 E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations
 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction
between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
 Understandability
 Understand why these are outliers: Justification of the detection
 Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism

851
 Summary

Outlier Detection I: Supervised Methods
 Two ways to categorize outlier detection methods:
 Based on whether user-labeled examples of outliers can be obtained:

Supervised, semi-supervised vs. unsupervised methods
 Based on assumptions about normal data and outliers:

Statistical, proximity-based, and clustering-based methods
 Outlier Detection I: Supervised Methods
 Modeling outlier detection as a classification problem

Samples examined by domain experts used for training & testing
 Methods for Learning a classifier for outlier detection effectively:

Model normal objects & report those not matching the model as
outliers, or

Model outliers and treat those not matching the model as normal
 Challenges

Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers

Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)
852

Outlier Detection II: Unsupervised Methods
 Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
 An outlier is expected to be far away from any groups of normal objects
 Weakness: Cannot detect collective outlier effectively
 Normal objects may not share any strong patterns, but the collective
outliers may share high similarity in a small area
 Ex. In some intrusion or virus detection, normal activities are diverse
 Unsupervised methods may have a high false positive rate but still
miss many real outliers.
 Supervised methods can be more effective, e.g., identify attacking
some key resources
 Many clustering methods can be adapted for unsupervised methods
 Find clusters, then outliers: not belonging to any cluster
 Problem 1: Hard to distinguish noise from outliers
 Problem 2: Costly since first clustering: but far less outliers than
normal objects

Newer methods: tackle outliers directly
853

Outlier Detection III: Semi-Supervised Methods
 Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
 Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
 If some labeled normal objects are available
 Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
 Those not fitting the model of normal objects are detected as outliers
 If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
 To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
854

Outlier Detection (1): Statistical Methods
 Statistical methods (also known as model-based methods) assume
that the normal data follow some statistical model (a stochastic model)
 The data not following the model are outliers.
855
 Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
 There are rich alternatives to use various statistical models
 E.g., parametric vs. non-parametric
 Example (right figure): First use Gaussian distribution
to model the normal data
 For each object y in region R, estimate gD(y), the
probability of y fits the Gaussian distribution
 If gD(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier

Outlier Detection (2): Proximity-Based Methods
 An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from
the proximity of most of the other objects in the same data set
856
 The effectiveness of proximity-based methods highly relies on the
proximity measure.
 In some applications, proximity or distance measures cannot be
obtained easily.
 Often have a difficulty in finding a group of outliers which stay close to
each other
 Two major types of proximity-based outlier detection
 Distance-based vs. density-based
 Example (right figure): Model the proximity of an
object using its 3 nearest neighbors
 Objects in region R are substantially different
from other objects in the data set.
 Thus the objects in R are outliers

Outlier Detection (3): Clustering-Based Methods
 Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
857
 Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
 Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets
 Example (right figure): two clusters
 All points not in R form a large cluster
 The two points in R form a tiny cluster,
thus are outliers

858
 Summary

Statistical Approaches
 Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
 Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as outliers
 Methods are divided into two categories: parametric vs. non-
parametric
 Parametric method
 Assumes that the normal data is generated by a parametric
distribution with parameter θ
 The probability density function of the parametric distribution f(x, θ)
gives the probability that object x is generated by the distribution
 The smaller this value, the more likely x is an outlier
 Non-parametric method
 Not assume an a-priori statistical model and determine the model
from the input data
 Not completely parameter free but consider the number and nature
of the parameters are flexible and not fixed in advance
 Examples: histogram and kernel density estimation
859

Parametric Methods I: Detection Univariate
Outliers Based on Normal Distribution
 Univariate data: A data set involving only one attribute or variable
 Often assume that data are generated from a normal distribution, learn
the parameters from the input data, and identify the points with low
probability as outliers
 Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
 Use the maximum likelihood method to estimate μ and σ
860
 Taking derivatives with respect to μ and σ2
, we derive the following
maximum likelihood estimates
 For the above data with n = 10, we have
 Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

Parametric Methods I: The Grubb’s Test
 Univariate outlier detection: The Grubb's test (maximum normed
residual test) ─ another statistical method under normal distribution
 For each object x in a data set, compute its z-score: x is an outlier if
where is the value taken by a t-distribution at a
significance level of α/(2N), and N is the # of objects in the data
set
861

Parametric Methods II: Detection of
Multivariate Outliers
 Multivariate data: A data set involving two or more attributes or
variables
 Transform the multivariate outlier detection task into a univariate
outlier detection problem
 Method 1. Compute Mahalaobis distance
 Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T
S –1
(o – ō)
where S is the covariance matrix
 Use the Grubb's test on this measure to detect outliers
 Method 2. Use χ2
–statistic:
 where Ei is the mean of the i-dimension among all objects, and n is
the dimensionality
 If χ2
–statistic is large, then object oi is an outlier
862

Parametric Methods III: Using Mixture of
Parametric Distributions
 Assuming data generated by a normal distribution
could be sometimes overly simplified
 Example (right figure): The objects between the two
clusters cannot be captured as outliers since they
are close to the estimated mean
863
 To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
 Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
 An object o is an outlier if it does not belong to any cluster

Non-Parametric Methods: Detection Using Histogram
 The model of normal data is learned from the
input data without any a priori structure.
 Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
 Outlier detection using histogram:
864
 Figure shows the histogram of purchase amounts in transactions
 A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
 Problem: Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data. If the estimated density function is high,
the object is likely normal. Otherwise, it is likely an outlier.

865
 Summary

Proximity-Based Approaches: Distance-Based vs.
Density-Based Outlier Detection
 Intuition: Objects that are far away from the others are
outliers
 Assumption of proximity-based approach: The proximity of
an outlier deviates significantly from that of most of the
others in the data set
 Two types of proximity-based outlier detection methods
 Distance-based outlier detection: An object o is an
outlier if its neighborhood does not have enough other
points
 Density-based outlier detection: An object o is an outlier
if its density is relatively much lower than that of its
neighbors
866

Distance-Based Outlier Detection
 For each object o, examine the # of other objects in the r-
neighborhood of o, where r is a user-specified distance threshold
 An object o is an outlier if most (taking π as a fraction threshold) of
the objects in D are far away from o, i.e., not in the r-neighborhood of o
 An object o is a DB(r, π) outlier if
 Equivalently, one can check the distance between o and its k-th
nearest neighbor ok, where . o is an outlier if dist(o, ok) > r
 Efficient computation: Nested loop algorithm
 For any object oi, calculate its distance from other objects, and
count the # of other objects in the r-neighborhood.
 If π∙n other objects are within r distance, terminate the inner loop
 Otherwise, oi is a DB(r, π) outlier
 Efficiency: Actually CPU time is not O(n2
) but linear to the data set size
since for most non-outlier objects, the inner loop terminates early 867

Distance-Based Outlier Detection: A Grid-Based Method
 Why efficiency is still a concern? When the complete set of objects
cannot be held into main memory, cost I/O swapping
 The major cost: (1) each object tests against the whole data set, why
not only its close neighbor? (2) check objects one by one, why not
group by group?
 Grid-based method (CELL): Data space is partitioned into a multi-D
grid. Each cell is a hyper cube with diagonal length r/2
868

Pruning using the level-1 & level 2 cell properties:
 For any possible point x in cell C and any
possible point y in a level-1 cell, dist(x,y) ≤ r
 For any possible point x in cell C and any point y
such that dist(x,y) ≥ r, y is in a level-2 cell
 Thus we only need to check the objects that cannot be pruned, and
even for such an object o, only need to compute the distance between
o and the objects in the level-2 cells (since beyond level-2, the
distance from o is more than r)

Density-Based Outlier Detection
 Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data
distribution
 In Fig., o1 and o2 are local outliers to C1, o3 is a
global outlier, but o4 is not an outlier. However,
proximity-based clustering cannot find o1 and o2
are outlier (e.g., comparing with O4).
869
 Intuition (density-based outlier detection): The density around an outlier
object is significantly different from the density around its neighbors
 Method: Use the relative density of an object against its neighbors as
the indicator of the degree of the object being outliers
 k-distance of an object o, distk(o): distance between o and its k-th NN
 k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
 Nk(o) could be bigger than k since multiple objects may have
identical distance to o

Local Outlier Factor: LOF
 Reachability distance from o’ to o:
 where k is a user-specified parameter
 Local reachability density of o:
870
 LOF (Local outlier factor) of an object o is the average of the ratio of
local reachability of o and those of o’s k-nearest neighbors
 The lower the local reachability density of o, and the higher the local
reachability density of the kNN of o, the higher LOF
 This captures a local outlier whose local density is relatively low
comparing to the local densities of its kNN

871
 Summary

Clustering-Based Outlier Detection (1 & 2):
Not belong to any cluster, or far from the closest one
 An object is an outlier if (1) it does not belong to any cluster, (2) there is
a large distance between the object and its closest cluster , or (3) it
belongs to a small or sparse cluster
 Case I: Not belong to any cluster
 Identify animals not part of a flock: Using a density-
based clustering method such as DBSCAN
 Case 2: Far from its closest cluster
 Using k-means, partition data points of into clusters
 For each object o, assign an outlier score based on
its distance from its closest center
 If dist(o, co)/avg_dist(co) is large, likely an outlier
 Ex. Intrusion detection: Consider the similarity between
data points and the clusters in a training data set
 Use a training set to find patterns of “normal” data, e.g., frequent
itemsets in each segment, and cluster similar connections into groups
 Compare new data points with the clusters mined—Outliers are
possible attacks 872

 FindCBLOF: Detect outliers in small clusters
 Find clusters, and sort them in decreasing size
 To each data point, assign a cluster-based local
outlier factor (CBLOF):
 If obj p belongs to a large cluster, CBLOF =
cluster_size X similarity between p and cluster
 If p belongs to a small one, CBLOF = cluster size
X similarity betw. p and the closest large cluster
873
Clustering-Based Outlier Detection (3):
Detecting Outliers in Small Clusters
 Ex. In the figure, o is outlier since its closest large cluster is C1, but the
similarity between o and C1 is small. For any point in C3, its closest
large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small

Clustering-Based Method: Strength and Weakness
 Strength
 Detect outliers without requiring any labeled data
 Work for many types of data
 Clusters can be regarded as summaries of the data
 Once the cluster are obtained, need only compare any object
against the clusters to determine whether it is an outlier (fast)
 Weakness
 Effectiveness depends highly on the clustering method used—they
may not be optimized for outlier detection
 High computational cost: Need to first find clusters
 A method to reduce the cost: Fixed-width clustering

A point is assigned to a cluster if the center of the cluster is
within a pre-defined distance threshold from the point

If a point cannot be assigned to any existing cluster, a new
cluster is created and the distance threshold may be learned
from the training data under certain conditions

875
 Summary

Classification-Based Method I: One-Class Model
 Idea: Train a classification model that can
distinguish “normal” data from outliers
 A brute-force approach: Consider a training set
that contains samples labeled as “normal” and
others labeled as “outlier”
 But, the training set is typically heavily
biased: # of “normal” samples likely far
exceeds # of outlier samples
 Cannot detect unseen anomaly
876
 One-class model: A classifier is built to describe only the normal class.
 Learn the decision boundary of the normal class using classification
methods such as SVM
 Any samples that do not belong to the normal class (not within the
decision boundary) are declared as outliers
 Adv: can detect new outliers that may not appear close to any outlier
objects in the training set
 Extension: Normal objects may belong to multiple classes

Classification-Based Method II: Semi-Supervised Learning
 Semi-supervised learning: Combining classification-
based and clustering-based methods
 Method
 Using a clustering-based approach, find a large
cluster, C, and a small cluster, C1
 Since some objects in C carry the label “normal”,
treat all objects in C as normal
 Use the one-class model of this cluster to identify
normal objects in outlier detection

Since some objects in cluster C1 carry the label
“outlier”, declare all objects in C1 as outliers
 Any object that does not fall into the model for C
(such as a) is considered an outlier as well
877
 Comments on classification-based outlier detection methods
 Strength: Outlier detection is fast
 Bottleneck: Quality heavily depends on the availability and quality of
the training set, but often difficult to obtain representative and high-
quality training data

878
 Summary

Mining Contextual Outliers I: Transform into
Conventional Outlier Detection
 If the contexts can be clearly identified, transform it to conventional
outlier detection
1. Identify the context of the object using the contextual attributes
2. Calculate the outlier score for the object in the context using a
conventional outlier detection method
 Ex. Detect outlier customers in the context of customer groups
 Contextual attributes: age group, postal code
 Behavioral attributes: # of trans/yr, annual total trans. amount
 Steps: (1) locate c’s context, (2) compare c with the other customers in
the same group, and (3) use a conventional outlier detection method
 If the context contains very few customers, generalize contexts
 Ex. Learn a mixture model U on the contextual attributes, and
another mixture model V of the data on the behavior attributes

Learn a mapping p(Vi|Uj): the probability that a data object o
belonging to cluster Uj on the contextual attributes is generated by
cluster Vi on the behavior attributes
 Outlier score:
879

Mining Contextual Outliers II: Modeling Normal
Behavior with Respect to Contexts
 In some applications, one cannot clearly partition the data into contexts
 Ex. if a customer suddenly purchased a product that is unrelated to
those she recently browsed, it is unclear how many products
browsed earlier should be considered as the context
 Model the “normal” behavior with respect to contexts
 Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute
values
 An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
 Using a prediction model that links the contexts and behavior, these
methods avoid the explicit identification of specific contexts
 Methods: A number of classification and prediction techniques can be
used to build such models, such as regression, Markov Models, and
Finite State Automaton
880

Mining Collective Outliers I: On the Set
of “Structured Objects”
 Collective outlier if objects as a group deviate
significantly from the entire data
 Need to examine the structure of the data set, i.e, the
relationships between multiple data objects
881
 Each of these structures is inherent to its respective type of data

For temporal data (such as time series and sequences), we explore
the structures formed by time, which occur in segments of the time
series or subsequences

For spatial data, explore local areas

For graph and network data, we explore subgraphs
 Difference from the contextual outlier detection: the structures are
often not explicitly defined, and have to be discovered as part of the
outlier detection process.
 Collective outlier detection methods: two categories

Reduce the problem to conventional outlier detection

Identify structure units, treat each structure unit (e.g.,
subsequence, time series segment, local area, or subgraph) as
a data object, and extract features

Then outlier detection on the set of “structured objects”
constructed as such using the extracted features

Mining Collective Outliers II: Direct Modeling of
the Expected Behavior of Structure Units
 Models the expected behavior of structure units directly
 Ex. 1. Detect collective outliers in online social network of customers
 Treat each possible subgraph of the network as a structure unit
 Collective outlier: An outlier subgraph in the social network

Small subgraphs that are of very low frequency

Large subgraphs that are surprisingly frequent
 Ex. 2. Detect collective outliers in temporal sequences
 Learn a Markov model from the sequences
 A subsequence can then be declared as a collective outlier if it
significantly deviates from the model
 Collective outlier detection is subtle due to the challenge of exploring
the structures in data
 The exploration typically uses heuristics, and thus may be
application dependent
 The computational cost is often high due to the sophisticated
mining process
882

883
 Summary

Challenges for Outlier Detection in High-
Dimensional Data
 Interpretation of outliers
 Detecting outliers without saying why they are outliers is not very
useful in high-D due to many features (or dimensions) are involved
in a high-dimensional data set
 E.g., which subspaces that manifest the outliers or an assessment
regarding the “outlier-ness” of the objects
 Data sparsity
 Data in high-D spaces are often sparse
 The distance between objects becomes heavily dominated by
noise as the dimensionality increases
 Data subspaces
 Adaptive to the subspaces signifying the outliers
 Capturing the local behavior of data
 Scalable with respect to dimensionality
 # of subspaces increases exponentially
884

Approach I: Extending Conventional Outlier
Detection
 Method 1: Detect outliers in the full space, e.g., HilOut Algorithm
 Find distance-based outliers, but use the ranks of distance instead of
the absolute distance in outlier detection
 For each object o, find its k-nearest neighbors: nn1(o), . . . , nnk(o)
 The weight of object o:
 All objects are ranked in weight-descending order
 Top-l objects in weight are output as outliers (l: user-specified parm)
 Employ space-filling curves for approximation: scalable in both time
and space w.r.t. data size and dimensionality
 Method 2: Dimensionality reduction
 Works only when in lower-dimensionality, normal instances can still
be distinguished from outliers
 PCA: Heuristically, the principal components with low variance are
preferred because, on such dimensions, normal objects are likely
close to each other and outliers often deviate from the majority
885

Approach II: Finding Outliers in Subspaces
 Extending conventional outlier detection: Hard for outlier interpretation
 Find outliers in much lower dimensional subspaces: easy to interpret
why and to what extent the object is an outlier
 E.g., find outlier customers in certain subspace: average transaction
amount >> avg. and purchase frequency << avg.
 Ex. A grid-based subspace outlier detection method
 Project data onto various subspaces to find an area whose density is
much lower than average
 Discretize the data into a grid with φ equi-depth (why?) regions
 Search for regions that are significantly sparse

Consider a k-d cube: k ranges on k dimensions, with n objects

If objects are independently distributed, the expected number of
objects falling into a k-dimensional region is (1/ φ)k
n = fk
n,the
standard deviation is

The sparsity coefficient of cube C:

If S(C) < 0, C contains less objects than expected

The more negative, the sparser C is and the more likely the
objects in C are outliers in the subspace
886

Approach III: Modeling High-Dimensional Outliers
 Ex. Angle-based outliers: Kriegel, Schubert, and Zimek [KSZ08]
 For each point o, examine the angle ∆xoy for every pair of points x, y.
 Point in the center (e.g., a), the angles formed differ widely
 An outlier (e.g., c), angle variable is substantially smaller
 Use the variance of angles for a point to determine outlier
 Combine angles and distance to model outliers
 Use the distance-weighted angle variance as the outlier score
 Angle-based outlier factor (ABOF):
 Efficient approximation computation method is developed
 It can be generalized to handle arbitrary types of data 887
 Develop new models for high-
dimensional outliers directly
 Avoid proximity measures and adopt
new heuristics that do not deteriorate
in high-dimensional data
A set of points
form a cluster
except c (outlier)

888
 Summary

Summary
 Types of outliers
 global, contextual & collective outliers
 Outlier detection
 supervised, semi-supervised, or unsupervised
 Statistical (or model-based) approaches
 Proximity-base approaches
 Clustering-base approaches
 Classification approaches
 Mining contextual and collective outliers
 Outlier detection in high dimensional data
889

References (I)
 B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika, 66:229–248,
1979.
 M. Agyemang, K. Barker, and R. Alhajj. A comprehensive survey of numeric and symbolic outlier mining
techniques. Intell. Data Anal., 10:521–538, 2006.
 F. J. Anscombe and I. Guttman. Rejection of outliers. Technometrics, 2:123–147, 1960.
 D. Agarwal. Detecting anomalies in cross-classified streams: a bayesian approach. Knowl. Inf. Syst., 11:29–44,
2006.
 F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. TKDE, 2005.
 C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD’01
 R.J. Beckman and R.D. Cook. Outlier...s. Technometrics, 25:119–149, 1983.
 I. Ben-Gal. Outlier detection. In Maimon O. and Rockach L. (eds.) Data Mining and Knowledge Discovery
Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic, 2005.
 M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD’00
 D. Barbar´a, Y. Li, J. Couto, J.-L. Lin, and S. Jajodia. Bootstrapping a data mining intrusion detection system.
SAC’03
 Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris. A comparative study for outlier detection techniques in
data mining. IEEE Conf. on Cybernetics and Intelligent Systems, 2006.
 S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a
simple pruning rule. KDD’03
 D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusion using bayesian estimators. SDM’01
 V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41:1–58, 2009.
 D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative selection
algorithm. In CEC’02

References (2)
 E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly
detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining for Security
Applications, 2002.
 E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00
 T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1:291–316, 1997.
 V. J. Hodge and J. Austin. A survey of outlier detection methdologies. Artif. Intell. Rev., 22:85–126, 2004.
 D. M. Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.
 Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recogn. Lett., 24, June, 2003.
 W. Jin, K. H. Tung, and J. Han. Mining top-n local outliers in large databases. KDD’01
 W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship.
PAKDD’06
 E. Knorr and R. Ng. A unified notion of outliers: Properties and computation. KDD’97
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98
 E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB J., 8:237–253,
2000.
 H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. KDD’08
 M. Markou and S. Singh. Novelty detection: A review—part 1: Statistical approaches. Signal Process., 83:2481–
2497, 2003.
 M. Markou and S. Singh. Novelty detection: A review—part 2: Neural network based approaches. Signal
Process., 83:2499–2521, 2003.
 C. C. Noble and D. J. Cook. Graph-based anomaly detection. KDD’03

References (3)
 S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local
correlation integral. ICDE’03
 A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and latest
technological trends. Comput. Netw., 51, 2007.
 X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl. and Data
Eng., 19, 2007.
 Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric space. KDD’06
 N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into
information systems. Quality and Reliability Engineering International, 17:105–112, 2001.
 B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-
evolving time sequences. ICDE’00

894
Outlier Discovery:
Statistical Approaches
Assume a model underlying distribution that generates data
set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution
 distribution parameter (e.g., mean, variance)
 number of expected outliers
 Drawbacks
 most tests are for single attribute
 In many cases, data distribution may not be known

895
Outlier Discovery: Distance-Based Approach
 Introduced to counter the main limitations imposed by
statistical methods
 We need multi-dimensional analysis without knowing
data distribution
 Distance-based outlier: A DB(p, D)-outlier is an object O in
a dataset T such that at least a fraction p of the objects in T
lies at a distance greater than D from O
 Algorithms for mining distance-based outliers [Knorr & Ng,
VLDB’98]
 Index-based algorithm
 Nested-loop algorithm
 Cell-based algorithm

896
Density-Based Local
Outlier Detection
 M. M. Breunig, H.-P. Kriegel, R. Ng, J.
Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
 Distance-based outlier detection is based
on global distance distribution
 It encounters difficulties to identify outliers
if data is not uniformly distributed
 Ex. C1 contains 400 loosely distributed
points, C2 has 100 tightly condensed
points, 2 outlier points o1, o2
 Distance-based method cannot identify o2
as an outlier
 Need the concept of local
outlier
 Local outlier factor (LOF)
 Assume outlier is not
crisp
 Each point has a LOF

897
Outlier Discovery: Deviation-Based Approach
 Identifies outliers by examining the main characteristics
of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
 OLAP data cube technique
 uses data cubes to identify regions of anomalies in
large multidimensional data

898
References (1)
 B. Abraham and G.E.P. Box. Bayesian analysis of some outlier problems in time series. Biometrika,
1979.
 Malik Agyemang, Ken Barker, and Rada Alhajj. A comprehensive survey of numeric and symbolic
outlier mining techniques. Intell. Data Anal., 2006.
 Deepak Agarwal. Detecting anomalies in cross-classied streams: a bayesian approach. Knowl. Inf.
Syst., 2006.
 C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD'01.
 M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Optics-of: Identifying local outliers. PKDD '99
 M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers.
SIGMOD'00.
 V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 2009.
 D. Dasgupta and N.S. Majumdar. Anomaly detection in multidimensional data using negative
selection algorithm. Computational Intelligence, 2002.
 E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised
anomaly detection: Detecting intrusions in unlabeled data. In Proc. 2002 Int. Conf. of Data Mining
for Security Applications, 2002.
 E. Eskin. Anomaly detection over noisy data using learned probability distributions. ICML’00.
 T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1997.
 R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detection problem using
kernel feature space. KDD '05
 F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 1969.

899
References (2)
 V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 2004.
 Douglas M Hawkins. Identification of Outliers. Chapman and Hall, 1980.
 P. S. Horn, L. Feng, Y. Li, and A. J. Pesce. Effect of Outliers and Nonhealthy Individuals on Reference
Interval Estimation. Clin Chem, 2001.
 W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood
relationship. PAKDD'06
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98
 M. Markou and S. Singh.. Novelty detection: a review| part 1: statistical approaches. Signal
Process., 83(12), 2003.
 M. Markou and S. Singh. Novelty detection: a review| part 2: neural network based approaches.
Signal Process., 83(12), 2003.
 S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using
the local correlation integral. ICDE'03.
 A. Patcha and J.-M. Park. An overview of anomaly detection techniques: Existing solutions and
latest technological trends. Comput. Netw., 51(12):3448{3470, 2007.
 W. Stefansky. Rejecting outliers in factorial designs. Technometrics, 14(2):469{479, 1972.
 X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection. IEEE Trans. on Knowl.
and Data Eng., 19(5):631{645, 2007.
 Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any metric
space. KDD '06:
 N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting
intrusions into information systems. Quality and Reliability Engineering International, 2001.

Data Mining:
(3rd
ed.)
— Chapter 13 —

902
Chapter 13: Data Mining Trends and
Research Frontiers
 Mining Complex Types of Data
 Other Methodologies of Data Mining
 Data Mining Applications
 Data Mining and Society
 Data Mining Trends
 Summary

903
Mining Complex Types of Data
 Mining Sequence Data
 Mining Time Series
 Mining Symbolic Sequences
 Mining Biological Sequences
 Mining Graphs and Networks
 Mining Other Kinds of Data

904
Mining Sequence Data
 Similarity Search in Time Series Data
 Subsequence match, dimensionality reduction, query-based
similarity search, motif-based similarity search
 Regression and Trend Analysis in Time-Series Data
 long term + cyclic + seasonal variation + random movements
 Sequential Pattern Mining in Symbolic Sequences
 GSP, PrefixSpan, constraint-based sequential pattern mining
 Sequence Classification
 Feature-based vs. sequence-distance-based vs. model-based
 Alignment of Biological Sequences
 Pair-wise vs. multi-sequence alignment, substitution matirces, BLAST
 Hidden Markov Model for Biological Sequence Analysis
 Markov chain vs. hidden Markov models, forward vs. Viterbi vs.
Baum-Welch algorithms

905
Mining Graphs and Networks
 Graph Pattern Mining
 Frequent subgraph patterns, closed graph patterns, gSpan vs.
CloseGraph
 Statistical Modeling of Networks
 Small world phenomenon, power law (log-tail) distribution,
densification
 Clustering and Classification of Graphs and Homogeneous Networks
 Clustering: Fast Modularity vs. SCAN
 Classification: model vs. pattern-based mining
 Clustering, Ranking and Classification of Heterogeneous Networks
 RankClus, RankClass, and meta path-based, user-guided methodology
 Role Discovery and Link Prediction in Information Networks
 PathPredict
 Similarity Search and OLAP in Information Networks: PathSim, GraphCube
 Evolution of Social and Information Networks: EvoNetClus

906
Mining Other Kinds of Data
 Mining Spatial Data
 Spatial frequent/co-located patterns, spatial clustering and
classification
 Mining Spatiotemporal and Moving Object Data
 Spatiotemporal data mining, trajectory mining, periodica, swarm, …
 Mining Cyber-Physical System Data
 Applications: healthcare, air-traffic control, flood simulation
 Mining Multimedia Data
 Social media data, geo-tagged spatial clustering, periodicity analysis, …
 Mining Text Data
 Topic modeling, i-topic model, integration with geo- and networked
data
 Mining Web Data
 Web content, web structure, and web usage mining
 Mining Data Streams


907
Research Frontiers
 Summary

908
Other Methodologies of Data Mining
 Statistical Data Mining
 Views on Data Mining Foundations
 Visual and Audio Data Mining

909
Major Statistical Data Mining Methods
 Regression
 Generalized Linear Model
 Analysis of Variance
 Mixed-Effect Models
 Factor Analysis
 Discriminant Analysis
 Survival Analysis

910
Statistical Data Mining (1)
 There are many well-established statistical techniques for data
analysis, particularly for numeric data
 applied extensively to data from scientific experiments and
data from economics and the social sciences
 Regression
 predict the value of a response
(dependent) variable from one or
more predictor (independent)
variables where the variables are
numeric
 forms of regression: linear,
multiple, weighted, polynomial,
nonparametric, and robust

911
Scientific and Statistical Data Mining (2)
 Generalized linear models
 allow a categorical response variable
(or some transformation of it) to be
related to a set of predictor variables
 similar to the modeling of a numeric
response variable using linear
regression
 include logistic regression and Poisson
regression
 Mixed-effect models

For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables
 Typically describe relationships between a response variable
and some covariates in data grouped according to one or more
factors

912
Scientific and Statistical Data Mining (3)
 Regression trees
 Binary trees used for classification
and prediction
 Similar to decision trees:Tests are
performed at the internal nodes
 In a regression tree the mean of
the objective attribute is computed
and used as the predicted value
 Analysis of variance
 Analyze experimental data for two
or more populations described by a
numeric response variable and one
or more categorical variables
(factors)

913
 Factor analysis
 determine which variables are
combined to generate a given
factor
 e.g., for many psychiatric data,
one can indirectly measure other
quantities (such as test scores)
that reflect the factor of interest
 Discriminant analysis
 predict a categorical response
variable, commonly used in social
science
 Attempts to determine several
discriminant functions (linear
combinations of the independent
variables) that discriminate
among the groups defined by the
response variable www.spss.com/datamine/factor.htm

914
 Time series: many methods such as autoregression,
ARIMA (Autoregressive integrated moving-average
modeling), long memory time-series modeling
 Quality control: displays group summary charts
 Survival analysis
 Predicts the
probability that a
patient undergoing a
medical treatment
would survive at least
to time t (life span
prediction)

915

916
Views on Data Mining Foundations (I)
 Data reduction
 Basis of data mining: Reduce data representation
 Trades accuracy for speed in response
 Basis of data mining: Compress the given data by
encoding in terms of bits, association rules, decision
trees, clusters, etc.
 Probability and statistical theory
 Basis of data mining: Discover joint probability
distributions of random variables

917
 Microeconomic view
 A view of utility: Finding patterns that are interesting only to the
extent in that they can be used in the decision-making process
of some enterprise
 Pattern Discovery and Inductive databases
 Basis of data mining: Discover patterns occurring in the
database, such as associations, classification models,
sequential patterns, etc.
 Data mining is the problem of performing inductive logic on
databases
 The task is to query the data and the theory (i.e., patterns) of
the database
 Popular among many researchers in database systems
Views on Data Mining Foundations (II)

918

919
Visual Data Mining
 Visualization: Use of computer graphics to create visual
images which aid in the understanding of complex,
often massive representations of data
 Visual Data Mining: discovering implicit but useful
knowledge from large data sets using visualization
techniques
Compute
r
Graphics
High
Performance
Computing
Pattern
Recognitio
n
Human
Compute
r
Interface
s
Multimedia
Systems
Visual Data
Mining

920
Visualization
 Purpose of Visualization
 Gain insight into an information space by mapping
data onto graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities,
relationships among data.
 Help find interesting regions and suitable
parameters for further quantitative analysis.
 Provide a visual proof of computer representations
derived

921
Visual Data Mining & Data Visualization
 Integration of visualization and data mining
 data visualization
 data mining result visualization
 data mining process visualization
 interactive visual data mining
 Data visualization
 Data in a database or data warehouse can be
viewed

at different levels of abstraction

as different combinations of attributes or
dimensions
 Data can be presented in various visual forms

922
Data Mining Result Visualization
 Presentation of the results or knowledge obtained
from data mining in visual forms
 Examples
 Scatter plots and boxplots (obtained from
descriptive data mining)
 Decision trees
 Association rules
 Clusters
 Outliers
 Generalized rules

923
Boxplots from Statsoft: Multiple
Variable Combinations

924
Visualization of Data Mining Results in SAS
Enterprise Miner: Scatter Plots

925
Visualization of Association Rules in
SGI/MineSet 3.0

926
Visualization of a Decision Tree in
SGI/MineSet 3.0

927
Visualization of Cluster Grouping in IBM
Intelligent Miner

928
Data Mining Process Visualization
 Presentation of the various processes of data mining
in visual forms so that users can see
 Data extraction process
 Where the data is extracted
 How the data is cleaned, integrated,
preprocessed, and mined
 Method selected for data mining
 Where the results are stored
 How they may be viewed

929
Visualization of Data Mining Processes
by Clementine
Understand
variations with
visualized data
See your solution
discovery
process clearly

930
Interactive Visual Data Mining
 Using visualization tools in the data mining process to
help users make smart data mining decisions
 Example
 Display the data distribution in a set of attributes
using colored sectors or columns (depending on
whether the whole space is represented by either a
circle or a set of columns)
 Use the display to which sector should first be
selected for classification and where a good split
point for this sector may be

931
Interactive Visual Mining by
Perception-Based Classification (PBC)

932
Audio Data Mining
 Uses audio signals to indicate the patterns of data or
the features of data mining results
 An interesting alternative to visual mining
 An inverse task of mining audio (such as music)
databases which is to find patterns from audio data
 Visual data mining may disclose interesting patterns
using graphical displays, but requires users to
concentrate on watching patterns
 Instead, transform patterns into sound and music
and listen to pitches, rhythms, tune, and melody in
order to identify anything interesting or unusual

933
Research Frontiers
 Summary

934
Data Mining Applications
 Data mining: A young discipline with broad and
diverse applications
 There still exists a nontrivial gap between generic
data mining methods and effective and scalable
data mining tools for domain-specific applications
 Some application domains (briefly discussed here)
 Data Mining for Financial data analysis
 Data Mining for Retail and Telecommunication
Industries
 Data Mining in Science and Engineering
 Data Mining for Intrusion Detection and Prevention
 Data Mining and Recommender Systems

935
Data Mining for Financial Data Analysis (I)
 Financial data collected in banks and financial
institutions are often relatively complete, reliable, and
of high quality
 Design and construction of data warehouses for
multidimensional data analysis and data mining
 View the debt and revenue changes by month, by
region, by sector, and by other factors
 Access statistical information such as max, min,
total, average, trend, etc.
 Loan payment prediction/consumer credit policy
analysis
 feature selection and attribute relevance ranking
 Loan payment performance

936
 Classification and clustering of customers for targeted
marketing
 multidimensional segmentation by nearest-
neighbor, classification, decision trees, etc. to
identify customer groups or associate a new
customer to an appropriate customer group
 Detection of money laundering and other financial
crimes
 integration of from multiple DBs (e.g., bank
transactions, federal/state crime history DBs)
 Tools: data visualization, linkage analysis,
classification, clustering tools, outlier analysis, and
sequential pattern analysis tools (find unusual
access sequences)
Data Mining for Financial Data Analysis (II)

937
Data Mining for Retail & Telcomm. Industries (I)
 Retail industry: huge amounts of data on sales,
customer shopping history, e-commerce, etc.
 Applications of retail data mining
 Identify customer buying behaviors
 Discover customer shopping patterns and trends
 Improve the quality of customer service
 Achieve better customer retention and satisfaction
 Enhance goods consumption ratios
 Design more effective goods transportation and
distribution policies
 Telcomm. and many other industries: Share many
similar goals and expectations of retail data mining

938
Data Mining Practice for Retail Industry
 Design and construction of data warehouses
 Multidimensional analysis of sales, customers, products, time,
and region
 Analysis of the effectiveness of sales campaigns
 Customer retention: Analysis of customer loyalty
 Use customer loyalty card information to register sequences
of purchases of particular customers
 Use sequential pattern mining to investigate changes in
customer consumption or loyalty
 Suggest adjustments on the pricing and variety of goods
 Product recommendation and cross-reference of items
 Fraudulent analysis and the identification of usual patterns
 Use of visualization tools in data analysis

939
Data Mining in Science and Engineering
 Data warehouses and data preprocessing
 Resolving inconsistencies or incompatible data collected in
diverse environments and different periods (e.g. eco-system
studies)
 Mining complex data types
 Spatiotemporal, biological, diverse semantics and
relationships
 Graph-based and network-based mining
 Links, relationships, data flow, etc.
 Visualization tools and domain-specific knowledge
 Other issues
 Data mining in social sciences and social studies: text and
social media
 Data mining in computer science: monitoring systems,

940
Data Mining for Intrusion Detection and
Prevention
 Majority of intrusion detection and prevention systems use
 Signature-based detection: use signatures, attack patterns that
are preconfigured and predetermined by domain experts
 Anomaly-based detection: build profiles (models of normal
behavior) and detect those that are substantially deviate from
the profiles
 What data mining can help
 New data mining algorithms for intrusion detection
 Association, correlation, and discriminative pattern analysis
help select and build discriminative classifiers
 Analysis of stream data: outlier detection, clustering, model
shifting
 Distributed data mining
 Visualization and querying tools

941
Data Mining and Recommender Systems
 Recommender systems: Personalization, making product
recommendations that are likely to be of interest to a user
 Approaches: Content-based, collaborative, or their hybrid
 Content-based: Recommends items that are similar to items
the user preferred or queried in the past
 Collaborative filtering: Consider a user's social environment,
opinions of other customers who have similar tastes or
preferences
 Data mining and recommender systems
 Users C × items S: extract from known to unknown ratings to
predict user-item combinations
 Memory-based method often uses k-nearest neighbor
approach
 Model-based method uses a collection of ratings to learn a
model (e.g., probabilistic models, clustering, Bayesian
networks, etc.)

942
Research Frontiers
 Summary

943
Ubiquitous and Invisible Data Mining
 Ubiquitous Data Mining
 Data mining is used everywhere, e.g., online shopping
 Ex. Customer relationship management (CRM)
 Invisible Data Mining
 Invisible: Data mining functions are built in daily life
operations
 Ex. Google search: Users may be unaware that they are
examining results returned by data
 Invisible data mining is highly desirable
 Invisible mining needs to consider efficiency and scalability,
user interaction, incorporation of background knowledge and
visualization techniques, finding interesting patterns, real-
time, …
 Further work: Integration of data mining into existing
business and scientific technologies to provide domain-

944
Privacy, Security and Social Impacts of Data
Mining
 Many data mining applications do not touch personal data
 E.g., meteorology, astronomy, geography, geology, biology, and
other scientific and engineering data
 Many DM studies are on developing scalable algorithms to find
general or statistically significant patterns, not touching individuals
 The real privacy concern: unconstrained access of individual
records, especially privacy-sensitive information
 Method 1: Removing sensitive IDs associated with the data
 Method 2: Data security-enhancing methods
 Multi-level security model: permit to access to only authorized
level
 Encryption: e.g., blind signatures, biometric encryption, and
anonymous databases (personal information is encrypted and
stored at different locations)
 Method 3: Privacy-preserving data mining methods

945
Privacy-Preserving Data Mining
 Privacy-preserving (privacy-enhanced or privacy-sensitive)
mining:
 Obtaining valid mining results without disclosing the
underlying sensitive data values
 Often needs trade-off between information loss and privacy
 Privacy-preserving data mining methods:
 Randomization (e.g., perturbation): Add noise to the data in
order to mask some attribute values of records
 K-anonymity and l-diversity: Alter individual records so that
they cannot be uniquely identified

k-anonymity: Any given record maps onto at least k other records

l-diversity: enforcing intra-group diversity of sensitive values
 Distributed privacy preservation: Data partitioned and
distributed either horizontally, vertically, or a combination of
both
 Downgrading the effectiveness of data mining: The output of
data mining may violate privacy

946
Research Frontiers
 Summary

947
Trends of Data Mining
 Application exploration: Dealing with application-specific
problems
 Scalable and interactive data mining methods
 Integration of data mining with Web search engines, database
systems, data warehouse systems and cloud computing systems
 Mining social and information networks
 Mining spatiotemporal, moving objects and cyber-physical
systems
 Mining multimedia, text and web data
 Mining biological and biomedical data
 Data mining with software engineering and system engineering
 Visual and audio data mining
 Distributed data mining and real-time data stream mining
 Privacy protection and information security in data mining

948
Research Frontiers
 Summary

949
Summary
 We present a high-level overview of mining complex data types
 Statistical data mining methods, such as regression, generalized
linear models, analysis of variance, etc., are popularly adopted
 Researchers also try to build theoretical foundations for data
mining
 Visual/audio data mining has been popular and effective
 Application-based mining integrates domain-specific knowledge
with data analysis techniques and provide mission-specific
solutions
 Ubiquitous data mining and invisible data mining are penetrating
our data lives
 Privacy and data security are importance issues in data mining,
and privacy-preserving data mining has been developed recently
 Our discussion on trends in data mining shows that data mining is

950
References and Further Reading
 The books lists a lot of references for further reading. Here we only list a few books
 E. Alpaydin. Introduction to Machine Learning, 2nd
ed., MIT Press, 2011
 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed., Wiley-Interscience, 2000
 D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected
World. Cambridge University Press, 2010.
 U. Fayyad, G. Grinstein, and A. Wierse (eds.), Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd
ed. 2011
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd
ed., Springer-Verlag, 2009
 D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,
2009.
 B. Liu. Web Data Mining, Springer 2006.
 T. M. Mitchell. Machine Learning, McGraw Hill, 1997
 M. Newman. Networks: An Introduction. Oxford University Press, 2010.
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd
ed. 2005

DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx

More Related Content

Similar to DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx

Recently uploaded

DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx

Editor's Notes