DISCRETIZATION AND CONCEPT HIERARCHY
GENERATION
Discretization:
Types of attributes:
Nominal values from an unordered set, e.g.,
color, profession
Ordinal values from an ordered set, e.g.,
military or academic rank
Continuous real numbers, e.g., integer or real
numbers
Discretization:
Divide the range of a continuous attribute into
intervals
Reduce data size by discretization
Discretization and Concept Hierarchy:
Discretization
Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals
Interval labels can then be used to replace actual
data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an
attribute
Concept hierarchy:
Concept hierarchy formation
Recursively reduce the data by collecting and
replacing low level concepts (such as numeric
values for age) by higher level concepts (such as
young, middle-aged, or senior)
Detail lost
More meaningful
Easier to interpret
Mining becomes easier
Several concept hierarchies can be defined for the
same attribute
Manual / Implicit
Discretization and Concept Hierarchy Generation for
Numeric Data:
Typical methods:
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
2
merging
Segmentation by natural partitioning
All the methods can be applied recursively
Techniques:
Binning
Distribute values into bins
Replace by bin mean / median
Recursive application leads to concept
hierarchies
Unsupervised technique
Histogram Analysis
Data Distribution Partition
Equiwidth (0-100], (100-200],
Equidepth
Recursive
Minimum Interval size
Unsupervised
Cluster Analysis
Clusters form nodes of concept hierarchy
Can decompose / combine
Lower level / higher level of hierarchy
Entropy-Based Discretization:
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the expected
information requirement after partitioning is
Entropy is calculated based on class distribution of
the samples in the set. Given m classes, the entropy
of S1 is
where pi is the probability of class i in S1
The boundary that minimizes the expected
information requirement over all possible boundaries
is selected as a binary discretization
The process is recursively applied to partitions
obtained until some stopping criterion is met
Reduces data size
Class information is considered
Improves accuracy
Interval Merging by 2 Analysis:
ChiMerge
Bottom-up approach
find the best neighbouring intervals and
merges them to form larger intervals
Supervised
If two adjacent intervals have similar
distribution of classes they can be merged
Initially each value is in a separate interval
2 tests are performed for adjacent intervals.
Those with least values are merged
Can be repeated
Stopping condition (Threshold, Number of
intervals)
Segmentation by Natural Partitioning:
A simply 3-4-5 rule can be used to segment numeric
data into relatively uniform, natural intervals.
If an interval covers 3, 6, 7 or 9 distinct values
at the most significant digit, partition the range
into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4
intervals
If it covers 1, 5, or 10 distinct values at the
most significant digit, partition the range into 5
intervals
Outliers could be present
Consider only the majority values
th
th
5 percentile 95 percentile
Example of 3-4-5 Rule
Concept Hierarchy Generation for Categorical Data:
Specification of a partial ordering of attributes
explicitly at the schema level by users or experts
User / Expert defines hierarchy
Street < city < state < country
Specification of a portion of a hierarchy by explicit
data grouping
Manual
Intermediate level information specified
Industrial, Agricultural..
Specification of a set of attributes but not their partial
ordering
Automatically inferring the hierarchy
Heuristic rule
High level concepts contain a smaller number
of values
Specification of only a partial set of attributes
Embedding data semantics
Attributes with tight semantic connections are
pinned together