Unit 2 Introduction to Data
Mining, Data Exploration
and Data preprocessing
Learning Outcomes
• Define Data preprocessing and its importance
• Describe Data Quality Measures
• Explain Data Integration, Data Transformation, Data
Reduction ,Data Discretization
• Define concept Hierarchy
Department of Computer Engineering
Outline
• Data Mining Task Primitives
• Measures of Data Quality
• Architecture, KDD process, Issues in Data Mining, Applications of
Data Mining
• Data Exploration Types of Attributes, Statistical Description of Data
• Data Visualization
• Data cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Concept Hierarchy
Department of Computer Engineering 3
Data Mining Task Primitives
Data mining task can be specified in the form of a data mining query, which is input
to the data mining system.
Data mining query is defined in terms of data mining task primitives.
Data Mining task primitives are:
1. Set of task relevant data to be mined. (relevant attributes / dimensions)
2. Kind of knowledge to be mined (kind of data mining functionality)
3. Background knowledge to be used in the discovery process. (knowledge base –
concept hierarchy, user beliefs)
4. Interestingness measures and thresholds for pattern evaluation (Interestingness
measure for association rules are ‘support’ and ‘confidence’)
5. Expected representation for visualizing the discovered patterns. (Tables,
charts, graphs, decision trees, cubes)
Department of Computer Engineering 4
Department of Computer Engineering 5
Why Data Preprocessing Is
Important?
• To know about your data
• Before mining we need to get data ready.
• No quality data, no quality mining results!
• Real world data is noisy, enormous in volume from heterogeneous
sources
• Data is inconsistent and incomplete.
• Preprocessing is one of the most critical steps in a data mining proc
ess 6
Department of Computer Engineering 6
Measures of Data Quality
A well-accepted multidimensional view has the following
properties:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Uniqueness
• Validity
• Auditability
Department of Computer Engineering 7
Why Is Data Dirty?
Data is dirty because of the below reasons.
• Incomplete data may come from
• “Not applicable” data value when collected.
• Different considerations between the time when the data was collected and
when it is analyzed.
• Human / hardware / software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Inconsistent data may come from
• Different data sources
• Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
Department of Computer Engineering 8
Why Is Data Pre-processing Important?
Data Pre-processing is important because:
If there is No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality
data
Department of Computer Engineering 9
Major Tasks in Data Pre-processing
• Data cleaning
• Data integration
• Data transformation
• Data reduction
• Data Discretization
Department of Computer Engineering 10
Data Cleaning
Data cleaning tasks are:
• Filling in missing values
• Identifying outliers and smoothing out noisy data
• Correcting inconsistent data
• Resolving redundancy caused by data integration
• Eg. Missing customer income attribute in the sales data
Department of Computer Engineering 11
Missing Data
Methods of handling missing values:
a) Ignore the tuple
b) Fill in the missing value manually
c) Use of a Global constant to fill in the missing value
d) Use the attribute mean to fill in the missing value
e) Use the attribute mean of all samples belonging to
the same class as that of the given tuple.
f) Use the most probable value to fill in the missing
value
Department of Computer Engineering 12
How to Handle Noisy
Data?
• Binning method:
• first sort data and partition into (equi-depth)
bins
• then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
• Regression
• smooth by fitting the data into regression
functions Department of Computer Engineering 13
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size:
uniform grid
• if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W =
(B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each
containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
Department of Computer Engineering 14
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-width) bins:
- Bin 1 (4-14): 4, 8, 9
- Bin 2(15-24): 15, 21, 21, 24
- Bin 3(25-34): 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
Department of Computer Engineering 15
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Department of Computer Engineering 16
b) Regression
- Method of mapping the data into an linear
mathematical equation and converging them to a single
line and finding the outlying data
c) Clustering
- Forming clusters and identifying the outlying data
Department of Computer Engineering 17
Data Integration
• Data integration:
• Combines data from multiple sources into a single store
• Schema integration:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are referred to as the entity
identification problem.
• Redundancy:
• An attribute may be redundant if it can be derived or obtained from another
attribute or set of attributes.
• Inconsistencies in attributes can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are different.
• An attribute in one system may be recorded at a lower level of abstraction than
the “same” attribute in another.
Department of Computer Engineering 18
18
Handling Redundancy in Data Integration
• Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or
object may have different names in different
databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies. Department of Computer Engineering 19
19
Department of Computer Engineering 20
If the correlation co-efficient between the attributes A &
B are positive then they are positively correlated.
- That is if A’s value increases, B’s value also increases.
- As the correlation co-efficient value increases, the
stronger the correlation.
- If the correlation co-efficient between the attributes A &
B is zero then they are independent attributes.
- If the correlation co-efficient value is negative then they
are negatively correlated.
Department of Computer Engineering 21
Department of Computer Engineering 22
Department of Computer Engineering 23
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
• The larger the Χ2 value, the more likely the
variables are related
• The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
Department of Computer Engineering 24
Department of Computer Engineering 25
Department of Computer Engineering 26
Data Transformation
• Smoothing:- Removes noise from the data
• Aggregation:- Summarization, Data cube Construction
• Generalization:- Concept Hierarchy climbing
• Attribute / Feature Construction:- New attributes
constructed from the given ones
• Normalization:- Data scaled to fall within a specified
range
• - min-max normalization - z-score normalization
- normalization by decimal scaling
Department of Computer Engineering 27
Department of Computer Engineering 28
Data Reduction
Why Data Reduction?
- A database of data warehouse may store
terabytes of data
- Complex data analysis or mining will take long
time to run on the complete data set
What is Data Reduction?
- Obtaining a reduced representation of the
complete dataset
Produces same result or almost same mining /
analytical results as that of original
Department of Computer Engineering 29
Data Reduction Strategies
1. Data cube Aggregation
2. Attribute (Subset) Selection
3. Dimensionality reduction – remove unwanted
attributes
4. Data Compression
5. Numerosity reduction – Fit data into mathematical
models
6. Discretization and Concept Hierarchy Generation
Department of Computer Engineering 30
Data Cube Aggregation
• The lowest level of data cube is called as base cuboid.
• - Single Level Aggregation - Select a particular entity or attribute and Aggregate based on that
particular attribute.
• Eg. Aggregate along ‘Year’ in a Sales data.
• - Multiple Level of Aggregation – Aggregates along multiple attributes –Further reduces the size of
the data to analyze.
• - When a query is posed by the user, use the appropriate level of Aggregation or data cube to solve
the task
• - Queries regarding aggregated information should be answered using the data cube whenever
possible.
Department of Computer Engineering 31
Department of Computer Engineering 32
Attribute Subset Selection
Feature Selection: (attribute subset selection)
- The goal of attribute subset selection is to find the minimum
set of attributes such that the resulting probability
distribution of data classes is as close as possible to the
original distribution obtained using all attributes.
Department of Computer Engineering 33
Heuristic Methods
Due to exponential number of attribute choices
- Step wise forward selection
- Step wise backward elimination
- Combining forward selection and backward elimination
- Decision Tree induction
Department of Computer Engineering 34
Department of Computer Engineering 35
Data Compression
- Compressed representation of the original data.
- Two types
- Lossless Compression
- Lossy Compression
Department of Computer Engineering 36
Numerosity Reduction
- Reduces the data volume by choosing
smaller forms of data representations.
- Two types – Parametric, Non-Parametric.
- Parametric – Data estimated into a
model
– only the data parameters stored and not
the actual data.
- Non-Parametric – Do not fits data into
models Department of Computer Engineering 37
Regression and Log-Linear
Models
- Linear Regression - data are modeled to fit in a straight line.
y=ax+b
- That is data can be modeled to the mathematical equation:
- Where y is called the “Response Variable” and x is called “Predictor Variable”.
- a and bare called the regression coefficients.
- b is the Y-intercept and a is the Slope of the equation.
- These regression coefficients can be solved by using “method of least
squares”.
- Multiple Regression – Extension of linear regression
– Response variable Y is modeled as a multidimensional vector.
Department of Computer Engineering 38
Histograms
- Uses binning to distribute the data.
- Histogram for an attribute A:
- Partitions the data of A into disjoint subsets / buckets.
- Buckets are represented in a horizontal line in a
histogram.
- Vertical line of histogram represents frequency of values
in bucket.
- Singleton Bucket – Has only one attribute value /
frequency pair Department of Computer Engineering 39
Department of Computer Engineering 40
Clustering
- Considers data tuples as objects.
- Partition objects into clusters.
- Objects within a cluster are similar to one another and the
objects in different
clusters are dissimilar.
- The cluster representation of the data can be used to replace the
actual data
- Effectiveness depends on the nature of the data.
- Effective for data that can be categorized into distinct clusters.
- Can also have hierarchical clustering of data
- There are many choices of clustering definitions and algorithms
Department of Computer Engineering 41
available.
Sampling
- Selects random sample or subset of data.
- Say large dataset D contains N tuples.
1. Simple Random Sample WithOut Replacement (SRSWOR) of
size n:
- Draw n tuples from the original N tuples in D, where n<N.
- The probability of drawing any tuple in D is 1/N. That is all tuples have
equal chance
2. Simple Random Sample With Replacement (SRSWR) of size n:
- Similar to SRSWOR, except that each time when a tuple is drawn from
D it is recorded and replaced.
- After a tuple is drawn it is placed back in D so that it can be drawn
again
Department of Computer Engineering 42
Department of Computer Engineering 43
Discretization
• Technique that is used to convert a
continuous attribute into discrete
attributes.
• Some classification algorithms only
accept discrete values.
• Two ways you can discretize your
attributes.
• Unsupervised way, where the class
labels do not matter
• Supervised- Entropy based
Department of Computer Engineering 44
Automated Discretization Methods:
o Binning
o Histogram analysis
o Entropy based Discretization Method
o X2 – Merging (Chi-Merging)
o Cluster Analysis
o Discretization by Intuition Partitioning
Department of Computer Engineering 45
Entropy-Based Discretization
• Top-Down Discretization
• The goal of this algorithm is to find the split with
the maximum information gain.
• The boundary that minimizes the entropy over
all possible boundaries is selected
• The process is recursively applied to partitions
obtained until some stopping criterion is met
• Such a boundary may reduce data size and
improve classification accuracy
Department of Computer Engineering 46
Department of Computer Engineering 47
Department of Computer Engineering 48
Department of Computer Engineering 49
Department of Computer Engineering 50
Department of Computer Engineering 51
Discretization by Intuitive
Partitioning
- Users like numerical value intervals to be uniform, easy-to-
use, ‘Intuitive’, Natural.
- Clustering analysis produces intervals such as
($53,245.78,$62,311.78].
- But intervals such as ($50,000,$60,000] is better than the
above.
- Follows 3-4-5 Rule:
o Partitions the given data range into 3 or 4 or 5 equi-width
intervals
o Partitions recursively, level-by-level, based on value range at
most significant digit
Department of Computer Engineering 52
Segmentation by natural
partitioning
•3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi- width
intervals
If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Department of Computer Engineering 53
Department of Computer Engineering 54
Department of Computer Engineering 55
Hence intervals are: (-$1,000,000,$0], ($0,$1,000,000],
($1,000,000,$2,000,000]
o LOW’ < MIN => Adjust the left boundary to make the interval smaller.
o Most significant digit of MIN is $100,000 => MIN’ = -$400,000
o Hence first interval reduced to (-$400,000,$0]
o HIGH’ < MAX => Add new interval ($2,000,000,$5,000,000]
o Hence the Top tier Hierarchy intervals are:
o (-
$400,000,$0],($0,$1,000,000],($1,000,000,$2,000,000],($2,000,000,$5,000,00
0]
o These are further subdivided as per 3-4-5 rule to obtain the lower level
hierarchies.
o Interval (-$400,000,$0] is divided into 4 equi-width intervals
o Intervals ($0,$1,000,000] & is divided into 5 Equi-width intervals
o Interval ($1,000,000,$2,000,000] is divided into 5 Equi-width intervals
o Interval ($2,000,000, $5,000,000] is divided into 3 Equi-width intervals
Department of Computer Engineering 56
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in
a data warehouse
• Concept hierarchies facilitate drilling and rolling in data
warehouses to view data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric
values for age) by higher level concepts (such as youth, adult,
or senior)
• Concept hierarchies can be explicitly specified by domain
experts and/or data warehouse designers
• Concept hierarchy can be automatically formed for both
numeric and nominalDepartment
data. ofFor numeric
Computer Engineeringdata, use 57
Concept Hierarchy
Generation
for •Nominal Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
• street < city < state < country
• Specification of a hierarchy for a set of values by
explicit data grouping
• {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
• E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct
values
Department of Computer Engineering 58
• E.g., for a set of attributes: {street, city, state,
Automatic Concept Hierarchy
Generation
• Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
• The attribute with the most distinct values is placed at the
lowest level of the hierarchy
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
Department of Computer Engineering 59
• Exceptions, e.g., weekday, month, quarter, year
Department of Computer Engineering 60