Data Warehousing & Data
Mining
1
B S C . C S I T , 7 TH S E M
UNIT: 3 DATA PREPROCESSING
Data Preprocessing
2
Data preprocessing is a data mining technique which is used to transform
the raw data in a useable and understandable format.
The data mining algorithms can not work with raw data so the quality of
data must be checked before applying data mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly used to check the data quality. The quality
can be checked by the following:
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that
do or do not match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing
3
The four major tasks in data preprocessing are as follows:
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data Cleaning
4
The data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets.
Data cleaning also replaces the missing values.
There are some techniques to handle missing data and noise data
1. Missing data
This situation arises when some data is missing in the data. It can be handled in various
ways like:
Standard values like “Not Available” or “NA” can be used to replace the missing
values.
Missing values can also be filled manually but it is not recommended when that
dataset is big.
The attribute’s mean value can be used to replace the missing value when the data is
normally distributed where in the case of non-normal distribution median value of
the attribute can be used.
While using regression or decision tree algorithms the missing value can be replaced
by the most probable value.
Data Cleaning
5
2. Noisy Data Noisy data is a meaningless data that can’t be interpreted by
machines. It can be generated due to faulty data collection, data entry errors
etc. It can be handled in following ways :
a. Binning
This method is to smooth or handle noisy data.
First, the data is sorted then and then the sorted values are separated and
stored in the form of bins.
There are three methods for smoothing data in the bin.
Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin.
Smoothing by bin median: In this method, the values in the bin are
replaced by the median value.
Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken and the values are replaced by
the closest boundary value.
Data Cleaning
6
Binning Example:
Data Cleaning
7
Binning Example Description:
In the previous example, the data for price are first sorted and
then partitioned into equal frequency bins of size 3. i.e. each
bin contains 3 values.
In smoothing by bin means, each value in a bin is replaced by
the mean value of the bin.(eg. In Bin1, mean value of 4, 8 and
15 is (4+8+15) / 3 =9)
In smoothing by bin boundaries, the minimum and maximum
values in a given bin are identified as the bin boundaries. Each
bin value is then replaced by the closest boundary value. (eg.
In Bin1, the minimum value is 4 and maximum value is 15,
since 8 is closest to 4 rather than 15 so bin value is replaced by
4)
Data Cleaning
8
b. Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
Linear regression involves finding the best line to fit two
attributes so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are
fit to a multidimensional surface.
c. Outlier analysis: Outlier may be detected by clustering.
Where similar values are organized into groups or cluster. The
values that fall outside of the set of cluster may be considered
outlier.
Data Integration
9
Data integration is the process of combining multiple sources into a
single dataset. The Data integration process is one of the main
components in data management. There are some problems to be
considered during data integration.
Schema integration: Integrates metadata(a set of data that describes
other data) from different sources.
Entity identification problem: Identifying entities from multiple
databases. For example, the system or the use should know student _id
of one database and student_name of another database belongs to the
same entity.
Detecting and resolving data value concepts: The data taken from
different databases while merging may differ. Like the attribute values
from one database may differ from another database. For example, the
date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”.
Data Reduction
10
This process helps to reduced the volume of the data which makes the
analysis easier yet produces the same or almost the same result.
This reduction also helps to reduce storage space.
There are some of the techniques in data reduction are Dimensionality
reduction, Numerosity reduction, Data compression.
Dimensionality reduction:
This process is necessary for real-world applications as the data size is
big.
In this process, the reduction of random variables or attributes is done
so that the dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its
original characteristics.
Data Reduction
11
This also helps in the reduction of storage space and computation time is
reduced.
When the data is highly dimensional the problem called “Curse of
Dimensionality” occurs.
Numerosity Reduction:
In this method, the representation of the data is made smaller by reducing the
volume.
There will not be any loss of data in this reduction.
Data compression:
The compressed form of data is called data compression.
This compression can be lossless or lossy.
When there is no loss of information during compression it is called lossless
compression. Whereas lossy compression reduces information but it removes only
the unnecessary information.
Data Transformation
12
The change made in the format or the structure of the data is called
data transformation. This step can be simple or complex based on the
requirements. There are some methods in data transformation.
Smoothing: With the help of algorithms, we can remove noise from
the dataset and helps in knowing the important features of the dataset.
By smoothing we can find even a simple change that helps in
prediction.
Aggregation: In this method, the data is stored and presented in the
form of a summary. The data set which is from multiple sources is
integrated into with data analysis description. This is an important step
since the accuracy of the data depends on the quantity and quality of
the data. When the quality and the quantity of the data are good the
results are more relevant.
Data Transformation
13
Discretization: The continuous data here is split into intervals.
Discretization reduces the data size. For example, rather than
specifying the class time, we can set an interval like (3 pm-5 pm, 6 pm-
8 pm).
Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
Forms of Data preprocessing
14
Data Discretization and Concept Hierarchies
15
Data Discretization
Dividing the range of a continuous attribute into intervals.
Interval labels can then be used to replace actual data values.
Reduce the number of values for a given continuous attribute.
Some classification algorithms only accept categorically attributes.
This leads to a concise, easy-to-use, knowledge-level representation of mining
results.
Discretization techniques can be categorized based on whether it uses class
information or not such as follows:
Supervised Discretization - This discretization process uses class
information.
Unsupervised Discretization - This discretization process does not use class
information.
Discretization techniques can be categorized based on which direction it
proceeds as follows:
Data Discretization and Concept Hierarchies
16
Top-down Discretization -
If the process starts by first finding one or a few points called
split points or cut points to split the entire attribute range and then
repeat this recursively on the resulting intervals.
Bottom-up Discretization -
Starts by considering all of the continuous values as potential
split-points.
Removes some by merging neighborhood values to form
intervals, and then recursively applies this process to the resulting
intervals.
Data Discretization and Concept Hierarchies
17
Concept Hierarchies
Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a Concept
Hierarchy.
Concept hierarchies can be used to reduce the data by collecting and replacing
low-level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions,
and each dimension contains multiple levels of abstraction defined by concept
hierarchies.
This organization provides users with the flexibility to view data from different
perspectives.
Data mining on a reduced data set means fewer input and output operations
and is more efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are
typically applied before data mining, rather than during mining.
Concept hierarchy
18
Data Mining Primitives OR
Data Mining Task Primitives
19
Data Mining Primitives OR
Data Mining Task Primitives
20
Data Mining Primitives OR
Data Mining Task Primitives
21
22
END OF UNIT 3