2.
3 Data Cleaning
Data
cleansing or
is
the
process records
of from
detecting a record
and set,
correcting table, or
corrupt
inaccurate
database.
Importance
o o o o o
Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
2.3.1 Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data may be due to
o o o o o
equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore
the
tuple:
usually
done
when
class
label
is
missing
(assuming the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible? Fill in it automatically with
o o o o
a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree
2.3.2 Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to
o o o o o o o o
1. Binning
faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention duplicate records incomplete data inconsistent data
Other data problems which requires data cleaning
How to Handle Noisy Data? o Binning methods its smooth a sorted that data is, value the by
consulting around it. o o
neighborhood,
values
first sort data and partition into (equal-frequency) bins then one can smooth by bin means, median, smooth by bin boundaries, etc. smooth by bin
Simple Discretization Methods: Binning
Equal-width (distance) partitioning
o Divides the range into N intervals of equal size:
uniform grid
o if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B A)/N. o o The most straightforward, but outliers may dominate presentation Skewed data is not handled well
Equal-depth (frequency) partitioning
o Divides the range into N intervals, each containing
approximately same number of samples o o Good data scaling Managing categorical attributes can be tricky
Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
2. Regression Data can be smoothed by fitting the data to a function, such as
with regression.
Linear regression involves finding the best line to fit two
attributes
Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit to a multidimensional surface. Regression Analysis
y
Y1
y=x+ 1
Y1
X1
3. Clustering Outliers may be detected by clustering, where similar values are
organized into groups, or clusters.
2.3.3 Data Cleaning as a Process
Data discrepancy detection
o o o o
Use
metadata
(e.g.,
domain,
range,
dependency,
distribution) Check field overloading Check uniqueness rule, consecutive rule and null rule Use commercial tools Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)
Data migration and integration
o o
Data ETL
migration
tools:
allow
transformations tools:
to
be
specified (Extraction/Transformation/Loading) allow users to specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potters Wheels)