KEMBAR78
Data Preprocessing in Data Analytics.ppt
1
1
Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
2
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
3
Incomplete (Missing) Data
 Data is not always available

E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction

inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry

not register history or changes of the data
 Missing data may need to be inferred
4
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the
same class: smarter

the most probable value: inference-based such as
Bayesian formula or decision tree
5
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning

duplicate records
 incomplete data

inconsistent data
6
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency)
bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

detect and remove outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g.,
deal with possible outliers)
7
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
8
8
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
9
9
Data Integration
 Data integration:

Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#

Integrate metadata from different sources
 Entity identification problem:

Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts

For the same real world entity, attribute values from different
sources are different

Possible reasons: different representations, different scales, e.g.,
metric vs. British units
10
10
Handling Redundancy in Data Integration
 Redundant data occur often when integration of
multiple databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

Data Preprocessing in Data Analytics.ppt

  • 1.
    1 1 Data Preprocessing  DataPreprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 2.
    2 Data Cleaning  Datain the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation=“ ” (missing data)  noisy: containing noise, errors, or outliers  e.g., Salary=“−10” (an error)  inconsistent: containing discrepancies in codes or names, e.g.,  Age=“42”, Birthday=“03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records  Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday?
  • 3.
    3 Incomplete (Missing) Data Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred
  • 4.
    4 How to HandleMissing Data?  Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably  Fill in the missing value manually: tedious + infeasible?  Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or decision tree
  • 5.
    5 Noisy Data  Noise:random error or variance in a measured variable  Incorrect attribute values may be due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Other data problems which require data cleaning  duplicate records  incomplete data  inconsistent data
  • 6.
    6 How to HandleNoisy Data?  Binning  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression  smooth by fitting the data into regression functions  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers)
  • 7.
    7 Data Cleaning asa Process  Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)  Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface  Integration of the two processes  Iterative and interactive (e.g., Potter’s Wheels)
  • 8.
    8 8 Chapter 3: DataPreprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 9.
    9 9 Data Integration  Dataintegration:  Combines data from multiple sources into a coherent store  Schema integration: e.g., A.cust-id  B.cust-#  Integrate metadata from different sources  Entity identification problem:  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Detecting and resolving data value conflicts  For the same real world entity, attribute values from different sources are different  Possible reasons: different representations, different scales, e.g., metric vs. British units
  • 10.
    10 10 Handling Redundancy inData Integration  Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis and covariance analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality