Data Mining:
Concepts and Techniques
               — Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
          Simon Fraser University
 ©2011 Han, Kamber & Pei. All rights reserved.
                                                 1
            Chapter 3: Data Preprocessing
   Data Preprocessing: An Overview
       Data Quality
       Major Tasks in Data Preprocessing
   Data Cleaning
   Data Reduction
   Data Transformation and Data Discretization
                                                  2
     Data Quality: Why Preprocess the Data?
   Measures for data quality: A multidimensional view
       Accuracy: accurate or noisy (containing errors, or values
        that deviate from the expected)
       Completeness: not recorded (lacking attribute values or
        certain attributes of interest …)
       Consistency: e.g. discrepancy in the department codes used
        to categorize items
       Timeliness: timely update?
       Believability: how much the data are trustable by users
       Interpretability: how easily the data can be understood?
                                                                     3
            Major Tasks in Data Preprocessing
   Data cleaning
       Fill in missing values, smooth noisy data, identify or remove
        outliers, and resolve inconsistencies
   Data integration
       Integration of multiple databases, data cubes, or files
   Data reduction
       Dimensionality reduction
       Numerosity reduction (e.g. sampling)
   Data transformation and data discretization
       Normalization
       …
                                                                        4
Major Tasks in Data Preprocessing
                                    5
            Chapter 3: Data Preprocessing
   Data Preprocessing: An Overview
       Data Quality
       Major Tasks in Data Preprocessing
   Data Cleaning
   Data Reduction
   Data Transformation and Data Discretization
                                                  6
                           Data Cleaning
   Data in the Real World Is Dirty: Lots of potentially incorrect data,
    e.g., instrument faulty, human or computer error, transmission error
        incomplete: lacking feature values, lacking certain features of
         interest, or containing only aggregate data
             e.g., Occupation=“ ” (missing data)
        noisy: containing noise, errors, or outliers
             e.g., Salary=“−10” (an error)
        inconsistent: containing discrepancies in codes or names, e.g.,
             Age=“42”, Birthday=“03/07/2010”
             Was rating “1, 2, 3”, now rating “A, B, C”
             discrepancy between duplicate records
        Intentional (e.g., disguised missing data)
             Jan. 1 as everyone’s birthday?
                                                                           7
            Incomplete (Missing) Data
   Data is not always available
       E.g., many tuples have no recorded value for several
        features, such as customer income in sales data
   Missing data may be due to
       equipment malfunction
       inconsistent with other recorded data and thus deleted
       data not entered due to misunderstanding
       certain data may not be considered important at the
        time of entry
       not register history or changes of the data
   Missing data may need to be inferred
                                                                 8
          How to Handle Missing Data?
   Ignore the tuple: usually done when class label is missing
    (when doing classification)—not effective when the % of
    missing values per feature varies considerably
   Fill in the missing value manually: tedious + infeasible?
   Fill in it automatically with
       a global constant : e.g., “unknown”, a new class?!
       the feature mean
       the feature mean for all samples belonging to the same
        class: smarter
       the most probable value: inference-based such as
        Bayesian formula or decision tree
                                                                 9
                      Noisy Data
   Noise: random error or variance in a measured variable
   Incorrect feature values may be due to
      faulty data collection instruments
      data entry problems
      data transmission problems
      technology limitation
      inconsistency in naming convention
   Other data problems which require data cleaning
      duplicate records
      incomplete data
      inconsistent data
                                                             10
              How to Handle Noisy Data?
   Binning
       First sort data and partition into (equal-frequency) bins
       Then one can smooth by bin means, smooth by bin median,
        smooth by bin boundaries, etc.
                                                                    11
       How to Handle Noisy Data (cont.)
   Regression
      smooth by fitting the data into regression functions
                                                              12
       How to Handle Noisy Data (cont.)
   Clustering
      detect and remove outlier
                                          13
              Data Cleaning as a Process
   Data discrepancy detection
      Use metadata (e.g., domain, range, dependency, distribution)
      Check field overloading
      Check uniqueness rule, consecutive rule and null rule
      Use commercial tools
          Data scrubbing: use simple domain knowledge (e.g., postal
           code, spell-check) to detect errors and make corrections
          Data auditing: by analyzing data to discover rules and
           relationship to detect violators (e.g., correlation and clustering
           to find outliers)
   Data migration and integration
      Data migration tools: allow transformations to be specified
      ETL (Extraction/Transformation/Loading) tools: allow users to
        specify transformations through a graphical user interface
   Integration of the two processes
      Iterative and interactive (e.g., Potter’s Wheels)
                                                                                14
            Chapter 3: Data Preprocessing
   Data Preprocessing: An Overview
       Data Quality
       Major Tasks in Data Preprocessing
   Data Cleaning
   Data Reduction
   Data Transformation and Data Discretization
                                                  15
                  Feature Engineering
   Feature Extraction / Construction aims to reduce the number
    of features in a dataset by creating new features from the existing
    ones (and then discarding the original features).
      e.g. PCA
   Feature Selection: Instead of creating new features, Feature
    Selection focuses on choosing a subset of the existing features
    that contribute most significantly to the problem.
   This process eliminates irrelevant or redundant features while
    preserving the important ones.
      e.g. Feature Subset Selection
   Feature Creation / Generation: Create new features that can
    capture the important information in a data set more effectively
    than the original ones.
                                                                          16
               Data Reduction Strategies
   Data reduction: Obtain a reduced representation of the data set that
    is much smaller in volume but yet produces the same (or almost the
    same) analytical results
   Why data reduction? — A database/data warehouse may store
    terabytes of data. Complex data analysis may take a very long time to
    run on the complete data set.
   Data reduction strategies
      Dimensionality reduction, e.g., remove unimportant features
          Principal Components Analysis (PCA)
          Feature subset selection, feature creation
      Numerosity reduction (some simply call it: Data Reduction)
          Regression and Log-Linear Models
          Histograms, clustering, sampling
          Data cube aggregation
      Data compression
                                                                            17
                Dimensionality Reduction
   Curse of dimensionality
       When dimensionality increases, data becomes increasingly sparse
       Density and distance between points, which is critical to clustering, outlier
        analysis, becomes less meaningful
       The possible combinations of subspaces will grow exponentially
   Dimensionality reduction
       Avoid the curse of dimensionality
       Help eliminate irrelevant features and reduce noise
       Reduce time and space required in data mining
       Allow easier visualization
   Dimensionality reduction techniques
       Principal Component Analysis
       Supervised and nonlinear techniques (e.g., feature selection)
                                                                                        18
               Feature Subset Selection
   Another way to reduce dimensionality of data
   Redundant features
       Duplicate much or all of the information contained in
        one or more other features
       E.g., purchase price of a product and the amount of
        sales tax paid
   Irrelevant features
       Contain no information that is useful for the data
        mining task at hand
       E.g., students' ID is often irrelevant to the task of
        predicting students' GPA
                                                                19
                         Clustering
   Partition data set into clusters based on similarity, and
    store cluster representation (e.g., centroid and diameter)
    only
   Can be very effective if data is clustered but not if data
    is “smeared”
   Can have hierarchical clustering and be stored in multi-
    dimensional index tree structures
   There are many choices of clustering definitions and
    clustering algorithms
                                                                 20
                         Sampling
   Sampling: obtaining a small sample s to represent the
    whole data set N
   Allow a mining algorithm to run in complexity that is
    potentially sub-linear to the size of the data
   Key principle: Choose a representative subset of the data
       Simple random sampling may have very poor
        performance in the presence of skew
       Develop adaptive sampling methods, e.g., stratified
        sampling:
   Note: Sampling may not reduce database I/Os (page at a
    time)
                                                                21
                   Types of Sampling
   Simple random sampling
      There is an equal probability of selecting any particular
       item
   Sampling without replacement
      Once an object is selected, it is removed from the
       population
   Sampling with replacement
      A selected object is not removed from the population
   Stratified sampling:
      Partition the data set, and draw samples from each
       partition (proportionally, i.e., approximately the same
       percentage of the data)
      Used in conjunction with skewed data
                                                                   22
Sampling: With or without Replacement
     Raw Data
                                        23
Sampling: Cluster or Stratified Sampling
     Raw Data         Cluster/Stratified Sample
                                                  24
            Chapter 3: Data Preprocessing
   Data Preprocessing: An Overview
       Data Quality
       Major Tasks in Data Preprocessing
   Data Cleaning
   Data Reduction
   Data Transformation and Data Discretization
                                                  25
                     Data Transformation
   A function that maps the entire set of values of a given attribute to a
    new set of replacement values s.t. each old value can be identified
    with one of the new values
   Methods
        Smoothing: Remove noise from data
        Attribute/feature construction
             New attributes constructed from the given ones
        Aggregation: Summarization, data cube construction
        Normalization: Scaled to fall within a smaller, specified range
             min-max normalization
             z-score normalization
             normalization by decimal scaling
        Discretization: Concept hierarchy climbing                           26
                           Normalization
   min-max normalization: to [new_minA, new_maxA]
                      v  minA
              v'                (new _ maxA  new _ minA)  new _ minA
                     maxA  minA
       Ex. Let income range $12000 to $98000 normalized to [0.0, 1.0].
                                  73600  12000
        Then $73000 is mapped to 98000  12000 (1.0  0)  0  0.716
   z-score normalization (μ: mean, σ: standard deviation):
                        v  A
                v'
                             A
                                                    73600  54000
       Ex. Let μ = 54000, σ = 16000. Then                         1.225
                                                       16000
   Normalization by decimal scaling:
              v
         v'  j        Where j is the smallest integer such that max (|ν’|) < 1
             10
                                                                                  27