Data
Preprocessing
What is Data?
Attributes
Collection of data objects and their attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Types of Attributes
➢ There are different types of attributes
➢ Nominal
➢ Examples: ID numbers, eye color, zip codes
➢ Ordinal
➢ Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
➢ Interval
➢ Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
➢ Ratio
➢ Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
➢ The type of an attribute depends on which of the
following properties it possesses:
➢ Distinctness: =
➢ Order: < >
➢ Addition: + -
➢ Multiplication: */
➢ Nominal attribute: distinctness
➢ Ordinal attribute: distinctness & order
➢ Interval attribute: distinctness, order & addition
➢ Ratio attribute: all 4 properties
Discrete, Continuous,
& Asymmetric Attributes
➢ Discrete Attribute
➢ Has only a finite or countably infinite set of values
➢ Ex: zip codes, counts, or the set of words in a collection of documents
➢ Often represented as integer variables (Nominal, ordinal, binary attributes)
➢ Continuous Attribute
➢ Has real numbers as attribute values
➢ Interval and ratio attributes
➢ Ex: temperature, height, or weight
➢ Asymmetric Attribute
➢ Only presence is regarded as important
➢ Ex: If students are compared on the basis of the courses they do not take,
then most students would seem very similar
Step 1: To describe the dataset
What do your records represent?
What does each attribute mean?
What type of attributes?
• Categorical
• Numerical
• Discrete
• Continuous
• Binary – Asymmetric
Step 2: To explore the dataset
➢ Preliminary investigation of the data to better
understand its specific characteristics
➢ It can help to answer some of the data mining questions
➢ To help in selecting pre-processing tools
➢ To help in selecting appropriate data mining algorithms
➢ Things to look at
➢ Class balance Visualization tools
➢ Dispersion of data attribute values are important
➢ Skewness, outliers, missing values Histograms, scatter
➢ Attributes that vary together plots
Useful Statistics
➢ Discrete attributes
➢ Frequency of each value
➢ Mode = value with highest frequency
➢ Continuous attributes
➢ Range of values, i.e. min and max
➢ Mean (average)
➢ Sensitive to outliers
➢ Median
➢ Better indication of the ”middle” of a set of values in a skewed
distribution
➢ Skewed distribution
➢ mean and median are quite different
Skewed Distributions of
Attribute Values
Dispersion of Data
➢ How do the values of an attribute spread?
➢ Variance
➢ Variance is sensitive to outliers
𝑛
1
ത )2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑋) = (𝑥𝑖 − 𝑥
𝑛
𝑖=1
➢ What if the distribution of values is multimodal, i.e.
data has several bumps?
➢ Visualization tools are useful
Attributes that Vary Together
➢ Correlation is a measure that describe how two attributes
vary together
𝑐𝑜𝑟𝑟(𝑥, 𝑦)
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )(𝑦𝑖 − 𝑦ത )
=
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )2 σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത )2
Data Quality
➢ Examples of data quality problems:
➢ Noise and outliers Tid Refund Marital Taxable
Status Income Cheat
➢ Missing values 1 Yes Single 125K No
➢ Duplicate data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
A mistake or a millionaire? 5 No Divorced 10000K Yes
6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Missing Values
➢ Reasons for missing values
➢ Information is not collected
(e.g., people decline to give their age and weight)
➢ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
➢ Handling missing values
➢ Eliminate Data Objects
➢ Estimate Missing Values
➢ Ignore the Missing Value During Analysis
➢ Replace with all possible values (weighted by their probabilities)
Outliers
➢ Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
➢ Can help to
➢ detect new phenomenon or
➢ discover unusual behavior in data
➢ detect problems
How to Handle Noisy Data?
➢ Binning method:
➢ first sort data and partition into (equi-depth) bins
➢ then smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
➢ Clustering
➢ detect and remove outliers
➢ Combined computer and human inspection
➢ detect suspicious values and check by human
➢ Regression
➢ smooth by fitting the data
into regression functions
Discretization
Divide the range of a continuous attribute into intervals
➢ Interval labels can be used to replace actual data
values.
➢ Reduce data size by discretization
➢ Some data mining algorithms only work with discrete
➢ attributes
➢ E.g. Apriori for ARM
Binning (Equal-width)
➢ Equal-width (distance) partitioning
➢ Divide the attribute values x into k equally sized bins
➢ If xmin ≤ x ≤ xmax then the bin width δ is given by
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
𝛿=
𝑘
➢ Disadvantages:
➢ outliers may dominate presentation
➢ Skewed data is not handled well.
Binning (Equal-frequency)
➢ Equal-depth (frequency) partitioning:
➢ Divides the range into N intervals, each containing
approximately same number of samples
➢ Good data scaling
➢ Disadvantage:
➢ Many occurrences of the same continuous value could
cause the values to be assigned into different bins
➢ Managing categorical attributes can be tricky.
Binning Example
Attribute values (for one attribute e.g., age):
• 0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning – for bin width of e.g., 10:
• Bin 1: 0, 4 [-,10) bin
• Bin 2: 12, 16, 16, 18 [10,20) bin
• Bin 3: 24, 26, 28 [20,+) bin
• – denote negative infinity, + positive infinity
Equi-frequency binning – for bin density of e.g., 3:
• Bin 1: 0, 4, 12 [-, 14) bin
• Bin 2: 16, 16, 18 [14, 21) bin
• Bin 3: 24, 26, 28 [21,+] bin
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into Equi-depth bins:
Equi-depth bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Smoothing by bin
Smoothing by bin means: boundaries:
Bin 1: Bin 1: 9, 9, 9, 9 Bin 1: 4, 4, 4, 15
Bin 2: 23, 23, 23, 23 Bin 2: 21, 21, 25, 25
Bin 3: 29, 29, 29, 29 Bin 3: 26, 26, 26, 34
Data Normalization
An attribute values are scaled to fall within a small, specified
range , such as 0.0 to 1.0
➢ Min-Max normalization
➢ performs a linear transformation on the original data.
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
➢ Example: Let min and max values for the attribute income are
$12,000 and $98,000, respectively.
➢ Map income to the range [0.0;1.0].
Data Normalization
➢ z-score normalization(or zero-mean normalization)
➢ An attribute A, values are normalized based on the mean and
standard deviation of A.
v − meanA
v' =
stand _ devA
➢ Example: Let mean= 54,000 and standard deviation=16,000 for the
attribute income
➢ With z-score normalization, a value of $73,600 for income is
transformed to
Continuous and Categorical
Attributes
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Country Session Number of
Browser
Id Length Web Pages Gender Buy
Type
(sec) viewed
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10
Example of Association Rule:
{Number of Pages [5,10) (Browser=Mozilla)} → {Buy = No}
Handling Categorical Attributes
➢ Categorical Attributes:
➢finite number of possible values,
➢no ordering among value
➢ Transform categorical attribute into asymmetric
binary variables
➢ Introduce a new “item” for each distinct
attribute-value pair
➢ Example: replace Browser Type attribute with
➢ Browser Type = Internet Explorer
➢ Browser Type = Mozilla
➢ Browser Type = Chrome
Handling Categorical Attributes
➢ Potential Issues
➢What if attribute has many possible values
➢ Example: attribute country has more than 200
possible values
➢ Many of the attribute values may have very low
support
➢Potential solution: Aggregate the low-support
attribute values
Replace less frequent attribute values
into category called others
Handling Categorical Attributes
➢ Potential Issue: What if distribution of attribute values is
highly skewed
➢ Example: In an online survey, we collected information
regarding attributes gender, education, state, computer
at home, chat online, shop online and privacy concern.
➢ 85 % of the participant have computer at home
➢ {Computer at home =yes, shop Online =yes} ->{ privacy
concerns = yes}
➢ Better: {shop Online =yes} ->{ privacy concerns = yes}
➢ Potential solution: drop the highly frequent items
Handling Continuous Attributes
➢ Different kinds of rules:
➢Age[21,35) Salary[70k,120k) → Buy
➢Salary[70k,120k) Buy → Age: =28, =4
➢ Different methods:
➢Discretization-based
➢Equal-width binning
➢Equal-depth binning
➢Clustering
➢Statistics-based
Similarity and
Dissimilarity
Similarity and Dissimilarity
• Numerical measure of how alike two data
objects are.
Similarity • Is higher when objects are more alike.
• Often falls in the range [0,1]
• Numerical measure of how different are
two data objects
Dissimilarity • Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for
Simple Attributes
p and q are the attribute values for two data objects.
Euclidean Distance
➢ Euclidean Distance
n
dist = ( pk − qk )
2
k =1
Where n is the number of dimensions (attributes)
pk and qk are the kth attributes (components) or data
objects p and q.
➢ Standardization is necessary, if scales differ.
General Approach for Combining
Similarities
Sometimes attributes are of many different types, but an
overall similarity is needed.
Using Weights to Combine
Similarities
➢ May not want to treat all attributes the same.
➢Use weights wk which are between 0 and 1 and
sum to 1.
Example
➢ One categorical variable, test-1,
➢ d(i, j) evaluates to 0 if objects i = j , and 1 otherwise
Example
➢ Ordinal variable, test-2,
➢ d(i, j) = |i-j| / (n-1), we have 0 to n-1 values
Example
➢ Ratio scaled variable, test-3,
➢ Normalize (min-max normalization)
➢ Max =64 , min =22
➢ Distance measure (Manhattan or Euclidean distance )
Example
➢ Variable of Mixed types
➢ We use the dissimilarity matrices for the three variables.
Assignment 4
➢ Preprocessing and Clustering using K-means in
Pyspark
➢Due on 24th May
Project ???
➢ Replaced with
➢more assignments
➢a mini project
➢review of a latest research paper on Spark