CS 434
Data Analytics
Lecture 4
Data Understanding & Preparation
Dr. Firas Jabloun
1
CRISP-DM, Phases and Tasks
Today’s Course
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation
•Determine •Collect Initial •Select Data •Select •Evaluate •Plan
Business Data •Clean Data Modeling Results Deployment
Objectives •Describe •Construct Technique •Review •Plan
•Assess Data Data •Generate Test Process Monitoring &
Situation •Explore Data •Integrate Design •Determine Maintenance
•Determine •Verify Data Data •Build Model Next Steps •Produce Final
Data Mining Quality •Format Data •Assess Model Report
Goals •Review
•Produce Project
Project Plan
2
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
3
Introduction
• Data understanding and data preparation (2nd and 3rd phases of CRISP-DM
standard process)
▪ Evaluate the quality of the data
▪ Clean the raw data and Deal with missing data
▪ Integrate different sources
▪ Reduce data where necessary
▪ Perform transformations on certain variables
4
Introduction: What is data?
Collection of data objects and their
attributes Attributes
An attribute is a property or Tid Refund Marital
Status
Taxable
Income
Cheat
characteristic of an object: 1 Yes Single 125K No
• Examples: eye color of a person, temperature, 2 No Married 100K No
etc. 3 No Single 70K No
• Attribute is also known as variable, field, 4 Yes Married 120K No
characteristic, or feature 5 No Divorce 95K Yes
d
Objects
A collection of attributes describe an 6 No Married 60K No
7 Yes Divorce 220K No
object d
8 No Single 85K Yes
• Object is also known as record, point, case, 9 No Married 75K No
sample, entity, or instance 10 No Single 90K Yes
10
5
Introduction: Attribute values
Attribute values are numbers or symbols assigned to an
attribute.
Distinction between attributes and attribute values
• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of
values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value
6
Data quality
• Data have quality if they satisfy the requirements of the intended use
• Measures for data quality: A multidimensional view
▪ Accuracy: correct or wrong, accurate or not
▪ Completeness: not recorded, unavailable, …
▪ Consistency: some modified but some not, dangling, …
▪ Timeliness: timely update?
▪ Believability: how trustable is the data?
▪ Interpretability: how easily the data can be understood?
• Two different users may have very different assessments of the quality of a
given dataset
7
Data Preparation
• Much of the raw data contained in databases is unprocessed, incomplete, and noisy
▪ Poor quality data result in incorrect and unreliable data mining results
▪ In recent years we moved from “all signal” to “mostly noise” data!!
• Data preparation (or data preprocessing) means the manipulation of data into a form suitable for
further analysis and processing
▪ Objective: minimize garbage in, garbage out (GIGO). It improves the quality of data and
consequently helps improve the quality of data mining results
• Depending on the dataset, data preprocessing alone can account for 10-60% of all time and effort
of the data mining process (data preparation activities are routine, tedious, and time consuming)
8
Major tasks in Data Preparation
• Data cleaning: Fill in missing values, smooth noisy data, identify
or remove outliers, and resolve inconsistencies
• Data integration: Integration of multiple databases, data cubes,
or files
• Data reduction: Dimensionality reduction, numerosity reduction,
data compression
• Data transformation and data discretization: Normalization,
Aggregation, Binning…
9
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
10
Data cleaning
• Data cleaning (cleansing or scrubbing) routines attempt to fill in missing values, smooth
out noise while identifying outliers, and correct inconsistencies in the data
11
Missing data
• A dataset can have missing
(unavailable) values
because those values either
cannot be measured or are
just unknown.
• Different ways to represent
missing values in a data set:
blank, ?, null, NA, « »,etc
12
Missing data
• Missing data may be due to
▪ equipment malfunction
▪ inconsistent with other recorded data and thus deleted
▪ data not entered due to misunderstanding
▪ certain data may not be considered important at the time of entry
▪ not register history or changes of the data
• Missing data may need to be inferred
13
Handling Missing Data
➢ Two main ways to deal with missing or inconsistent values :
1. Replace with a proper value (also called imputation)
• Use a value that makes sense, such as the mean, median, mode
1. Remove the entire example/ Ignore the tuple: usually done when class label is missing
(when doing classification)
- Not effective when the % of missing values per attribute varies considerably
➢ Choosing between these 2 methods depends on the underlying meaning of
the values
➢ Also, sometimes we may leave the value as missing, because subsequent
modeling step can actually tolerate missing values (e.g. decision trees)
14
Handling missing data: Imputation
➢ Manual Imputation: tedious + infeasible?
➢ Automatic Imputation using:
▪ a global constant : e.g., “unknown”, a new class?!
▪ the attribute mean/the local mean/the moving average
▪ the attribute mean for all samples belonging to the same class (smarter)
▪ the most probable value: inference-based such as Bayesian formula or decision tree, or
regression
15
Noisy data
❖Noise: random error or variance in a measured variable, a value that is not
valid or meaningful, for example has wrong data type, wrong range, or
impossible value
➢ Incorrect attribute values may be due to
▪ faulty data collection instruments
▪ data entry problems
▪ data transmission problems
▪ technology limitation
▪ inconsistency in naming convention
16
How to handle noisy data (outliers)?
• Binning:
▪ first sort data and partition into (equal-frequency) bins
▪ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression: smooth by fitting the data into regression functions
• Clustering: detect and remove outliers
• Combined computer and human inspection: detect suspicious
values and check by human (e.g., deal with possible outliers)
17
How would you handle this?
Nescafe % of Own Label % of Nescafe Kenco
Nescafe Share Nescafe Price Kenco Price Own Label Kenco % of All
Time Period all Stores All Stores Advertising Advertising
of Market (Pence) (Pence) Price (Pence) Stores Stocking
Stocking Stocking (£000s) (£000s)
1 25.7% 68.0 59.8 38.7 66 89 79 255
2 26.7% 69.4 65.4 39.0 65 88 79 252
3 27.8% 67.1 39.6 69 89 77 233
4 25.2% 74.9 66.8 38.8 67 89 78 210
5 25.7% 73.9 67.0 39.7 66 89 78 250
6 27.1% 74.8 67.2 39.2 67 88 79 306
7 64.4 77 319
8 35.3% 0 38.7 66 88 78
9 32.5% 0 65.9 40.2 65 90 77
10 78.2 66.8 38.9 0 90 77
11 23.1% 75.7 65.8 0 90 78
12 25.0% 73.2 65.9 0 90 48
13 21.1% 77.7 66.6 0 92 46 195
14 19.9% 74.2 64.6 66 95 44 209
15 17.5% 75.2 65.2 37.8 62 94 47 171
16 21.0% 72.0 65.1 44.6 64 94 45 182
17 27.0% 67.4 45.1 63 94 46 1.52
18 24.6% 72.6 69.0 45.8 67 95 48
19 23.1% 76.6 69.0 46.4 66 0 54
20 25.8% 72.3 69.0 47.1 64 0 57
21 29.5% 71.6 68.9 46.5 67 0 48
18
Identifying misclassifications
• A frequency distribution can be used to make sure that all labels of a categorical
variable are all valid and consistent
• Example (Larose & Larose, 2015, p. 26)
19
The normal distribution: Review
• Bell Shaped
• Symmetrical
f(x)
• Mean, Median and Mode are Equal
• Location is determined by the mean, μ
• Spread is determined by the standard deviation, σ σ
• The random variable has an infinite theoretical range: x
• + to − μ
Mean = Median = Mode
20
The Standard Normal Distribution: Review
• Also known as the “z” distribution
• Mean is defined to be 0
• Standard Deviation is 1
f(z)
• Values above the mean have positive
z-values
• Values below the mean have negative
z-values z
-1.96 0 +1.96
• Translate from x to the standard normal
(the “z” distribution) by subtracting the
mean of x and dividing by its standard • 95.0% of the scores fall between a Z of -1.96 to +1.96
deviation: • 99.9% of the scores fall between a Z of -3.30 to +3.30
x −μ
z=
σ
21
22
Graphical methods for identifying outliers
➢ Outliers are extreme values that go against the trend of the remaining data.
➢ Certain statistical methods are sensitive to the presence of outliers and may deliver
unreliable results
➢ Graphical methods for identifying outliers for numeric variables include:
▪ Histograms
▪ Box Plot
▪ Scatter diagram
23
Histogram
❖HISTOGRAM: A graph in which the classes are marked on the horizontal axis and the class
frequencies on the vertical axis. The class frequencies are represented by the heights of
the bars and the bars are drawn adjacent to each other.
Bin
Histogram: Discovering Outlier
Outlier
Box Plot: Percentiles & Quartiles
➢ The pth percentile of a data set is a value such that at least p percent of the items take on this value
or less and at least (100 - p) percent of the items take on this value or more.
▪ Arrange the data in ascending order.
▪ Compute index i, the position of the pth percentile.
i = (p/100)n
▪ If i is not an integer, round up. The p th percentile is the value in the i th position.
▪ If i is an integer, the p th percentile is the average of the values in positions i and i +1.
➢ Quartiles are specific percentiles
▪ First Quartile = 25th Percentile
▪ Second Quartile = 50th Percentile = Median
▪ Third Quartile = 75th Percentile
➢ IQR: Interquartile Range=Q3-Q1
➢A measure of the spread of the middle 50% of the data.
Box Plot
❖ A box plot is a graphical rendition of statistical data based on
the minimum, first quartile, median, third quartile, and
maximum.
❖ The term "box plot" comes from the fact that the graph looks
like a rectangle with lines extending from the top and bottom.
IQR=Interquartile range
27
Box Plot: Identifying outliers
• A robust measure of identifying outliers (univariate case) is defined as follows. A
data value is an outlier if
▪ it is located 1.5(IQR) or more below Q1, or
▪ It is located 1.5(IQR) or more above Q3.
• Example: Q1=70, Q3=80, identify the boundaries starting from which some
of the values are considered outliers:
• IQR=80-70=10
• A value is identified as an outlier if:
• It is lower than Q1-1.5*IQR=55, or
• It is higher than Q3+1.5*IQR=95
28
Box Plot: Z-Score Example
• Case item number 282
has a star, indicating that
it is an extreme score.
• We need to convert the
data to Z scores to
examine the Z for case
282.
29
The column
ZReactionTime
contains the Z scores
for ReactionTime.
The Z score for case 282 is
4.944. Since the value is
much greater than our
arbitrary cutoff of 3.
We will delete this data
point
Scatter Plot
• Scatter plots are similar to line graphs in that they use horizontal and vertical
axes to plot data points. However, they have a very specific purpose. Scatter
plots show how much one variable is affected by another. The relationship
between two variables is called their correlation .
• Scatter plots usually consist of a large body of data. The closer the data points
come when plotted to making a straight line, the higher the correlation between
the two variables, or the stronger the relationship.
32
Correlation vs. Possible Relationship Between Variables
❖ Direct cause and effect,
▪ e.g. water causes plant to grow
❖ Both cause and effect,
▪ E.g. coffee consumption causes nervousness as well nervous people have more coffee.
❖ Relationship caused by third variable;
▪ Death due to drowning and soft drink consumption during summer.
• Both variables are related to heat and humidity (third variable).
• This is dangerous (Why?)
❖ Coincidental relationship;
▪ Increase in the number of people exercising and increase in the number of people committing crimes.
• This is even more dangerous (Why?)
❖ Correlation measures association and not causation.
33
Other problems
❑ Other data problems which require data cleaning
▪ duplicate records
▪ incomplete data
▪ inconsistent data
34
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
36
Data Reduction
1) Sampling observations (examples)
• Select a small but representative subset of the observations
• Appropriate when number of observations in dataset is too large
• Usually done by some form of random sampling
2) Removing irrelevant attributes
• Because the dataset should only contain attributes that are relevant
(pertinent) to the pattern of interest
• This is a non-trivial task, especially with large number of attributes
• Typically achieved with a combination of domain knowledge, visual
exploration of the data, and statistical dimensionality reduction
techniques (such as Principal Components Analysis)
37
Data reduction: keep in mind the following tradeoffs
❖Data reduction helps reduce the processing time of the 2 subsequent phases
(modeling & evaluation)
▪ Ideally, should not throw away data unless we’re sure it is not essential
▪ The number of observations in the dataset should be at least 6 times the
number of attributes
• This is a so-called rule of thumb (not an exact rule)
• Need sufficient observations in order to obtain reliable modeling results
38
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
39
Data transformation
• Data are transformed or consolidated to forms appropriate for mining
• Strategies for data transformation include the following
▪ Smoothing
▪ Attribute construction
▪ Aggregation
▪ Normalization
▪ Discretization
▪ Concept hierarchy generation for nominal data
40
Normalization
• Variables tend to have ranges that vary greatly from each other
• The measurement unit used can affect the data analysis
• For some data mining algorithms, differences in ranges will lead to a tendency for the
variable with greater range to have undue influence on the results
• Data miners should normalize their numeric variables in order to standardize the scale
of effect each variable has on the results
41
Normalization
• Normalizing the data attempts to give all attributes an equal weight
• The terms standardize and normalize are used interchangeably in data preprocessing
• Algorithms that make use of distance measures, such as the k-nearest neighbors
algorithm benefit from normalization
• Notation:
▪ X : original field value
▪ X*: normalized field value
42
Min-max normalization
• Performs a linear transformation on the original data
• Min-max normalization works by seeing how much greater the field value is
than the minimum value, min(X), and scaling this difference by the range:
X − min( X ) X − min( X )
*
X mm = =
range( X ) max( X ) − min( X )
• Min-max normalization preserves the relationships among the original data
values
• Values range between 0 and 1
• Min-max normalization will encounter an « out-of-bounds » error if a future
input case for normalization falls outside of the original data range of X
43
Z-score normalization
• Also called zero-mean normalization
• Z-score standardization works by taking the difference between the field
value and the field mean value, and scaling this difference by the standard
deviation of the field values:
X − mean( X )
z − score =
SD( X )
• The z-score normalization is useful when the actual minimum and maximum
of an attribute X are unknown, or when there are outliers that dominate the
min-max normalization
• A variation of the z-score normalization replaces the standard deviation of
X by the mean absolute deviation of X
44
Decimal scaling
❖ Decimal scaling ensures that every normalized value lies
between -1 and 1
X
X decimal = d
*
10
• d represents the number of digits in the data value with the largest absolute value
45
Normalization-examples
• Suppose that the minimum, maximum, mean, and standard deviation of the
values for the attribute income are $12000, $98000, $54000 and $16000.
Transform a value of $73600 for income using:
▪ Min-max normalization: a value of $73600 for income is transformed to 0.716
▪ Z-score normalization: a value of $73600 for income is transformed to 1.225
• Suppose that the recorded values range from -986 to 917. To normalize by
decimal scaling, we therefore divide each value by 1000 = 103
▪ 986 normalizes to -0.986 and 917 normalizes to 0.917
46
Normalization — remarks
• Normalization can change the original data quite a bit,
especially when using the z-score normalization or decimal
scaling
• It is necessary to save the normalization parameters (e.g., the
mean and standard deviation if using z-score normalization) so
that future data can be normalized in a uniform manner
• The normalization parameters now become model parameters
and the same value should be used when the model is used on
new data (e.g. testing data)
47
Transformations to achieve normality
• Some data mining algorithms and statistical methods require that the variables be
normally distributed
• Z-score transformation does not achieve normality
Z-score
Transformation
48
Transformations to achieve normality
❖ The skewness of a distribution is measured by
3(mean − median)
sk =
std
▪ Much real-world data is right skewed, including most financial data
➢ right skewed : sk >0
▪ Left-skewed data is not as common (often occurs when the data is right-censored)
➢ Left skewed : sk <0
49
Transformations to achieve normality
• Common transformations to achieve normality (more normally distributed) by
eliminating skewness)
➢ ln(x)
➢ Sqrt(x)
➢ 1/x
➢ 1/sqrt(x)
50
Transformations to achieve normality
• Normal probability plot can be used to check whether the new variable is
normally distributed or not (normality ≠ symmetry).
• Algorithms requiring normality usually do fine when supplied with data that
is symmetric and unimodal.
• Don’t forget to « de-transform » the data when the algorithm is done with its
results:
➢ What does this mean?
51
Checking Normality
We can use the following to determine if a
distribution is approximately normal:
• Distribution should be similar in shape to the normal
curve.
• Q-Q Plot values should lie close to the 45 line.
• Skew & (excess) Kurtosis should be reasonably close to 0.
• Note: For a Normal distribution, the Kurtosis should be 3,
but usually there is a correction term of -3 leading to the
so called “excess” Kurtosis which should be 0.
52
Checking Normality: Histogram
• Looking at the histogram with
Normal curve can show
problems in skewness and
therefore departure from
normality
Checking Normality: Quantile-Quantile (Q-Q) Plot
➢The Normal probability plot, plots
the quantiles/percentiles of a
particular distribution against the
quantiles of the standard normal The bulk of the points
distribution should fall on the straight
➢The circles should fall on the 45 lines
degree line:
➢ For this data set the ends are
deviating from the line, again
suggesting a problem with
normality.
Checking Normality: Skewness & Kurtosis
• Skewness & Kurtosis should
be reasonably close to 0.
Dummy variables
Some analytical methods, such as regression, require
predictors to be numeric
A dummy variable (or flag variable, or indicator variable) is a
categorical variable taking only two values, 0 and 1
When a categorical predictor takes k ≥ 3 possible values, then
define k-1 dummy variables and use the unassigned category
as the reference category
56
Transforming categorical variables into numerical variables
• In most instances, the data analyst should
avoid transforming categorical variables
to numeric variables
• The exception is for categorical variables
that are clearly ordered
▪ One may bicker with the actual values assigned
▪ But still not very recommended (why?)
57
Discretization by binning numerical values
• Some algorithms prefer categorical rather than continuous predictors, in which case
we would need to partition the numerical predictors into bins or bands
• Common methods for binning numerical predictors:
▪ Equal width binning
▪ Equal frequency binning
▪ Binning by clustering
▪ Binning based on predictive value
58
Discretization by binning
▪ Binning does not use class information and is therefore an unsupervised discretization
technique
▪ Binning is sensitive to the user-specified number of bins, as well as the presence of
outliers
▪ Equal width binning is not recommended for most data mining applications (the width
of the categories can be greatly affected by the presence of outliers)
▪ Equal frequency binning assumes that each category is equally likely (an assumption
which is usually not warranted)
▪ Binning by clustering and binning based on predictive value are preferred
59
Discretization by binning --example
➢ Suppose we have the following tiny data set, which we would like to discretize into
k=3 categories: X = {1, 1, 1, 1, 1, 2, 2, 11, 11, 12, 12, 44}
▪ Equal width binning: [0, 15), [15, 30), [30, 45)
▪ Equal frequency binning: n/k = 12 / 3 = 4
▪ K-means clustering
Reclassifying categorical variables
▪ Reclassifying categorical variables is the equivalent of binning numerical variables
▪ Often, a categorical variable will contain too many easily analyzable field values (for example state)
▪ Data mining methods such as logistic regression perform sub -optimally when confronted with
predictors containing too many field values
▪ The data analyst should reclassify the field values (the reclassification should support the objectives of
the business problem or research question)
61