KEMBAR78
Data Science Slides | PDF | Outlier | Principal Component Analysis
0% found this document useful (0 votes)
11 views57 pages

Data Science Slides

The document outlines the essential processes of data preparation in data science, specifically for marketing, including data cleaning, reduction, transformation, and integration. It emphasizes the importance of addressing issues such as noise, missing values, and outliers to improve data quality and mining results. Various techniques for handling these issues, such as normalization, imputation, and feature engineering, are discussed to enhance the efficiency of data analysis.

Uploaded by

Clarisse Gaiola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views57 pages

Data Science Slides

The document outlines the essential processes of data preparation in data science, specifically for marketing, including data cleaning, reduction, transformation, and integration. It emphasizes the importance of addressing issues such as noise, missing values, and outliers to improve data quality and mining results. Various techniques for handling these issues, such as normalization, imputation, and feature engineering, are discussed to enhance the efficiency of data analysis.

Uploaded by

Clarisse Gaiola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

NOVA

IMS 6
Information
Management
School

DATA
PREPARATION

Data Science for Marketing


© 2021-2024 Nuno António
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa
Summary
1.Introduction
2.Data cleaning
3.Data reduction
4.Data transformation
5.Data integration

2
6.1
0

Introduction
Data preparation

3
Data preparation

original modeling
dataset dataset
aka
Analytical
Base Table
(ABT)

4
Why data preparation
Due to their size and multiple, heterogenous sources, real-word databases
commonly have:
§ “Noise” (random error or variance)
§ Missing data
§ Inconsistent data

Low quality data → Low quality mining results

For this reason, data must be preprocessed and prepared to improve the
efficiency and ease of the mining process.

5
Forms of data preprocessing and preparation
Han et al. (2012)

6
6.2
0

Data cleaning
Data preparation

7
Objective
Fix variables problems, such as:
§Duplicates
§Redundancy
§Incorrect or miscoded values
§Outliers
§Missing values

8
Duplicates
It is common for real-world datasets to have duplicate instances, even when
they should not exist (e.g., having two instances of the same customer profile)
§ If instances are exact-match duplicates, with all columns having the same
values, most of the time all duplicated instances could be deleted (except
one, of course)

§ If some columns are not equal (e.g., if there are two customer instances with
the same name, telephone, and address but a different volume of purchases),
aggregations may be required. In this example, sales could need to be
summed up, and one of the instances deleted after

9
Redundancy
When two attributes are redundant, one
of them should not be included in the
modeling dataset. Removing correlated
attributes:

§Improves the model development speed


§Decrease harmful bias
§Increases interpretability

10
Incorrect or miscoded values

Treat as missing Column

N Correct the problem,


recollect data
correctly, or delete
Inconsistent the column
Occur Y N (e.g. dates in Y
Type?
frequently? multiple
formats)?
Treat as outliers
Numerical
N Categorical

Can an expert Outliers,


interpret and Y Incorrect unusual Y Occur
impute the categories? values, or frequently?
correct value? spikes?

Y N N N

Correct the value Leave as it is

11
Approaches to handling
outliers
Data cleaning

12
Approaches to handling outliers (1/6)
Remove from the modeling data
When outliers distort the models more than they
can help, in numeric algorithms (K-means
clustering or Principal component analysis)

Risk of removing outliers:


Model deployment can be compromised
when outliers appear (produces unexpected
scores)

13
Approaches to handling outliers (2/6)
Separate the outliers and create models just for them
§ Relax the definition of outliers from two standard
deviations from the mean to three standard deviations
§ Create a separate model for outliers (e.g., linear
regression)

Some algorithms, such as Decision trees-based


algorithms already incorporate this approach in the
algorithm design itself

14
Approaches to handling outliers (3/6)
Transform the outliers so they
are no longer outliers
§ Apply skew transformation or
normalization techniques to
reduce the distance between
the outliers and the main body
of the distribution
§ Apply a MIN or MAX function
based on ”valid” minimum or
maximum values

15
Approaches to handling outliers (4/6)
Transform the outlier and
create an indicator column
Apply skew transformation of
normalization techniques as in
the previous approach, but,
additionally, create a dummy
column indicating if the
observation is an outlier (0: no;
1:yes)

16
Approaches to handling outliers (5/6)
Bin the data (discretize the data)
Because transformations may not capture too
extreme outliers, an alternative to transformations
is to transform the numeric variable in categorical
(e.g., instead of salary amount use low, medium,
high)

Common binning options:


§ Equal-Frequency: the number of unique values
in all bins are similar
§ Equal-Width: pre-defined or range-based width
(dividing the range by the number of bins to
define size)
§ Clustering: dividing the data into discrete
groups or clusters

17
Approaches to handling outliers (6/6)
Leave in the data without modification
Employ on algorithms that are unaffected by outliers, such as Decision trees-
based algorithms

18
Approaches to handling
missing values
Data cleaning

19
Approaches to handling missing values (1/6)
Listwise and column deletion
§If a small percentage of observations
have columns with missing values, just
remove those observations
§If a specific column has many missing
values, consider removing it

20
Approaches to handling missing values (2/6)
Imputation with a constant
§For categorical variables, this is as
simple as filling the missing values with a
value indicating that is missing (e.g.,
“NULL”)
§For numeric variables, if the 0 (zero)
makes sense (e.g., bank balance) then fill
it with a 0 (zero). Otherwise, try other
approach, like the “Mean or median
imputation”

21
Approaches to handling missing values (3/6)
Mean and median imputation (for
continuous variables)
One of the most common approaches in
continuous variables is the imputation of the
mean value. However, if the distribution is
skewed, the median could be better

If the number of observations is large, this


operation could be computationally
expensive.

22
Approaches to handling missing values (4/6)
Imputations with distributions
In numeric variables, when a large
percentage of values are missing, the
summary statistics are affected by
mean/median imputation. In these
cases, the missing value should be
replaced from a random number of a
known distribution (based on the
variable distribution)

23
Approaches to handling missing values (5/6)
Random imputation from own
distributions
This approach involves for each
missing value, randomly, select
a value of one of the non-
missing values existing on the
column.

The advantage of this approach


is that the distribution of
imputed values matches the
populated data.
24
Approaches to handling missing values (6/6)
Impute value from a model
This is the more complex approach. It involves developing a model to impute
missing values.

This approach can take time


When deployed, requires that missing values in data to be also processed
by the model

25
Additional consideration on missing values
Creation of dummy variables
In some cases, the existence of missing values can be informative for the
model. In those cases, besides implementing one of the previous approaches, a
dummy variable could be created to indicate if there is a missing value (0: no;
1:yes)

26
6.3
0

Data reduction
Data preparation

27
Dimensionality reduction
Data reduction

28
The curse of dimensionality
As the number of candidate variables for modeling increase, the number of
observations must also increase (exponentially) to be able to capture the high-
dimensional patterns. One way to address this problem is to reduce the
number of dimensions

source: http://www.turingfinance.com
29
Attribute subset selection
Datasets may contain hundreds of attributes, but many of which may be
irrelevant to the mining task or redundant. For example, for segmenting
customers, telephone number may be irrelevant

“The goal of attribute subset selection is to find a minimum set of attributes


such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes”
Han et al. (2012)

30
Attribute subset selection types
§Filter: uses statistics tests (Pearson correlation, Chi-squared, etc.)
§Wrapper: uses ML to select features to use (forward selection,
backward selection, or other method)
§Embedded: included in the algorithm (e.g., Decision trees)

These methods will be discussed in more detail in Machine


Learning, Marketing Engineering, and other courses as they tend to
be used more in Predictive Modeling

31
Techniques for dimensionality reduction
§Principal Components Analysis (PCA): reduces dimensionality,
while retaining as much variance in data as possible (finds a new
set of variables that are a linear combination of the original
variables)
§Kernel PCA (KPCA): nonlinear variation of PCA
§Linear Discriminant Analysis (LDA): unsupervised learning
method that transforms a set of features to a new set
§Singular Value Decomposition (SVD): extracts important features
from data, while reconstructing the original dataset to a smaller
dataset (e.g., transform a 1 024 pixels image to 66 pixels)
§Among others
Numerosity reduction
Data reduction

33
Numerosity reduction (1/2)
Methods:
§Aggregations: aggregate the data in a different unit of analysis
(e.g., weekly data, instead of daily data)
§Clustering: cluster representations of the data are used to replace
the actual data
§Parametric data reduction: regression and log-linear models are
used to “predict” an output, based on a set of inputs (e.g., using
multivariate linear regression to transform a set of variables in only
one)

34
Numerosity reduction (2/2)
Methods (cont.): RS (s=4)

§Sampling: allows a dataset to be 1


3
youth
middle_aged
represented by a smaller subset. full dataset 4 middle_aged
Could be:
§ Random sampling (RS): selects a 1 youth 6 middle_aged

random percentage of instances 2 youth RSWR (s=4)


§ Random sampling with replacement 3 middle_aged 1 youth
(RSWR): similar to previous, but the 4 middle_aged 3 middle_aged
same instance can be selected more 5 middle_aged 7 senior
than once 6 middle_aged
§ Stratified sample (SS): selects 7 senior
7 senior
SS (s=4)
instances accordingly the relative
frequencies of the levels of a specified 1 youth
stratification feature. This selection 3 middle_aged
ensures that the sample presents a 4 middle_aged
distribution similar to the population
7 senior
35
Sampling consideration

Survivorship bias
Concentrating on the instances
that passed some selection or
sample process and
overlooking those who did not
(one focus on what can see and
ignore what can not see)

It can lead to excessively


optimistic certainties as
multiple variables are ignored source: https://www.wikipedia.com

36
6.4
0

Data transformation
Data preparation

37
Normalization
Data transformation

38
Normalization
§ Some algorithms, such as the K-MEANS algorithm, have
difficulty in covering variables in very different ranges (e.g., age
in the range of [15, 80] and salary in the range [30 000, 80 000]
§ Linear regression coefficients are also influenced
disproportionately by the large values of a skewed distribution
§ Normalization can make a continuous variable fall within a
specific range while maintaining the relative differences between
the values for the variable

39
Common normalization techniques

Method Formula Range


"
Magnitude scaling 𝑥! = [-1, 1]
#$%( " )
(
Sigmoid 𝑥 ! = (()* !") [0, 1]
("+"#$%)
Min-max 𝑥 ! = (" [0, 1]
#&"+"#$%)
("+")̅
Z-score 𝑥! = -"
mostly [-3, 3]
((.. × 0123 405*0)
Rank binning 𝑥! = # 478*091:;428
[0, 100]
" +<*5;12(")
Robust scaling 𝑥! = mostly [-1, 1]
=>?(")

40
Measures scaling
Normalization techniques are also used to scale measurements in different scales to
the same scale

Example
Tripadvisor’s reviews rating scale: [1, 5]
Booking.com’s reviews rating scale: [2.5, 10]

Min-max scale to convert an 8.1 rating in Booking.com to 0-10 scale


(#$#!"# ) (&.($).*) *.,
scale = 𝑥 ! = = = = 0.7467 × 10 = 7.5
(#!$% $#!"# ) ((+$).*) -.*

Min-max scale to convert a 4 rating in Tripadvisor to 0-10 scale


(#$#!"# ) (.$() /
scale = 𝑥 ! = = = = 0.75 × 10 = 7.5
(#!$% $#!"# ) (*$() .

41
Feature engineering
Data transformation

42
Feature engineering
The creation of new features (also know as ”derived variables” or
“derived attributes”) provides more value-added to the quality of
data than any other modeling step

43
Distributions and possible “corrections”

Abbott (2014) 44
Binning (discretizing) variables (1/2)

Abbott (2014)

www.towardsdatascience.com
45
Binning (discretizing) variables (2/2)

46
Other possible transformations

§Reciprocal transformation: "!


§Square root transformation: 𝑥
§Exponential transformation: 𝑒 "
" 0 #!
𝑓𝑜𝑟 𝛾 ≠ 0
§Box-cox transformation: # $
log 𝑥 𝑓𝑜𝑟 𝛾 = 0

47
Encode categorical variables – Label encoding (1/3)

Numerical algorithms such as Linear


Regression or K-MEANS require inputs to be
numerical.
One way to use categorical variables when
there is an inherent order to the different levels
is to assign a number to each level

CustomerID Spent Education CustomerID Spent Education


1 € 100 Bachelor 1 € 100 1
2 € 120 Master 2 € 120 2
3 € 110 Doctorate 3 € 110 3
4 € 140 Master 4 € 140 2

48
Encode categorical variables – One-hot encoding (2/3)

When there is no inherent order


in the levels, the most common
approach is to create dummy
variables
CustomerID Spent Segment CustomerID Spent Corporate SME Individual
1 € 100 Corporate 1 € 100 1 0 0
2 € 120 SME 2 € 120 0 1 0
3 € 110 Individual 3 € 110 0 0 1

49
Encode categorical variables (3/3)
Approach to handling high cardinality:
§Encode categorical variables using an encoder that does not
generate a column for each value/level of the categorical
variable (e.g., the count or probability of observations that have
that value/level)
§If there is a hierarchy, consider using higher levels only . For
example, if you have street, city, and region, consider using only
city and region, or even just region
§For values/levels present in more than a predetermined
threshold of observations (e.g., 2%) create dummy variables
CustomerID Spent Segment CustomerID Spent Segment Corporate
1 € 100 Corporate 1 € 100 2/4 1
2 € 120 SME 2 € 120 1/4 0
30%
3 € 110 Individual 3 € 110 1/4 0
4 € 105 Corporate 4 €105 2/4 1
50
Date/time variables
Datasets are two-dimensional, so, when models require the
introduction of time, transformations are necessary to include a
third dimension (time).

Usually Date/time variables are converted to numeric units related


to the outcome to be analyzed. For example:
§The date the customer was offered a quotation for a loan could be
converted to the number of days before the mortgage was signed
or the number of days since a certain date (e.g., 2000-01-01)
§The date of mortgage signature could be converted to the the day
in the year or the week number

51
Multidimensional features
The most powerful of features. The two most common examples
are:
§Interactions: multiplication of variables
§Ratios: division of variables
Usually, domain expertise is required to understand which
interactions, and above all, which ratios may have modeling value.

52
Multidimensional features - ratios
Ratios are import because they are difficult for most algorithms to
uncover. Ratios can:
§Provided a normalized version of a variable. For example, a
percentage (e.g., a customer website purchase ratio =
%&'()* +, -&*./01)1
)
.&12+')* 3)(142) 541421
§Can incorporate complex ideas. For example,.604'1
the claims received
*).)45)7
to premiums paid in an insurance company =
-*)'4&'1 -047
§Can make models to live “longer”. For example, a model for real
estate property value, instead of using each property price, due to
-*+-)*28 -*4.) (':)
prices increasing trend, could be =
05)*0<) -*+-)*28 -*4.) (':)

53
6.5
0

Data integration
Data preparation

54
Merge data
Joining data that comes from two or more
databases about the unit of analysis under studied

stocks history social reputation currencies national official statistics

stocks forecast

55
Reformat data
Apply syntactic modifications that do not change data meaning,
but are required for modeling, for example:
§Remove commas from text fields if the dataset is supposed to be
saved as comma separated values
§Remove any ordering that might exist in the observations
§Trim some variables (e.g., text variables) to a certain maximum size

56
Data Science for Marketing
© 2021-2024 Nuno António (Rev. 2024-08-28)
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa

You might also like