KEMBAR78
IntroToDataMining : Key Components and process | PPTX
Data Mining Overview
Core Ideas in Data Mining
 Classification - Classifying the outcome of
categorical variable based on historical data
 Prediction – prediction of a given output variable
(target variable) based on other variables or
features
 Association Rules & Recommenders
 Data & Dimension Reduction
 Data Exploration
 Visualization
Supervised Learning
 Predictive modeling describes supervised learning -
the training data we feed the algorithm includes the
target variable. We are “supervising” the training
process
 Training data, where target value is known
 Score to data where value is not known
 Methods: Classification and Prediction
Supervised: Classification
 The object of classification is to predict a categorical
variable
 Examples: Purchase/no purchase, fraud/no fraud,
creditworthy/not creditworthy…
 Target variable is often binary (yes/no)
Supervised: Prediction
 Goal: Predict numerical target (outcome) variable
 Examples: sales, revenue, performance
 As in classification:
 Each row is a case (customer, tax return, applicant)
 Each column is a variable
 Taken together, classification and prediction
constitute “predictive analytics”
Unsupervised Learning
 Performs an analysis without a target variable
 Goal: Segment data into meaningful segments;
detect patterns
 There is no target (outcome) variable to predict or
classify
 Example: Clustering – segments the observations
into to similar groups based on the variables
Unsupervised: Association Rules
 Goal: Produce rules that define “what goes with what”
 Example: “If X was purchased, Y was also purchased”
 Rows are transactions
 Used in recommender systems – “Our records show
you bought X, you may also like Y”
 Also called “affinity analysis”
Unsupervised: Data Reduction
 Distillation of complex/large data into
simpler/smaller data
 Reducing the number of variables/columns (e.g.,
principal components)
 Reducing the number of records/rows (e.g.,
clustering)
Unsupervised: Data Visualization
 Graphs and plots of data
 Histograms, boxplots, bar charts, scatterplots
 Especially useful to examine relationships between
pairs of variables
Data Exploration
 Data sets are typically large, complex & messy
 Need to review the data to help refine the task
 Use techniques of Reduction and Visualization
The Process of Data Mining
Steps in Data Mining
1. Define/understand purpose
2. Obtain data (may involve random sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM, partition it
5. Specify task (classification, clustering, etc.)
6. Choose the techniques (regression, neural
networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
9. Deploy best model
Test Partition
 When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
 Assessing multiple models on same
validation data can overfit validation data
 Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
 Solution: final selected model is applied
to a test partition to give unbiased
estimate of its performance on new data
Data Splitting
 Data are split into training and test sets.
 Training data is used to train the model
 Testing data is used to estimate an unbiased assessment of the
model’s performance
#Random splitting of iris data into 70% train and 30%test datasets
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
testData <- data[ind==2,]
Simple Random Sampling
 All observations have an equal chance of selection
# Using base R
set.seed(123) # for reproducibility
index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.7))
train_1 <- ames[index_1, ]
test_1 <- ames[-index_1, ]
Rare Event Oversampling
 Often the event of interest is rare
 Examples: response to mailing, fraud in taxes, …
 Sampling may yield too few “interesting” cases to
effectively train a model
 A popular solution: oversample the rare cases to
obtain a more balanced training set
 Later, need to adjust results for the oversampling
Stratified Sampling
 Common in classification problems where the response
variable may be severely imbalanced
 Example: 90% of the observations have a “yes” response and
10% have a “no” response.
 Solution: segment into quartiles and then random sample
from each quartile
set.seed(123)
split_strat <- initial_split(attrition, prop = 0.7,
strata = "Attrition")
train_strat <- training(split_strat)
test_strat <- testing(split_strat)

IntroToDataMining : Key Components and process

  • 1.
  • 2.
    Core Ideas inData Mining  Classification - Classifying the outcome of categorical variable based on historical data  Prediction – prediction of a given output variable (target variable) based on other variables or features  Association Rules & Recommenders  Data & Dimension Reduction  Data Exploration  Visualization
  • 3.
    Supervised Learning  Predictivemodeling describes supervised learning - the training data we feed the algorithm includes the target variable. We are “supervising” the training process  Training data, where target value is known  Score to data where value is not known  Methods: Classification and Prediction
  • 4.
    Supervised: Classification  Theobject of classification is to predict a categorical variable  Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy…  Target variable is often binary (yes/no)
  • 5.
    Supervised: Prediction  Goal:Predict numerical target (outcome) variable  Examples: sales, revenue, performance  As in classification:  Each row is a case (customer, tax return, applicant)  Each column is a variable  Taken together, classification and prediction constitute “predictive analytics”
  • 6.
    Unsupervised Learning  Performsan analysis without a target variable  Goal: Segment data into meaningful segments; detect patterns  There is no target (outcome) variable to predict or classify  Example: Clustering – segments the observations into to similar groups based on the variables
  • 7.
    Unsupervised: Association Rules Goal: Produce rules that define “what goes with what”  Example: “If X was purchased, Y was also purchased”  Rows are transactions  Used in recommender systems – “Our records show you bought X, you may also like Y”  Also called “affinity analysis”
  • 8.
    Unsupervised: Data Reduction Distillation of complex/large data into simpler/smaller data  Reducing the number of variables/columns (e.g., principal components)  Reducing the number of records/rows (e.g., clustering)
  • 9.
    Unsupervised: Data Visualization Graphs and plots of data  Histograms, boxplots, bar charts, scatterplots  Especially useful to examine relationships between pairs of variables
  • 10.
    Data Exploration  Datasets are typically large, complex & messy  Need to review the data to help refine the task  Use techniques of Reduction and Visualization
  • 11.
    The Process ofData Mining
  • 12.
    Steps in DataMining 1. Define/understand purpose 2. Obtain data (may involve random sampling) 3. Explore, clean, pre-process data 4. Reduce the data; if supervised DM, partition it 5. Specify task (classification, clustering, etc.) 6. Choose the techniques (regression, neural networks, etc.) 7. Iterative implementation and “tuning” 8. Assess results – compare models 9. Deploy best model
  • 13.
    Test Partition  Whena model is developed on training data, it can overfit the training data (hence need to assess on validation)  Assessing multiple models on same validation data can overfit validation data  Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data  Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data
  • 14.
    Data Splitting  Dataare split into training and test sets.  Training data is used to train the model  Testing data is used to estimate an unbiased assessment of the model’s performance #Random splitting of iris data into 70% train and 30%test datasets ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3)) trainData <- data[ind==1,] testData <- data[ind==2,]
  • 15.
    Simple Random Sampling All observations have an equal chance of selection # Using base R set.seed(123) # for reproducibility index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.7)) train_1 <- ames[index_1, ] test_1 <- ames[-index_1, ]
  • 16.
    Rare Event Oversampling Often the event of interest is rare  Examples: response to mailing, fraud in taxes, …  Sampling may yield too few “interesting” cases to effectively train a model  A popular solution: oversample the rare cases to obtain a more balanced training set  Later, need to adjust results for the oversampling
  • 17.
    Stratified Sampling  Commonin classification problems where the response variable may be severely imbalanced  Example: 90% of the observations have a “yes” response and 10% have a “no” response.  Solution: segment into quartiles and then random sample from each quartile set.seed(123) split_strat <- initial_split(attrition, prop = 0.7, strata = "Attrition") train_strat <- training(split_strat) test_strat <- testing(split_strat)