IntroToDataMining : Key Components and process

Core Ideas in Data Mining
 Classification - Classifying the outcome of
categorical variable based on historical data
 Prediction – prediction of a given output variable
(target variable) based on other variables or
features
 Association Rules & Recommenders
 Data & Dimension Reduction
 Data Exploration
 Visualization

Supervised Learning
 Predictive modeling describes supervised learning -
the training data we feed the algorithm includes the
target variable. We are “supervising” the training
process
 Training data, where target value is known
 Score to data where value is not known
 Methods: Classification and Prediction

Supervised: Classification
 The object of classification is to predict a categorical
variable
 Examples: Purchase/no purchase, fraud/no fraud,
creditworthy/not creditworthy…
 Target variable is often binary (yes/no)

Supervised: Prediction
 Goal: Predict numerical target (outcome) variable
 Examples: sales, revenue, performance
 As in classification:
 Each row is a case (customer, tax return, applicant)
 Each column is a variable
 Taken together, classification and prediction
constitute “predictive analytics”

Unsupervised Learning
 Performs an analysis without a target variable
 Goal: Segment data into meaningful segments;
detect patterns
 There is no target (outcome) variable to predict or
classify
 Example: Clustering – segments the observations
into to similar groups based on the variables

Unsupervised: Association Rules
 Goal: Produce rules that define “what goes with what”
 Example: “If X was purchased, Y was also purchased”
 Rows are transactions
 Used in recommender systems – “Our records show
you bought X, you may also like Y”
 Also called “affinity analysis”

Unsupervised: Data Reduction
 Distillation of complex/large data into
simpler/smaller data
 Reducing the number of variables/columns (e.g.,
principal components)
 Reducing the number of records/rows (e.g.,
clustering)

Unsupervised: Data Visualization
 Graphs and plots of data
 Histograms, boxplots, bar charts, scatterplots
 Especially useful to examine relationships between
pairs of variables

Data Exploration
 Data sets are typically large, complex & messy
 Need to review the data to help refine the task
 Use techniques of Reduction and Visualization

Steps in Data Mining
1. Define/understand purpose
2. Obtain data (may involve random sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised DM, partition it
5. Specify task (classification, clustering, etc.)
6. Choose the techniques (regression, neural
networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
9. Deploy best model

Test Partition
 When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
 Assessing multiple models on same
validation data can overfit validation data
 Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
 Solution: final selected model is applied
to a test partition to give unbiased
estimate of its performance on new data

Data Splitting
 Data are split into training and test sets.
 Training data is used to train the model
 Testing data is used to estimate an unbiased assessment of the
model’s performance
#Random splitting of iris data into 70% train and 30%test datasets
ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.7, 0.3))
trainData <- data[ind==1,]
testData <- data[ind==2,]

Simple Random Sampling
 All observations have an equal chance of selection
# Using base R
set.seed(123) # for reproducibility
index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.7))
train_1 <- ames[index_1, ]
test_1 <- ames[-index_1, ]

Rare Event Oversampling
 Often the event of interest is rare
 Examples: response to mailing, fraud in taxes, …
 Sampling may yield too few “interesting” cases to
effectively train a model
 A popular solution: oversample the rare cases to
obtain a more balanced training set
 Later, need to adjust results for the oversampling

Stratified Sampling
 Common in classification problems where the response
variable may be severely imbalanced
 Example: 90% of the observations have a “yes” response and
10% have a “no” response.
 Solution: segment into quartiles and then random sample
from each quartile
set.seed(123)
split_strat <- initial_split(attrition, prop = 0.7,
strata = "Attrition")
train_strat <- training(split_strat)
test_strat <- testing(split_strat)

IntroToDataMining : Key Components and process

More Related Content

Similar to IntroToDataMining : Key Components and process

Recently uploaded

IntroToDataMining : Key Components and process