STATISTICAL LEARNING
STÉPHANIE VAN DEN BERG
MARYAM AMIR HAERI
'
RECAP OF THIS COURSE
• Linear regression
• Logistic regression
• Data pre-processing
• Feature extraction and selection
• Next step: machine learning
BASICS STATISTICAL/MACHINE LEARNING
AN EXAMPLE
▪ IBM Watson
▪ Watson is a question answering computer system capable of answering questions posed in natural
language, developed in IBM's DeepQA project. It can handle speaking language.
▪ Jeopardy
PROCESS OF TEXT MINING AND STATISTICAL LEARNING
▪ Text preprocessing
▪ Feature generation
Text Feature Feature Analysis of
▪ Feature selection preprocessing generation selection
Mining
results
▪ Mining
▪ Analysis of results
PROCESS OF TEXT MINING AND STATISTICAL LEARNING
▪ Text preprocessing
▪ Feature generation
Text Feature Feature Analysis of
▪ Feature selection preprocessing generation selection
Mining
results
▪ Mining
▪ Analysis of results
26-11-2020
MACHINE LEARNING METHODS
Source: https://www.diegocalvo.es/en/machine-learning-supervised-unsupervised/
Supervised Learning Unsupervised Learning
Classification Regression Clustering
https://datasolut.com/wiki/unsupervised-learning/
SUPERVISED VS UNSUPERVISED
In supervised learning, a function 𝐹 is sought that
predicts the outcome 𝑦 based on the input 𝑥.
𝑦 = 𝐹(𝑥)
The function 𝑭 is trained on known cases.
Unsupervised learning is more about recognizing
groups of observations 𝑥 that have more in common
with themselves compared to the other observations.
CLASSIFICATION
training prediction
THIS COURSE
▪ Supervised learning
▪ Focus on Classification and regression trees
▪ KNN
▪ Evaluation of classification
▪ Accuracy, Sensitivi
CLASSIFICATION TASK
Training data Learning Algorithm
X1 X2 Xn Y
Learn Model
Model
X1 X2 Xn Y
Apply Model
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score
10 85 Master 80 Yes
4 55 PhD 65 No
7 80 Master 90 Yes Learning
3 40 PhD 50 NO Algorithm
Learn Model
Model
CLASSIFICATION TASK
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score
10 85 Master 80 Yes
4 55 PhD 65 No
7 80 Master 90 Yes Learning
3 40 PhD 50 No Algorithm
Learn Model
Model
deduction
Apply Model
Years of behavior University Past work
Experiences test results degree performance
score
5 35 Master 60
3 55 Master 85
9 87 PhD 75
8 90 Master 30
CLASSIFICATION TASK
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score
10 85 Master 80 Yes
4 55 PhD 65 No
7 80 Master 90 Yes Learning
3 40 PhD 50 No Algorithm
Learn Model
Years of behavior University Past work Predicted output Model
Experiences test results degree performance Accept for job?
score
5 35 Master 60 No deduction
3 55 Master 85 Yes Apply Model
9 87 PhD 75 Yes
8 90 Master 30 No
Years of behavior University Past work
Experiences test results degree performance
score
5 35 Master 60
3 55 Master 85
9 87 PhD 75
8 90 Master 30
OUTPUT OF A CLASSIFIER
Label Input Classifier 𝑌
𝑌 =0, 1
i.e. the output can be a
label,
for example it can be
class zero or class one
Score (Rank)
Input Classifier 𝑌
𝑌 can be any number
in a district range
for example [0,1]
EXAMPLE OF DECISION TREE CLASSIFIER
Years of
experiences
≥5 <5
Past Behavior
performance
score test results
≥ 70 < 70 ≥ 60 <60
Behavior University
No No
test results degree
≥ 60 <60 PhD Master
Yes No Yes No
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score
10 85 Master 80 ?
Years of
experiences
≥5 <5
Past Behavior
performance
score test results
≥ 70 < 70 ≥ 60 <60
Behavior University
No No
test results degree
≥ 60 <60 PhD Master
Yes No Yes No
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score
10 85 Master 80 Yes
Years of
experiences
≥5 <5
Past Behavior
performance
score test results
≥ 70 < 70 ≥ 60 <60
Behavior University
No No
test results degree
≥ 60 <60 PhD Master
Yes No Yes No
LOGISTIC REGRESSION CLASSIFIER
▪ Provide score as the output
▪ The score can transform into a label
Yes
No
https://www.javatpoint.com/logistic-regression-in-machine-learning
CLASSIFICATION AND REGRESSION TREES
DECISION TREE
https://www.aunalytics.com/decision-trees-an-overview/
26-11-2020
REGRESSION TREE
98 150 250 300
https://www.aunalytics.com/decision-trees-an-overview/
INTERPRETATION
▪ Can you describe the rules that come with this tree?
▪ If Income ≤ 50k and Age ≤ 50 then the average of home value is 98k
▪ If Income ≤ 50k and Age > 50 then the average of home value is 150k
▪ If Income > 50k and Age ≤ 50 then the average of home value is 250 k
▪ If Income > 50k and Age > 50 then the average of home value is 300 k
EXERCISE: SHOULD I PLAY BASEBALL?
▪ Please describe the set of rules for the following classification tree:
DECISION TREE
K NEAREST NEIGHBOR
K NEAREST NEIGHBOR
Source: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
K NEAREST NEIGHBOR
https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
TWO ISSUES IN MACHINE LEARNING
• CURSE OF DIMENSIONALITY
• OVER-FITTING YOUR MODEL
ISSUE 1: CURSE OF DIMENSIONALITY
▪ To train more complex models, exponentially more data is needed.
▪ Let’s go back to the example of the regression trees.
CURSE OF DIMENSIONALITY
▪ To train more complex models, exponentially more data is needed.
HOW MUCH DATA WOULD YOU NEED?
▪ Imagine a data set with 1000 individuals for a binary classification
3 variables, two
2 variables, two categories categories
1 variable, two 127 124
categories X1=a 248 252 122 127 126 126
500 500 X1=b 252 248 124 126
X1=a X1=b X2=c X2=d
4 variables, two
categories 60 61
62 63 61 63
61 62 64 61
60 65 63 61
70 64
HOW MUCH DATA WOULD YOU NEED?
▪ With 4 times more variables, you need almost 10 times more cases.
▪ As a consequence, for large models, you will have very sparse area’s without many observations.
▪ Imagine what would happen with more categories
per variable, and even more than 4 variables.
HOW MUCH DATA WOULD YOU NEED?
▪ Rule of thumb: 50 respondents per combination of categories
HOW MUCH DATA WOULD YOU NEED?
▪ Rule of thumb: 50 respondents per combination of categories
3*2*2 = 12 , therefore 12*50 = 600 observations
HOW MUCH DATA WOULD YOU NEED?
▪ Rule of thumb: 50 respondents per combination of categories
(4 variables), 3*3*2*2 = 36, therefore 1800 observations
ISSUE 2: OVER-FITTING YOUR MODEL
▪ A model might perfectly describe the data that you use to train it with. However, for future unseen
cases, it might be pretty far off.
▪ Technically, if the number of variables in your model is as large as the number of observations, you
can come to a perfect model.
OVERFITTING
Source: https://datascience.foundation/sciencewhitepaper/underfitting-and-overfitting-in-machine-learning
EXAMPLE OF K-NEAREST NEIGHBOR
ISSUE OF OVER-FITTING
▪ When k-nearest neighbour is applied as the machine learning algorithm, the optimal value for k
has to be chosen.
▪ K=1: over-fits the data
▪ K=optimum value
▪ K=very high: under-fits the data
▪ Because of this, various settings of the algorithm have to be applied. The difference in fit
between the training set and the test set is an indication for over-fitting.
▪ Usually the k-value with the best fit in the test set is chosen as optimal.
STEP 5: INTERPRETATION AND EVALUATION
HOW TO TEST THE QUALITY OF THE MACHINE LEARNING MODEL?
▪ With linear regression, you check R-square
▪ For classification algorithm, instead, we use a matrix related to (in)correct decisions. In case of
two possible outcomes:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
METRICS FOR PERFORMANCE EVALUATION
Confusion Matrix:
PREDICTED CLASS (R)
Class=1 Class=0
ACTUAL Class=1 TP (true positive) FN (false negative)
CLASS (Y)
Class=0 FP (false positive) TN (true negative)
EXAMPLE OF CONFUSION MATRIX
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
ACCURACY
▪ How many predictions were correct predictions (proportion)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁
Accuracy = =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
𝐹 + 𝐹𝑁
Error rate = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
ACCURACY
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
𝑇𝑃 + 𝑇𝑁 500
Accuracy = = = 0,869
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 575
𝐹+𝐹𝑁
Error rate = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = =0.131
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
SENSITIVITY (RECALL)
✓ Sensitivity focuses on the percentage of instances predicted as positive with respect to
the amount of existing positive instances.
✓ The other names of Sensitivity areTrue Positive Rate (TPR), and recall.
✓ How many of the YES-es were correctly identified (proportion)
PREDICTED CLASS 𝑇𝑃
Sensitivity =
𝑇𝑃 + 𝐹𝑁
Class=Yes Class=No
Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
SENSITIVITY (RECALL)
PREDICTED CLASS 𝑇𝑃 400
Sensitivity = = = 0.941
𝑇𝑃 + 𝐹𝑁 425
Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
SPECIFICITY
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
𝑇𝑁
Specificity= TNR =
𝑇𝑁 + 𝐹𝑃
▪ Specificity measure is used to determine the proportion of
actual negative cases, which got predicted correctly
SPECIFICITY
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
𝑇𝑁 100
Specificity= TNR = = = 0.666
𝑇𝑁+𝐹𝑃 150
POSITIVE PREDICTIVE VALUE (PPV OR PRECISION)
✓ positive predictive value (PPV) or Precision focuses on the
percentage of instances that have been predicted as positive
instances and they are actually positive.
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
𝑇𝑃
PPV=Precision =
𝑇𝑃 + 𝐹𝑃
POSITIVE PREDICTIVE VALUE (PPV OR PRECISION)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
𝑇𝑃 400
PPV=Precision = = = 0.888
𝑇𝑃+𝐹𝑃 450
NEGATIVE PREDICTIVE VALUE (NPV)
▪ The proportion of predicted negatives which are real negatives. It reflects the probability that a
predicted negative is a true negative.
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
𝑇𝑁
NPV =
𝑇𝑁 + 𝐹𝑁
NEGATIVE PREDICTIVE VALUE (NPV)
▪ The proportion of predicted negatives which are real negatives. It reflects the probability that a
predicted negative is a true negative.
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
𝑇𝑁 100
NPV = = = 0.8
𝑇𝑁+𝐹𝑁 125
F-SCORE
PREDICTED CLASS
Class=Yes Class=No
Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
𝑃𝑃𝑉 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2. 𝑇𝑃
F1 − Score = 2. =
𝑃𝑃𝑉 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2TP + FP + FN
❑ PPV considers true positive and false positive
❑ However, Sensitivity considers true positive and false negative.
❑ F-measure considers all except true negative.
❑ F-measure is high if both precision and recall are high.
F-SCORE
PREDICTED CLASS
2.𝑇𝑃
▪
2TP + FP + FN Class=Yes Class=No
Class=Yes TP=400 FN=25
ACTUAL
CLASS Class=No
FP=50 TN=100
𝑃𝑃𝑉 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2. 𝑇𝑃 800
F1 − Score = 2. = = = 0.91
𝑃𝑃𝑉 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2TP + FP + FN 875
EVALUATION OF SUPERVISED MODELS
Train Test
Training Accuracy Test Accuracy
Training PPV Test PPV
Training Sensitivity Test Sensitivity
…. Dataset ….
K-FOLD CROSS-VALIDATION
Source: https://bradleyboehmke.github.io/HOML/process.html