0% found this document useful (0 votes)

431 views60 pages

Statistical Learning Slides

This document discusses statistical learning and machine learning methods. It provides an overview of supervised learning techniques like linear regression, logistic regression, and decision trees. It also covers unsupervised learning techniques like clustering. An example of IBM's Watson question answering system is provided. The document then explains the process of text mining and feature engineering. It discusses the differences between supervised and unsupervised learning. Finally, it provides examples of classification tasks using decision tree classifiers to make predictions.

Uploaded by

john

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

431 views60 pages

Statistical Learning Slides

Uploaded by

john

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

STATISTICAL LEARNING

STÉPHANIE VAN DEN BERG

MARYAM AMIR HAERI
'

RECAP OF THIS COURSE

• Linear regression
• Logistic regression
• Data pre-processing
• Feature extraction and selection

• Next step: machine learning

BASICS STATISTICAL/MACHINE LEARNING
AN EXAMPLE

▪ IBM Watson
▪ Watson is a question answering computer system capable of answering questions posed in natural
language, developed in IBM's DeepQA project. It can handle speaking language.
▪ Jeopardy
PROCESS OF TEXT MINING AND STATISTICAL LEARNING

▪ Text preprocessing
▪ Feature generation
Text Feature Feature Analysis of
▪ Feature selection preprocessing generation selection
Mining
results
▪ Mining
▪ Analysis of results
PROCESS OF TEXT MINING AND STATISTICAL LEARNING

▪ Text preprocessing
▪ Feature generation
Text Feature Feature Analysis of
▪ Feature selection preprocessing generation selection
Mining
results
▪ Mining
▪ Analysis of results
26-11-2020

MACHINE LEARNING METHODS

Source: https://www.diegocalvo.es/en/machine-learning-supervised-unsupervised/
Supervised Learning Unsupervised Learning

Classification Regression Clustering

https://datasolut.com/wiki/unsupervised-learning/
SUPERVISED VS UNSUPERVISED

In supervised learning, a function 𝐹 is sought that

predicts the outcome 𝑦 based on the input 𝑥.

𝑦 = 𝐹(𝑥)
The function 𝑭 is trained on known cases.

Unsupervised learning is more about recognizing

groups of observations 𝑥 that have more in common
with themselves compared to the other observations.
CLASSIFICATION

training prediction
THIS COURSE

▪ Supervised learning
▪ Focus on Classification and regression trees
▪ KNN
▪ Evaluation of classification
▪ Accuracy, Sensitivi
CLASSIFICATION TASK

Training data Learning Algorithm

X1 X2 Xn Y

Learn Model
Model

X1 X2 Xn Y
Apply Model
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score

10 85 Master 80 Yes

4 55 PhD 65 No

7 80 Master 90 Yes Learning

3 40 PhD 50 NO Algorithm

Learn Model
Model
CLASSIFICATION TASK
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score

10 85 Master 80 Yes

4 55 PhD 65 No

7 80 Master 90 Yes Learning

3 40 PhD 50 No Algorithm

Learn Model
Model

deduction
Apply Model

Years of behavior University Past work

Experiences test results degree performance
score

5 35 Master 60

3 55 Master 85

9 87 PhD 75

8 90 Master 30
CLASSIFICATION TASK
Years of behavior University Past work Qualified for the
Experiences test results degree performance job
score

10 85 Master 80 Yes

4 55 PhD 65 No

7 80 Master 90 Yes Learning

3 40 PhD 50 No Algorithm

Learn Model
Years of behavior University Past work Predicted output Model
Experiences test results degree performance Accept for job?
score

5 35 Master 60 No deduction
3 55 Master 85 Yes Apply Model
9 87 PhD 75 Yes

8 90 Master 30 No

Years of behavior University Past work

Experiences test results degree performance
score

5 35 Master 60

3 55 Master 85

9 87 PhD 75

8 90 Master 30
OUTPUT OF A CLASSIFIER

Label Input Classifier 𝑌෠

𝑌෠ =0, 1
i.e. the output can be a
label,
for example it can be
class zero or class one
Score (Rank)

Input Classifier 𝑌෠

Yes No Yes No
LOGISTIC REGRESSION CLASSIFIER

▪ Provide score as the output

▪ The score can transform into a label

Yes

https://www.javatpoint.com/logistic-regression-in-machine-learning
CLASSIFICATION AND REGRESSION TREES
DECISION TREE

https://www.aunalytics.com/decision-trees-an-overview/
26-11-2020

REGRESSION TREE

98 150 250 300

https://www.aunalytics.com/decision-trees-an-overview/
INTERPRETATION

▪ Can you describe the rules that come with this tree?

▪ If Income ≤ 50k and Age ≤ 50 then the average of home value is 98k
▪ If Income ≤ 50k and Age > 50 then the average of home value is 150k
▪ If Income > 50k and Age ≤ 50 then the average of home value is 250 k
▪ If Income > 50k and Age > 50 then the average of home value is 300 k
EXERCISE: SHOULD I PLAY BASEBALL?

▪ Please describe the set of rules for the following classification tree:
DECISION TREE
K NEAREST NEIGHBOR
K NEAREST NEIGHBOR

Source: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
K NEAREST NEIGHBOR

https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
TWO ISSUES IN MACHINE LEARNING
• CURSE OF DIMENSIONALITY
• OVER-FITTING YOUR MODEL
ISSUE 1: CURSE OF DIMENSIONALITY

▪ To train more complex models, exponentially more data is needed.

▪ Let’s go back to the example of the regression trees.
CURSE OF DIMENSIONALITY

▪ To train more complex models, exponentially more data is needed.

HOW MUCH DATA WOULD YOU NEED?

▪ Imagine a data set with 1000 individuals for a binary classification

3 variables, two
2 variables, two categories categories
1 variable, two 127 124
categories X1=a 248 252 122 127 126 126
500 500 X1=b 252 248 124 126
X1=a X1=b X2=c X2=d

4 variables, two
categories 60 61
62 63 61 63
61 62 64 61
60 65 63 61
70 64
HOW MUCH DATA WOULD YOU NEED?

▪ With 4 times more variables, you need almost 10 times more cases.
▪ As a consequence, for large models, you will have very sparse area’s without many observations.

▪ Imagine what would happen with more categories

per variable, and even more than 4 variables.
HOW MUCH DATA WOULD YOU NEED?

▪ Rule of thumb: 50 respondents per combination of categories

HOW MUCH DATA WOULD YOU NEED?

▪ Rule of thumb: 50 respondents per combination of categories

322 = 12 , therefore 12*50 = 600 observations

HOW MUCH DATA WOULD YOU NEED?

▪ Rule of thumb: 50 respondents per combination of categories

(4 variables), 332*2 = 36, therefore 1800 observations

ISSUE 2: OVER-FITTING YOUR MODEL

▪ A model might perfectly describe the data that you use to train it with. However, for future unseen
cases, it might be pretty far off.
▪ Technically, if the number of variables in your model is as large as the number of observations, you
can come to a perfect model.
OVERFITTING

Source: https://datascience.foundation/sciencewhitepaper/underfitting-and-overfitting-in-machine-learning
EXAMPLE OF K-NEAREST NEIGHBOR
ISSUE OF OVER-FITTING

▪ When k-nearest neighbour is applied as the machine learning algorithm, the optimal value for k
has to be chosen.
▪ K=1: over-fits the data
▪ K=optimum value
▪ K=very high: under-fits the data
▪ Because of this, various settings of the algorithm have to be applied. The difference in fit
between the training set and the test set is an indication for over-fitting.
▪ Usually the k-value with the best fit in the test set is chosen as optimal.
STEP 5: INTERPRETATION AND EVALUATION
HOW TO TEST THE QUALITY OF THE MACHINE LEARNING MODEL?

▪ With linear regression, you check R-square

▪ For classification algorithm, instead, we use a matrix related to (in)correct decisions. In case of
two possible outcomes:

PREDICTED CLASS
Class=Yes Class=No

Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
METRICS FOR PERFORMANCE EVALUATION

Confusion Matrix:
PREDICTED CLASS (R)

Class=1 Class=0

ACTUAL Class=1 TP (true positive) FN (false negative)

CLASS (Y)
Class=0 FP (false positive) TN (true negative)
EXAMPLE OF CONFUSION MATRIX

✓ Sensitivity focuses on the percentage of instances predicted as positive with respect to

the amount of existing positive instances.

✓ The other names of Sensitivity areTrue Positive Rate (TPR), and recall.
✓ How many of the YES-es were correctly identified (proportion)

PREDICTED CLASS 𝑇𝑃
Sensitivity =
𝑇𝑃 + 𝐹𝑁
Class=Yes Class=No

Class=Yes TP FN
ACTUAL
CLASS Class=No
FP TN
SENSITIVITY (RECALL)

PREDICTED CLASS 𝑇𝑃 400

Sensitivity = = = 0.941
𝑇𝑃 + 𝐹𝑁 425
Class=Yes Class=No

Class=Yes TP=400 FN=25

ACTUAL
CLASS Class=No
FP=50 TN=100
SPECIFICITY

PREDICTED CLASS

Class=Yes Class=No

❑ However, Sensitivity considers true positive and false negative.
❑ F-measure considers all except true negative.
❑ F-measure is high if both precision and recall are high.
F-SCORE
PREDICTED CLASS
2.𝑇𝑃
▪
2TP + FP + FN Class=Yes Class=No

Class=Yes TP=400 FN=25

ACTUAL
CLASS Class=No
FP=50 TN=100

𝑃𝑃𝑉 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2. 𝑇𝑃 800

F1 − Score = 2. = = = 0.91
𝑃𝑃𝑉 + 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2TP + FP + FN 875
EVALUATION OF SUPERVISED MODELS

Train Test

Training Accuracy Test Accuracy

Training PPV Test PPV
Training Sensitivity Test Sensitivity
…. Dataset ….
K-FOLD CROSS-VALIDATION

Source: https://bradleyboehmke.github.io/HOML/process.html

ML Unit 2
No ratings yet
ML Unit 2
37 pages
Classification
100% (2)
Classification
105 pages
Machine Learning Issues & Algorithms
No ratings yet
Machine Learning Issues & Algorithms
133 pages
Unit 2
No ratings yet
Unit 2
151 pages
Classification
No ratings yet
Classification
53 pages
08 CSE358 Intro To Machine Learning II
No ratings yet
08 CSE358 Intro To Machine Learning II
100 pages
Let's Begin With:: Differentiate Between Supervised and Unsupervised Learning
No ratings yet
Let's Begin With:: Differentiate Between Supervised and Unsupervised Learning
26 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Supervised Learning
No ratings yet
Supervised Learning
187 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Moocs Ritesh
No ratings yet
Moocs Ritesh
22 pages
Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
12 pages
Beginner's Guide to Machine Learning
No ratings yet
Beginner's Guide to Machine Learning
8 pages
Developing A Machining Learning Models From Start To Finish.
No ratings yet
Developing A Machining Learning Models From Start To Finish.
59 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
INT354 - Unit 2
No ratings yet
INT354 - Unit 2
26 pages
Lecture 4 - Intro To Machine Learning and Decision Trees
No ratings yet
Lecture 4 - Intro To Machine Learning and Decision Trees
61 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
118 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
Lecture - 2 & 3
No ratings yet
Lecture - 2 & 3
62 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
Pyq ML
No ratings yet
Pyq ML
8 pages
MI - Unit 3
100% (1)
MI - Unit 3
107 pages
Lect 1
No ratings yet
Lect 1
24 pages
Chapter Four
No ratings yet
Chapter Four
75 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
19 pages
ch-4 FML
No ratings yet
ch-4 FML
13 pages
Lecture1 MCQ Guide
No ratings yet
Lecture1 MCQ Guide
4 pages
05 - Machine Learning
No ratings yet
05 - Machine Learning
31 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
DM Assignment 2
No ratings yet
DM Assignment 2
23 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
13 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
Unit4 PPT
No ratings yet
Unit4 PPT
126 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
LR, Decision Tree
No ratings yet
LR, Decision Tree
48 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
CH 6
No ratings yet
CH 6
24 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
Logistic Regression
No ratings yet
Logistic Regression
61 pages
Session 5
No ratings yet
Session 5
36 pages
Assignment
No ratings yet
Assignment
5 pages
Unit 5
No ratings yet
Unit 5
73 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Supervised Learning
No ratings yet
Supervised Learning
23 pages
Achine Learning Machine Learning Systems: Uilding A Spam Classifier
No ratings yet
Achine Learning Machine Learning Systems: Uilding A Spam Classifier
9 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
40 pages
ML Classification Essentials
No ratings yet
ML Classification Essentials
50 pages
Unit 2
No ratings yet
Unit 2
20 pages
Machine Learning Study Experiment
No ratings yet
Machine Learning Study Experiment
5 pages
Revision of Reading and Writing
No ratings yet
Revision of Reading and Writing
10 pages
Detailed Lesson Plan in Music Grade 8 Quarter Two Week - LC 3
100% (1)
Detailed Lesson Plan in Music Grade 8 Quarter Two Week - LC 3
3 pages
HiltonPlatt 11e TB Ch10
No ratings yet
HiltonPlatt 11e TB Ch10
66 pages
Nord Anglia Teacher Guidelines
No ratings yet
Nord Anglia Teacher Guidelines
7 pages
خطة مساق الادارة الاستراتيجية1 2011
No ratings yet
خطة مساق الادارة الاستراتيجية1 2011
3 pages
Digital-Skill-Web-Analytics Certificate of Achievement H334ey3
No ratings yet
Digital-Skill-Web-Analytics Certificate of Achievement H334ey3
2 pages
Art Integration in Language Teaching
No ratings yet
Art Integration in Language Teaching
4 pages
Computational Thinking Definition
No ratings yet
Computational Thinking Definition
3 pages
AI & Quantum Certification Courses
No ratings yet
AI & Quantum Certification Courses
6 pages
Deep Learning Course Overview
100% (1)
Deep Learning Course Overview
122 pages
A Corpus-Based Analysis of Errors in Spanish EFL Writings
No ratings yet
A Corpus-Based Analysis of Errors in Spanish EFL Writings
14 pages
53040564655
No ratings yet
53040564655
3 pages
K3 Australia Dot Art Lesson Plan
No ratings yet
K3 Australia Dot Art Lesson Plan
1 page
Nemera Wedajo' Project
No ratings yet
Nemera Wedajo' Project
79 pages
UNIT 1-Capstone Project Practice Questions
No ratings yet
UNIT 1-Capstone Project Practice Questions
14 pages
Phonics Monster Readers Homework Worksheets
No ratings yet
Phonics Monster Readers Homework Worksheets
19 pages
Assessment in Secondary Social Studies: Michael P. Vale
No ratings yet
Assessment in Secondary Social Studies: Michael P. Vale
23 pages
Mapa Mental Sobre Su Personal Learning Environment PLE GA4 240202501 AA1 EV02
No ratings yet
Mapa Mental Sobre Su Personal Learning Environment PLE GA4 240202501 AA1 EV02
1 page
Immediate Access Physics 5th Edition Verified PDF Download
No ratings yet
Immediate Access Physics 5th Edition Verified PDF Download
403 pages
TOS Philo 1st Quarter
100% (1)
TOS Philo 1st Quarter
2 pages
Peer Review - Davis
No ratings yet
Peer Review - Davis
18 pages
4Q. Experimental Theoritical Prob.
100% (1)
4Q. Experimental Theoritical Prob.
4 pages
Daily Lesson Plan in Tle - 10: HE - Dressmaking Second
No ratings yet
Daily Lesson Plan in Tle - 10: HE - Dressmaking Second
4 pages
John MC Carthy 6
No ratings yet
John MC Carthy 6
15 pages
Introduction of CPD
No ratings yet
Introduction of CPD
6 pages
Ideas To Ignite Your Students Interests
No ratings yet
Ideas To Ignite Your Students Interests
2 pages
8th Grade Math Syllabus 2016-2017 Henderson
No ratings yet
8th Grade Math Syllabus 2016-2017 Henderson
2 pages
Instructional Design: Mid Term Test Written By: Wahyuni Fitria
No ratings yet
Instructional Design: Mid Term Test Written By: Wahyuni Fitria
6 pages
Lesson Plan 2
No ratings yet
Lesson Plan 2
4 pages
Effective Writing by Judith Hochman
No ratings yet
Effective Writing by Judith Hochman
7 pages

Statistical Learning Slides

Uploaded by

Statistical Learning Slides

Uploaded by

STATISTICAL LEARNING

STÉPHANIE VAN DEN BERG

RECAP OF THIS COURSE

• Next step: machine learning

MACHINE LEARNING METHODS

Classification Regression Clustering

In supervised learning, a function 𝐹 is sought that

Unsupervised learning is more about recognizing

Training data Learning Algorithm

7 80 Master 90 Yes Learning

7 80 Master 90 Yes Learning

Years of behavior University Past work

7 80 Master 90 Yes Learning

Years of behavior University Past work

Label Input Classifier 𝑌෠

𝑌෠ can be any number

≥ 60 <60 PhD Master

≥ 60 <60 PhD Master

≥ 60 <60 PhD Master

▪ Provide score as the output

98 150 250 300

▪ To train more complex models, exponentially more data is needed.

▪ To train more complex models, exponentially more data is needed.

▪ Imagine a data set with 1000 individuals for a binary classification

▪ Imagine what would happen with more categories

▪ Rule of thumb: 50 respondents per combination of categories

▪ Rule of thumb: 50 respondents per combination of categories

3*2*2 = 12 , therefore 12*50 = 600 observations

▪ Rule of thumb: 50 respondents per combination of categories

(4 variables), 3*3*2*2 = 36, therefore 1800 observations

▪ With linear regression, you check R-square

ACTUAL Class=1 TP (true positive) FN (false negative)

Class=Yes TP=400 FN=25

▪ How many predictions were correct predictions (proportion)

Class=Yes TP=400 FN=25

Class=Yes TP=400 FN=25

✓ Sensitivity focuses on the percentage of instances predicted as positive with respect to

PREDICTED CLASS 𝑇𝑃 400

Class=Yes TP=400 FN=25

▪ Specificity measure is used to determine the proportion of

Class=Yes TP=400 FN=25

✓ positive predictive value (PPV) or Precision focuses on the

Class=Yes TP=400 FN=25

Class=Yes TP=400 FN=25

❑ PPV considers true positive and false positive

Class=Yes TP=400 FN=25

𝑃𝑃𝑉 × 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2. 𝑇𝑃 800

Training Accuracy Test Accuracy

You might also like

322 = 12 , therefore 12*50 = 600 observations

(4 variables), 332*2 = 36, therefore 1800 observations