KEMBAR78
DS-05 Introduction To Machine Learning | PDF | Receiver Operating Characteristic | Dependent And Independent Variables
0% found this document useful (0 votes)
111 views103 pages

DS-05 Introduction To Machine Learning

Uploaded by

Bojana Jovanceva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views103 pages

DS-05 Introduction To Machine Learning

Uploaded by

Bojana Jovanceva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Introduction to Machine Learning

Introduction to Data Science


1
References

Parts of this lecture are based on:


1.Slides from the Harvard Course CS109A Introduction to Data Science
by Pavlos Protopapas, Kevin Rader and Chris Tanner, available at
https://github.com/Harvard-IACS/2019-CS109A
2.James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An introduction to statistical learning. Vol. 112. New York: springer,
2013.
3.Cielen, Davy, Arno Meysman, and Mohamed Ali. Introducing data
science: big data, machine learning, and more, using Python tools.
Manning Publications Co., 2016.

2
Agenda

• Introduction to ML
– Types of ML
– Using ML to make predictions
• Regression
– Defining the problem
– Linear Regressing Models
– Regressing model evaluation
• Classification
– Logistic regression
– Classification metrics
3
Paradigm shift

Traditional Programming

Data
Computer Output
Program

Machine Learning

Data
Computer Program
Output
Understanding
how machines
learn
What is machine learning?

• A branch of artificial intelligence, concerned with the


design and development of algorithms that allow
computers to evolve behaviors based on empirical data.
• As intelligence requires knowledge, it is necessary for the
computers to acquire knowledge
• Automating automation
• Getting computers to program themselves
– Let the data do the work instead!
Machine Learning Definition
• The Formal one:
– “A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured
by P, improves with experience E.”
• A Practical Example:
– Look at data. Try something. Get the right answer? No?
Look at the data. Do something different. Better? Yes?
Do that again. (Repeat)

• Human learning is still far more sophisticated than even


the most advanced machine-learning algorithms, but
computers have the advantage of greater capacity to
memorize, recall, and process data.
Types of Learning
• Supervised (inductive) learning
– Labels provided
– Training data includes desired outputs
– Predicting the future
– Learn from known past examples to predict future

• Unsupervised learning
– Labels not provided
– Training data does not include desired outputs
– Making sense of data
– Understanding the past
– Learning the structure of data

• Reinforcement learning
– Rewards and/or punishments from sequence of actions
Algorithms

Supervised learning Unsupervised learning

Semi-supervised learning
Machine Learning Capabilities

How much/many Which group


Which category (Regression) (Clustering,
(Classification) Recommender)

Is it odd Which action


(Anomaly) (Reinforcement
Learning)
ML in a Nutshell

• Tens of thousands of machine learning algorithms


• Hundreds new every year
• Every machine learning algorithm has three components:
– Representation
– Evaluation
– Optimization
Representation

• Decision trees
• Sets of rules / Logic programs
• Instances
• Graphical models (Bayes/Markov nets)
• Neural networks
• Support vector machines
• Model ensembles
• Etc.
Evaluation

• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• Etc.
Optimization

• Combinatorial optimization
– E.g.: Greedy search
• Convex optimization
– E.g.: Gradient descent
• Constrained optimization
– E.g.: Linear programming
ML in Practice

• Understanding domain, prior knowledge, and goals


• Data integration, selection, cleaning, pre-processing, etc.
• Learning models
• Interpreting results
• Consolidating and deploying discovered knowledge
Using data to make decisions

• Traditional approaches for the loan-


approval process for the microlending
• Human Analyst look at the documents
Using data to
make decisions
• You need to manually
analyze incoming
applications
• Apply filtering rules and
intuition to approve or
disapprove the
application
The machine-learning approach
• ML learns the optimal decisions
directly from the data without having
to arbitrarily hardcode decision rules.
• This graduation from rules-based to
ML-based decision-making means that
your decisions will be more accurate
and will improve over time as more
loans are made.
• Data provides the foundation for
deriving insights about the problem at
hand
• The input data, consists of a set of
features, numerical or categorical
metrics that capture the relevant
aspects of each application, such as the
applicant’s credit score, gender, and
occupation.
EXAMPLE: logistic regression to model the loan approval process
• Machine learning comes in many
flavors, ranging from simple
statistical models to more-
sophisticated Deep Learning
approaches
• In logistic regression, the logarithm
of the odds that each loan is repaid
is modeled as a linear function of
the input features:
– the applicant’s credit line,
– education level,
– age
• The optimal values of each
coefficient of the equation are
learned from the training data
examples.
Linear vs Non-Linear ML Algorithms

• When the relationship between the inputs and the response are
complicated, models such as logistic regression can be limited. We
need to use more complicated models.
Predicting a Variable

• Let's imagine a scenario where we'd like to predict one variable using another
(or a set of other) variables.
• Thus, we'd like to define two categories of variables:
• variables whose value we want to predict
• variables whose values we use to make our prediction

• Examples:

• Predicting the amount of views a YouTube video will get next week based
on video length, the date it was posted, previous number of views, etc.
• Predicting which movies a Netflix user will rate highly, based on their
previous movie ratings, demographic data etc.
21
Translating Between Statistics and Machine Learning

other terms:
residual, loss (statistics) = error (machine learning)

22
Data
The Advertising data set consists of the sales of that product in 200
different markets, along with advertising budgets for the product
, in
each of those markets for three different media: TV, radio, and
newspaper. Everything is given in units of $1000.

TV radio newspaper sales


230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with
applications in R" (Springer, 2013)
23
Response vs. Predictor Variables

X Y
predictors outcome
features response variable
covariates dependent variable

TV radio newspaper sales


n observations

230.1 37.8 69.2 22.1


44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

p predictors
24
Response vs. Predictor Variables
𝑋 = 𝑋! , … , 𝑋"
𝑋# = 𝑥!# , … , 𝑥$# , … , 𝑥%# 𝑌 = 𝑦! , … , 𝑦%
predictors outcome
features response variable
covariates dependent variable

TV radio newspaper sales


n observations

230.1 37.8 69.2 22.1


44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
180.8 10.8 58.4 12.9

p predictors
25
True vs. Statistical Model
• We will assume that the response variable, 𝑌, relates to the
predictors, 𝑋, through some unknown function expressed
generally as:

• 𝑌 =𝑓 𝑋 +𝜀

• Here, 𝑓 is the unknown function expressing an underlying rule for


relating 𝑌 to 𝑋, 𝜀 is the random amount (unrelated to 𝑋) that 𝑌
differs from the rule 𝑓 𝑋 .

• A statistical model is any algorithm that estimates 𝑓. We denote


the estimated function as 𝑓.(
26
Statistical Model

x
27
Statistical Model
How do we -ind 𝑓0 𝑥 ?

What is the
value of y at this
𝑥?

28
Statistical Model
How do we -ind 𝑓0 𝑥 ?

or this one?

29
Statistical Model
(
Simple idea is to take the mean of all y’s, 𝑓0 𝑥 = ∑)( 𝑦*
)

30
Prediction vs. Estimation

• 0 our estimate of 𝑓. These are called


For some problems, what's important is obtaining 𝑓,
inference problems.

• When we use a set of measurements, (𝑥*,( , … , 𝑥*,+ ) to predict a value for the response
variable, we denote the predicted value by:

• 0 *,( , … , 𝑥*,+ ).
𝑦9* = 𝑓(𝑥

• 0 we just want to make


For some problems, we don't care about the specific form of 𝑓,
our predictions 𝑦’s
9 as close to the observed values 𝑦’s as possible. These are called
prediction problems.

31
Simple Prediction Model

What is 𝑦,& at some 𝑥& ?

Find distances to all


other points
𝐷(𝑥& , 𝑥$ )
𝑦,&

Predict 𝑦,& = 𝑦"

𝑥-

32
Simple Prediction Model
Do the same for “all” 𝑥′𝑠

33
Extend the Prediction Model

What is 𝑦,& at some 𝑥& ?

Find distances to all


other points
𝑦,& 𝐷(𝑥& , 𝑥$ )

Find the k-nearest


neighbors, 𝑥&" , … , 𝑥&#

!
'
& = ∑$ 𝑦&!
Predict 𝑦1
'

𝑥-

34
Simple Prediction Models

35
Simple Prediction Models
We can try different k-models on more data

36
Error Evaluation
Start with some data.

37
Error Evaluation
Hide some of the data from the model. This is called train-test split.

We use the
train set to
estimate 𝑦,9 and
the test set to
evaluate the
model.

38
Error Evaluation
Estimate 𝑦9 for k=1 .

39
Error Evaluation
Now, we look at the data we have not used, the test data (red crosses).

40
Error Evaluation
Calculate the residuals (𝑦* − 𝑦9* ).

41
Error Evaluation
Do the same for k=3.

42
Error Evaluation

• In order to quantify how well a model performs, we define a loss or error function.

• A common loss function for quantitative outcomes is the Mean Squared Error
(MSE): Xn
1
M SE = (yi ybi )2
n i=1
• The quantity 𝑦* − 𝑦9* is called a residual and measures the error at the i-th
prediction.

• Note: The square Root of the Mean of the Squared Errors (RMSE) is also
commonly used. v
u n
p u1 X
RM SE = M SE = t (yi ybi )2
n i=1
43
Model Comparison
Do the same for all k’s and compare the RMSEs. k=3 seems to be the best model.

44
Model fitness
For a subset of the data, calculate the RMSE for k=3. Is RMSE=5.0 good enough?

45
Model fitness
What if we measure the Sales in cents instead of dollars?

RMSE is now
5004.93.
Is that good?

46
Model fitness
It is better if we compare it to something.

We will use the simplest


model:
n
1X
ŷ = yi
n i
i=1

47
R-squared - The Coefficient of Determination

P
(ŷi yi ) 2
R2 = 1 P i
i (ȳ yi ) 2

• = then 𝑅 3 = 0
If our model is as good as the mean value, 𝑦,

• If our model is perfect then 𝑅 3 = 1


• 𝑅 3 can be negative if the model is worst than the average. This can happen
when we evaluate the model in the test set.

48
Linear Models
• Note that in building our k-NN model for prediction, we did not
compute a closed form for 𝑓.0

• What if we ask the question:

– “how much more sales do we expect if we double the TV advertising budget?”

• Alternatively, we can build a model by first assuming a simple form of 𝑓:

Y = f (X) + ✏ = 1X + 0 + ✏.
• … then it follows that our estimate is:
Yb = fb(X) = c1 X + c0

• where 𝛽0( and 𝛽04 are estimates of 𝛽( and 𝛽4 respectively, that we compute using
49
observations.
Estimate of the regression coefficients (cont)
Is this line good?

50
Estimate of the regression coefficients (cont)
Maybe this one?

51
Estimate of the regression coefficients (cont)
Or this one?

52
Estimate of the regression coefficients (cont)
Question: Which line is the best?
First calculate the residuals

53
Estimate of the regression coefficients (cont)

• Again we use MSE as our loss function,


n n
1X 2 1X 2
L( 0 , 1 ) = (yi ybi ) = [yi ( 1X + 0 )] .
n i=1 n i=1

• We choose 𝛽0( and 𝛽04 in order to minimize the predictive errors made by our
model, i.e. minimize our loss function.

• Then the optimal values for 𝛽04 and 𝛽0( should be:

b0 , b1 = argminL( 0, 1 ).
0, 1

54
Estimate of the regression coefficients: brute force

• A way to estimate argmin5$ ,5" 𝐿 is to calculate the loss function for every
possible 𝛽4 and 𝛽( . Then select the 𝛽4 and 𝛽( where the loss function is
minimum.

• E.g. the loss function for different 𝛽( when 𝛽4 is fixed to be 6:

55
Estimate of the regression coefficients: exact method
Take the partial derivatives of 𝐿 with respect to 𝛽4 and 𝛽( , set to zero, and find the
solution to that equation. This procedure will give us explicit formulae for 𝛽04 and 𝛽0( :

P
ˆ1 = i (x
Pi
x)(yi y)
(x x) 2
i i

ˆ0 = ȳ ˆ1 x̄

where 𝑦= and 𝑥̅ are sample means.


The line: b Y = b1 X + b0
is called the regression line.

56
Proof: 1X 2
L( 0 , 1 ) = [yi ( 0 1 xi )]
n i

dL( 0 , 1 ) dL( 0 , 1 )
=0 =0
d 0 d 1
2X 2X
) (yi ) (yi 0 1 xi )( xi ) = 0
0 1 xi ) =0 n i
n i X X X
1X 1X ) x i yi + 0 xi + 1 x2i = 0
) yi 0 1 xi = 0 i i i
n i n i X X X
) xi yi + (ȳ 1 x̄) xi + 1 x2i = 0
) 0 = ȳ 1 x̄ i
! i i
X X
) 1 x2i nx̄ 2
= x i yi nx̄ȳ
i i
P
xi yi nx̄ȳ
) 1 = Pi 2
i xi nx̄2
P
(xi x̄)(yi ȳ)
) 1 = iP
i (xi x̄)2
57
Estimate of the regression coefficients: gradient descent

A more flexible method is


• Start from a random point
1. Determine which direction to go to reduce the loss (left or right)
2. Compute the slope of the function at this point and step to the
right if slope is negative or step to the left if slope is positive
3. Goto to #1
58
Estimate of the regression coefficients: Gradient Descent
• We know that we want to go in the opposite direction of the derivative and we
know we want to be making a step proportionally to the derivative.
Notation: 𝑤 = [𝛽4 , 𝛽( ]
• Making a step means:
𝑤 )67 = 𝑤 89: + 𝑠𝑡𝑒𝑝
Opposite direction of the derivative and proportional to the derivative means:
)67 89:
𝑑ℒ
𝑤 =𝑤 −𝜆
𝑑𝑤

Change to more conventional notation:


𝑑ℒ Learning
𝑤 (*;() = 𝑤 (*) −𝜆 Rate
𝑑𝑤
59
Estimate of the regression coefficients: gradient descent

Summary of Gradient Descent 𝑑ℒ


• 𝑤 (*;() = 𝑤 (*) −𝜆
• Algorithm for optimization of first order 𝑑𝑤
to finding a minimum of a function.
• It is an iterative method. L - +
• L is decreasing in the direction of the
negative derivative.
• The learning rate is controlled by the
magnitude of 𝜆.

60
Interpretation of Predictors

• Question: What do you think a predictor coefficient means?

• 𝑆𝑎𝑙𝑒𝑠 = 7.5 + 0.04 𝑇𝑉


• What does 7.5 mean and what does 0.04 mean?

• If we increase the TV by $1000, what would you expect the increase in sales to be?

• What if? 𝑆𝑎𝑙𝑒𝑠 = 7.5 + 1.01 𝑇𝑉


The interpretation of the predictors depends on the values, but decisions depend on
how much we trust these values.

61
Confidence intervals for the predictors estimates

• We interpret the 𝜀 term in our observation


y = f (x) + ✏
• to be noise introduced by random variations in natural systems or imprecisions
of our scientific instruments.
• If we knew the exact form of 𝑓 𝑥 , for example, 𝑓 𝑥 = 𝛽4 + 𝛽( 𝑥, and there
was no 𝜀 , then estimating the 𝛽0 < 𝑠 would have been exact.
• However, three things happen, which result in mistrust of the values of 𝛽0 < 𝑠 :
• 𝜺 is always there
• we do not know the exact form of 𝑓 𝑥
• limited sample size

62
Confidence intervals for the predictors estimates (cont)
• But due to error, every time we measure the response Y for a fixed value of X,
we will obtain a different observation.
• We have 3 measurements for several different values of X, “realization” on the
picture

63
Confidence intervals for the predictors estimates (cont)

• For each one of those “realizations”, we could fit a model and estimate 𝛽04 and
𝛽0( .

64
Confidence intervals for the predictors estimates (cont)

• For each one of those “realizations”, we could fit a model and estimate, 𝛽04
and 𝛽0( .

65
Confidence intervals for the predictors estimates (cont)

• For each one of those “realizations”, we could fit a model and estimate, 𝛽04
and 𝛽0( .

66
Bootstrapping for Estimating Sampling Error

Definition
• Bootstrapping is the practice of estimating properties of an
estimator by measuring those properties by, for example, sampling
from the observed data.

• For example, we can compute 𝛽0[ and 𝛽0\ multiple times by


randomly sampling from our data set. We then use the variance of
our multiple estimates to approximate the true variance of 𝛽0[ and
𝛽0\.
Idealized Sampling

idealized original population


(through an oracle)

take samples

apply test statistic (e.g. mean)

histogram of statistic values

compare test statistic on


the given data, compute p
Bootstrap Sampling
Original pop.
Given data (sample)

bootstrap samples,
drawn with replacement

apply test statistic (e.g. mean)

histogram of statistic values

The region containing 95% of the samples is a 95% confidence interval (CI)
Confidence intervals for the predictors estimates (cont)
We sample multiple times and calculate 𝛽.04 and 𝛽0(

70
Confidence intervals for the predictors estimates (cont)
Another sample

71
Confidence intervals for the predictors estimates (cont)
Another sample

72
Confidence intervals for the predictors estimates (cont)
And another sample

73
Confidence intervals for the predictors estimates (cont)
Repeat this for 100 times

74
Confidence intervals for the predictors estimates (cont)
We can now estimate the mean and standard deviation of all the estimates 𝛽0( .
The variance of 𝛽04 and 𝛽0( are also called their standard errors, 𝑆𝐸 𝛽04 , 𝑆𝐸 𝛽0( .

75
Confidence intervals for the predictors estimates (cont)
Finally we can calculate the confidence intervals, which are the ranges of values
such that the true value of 𝛽( is contained in this interval with n percent
probability.

95%
68%

76
Confidence intervals for the predictors estimates: Standard Errors

• We can empirically estimate the standard errors, 𝑆𝐸 𝛽0[ , 𝑆𝐸 𝛽0\ of 𝛽[


and 𝛽\ through bootstrapping.

• If for each bootstrapped sample the estimated betas are: 𝛽0[,^ , 𝛽0\,^ , then

• 𝑆𝐸 𝛽0[ = var(𝛽
`[)

• 𝑆𝐸 𝛽0\ = var(𝛽
`\)
Confidence intervals for the predictors estimates: Standard Errors

• Alternatively:

• If we know the variance 𝜎"# of the noise 𝜖, we can compute 𝑆𝐸 𝛽') , 𝑆𝐸 𝛽'+
analytically using the formulae below (no need to bootstrap):



Standard Errors

• However, if we make the following assumptions,

• the errors 𝜖* = 𝑦* − 𝑦9* and 𝜖= = 𝑦= − 𝑦9= are uncorrelated, for 𝑖 ≠ 𝑗 ,

• each 𝜖* has a mean 0 and variance 𝜎>3 ,

• then, we can empirically estimate 𝜎 3 , from the data and our regression line:

Remember:
𝑦* = 𝑓 𝑥* + 𝜖* ⟹ 𝜖* = 𝑦* − 𝑓(𝑥* )
Standard Errors
s
⇣ ⌘ 1 x2
More data: 𝑛 ↑ and ∑/(𝑥/ − 𝑥)̅ # ↑⟹ 𝑆𝐸 ↓ SE b0 =
n
+P
(x x)
2
i i
⇣ ⌘
Larger coverage: 𝑣𝑎𝑟(𝑥) or ∑/(𝑥/ − 𝑥)̅ # ↑ ⟹ 𝑆𝐸 ↓ SE b1 = qP
2
# (xi x)
Better data: 𝜎 ↓ ⇒ 𝑆𝐸 ↓ i

s
X (fˆ(x) yi ) 2
Better model: (𝑓' − 𝑦/ ) ↓ ⟹ 𝜎 ↓ ⟹ 𝑆𝐸 ↓ ⇡
n 2

80
Classification

• Up to this point, the methods we have seen have centered around modeling and
the prediction of a quantitative response variable (ex, number of taxi pickups,
number of bike rentals, etc). Linear regression performs well under these
situations

• When the response variable is categorical, then the problem is no longer called a
regression problem, but instead is called a classification problem.

• The goal is to attempt to classify each observation into a category, also known as
(aka) class or cluster, defined by Y, based on a set of predictor variables X.
Example

• We are interested in predicting whether an individual will default on his or her


credit card payment, on the basis of annual income and monthly credit card
balance.
• The individuals who defaulted on their credit card payments are shown in
orange, and those who did not are shown in blue.

82
Why not Linear Regression?

• We model default =1 and not default =0


• Applying Linear Regression we will get:

83
Logistic Regression
• Rather than modelling this response Y directly, logistic regression
models the probability that Y belongs to a particular category
• For the Default data, logistic regression models the probability of
default.
– The probability of default given balance can be written as
Pr(default = Yes|balance), which we abbreviate as p(balance)
– The values p(balance), will range between 0 and 1.
– Then for any given value of balance, a prediction can be made for default.
– For example, one might predict default = Yes for any individual for whom
p(balance) > 0.5.
– Alternatively, if a company wishes to be conservative in predicting
individuals who are at risk for default, then they may choose to use a lower
threshold, such as p(balance) > 0.1.
84
The Logistic Model

• We need to model the relationship between


p(X) = Pr(Y = 1|X) and X
• We can try to apply linear regression model p(X) = β0 + β1X, but we
will get the model shown two slides before
• To avoid this problem, we must model p(X) using a function that
gives outputs between 0 and 1 for all values of X.
• Many functions meet this description. In logistic regression, we use
the logistic function,

85
The Logistic Model

• After a bit of manipulation of the original formula, we find that

• The quantity p(X)/[1−p(X)] is called the odds, and can take on any
value odds between 0 and ∞.
• Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of default, respectively
• For example,
– on average 1 in 5 people with an odds of 1/4 will default, since p(X)=0.2
implies an odds of 0.2/(1−0.2) = 1/4.
– Likewise on average nine out of every ten people with an odds of 9 will
default, since p(X)=0.9 implies an odds of 0.9/(1−0.9) = 9 86
The Logistic Model
• By taking the logarithm of both sides of we arrive at

• The left-hand side is called the log-odds or logit


• The coefficients β0 and β1 need to be estimated based on the available training data
• The maximum likelihood method is used, to fit a logistic regression model and can
be interpreted as follows:
– we seek estimates for β0 and β1 such that the predicted probability ˆp(xi) of
default for each individual, corresponds as closely as possible to the individual’s
observed default status.
• Mathematical equation called a likelihood function is used to estimate β0 and β1 :

87
Making Predictions

• Solving the model a we get β0 = −10.6513 and β1= 0.0055


• For example, the model predicts that the default probability for an
individual with a balance of $1, 000 is

which is below 1 %.
• In contrast, the predicted probability of
default for an individual with a balance
of $2, 000 is much higher, and
equals 0.586 or 58.6 %.
88
Multiple Logistic Regression

• We now consider the problem of predicting a binary response


using multiple predictors

89
Important measures for classification and diagnostic testing

• Accuracy
• Precision
• Recall
• Sensitivity
• Specificity

90
Basic quantities for performance evaluation

91
Two of the most important performance measures

• Precision By this we mean the percentage of true positives, NTP, among all
examples that the classifier has labeled as positive: NTP + NFP . The value is
obtained by the following formula:

• Recall By this we mean the probability that a positive example will be


correctly recognized by the classifier. The value is obtained by dividing the
number of true positives, NTP , by the number of positives in the given set:
NTP + NFN:

92
Why not to use “accuracy” directly

• The simplest measure of performance would be the fraction of


items that are correctly classified, or the “accuracy” which is:

NTP + NTN
NTP + NTN + NFP + NFN

• (NTP = true positive, NFN = false negative etc.).


• But this measure is dominated by the larger set (of positives or
negatives) and favors trivial classifiers.
• e.g. if 5% of items are truly positive, then a classifier that always
says “negative” is 95% accurate.
Other performance measures

• Sensitivity is recall measured on positive examples:

• Specificity is recall measured on negative examples:

94
Precision and Recall
Important measures for classification and diagnostic testing
Confusion matrix

• Is an error matrix, that allows


visualization of the
performance of an algorithm.
Each row of the matrix
represents the instances in a
predicted class while each
column represents the
instances in an actual class
(or vice versa).
ROC plots
• ROC is Receiver-Operating Characteristic. ROC plots
• Y-axis: true positive rate = NTP/(NTP + NFN), same as recall
• X-axis: false positive rate = NFP/(NFP + NTN) = 1 - specificity
ROC Curve

An ROC curve that rises at 45° is a poor model. It represents a random allocation of cases to the classes and is the
99
ROC curve for the baseline model.
ROC AUC

• ROC AUC is the “Area


Under the Curve” – a single
number that captures the
overall quality of the
classifier. It should be Random ordering
between 0.5 (random area = 0.5
classifier) and 1.0 (perfect).
Lift Plot

• A variation of the ROC plot


is the lift plot, which
compares the performance
of the actual
classifier/search engine Lift is the ratio
against random ordering, or of these lengths
sometimes against another
classifier.
Lift Plot

• Lift plots emphasize initial


precision (typically what you care
about), and performance in a
problem-independent way.
• Note: The lift plot points should
be computed at regular spacing,
e.g. 1/00 or 1/1000. Otherwise
the initial lift value can be
excessively high, and unstable.
1 - specificity
Linear Regression Example with sklearn

103

You might also like