Logistic Regression & Practice
Truong Phuoc Long, ph.D
1
Logistic regression
When studying linear regression, we tried to estimate a
population regression equation
In linear regression, we fit the model of the form as follows:
y 0 1 x1 ... q xq
The outcome y is continuous variable, assumed to
follow a normal distribution.
In many situations, y is a binary variable (disease or not)
with 2 values: 0 (no) and 1 (yes).
The mean of y is the proportion of times that it takes
the value 1: p = Pr( y = 1)
2
Categorical Response Variables
Examples:
Whether or not a person smokes Non smoker
Y
Smoker
Binary Response
Survives
Success of a medical treatment Y
Dies
Opinion poll responses Agree
Y Neutral
Ordinal Response Disagree
3
Example:
Age and signs of coronary heart disease (CD)
Age CD Age CD Age CD
22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1
4
How can we analyse these data?
Compare the mean age of diseased and non-diseased
women
Non-diseased: 38.6 years
Diseased: 58.7 years (p<0.0001)
Linear regression?
Can we apply an Ordinary Least Squares regression ?
5
The scatter plot
Yes 1
coronary
Signs of
The relationship between age and signs of
disease
coronary cannot be linear.
No 0
0 20 40 60 80 100
AGE(years)
6
Linear regression for Binary variable
In the OLS regression:
Y = 0 + 1 X + ; where Y = (0, 1)
• The error terms are heteroskedastic
• is not normally distributed because Y takes on only two values
Can not apply ordinary linear regression because data is not
continuous or distributed normally.
7
Let’s try another way
Table: Prevalence (%) of signs of CD according to age group
Diseased
Age group # in group # %
20 -29 5 0 0
30 - 39 6 1 17
40 - 49 7 2 29
50 - 59 7 4 57
60 - 69 5 4 80
70 - 79 2 2 100
80 - 89 1 1 100
8
The scatter plot now
100
Diseased %
80
60
40
20
0
0 2 4 6 8
Age group
It looks like an S-shape of a sigmoid curve or a logistic curve
9
Sigmoid curve and logistic function
• A sigmoid function is a mathematical function having a characteristic
"S"-shaped curve or sigmoid curve.
• A common example of a sigmoid function is the logistic
function shown in the figure and defined by the formula:
Sigmoid functions most often
show a return value (y axis) in the
range 0 to 1
10
The Logistic Function
1.0
Probability
of disease 0.8
0.6 x
e
P( y x ) x
0.4
1 e
0.2
0.0
x
So we fit the model of the form
This is called the logistic function
11
Logistic regression
The probability of having the disease:
The odds of having the disease:
Taking the natural logarithm of each side
We fit a linear regression model between x and the log of the
odds of having the disease, assuming that the relationship
between ln[p/(1-p)] and x is linear.
This technique is known as logistic regression 12
The Logistic Regression Model
p is the probability that the event Y occurs, Pr(Y=1).
p/(1-p) is the “odds”.
ln[p/(1-p)] is the log odds.
The logistic model assumes a linear relationship
between the predictors and log(odds).
13
The Logistic Regression Model
0 = log odds of disease in unexposed
1 = log odds ratio associated with being exposed
e 1 = odds ratio
14
Fitting equation to the data
• Linear regression: Least squares
• Logistic regression: Maximum likelihood
• In statistics, maximum likelihood estimation is a method
of estimating the parameters of an assumed probability distribution,
given some observed data.
• This is achieved by maximizing a likelihood function so that, under
the assumed statistical model, the observed data is most probable.
• Likelihood function
- Estimates parameters 0 and 1
- Practically easier to work with log-likelihood
15
Maximum likelihood
• Iterative computing
- Choice of an arbitrary value for the coefficients (usually 0)
- Computing of log-likelihood
- Variation of coefficients’ values
- Reiteration until maximisation (plateau)
• Results
- Maximum Likelihood Estimates (MLE) for 0 and 1
- Estimates of P(y) for a given value of x
16
Multiple logistic regression
• More than one independent variable
- Predictor variables may be of any data level (categorical,
ordinal, or continuous). P
ln α β1x1 β2 x 2 ... βixi
1- P
• Interpretation of i
- Increase in log-odds for a one unit increase in xi with all
the other xis constant.
- Measures association between xi and log-odds adjusted for
all other xi 17
Statistical testing
• Question
- Does model including given independent variable provide
more information about dependent variable than model
without this variable?
• Three tests
- Likelihood ratio statistic (LRS)
- Wald test
- Score test
18
Likelihood ratio statistic
• Compares two nested models
Log(odds) = + 1x1 + 2x2 + 3x3 (model 1)
Log(odds) = + 1x1 + 2x2 (model 2)
• Likelihood ratio statistic (LRS)
-2 log (likelihood model 2 / likelihood model 1) =
-2 log (likelihood model 2) minus -2log (likelihood model 1)
LR statistic is a 2 with DF = number of extra parameters in model
- The null hypothesis of the test states that the smaller model provides as
good a fit for the data as the larger model.
- If the null hypothesis is rejected, then the alternative hypothesis, larger
model provides a significant improvement over the smaller model.
19
Interpretation: Example
Using a child’s birth weight to predict the likelihood that he
or she will develop the chronic lung disease, we fit the
model.
p̂ ˆ
ln 0 1x
ˆ
1 pˆ
From the sample, the estimated logistic regression equation is
The coefficient of weight implies that for each one-gram
increase in birth weight, the log odds that the infant
develops the disease decrease by 0.0042 on average.
20
Interpretation
If an infant weighs 750 grams at birth, what is the
probability that he develops the disease?
21
Interpretation
If an infant weighs 750 grams at birth, what is the
probability that he develops the disease?
The logit:
22
Notes
Inference of the coefficient:
There is no relationship between p and x
against the alternative
We need to know the standard error of the estimator ˆ. We can
calculate z score and test statistic
23
Logistic regression example
The research done by Wuensch and Poteat and
published in the Journal of Social Behavior and
Personality, 1998, 13, 139-150.
College students (N = 315) were asked to pretend that
they were serving on a university research committee
to decide whether or not to withdraw a faculty’s
authorization to conduct an animal research.
Use data file logistic.sav on Blackboard Data.
24
Logistic regression example
Which variables are binary (dichotomous)?
25
Logistic regression example
Let’s explore how “gender” predicts the “decisions”
on whether to stop or continue the research.
What will be the dependent variable Y and
independent variable X (predictor)?
26
Logistic regression example
Our regression model will be predicting the logit, that
is, the natural log of the odds, of having made one or
the other decision.
In statistics, the logit function or the log-odds is
the logarithm of the odds {p/1-p} where p is a probability.
What are yˆ and 1 - yˆ ?
27
Logistic regression example
Our regression model will be predicting the logit, that
is, the natural log of the odds, of having made one or
the other decision.
yˆ is the predicted probability of the event which is
coded with 1 (continue the research)
1 - yˆ is the predicted probability of the other decision
(stop the research)
28
Logistic regression example
29
Predict that all subjects will
decide to stop correct 59.4%
Logistic regression exampleof the time
21
In the intercept-only model: ln(odds) = - 0.379
the predicted odds of deciding to continue the
research = [Exp(B)] = 0.684
30
• The Omnibus Tests is used to check
that the new model is an improvement
over the baseline model.
• It uses chi-square tests to see if there is
a significant difference between the
Log-likelihoods of the baseline model
and the new model.
• If the new model has a significantly
reduced -2LL compared to the baseline
then it suggests that the new model is
explaining more of the variance in the
outcome and is an improvement!
This statistic measures how poorly the model predicts the decisions -- the smaller
the statistic the better the model used to compare nested (reduced) models
31
Based on this result, what is the regression equation?
32
24
Use this model to predict the odds that a subject of a given
gender will decide to continue the research:
What is the odds that a woman will decide to continue
the research?
What is the probability that women will decide to
continue the research?
33
25
If the subject is a woman (gender = 0), then
ODDS = e-0.847+1.217(0) = 0.429
A woman is 0.429 times more likely to decide to continue the
research than to decide to stop the research
Convert odds to probabilities:
Our model predicts that 30% of women will decide to
continue the research 34
26
What is the odds that a man will decide to continue the
research?
What is the probability that men will decide to
continue the research?
35
27
If the subject is a man (gender = 1), then
ODDS = e-0.847+1.217(1) = 1.448
A man is 1.448 times more likely to decide to continue the
research than to decide to stop the research
Convert odds to probabilities:
Our model predicts that 59% of men will decide to continue the
research 36
The probability that men will decide to continue the research = 0.59
The probability that women will decide to continue the research = 0.30
With the cut value of 0.50:
If the probability ≥ 0.50, the subject is classified into “Continue the research”
all male subjects (115) are predicted to continue the research
If the probability < 0.50, the subject is classified into “Stop the research”
all female subject (200) are predicted to stop the research
This rule allows us to correctly classify 68/128 = 53.1% of the subjects where
the predicted event (deciding to continue the research) was observed
Sensitivity of prediction, P(correct | event did occur) = 53.1%
This rule allows us to correctly classify 140/187 = 74.9% of the subjects where
the predicted event was not observed. This is known as the specificity of
prediction, the P(correct | event did not occur) = 74.9%
Overall our predictions were correct 208 out of 315 times, for an overall
success rate of 66% 37
29
Exp(B) is the odds ratio predicted by the model
The model predicts that the odds of deciding to continue
the research are 3.376 times higher for men than they
are for women.
For the men, the odds are 1.448, and for the women they
are 0.429. The odds ratio is
38
95% CI for the predicted OR (odds ratio)
39
95% CI for the predicted OR (odds ratio)
Interpretation?
40
95% CI for the predicted OR (odds ratio)
95% confident that the odds of deciding to continue
the research are 2.09-5.45 times higher for men than
they are for women in the population.
41
Exercise
Use the low birth weight data set (lowbwt.sav).
42
Exercise
Let’s consider age of the mother as independent variable
to predict the low birth weight of her infant (low).
Model 1: Perform a simple logistic regression to derive an
equation to compute the probability of the low birth weight
infants from the age of their mothers.
Is the effect of mother age on low birth weight infants
significant?
What is the predicted odds ratio? Interpret it
What is 95% CI of the odds ratio? Interpret it
What is the predicted probability of having a low birth
weight infant of a woman at 35 years old?
43
35
44
45
Result
• Model 1:
• As age of mother increases 1, the logit of having low birth weight
of infant decreases 0.051. p-value of ˆ is 0.105 > 0.05
the effect of age is not significant.
• ORˆ = 0.95 as age of mother increases 1, the odds ratio of having
low birth weight of infant decreases by 0.95 times.
• 95% CI of OR = 0.893 – 1.011: in the population, we are 95%
sure that the OR of having LBW infant as age of mother increases
1 is between 0.89 and 1.01. The 95% CI of OR includes 1, which
means the effect of mother age on LBW is not significant.
46
Exercise
Model 2: Let’s add smoking as another the
independent variable.
Interpret the result of each predictor: age
and smoking (significance, OR).
What is the predicted probability of having a low
birth weight infant of a 35-year-old woman who
smokes?
Compare model 2 with model 1 using Chi-square
test for the difference in -2 log likelihood.
47
48
Result
Model 2:
As age of mother increases 1, the logit of having LBW infant decreases
0.05, holding smoking status constant. p-value of ˆ is 0.119 > 0.05 the
effect of age is not significant.
When the mother smokes, the logit of having LBW infant increases 0.692,
holding mother age constant. p-value of ˆ is 0.032 < 0.05 the effect of
smoking is significant.
ORˆ(smoke) 1.997
As the mother smokes, the odds ratio of having LBW infant increases by
1.997 times, holding mother age constant.
95% CI of OR (smoke) = 1.063 – 3.753: in the population, we are 95% sure
that the OR of having LBW infant as the mother smokes is between 1.063
and 3.753. The 95% CI of OR does not include 1, which means the effect
49
of mother smoking on LBW is significant.
Chi-square test of -2LLs
• Model 1: -2LL = 231.912
• Model 2: -2LL = 227.276
• χ2 = 4.636 > 3.84 (χ2 at df = 1 (difference in number of
predictors between two models))
• p<0.05 reject the null and conclude that adding the
smoking variable has significantly increased the ability
to predict infant low birth weight.
50
http://bis.net.vn/forums/t/484.aspx
51