KEMBAR78
Introduction To Logistic Regression | PDF | Logistic Regression | Linear Regression
0% found this document useful (0 votes)
7 views43 pages

Introduction To Logistic Regression

The document provides an introduction to logistic regression, a statistical method for binary classification that predicts the probability of an outcome using the logistic function. It covers types of logistic regression, including binary, multinomial, and ordinal, as well as the assumptions and advantages of the method. Additionally, it presents examples of simple and multiple logistic regression analyses, including their equations and interpretations.

Uploaded by

Haliyah Musibau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views43 pages

Introduction To Logistic Regression

The document provides an introduction to logistic regression, a statistical method for binary classification that predicts the probability of an outcome using the logistic function. It covers types of logistic regression, including binary, multinomial, and ordinal, as well as the assumptions and advantages of the method. Additionally, it presents examples of simple and multiple logistic regression analyses, including their equations and interpretations.

Uploaded by

Haliyah Musibau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to Logistic Regression

by
Dr. Chindo Ibrahim Bisallah
MB.BS, MPH, MPA, PhD
Department of community Medicine
, Faculty of health sciences,
IBBU Lapai
Introduction
• Logistic regression is a statistical method used for binary classification—
predicting one of two possible outcomes (e.g., yes/no, 0/1, spam/not spam)
• Unlike linear regression, which predicts continuous values, logistic regression
predicts the probability that a given input belongs to a particular class
• It uses the logistic (sigmoid) function to map any real-valued number into a
value between 0 and 1:
• σ(z)= 1/1+e −z
• (z) is the sigmoid function.
• e is the base of the natural logarithm (approximately 2.718).
• z is the input value (a real number).
• Where
• 𝑧= 𝛽 0 + 𝛽1 X 1+⋯ +𝛽𝑛X 𝑛

Types of logistic Regression

1. Binary Logistic Regression


Used when the outcome has two categories (e.g., 0 or 1, yes or no).
Example: Predicting whether one will pass an exam or fail.
2 Multinomial Logistic Regression
Used when the outcome has more than two categories that are not ordered.
Example: Predicting the type of cookery a restaurant serves (e.g., Italian, Chinese,
Indian).
3. Ordinal Logistic Regression
Used when the outcome has more than two ordered categories.
Example: Predicting a customer satisfaction rating (e.g., poor, fair, good,
excellent).
Simple logistic regression
This involves a single independent variable to predict the probability of a
binary outcome.
It models the relationship between the independent variable and the log-
odds of the dependent event occurring.
formula:
log ( P/1−p​) = β 0​+ β 1​ X
Where:
p is the probability of the event occurring (e.g., success).
β 0 is the intercept of the model.
𝛽1 is the coefficient representing the effect of the independent variable X
on the log-odds of the outcome.
Example;
Predicting whether a student passes an exam (pass/fail) based on the
number of hours studied. Here, "hours studied" is the single
independent variable
Multiple logistic regression
• Multiple logistic regression extends simple logistic regression by
incorporating two or more independent variables to predict the
probability of a binary outcome.
• This allows for the assessment of the effect of each predictor while
controlling for the others
• formula
• log ( p/1−p )=β 0​ +β 1​ X 1​ +β2​ X 2 +⋯+β k​ X k​
X 1 ,X 2,…,X k are the independent variables
β 1 ,β 2,…,β k are the coefficients representing the effect of each independent
variable on the log-odds of the outcome.

E.g;
Predicting whether a patient has a disease (yes/no) based on multiple factors such as
age, blood pressure, cholesterol level, and smoking status. Each of these factors is an
independent variable in the model
Advantages;
Allows for the assessment of the impact of multiple factors simultaneously.
Can control for confounding variables, providing a clearer understanding of each
predictor's effect.
Assumption of logistic Regression
1. Binary Dependent Variable
The dependent variable should be binary (e.g., success/failure,
yes/no, 1/0)
2. Independence of Observations
Observations should be independent of each other
3. No Multicollinearity
Independent variables should not be highly correlated with each
other.
Use Variance Inflation Factor (VIF) to check for multicollinearity
4. Linearity of the Logit
The relationship between the independent variables and the log odds
of the dependent variable should be linear.
This doesn’t mean a linear relationship between predictors and the
outcome itself, but between predictors and the logit (log-odds).
Can test this with Box-Tidwell test
5. Large Sample Size
Logistic regression requires a relatively large sample, especially when
the outcome is rare.
A common rule of thumb is at least 10 events per predictor variable
6. Random sample
Use random sampling methods for the collection of samples
7. No Extreme Outliers
Logistic regression can be sensitive to outliers or influential points,
particularly in small datasets.
Use diagnostics like Cook's distance or leverage values (High cook
distance identify observation with outliers)
• Data set for simple linear regression analysis for predicting whether a
student passes or fails based on the number of hours studied:

• hour studied 1, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5,6, 6.5, 7


• exam result 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1
• note: pass = 1, fail = 0
• single logistic regression analysis
Omnibus Tests of Model Coefficients Classification Tablea
Observed Predicted
Chi-square df Sig.
Percentage
Step 1 Step 16.636 1 .000 examresult Correct
Block 16.636 1 .000 .00 1.00

Model Step 1 examresult .00 6 0 100.0


16.636 1 .000
1.00 0 6 100.0
Model Summary Overall Percentage 100.0

Variables in the Equation


-2 Log Cox & Snell Nagelkerke R
Step likelihood R Square Square B S.E. Wald df Sig. Exp(B)

1 Step 1a hoursstudied
.000a .750 1.000 54124076836870960
66.161 11046.351 .000 1 .995
000000000000.000
a. Estimation terminated at iteration number 20
because maximum iterations has been reached. Constant
-281.185 47027.986 .000 1 .995 .000
Final solution cannot be found.

a. Variable(s) entered on step 1: hoursstudied.


1. Omnibus test of model coefficients

Omnibus Tests of Model Coefficients


Chi-square df Sig.
Step 1Step 16.636 1 .000
Block 16.636 1 .000
Model 16.636 1 .000

• Interpretation: The model as a whole is statistically significant (p < .001), meaning


the independent variable (hoursstudied) contributes to predicting the dependent
variable (examresult).
2. Model Summary
Interpretation:
The Nagelkerke R² = 1.000 suggests a perfect model fit — unusually
high ( not likely)
3. The Nagelkerke R² (also called the Nagelkerke pseudo-R²)

Purpose: It gives a measure of how well the model explains the variation in
the dependent (outcome) variable.

Range: Values range from 0 to 1.


0 means the model explains none of the variability.
1 means the model explains all of the variability (perfect fit).
Higher values indicate a better fit of the model to the data.
1. Classification Table
The model predicted exam results with 100% accuracy (both pass and
fail groups).
Again, this perfection is suspicious and often a symptom of data
issues (e.g., a small sample, complete separation)
4. Variables in the equation
• Interpretation:
• The coefficient for hoursstudied is very large, but not statistically
significant (p = .995).
• The extremely large standard errors and Exp(B) value (i.e., the odds
ratio is astronomically high) confirm model instability, likely due to
small samle size.
5. Logistic regression equation for the above analysis is
log (p/1−p)= −281.185+ 66.161× hoursstudied
Example 2.
A logistic regression was conducted to examine whether the number of
hours studied predicted whether students passed or failed a final
examination. The dependent variable was exam result (1 = Pass, 0 = Fail),
and the independent variable was hours studied.
Results of analysis is presented below:
• Sample size: 50 students
• Model Chi-Square (Omnibus Test): χ²(1) = 12.45, p < .001
• –2 Log Likelihood: 45.23
• Nagelkerke R²: 0.36
• Classification Accuracy: 80%
• Hosmer-Lemeshow Test: p = 0.62 (model fits well)
Logistic Regression Coefficients:
• Predictor B S.E. Wald Sig. Exp(B)
• Hours Studied 0.85 0.27 9.89 .002 2.34
• Constant -4.20 1.25 11.26 .001 —
Reporting format for Logistic regression analysis
1. The logistic regression model was statistically significant, χ²(1) = 12.45, p < .001,
indicating that hours studied is a significant predictor of exam success.
2. The model explained approximately 36% of the variance in exam results
(Nagelkerke R² = 0.36) and correctly classified 80% of the students.
3. The regression coefficient for hours studied was 0.85, meaning each additional
hour studied increases the log odds of passing by 0.85.
4. The odds ratio (Exp(B)) is 2.34, which indicates that each extra hour of study
multiplies the odds of passing the exam by 2.34.
5. In other words, students who studied more were significantly more likely to
pass the exam.
1. Logistic Regression equation
Log (p/1−p)= −4.20+ 0.85×(Hours Studied)
where,
p = probability of passing the exam
Hours Studied = number of hours the student studied
To Find the Probability p for a Given Number of Hours Studied:
Plug in the number of hours studied.
Compute the log-odds.
Convert log-odds to odds using exponentiation.

using the exponential function: odds = 𝑒log-odds


• To convert log-odds to odds, you simply exponentiate the log-odds value

Where
• e is the base of the natural logarithm (approximately 2.71828).
Convert odds to probability
• step 1: Compute the Log-Odds
• Given a specific number of hours studied, plug that value into the
equation to calculate the log-odds.

• Example: For 3 hours studied:


• Log-Odds=−4.20+0.85×3=−1.65
• Step 2: Convert Log-Odds to Odds
• Exponentiate the log-odds to obtain the odds
• Odds=e Log-Odds = e −1.65 ≈0.192
• Step 3: Convert Odds to Probability
• Finally, convert the odds to probability using the formula:
• p= Odds /1+Odds = 0.192/1+0.192 ≈0.161
• So, the probability of passing the exam after studying for 3 hours is
approximately 16.1%.
• Example: 6 Hours Studied
• Log( 1/ 1−p) = −4.20+0.85×6=−4.20+5.10=0.90
• P/1−p​ = e 0.90 ≈2.46
• p = 2.46​/1+2.46 ≈ 2.46/3.46 ≈0.71
• So, the probability of passing after studying for 6 hours is about 71%.

Multiple/Multivariable Logistic
Regression
• To estimate the relationship between ONE binary outcome with more
than ONE independent variables
• ŷ=β0+ β1 x -> simple regression
• ŷ=β0+ β1 x1+ β2 x2+ β3 x3… βNxN…
• =β0+ β1 x1+ β2 x2+ β3 x3… βN xN…
• Logit transformation
• Log odds
• Odds ratio
Example
• To investigate the factors associated with the probability of having
disease.
Intervention
Treatment A program
(Yes/No) (Yes/No)
Disease: Yes
and No
Sedentary act. Soft drink
(low, medium, intake
high) (continous)
Assumptions
 1. Random sample

 2. Independent sample- error term should be independent

 3. Linearity- There is a linear relationship between x and y


 Linearity distributed of log odds vs independent variable *

 4. Multicollinearity- independent variables are not highly


correlated.
Steps in analysis
• 1. Data exploration-descriptive statistics
• 2. Simple logistic regression
• Preliminary final model
• 3. Multiple logistic regression
• 4. Checking multicollinearity and interaction
• Preliminary final model
• 5. Checking assumptions
• Final model
• 6. Interpretation, conclusion and presentation.
Steps in analysis
• 1. Data exploration-descriptive statistics
• Diabetes vs intervention (Yes/No)
• Diabetes vs treatment A (Yes/No)
• Diabetes vs sedentary activity (low, moderate, high)
• Diabetes vs Softdrink intake (continuous)
• 2. Simple logistic regression
• Diabetes vs intervention (Yes/No)
• Diabetes vs treatment A (Yes/No)
• Diabetes vs sedentary activity (low, moderate, high)
• Diabetes vs Softdrink intake (continuous)
Steps in analysis
• 1. Data exploration-descriptive statistics
• 2. Simple logistic regression
• Preliminary final model
• 3. Multiple logistic regression
• 4. Checking multicollinearity and interaction
• Preliminary final model
• 5. Checking assumptions
• Final model
• 6. Interpretation, conclusion and presentation.
Example
• Research objective: To investigate whether diabetes is affected by
sedentary activity, treatment A, soft drink intake and intervention
program.

Dependent variable: Diabetes first visit


Independent variable: sedentary activity, softdrink intake,
intervention and treatment A
• Hypothetical Multiple Logistic Regression Model
• The logistic regression equation can be represented as:
• log(p / (1 - p)) = β₀ + β₁ × Sedentary Activity + β₂ × Treatment A + β₃ ×
Soft Drink Intake + β₄ × Intervention Program
• Where:
• - p: Probability of developing diabetes
• - β₀: Intercept
• - β₁ to β₄: Coefficients representing the effect size of each predictor
variable
The table below presents hypothetical coefficients for the logistic
regression model
Variable Coefficient Odds Ratio Interpretation
(β) (e^β)
Intercept -3.5 0.03 Baseline log-odds of diabetes
Sedentary 0.9 2.459 Higher activity increases diabetes
Activity risk
Treatment A -1.2 0.301 Treatment A reduces diabetes risk

Soft Drink Intake 0.75 2.117 Increased soft drink intake raises
risk
Intervention -1.0 0.368 Participation reduces diabetes risk
Program
Multiple Logistic Regression equation for the analysis
log(p / (1 - p)) = β₀ + β₁ × Sedentary Activity + β₂ × Treatment A + β₃ ×
Soft Drink Intake + β₄ × Intervention Program
log(p / (1 - p)) = -3.5 + 0.9 (Sedentary Activity) + -1.2 (Treatment A) +
0.75 (Soft Drink Intake) + -1.0 (Intervention Program)
Interpretation of the Equation
 Intercept (-3.5): Baseline log-odds of diabetes when all predictors = 0.
Sedentary Activity (0.9): Each unit increase in sedentary activity increases
the log-odds of diabetes by 0.9, or odds by a factor of e^0.9 ≈ 2.46.
 Treatment A (-1.2): Being in Treatment A reduces the log-odds by 1.2, or
odds by a factor of e^-1.2 ≈ 0.30.
 Soft Drink Intake (0.75): Each unit increase raises the log-odds of
diabetes by 0.75, or odds by a factor of e^0.75 ≈ 2.12.
 Intervention Program (-1.0): Participation reduces the log-odds of
diabetes by 1.0, or odds by a factor of e^-1.0 ≈ 0.37.
Sedentary activity - p =2.46/1+2.46 = 0.710 = 71%
Treatment A - p = 0.30 = 0.30/1+0.30 = 0.230 = 23%
Soft Drink Intake = 2.12 = 2.12/1+2.12 = 67.9%
Intervention Program = 0.37= 0.37/1+ 0.37 = 0.270 = 27%
• Interpretation of Coefficients
• Sedentary Activity (β₁): A positive coefficient suggests that increased sedentary behavior
is associated with higher odds of developing diabetes.
• Treatment A (β₂): A negative coefficient indicates that Treatment A may have a protective
effect against diabetes.
• Soft Drink Intake (β₃): A positive coefficient implies that higher consumption of soft drinks
is linked to increased diabetes risk. For instance, studies have shown that consuming five
or more servings of soft drinks per week is associated with nearly double the risk of type
2 diabetes compared to consuming less than one serving per week.
• Intervention Program (β₄): A negative coefficient suggests that participation in the
intervention program reduces the odds of developing diabetes. Evidence indicates that
physical activity promotion programs can significantly decrease fasting blood glucose
levels among patients with type 2 diabetes.
• Statistical Significance and Model Fit
• Each coefficient's statistical significance is typically assessed using p-
values:
• p-value < 0.05: Statistically significant effect
• p-value ≥ 0.05: Not statistically significant
• The overall model fit can be evaluated using metrics such as the
Akaike Information Criterion (AIC) or the Hosmer-Lemeshow test.
• This multiple logistic regression analysis allows for the assessment of
how sedentary activity, Treatment A, soft drink intake, and
participation in an intervention program collectively influence the risk
of developing diabetes.
• The model can identify significant predictors and quantify their
impact, aiding in the development of targeted prevention strategies
Reference
• 1.Biostatistics- A foundation for analysis in the health sciences.
• 2. Broemeling, L. D. (2013). Bayesian methods in epidemiology. CRC
Press.
• 3. Abhaya Indrayan,(2013). Medical biostatistics third edition. CRC
Press.
Thank you

You might also like