What is logistic regression?
Logistic regression estimates the probability of an event occurring, such as
voted or didn’t vote, based on a given data set of independent variables.
This type of statistical model (also known as logit model) is often used for
classification and predictive analytics. Since the outcome is a probability, the
dependent variable is bounded between 0 and 1. In logistic regression, a logit
transformation is applied on the odds—that is, the probability of success
divided by the probability of failure. This is also commonly known as the log
odds, or the natural logarithm of odds, and this logistic function is represented
by the following formulas:
Logit(pi) = 1/(1+ exp(-pi))
ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k
In this logistic regression equation, logit(pi) is the dependent or response
variable and x is the independent variable. The beta parameter, or coefficient,
in this model is commonly estimated via maximum likelihood estimation (MLE).
This method tests different values of beta through multiple iterations to
optimize for the best fit of log odds. All of these iterations produce the log
likelihood function, and logistic regression seeks to maximize this function to
find the best parameter estimate. Once the optimal coefficient (or coefficients
if there is more than one independent variable) is found, the conditional
probabilities for each observation can be calculated, logged, and summed
together to yield a predicted probability. For binary classification, a probability
less than .5 will predict 0 while a probability greater than 0 will predict 1. After
the model has been computed, it’s best practice to evaluate the how well the
model predicts the dependent variable, which is called goodness of fit. The
Hosmer–Leme show test is a popular method to assess model fit
Interpreting logistic regression
Log odds can be difficult to make sense of within a logistic regression data
analysis. As a result, exponentiating the beta estimates is common to transform
the results into an odds ratio (OR), easing the interpretation of results. The OR
represents the odds that an outcome will occur given a particular event,
compared to the odds of the outcome occurring in the absence of that event. If
the OR is greater than 1, then the event is associated with a higher odds of
generating a specific outcome. Conversely, if the OR is less than 1, then the
event is associated with a lower odds of that outcome occurring. Based on the
equation from above, the interpretation of an odds ratio can be denoted as the
following: the odds of a success changes by exp(cB_1) times for every c-unit
increase in x. To use an example, let’s say that we were to estimate the odds of
survival on the Titanic given that the person was male, and the odds ratio for
males was .0810. We’d interpret the odds ratio as the odds of survival of males
decreased by a factor of .0810 when compared to females, holding all other
variables constant.
Linear regression vs logistic regression
Both linear and logistic regression are among the most popular models within
data science, and open-source tools, like Python and R, make the computation
for them quick and easy.
Linear regression models are used to identify the relationship between a
continuous dependent variable and one or more independent variables. When
there is only one independent variable and one dependent variable, it is known
as simple linear regression, but as the number of independent variables
increases, it is referred to as multiple linear regression. For each type of linear
regression, it seeks to plot a line of best fit through a set of data points, which
is typically calculated using the least squares method.
Similar to linear regression, logistic regression is also used to estimate the
relationship between a dependent variable and one or more independent
variables, but it is used to make a prediction about a categorical variable versus
a continuous one. A categorical variable can be true or false, yes or no, 1 or 0,
et cetera. The unit of measure also differs from linear regression as it produces
a probability, but the logit function transforms the S-curve into straight line.
While both models are used in regression analysis to make predictions about
future outcomes, linear regression is typically easier to understand. Linear
regression also does not require as large of a sample size as logistic regression
needs an adequate sample to represent values across all the response
categories. Without a larger, representative sample, the model may not have
sufficient statistical power to detect a significant effect
Types of logistic regression
There are three types of logistic regression models, which are defined based on
categorical response.
Binary logistic regression: In this approach, the response or dependent
variable is dichotomous in nature—i.e. it has only two possible outcomes
(e.g. 0 or 1). Some popular examples of its use include predicting if an e-
mail is spam or not spam or if a tumor is malignant or not malignant.
Within logistic regression, this is the most commonly used approach, and
more generally, it is one of the most common classifiers for binary
classification.
Multinomial logistic regression: In this type of logistic regression model,
the dependent variable has three or more possible outcomes; however,
these values have no specified order. For example, movie studios want
to predict what genre of film a moviegoer is likely to see to market films
more effectively. A multinomial logistic regression model can help the
studio to determine the strength of influence a person's age, gender, and
dating status may have on the type of film that they prefer. The studio
can then orient an advertising campaign of a specific movie toward a
group of people likely to go see it.
Ordinal logistic regression: This type of logistic regression model is
leveraged when the response variable has three or more possible
outcome, but in this case, these values do have a defined order.
Examples of ordinal responses include grading scales from A to F or
rating scales from 1 to 5.
A glimpse inside the mind of a data scientist - This link downloads a pdf
Logistic regression and machine learning
Within machine learning, logistic regression belongs to the family of supervised
machine learning models. It is also considered a discriminative model, which
means that it attempts to distinguish between classes (or categories). Unlike a
generative algorithm, such as naïve bayes, it cannot, as the name implies,
generate information, such as an image, of the class that it is trying to predict
(e.g. a picture of a cat).
Previously, we mentioned how logistic regression maximizes the log likelihood
function to determine the beta coefficients of the model. This changes slightly
under the context of machine learning. Within machine learning, the negative
log likelihood used as the loss function, using the process of gradient
descent to find the global maximum. This is just another way to arrive at the
same estimations discussed above.
Logistic regression can also be prone to overfitting, particularly when there is a
high number of predictor variables within the model. Regularization is typically
used to penalize parameters large coefficients when the model suffers from
high dimensionality.
Scikit-learn (link resides outside ibm.com) provides valuable documentation to
learn more about the logistic regression machine learning model.
Use cases of logistic regression
Logistic regression is commonly used for prediction and classification problems.
Some of these use cases include:
Fraud detection: Logistic regression models can help teams identify data
anomalies, which are predictive of fraud. Certain behaviors or
characteristics may have a higher association with fraudulent activities,
which is particularly helpful to banking and other financial institutions in
protecting their clients. SaaS-based companies have also started to
adopt these practices to eliminate fake user accounts from their datasets
when conducting data analysis around business performance.
Disease prediction: In medicine, this analytics approach can be used to
predict the likelihood of disease or illness for a given population.
Healthcare organizations can set up preventative care for individuals that
show higher propensity for specific illnesses.
Churn prediction: Specific behaviors may be indicative of churn in
different functions of an organization. For example, human resources
and management teams may want to know if there are high performers
within the company who are at risk of leaving the organization; this type
of insight can prompt conversations to understand problem areas within
the company, such as culture or compensation. Alternatively, the sales
organization may want to learn which of their clients are at risk of taking
their business elsewhere. This can prompt teams to set up a retention
strategy to avoid lost revenue.
Logistic Regression in Machine Learning
Logistic regression is a supervised machine learning algorithm used
for classification tasks where the goal is to predict the probability that an
instance belongs to a given class or not. Logistic regression is a statistical
algorithm which analyze the relationship between two data factors. The article
explores the fundamentals of logistic regression, it’s types and
implementations.
Table of Content
What is Logistic Regression?
Logistic Function – Sigmoid Function
Types of Logistic Regression
Assumptions of Logistic Regression
How does Logistic Regression work?
Code Implementation for Logistic Regression
Precision-Recall Tradeoff in Logistic Regression Threshold Setting
How to Evaluate Logistic Regression Model?
Differences Between Linear and Logistic Regression
What is Logistic Regression?
Logistic regression is used for binary classification where we use sigmoid
function, that takes input as independent variables and produces a probability
value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to
Class 1 otherwise it belongs to Class 0. It’s referred to as regression because it
is the extension of linear regression but is mainly used for classification
problems.
Key Points:
Logistic regression predicts the output of a categorical dependent
variable. Therefore, the outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).
Logistic Function – Sigmoid Function
The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1. The
value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the
threshold value tends to 1, and a value below the threshold values tends
to 0.
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three
types:
1. Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”,
“dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as “low”, “Medium”, or
“High”.
Assumptions of Logistic Regression
We will explore the assumptions of logistic regression as understanding these
assumptions is important to ensure that we are using appropriate application
of the model. The assumption include:
1. Independent observations: Each observation is independent of the other.
meaning there is no correlation between any input variables.
2. Binary dependent variables: It takes the assumption that the dependent
variable must be binary or dichotomous, meaning it can take only two
values. For more than two categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The
relationship between the independent variables and the log odds of the
dependent variable should be linear.
4. No outliers: There should be no outliers in the dataset.
5. Large sample size: The sample size is sufficiently large
Terminologies involved in Logistic Regression
Here are some common terms involved in logistic regression:
Independent variables: The input characteristics or predictor factors
applied to the dependent variable’s predictions.
Dependent variable: The target variable in a logistic regression model,
which we are trying to predict.
Logistic function: The formula used to represent how the independent
and dependent variables relate to one another. The logistic function
transforms the input variables into a probability value between 0 and 1,
which represents the likelihood of the dependent variable being 1 or 0.
Odds: It is the ratio of something occurring to something not occurring.
it is different from probability as the probability is the ratio of something
occurring to everything that could possibly occur.
Log-odds: The log-odds, also known as the logit function, is the natural
logarithm of the odds. In logistic regression, the log odds of the
dependent variable are modeled as a linear combination of the
independent variables and the intercept.
Coefficient: The logistic regression model’s estimated parameters, show
how the independent and dependent variables relate to one another.
Intercept: A constant term in the logistic regression model, which
represents the log odds when all independent variables are equal to
zero.
Maximum likelihood estimation: The method used to estimate the
coefficients of the logistic regression model, which maximizes the
likelihood of observing the data given the model.
How does Logistic Regression work?
The logistic regression model transforms the linear regression function
continuous value output into categorical value output using a sigmoid function,
which maps any real-valued set of independent variables input into a value
between 0 and 1. This function is known as the logistic function.
Let the independent input features be:
X=[x11 …x1mx21 …x2m ⋮⋱ ⋮ xn1 …xnm]X=x11 x21 ⋮xn1 ……⋱ …x1mx2m⋮ xnm
and the dependent variable is Y having only binary value i.e. 0 or 1.