KEMBAR78
Slide 6 | PDF | Logistic Regression | Multicollinearity
0% found this document useful (0 votes)
85 views37 pages

Slide 6

This document covers Chapter 6 of a course on Web Data Analysis, focusing on supervised learning and regression analysis. It includes topics such as linear regression, multiple linear regression, multi-collinearity, dummy variables, and logistic regression, along with methods for evaluating model performance. The chapter also discusses the implementation of logistic regression using scikit-learn.

Uploaded by

minhminh87882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views37 pages

Slide 6

This document covers Chapter 6 of a course on Web Data Analysis, focusing on supervised learning and regression analysis. It includes topics such as linear regression, multiple linear regression, multi-collinearity, dummy variables, and logistic regression, along with methods for evaluating model performance. The chapter also discusses the implementation of logistic regression using scikit-learn.

Uploaded by

minhminh87882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

FACULTY OF INFORMATION SYSTEMS

Course:
Web Data Analysis
(3 credits)

Lecturer: Nguyen Thon Da Ph.D.


LECTURER’S INFORMATION

Chapter 6
Supervised Learning:
Regression Analysis

Web Data Analysis :: Thon-Da Nguyen Ph.D. 1


MAIN CONTENTS
6.1 Linear regression 6.7 Evaluating regression model
6.2 Multiple linear regression performance
6.3 Understanding multi-collinearity 6.8 Fitting polynomial regression
6.4 Removing multi-collinearity 6.9 Regression models for
6.5 Dummy variables classification
6.6 Developing a linear regression 6.10 Logistic regression
model 6.11 Implementing logistic
regression using scikit-learn

Web Data Analysis :: Thon-Da Nguyen Ph.D. 2


6.1 Linear regression
 Linear regression:
 A kind of curve-fitting and
prediction algorithm.
 Be used to discover the
linear association between
a dependent (or target)
column and one or more
independent columns (or
predictor variables).

Web Data Analysis :: Thon-Da Nguyen Ph.D. 3


6.1 Linear regression
 The equation for the regression model:

 The parameters of linear regression are estimated


using OLS (Least Squares Ordinary).
 OLS is a method that is widely used to estimate the
regression intercept and coefficients.
 OLS reduces the sum of squares of residuals (or
error), which is the difference between the predicted
and actual. Web Data Analysis :: Thon-Da Nguyen Ph.D. 4
6.2 Multiple linear regression
 Multiple linear regression :
o A generalized form of simple linear regression.
o Be used to predict the continuous target variable
based on multiple features or explanatory variables.
o The main objective of MLR is to estimate the linear
relationship between the multiple features and the
target variable.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 5


6.3 Understanding multi-collinearity
 Multi-collinearity
 occurs when there is strong correlation among
independent variables, making it challenging to determine
the individual effects of each variable.
 can lead to unstable regression coefficients, create
uncertainty in assessing variable importance.
 typically arises from high correlations between
independent variables
 to address multi-collinearity: eliminate unimportant
variables, employ principal component analysis (PCA), or
utilize techniques like Ridge and Lasso regression to
stabilize the modelWeb Data Analysis :: Thon-Da Nguyen Ph.D. 6
6.4 Removing multi-collinearity
 Removing multi-collinearity
o The correlation coefficient (or correlation matrix) between
independent variables : detect the multi-collinearity by
checking the correlation coefficient magnitude.
o Variance Inflation Factor : quantifies the degree of multi-
collinearity, indicating how much a variable's variance is
increased due to correlations with other variables in a
multiple regression model.
o Eigenvalues : used to assess the degree of multi-collinearity
and identify which independent variables contribute
significantly to this phenomenon.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 7


6.4 Removing multi-collinearity

Web Data Analysis :: Thon-Da Nguyen Ph.D. 8


6.4 Removing multi-collinearity

Web Data Analysis :: Thon-Da Nguyen Ph.D. 9


6.5 Dummy variables
 Dummy variables:
 categorical independent variables used in regression analysis.
 a Boolean, indicator, qualitative, categorical, and binary
variable.
 convert a categorical variable with N distinct values into N–1
dummy variables.
 only takes the 1 and 0 binary values, which are equivalent to
existence and nonexistence.
 pandas offers the get_dummies() function to generate the
dummy values.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 10


WEB DATA ANALYSIS 6.5 Dummy variables

Web Data Analysis :: Thon-Da Nguyen Ph.D. 11


6.6 Developing a linear regression model

Web Data Analysis :: Thon-Da Nguyen Ph.D. 12


6.6 Developing a linear regression model

Web Data Analysis :: Thon-Da Nguyen Ph.D. 13


6.6 Developing a linear regression model

Web Data Analysis :: Thon-Da Nguyen Ph.D. 14


6.6 Developing a linear regression model

Web Data Analysis :: Thon-Da Nguyen Ph.D. 15


6.7 Evaluating regression model performance
 R-squared (or coefficient of determination):
 a statistical model evaluation measure that assesses
the goodness of a regression model.
 helps data analysts to explain model performance
compared to the base model.
 value lies between 0 and 1.
 a negative value : worse than the average base model.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 16


6.7 Evaluating regression model performance
 R-squared (or coefficient of determination):

Sum of Squares Regression (SSR): The difference between the


forecasted value and the mean of the data.

Sum of Squared Errors (SSE): The change between the original or


genuine value and the forecasted value.

Total Sum of Squares (SST): The change between the original or


genuine value and the mean of the data.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 17
6.7 Evaluating regression model performance
 MSE

Web Data Analysis :: Thon-Da Nguyen Ph.D. 18


6.7 Evaluating regression model performance
 MAE

Web Data Analysis :: Thon-Da Nguyen Ph.D. 19


6.7 Evaluating regression model performance
 RMSE

Web Data Analysis :: Thon-Da Nguyen Ph.D. 20


6.7 Evaluating regression model performance

Web Data Analysis :: Thon-Da Nguyen Ph.D. 21


6.8 Fitting polynomial regression
 Polynomial regression
 is used to adapt the nonlinear relationships between
dependent and independent variables.
 variables are modeled as the nth polynomial degree.
 recognize the growth rate of various phenomena.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 22


6.8 Fitting polynomial regression

Web Data Analysis :: Thon-Da Nguyen Ph.D. 23


6.8 Fitting polynomial regression

Web Data Analysis :: Thon-Da Nguyen Ph.D. 24


6.9 Regression models for classification

Web Data Analysis :: Thon-Da Nguyen Ph.D. 25


6.9 Regression models for classification
 Classification involves dividing up objects so that each is assigned to
one of a number of mutually exhaustive and exclusive categories
known as classes.
 The term ‘mutually exhaustive and exclusive’ simply means that
each object must be assigned to precisely one class.
 Many practical decision-making tasks can be formulated as
classification problems.
 Logistic regression is one of the classification methods, although its
name ends with regression. It is a commonly used binary class
classification method. It is a basic ML algorithm for all kinds of
classification problems. It finds the association between dependent
(or target) variables and sets of independent variables (or features).
Web Data Analysis :: Thon-Da Nguyen Ph.D. 26
6.9 Regression models for classification

Web Data Analysis :: Thon-Da Nguyen Ph.D. 27


6.9 Regression models for classification

Web Data Analysis :: Thon-Da Nguyen Ph.D. 28


6.10 Logistic regression
 Logistic regression
 a kind of supervised machine learning algorithm that is utilized to
forecast a binary outcome and classify observations.
 its dependent variable is a binary variable with 2 classes: 0 or 1.
 computes a log of the odds ratio of the target variable, which
represents the probability of occurrence of an event.
 Logistic regression is a kind of simple linear regression where the
dependent or target variable is categorical.
 uses the sigmoid function on the prediction result of linear regression.
 multiple-class problems: multinomial logistic regression.
 Multinomial logistic regression is a modification of logistic regression;
it uses the softmax function instead of the sigmoid activation function.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 29
6.10 Logistic regression
The sigmoid function is also known as a logistic function or an S-shaped curve.
It maps input values between the ranges 0 and 1, which represents the
probability of occurrence of an event.
If the curve  +ꝏ, then the outcome = 1
If the curve  -ꝏ, outcome = 0

then the

Web Data Analysis :: Thon-Da Nguyen Ph.D. 30


6.10 Logistic regression
The following formula shows the logistic regression equation:

The term in the log() function: an odds ratio or "odds." The odds ratio is the
ratio of the probability of the occurrence of an event to the probability of
not occurrence of an event.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 31


6.10 Logistic regression
 Characteristics of the logistic regression model
 The dependent or target variable should be binary in nature.
 There should be no multicollinearity among independent
variables.
 Coefficients are estimated using maximum likelihood.
 Logistic regression follows Bernoulli distribution.
 There is no R-squared for model evaluation.
 The model was evaluated using concordance, KS statistics.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 32


6.10 Logistic regression
 Types of logistic regression algorithms
 Binary logistic regression model:
The dependent or target column has only two values.
 Multinomial logistic regression model:
A dependent or target column has three or more than
three values.
 Ordinal logistic regression model:
A dependent variable will have ordinal or sequence classes.

Web Data Analysis :: Thon-Da Nguyen Ph.D. 33


6.10 Logistic regression
 Advantages of logistics
 not only provides prediction (0 or 1) but also gives the
probabilities of outcomes, which helps us to understand
the confidence of a prediction.
 easy to implement and understand and is interpretable.
 Disadvantages of logistics
 multiple independent variables will increase the amount
of variance explained  overfitting.
 cannot work with non-linear relationships.
 not perform well with highly correlated feature variables
(or independent variables).
Web Data Analysis :: Thon-Da Nguyen Ph.D. 34
6.11 Implementing logistic regression using scikit-learn

Web Data Analysis :: Thon-Da Nguyen Ph.D. 35


6.11 Implementing logistic regression using scikit-learn

Web Data Analysis :: Thon-Da Nguyen Ph.D. 36

You might also like