FACULTY OF INFORMATION SYSTEMS
Course:
Web Data Analysis
(3 credits)
Lecturer: Nguyen Thon Da Ph.D.
LECTURER’S INFORMATION
Chapter 6
Supervised Learning:
Regression Analysis
Web Data Analysis :: Thon-Da Nguyen Ph.D. 1
MAIN CONTENTS
6.1 Linear regression 6.7 Evaluating regression model
6.2 Multiple linear regression performance
6.3 Understanding multi-collinearity 6.8 Fitting polynomial regression
6.4 Removing multi-collinearity 6.9 Regression models for
6.5 Dummy variables classification
6.6 Developing a linear regression 6.10 Logistic regression
model 6.11 Implementing logistic
regression using scikit-learn
Web Data Analysis :: Thon-Da Nguyen Ph.D. 2
6.1 Linear regression
Linear regression:
A kind of curve-fitting and
prediction algorithm.
Be used to discover the
linear association between
a dependent (or target)
column and one or more
independent columns (or
predictor variables).
Web Data Analysis :: Thon-Da Nguyen Ph.D. 3
6.1 Linear regression
The equation for the regression model:
The parameters of linear regression are estimated
using OLS (Least Squares Ordinary).
OLS is a method that is widely used to estimate the
regression intercept and coefficients.
OLS reduces the sum of squares of residuals (or
error), which is the difference between the predicted
and actual. Web Data Analysis :: Thon-Da Nguyen Ph.D. 4
6.2 Multiple linear regression
Multiple linear regression :
o A generalized form of simple linear regression.
o Be used to predict the continuous target variable
based on multiple features or explanatory variables.
o The main objective of MLR is to estimate the linear
relationship between the multiple features and the
target variable.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 5
6.3 Understanding multi-collinearity
Multi-collinearity
occurs when there is strong correlation among
independent variables, making it challenging to determine
the individual effects of each variable.
can lead to unstable regression coefficients, create
uncertainty in assessing variable importance.
typically arises from high correlations between
independent variables
to address multi-collinearity: eliminate unimportant
variables, employ principal component analysis (PCA), or
utilize techniques like Ridge and Lasso regression to
stabilize the modelWeb Data Analysis :: Thon-Da Nguyen Ph.D. 6
6.4 Removing multi-collinearity
Removing multi-collinearity
o The correlation coefficient (or correlation matrix) between
independent variables : detect the multi-collinearity by
checking the correlation coefficient magnitude.
o Variance Inflation Factor : quantifies the degree of multi-
collinearity, indicating how much a variable's variance is
increased due to correlations with other variables in a
multiple regression model.
o Eigenvalues : used to assess the degree of multi-collinearity
and identify which independent variables contribute
significantly to this phenomenon.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 7
6.4 Removing multi-collinearity
Web Data Analysis :: Thon-Da Nguyen Ph.D. 8
6.4 Removing multi-collinearity
Web Data Analysis :: Thon-Da Nguyen Ph.D. 9
6.5 Dummy variables
Dummy variables:
categorical independent variables used in regression analysis.
a Boolean, indicator, qualitative, categorical, and binary
variable.
convert a categorical variable with N distinct values into N–1
dummy variables.
only takes the 1 and 0 binary values, which are equivalent to
existence and nonexistence.
pandas offers the get_dummies() function to generate the
dummy values.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 10
WEB DATA ANALYSIS 6.5 Dummy variables
Web Data Analysis :: Thon-Da Nguyen Ph.D. 11
6.6 Developing a linear regression model
Web Data Analysis :: Thon-Da Nguyen Ph.D. 12
6.6 Developing a linear regression model
Web Data Analysis :: Thon-Da Nguyen Ph.D. 13
6.6 Developing a linear regression model
Web Data Analysis :: Thon-Da Nguyen Ph.D. 14
6.6 Developing a linear regression model
Web Data Analysis :: Thon-Da Nguyen Ph.D. 15
6.7 Evaluating regression model performance
R-squared (or coefficient of determination):
a statistical model evaluation measure that assesses
the goodness of a regression model.
helps data analysts to explain model performance
compared to the base model.
value lies between 0 and 1.
a negative value : worse than the average base model.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 16
6.7 Evaluating regression model performance
R-squared (or coefficient of determination):
Sum of Squares Regression (SSR): The difference between the
forecasted value and the mean of the data.
Sum of Squared Errors (SSE): The change between the original or
genuine value and the forecasted value.
Total Sum of Squares (SST): The change between the original or
genuine value and the mean of the data.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 17
6.7 Evaluating regression model performance
MSE
Web Data Analysis :: Thon-Da Nguyen Ph.D. 18
6.7 Evaluating regression model performance
MAE
Web Data Analysis :: Thon-Da Nguyen Ph.D. 19
6.7 Evaluating regression model performance
RMSE
Web Data Analysis :: Thon-Da Nguyen Ph.D. 20
6.7 Evaluating regression model performance
Web Data Analysis :: Thon-Da Nguyen Ph.D. 21
6.8 Fitting polynomial regression
Polynomial regression
is used to adapt the nonlinear relationships between
dependent and independent variables.
variables are modeled as the nth polynomial degree.
recognize the growth rate of various phenomena.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 22
6.8 Fitting polynomial regression
Web Data Analysis :: Thon-Da Nguyen Ph.D. 23
6.8 Fitting polynomial regression
Web Data Analysis :: Thon-Da Nguyen Ph.D. 24
6.9 Regression models for classification
Web Data Analysis :: Thon-Da Nguyen Ph.D. 25
6.9 Regression models for classification
Classification involves dividing up objects so that each is assigned to
one of a number of mutually exhaustive and exclusive categories
known as classes.
The term ‘mutually exhaustive and exclusive’ simply means that
each object must be assigned to precisely one class.
Many practical decision-making tasks can be formulated as
classification problems.
Logistic regression is one of the classification methods, although its
name ends with regression. It is a commonly used binary class
classification method. It is a basic ML algorithm for all kinds of
classification problems. It finds the association between dependent
(or target) variables and sets of independent variables (or features).
Web Data Analysis :: Thon-Da Nguyen Ph.D. 26
6.9 Regression models for classification
Web Data Analysis :: Thon-Da Nguyen Ph.D. 27
6.9 Regression models for classification
Web Data Analysis :: Thon-Da Nguyen Ph.D. 28
6.10 Logistic regression
Logistic regression
a kind of supervised machine learning algorithm that is utilized to
forecast a binary outcome and classify observations.
its dependent variable is a binary variable with 2 classes: 0 or 1.
computes a log of the odds ratio of the target variable, which
represents the probability of occurrence of an event.
Logistic regression is a kind of simple linear regression where the
dependent or target variable is categorical.
uses the sigmoid function on the prediction result of linear regression.
multiple-class problems: multinomial logistic regression.
Multinomial logistic regression is a modification of logistic regression;
it uses the softmax function instead of the sigmoid activation function.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 29
6.10 Logistic regression
The sigmoid function is also known as a logistic function or an S-shaped curve.
It maps input values between the ranges 0 and 1, which represents the
probability of occurrence of an event.
If the curve +ꝏ, then the outcome = 1
If the curve -ꝏ, outcome = 0
then the
Web Data Analysis :: Thon-Da Nguyen Ph.D. 30
6.10 Logistic regression
The following formula shows the logistic regression equation:
The term in the log() function: an odds ratio or "odds." The odds ratio is the
ratio of the probability of the occurrence of an event to the probability of
not occurrence of an event.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 31
6.10 Logistic regression
Characteristics of the logistic regression model
The dependent or target variable should be binary in nature.
There should be no multicollinearity among independent
variables.
Coefficients are estimated using maximum likelihood.
Logistic regression follows Bernoulli distribution.
There is no R-squared for model evaluation.
The model was evaluated using concordance, KS statistics.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 32
6.10 Logistic regression
Types of logistic regression algorithms
Binary logistic regression model:
The dependent or target column has only two values.
Multinomial logistic regression model:
A dependent or target column has three or more than
three values.
Ordinal logistic regression model:
A dependent variable will have ordinal or sequence classes.
Web Data Analysis :: Thon-Da Nguyen Ph.D. 33
6.10 Logistic regression
Advantages of logistics
not only provides prediction (0 or 1) but also gives the
probabilities of outcomes, which helps us to understand
the confidence of a prediction.
easy to implement and understand and is interpretable.
Disadvantages of logistics
multiple independent variables will increase the amount
of variance explained overfitting.
cannot work with non-linear relationships.
not perform well with highly correlated feature variables
(or independent variables).
Web Data Analysis :: Thon-Da Nguyen Ph.D. 34
6.11 Implementing logistic regression using scikit-learn
Web Data Analysis :: Thon-Da Nguyen Ph.D. 35
6.11 Implementing logistic regression using scikit-learn
Web Data Analysis :: Thon-Da Nguyen Ph.D. 36