Why Generalized Linear Models?
Welcome to the course on Advanced Regression Models.
In the previous course on Regression, you have seen - how to predict a dependent
variable from one or more independent variable.
The dependent variable was numeric and error assumed to be normally distributed.
What if your dependent variable is discrete (0 or 1) or just count data and the
error terms do not follow normal distribution?
Do you think your prediction would be good with the Linear Model ?
Limitations of Linear Models
Linear Models are the most widely used models in Statistics.
But they come with their own limitations.
Not proficient in handling Binary Data
Not Accurate When count data(number of footfalls, number of pages visited etc ..)
is involved.
Some variable have a constraint of being only strictly positive
To fix some of these problems we go can go for Transformation.
In some scenarios Transformation minimises interpretability so we have to look for
other alternatives.
GLM
To overcome the some limitations of Linear Models , we can go for Generalized
Linear Models(GLMs).
In GLMs the modeling is done on the scale in which the data was recorded.
GLMs honor the known assumptions of the data
GLM Components
GLMs comprise of 3 components
Random Component that explains the data distribution that describes Randomness /
Errors.
Systematic Component consists of linear predictors (the covariate and the
coefficient)
Link function connects the mean of the response to Predictors
GLM Representation
The First Equation describes the Random Component, here it is the Gaussian
Distribution
The second equation is the systematic component which has the covariates and the
coefficients . This is the Linear Predictor
The third equation links the random component to the Link Function
The above set of equations are a generic representation of the Generalized Linear
Model.
Types of Generalized Models
Logistic Regression used for predicting Binary Outcomes.
Poisson Regression
used for predicting count data (# of footfalls, # of hits on a website)
Next in GLM ...
In the following topics you will be understanding GLM in detail
You will understand Logistic and Poisson Regression with some applications
You will first learn the statistical aspects and later understand it from a Machine
Learning perspective
You can try out some examples in Python.
Understanding the Logistic Function
Positive values of the coefficients predict class 1
Positive values coefficients increase linear regression piece there by increasing
the probability of y = 1
Negative values of the coefficients predict class 0
Negative values coefficients decrease linear regression piece there by decreasing
the probability of y = 1
Regression Coefficients
The coefficients β0 , β1 and β2 are selected in such a way that
Predict high probability for a given case
Predict low probability for the opposite case
Odds Ratio
Odds = p(y=1) / p(y=0)
Odds > 1 if y = 1 is more likely
Odds < 1 if y = 0 is more likely
Odds = 1 if outcome is equally likely
The Logit
Odds = e^(β0 + β1x1 + ... βkxk)
log(Odds) = β0 + β1x1 + ... βkxk
This is called logit and looks like linear regression.
Bigger the logit bigger the P(y=1)
+ve beta values increase logit increasing the odds
-ve beta values decrease logit decreasing the odds
Baseline Method
For Logistic Regression / Binary Classification the baseline method is to predict
the most frequent outcome.
The output is a probability value. To separate the 1 and 0 you have to identify a
threshold value.
Values above the threshold will be marked 1 and below will be marked 0.
Choosing the right threshold value is important.
ROC Curve
ROC stands for Receiver Operator Characteristics
It is a graphical way of identifying how the model has been fit
The true positive rate represents the vertical axis
The false positive rate represents the horizontal axis
ROC Curve Properties
ROC Curver captures all the threshold values
It also helps to ...
Choose the best threshold for the best trade off
Get the Cost of failing to detect the false positives
Get the Cost of raising the false alarm
Confusion Matrix is a tabular way of representing the model performance . It has
Actual outcomes along the rows
Predicted outcomes along the columns
Many Metrics can be derived from the values in a confusion matrix
Sensitivity is the true positive(TP) rate.
sensitivity = TP/(TP+FN)
Specificity is the true negative(TN) rate.
specificity = TN/(TN+FP)
The Story So far ...
You have learnt what are Generalized Linear Models
You have learnt how the Random Componets are linked to the predictor using Link
Function
What is Logit ? How the parameters are affecting the logit ?
How to measure the accuracy of a logistic regression model
You will now learn how to implement this idea using Python
Data and Code
The code below is a simple demonstration of how GLMs are implemented in Python
A dataset is created with scores a team got and Won or lost that respective game
This is to illustrate how the score is helping us predict the binary outcome
win/loose
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from __future__ import print_function
Scores = [(200,1),(100,0),(150,1),(320,1),(270,1),(134,0),(322,1),(140,0),(210,0),
(199,0)]
Labels = ['Score','Win']
df = pd.DataFrame.from_records(Scores, columns=Labels)
glm_binom = sm.GLM(df.Win, df.Score, family=sm.families.Binomial())
res = glm_binom.fit()
print(res.summary())
Sample Output
The value of the score coef tells us how it is able to tell us to what extent it is
able to predict the likilihood of winning a game .
The rest of the values are a standard outcome of a regression equation.
Sample Data
In the previous example you have seen how to fit a GLM using statsmodels package.
In this example you will learn how to fit a Logistic Regression using scikit
learn .
The previous example was a statistical perspective. The current example will give a
machine learning perspective.
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=2,
n_informative=2, n_redundant=0,
n_clusters_per_class=1,
class_sep = 2.0, random_state=101)
The above code creates a sample dataset for a binary classification problem.
2 features are created and the 2 classes are created for the given features
Plotting the Data
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y,
linewidth=0, edgecolor=None)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
The above code explains how the plot should be created for the input data.
Splitting the Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y.astype(float),test_size=0.33, random_state=101)
In this code we are splitting the training and test sets.
Logistic Regression Model Fit
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train.astype(int))
y_clf = clf.predict(X_test)
print(classification_report(y_test, y_clf))
The code above explains how to fit a logistic regression model and view the
classification report.
Results of Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
clf = LogisticRegression()
clf.fit(X_train, y_train.astype(int))
y_clf = clf.predict(X_test)
print(classification_report(y_test, y_clf))
precision recall f1-score support
0.0 1.00 0.93 0.97 15
1.0 0.95 1.00 0.97 18
avg / total 0.97 0.97 0.97 33
Precision = TP/(TP + FP)
Precision is similar to accuracy but looks at only the positively predicted data.
Recall = TP / (TP + FN)
Recall is also similar to accuracy, it looks at only the relevant data.
Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_clf)
array([[14, 1],
[ 0, 18]])
Accuracy for this model is (14+18) / (14+1+0+18) = 0.96969
Sensitivity and Specificity
Sensitivity for the model is 100%
Specificity for the model is 94%
Based on the numbers we can interpret that the model is able to clearly separate
the data into 2 classes.
The model is also able to designate the individual numbers that do not belong to a
specific class as negative.
Data Prep for Hands On
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
print(iris_X)
print(iris_y)
use the above commands to load the dataset and the required variables
Why Poisson Regression ?
One of the underlying assumptions of Linear Regression is that the error terms
follow a normal distribution
When the error terms do not follow normal distribution , we go for other types of
Regression
When we try to model count data(number of footfalls, traffic in a website), we go
for Poisson Regression
Where is Poisson Regression Used ?
Poisson Regression is used to model count data
Number of foot falls , number of call drops etc ...
In mathematical terms , Poisson Regression is used to model the logarithm of the
count data
Variables in Poisson Regression Equation
Dependent Variable Y represents count or sometimes Y/t is used signifying the rate
Independent variables are categorical or continuous variables depending on the
dataset
GLM
Link Function : g(μ)=β0+β1x1+β2x2+…+βkxk = xTiβ
Random component: Response Y has a Poisson distribution that is yi∼Poisson(μi) for
i=1,...,N where the expected count of yi is E(Y)=μ.
Systematic component: Any set of X = (X1, X2, … Xk) are independent variables.
Link Function
Identity link: μ=β0+β1x1
In some ocassions the identity link function is used in Poisson regression. Here
the random component is the Poisson distribution.
Natural log link: log(μ)=β0+β1x1
The Poisson regression model for counts is occassionally referred to as a “Poisson
loglinear model”.
For simplicity, with a single dependent variable, we can write: log(μ)=α+βx. This
is equivalent to:μ=exp(α+βx)=exp(α)exp(βx)
Interpreting Parameters
Interpreting the estimated parameter.
exp(α) = effect on the mean of Y, that is mean, when X = 0
exp(β) = with every unit increase in X, the predictor variable has **multiplicative
effect **of exp(β) on the mean of Y, that is μ
If β = 0, then exp(β) = 1, and the expected count, μ = E(y) = exp(α), and Y and X
are not related.
If β > 0, then exp(β) > 1, and the expected count μ = E(y) is exp(β) times larger
than when X = 0
If β < 0, then exp(β) < 1, and the expected count μ = E(y) is exp(β) times smaller
than when X = 0
Poisson Regression for Rate
The set of equations mentioned in the above cards can also be applicable for rate
data. Y/t
Y is the count data and t is the time
Invoking the required Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
For the following code we will be needing pandas , numpy and statsmodels libraries
Create Data Frame
dataset = pd.DataFrame({'A':np.random.rand(100)*1000,
'B':np.random.rand(100)*100,
'C':np.random.rand(100)*10,
'target':np.random.randint(0, 5, 100)})
We are creating a sample dataframe . The variables are random numbers.
The Dependent variable signifies count data.
The Independent variables are random numbers.
This exercises is just to familiarize you with Poisson Regression.
Split the Variables
X = dataset[['A','B','C']]
X['constant'] = 1
y = dataset['target']
size = 1e5
nbeta = 3
The code splits the dependent and the independent variables. The required
parameters are also set here.
Model Fitting
fam = sm.families.Poisson()
pois_glm = sm.GLM(y,X, family=fam)
pois_res = pois_glm.fit()
pois_res.summary()
Using the above code , we are fitting the Model and viewing the results.
Interpreting the results
On viewing the results and the coefficient values, we can say to what extent each
coef is explaining the log of count data i.e the dependent variable.
The rest of the values are what a Regression Output shows
Advanced Models
In this topic you will learn some advance regression models
You will understand when to and when not to apply a specific regression model
Bayesian Vs Linear Regression
Bayesian Regression is similar to Linear Regression in many ways
In Linear Regression the output is number / value
In Bayesian the output is also a value but it also returns the entire probability
distribution
How is the Probability Distribution constructed?
Here, the predicted value is returned and the variance value is also returned.
With value as the mean and the variance value as the standard deviation the
probability distribution can be constructed
With value as the mean and the variance value as the standard deviation the
probability distribution can be constructed
Bayesian Regression in Python
regr = linear_model.BayesianRidge()
regr.fit(X, y)
Out:
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False,
copy_X=True, fit_intercept=True, lambda_1=1e-06,
lambda_2=1e-06, n_iter=300, normalize=False,
tol=0.001, verbose=False)
The above code is a sample that shows how to model Bayesian Regression using python
Pros and Cons of BR
Pro
It is Robust to Gaussian Noise
Works well if the number of features and observations in the dataset are comparable
Cons
It is really time-consuming
CART Algorithm
Classification and Regression Trees are a set of non-linear learning algorithms
which can be used for numerical as well as categorical features
Here the tree has a set of nodes that split the branch into children
In turn each of the branches can go into another node or just stay as a leaf along
with the forecasted value or the predicted class
Why Trees ?
Performing the prediction task is quick
The principal task is traversal along the the tree from the root node to the leaf
nodes and at each point check if the respective feature is above or below the
threshold
The concept of variance reduction is used in this algorithm
In each of the given nodes a search is performed along all the features across all
levels in that feature
The combination that contains the best variance is marked and selected as the best
Regression Trees with Python
In:
from sklearn.tree import DecisionTreeRegressor
regr = DecisionTreeRegressor(random_state=101)
regr.fit(X_train, y_train)
mean_absolute_error(y_test, regr.predict(X_test))
The syntax is similar to applying any regression model using scikit learn.
Pros and Cons of Trees
Pro ...
Trees are the go to algorithms for modeling non-linear behavior
They can be used for both categorical and numeric datatypes without performing any
kind of normalization
The training time , Prediction time are fast
They leave a very small memory fingerprint
Cons
It belongs to a class of Greedy Algorithms , does not optimize the entire
solution , it just optimizes specific choices
If there are significant number of features, it does not perform well
The leaf nodes can be very specific sometime leading to overfitting. In that case
those nodes can be pruned.
Bagging and Boosting
Bagging and Boosting are techniques that are used for combining multiple models to
improve overall accuracy.
The final combination is a non linear model containing a set of linear models.
Bootstrap Aggregation is abbreviated as Bagging.
The main objective of this technique is to reduce the overall variance by
aggregating the models.
How is Bagging Done ?
Each model is trained on the on the selected set of features with replacement
At the end of training , during prediction , each of the models perform their
respective prediction , the results are all taken , averaged and then the ensemble
prediction is performed.
Bagging Tip
The training and the prediction happens at individual model level. This gives
flexibility to parallelize the operation on multiple CPUs.
Bagging in Python
from sklearn.ensemble import BaggingRegressor
bagging = BaggingRegressor(SGDRegressor(), n_jobs=-1,
n_estimators=1000, random_state=101,
max_features=0.8)
bagging.fit(X_train, y_train)
mean_absolute_error(y_test, bagging.predict(X_test))
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators=100,
n_jobs=-1, random_state=101)
regr.fit(X_train, y_train)
mean_absolute_error(y_test, regr.predict(X_test))
The sample code above explains how to implement bagging using python .
Boosting
Boosting is another way of combining multiple learning models
The objective of boosting is to reduce the prediction bias
In boosting the models are in a sequence , cascaded with each other , the output of
one is the input of another
Boosting Algorithm
During training , the output of one model is predicted
The error is calculated based on the actual value
This error is multiplied with the learning rate
New model is trained on that error set and inserted at final stage of the cascaded
and trained models
The output value from one stage is the value predicted combined with the learning
rate times by the output prediction from the current stage
Boosting Sample Code
from sklearn.ensemble import GradientBoostingRegressor
regr = GradientBoostingRegressor(n_estimators=500,
learning_rate=0.01,
random_state=101)
regr.fit(X_train, y_train)
Pros and Cons
Pros
We can build very good and robust models combining weak models
They support stochastic learning
The robustness in the solution is created by the stochastic or random nature of the
model
Cons
Time taken for training is very high . There is a high memory footprint
The steps in model building can be tricky because of the stochastic nature
Application Areas
In this course so far you have seen different types of regression models , now you
will learn in what kind scenarios are all these models applied.
Some of the application areas include
Prediction Problems
Binary and Multi Class Classification
Time Series Analysis
Ranking Problems
A Regression Problem
Consider a dataset from the Music Industry.
The descriptors of a particular song are given and the year the song was produced
is given.
Can this data be modeled as a Regression Problem to predict the year given the
descriptors ?
Problem Approach
For the question raised in the previous card, the answer is yes we can predict the
year of production based on the descriptors .
The features should be identified based on the relevance to the context
Once the features are extracted a model can be trained with Features as inputs and
year of production as output
The model can be evaluated using Mean Absolute Error between actual and predicted
values
The ultimate objective would be to minimize the error
Classification Problem
The previous problem can also be modeled as a Multi Class Classification problem.
The features and the descriptors still remain the same.
The output will belong to one of the classes from the range of years provided.
Mean absolute error can be used for validating the accuracy of the prediction.
Ranking Problem
Consider a dataset with some features related to a car along with a price.
Insurance companies would want to assess if the car is riskier or not to sell / buy
on a given scale.
How do you think you will design this problem ?
Ranking Problem Approach
The above problem can be modeled as a regression problem where we are predicting
the risk on a scale .
The methodology to asses the prediction will be different .
In this scenario , we can go for label ranking loss , a metric that indicates the
strength ranking
Mean Absolute and Mean Standard Errors are not applicable in this scenario.
Another way to measure the prediction accuracy is by Label Ranking Average
Precision
Time Series Problem
So far you have seen problems where the features and the target variables are
different.
In scenario where you try to analysis stock prices or currency fluctuations or
Support ticket trend over a period of time the variables themselves can be the
features and the targets .
These problems fall under Time Series Analysis.
In time series analysis , the data at time t+k can be the target and data at time t
can be the feature. The concept of auto regression is applied in these scenarios.
Summary of the Course
In this course,
You have understood the limitations of Linear Models and how they can be overcome
with Generalized Linear Models
How to represent Generalized Linear Models
Logistic Regression from a GLM and Machine Learning Perspective
Poisson Regression, its representation and how to apply this in a real world
application
Advance Regression Models
Real world applications of Regression Analysis
Hope you had fun taking this course.