What & Why
What is Regression?
Formulation of a functional relationship between a set of Independent or
Explanatory variables (X’s) with a Dependent or Response variable (Y).
Y = f(X)
Why Regression?
Knowledge of Y is crucial for decision making.
• Will he/she buy or not?
• Shall I offer him/her the loan or not?
• ………
X is available at the time of decision making and is related to Y, thus making
it possible to have a prediction of Y.
1
Types of Regression
Continuous Binary (0/1)
E.g., Sales Volume, Claim
E.g., Buy/No-Buy, Survive/Not-
Amount, % of sales growth
Survive, Win/Loss etc
etc.
Ordinary Least Square Logistic Regression
(OLS) Regression
2
Intro to Regression Analysis
• Regression analysis is used to:
• Predict the value of a
dependent variable based on
the value of at least one
independent variable
• Explain the impact of changes
in an independent variable on
the dependent variable
• Dependent variable: the
variable we wish to explain,
usually denoted by Y.
• Independent variable: the
variable used to explain the
dependent variable. Usually
denoted by X.
3
Regression Example
Predict the fitness of a
person based on one or
more parameters.
4
Regression Example
5
Simple Linear Regression Model
• Only one independent
variable, x
• Relationship between x
and y is described by a
linear function
• Changes in y are
assumed to be caused
by changes in x
6
Assumptions for Simple Linear Regression
E(ε) = 0
7
Assumptions for Multiple Regression
8
Assumptions for Multiple Regression
2 ଶ
9
Assumptions for Multiple Regression
[E(ε i ε j ) = 0, j ≠ i]
10
Equations for Regression
11
Simple Linear Regression Model
12
Beta Zero
13
Beta One
1 unit
14
Error Term /Residual
15
Regression Line Equation
16
The Simple Linear Regression Model
17
The Multiple Linear Regression Model
18
Model for Multiple Regression
19
Types of Regression Relationships
Negative Linear Relationship Relationship NOT Linear
Positive Linear Relationship No Relationship
20
Population & Sample Regression Models
Population Random Sample
Unknown
Relationship☺
Yi = β 0 + β 1X i + ε i ☺
☺
☺ ☺
☺
☺
21
Population Linear Regression
Y Y = β 0 + β1X + u
Slope = β1
ui
Predicted Value Random Error for this x value
of Y for Xi
Intercept = β0 Individual
person's marks
xi X
22
Population Regression Function
Random
Dependent Population y Population Slope Independent Error term, or
Variable intercept Coefficient Variable residual
Y = β 0 + β1X + u
Linear component Random Error
component
But can we actually get this equation?
If yes what all information we will need?
23
Sample Regression Function
Y y = b 0 + b1 x + e
Observed Value
of y for xi
Slope = β1
ei
Predicted Value Random Error for this x value
of Y for Xi
Intercept = β0
xi X
24
Sample Regression Function
Estimate of the Estimate of the
regression intercept regression slope
Independent
variable
y i = b 0 + b1x + e Error term
Notice the similarity with the Population Regression Function
Can we do something of the error term?
25
The Error Term (Residual)
• Represents the influence of all the variable which
we have not accounted for in the equation
• It represents the difference between the actual y
values as compared the predicted y values from the
Sample Regression Line
• Wouldn't it be good if we were able to reduce this
error term?
• By the way - what are we trying to achieve by
Sample Regression?
26
How Well A Model Fits the Data
27
Comparing the Regression Model to a Baseline Model
28
Comparing the Regression Model to a Baseline Model
29
OLS Regression Properties
• The sum of the residuals from the least squares regression line is
zero. ∑ ( y − yˆ ) = 0
• The sum of the squared residuals is a minimum.
Minimize( ∑ ( y − yˆ ) 2
)
• The simple regression line always passes through the mean of
the y variable and the mean of the x variable
• The least squares coefficients are unbiased estimates of β0 and
β1
30
Limitations of Regression Analysis
• Parameter Instability - This happens in situations where
correlations change over a period of time. This is very
common in financial markets where economic, tax,
regulatory, and political factors change frequently.
• Public knowledge of a specific regression relation may
cause a large number of people to react in a similar fashion
towards the variables, negating its future usefulness.
• If any of the regression assumptions are violated,
predicted dependent variables and hypothesis tests will not
hold valid.
31
General Multiple Linear Regression Model
• In simple linear regression, the dependent variable was assumed to be
dependent on only one variable (independent variable)
• In General Multiple Linear Regression model, the dependent variable derives its
value from two or more than two variable.
• General Multiple Linear Regression model take the following form:
Yi = b0 + b1 X 1i + b2 X 2 i + ......... + bk X ki + ε i
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
b0 = intercept term
bk = slope coefficient of kth independent variable
εi = error term of ith observation
n = number of observations
k = total number of independent variables
32
Estimated Regression Equation
• As we calculated the intercept and the slope coefficient in case of
simple linear regression by minimizing the sum of squared errors,
similarly we estimate the intercept and slope coefficient in multiple
linear regression. n
• Sum of Squared Errors
estimated.
∑ i
ε 2
i =1
is minimized and the slope coefficient is
• The resultant estimated equation becomes:
∧ ∧ ∧ ∧ ∧
Yi = b0 + b1 X 1i + b2 X 2 i + ......... + bk X ki
• Now the error in the ith observation can be written as:
∧
∧ ∧ ∧ ∧
ε i = Yi − Yi = Yi − b0 + b1 X 1i + b2 X 2 i + ......... + bk X ki
33
Assumptions of Multiple Regression Model
• There exists a linear relationship between the dependent and
independent variables.
• The expected value of the error term, conditional on the
independent variables is zero.
• The error terms are homoskedastic, i.e. the variance of the
error terms is constant for all the observations.
• The expected value of the product of error terms is always
zero, which implies that the error terms are uncorrelated with
each other.
• The error term is normally distributed.
• The independent variables doesn't have any linear
relationships between each other.
34
Thank you!