simple linear regression - brief introduction

DENYS GREKOV
REGRESSION ANALYSIS
denys.grekov@imt-atlantique.fr

2
INTRODUCTION
Population Sample
Descriptive
statistics
Data analysis
Principal component analysis
Correspondence analysis
Automatic classification
…
Estimation
Extend the
observations on the
sample to the
population
Test
Validate or
invalidate a
hypothesis on a
parameter or the
shape of a law
Regression
Modeling and
prediction
Probability calculations
Model construction to describe
random phenomena
Probability model
Adequacy ?
Experimental data
describing a real
phenomenon
Tables
Charts
Characteristic values
Mean parameters
Spread parameters

3
CONTENT OF THE MODULE
Goal of regression analysis: quantitative description and
prediction of the interdependence between two or more variables.
• Definition of the correlation
• The specification of a simple linear regression model
• Least squares estimators: construction and properties
• Verification of statistical significance of regression model

4
CORRELATION ANALYSIS
X
Y
X
_
Y
_
N1
N2
+
+
-
-
X
_
Y
_
Mean height and weight within the
sample
X : Independent or explanatory or exogenous variable
Y: Dependent or response or endogenous variable
Product of deviations from an average
>0 for most of points in case of
positive correlation between X and Y
𝑋1 − ത
𝑋 ∙ 𝑌1 − ത
𝑌 >0
𝑋2 − ത
𝑋 ∙ 𝑌2 − ത
𝑌 >0
𝑋𝑖 − ത
𝑋 ∙ 𝑌𝑖 − ത
𝑌
.
.
.
𝑐𝑜𝑣 =
σ 𝑋𝑖 − ത
𝑋 ∙ 𝑌𝑖 − ത
𝑌
𝑁 − 1
Covariance – quantitatively describes
the strength and the sense of
correlation between variables X and Y
𝑟𝑋𝑌 =
𝑐𝑜𝑣
𝜎𝑋 ∙ 𝜎𝑌
Correlation coefficient (Pearson’s
coefficient) – the same meaning as
covariance but independent on raw
data magnitude. Range: [-1; +1].
Coefficient of determination 𝑟𝑋𝑌
2
(shared variance) – determine how
the variance of variable 1 is influenced
by variable 2. Range: [0; 1].

5
CORRELATION ANALYSIS
X
Y
Y
X
1 0.64 0.16 0 0.16 0.64 1
rXY
R2
Correlation coefficient makes sense when:
• Character of interdependence between
variables is linear and monotonous;
• There is no significant outliers.
Outliers
Non-linearity of data
Correlation may signify causal relationship between two variables but à priori
it does not! It may be caused by a third variable, frequently hidden.

6
REGRESSION ANALYSIS
Regression analysis is used to verify how one variable determines/describes/predicts another
variable.
Dependent
variable
Independent variable (Predictor)
0
20
40
60
80
0 10 20 30 40
Price
Quality
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑒𝑟𝑟𝑜𝑟
Intercept Slope
ො
𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥
sample
population
ෝ
𝒚
ഥ
𝒚
𝒚 actual value
average value
predicted value

7
LEAST SQUARES METHOD
Method of Least Squares – allows the research of optimal parameters of linear regression
minimizing the residual sum of squares
𝑒1 = 𝑦1 − ෞ
𝑦1
Actual value Predicted value
Residual
> 0 𝑒2 = 𝑦2 − ෞ
𝑦2 < 0 ෍
𝑖=1
𝑛
𝑒𝑖
2
→ 𝑚𝑖𝑛
Dependent
variable
Independent variable (Predictor)
0
20
40
60
80
0 10 20 30 40
Price
Quality
N1
N2
Squares are needed to prevent the cancelling between + and - values

8
When function is minimal, the partial derivatives with the respect of each variable = 0
𝜕𝑓
𝜕𝑏1
= −2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0 ∙ 𝑥𝑖 = 0
𝜕𝑓
𝜕𝑏0
= −2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0 = 0
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖
2
+ 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 = ෍
𝑖=1
𝑛
𝑥𝑖 ∙ 𝑦𝑖
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍
𝑖=1
𝑛
𝑦𝑖
Solving differential equations for simple linear regression we get
The research of regression parameters for nonlinear or multivariable
regressions is conceptually the same.
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑧 + … . 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ … .
෍
𝑖=1
𝑛
𝑒𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − ො
𝑦𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0
2
= 𝑓 𝑏0, 𝑏1 → 𝑚𝑖𝑛
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥

9
ON PRACTICE
1 The easiest way: directly using excel
2 More complicated: analytical solution of a system of linear equations
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖
2
+ 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 = ෍
𝑖=1
𝑛
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍
𝑖=1
𝑛
𝑦𝑖
• Matrix method
• Gauss method
• Kramer method
• Online applications
𝑏1 ∙ 10436 + 𝑏0 ∙ 426 = 24022
𝑏1 ∙ 426 + 𝑏0 ∙ 22 = 1070
𝑏1 = 1.51 𝑏0=19.4
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
Summation

10
ON PRACTICE
3 More complicated : using statistics tools
𝑏0 = ത
𝑦 − 𝑏1 ∙ ҧ
𝑥
𝑏1 =
𝑠𝑦
𝑠𝑥
∙ 𝑟𝑥𝑦
In terms of statistics
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖
2
+ 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 = ෍
𝑖=1
𝑛
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍
𝑖=1
𝑛
𝑦𝑖
In terms of algebra

11
QUANTITATIVE DESCRIPTION OF SLR
SST = SSR + SSM
     


 







n
1
i
2
i
n
1
i
2
i
i
n
1
i
2
i y
y
y
y
y
y ˆ
ˆ
Sum of squares
explained by the
model
(regression sum
of squares)
Sum of squares
of residuals
(error sum of
squares)
Total sum of
squares of the
response
variable y

12
ROLE OF R2
Coefficient of determination (R2) determines how the variance of variable 1 is influenced
by variable 2 or how can it be predicted by regression model.
𝑅2
= 1 −
𝑆𝑆𝑟𝑒𝑠
𝑆𝑆𝑡𝑜𝑡
= 1 −
σ𝑖=1
𝑛
𝑦𝑖 − ෝ
𝑦𝑖
2
σ𝑖=1
𝑛
𝑦𝑖 − ഥ
𝑦𝑖
2
R² = 0.8877
0
20
40
60
80
0 5 10 15 20 25 30 35 40
Price
Quality
R² = 0.0073
0
10
20
30
40
0 5 10 15 20 25 30
Price
Quality
Higher the R2 better the regression model predicts the behavior of dependent
variable.
Only such a small
fraction of price
variance is determined
by the quality
Price is mainly
determined by product’s
quality

13
DIAGNOSTICS FOR SLR
• Linear relationship between variables
X and Y
• Exogeneity – residuals following a
Normal distribution centered on 0
with a standard-deviation σ
• Homoscedasticity – constant
variance of residues within the whole
range of variables.
• No colinearity - no relationship
between residuals.
0
20
40
60
80
0 10 20 30 40
Price
Quality

14
DIAGNOSTICS FOR SLR
http://www2.stat.duke.edu/~mc301/shinyed/

15
DIAGNOSTICS FOR SLR

16
DIAGNOSTICS FOR SLR

17
DIAGNOSTICS FOR SLR

18
DIAGNOSTICS FOR SLR

19
VALIDATION OF REGRESSION PARAMETERS
Regression line for sample may be different from regression line for the total population.
According to the central limit theorem we can assess regression parameters
for the entire population relying on the data for the sample.
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥
Regression line for the population. Regression line for the sample.

20
CONFIDENCE INTERVALS
Variance of parameters b0 and b1 can be defined as following:
𝑠𝑏1
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2
𝑠𝑏0
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2 ∙ 𝑠𝑥
2
+ ҧ
𝑥2
𝑏0 − 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏0
< 𝛽0 < 𝑏0 + 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏0
𝑏1 − 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏1
< 𝛽1 < 𝑏1 + 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏1
Confidence interval for regression coefficients can be found using Student’s
distribution
P ~ 70-95%

21
PREDICTION WITH THE CONFIDENCE INTERVAL
𝑦0 = 𝑏0 + 𝑏1 ∙ 𝑥0 ± 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑦0
𝑠𝑦0
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2 ∙ 𝑛 + 1 ∙ 𝑠𝑥
2
+ 𝑥0 − ҧ
𝑥 2
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥
Regression line for the population. Regression line for the sample.
The uncertainty of the coefficients may be a source of errors, confidence
interval should be considered
Term which combines the
uncertainties of regression
parameters
𝑠𝑏1
2
=
1 − 𝑟2 ∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2
𝑠𝑏0
2
=
1 − 𝑟2 ∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2 ∙ 𝑠𝑥
2 + ҧ
𝑥2

22
DIAGNOSTICS FOR SLR – T-TEST
Allows to de determine whether the regression model is statistically significant.
𝑡 =
𝑏1
𝑠𝑎𝑚𝑝𝑙𝑒
− 𝛽1
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑠𝑏1
=
𝑏1
𝑠𝑏1
3) Quantify the evidence of the test. Compare the computed t values to the critical
values:
±𝑡1− Τ
𝛼 2 𝑓𝑜𝑟 𝑛−2
t-test steps:
1) Specify the hypothesis. H0: b1=0; H1: b1≠0 usually at 5% significance level (α)
2) Determine t
𝑠𝑏1
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2
0 if we are testing H0
P ~ 70-95%
“H0 acceptance region”
“rejection region” “rejection region”
If the computed t is in the critical region
(smaller than −𝑡1− Τ
𝛼
2
𝑛−2
and higher than
+ 𝑡1− Τ
𝛼
2
𝑛−2
), H0 can be rejected.

24
SIMPLE NON-LINEAR REGRESSION
• When 2D data features a nonlinear pattern
• Obvious pattern in the residual plot
Segmentation: X can be split up into classes or segments
and linear regression can be performed per segment.

25
SIMPLE NON-LINEAR REGRESSION
ො
𝑦 = 𝑏0 ∙ 𝑒𝑏1∙𝑥

26
෍
𝑖=1
𝑛
𝑒𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − ො
𝑦𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖
2
= 𝑓 𝑏0, 𝑏1 → 𝑚𝑖𝑛
When function is minimal, the partial derivatives with the respect of each variable = 0
𝜕𝑓
𝜕𝑏0
= 2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖 ∙ −𝑒𝑏1∙𝑥𝑖 𝑥𝑖 = 0
𝜕𝑓
𝜕𝑏1
= 2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖 ∙ −𝑏0 ∙ 𝑥 ∙ 𝑒𝑏1∙𝑥𝑖 = 0
− ෍
𝑖=1
𝑛
𝑦𝑖 ∙ 𝑒𝑏1∙𝑥𝑖 + 𝑏0 ෍
𝑖=1
𝑛
𝑒2∙𝑏1∙𝑥𝑖 = 0
෍
𝑖=1
𝑛
𝑦𝑖 ∙ 𝑥𝑖 ∙ 𝑒𝑏1∙𝑥𝑖 − 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 ∙ 𝑒2∙𝑏1∙𝑥𝑖 = 0
Solving differential equations for simple linear regression we get
Numerical solution (iterative method) can be used.

27
LINEARIZATION PROCEDURE
Linearization: several nonlinear regression functions can be moved to a linear domain and
solved as SLR.
𝑦 = 𝑏0 ∙ 𝑒𝑏1∙𝑥
ln 𝑦 = ln 𝑏0 + ln(𝑒𝑏1∙𝑥
)
ln 𝑦 = ln 𝑏0 + 𝑏1 ∙ 𝑥
𝑌 = 𝐵0 + 𝑏1 ∙ 𝑥
𝑏0 = 𝑒𝐵0

28
POLYNOMIAL FUNCTION
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛
0
100
200
300
400
0 100 200 300 400 500
Y
X

29
POLYNOMIAL FUNCTION
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛
R² = 0.8019
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9075
0
100
200
300
400
0 100 200 300 400 500
Y
X
-100
-50
0
50
100
0 100 200 300 400 500
Résidus
X
-100
-50
0
50
100
0 100 200 300 400 500
Résidus
X

30
POLYNOMIAL FUNCTION
R² = 0.8019
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9075
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9153
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9281
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9283
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9336
0
100
200
300
400
0 100 200 300 400 500
Y
X
Linear 2nd order 3rd order
4th order 5th order 6th order
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛

31
OVERFITTING WITH POLYNOMIAL FUNCTION
Excellent model fitting all points individually is like individually tailored dress
or suit – non-universal.
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛

32
PRACTICE
► Tutorial 1
- The ionic product of a solvent (pKs) is linked to its dielectric constant e via
a relation:
- We know the following results
ε
β
α
pKs 

solvents pKs ϵ
Water 14 78.5
Ethanol 19.1 24.3
Iso-propanol 20.8 18.3
Methanol 16.7 32.6

33
PRACTICE
► Tutorial 1
- Check graphically the validity of the given formula
- Give the values of a and b
• Punctually
• With a confidence interval of 95%
- For the N-propanol, we have e=20.1. Give its pKs
• Punctually
• With a confidence interval with a risk of 5%

34
PRACTICE
► Tutorial 2
► For 5 factories, we have the values of their assets and their annual profits (M$)
1) Are these two variables linked? Verify the hypothesis using statistical tools.
2) Calculate the range of regression model parameters for confidence intervals of
80, 90 and 95%?
3) Verify whether the proposed model describes the following data. Which
confidence interval? Is the model good enough?
Factory i 1 2 3 4 5
Assets xi 10 20 30 40 50
Annual profits yi 1 3 2 5 4
Factory i 1 2 3 4 5
Assets xi 15 20 35 45 50
Annual profits yi 3 1 1 4 2
Factory 1
Assets 25
Annual profits 5

simple linear regression - brief introduction

More Related Content

Similar to simple linear regression - brief introduction

Recently uploaded

simple linear regression - brief introduction