KEMBAR78
simple linear regression - brief introduction | PDF
DENYS GREKOV
REGRESSION ANALYSIS
denys.grekov@imt-atlantique.fr
2
INTRODUCTION
Population Sample
Descriptive
statistics
Data analysis
Principal component analysis
Correspondence analysis
Automatic classification
…
Estimation
Extend the
observations on the
sample to the
population
Test
Validate or
invalidate a
hypothesis on a
parameter or the
shape of a law
Regression
Modeling and
prediction
Probability calculations
Model construction to describe
random phenomena
Probability model
Adequacy ?
Experimental data
describing a real
phenomenon
Tables
Charts
Characteristic values
Mean parameters
Spread parameters
3
CONTENT OF THE MODULE
Goal of regression analysis: quantitative description and
prediction of the interdependence between two or more variables.
• Definition of the correlation
• The specification of a simple linear regression model
• Least squares estimators: construction and properties
• Verification of statistical significance of regression model
4
CORRELATION ANALYSIS
X
Y
X
_
Y
_
N1
N2
+
+
-
-
X
_
Y
_
Mean height and weight within the
sample
X : Independent or explanatory or exogenous variable
Y: Dependent or response or endogenous variable
Product of deviations from an average
>0 for most of points in case of
positive correlation between X and Y
𝑋1 − ത
𝑋 ∙ 𝑌1 − ത
𝑌 >0
𝑋2 − ത
𝑋 ∙ 𝑌2 − ത
𝑌 >0
𝑋𝑖 − ത
𝑋 ∙ 𝑌𝑖 − ത
𝑌
.
.
.
𝑐𝑜𝑣 =
σ 𝑋𝑖 − ത
𝑋 ∙ 𝑌𝑖 − ത
𝑌
𝑁 − 1
Covariance – quantitatively describes
the strength and the sense of
correlation between variables X and Y
𝑟𝑋𝑌 =
𝑐𝑜𝑣
𝜎𝑋 ∙ 𝜎𝑌
Correlation coefficient (Pearson’s
coefficient) – the same meaning as
covariance but independent on raw
data magnitude. Range: [-1; +1].
Coefficient of determination 𝑟𝑋𝑌
2
(shared variance) – determine how
the variance of variable 1 is influenced
by variable 2. Range: [0; 1].
5
CORRELATION ANALYSIS
X
Y
Y
X
1 0.64 0.16 0 0.16 0.64 1
rXY
R2
Correlation coefficient makes sense when:
• Character of interdependence between
variables is linear and monotonous;
• There is no significant outliers.
Outliers
Non-linearity of data
Correlation may signify causal relationship between two variables but à priori
it does not! It may be caused by a third variable, frequently hidden.
6
REGRESSION ANALYSIS
Regression analysis is used to verify how one variable determines/describes/predicts another
variable.
Dependent
variable
Independent variable (Predictor)
0
20
40
60
80
0 10 20 30 40
Price
Quality
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑒𝑟𝑟𝑜𝑟
Intercept Slope
ො
𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥
sample
population
ෝ
𝒚
ഥ
𝒚
𝒚 actual value
average value
predicted value
7
LEAST SQUARES METHOD
Method of Least Squares – allows the research of optimal parameters of linear regression
minimizing the residual sum of squares
𝑒1 = 𝑦1 − ෞ
𝑦1
Actual value Predicted value
Residual
> 0 𝑒2 = 𝑦2 − ෞ
𝑦2 < 0 ෍
𝑖=1
𝑛
𝑒𝑖
2
→ 𝑚𝑖𝑛
Dependent
variable
Independent variable (Predictor)
0
20
40
60
80
0 10 20 30 40
Price
Quality
N1
N2
Squares are needed to prevent the cancelling between + and - values
8
LEAST SQUARES METHOD
When function is minimal, the partial derivatives with the respect of each variable = 0
𝜕𝑓
𝜕𝑏1
= −2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0 ∙ 𝑥𝑖 = 0
𝜕𝑓
𝜕𝑏0
= −2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0 = 0
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖
2
+ 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 = ෍
𝑖=1
𝑛
𝑥𝑖 ∙ 𝑦𝑖
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍
𝑖=1
𝑛
𝑦𝑖
Solving differential equations for simple linear regression we get
The research of regression parameters for nonlinear or multivariable
regressions is conceptually the same.
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑧 + … . 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ … .
෍
𝑖=1
𝑛
𝑒𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − ො
𝑦𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0
2
= 𝑓 𝑏0, 𝑏1 → 𝑚𝑖𝑛
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
9
ON PRACTICE
1 The easiest way: directly using excel
2 More complicated: analytical solution of a system of linear equations
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖
2
+ 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 = ෍
𝑖=1
𝑛
𝑥𝑖 ∙ 𝑦𝑖
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍
𝑖=1
𝑛
𝑦𝑖
• Matrix method
• Gauss method
• Kramer method
• Online applications
𝑏1 ∙ 10436 + 𝑏0 ∙ 426 = 24022
𝑏1 ∙ 426 + 𝑏0 ∙ 22 = 1070
𝑏1 = 1.51 𝑏0=19.4
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
Summation
10
ON PRACTICE
3 More complicated : using statistics tools
𝑏0 = ത
𝑦 − 𝑏1 ∙ ҧ
𝑥
𝑏1 =
𝑠𝑦
𝑠𝑥
∙ 𝑟𝑥𝑦
In terms of statistics
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖
2
+ 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 = ෍
𝑖=1
𝑛
𝑥𝑖 ∙ 𝑦𝑖
𝑏1 ෍
𝑖=1
𝑛
𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍
𝑖=1
𝑛
𝑦𝑖
In terms of algebra
11
QUANTITATIVE DESCRIPTION OF SLR
SST = SSR + SSM
     


 







n
1
i
2
i
n
1
i
2
i
i
n
1
i
2
i y
y
y
y
y
y ˆ
ˆ
Sum of squares
explained by the
model
(regression sum
of squares)
Sum of squares
of residuals
(error sum of
squares)
Total sum of
squares of the
response
variable y
12
ROLE OF R2
Coefficient of determination (R2) determines how the variance of variable 1 is influenced
by variable 2 or how can it be predicted by regression model.
𝑅2
= 1 −
𝑆𝑆𝑟𝑒𝑠
𝑆𝑆𝑡𝑜𝑡
= 1 −
σ𝑖=1
𝑛
𝑦𝑖 − ෝ
𝑦𝑖
2
σ𝑖=1
𝑛
𝑦𝑖 − ഥ
𝑦𝑖
2
R² = 0.8877
0
20
40
60
80
0 5 10 15 20 25 30 35 40
Price
Quality
R² = 0.0073
0
10
20
30
40
0 5 10 15 20 25 30
Price
Quality
Higher the R2 better the regression model predicts the behavior of dependent
variable.
Only such a small
fraction of price
variance is determined
by the quality
Price is mainly
determined by product’s
quality
13
DIAGNOSTICS FOR SLR
• Linear relationship between variables
X and Y
• Exogeneity – residuals following a
Normal distribution centered on 0
with a standard-deviation σ
• Homoscedasticity – constant
variance of residues within the whole
range of variables.
• No colinearity - no relationship
between residuals.
0
20
40
60
80
0 10 20 30 40
Price
Quality
14
DIAGNOSTICS FOR SLR
http://www2.stat.duke.edu/~mc301/shinyed/
15
DIAGNOSTICS FOR SLR
http://www2.stat.duke.edu/~mc301/shinyed/
16
DIAGNOSTICS FOR SLR
http://www2.stat.duke.edu/~mc301/shinyed/
17
DIAGNOSTICS FOR SLR
http://www2.stat.duke.edu/~mc301/shinyed/
18
DIAGNOSTICS FOR SLR
http://www2.stat.duke.edu/~mc301/shinyed/
19
VALIDATION OF REGRESSION PARAMETERS
Regression line for sample may be different from regression line for the total population.
According to the central limit theorem we can assess regression parameters
for the entire population relying on the data for the sample.
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥
Regression line for the population. Regression line for the sample.
20
CONFIDENCE INTERVALS
Variance of parameters b0 and b1 can be defined as following:
𝑠𝑏1
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2
𝑠𝑏0
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2 ∙ 𝑠𝑥
2
+ ҧ
𝑥2
𝑏0 − 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏0
< 𝛽0 < 𝑏0 + 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏0
𝑏1 − 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏1
< 𝛽1 < 𝑏1 + 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑏1
Confidence interval for regression coefficients can be found using Student’s
distribution
P ~ 70-95%
21
PREDICTION WITH THE CONFIDENCE INTERVAL
𝑦0 = 𝑏0 + 𝑏1 ∙ 𝑥0 ± 𝑡𝛼
2
𝑛−2
∙ 𝑠𝑦0
𝑠𝑦0
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2 ∙ 𝑛 + 1 ∙ 𝑠𝑥
2
+ 𝑥0 − ҧ
𝑥 2
ො
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥
Regression line for the population. Regression line for the sample.
The uncertainty of the coefficients may be a source of errors, confidence
interval should be considered
Term which combines the
uncertainties of regression
parameters
𝑠𝑏1
2
=
1 − 𝑟2 ∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2
𝑠𝑏0
2
=
1 − 𝑟2 ∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2 ∙ 𝑠𝑥
2 + ҧ
𝑥2
22
DIAGNOSTICS FOR SLR – T-TEST
Allows to de determine whether the regression model is statistically significant.
𝑡 =
𝑏1
𝑠𝑎𝑚𝑝𝑙𝑒
− 𝛽1
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑠𝑏1
=
𝑏1
𝑠𝑏1
3) Quantify the evidence of the test. Compare the computed t values to the critical
values:
±𝑡1− Τ
𝛼 2 𝑓𝑜𝑟 𝑛−2
t-test steps:
1) Specify the hypothesis. H0: b1=0; H1: b1≠0 usually at 5% significance level (α)
2) Determine t
𝑠𝑏1
2
=
1 − 𝑟2
∙ 𝑠𝑦
2
𝑛 − 2 ∙ 𝑠𝑥
2
0 if we are testing H0
P ~ 70-95%
“H0 acceptance region”
“rejection region” “rejection region”
If the computed t is in the critical region
(smaller than −𝑡1− Τ
𝛼
2
𝑛−2
and higher than
+ 𝑡1− Τ
𝛼
2
𝑛−2
), H0 can be rejected.
23
DIAGNOSTICS FOR SLR
24
SIMPLE NON-LINEAR REGRESSION
• When 2D data features a nonlinear pattern
• Obvious pattern in the residual plot
Segmentation: X can be split up into classes or segments
and linear regression can be performed per segment.
25
SIMPLE NON-LINEAR REGRESSION
ො
𝑦 = 𝑏0 ∙ 𝑒𝑏1∙𝑥
26
LEAST SQUARES METHOD
෍
𝑖=1
𝑛
𝑒𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − ො
𝑦𝑖
2
= ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖
2
= 𝑓 𝑏0, 𝑏1 → 𝑚𝑖𝑛
When function is minimal, the partial derivatives with the respect of each variable = 0
𝜕𝑓
𝜕𝑏0
= 2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖 ∙ −𝑒𝑏1∙𝑥𝑖 𝑥𝑖 = 0
𝜕𝑓
𝜕𝑏1
= 2 ∙ ෍
𝑖=1
𝑛
𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖 ∙ −𝑏0 ∙ 𝑥 ∙ 𝑒𝑏1∙𝑥𝑖 = 0
− ෍
𝑖=1
𝑛
𝑦𝑖 ∙ 𝑒𝑏1∙𝑥𝑖 + 𝑏0 ෍
𝑖=1
𝑛
𝑒2∙𝑏1∙𝑥𝑖 = 0
෍
𝑖=1
𝑛
𝑦𝑖 ∙ 𝑥𝑖 ∙ 𝑒𝑏1∙𝑥𝑖 − 𝑏0 ෍
𝑖=1
𝑛
𝑥𝑖 ∙ 𝑒2∙𝑏1∙𝑥𝑖 = 0
Solving differential equations for simple linear regression we get
Numerical solution (iterative method) can be used.
27
LINEARIZATION PROCEDURE
Linearization: several nonlinear regression functions can be moved to a linear domain and
solved as SLR.
𝑦 = 𝑏0 ∙ 𝑒𝑏1∙𝑥
ln 𝑦 = ln 𝑏0 + ln(𝑒𝑏1∙𝑥
)
ln 𝑦 = ln 𝑏0 + 𝑏1 ∙ 𝑥
𝑌 = 𝐵0 + 𝑏1 ∙ 𝑥
𝑏0 = 𝑒𝐵0
28
POLYNOMIAL FUNCTION
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛
0
100
200
300
400
0 100 200 300 400 500
Y
X
29
POLYNOMIAL FUNCTION
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛
R² = 0.8019
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9075
0
100
200
300
400
0 100 200 300 400 500
Y
X
-100
-50
0
50
100
0 100 200 300 400 500
Résidus
X
-100
-50
0
50
100
0 100 200 300 400 500
Résidus
X
30
POLYNOMIAL FUNCTION
R² = 0.8019
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9075
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9153
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9281
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9283
0
100
200
300
400
0 100 200 300 400 500
Y
X
R² = 0.9336
0
100
200
300
400
0 100 200 300 400 500
Y
X
Linear 2nd order 3rd order
4th order 5th order 6th order
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛
31
OVERFITTING WITH POLYNOMIAL FUNCTION
Excellent model fitting all points individually is like individually tailored dress
or suit – non-universal.
𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2
+ 𝑏3 ∙ 𝑥3
+. . +𝑏𝑛 ∙ 𝑥𝑛
32
PRACTICE
► Tutorial 1
- The ionic product of a solvent (pKs) is linked to its dielectric constant e via
a relation:
- We know the following results
ε
β
α
pKs 

solvents pKs ϵ
Water 14 78.5
Ethanol 19.1 24.3
Iso-propanol 20.8 18.3
Methanol 16.7 32.6
33
PRACTICE
► Tutorial 1
- Check graphically the validity of the given formula
- Give the values of a and b
• Punctually
• With a confidence interval of 95%
- For the N-propanol, we have e=20.1. Give its pKs
• Punctually
• With a confidence interval with a risk of 5%
34
PRACTICE
► Tutorial 2
► For 5 factories, we have the values of their assets and their annual profits (M$)
1) Are these two variables linked? Verify the hypothesis using statistical tools.
2) Calculate the range of regression model parameters for confidence intervals of
80, 90 and 95%?
3) Verify whether the proposed model describes the following data. Which
confidence interval? Is the model good enough?
Factory i 1 2 3 4 5
Assets xi 10 20 30 40 50
Annual profits yi 1 3 2 5 4
Factory i 1 2 3 4 5
Assets xi 15 20 35 45 50
Annual profits yi 3 1 1 4 2
Factory 1
Assets 25
Annual profits 5

simple linear regression - brief introduction

  • 1.
  • 2.
    2 INTRODUCTION Population Sample Descriptive statistics Data analysis Principalcomponent analysis Correspondence analysis Automatic classification … Estimation Extend the observations on the sample to the population Test Validate or invalidate a hypothesis on a parameter or the shape of a law Regression Modeling and prediction Probability calculations Model construction to describe random phenomena Probability model Adequacy ? Experimental data describing a real phenomenon Tables Charts Characteristic values Mean parameters Spread parameters
  • 3.
    3 CONTENT OF THEMODULE Goal of regression analysis: quantitative description and prediction of the interdependence between two or more variables. • Definition of the correlation • The specification of a simple linear regression model • Least squares estimators: construction and properties • Verification of statistical significance of regression model
  • 4.
    4 CORRELATION ANALYSIS X Y X _ Y _ N1 N2 + + - - X _ Y _ Mean heightand weight within the sample X : Independent or explanatory or exogenous variable Y: Dependent or response or endogenous variable Product of deviations from an average >0 for most of points in case of positive correlation between X and Y 𝑋1 − ത 𝑋 ∙ 𝑌1 − ത 𝑌 >0 𝑋2 − ത 𝑋 ∙ 𝑌2 − ത 𝑌 >0 𝑋𝑖 − ത 𝑋 ∙ 𝑌𝑖 − ത 𝑌 . . . 𝑐𝑜𝑣 = σ 𝑋𝑖 − ത 𝑋 ∙ 𝑌𝑖 − ത 𝑌 𝑁 − 1 Covariance – quantitatively describes the strength and the sense of correlation between variables X and Y 𝑟𝑋𝑌 = 𝑐𝑜𝑣 𝜎𝑋 ∙ 𝜎𝑌 Correlation coefficient (Pearson’s coefficient) – the same meaning as covariance but independent on raw data magnitude. Range: [-1; +1]. Coefficient of determination 𝑟𝑋𝑌 2 (shared variance) – determine how the variance of variable 1 is influenced by variable 2. Range: [0; 1].
  • 5.
    5 CORRELATION ANALYSIS X Y Y X 1 0.640.16 0 0.16 0.64 1 rXY R2 Correlation coefficient makes sense when: • Character of interdependence between variables is linear and monotonous; • There is no significant outliers. Outliers Non-linearity of data Correlation may signify causal relationship between two variables but à priori it does not! It may be caused by a third variable, frequently hidden.
  • 6.
    6 REGRESSION ANALYSIS Regression analysisis used to verify how one variable determines/describes/predicts another variable. Dependent variable Independent variable (Predictor) 0 20 40 60 80 0 10 20 30 40 Price Quality ො 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑒𝑟𝑟𝑜𝑟 Intercept Slope ො 𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥 sample population ෝ 𝒚 ഥ 𝒚 𝒚 actual value average value predicted value
  • 7.
    7 LEAST SQUARES METHOD Methodof Least Squares – allows the research of optimal parameters of linear regression minimizing the residual sum of squares 𝑒1 = 𝑦1 − ෞ 𝑦1 Actual value Predicted value Residual > 0 𝑒2 = 𝑦2 − ෞ 𝑦2 < 0 ෍ 𝑖=1 𝑛 𝑒𝑖 2 → 𝑚𝑖𝑛 Dependent variable Independent variable (Predictor) 0 20 40 60 80 0 10 20 30 40 Price Quality N1 N2 Squares are needed to prevent the cancelling between + and - values
  • 8.
    8 LEAST SQUARES METHOD Whenfunction is minimal, the partial derivatives with the respect of each variable = 0 𝜕𝑓 𝜕𝑏1 = −2 ∙ ෍ 𝑖=1 𝑛 𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0 ∙ 𝑥𝑖 = 0 𝜕𝑓 𝜕𝑏0 = −2 ∙ ෍ 𝑖=1 𝑛 𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0 = 0 𝑏1 ෍ 𝑖=1 𝑛 𝑥𝑖 2 + 𝑏0 ෍ 𝑖=1 𝑛 𝑥𝑖 = ෍ 𝑖=1 𝑛 𝑥𝑖 ∙ 𝑦𝑖 𝑏1 ෍ 𝑖=1 𝑛 𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍ 𝑖=1 𝑛 𝑦𝑖 Solving differential equations for simple linear regression we get The research of regression parameters for nonlinear or multivariable regressions is conceptually the same. 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑧 + … . 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2 + … . ෍ 𝑖=1 𝑛 𝑒𝑖 2 = ෍ 𝑖=1 𝑛 𝑦𝑖 − ො 𝑦𝑖 2 = ෍ 𝑖=1 𝑛 𝑦𝑖 − 𝑏1 ∙ 𝑥𝑖 − 𝑏0 2 = 𝑓 𝑏0, 𝑏1 → 𝑚𝑖𝑛 ො 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥
  • 9.
    9 ON PRACTICE 1 Theeasiest way: directly using excel 2 More complicated: analytical solution of a system of linear equations 𝑏1 ෍ 𝑖=1 𝑛 𝑥𝑖 2 + 𝑏0 ෍ 𝑖=1 𝑛 𝑥𝑖 = ෍ 𝑖=1 𝑛 𝑥𝑖 ∙ 𝑦𝑖 𝑏1 ෍ 𝑖=1 𝑛 𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍ 𝑖=1 𝑛 𝑦𝑖 • Matrix method • Gauss method • Kramer method • Online applications 𝑏1 ∙ 10436 + 𝑏0 ∙ 426 = 24022 𝑏1 ∙ 426 + 𝑏0 ∙ 22 = 1070 𝑏1 = 1.51 𝑏0=19.4 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 Summation
  • 10.
    10 ON PRACTICE 3 Morecomplicated : using statistics tools 𝑏0 = ത 𝑦 − 𝑏1 ∙ ҧ 𝑥 𝑏1 = 𝑠𝑦 𝑠𝑥 ∙ 𝑟𝑥𝑦 In terms of statistics 𝑏1 ෍ 𝑖=1 𝑛 𝑥𝑖 2 + 𝑏0 ෍ 𝑖=1 𝑛 𝑥𝑖 = ෍ 𝑖=1 𝑛 𝑥𝑖 ∙ 𝑦𝑖 𝑏1 ෍ 𝑖=1 𝑛 𝑥𝑖 + 𝑏0 ∙ 𝑛 = ෍ 𝑖=1 𝑛 𝑦𝑖 In terms of algebra
  • 11.
    11 QUANTITATIVE DESCRIPTION OFSLR SST = SSR + SSM                  n 1 i 2 i n 1 i 2 i i n 1 i 2 i y y y y y y ˆ ˆ Sum of squares explained by the model (regression sum of squares) Sum of squares of residuals (error sum of squares) Total sum of squares of the response variable y
  • 12.
    12 ROLE OF R2 Coefficientof determination (R2) determines how the variance of variable 1 is influenced by variable 2 or how can it be predicted by regression model. 𝑅2 = 1 − 𝑆𝑆𝑟𝑒𝑠 𝑆𝑆𝑡𝑜𝑡 = 1 − σ𝑖=1 𝑛 𝑦𝑖 − ෝ 𝑦𝑖 2 σ𝑖=1 𝑛 𝑦𝑖 − ഥ 𝑦𝑖 2 R² = 0.8877 0 20 40 60 80 0 5 10 15 20 25 30 35 40 Price Quality R² = 0.0073 0 10 20 30 40 0 5 10 15 20 25 30 Price Quality Higher the R2 better the regression model predicts the behavior of dependent variable. Only such a small fraction of price variance is determined by the quality Price is mainly determined by product’s quality
  • 13.
    13 DIAGNOSTICS FOR SLR •Linear relationship between variables X and Y • Exogeneity – residuals following a Normal distribution centered on 0 with a standard-deviation σ • Homoscedasticity – constant variance of residues within the whole range of variables. • No colinearity - no relationship between residuals. 0 20 40 60 80 0 10 20 30 40 Price Quality
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    19 VALIDATION OF REGRESSIONPARAMETERS Regression line for sample may be different from regression line for the total population. According to the central limit theorem we can assess regression parameters for the entire population relying on the data for the sample. ො 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥 Regression line for the population. Regression line for the sample.
  • 20.
    20 CONFIDENCE INTERVALS Variance ofparameters b0 and b1 can be defined as following: 𝑠𝑏1 2 = 1 − 𝑟2 ∙ 𝑠𝑦 2 𝑛 − 2 ∙ 𝑠𝑥 2 𝑠𝑏0 2 = 1 − 𝑟2 ∙ 𝑠𝑦 2 𝑛 − 2 ∙ 𝑠𝑥 2 ∙ 𝑠𝑥 2 + ҧ 𝑥2 𝑏0 − 𝑡𝛼 2 𝑛−2 ∙ 𝑠𝑏0 < 𝛽0 < 𝑏0 + 𝑡𝛼 2 𝑛−2 ∙ 𝑠𝑏0 𝑏1 − 𝑡𝛼 2 𝑛−2 ∙ 𝑠𝑏1 < 𝛽1 < 𝑏1 + 𝑡𝛼 2 𝑛−2 ∙ 𝑠𝑏1 Confidence interval for regression coefficients can be found using Student’s distribution P ~ 70-95%
  • 21.
    21 PREDICTION WITH THECONFIDENCE INTERVAL 𝑦0 = 𝑏0 + 𝑏1 ∙ 𝑥0 ± 𝑡𝛼 2 𝑛−2 ∙ 𝑠𝑦0 𝑠𝑦0 2 = 1 − 𝑟2 ∙ 𝑠𝑦 2 𝑛 − 2 ∙ 𝑠𝑥 2 ∙ 𝑛 + 1 ∙ 𝑠𝑥 2 + 𝑥0 − ҧ 𝑥 2 ො 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 𝑦 = 𝛽0 + 𝛽1 ∙ 𝑥 Regression line for the population. Regression line for the sample. The uncertainty of the coefficients may be a source of errors, confidence interval should be considered Term which combines the uncertainties of regression parameters 𝑠𝑏1 2 = 1 − 𝑟2 ∙ 𝑠𝑦 2 𝑛 − 2 ∙ 𝑠𝑥 2 𝑠𝑏0 2 = 1 − 𝑟2 ∙ 𝑠𝑦 2 𝑛 − 2 ∙ 𝑠𝑥 2 ∙ 𝑠𝑥 2 + ҧ 𝑥2
  • 22.
    22 DIAGNOSTICS FOR SLR– T-TEST Allows to de determine whether the regression model is statistically significant. 𝑡 = 𝑏1 𝑠𝑎𝑚𝑝𝑙𝑒 − 𝛽1 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑏1 = 𝑏1 𝑠𝑏1 3) Quantify the evidence of the test. Compare the computed t values to the critical values: ±𝑡1− Τ 𝛼 2 𝑓𝑜𝑟 𝑛−2 t-test steps: 1) Specify the hypothesis. H0: b1=0; H1: b1≠0 usually at 5% significance level (α) 2) Determine t 𝑠𝑏1 2 = 1 − 𝑟2 ∙ 𝑠𝑦 2 𝑛 − 2 ∙ 𝑠𝑥 2 0 if we are testing H0 P ~ 70-95% “H0 acceptance region” “rejection region” “rejection region” If the computed t is in the critical region (smaller than −𝑡1− Τ 𝛼 2 𝑛−2 and higher than + 𝑡1− Τ 𝛼 2 𝑛−2 ), H0 can be rejected.
  • 23.
  • 24.
    24 SIMPLE NON-LINEAR REGRESSION •When 2D data features a nonlinear pattern • Obvious pattern in the residual plot Segmentation: X can be split up into classes or segments and linear regression can be performed per segment.
  • 25.
    25 SIMPLE NON-LINEAR REGRESSION ො 𝑦= 𝑏0 ∙ 𝑒𝑏1∙𝑥
  • 26.
    26 LEAST SQUARES METHOD ෍ 𝑖=1 𝑛 𝑒𝑖 2 =෍ 𝑖=1 𝑛 𝑦𝑖 − ො 𝑦𝑖 2 = ෍ 𝑖=1 𝑛 𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖 2 = 𝑓 𝑏0, 𝑏1 → 𝑚𝑖𝑛 When function is minimal, the partial derivatives with the respect of each variable = 0 𝜕𝑓 𝜕𝑏0 = 2 ∙ ෍ 𝑖=1 𝑛 𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖 ∙ −𝑒𝑏1∙𝑥𝑖 𝑥𝑖 = 0 𝜕𝑓 𝜕𝑏1 = 2 ∙ ෍ 𝑖=1 𝑛 𝑦𝑖 − 𝑏0 ∙ 𝑒𝑏1∙𝑥𝑖 ∙ −𝑏0 ∙ 𝑥 ∙ 𝑒𝑏1∙𝑥𝑖 = 0 − ෍ 𝑖=1 𝑛 𝑦𝑖 ∙ 𝑒𝑏1∙𝑥𝑖 + 𝑏0 ෍ 𝑖=1 𝑛 𝑒2∙𝑏1∙𝑥𝑖 = 0 ෍ 𝑖=1 𝑛 𝑦𝑖 ∙ 𝑥𝑖 ∙ 𝑒𝑏1∙𝑥𝑖 − 𝑏0 ෍ 𝑖=1 𝑛 𝑥𝑖 ∙ 𝑒2∙𝑏1∙𝑥𝑖 = 0 Solving differential equations for simple linear regression we get Numerical solution (iterative method) can be used.
  • 27.
    27 LINEARIZATION PROCEDURE Linearization: severalnonlinear regression functions can be moved to a linear domain and solved as SLR. 𝑦 = 𝑏0 ∙ 𝑒𝑏1∙𝑥 ln 𝑦 = ln 𝑏0 + ln(𝑒𝑏1∙𝑥 ) ln 𝑦 = ln 𝑏0 + 𝑏1 ∙ 𝑥 𝑌 = 𝐵0 + 𝑏1 ∙ 𝑥 𝑏0 = 𝑒𝐵0
  • 28.
    28 POLYNOMIAL FUNCTION 𝑦 =𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2 + 𝑏3 ∙ 𝑥3 +. . +𝑏𝑛 ∙ 𝑥𝑛 0 100 200 300 400 0 100 200 300 400 500 Y X
  • 29.
    29 POLYNOMIAL FUNCTION 𝑦 =𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2 + 𝑏3 ∙ 𝑥3 +. . +𝑏𝑛 ∙ 𝑥𝑛 R² = 0.8019 0 100 200 300 400 0 100 200 300 400 500 Y X R² = 0.9075 0 100 200 300 400 0 100 200 300 400 500 Y X -100 -50 0 50 100 0 100 200 300 400 500 Résidus X -100 -50 0 50 100 0 100 200 300 400 500 Résidus X
  • 30.
    30 POLYNOMIAL FUNCTION R² =0.8019 0 100 200 300 400 0 100 200 300 400 500 Y X R² = 0.9075 0 100 200 300 400 0 100 200 300 400 500 Y X R² = 0.9153 0 100 200 300 400 0 100 200 300 400 500 Y X R² = 0.9281 0 100 200 300 400 0 100 200 300 400 500 Y X R² = 0.9283 0 100 200 300 400 0 100 200 300 400 500 Y X R² = 0.9336 0 100 200 300 400 0 100 200 300 400 500 Y X Linear 2nd order 3rd order 4th order 5th order 6th order 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2 + 𝑏3 ∙ 𝑥3 +. . +𝑏𝑛 ∙ 𝑥𝑛
  • 31.
    31 OVERFITTING WITH POLYNOMIALFUNCTION Excellent model fitting all points individually is like individually tailored dress or suit – non-universal. 𝑦 = 𝑏0 + 𝑏1 ∙ 𝑥 + 𝑏2 ∙ 𝑥2 + 𝑏3 ∙ 𝑥3 +. . +𝑏𝑛 ∙ 𝑥𝑛
  • 32.
    32 PRACTICE ► Tutorial 1 -The ionic product of a solvent (pKs) is linked to its dielectric constant e via a relation: - We know the following results ε β α pKs   solvents pKs ϵ Water 14 78.5 Ethanol 19.1 24.3 Iso-propanol 20.8 18.3 Methanol 16.7 32.6
  • 33.
    33 PRACTICE ► Tutorial 1 -Check graphically the validity of the given formula - Give the values of a and b • Punctually • With a confidence interval of 95% - For the N-propanol, we have e=20.1. Give its pKs • Punctually • With a confidence interval with a risk of 5%
  • 34.
    34 PRACTICE ► Tutorial 2 ►For 5 factories, we have the values of their assets and their annual profits (M$) 1) Are these two variables linked? Verify the hypothesis using statistical tools. 2) Calculate the range of regression model parameters for confidence intervals of 80, 90 and 95%? 3) Verify whether the proposed model describes the following data. Which confidence interval? Is the model good enough? Factory i 1 2 3 4 5 Assets xi 10 20 30 40 50 Annual profits yi 1 3 2 5 4 Factory i 1 2 3 4 5 Assets xi 15 20 35 45 50 Annual profits yi 3 1 1 4 2 Factory 1 Assets 25 Annual profits 5