KEMBAR78
Lecture7b Session | PDF | Errors And Residuals | Regression Analysis
0% found this document useful (0 votes)
20 views21 pages

Lecture7b Session

The lecture discusses correlation and regression analysis, emphasizing their importance in understanding relationships within vast data sets. It explains the differences between correlation, which summarizes direct relationships, and regression, which predicts behavior and can demonstrate cause and effect. The document also covers mathematical foundations, covariance, correlation coefficients, and the interpretation of regression equations.

Uploaded by

Alan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views21 pages

Lecture7b Session

The lecture discusses correlation and regression analysis, emphasizing their importance in understanding relationships within vast data sets. It explains the differences between correlation, which summarizes direct relationships, and regression, which predicts behavior and can demonstrate cause and effect. The document also covers mathematical foundations, covariance, correlation coefficients, and the interpretation of regression equations.

Uploaded by

Alan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

B123 – Statistics

Lecture 7

Dr. Christian Dombrowski


Correlation and Regression – Why?
• Vast data sets with many (dependent) input factors
• Overview lost 1

• Relationships between input factors unclear 0,9

• Is there a relationship? What is it? 0,8

• What is future behavior? 0,7

0,6

0,5

0,4

0,3

0,2

0,1

0
0 0,2 0,4 0,6 0,8 1

GISMA Business School – Potsdam – B123 2


Correlation and Regression – Why?
• Vast data sets with many (dependent) input factors
• Overview lost 1

• Relationships between input factors unclear 0,9

• Is there a relationship? What is it? 0,8

• What is future behavior? 0,7

0,6

• Correlation Analysis 0,5

• Describes degree of relationship 0,4

• Positive vs. negative vs. none 0,3

• Strong vs. weak 0,2

• Linear, non-linear, etc. 0,1

0
0 0,2 0,4 0,6 0,8 1

GISMA Business School – Potsdam – B123 3


Correlation and Regression – Why?
• Vast data sets with many (dependent) input factors
• Overview lost 1

• Relationships between input factors unclear 0,9

• Is there a relationship? What is it? 0,8

• What is future behavior? 0,7

0,6

• Correlation Analysis 0,5

• Describes degree of relationship 0,4

• Positive vs. negative vs. none 0,3

• Strong vs. weak 0,2

• Linear, non-linear, etc. 0,1

0
• Regression Analysis 0 0,2 0,4 0,6 0,8 1

• Describes impact of one RV on the other


• Linear, log, etc.

GISMA Business School – Potsdam – B123 4


Correlation vs. Regression

Correlation Regression

When to use? Summarize direct relationship Explain or predict behavior

Quantifies direction? Yes Yes

Quantifies strength? Yes Yes

Can demonstrate cause&effect? No Yes

Can predict and optimize? No Yes

Neither proves a causal relationship!


In most cases, they not even imply it

GISMA Business School – Potsdam – B123 5


Mathematical Foundation – Covariance
• Covariance of RVs 𝑋 and 𝑌
• Assumption: the expectation values 𝜇𝑋 , 𝜇𝑌 exist
∞ ∞
• Cov 𝑋, 𝑌 = 𝜎𝑋𝑌 = E 𝑋 − 𝜇𝑋 ⋅ 𝑌 − 𝜇𝑌 = ‫׬‬−∞ ‫׬‬−∞ 𝑥 − 𝜇𝑋 ⋅ 𝑦 − 𝜇𝑌 ⋅ 𝑓𝑋𝑌 𝑥, 𝑦 d𝑥 d𝑦
• Discrete version: Cov 𝑋, 𝑌 = σ∞ ∞
𝑖=1 σ𝑗=1 𝑥𝑖 − 𝜇𝑋 ⋅ 𝑦𝑗 − 𝜇𝑌 ⋅ P(𝑥𝑖 , 𝑦𝑗 )
• Can be positive, negative, zero, or infinity

• Important rules
• Covariance of itself: Cov 𝑋, 𝑋 = Var 𝑋 = 𝜎𝑋2
• Alternative calculation: Cov 𝑋, 𝑌 = E 𝑋 ⋅ 𝑌 − E 𝑋 ⋅ E 𝑌
• Bi-Linear operation: Cov 𝑎 + 𝑏 ⋅ 𝑋, 𝑐 + 𝑑 ⋅ 𝑌 = 𝑏 ⋅ 𝑑 ⋅ Cov 𝑋, 𝑌
• Symmetry: Cov 𝑋, 𝑌 = Cov(𝑌, 𝑋)
• Distributivity: Cov 𝑋, 𝑌 + 𝑍 = Cov 𝑋, 𝑌 + Cov 𝑋, 𝑍

• Extra: Multiple RVs possible; Representation using vectors & matrices


GISMA Business School – Potsdam – B123 6
Exercise
• What is Cov(3𝑋, 5 + 𝑋)?

• If 𝑋1 , 𝑋2 are independent with Var 𝑋𝑖 = 𝜎 2 , what is Cov 𝑋1 + 𝑋2 , 𝑋1 − 𝑋2 ?

GISMA Business School – Potsdam – B123 7


Correlation Coefficient
• Covariance allows to interpret type (positive/negative/zero)
• Correlation coefficients allow to interpret strength (strong/weak)

• Pearson correlation coefficient


• Linear relationship, interval data
• Assuming 𝜎𝑋 , 𝜎𝑌 are not zero or infinite and 𝜇𝑋 , 𝜇𝑌 exist
Cov(𝑋,𝑌) 𝜎𝑋𝑌 E 𝑋⋅Y − E 𝑋 ⋅E(𝑌)
• 𝜌 𝑋, 𝑌 = = =
Var 𝑋 ⋅Var 𝑌 𝜎𝑋 ⋅𝜎𝑌 E 𝑋 2 −E 𝑋 2 ⋅ E 𝑌 2 −E 𝑌 2
• Range: −1, 1

• Extra: other corr. coeff. exist (e.g. non-linear corr., or ordinal data, …)

GISMA Business School – Potsdam – B123 8


Estimators for Covariance and Corr.-Coeff.
• Sample covariance 𝑠𝑋𝑌
1
• 𝑠𝑋𝑌 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ ⋅ (𝑦𝑖 − 𝑦)

𝑛−1
• Unbiased*, highly susceptible to outliers, uses whole data set

• Sample correlation coefficient 𝑟𝑋𝑌


1
σ𝑛 σ𝑛 𝑛 𝑛
𝑖=1 𝑥𝑖 ⋅𝑦𝑖 − 𝑛⋅σ𝑖=1 𝑥𝑖 ⋅σ𝑖=1 𝑦𝑖
𝑖=1 𝑥𝑖 − 𝑥ҧ ⋅(𝑦𝑖 − 𝑦)

• 𝑟𝑋𝑌 = =
2 2 2
σ𝑛 𝑥 2 ⋅ σ𝑛
𝑖=1 𝑥𝑖 − ഥ ത
𝑖=1 𝑦𝑖 − 𝑦 σ𝑛 2 𝑛
𝑖=1 𝑥𝑖 − σ𝑖=1 𝑥𝑖 ⋅ σ𝑛 2 𝑛
𝑖=1 𝑦𝑖 − σ𝑖=1 𝑦𝑖

• Range: [−1, 1]
• Biased**, highly susceptible to outliers, uses whole data set
2
• Rule-of-thumb to check if correlation exists: 𝑟 ≥
𝑛

• Sampling distribution (highly depends on 𝜌, often not normal distributed!)


• Extra: Z-transformation of 𝑟 (which is then normal distributed)
GISMA Business School – Potsdam – B123
*: Not that easy, for most cases it is unbiased (or at least asymp. unbiased) 9
**: Even harder, for most cases it is biased
Exercise
• Draw the following data set, using an appropriate chart
𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝟓 𝒙𝟔 𝒙𝟕 𝒙𝟖 𝒙𝟗 𝒙𝟏𝟎
A 14 4 6 11 8 12 12 10 2 9
B 7 5 3 4 0 3 4 1 7 0

• Based on the graph, predict the 𝜌𝐴𝐵

GISMA Business School – Potsdam – B123 10


Exercise
• Calculate 𝑟𝐴𝐵
𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝟓 𝒙𝟔 𝒙𝟕 𝒙𝟖 𝒙𝟗 𝒙𝟏𝟎
A 14 4 6 11 8 12 12 10 2 9
B 7 5 3 4 0 3 4 1 7 0

• Are 𝐴 and 𝐵 uncorrelated?

• Are 𝐴 and 𝐵 (likely) independent?


GISMA Business School – Potsdam – B123 11
Correlation and Dependence
• Definition of correlation
• Two RVs 𝑋 and 𝑌 are correlated iff the correlation coefficient 𝜌 is not zero.
• 𝜌(𝑋, 𝑌) ≠ 0

• Depends on the used correlation coefficient!


• We use Pearson’s correlation coefficient → Correlation describes linear relationship
• Two RVs are uncorrelated if 𝜌 𝑋, 𝑌 = 0
• Why did we define in Lecture 5:
Two RVs are uncorrelated iff E 𝑋 ⋅ 𝑌 = E 𝑋 ⋅ E 𝑌
• Is this a different definition?

• Dependence is stronger/broader than correlation


• Independence is stronger than uncorrelated-ness
GISMA Business School – Potsdam – B123 12
Regression
• Mathematical formulation of impact of one RV on another
• Mostly approximation, valid only within range of sample set
• Focus: Linear relationship
• Description uses polynomial functions of degree 1
• 𝑓 𝑥 = 𝑎0 + 𝑎1 ⋅ 𝑥 (in school sometimes 𝑦 = 𝑚 ⋅ 𝑥 + 𝑐)

GISMA Business School – Potsdam – B123 13


Regression
• Mathematical formulation of impact of one RV on another
• Mostly approximation, valid only within range of sample set
• Focus: Linear relationship
• Description uses polynomial functions of degree 1
• 𝑓 𝑥 = 𝑎0 + 𝑎1 ⋅ 𝑥 (in school sometimes 𝑦 = 𝑚 ⋅ 𝑥 + 𝑐)
• Exercise: Draw 𝑓 𝑥 = 4 − 0.5𝑥 and 𝑔 𝑥 = 2𝑥 − 1

GISMA Business School – Potsdam – B123 14


Interpretation of Regression Equation
• 𝑓 𝑥 = 𝑎0 + 𝑎1 ⋅ 𝑥
• Think of an example in your future business life!

• What is 𝑥?

• What is 𝑎0 ?

• What is 𝑎1 ?

GISMA Business School – Potsdam – B123 15


Derivation of the Regression Equation
• 𝑓 𝑥 = 𝑎0 + 𝑎1 ⋅ 𝑥
• Estimator for parameters 𝑎0 and 𝑎1 needed
• For 𝑎ො1 (“sensitivity”):
𝑛 ⋅ σ𝑛𝑖=1 𝑥𝑖 ⋅ 𝑦𝑖 − σ𝑛𝑖=1 𝑥𝑖 ⋅ σ𝑛𝑖=1 𝑦𝑖 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ ⋅ (𝑦𝑖 − 𝑦)
ത 𝑠𝑌∗
𝑎ො1 = 2 = 𝑛 2
= 𝑟𝑋𝑌 ⋅ ∗
𝑛 2 𝑛
𝑛 ⋅ σ𝑖=1 𝑥𝑖 − σ𝑖=1 𝑥𝑖 σ 𝑥
𝑖=1 𝑖 − 𝑥ҧ 𝑠𝑋

• The sample std. dev. 𝑠𝑋∗ , 𝑠𝑌∗ use the uncorrected term 1Τ𝑁 instead of 1Τ𝑁−1
• For 𝑎ො0 (“offset”, “blank value”):
𝑛 𝑛
1
𝑎ො0 = ⋅ ෍ 𝑦𝑖 − 𝑎1 ⋅ ෍ 𝑥𝑖 = 𝑦ത − 𝑎1 ⋅ 𝑥ҧ
𝑛
𝑖=1 𝑖=1

• Estimators are unbiased, use whole data set


• Formulas assume Least Square Error (LSE) minimization for fitting
• Mean Squared Error (MSE) is metric to evaluate the quality of fit Nugget 6!
GISMA Business School – Potsdam – B123 16
Exercise
• Based on the cost structure below, what are our cost if our prime
customer orders 7.5k units in July? What about 12k units in August?

Month Units (in thsd) Cost (in k€) 𝒙⋅𝒚 𝒙𝟐 𝒚𝟐


Jan 7 12
Feb 6 8
Mar 8 12
Apr 5 10
May 6 11
Jun 9 13
Total

GISMA Business School – Potsdam – B123 17


Prediction Precision
• Rule-of-thumb metrics
• Visual inspection
• If points are close to regression line
• Correlation coefficient and gof value

• Precise metric: Residual analysis


• Goal: 𝑓 𝑥 = 𝑎0 + 𝑎1 ⋅ 𝑥
• What is 𝑎0 and 𝑎1 ?
• Measurable: 𝑦𝑖 = 𝑎0 + 𝑎1 ⋅ 𝑥𝑖 + 𝑒𝑖
• Least-square minimization was shown to be optimal
• Requirement: all 𝑥𝑖 need to be mutually independent
• Minimization problem: min σ𝑛𝑖=1 𝑒𝑖2 which is min σ𝑛𝑖=1 𝑦𝑖 − 𝑎0 − 𝑎1 ⋅ 𝑥𝑖 2
𝑎0 ,𝑎1 𝑎0 ,𝑎1

GISMA Business School – Potsdam – B123 18


Exercise – Ctn‘d
• Draw the data points, and the regression line in the chart below!
• Calculate the fitted values and the residuals (estimation error) for all
observations!

GISMA Business School – Potsdam – B123 19


Goodness Of Fit
• Goodness of Fit measure
• Description of how well regression line fits sample set
• gof = 𝑟 2 ⋅ 100 % (typically given in percent)
• Variation in independent variable accounts for gof % of the variation in dependent var

• Exercise
• You calculated 𝑟 = −0.126. What is the gof?

• You calculated 𝑟 = 0.891. What is the gof?

• What is gof for the previous production example?

GISMA Business School – Potsdam – B123 20


Correlation and Regression Analysis
• Steps to analyze sample set
1. Think about problem and identify potential impact factors
2. Pre-process sample set
3. Recommended: Draw or let draw
4. Determine the correlation coefficient 𝑟𝑋𝑌
5. Test 𝑟𝑋𝑌 for significance (see Lecture “Hypothesis Tests”)
6. If not significant, chose a better non-linear correlation coeff. (jump to 3.)
7. Derive the regression equation (RE)
8. Recommended: Draw RE
9. Calculate gof
10. Interpret the coefficients w.r.t. to impact factors from 1., also considering the gof
11. Recommended: Check if regression still makes sense for extreme values of data set
12. Residual analysis
13. Confidence interval calculation

GISMA Business School – Potsdam – B123 21

You might also like