Wachemo University
School of Public Health
Abriham S. Areba (Assistant Professor)
Simple Linear Regression and Correlation
October, 2024
Hossana, Ethiopia
Variable: is a characteristic that takes on different values in
different things.
Types of Variables
Quantitative (Numerical) variables
Example: Number of children in a family, Weight, height,
BP,...
Qualitative (Categorical) variables
Example: Marital status, religion, Education status, patient
satisfaction, …
Abriham S. Areba
Variables can be again classified into two broad categories
Dependent Variable
o Also called response/regressed/endogenous/outcome/effect/
explained variable
o It is the focus of the research
o Affected by other (independent) variables
Independent Variables
o Also called explanatory/regressors/Exogenous/predictor/
Covariate/causal variables
o Affects the outcome variable
Abriham S. Areba
Simple Linear Regression and Correlation
Abriham S. Areba
o Regression analysis is concerned with describing and evaluating
the relationship between a given variable (often called the
dependent variable) and one or more variables which are assumed
to influence the given variable (often called explanatory variables).
o Predict the value of a dependent variable based on the value of at
least one independent variable.
o Explain the impact of changes in an independent variable on the
dependent variable.
Abriham S. Areba
Linear Regression Model
o When we observe pairs (X, Y ), we would like to write a statistical
relation with uniformly small error.
o We do not know Y exactly for every X, we will often approximate
the relation between X and Y.
The relationship between X and
Y is described by a linear
function
Changes in Y are assumed to be
caused by changes in X
Abriham S. Areba
Simple Linear Regression Model
o Is to determine how the average value of the continuous
outcome y varies with the value of a single predictor x.
o Linear in the parameters since no parameter appears as an
exponent or is multiplied or divided by another parameter.
Consider the following 2 models:
Model 1 : Yi = β0 + β1Xi + εi
Model 2 : Yi = β0 + β1Xi + β 2 Xi2 + εi
Models 1 and 2 are both linear in the parameters, and
can thus both be considered as linear models.
Abriham S. Areba
Error term (ε)
In this context, error does not mean mistake but it is a statistical
term representing random fluctuations, measurement error or
the effect of factors outside of our control.
𝜀~𝑁(0, 𝜎 2 )
The true model cannot be observed since β0 and β1 are not
known. We must estimate them from the data.
This gives the estimated or fitted regression line is:
0 +β 1 xi
yෝi = β
0 : estimates of β0
Where: β
β 1 ∶ estimates of β1
yෝi : estimates of yi Abriham S. Areba
Assumptions of Simple Linear Regression Model
1. Linearity
2. Normality
3. Homogeneity of variance
4. Independence of error
Linearity: the relationships between the predictor and the
outcome the variable should be linear
Normality: the errors should be normally distributed
Homogeneity of Variance: the error variance should be constant
Independence: the errors associated with one observation are
not correlated with the errors of any other observation.
Abriham S. Areba
Assumption 3: The variance of y is the same for any x that is,
the spread of values for y at each level of x remains approximately
constant.
o The magnitude of the residuals is the vertical distance
between the actual observed points and the estimating line.
o The estimating line will have a ‘good fit’ if it minimizes the
error between the estimated points on the line and the actual
observed points that were used to draw it.
Abriham S. Areba
Abriham S. Areba
Abriham S. Areba
Abriham S. Areba
0 and β 1 calculated as:
The parameter β
0 = yത − β 1 xത
β
n σn n n
i=1 xi yi − σi=1 yi σi=1 yi
β 1 = 2
n σn 2 n
i=1 xi − σi=1 x
Abriham S. Areba
o The regression line is that for independent variable the
corresponding dependent variable will be normally distributed
with:
❖ Mean, 𝟎 + 𝟏 x
and
❖ Variance, 2
o If 2 were 0, then every point would fall exactly on the
regression line, whereas
o the larger 2 is, the more scatter occurs about the regression
line.
Abriham S. Areba
The β0 and β1 are not known. We must estimate them,
This gives the estimated or fitted regression line is:
0 +β 1 Xi
Yi = β
0 is the estimated mean response when X = 0.
β
β 1 is the estimated change in the mean response for a unit increase
in X.
β 1 > 0, indicates that there is a direct linear r/ship b/n x & y.
β 1 < 0, indicates that there is an inverse r/ship b/n x and y.
β 1 = 0, indicates that there is no linear r/ship between x and y.
Abriham S. Areba
Tests of Significance of Regression Coefficients
The null hypothesis is that there is no relationship between X and
Y is expressed as:
Ho : β1 = 0
The alternative hypothesis is that there is a significant relationship
between X and Y, that is,
HA : β1 ≠ 0
In order to reject or not reject the HO , we calculate the test
statistic given by:
1 −β0
β 1
β
t= 1) = 1)
Se(β Se(β
and compare the student’s t distribution with (n-2) df for a given
significance level α.
Decision rule:
If t > t αൗ n − 2
2
then we reject the null hypothesis, and conclude that there is a
significant relationship between X and Y.
A 1 − α 100% CI for β1 and β0 are given by:
1 ± t αൗ n − 2 Se β 1
β 2
Correlation
Correlation Analysis: deals with the measurement of the closeness of
the relation ship which are described in the regression equation.
❖ Correlation: measures the relative strength of the linear
relationship between two variables
❖ Unit-less
❖ Ranges between –1 and 1
❖ r = -1 implies perfect negative linear correlation between the
variables under consideration
❖ r = +1 implies perfect positive linear correlation between the
variables under consideration
Abriham S. Areba
❖ The closer to –1, the stronger the negative linear relationship
❖ The closer to 1, the stronger the positive linear relationship
❖ The closer to 0, the weaker any positive linear relationship
Abriham S. Areba
The correlation coefficient, r between x & y is given by:
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥,𝑦)
r=
𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟(𝑦)
σni=1 xi − 𝑥ҧ yi − 𝑦ത
𝑟=
𝑥𝑖 − 𝑥ҧ 2 𝑦𝑖 − 𝑦ത 2
n σni=1 xi yi − σni=1 xi σni=1 yi
𝑟=
n σni=1 xi 2 − σni=1 xi 2 n σni=1 yi 2 − σni=1 yi 2
𝑆𝑆𝑥𝑦
r=
𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦
Abriham S. Areba
Scatter Plots of Data with Various Correlation Coefficients
Abriham S. Areba
Hypothesis Testing for Correlation
H0: ρ = 0 (no correlation between two variables)
HA: ρ ≠ 0 (correlation exists between two variables)
𝑛−2
Test statistic 𝑡=𝑟
1−𝑟 2
has a t distribution with n-2 degrees of freedom.
conclusion: if P<0.05, Reject the Ho and conclude that there is
evidence to suggest that there is a correlation between two
variables.
Abriham S. Areba
Coefficient of Determination (R2)
o The coefficient of determination is the portion of the total
variation in the dependent variable that is explained by
variation in the independent variable
o It is an indicator of how well the model fits the data.
o Adding more predictors to the model increases R2
SSR SSE
R2 = =1− 0 R2 1
SST SST
The proportion of total variation in the dependent variable (y)
that is explained by changes in the independent variable (x) or by
the regression line is equal to: R2 𝑥100%
Abriham S. Areba
Example: A researcher wants to find out if there is any relationship between
the height of the son and his father. He took a random sample of 6 fathers and
their sons. The height in inches is given in the table below.
σ X i = 392, σ X i 2 = 25628, σ X i Yi = 26476, σ Yi =405, σ Yi 2 = 27355
A. Estimate the parameters β 0 and β 1
B. Fit a simple linear regression line and interpret the estimates
C. What would be the height of the son if his father’s height is 70 inch?
D. Calculate coefficient of correlation and Interpret it
E. Calculate coefficient of determination
Abriham S. Areba
n σn n
nσ XY − i=1 Xi σi=1 Yi 6∗26476−392∗405
β 1 = i=1 in i 2 = = 0.92
n σi=1 Xi 2 − σn 6∗25628−3922
i=1 X
405 392
β 0 = yത − β 1 xത = − 0.92 ∗ = 7.2
6 6
B. The fitted (regression) line of Y on X is:
yො = β 0 +β 1 𝑥 = 7.2 + 0.92 x
β 0 = 7.2, indicates that the value of Y when no effect of X for the Y.
β 1 = 0.92, indicates for every unit increase in height of father, the
mean height of the son increase by 0.92.
β 1 > 0, there is direct r/ship between height of father and son.
yො = 7.2 + 0.92 x
ො = 7.2+0.92(70) =71.8, thus the height of the son is 71.8 inch.
C. y
n σn n n
i=1 xi yi − σi=1 xi σi=1 yi
D. r =
2 2
n σn 2 n
i=1 xi − σi=1 xi n σn 2 n
i=1 yi − σi=1 yi
6 ∗ 26476 − 392 ∗ 405
r= = 0.92
6 ∗ 25628 − 3922 6 ∗ 27355 − 4052
There is strongest positive correlation between height of father and
height of son.
E.
r 2 = 0.922 = 84.6%
About 84.6% of the variation in the height of the son is due to
changes in the height of the father.