UNIVERSITY OF DODOMA
COLLEGE OF HEALTH AND ALLIED SCIENCIES
SCHOOL OF PUBLIC HEALTH
Topic 5: Correlation and Regression Analysis
Instructor: C. Mbotwa
Relationship Between two Variables
Introduction
So far we have studied problems relating to one variable only. In practice we come across a large
number of problems involving the use of two or more variables. If two quantities vary in such a way
that movements in one are accompanied by movements in the others, these quantities are correlated.
The study of this relationship is called BIVARIATE ANALYSIS. More formally, if for every
measurement of a variable X we know a corresponding value of a second variable Y, the resulting
set of pairs of varieties is called a BIVARIATE POPULATION and the data used are called
BIVARIATE DATA. To be more specific, is to say the data which involve two variables are referred
as BIVARIATE DATA.
Examples,
i. In health studies of populations, it is common to obtain variables such as height and weight.
ii. Economic studies may be interested in, among other things, personal income and years of
education, personal income and private consumption.
iii. Most university admissions committees ask for an applicant’s high school grade point
average and standardized admission test scores.
Scatter Diagrams
The simplest device for determining a relationship between two variables is by the use of a special
type of dot chart called scatter diagram. The method is so called because it indicates the scatter of
the various points. When this method is used the given bivariate data are plotted on graph in form of
dots, ie for each pair of X and Y values we put a dot and thus obtain many points as the number of
observations we have. To plot the data, the dependent variable Y is always plotted on the Y-axis
(vertical-axis) and the independent variable X is plotted on the X-axis (horizontal axis).
By looking at the scatter of the various points we can form an idea as to whether the variables are
related or not. The more the plotted points “scatter” over a chart the less relationship there is
between the two variables. The more nearly the points come to falling on a line, the higher the
degree of a linear relationship. If all points lie on the straight line rising from the lower left-hand
corner to the upper right-hand corner the relationship is said to be PERFECTLY POSITIVE (Figure
5.2). On the other hand, if all points are lying on a straight line falling from the upper left-hand
1
corner to the lower right-hand corner of the diagram, the relationship is said to be PERFECTLY
NEGATIVE (Figure 5.4).
If the plotted points fall in a narrow band there would be a high degree of relationship between the
variables. The relationship shall be positive if the points show a RISING TENDENCY from the
lower left-hand corner to the upper right-hand corner (Figure 5.1). Conversely, the relationship will
be negative if the points show a DECLINING TENDENCY from the upper left-hand corner to the
lower right-hand corner of the diagram (Figure 5.3).
Correlation Analysis
Correlation refers to the extent of a linear relationship between two or more variables. If there is a
close linear relationship between the two variables, the variables are said to be highly correlated; if
there is no linear relationship between the two variables, the variables are said to be uncorrelated.
Thus, correlation analysis refers to the technique used in measuring the closeness of the linear
relationship between variables.
Correlation Coefficient
The strength of the linear relationship between two variables is popularly measured by what is
known as the PERSON PRODUCT-MOMENT COEFFICIENT OF CORRELATION. The sample
correlation coefficient is computed as:
The value of the coefficient of correlation obtained by the above formula lies between -1 and +1.
When , it means there is a PERFECT POSITIVE linear relationship between the variables.
When , it means there is a PERFECT NEGATIVE linear relationship between the variables.
When , it means there is NO LINEAR RELATIONSHIP between the variables. However, in
practice, such values of as and 0 are rare. We normally get non-zero values which lie
between and .
The coefficient of correlation describes not only the magnitude of correlation but also its type of
direction. Thus, and has the same magnitude of 0.95 but the first coefficient
describes a positive correlation where as the second describes a negative correlation.
Properties of the Coefficient of Correlation
The following are the important properties of the correlation coefficient, r.
1. The coefficient of correlation lies between -1 and +1. In symbols we write or
.
2. The coefficient of correlation is independent of change of the scale and origin of the
variables X and Y.
2
Y 9 Y 8
8
7
7
6 6
5 5
4 4
3 3
2 2
1 1
0
0
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
X
X
Figure 5.1: Positive linear relationship (r=0.98) Figure 5.2: Perfectly positive linear relationship (r=1)
Y 7
Y 7
6
6
5
5
4
4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
X X
Figure 5.3: Negative Linear relationship ( ) Figure 5.4: Perfectly negative linear relationship ( )
Y 6
5
0
0 1 2 3 4 5 6 7 8 9
X
Figure 5.5: No linear relationship (r=0.02)
3
Regression Analysis
The purpose of linear regression is to develop a mathematical relationship (model) between the
variables that can be used to estimate the value of the variable if the value of another variable is
known. The relationship that is developed has a form of a straight line and that is why it is called
linear regression. Linear regression is further classified into two, i.e simple linear regression and
multiple linear regression.
Simple Liner Regression
In simple linear regression, we develop a linear relationship between one dependent variable against
one independent (explanatory) variable. The relationship we need to establish in simple linear
regression has got the form:
Where:
Y is the dependent variable.
is the intercept in the Y-axis (Constant).
is the gradient (slope) of the relationship or coefficient of the independent variable.
X is the independent or explanatory or predictor variable
is random error in Y
Since we cannot fit exactly the line we need, as the case in inferential statistics is, we estimate the
relationship by:
Where:
a is estimating the
b is estimating the
In order to establish the relationship, we need to find the values of and . By using the method of
least squares, the values of and can be shown to be:
4
Interpretation of the Regression Coefficients
“a” and “b” are called regression coefficients and have the following interpretation:
a (Y intercept)-shows the minimum value of dependent variable Y can take without any impact of X
if the slope b is positive. In case the slope is negative, the intercept will show the maximum value
that Y can attain if there is no impact of X.
On the other hand, b (slope) has two interpretations. First, it shows the direction of the relationship.
If its value is positive, then we say that there is a positive relationship between the regressed
variables. Conversely, if the value of the slope is negative, then we understand that the two variables
are negatively related. The second interpretation is that, it shows the amount by which Y will change
by increasing one unit of X.
Fitting the Least Squares Line or Line of Best Fit
Consider as set of data of points which can be identified by corresponding values of X and Y, say
.
The straight-line model for the response Y in terms of X is .
The line that gives the mean (or expected) value of Y for a given value of X is And
the fitted line which we hope to find, is represented as .
Where is the estimator of the mean value of Y, and a predictor of some future values of Y,
and are estimators of and respectively.
For a given data point, say the observed value of Y is and the predicted value of Y would
be obtained by substituting into the prediction equation: .
The deviation of the ith value of Y from its predicted value is:
Then the sum of squares of the deviation of the Y values about their predicted values for all of the
data points is:
The quantities and that make the SSE a minimum are called the least squares estimates of the
population parameters and and the prediction equation , is called least squares line.
Definition: The least squares line is one that has a smaller SSE than any other straight line model.
5
The values of and that minimize the SSE are derived by performing a partial derivatives with
respect to and whereby we get two equations known a normal equations.
Example 5.1
Suppose an experiment involving five subjects is conducted to determine the relationship between
the percentage of a certain drug in the bloodstream and the length of time it takes to react to
stimulus. The results are shown below:
Subject Amount of Drug X (%) Reaction time Y (Minutes)
1 1 1
2 2 1
3 3 2
4 4 2
5 5 4
a) Plot these points in an X-Y plane (Scatter diagram).
b) Determine the Pearson Correlation coefficient of amount of drug and reaction time and
interpret it.
c) Determine the least square line and interpret the slope.
d) Determine the value of SSE
e) Estimate the time the drug takes to react to a stimulus if 4.7% of the drug is used.
Solution
c) The scatter diagram on the data is shown below.
4.5
4
3.5
Reaction Time (Min)
3
2.5
2
1.5
1
0.5
0
0 1 2 3 4 5 6
Amount of Drug (%)
For the next parts, Preliminary computations are done as shown below.
6
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
Totals 15 10 55 26 37
and
b) The correlation coefficient
Therefore, the Pearson correlation coefficient of amount of drug and reaction time is 0.90. This
shows that amount of drug and reaction time have strong positive linear relationship.
c) The least square line
The slope of the squares line is
And the y-intercept is
7
Thus, the least squares line is
or
Interpretation of the Slope
The slope shows that there is a positive linear relationship between reaction time and amount of
drug. That is as amount of drug increases by 1%, reaction time will increase by 0.7 minutes (42
seconds).
d) Determination of the SSE.
The observed and predicted values of the deviations of the Y values about their predicted values, and
the squares of the these deviations are shown below:-
1 1 0.6 0.4 0.16
2 1 1.3 -0.3 0.09
3 2 2 0 0.00
4 2 2.7 -0.7 0.49
5 4 3.4 0.6 0.36
0 SSE=1.10
e) Estimate the time the drug takes to react to a stimulus if 4.7% of the drug is used.
This is given by minutes.
Model Assumptions
Regression analysis requires us to specify the probability distribution of the random error . We will make
four basic assumptions about the general form of this probability distribution.
Assumption 1
The mean of the probability distribution of is zero i.e . This assumption implies that the mean
value of Y, for a given value of X is
Assumption 2
The variance of the probability distribution of is constant for all settings of the independent variable X i.e
for all values of X.
Assumption 3
The probability distribution of is Normal i.e
8
Assumption 4
The errors associated with any two different observations are independent i.e .
That is, the errors associated with one value of Y have no effect on errors associated with other Y values.
Differences between Correlation and Regression Analysis
1. Whereas correlation coefficient is the measure of the strength of the linear relationship, the objective of
regression analysis is to study the nature of the relationship between the variables so that we may be
able to predict the value of one variable on the basis of another. Conventionally, the variable which is
the basis of prediction is called the independent or explanatory variable and the variable that is to be
predicted is referred to as the dependent variable. The choice of dependent and independent variables is
a crucial one in regression analysis.
2. The cause and effect relationship is clearly indicated through linear regression analysis than by
correlation. Correlation is merely a tool of ascertaining the degree of relationship between two
variables, and therefore we cannot say that one variable is the cause and the other the effect.
Multiple Linear Regression
Multiple linear regression (MLR) is an extension of simple linear regression. In Simple Linear
Regression we considered a single dependent variable, Y, and a single independent variable X.
MLR is used when there are two or more independent variables where the model using population
information is:
Examples:
i. The height of a child can depend on the height of the mother, the height of the father,
nutrition, and environmental factors.
ii. The level of blood pressure may be influenced weight of an individual, amount of salt in the
diet, alcohol consumption, age, stress e.tc.
A multiple linear regression model with k predictor variables X1, X2, ..., Xk and a response Y , can
be written as
Classical Linear Regression Assumptions
i. There should be no multicollinearity problem.
That means the correlations among the independent variables should be low (say rij<0.3)
ii. The variance of the error terms should be equal and constant (homoscedasticity)
iii. There should be no autocorrelation
The error terms should not be correlated to each other (
iv. Each of the error term should be normally distributed. That is
9
Example 5.2:
Consider the following regression of baby weight (grams) on maternal height (cm), maternal weight
(kg), maternal age at birth (years).
Where:
X1 : Maternal height
X2 : Maternal weight
X3 Age of mothers
Y : Baby weight
Interpretations
- The above results show that baby’s weight is negative related to the maternal are height and
positively related to other explanatory variables (Maternal weight and age of the mother).
- If height of the mother increases by one centimetre, the weight of the baby will decrease by
96.82 grams.
- If the weight of the mother increases by one kilogram, the weight of the baby will increase by
9.05 grams.
- If the age of the mother increases by one year, the weight of the baby will increase by 68.39
grams
Coefficient of Multiple Determination (R2)
Proportion of variation in dependent variable Y “explained” by the the k independent variables.
Adjusted-R2 is used to compare models with different sets of independent variables in terms of
predictive capabilities. Penalizes models with unnecessary or redundant predictors.
10
REVIEW QUESTIONS
1. The table below shows the average rate of growth of GDP, g, and employment, e, for 25
countries for the period of 1988-1997.
Country Employment GDP (g) Country Employment GDP (g)
(e) (e)
1 1.68 3.04 14 2.57 7.73
2 0.65 2.55 15 3.02 5.64
3 0.34 2.16 16 1.88 2.86
4 1.17 2.03 17 0.91 2.01
5 0.02 2.02 18 0.36 2.98
6 -1.06 1.78 19 0.33 2.79
7 0.28 2.08 20 0.89 2.60
8 0.08 2.71 21 -0.94 1.17
9 0.87 2.08 22 0.79 1.15
10 -0.13 1.54 23 2.02 4.18
11 2.16 6.40 24 0.66 1.97
12 -0.30 1.68 25 1.53 2.46
13 1.06 2.81
i. Find the Pearson correlation coefficient of Employment and GDP and interpret it.
ii. Estimate the linear regression of “g” on “e”, and interpret the coefficients.
iii. What will be the value of GDP when employment rate is 2.00?
2. The linear regression equation obtained after regression weight (Kg) of respondent against
her height (cm) and hours spend for physical exercises per week is given below. Interpret the
coefficients.
.
3. The table below shows years of schooling, S, and hourly earnings in 1944, in dollars, Y, for a
subset of 20 respondents from the United States National Longitudinal Survey of Youth.
Observation S Y Observation S Y
1 15 17.24 11 17 15.38
2 16 15.00 12 12 12.70
3 8 14.91 13 12 26.00
4 6 4.50 14 9 7.50
5 15 18.00 15 15 5.00
6 12 6.29 16 12 21.63
7 12 19.23 17 16 12.10
8 18 18.69 18 12 5.55
9 12 7.21 19 12 7.50
10 20 42.06 20 14 8.00
a) Find the correlation coefficient.
b) Fit the regression line of Y on S and interpret the coefficients.
11
4. A public health scientist exploring the relationship between family size and food expenditure
randomly selected six female at a street. Each selected female was asked how many children
under the age of 18 years with her, and she was asked the number of litres of milk consumed
weekly by the household, on average. The data resulting from this inquiry are shown below.
Number of children under 18 years Weekly milk expenditure (litres)
2 14
4 20
2 9
6 25
3 16
1 14
i. Construct a scatter plot for the given data.
ii. Determine the least square line that exists between the two variables.
iii. Compute and interpret the coefficient of correlation between the variables
5. Given below is level of education (X) attained and number of children born (Y) by a sample
of six women:
X 0 4 8 12 14 17
Y 8 7 5 4 3 2
i. Compute a sample correlation coefficient of X and Y
ii. Estimate a linear regression of Y on X.
iii. Estimate the expected number of children in a family of a mother who spent 16 years
at school.
6. A study carried out in Iringa last year revealed the following data concerning family size and
years spent at school by the family head for ten respondents. The data is provided in the table
below as follows:
Family Size Year at School
10 14
5 12
8 8
25 0
22 2
19 7
16 16
14 10
9 20
6 6
Fit the linear regression of the family size on year at school.
12