566 Chapter 10 Correlation and Regression
Extending the Concepts
28. One of the formulas for computing r is compute r again. Compare this value with the previous one.
Explain the results of the comparison.
©1x x 21y y 2 x 1 2 3 4 5
r
1n 121sx 21sy 2 y 3 5 7 9 11
Using the data in Exercise 27, compute r with this 30. Compute r for the following data and test the hypothesis
formula. Compare the results. H0: r 0. Draw the scatter plot; then explain the results.
29. Compute r for the data set shown. Explain the reason for x 3 2 1 0 1 2 3
this value of r. Now, interchange the values of x and y and y 9 4 1 0 1 4 9
10–2 Regression
OBJECTIVE 4 In studying relationships between two variables, collect the data and then construct a scat-
Compute the equation of the ter plot. The purpose of the scatter plot, as indicated previously, is to determine the nature
regression line. of the relationship between the variables. The possibilities include a positive linear rela-
tionship, a negative linear relationship, a curvilinear relationship, or no discernible rela-
tionship. After the scatter plot is drawn and a linear relationship is determined, the next
steps are to compute the value of the correlation coefficient and to test the significance of
the relationship. If the value of the correlation coefficient is significant, the next step is to
determine the equation of the regression line, which is the data’s line of best fit. (Note:
Determining the regression line when r is not significant and then making predictions
using the regression line are meaningless.) The purpose of the regression line is to enable
the researcher to see the trend and make predictions on the basis of the data.
Line of Best Fit
Figure 10–11 shows a scatter plot for the data of two variables. It shows that several lines
can be drawn on the graph near the points. Given a scatter plot, you must be able to draw
the line of best fit. Best fit means that the sum of the squares of the vertical distances from
each point to the line is at a minimum.
FIGURE 10–11 y
Scatter Plot with Three Lines
Fit to the Data
10–18
Section 10–2 Regression 567
FIGURE 10–12 y
Line of Best Fit for a Set of
d6 d7
Data Points
d5
Observed
value
d3
d2 d4
d1
Predicted
value x
The difference between the actual value y and the predicted value y¿ (that is, the ver-
tical distance) is called a residual or a predicted error. Residuals are used to determine
the line that best describes the relationship between the two variables.
The method used for making the residuals as small as possible is called the method of
least squares. As a result of this method, the regression line is also called the least squares
Historical Notes regression line.
Francis Galton drew the The reason you need a line of best fit is that the values of y will be predicted from the
line of best fit visually. values of x; hence, the closer the points are to the line, the better the fit and the prediction
An assistant of Karl will be. See Figure 10–12. When r is positive, the line slopes upward and to the right.
Pearson’s named G. Yule When r is negative, the line slopes downward from left to right.
devised the mathemati-
cal solution using the
least-squares method,
Determination of the Regression Line Equation
employing a mathemati- In algebra, the equation of a line is usually given as y mx b, where m is the slope of
cal technique developed the line and b is the y intercept. (Students who need an algebraic review of the properties
by Adrien-Marie Legendre of a line should refer to the online resources, before studying this section.) In statistics,
about 100 years earlier. the equation of the regression line is written as y a bx, where a is the y intercept
and b is the slope of the line. See Figure 10–13.
There are several methods for finding the equation of the regression line. Two formu-
las are given here. These formulas use the same values that are used in computing the
value of the correlation coefficient. The mathematical development of these formulas is
beyond the scope of this book.
FIGURE 10–13 A Line as Represented in Algebra and in Statistics
y y⬘
Slope y⬘ Intercept
y Intercept Slope
y = mx + b y⬘ = a + bx
y=2 y⬘ = 2
y = 0.5x + 5 y⬘ = 5 + 0.5x
x=4 x=4
y 2 y⬘ 2
5 m= = = 0.5 5 b= = = 0.5
x 4 x 4
x x
(a) Algebra of a line (b) Statistical notation for a regression line
10–19
568 Chapter 10 Correlation and Regression
Formulas for the Regression Line y ⴕ ⴝ a ⴙ bx
1 ©y21©x2 2 1©x21©xy2
a
n1©x2 2 1©x2 2
n1 ©xy2 1©x21©y2
b
n1©x2 2 1©x2 2
where a is the y intercept and b is the slope of the line.
Rounding Rule for the Intercept and Slope Round the values of a and b to three
decimal places.
The steps for finding the regression line equation are summarized in this Procedure
Table:
Procedure Table
Finding the Regression Line Equation
Step 1 Make a table, as shown in step 2.
Step 2 Find the values of xy, x2, and y2. Place them in the appropriate columns and sum
each column.
x y xy x2 y2
兺x 兺y 兺xy 兺x 2 兺y 2
Step 3 When r is significant, substitute in the formulas to find the values of a and b for the
regression line equation y a bx.
1 ©y21©x2 2 1©x21©xy2 n1©xy2 1 ©x2 1 ©y2
a b
n1©x 2 1©x2
2 2
n1©x2 2 1©x2 2
EXAMPLE 10–9 Car Rental Companies
Find the equation of the regression line for the data in Example 10–4, and graph the line
on the scatter plot of the data.
SOLUTION
The values needed for the equation are n 6, 兺x 153.8, 兺y 18.7, 兺xy 682.77,
and 兺x2 5859.26. Substituting in the formulas, you get
1©y21©x2 2 1 ©x2 1 ©xy2 118.72 15859.262 1153.82 1682.772
a 0.396
n1 ©x 2 1 ©x2
2 2
16215859.262 1153.82 2
n1©xy2 1 ©x21©y2 61682.772 1153.82118.72
b 0.106
n1 ©x 2 1 ©x2
2
162 15859.262 1153.82 2
2
Hence, the equation of the regression line y a bx is
y 0.396 0.106x
To graph the line, select any two points for x and find the corresponding values for y.
Use any x values between 10 and 60. For example, let x 15. Substitute in the equation
and find the corresponding y value.
y 0.396 0.106x
0.396 0.106(15)
1.986
10–20
Section 10–2 Regression 569
Let x 40; then
y 0.396 0.106x
0.396 0.106(40)
4.636
Then plot the two points (15,1.986) and (40, 4.636) and draw a line connecting the two
points. See Figure 10–14.
FIGURE 10–14 Regression Line for Example 10–9
y
7.75
Revenue (billions) 6.50
5.25
y⬘ = 0.396 + 0.106x
4.00
2.75
1.50
x
8.5 17.5 26.5 35.5 44.5 53.5 62.5
Cars (in 10,000s)
Note: When you draw the regression line, it is sometimes necessary to truncate the
graph (see Chapter 2). This is done when the distance between the origin and the first
labeled coordinate on the x axis is not the same as the distance between the rest of the
labeled x coordinates or the distance between the origin and the first labeled y coordi-
nate is not the same as the distance between the other labeled y coordinates. When the x
axis or the y axis has been truncated, do not use the y intercept value to graph the line.
When you graph the regression line, always select x values between the smallest x data
value and the largest x data value.
EXAMPLE 10–10 Absences and Final Grades
Find the equation of the regression line for the data in Example 10–5, and graph the line
Historical Note on the scatter plot.
In 1795, Adrien-Marie
SOLUTION
Legendre (1752–1833)
measured the meridian The values needed for the equation are n 7, 兺x 57, 兺y 511, 兺xy 3745, and
arc on the earth’s surface 兺x2 579. Substituting in the formulas, you get
1©y21©x2 2 1 ©x21©xy2 1511215792 1572137452
from Barcelona, Spain, to
Dunkirk, England. This a 102.493
measure was used as n1 ©x 2 1 ©x2
2 2
17215792 1572 2
the basis for the measure
n1 ©xy2 1 ©x2 1 ©y2 172 137452 1572 15112
of the meter. Legendre b 3.622
developed the least- n1 ©x 2 1 ©x2
2 2
172 15792 1572 2
squares method around Hence, the equation of the regression line y a bx is
the year 1805.
y 102.493 3.622x
10–21
570 Chapter 10 Correlation and Regression
The graph of the line is shown in Figure 10–15.
FIGURE 10–15 y⬘
Regression Line for
Example 10–10 100
90
80
Final grade
70
y⬘ = 102.493 – 3.622x
60
50
40
30
x
0
5 10 15
Number of absences
The sign of the correlation coefficient and the sign of the slope of the regression line
will always be the same. That is, if r is positive, then b will be positive; if r is negative,
then b will be negative. The reason is that the numerators of the formulas are the same and
determine the signs of r and b, and the denominators are always positive. The regression
line will always pass through the point whose x coordinate is the mean of the x values and
whose y coordinate is the mean of the y values, that is, (x, y ).
The regression line can be used to make predictions for the dependent variable.
You should use these guidelines when you are making predictions.
1. The points of the scatter plot fit the linear regression line reasonably well.
2. The value of r is significant.
3. The value of a specific x is not much beyond the observed values (x values) in the
original data.
4. If r is not significant, then the best predicted value for a specific x value is the mean
of the y value in the original data.
Assumptions for Valid Predictions in Regression
1. The sample is a random sample.
2. For any specific value of the independent variable x, the value of the dependent variable
y must be normally distributed about the regression line. See Figure 10–16(a).
3. The standard deviation of each of the dependent variables must be the same for each
value of the independent variable. See Figure 10–16(b).
FIGURE 10–16 Assumptions for Predictions
y y
y
y
y’s
y9 = a + bx
x x
n
x
2
x x x x
1
x1 x2 xn
(a) Dependent variable y normally distributed (b) 1 = 2 = . . . = n
10–22
Section 10–2 Regression 571
In this book, the assumptions will be stated in the exercises; however, when encountering
statistics in other situations, you must check to see that these assumptions have been met
before proceeding.
The method for making predictions is shown in Example 10–11.
EXAMPLE 10–11 Car Rental Companies
Use the equation of the regression line to predict the income of a car rental agency that
has 200,000 automobiles.
SOLUTION
Since the x values are in 10,000s, divide 200,000 by 10,000 to get 20, and then substi-
tute 20 for x in the equation.
y 0.396 0.106x
0.396 0.106(20)
2.516
Hence, when a rental agency has 200,000 automobiles, its revenue will be approximately
$2.516 billion.
The value obtained in Example 10–11 is a point prediction, and with point predic-
tions, no degree of accuracy or confidence can be determined. More information on
prediction is given in Section 10–3.
The magnitude of the change in one variable when the other variable changes exactly
1 unit is called a marginal change. The value of slope b of the regression line equation
represents the marginal change. For example, in Example 10–9 the slope of the regres-
sion line is 0.106, which means for each additional increase of 10,000 cars, the value of y
changes 0.106 unit ($106 million) on average.
Extrapolation, or making predictions beyond the bounds of the data, must be inter-
Interesting Fact preted cautiously. For example, in 1979, some experts predicted that the United States
It is estimated that wear- would run out of oil by the year 2003. This prediction was based on the current consump-
ing a motorcycle helmet tion and on known oil reserves at that time. However, since then, the automobile industry
reduces the risk of a has produced many new fuel-efficient vehicles. Also, there are many as yet undiscovered
fatal accident by 30%. oil fields. Finally, science may someday discover a way to run a car on something as
unlikely but as common as peanut oil. In addition, the price of a gallon of gasoline was
predicted to reach $10 a few years later. Fortunately this has not come to pass. Remember
that when predictions are made, they are based on present conditions or on the premise
that present trends will continue. This assumption may or may not prove true in the future.
A scatter plot should be checked for outliers. An outlier is a point that seems out of
place when compared with the other points (see Chapter 3). Some of these points can affect
the equation of the regression line. When this happens, the points are called
influential points or influential observations.
When a point on the scatter plot appears to be an outlier, it should be
checked to see if it is an influential point. An influential point tends to
“pull” the regression line toward the point itself. To check for an influen-
tial point, the regression line should be graphed with the point included in
the data set. Then a second regression line should be graphed that excludes
the point from the data set. If the position of the second line is changed
considerably, the point is said to be an influential point. Points that are out-
liers in the x direction tend to be influential points.
Researchers should use their judgment as to whether to include influential
observations in the final analysis of the data. If the researcher feels that the
observation is not necessary, then it should be excluded so that it does not in-
“Explain that to me.” fluence the results of the study. However, if the researcher feels that it is neces-
sary, then he or she may want to obtain additional data values whose x values
© Dave Carpenter. King Features Syndicate. are near the x value of the influential point and then include them in the study.
10–23