KEMBAR78
Linear Regression II | PDF | Errors And Residuals | Regression Analysis
0% found this document useful (0 votes)
70 views54 pages

Linear Regression II

Linear regression finds the linear relationship between two variables by estimating the intercept and slope of the line that best fits the data. The intercept is the value of Y when X is 0, and the slope is the change in Y given a one unit change in X. Residuals are the differences between actual Y values and predicted Y values from the regression line. The variance of the estimate quantifies the error in predictions as the average squared residuals, with smaller variance indicating better fit. The coefficient of determination (R2) shows the proportion of variance in Y explained by the regression line.

Uploaded by

Mutai Victor
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views54 pages

Linear Regression II

Linear regression finds the linear relationship between two variables by estimating the intercept and slope of the line that best fits the data. The intercept is the value of Y when X is 0, and the slope is the change in Y given a one unit change in X. Residuals are the differences between actual Y values and predicted Y values from the regression line. The variance of the estimate quantifies the error in predictions as the average squared residuals, with smaller variance indicating better fit. The coefficient of determination (R2) shows the proportion of variance in Y explained by the regression line.

Uploaded by

Mutai Victor
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

Linear Regression II

Linear regression
• An estimate of the linear relationship
between two variables (X & Y) in
terms of the actual scale
– We find the equation for the line that
best fits the data
– This involves
• Finding the intercept– e.g., the value of Y
when X = 0
• Finding the slope– the change in Y given a
one point change in X
Equation of a line
• Y’ = a + bX
• The intercept (a) is the point at
which the line crosses the Y axis
– The value of Y when X = 0
• The slope (b) is the amount of
increase in Y given an increase in
one point of X
• Y’ means “predicted Y value”
Making predictions
• We use the regression equation to
predict what Y will be given some value
of X
– E.g., how tall is someone who weights 121
lbs?
• Last time, we focused on “perfect
predictions”
– Weight “perfectly predicted” height
because the correlation was one…
– That’s usually not the case in real life…
Making predictions
• One way to think about the
regression line is in terms of
“conditional” averages (means)
– Given some condition of X, what is the
mean of Y?
– So, given a GMAT score of 640, what is
the average income?
The line that “best fits”
• The method behind linear regression
involves finding the line that “best
fits” the data
– We won’t get into how this is computed
in this class
• Involves matrix algebra
– But conceptually, the point is to find a
line that minimizes the total distance
from all the points
The line that “best fits”
• The method behind linear regression
involves finding the line that “best fits”
the data
– We won’t get into how this is computed in
this class
• Involves matrix algebra
– But conceptually, the goal is to find a line
that minimizes the total distance from all
the points
• Often called “Ordinary Least Squares regression”
because you square the distance from each
point to the line and use an iterative process to
minimize it– i.e., find the “least” amount of
summed squares
Residuals
• The regression line is what we would
predict for Y given some X…
Residuals
• The regression line is what we would
predict for Y given some X…
– Regression equation gives us the straight
line that minimizes the error involved in
making predictions
• Residuals are what we call error
– Residuals are the differences between an
actual Y value and the predicted Y value
– The residual is Y – Y’
• The actual Y value minus the predicted Y value
Variance of the estimate
• We can quantify the amount of error
in the prediction by finding the
average of all of the square residuals
10000
-11000
Variance of the estimate
• We can quantify the amount of error
in the prediction by finding the
average of all of the squared
residuals
– This is the “variance of the estimate”
E.g., How much do the points vary
around the line
The closer the
points are to the

2
2 (Y − Y ′ ) line, the smaller
σ
estY = the variance of the
N
estimate will be
Variance of the Estimate
• When r=0 (no correlation), this
means the best fitting line is a
horizontal one…


2
2 (Y − Y ′ )
σ
estY =
N
No correlation
No correlation
Variance of the Estimate
• When r=0 (no correlation), this
means the best fitting line is a
horizontal one…
– Same predicted Y for all values of X
• The line is doing nothing for us..
– The variance of the estimate is largest
in this case
• The variance of the predictions around the
regression line is just the variance of Y

When r=0,

2 2
2
σ =
(Y − Y ′ ) Y’ is the
mean of Y 2
σ =
∑ (Y − Y )
estY estY
N N
Variance of the estimate
• For a sample, we use N-2 in the
denominator to get an unbiased
estimate
– Two degrees of freedom


2
2 (Y − Y ′ )
sestY =
N −2
Explained vs. unexplained
variance
• The difference between the total
amount of variance in Y and the
variance of the estimate is the
amount of variance explained by the
regression line
• Explained variance = total variance-
unexplained variance
– Total Variance = Unexplained variance
+ Explained Variance
Coefficient of determination
• This is the “proportion of the total
variance that is explained (or
determined) by the predictor
variable”
• It is the (explained variance)/(total
variance)
– This equals r2
– It is the proportion of the variance in Y
that is accounted for by X
Coefficient of non-
determination
• This is simply the reverse—the
amount of variance in Y that X does
not account for
– An estimate of how much the points
don’t fall on the line
• It is the (unexplained
variance)/(total variance) or (1-r2)
The variance of the
estimate
• Remember that the variance of the
estimate is the unexplained variance
• An easier way to compute the
variance of the estimate is to use the
coefficient of non-determination
2
σestY 2
2
=1 − r Becomes…
σY σ 2 2 2
= σ (1 − r )
estY Y
Example
• Relationship between age and verbal
comprehension
• We want to use age (in months) to
predict test scores on a verbal
comprehension test
Example
• In our sample of 100 kids from
grades 1-6, we have
• Mean age of 98.14 months (s = 21.0)
• Mean test score of 30.35 items correct
out of 50 (s = 7.25)
Example
• In our sample of 100 kids from
grades 1-6, we have
• Mean age of 98.14 months (s = 21.0)
• Mean test score of 30.35 items correct
out of 50 (s = 7.25)
Why use regression?
Our independent variable is age—
a continuous measure…
We don’t have 2 groups to
compare, so we can use a t-test.
We want to look at how increases
in age relate to increases (or
decreases) in scores
Example

• In our sample of 100 kids from grades


1-6, we have
• Mean age of 98.14 months (s = 21.0)
• Mean test score of 30.35 items correct
out of 50 (s = 7.25)
• We find that the correlation between age
and test score in our sample is r = .72
• How can we make predictions for
verbal comprehension given an age?
1. Find the slope of the line
• X is age (the independent variable)
– Mx = 98.14, sx = 21.0
• Y is test score (the dependent
variable)
– My = 30.35, sy = 7.25
• r = .72

sY
bYX = r
sX
1. Find the slope of the line
• X is age (the independent variable)
– Mx = 98.14, sx = 21.0
• Y is test score (the dependent
variable)
– My = 30.35, sy = 7.25
• r = .72

sY 7.25
bYX = r= (.72)
sX 21.0
1. Find the slope of the line
• X is age (the independent variable)
– Mx = 98.14, sx = 21.0
• Y is test score (the dependent
variable)
– My = 30.35, sy = 7.25
• r = .72

sY 7.25
bYX = r= (.72) = .249
sX 21.0
2. Find the intercept of the line
• X is age (the independent variable)
– Mx = 98.14, sx = 21.0
• Y is test score (the dependent
variable)
– My = 30.35, sy = 7.25
• b = .249

aYX = Y − bYX X
2. Find the intercept of the line
• X is age (the independent variable)
– Mx = 98.14, sx = 21.0
• Y is test score (the dependent
variable)
– My = 30.35, sy = 7.25
• b = .249

aYX = Y − bYX X = 30.35 − .249(98.14)


2. Find the intercept of the line
• X is age (the independent variable)
– Mx = 98.14, sx = 21.0
• Y is test score (the dependent
variable)
– My = 30.35, sy = 7.25
• b = .249

aYX = Y − bYX X = 30.35 − .249(98.14)


= 5.91
2. Find the intercept of the line
• X is age (the independent variable)
– Mx = 98.14, sx = 21.0
• Y is test score (the dependent
variable) For an age of
0 months
– My = 30.35, sy = 7.25 (X=0), we
predict a
• b = .249 score of 5.91
on the test

aYX = Y − bYX X = 30.35 − .249(98.14)


= 5.91
Making a prediction
• Y’ = a + bX
– a = 5.91
– b = .249
• Y’ means “predicted Y”

• A child is 10 years old (120 months)


– His predicted test result will be:
Y’ = 5.91 + .249(120) = 35.8 items
correct
Example
• In our sample of 100 kids from
grades 1-6, we have
• Mean age of 98.14 months (s = 21.0)
• Mean test score of 30.35 items correct
out of 50 (s = 7.25)
We predict a child at 120
months will get 35.8 items
correct
This child is older than the
average child in our sample,
so he does better than
average on the test
Interpreting: r vs. b
• b (the slope of the line) is the change
(amount of points) we predict in Y
based on a one point change is X
– For each month increase in age, test scores
go up .249
• r (the correlation) is the change (in
terms of standard deviations) we
predict in Y based on a one standard
deviation change in X
– For every one standard deviation increase
in age, test scores will increase by .72 of a
standard deviation
The residual
• Our equation is:
Test score = 5.91 + .249(age in months)
• We have a child who is 92 months old,
and she gets 40 questions correct
• We’d predict she would get
Y’ =5.91+.249(92) = 28.82 questions
correct
The residual
• Our equation is:
Test score = 5.91 + .249(age in months)
• We have a child who is 92 months old,
and she gets 40 questions correct
• We’d predict she would get
Y’ =5.91+.249(92) = 28.82 questions
correct
The residual is 40-28.82 = 11.18
Positive because she did better than
our predicted value
The residual
• Our equation is:
Test score = 5.91 + .249(age in months)
• We have a child who is 92 months old, and she
gets 40 questions correct
• We’d predict she would get
Y’ =5.91+.249(92) = 28.82 questions correct
The residual is 40-28.82 = 11.18
Positive because she did better than our
predicted value
If another 92 month old got 27 questions
correct, the residual would be 27-28.82=-1.12
Example: Variance explained
• In our sample of 100 kids from
grades 1-6, we have
• Mean age of 98.14 months (s = 21.0)
• Mean test score of 30.35 items correct
out of 50 (s = 7.25)

The total variance in test


scores is s2 = 7.252 = 52.56

How much is explained by


the regression line?
Unexplained variance
• If we went through each of our 100
data points, we could calculate the
residual– the value of Y we actually
got minus the value of Y we
predicted from the equation
– The sum of those squared deviations is
everything we didn’t explain

∑(Y −Y ')
2
Age Score
(X) (Y)

92 25

100 30

84 29

73 25
Age Score Predicted Score
(X) (Y) Y’ = 5.91+.249(X)

92 25 5.91+.249(92) = 28.82

100 30 5.91+.249(92) = 30.81

84 29 5.91+.249(92) = 26.83

73 25 5.91+.249(92) = 24.09
Age Score Predicted Score Residual
(X) (Y) Y’ = 5.91+.249(X) Y-Y’

92 25 5.91+.249(92) = 28.82 25-28.82=-3.82

100 30 5.91+.249(92) = 30.81 30-30.81=-.81

84 29 5.91+.249(92) = 26.83 29-26.83=2.17

73 25 5.91+.249(92) = 24.09 25-24.09=.91


Unexplained variance
• If we went through each of our 100
data points, we could calculate the
residual– the value of Y we actually
got minus the value of Y we
predicted from the equation
– The sum of those squared deviations is
everything we didn’t explain
– The average squared deviations is the
variance of the estimate

2
2 (Y − Y ′ )
σestY =
N
Explained variance
• The total variance is the variance of Y
• The unexplained variance is the
average squared deviation score
• Total variance = explained variance +
unexplained variance
– So all that’s left is what we explained by
the regression line
– Explained variance = total variance
-unexplained
Coefficient of determination
• We know from our example that the
correlation between age & test score
was .72
– We can compute the coefficient of
determination by squaring it
– r2 = .722 = .52
• Age accounts for 52% of the
variance in test scores
Coefficient of non-
determination
• This is simply the reverse—the amount of
variance in Y that X does not account for
– An estimate of how much the points don’t fall
on the line
• It is the (unexplained variance)/(total
variance) or (1-r2)
– So 1- .722 = 1 - .52 = .48
• 48% of the variance in test scores is not
accounted for by age
– We cannot account for 48% of the variance in
test scores
• Next time: More regression & quiz
review
• Happy Spring!

You might also like