CORRELATION &
REGRESSION
                      Dr. Md Razib Alam
Dr. Md Razib Alam                         1
The Coefficient of Correlation (r) is a measure of the strength
of the relationship between two variables.
        Also called Pearson’s r and   It requires interval or ratio-
        Pearson’s product moment
                                               scaled data.
        correlation coefficient.
             It can range from                Pearson's r
                -1.00 to 1.00.
         Values of -1.00 or 1.00
         indicate perfect and strong
         correlation.                 -1           0            1
        Negative values indicate an    Values close to 0.0 indicate
        inverse relationship and            weak correlation.
        positive values indicate a        The   Coefficient
        direct relationship.
                                         of Correlation, r
Product Moment Correlation
         Dr. Md Razib Alam   3
Partial Correlation
      Dr. Md Razib Alam   4
Coefficient of Determination
     The coefficient of determination (r2) is the
     proportion of the total variation in the dependent
     variable (Y) that is explained or accounted for by the
     variation in the independent variable (X).
     It is the square of the coefficient of correlation.
     It ranges from 0 to 1.
     It does not give any information on the direction
     of the relationship between the variables.
Regression Analysis
       Dr. Md Razib Alam   6
Bivariate Regression
      Dr. Md Razib Alam   7
Formulate the Bivariate Regression Model
                 Dr. Md Razib Alam     8
Multiple Regression
Multiple regression involves a single dependent variable and two or
more independent variables.
Example:
• Can variation in sales be explained in terms of variation in advertising
  expenditures, prices, and level of distribution?
• Can variation in market shares be accounted for by the size of the sales
  force, advertising expenditures, and sales promotion budgets?
• Are consumers’ perceptions of quality determined by their perceptions
  of prices, brand image, and brand attributes?
Multiple Regression
     Dr. Md Razib Alam   10
The general multiple regression with k independent variables is
                          given by:
         Y ' = a + b1 X 1 + b2 X 2 +...+bk X k
Greek letters are
used for a (a) and     a is the Y-intercept.
b (b) when             X1 to Xk are the
denoting               independent
population             variables.
parameters.
                           Dr. Md Razib Alam                      11
List of Variables for Our Example
Output Variable (Y): csat
Predictor variables (X): expense, percent, income, high, college
where,
csat= Mean composite SAT score
expense= Per pupil expenditures
percent= % HS graduates taking SAT
income= Median household income
high= % adults HS diploma
college= % adults college degree
                                Dr. Md Razib Alam                  12
. regress csat expense percent income high college
      Source         SS            df         MS         Number of obs       =           51
                                                         F(5, 45)            =        42.23
       Model    184663.309          5    36932.6617      Prob > F            =       0.0000
    Residual    39351.2012         45    874.471137      R-squared           =       0.8243
                                                         Adj R-squared       =       0.8048
       Total     224014.51         50     4480.2902      Root MSE            =       29.571
        csat        Coef.    Std. Err.        t       P>|t|      [95% Conf. Interval]
     expense     .0033528    .0044709       0.75      0.457      -.005652          .0123576
     percent    -2.618177    .2538491     -10.31      0.000     -3.129455         -2.106898
      income     .1055853    1.166094       0.09      0.928     -2.243048          2.454218
        high     1.630841     .992247       1.64      0.107      -.367647          3.629329
     college     2.030894    1.660118       1.22      0.228     -1.312756          5.374544
       _cons     851.5649    59.29228      14.36      0.000      732.1441          970.9857
                                                              Dr. Md Razib Alam               13
ANOVA TABLE
              Dr. Md Razib Alam   14
                                                 15
Anova Table
• Source – This is the source of variance, Model,
Residual, and Total. The Total variance is partitioned
into the variance which can be explained by the
independent variables (Model) and the variance
which is not explained by the independent
variables (Residual, sometimes called Error). Note
that the Sums of Squares for the Model and
Residual add up to the Total Variance, reflecting the
fact that the Total Variance is partitioned into
Model and Residual variance.
Anova Table
SS – These are the Sum of Squares associated with the three sources of variance, Total, Model and
Residual. These can be computed in many ways. Conceptually, these formulas can be expressed as:
SSTotal The total variability around the mean. S(Y – Ybar)2. SSResidual The sum of squared errors
in prediction. S(Y – Ypredicted)2. SSModel The improvement in prediction by using the predicted
value of Y over just using the mean of Y. Hence, this would be the squared differences between the
predicted value of Y and the mean of Y, S(Ypredicted – Ybar)2. Another way to think of this is the
SSModel is SSTotal – SSResidual.
 Note that the SSTotal = SSModel + SSResidual.
 Note that SSModel / SSTotal is equal to .8243, the value of R-Square. This is because R-Square is
the proportion of the variance explained by the independent variables, hence can be computed by
SSModel / SSTotal.                           Dr. Md Razib Alam                                     16
 Anova Table
 df – These are the degrees of freedom associated with the sources of variance. The total variance has
N-1 degrees of freedom. In this case, there were N=51 students, so the DF for total is 50. The model
degrees of freedom corresponds to the number of predictors minus 1 (K-1). You may think this would
be 5-1 (since there were 5 independent variables in the model, expense, percent, income, high, and
college). But, the intercept is automatically included in the model (unless you explicitly omit the
intercept). Including the intercept, there are 6 predictors, so the model has 6-1=5 degrees of
freedom. The Residual degrees of freedom is the DF total minus the DF model, 50 – 5 is 45.
                                             Dr. Md Razib Alam                                     17
 Anova Table
MS – These are the Mean Squares, the Sum of Squares divided by their respective DF. For the Model,
184663.309 / 5 = 36932.6617. For the Residual, 39351.2012 / 45 = 874.471137. These are computed so you can
compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the
predictors in the model.
F=36932.6617/874.471137= 42.23
                                                 Dr. Md Razib Alam                                          18
   Overall Model Fit
• Number of obs – This is the number of observations used in the regression
  analysis.
• F – The F-value is the Mean Square Model (36932.6617) divided by the Mean
  Square Residual (874.471137), yielding F=42.23
• Prob > F - This is the p-value of the model. It indicates the reliability of X to
  predict Y. Usually we need a p-value lower than 0.05 to show a statistically
  significant relationship between X and Y.
• R-square shows the amount of variance of Y explained by X. In this case the
  model explains 82.43% of the variance in SAT scores.
• Root MSE: root mean squared error, is the sd of the regression. The closer to zero
  better the fit.
                                      Dr. Md Razib Alam                         19
     Overall Model Fit
• Adj R-squared – Adjusted R-square. As predictors are added to the model, each predictor will
  explain some of the variance in the dependent variable simply due to chance.
• One could continue to add predictors to the model which would continue to improve the ability
  of the predictors to explain the dependent variable, although some of this increase in R-square
  would be simply due to chance variation in that particular sample.
• The adjusted R-square attempts to yield a more honest value to estimate the R-squared for the
  population. The value of R-square was 0.8243, while the value of Adjusted R-square was 0.8048
• Adjusted R-squared is computed using the formula 1 – ((1 – Rsq)((N – 1) /( N – k – 1)). From this
  formula, you can see that when the number of observations is small and the number of
  predictors is large, there will be a much greater difference between R-square and adjusted R-
  square (because the ratio of (N – 1) / (N – k – 1) will be much greater than 1). By contrast, when
  the number of observations is very large compared to the number of predictors, the value of R-
  square and adjusted R-square will be much closer because the ratio of (N – 1)/(N – k – 1) will
  approach 1.
                                              Dr. Md Razib Alam                                 20
   Parameter Estimates
           csat       Coef.   Std. Err.       t    P>|t|      [95% Conf. Interval]
        expense    .0033528   .0044709      0.75   0.457      -.005652     .0123576
        percent   -2.618177   .2538491    -10.31   0.000     -3.129455    -2.106898
         income    .1055853   1.166094      0.09   0.928     -2.243048     2.454218
           high    1.630841    .992247      1.64   0.107      -.367647     3.629329
        college    2.030894   1.660118      1.22   0.228     -1.312756     5.374544
          _cons    851.5649   59.29228     14.36   0.000      732.1441     970.9857
csat– This column shows the dependent variable at the top (csat) with the
predictor variables below it (expense, percent, income, high, college and _cons).
The last variable (_cons) represents the constant, also referred to in textbooks
as the Y intercept, the height of the regression line when it crosses the Y axis. In
other words, this is the predicted value of csat when all other variables are 0.
                                                      Dr. Md Razib Alam                21
 Parameter Estimates
           csat       Coef.   Std. Err.       t    P>|t|      [95% Conf. Interval]
        expense    .0033528   .0044709      0.75   0.457      -.005652          .0123576
        percent   -2.618177   .2538491    -10.31   0.000     -3.129455         -2.106898
         income    .1055853   1.166094      0.09   0.928     -2.243048          2.454218
           high    1.630841    .992247      1.64   0.107      -.367647          3.629329
        college    2.030894   1.660118      1.22   0.228     -1.312756          5.374544
          _cons    851.5649   59.29228     14.36   0.000      732.1441          970.9857
 Coef. – These are the values for the regression equation for predicting the dependent variable from the
independent variable. The regression equation is presented as:
     csat = 851.56 + 0.003*expense – 2.62*percent + 0.11*income + 1.63*high + 2.03*college
These estimates tell you about the relationship between the independent variables and the dependent
variable. These estimates tell the amount of increase in csat scores that would be predicted by a 1 unit
increase in the predictor. Note: For the independent variables which are not significant, the coefficients are
not significantly different from 0, which should be taken into account when interpreting the coefficients.
                                                           Dr. Md Razib Alam                                     22
 Parameter Estimates
          csat       Coef.   Std. Err.       t    P>|t|      [95% Conf. Interval]
       expense    .0033528   .0044709      0.75   0.457      -.005652          .0123576
       percent   -2.618177   .2538491    -10.31   0.000     -3.129455         -2.106898
        income    .1055853   1.166094      0.09   0.928     -2.243048          2.454218
          high    1.630841    .992247      1.64   0.107      -.367647          3.629329
       college    2.030894   1.660118      1.22   0.228     -1.312756          5.374544
         _cons    851.5649   59.29228     14.36   0.000      732.1441          970.9857
 Std. Err. – These are the standard errors associated with the coefficients. The standard error is
used for testing whether the parameter is significantly different from 0 by dividing the parameter
estimate by the standard error to obtain a t-value (see the column with t-values and p-
values). The standard errors can also be used to form a confidence interval for the parameter, as
shown in the last two columns of this table.
                                                          Dr. Md Razib Alam                          23
      Parameter Estimates
               csat       Coef.   Std. Err.       t    P>|t|      [95% Conf. Interval]
            expense    .0033528   .0044709      0.75   0.457      -.005652          .0123576
            percent   -2.618177   .2538491    -10.31   0.000     -3.129455         -2.106898
             income    .1055853   1.166094      0.09   0.928     -2.243048          2.454218
               high    1.630841    .992247      1.64   0.107      -.367647          3.629329
            college    2.030894   1.660118      1.22   0.228     -1.312756          5.374544
              _cons    851.5649   59.29228     14.36   0.000      732.1441          970.9857
t-The t-values test the hypothesis that the coefficient is different from 0. To reject this, you need a t-value
greater than 1.96 (at 0.05 confidence). You can get the t-values by dividing the coefficient by its standard
error. The t-values also show the importance of a variable in the model. In this case, percent is the most
important.
P>|t|- Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value
has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense, income, and college
are not statistically significant in explaining SAT; high is almost significant at 0.10. Percent is the only variable
that has some significant impact on SAT (its coefficient is different from 0)
If you use a 1-tailed test (i.e., you hypothesize that the parameter will go in a particular direction), then you
can divide the p-value by 2 before comparing it to your pre-selected alpha level.
                                                               Dr. Md Razib Alam                               24
      Parameter Estimates
                csat       Coef.   Std. Err.       t    P>|t|      [95% Conf. Interval]
             expense    .0033528   .0044709      0.75   0.457      -.005652          .0123576
             percent   -2.618177   .2538491    -10.31   0.000     -3.129455         -2.106898
              income    .1055853   1.166094      0.09   0.928     -2.243048          2.454218
                high    1.630841    .992247      1.64   0.107      -.367647          3.629329
             college    2.030894   1.660118      1.22   0.228     -1.312756          5.374544
               _cons    851.5649   59.29228     14.36   0.000      732.1441          970.9857
The coefficient for expense (.0033528) is not statistically significantly different from 0 because its p-value is larger than 0.05.
The coefficient for percent (-2.618177) is significantly different from 0 using alpha of 0.05 because its p-value is 0.000,
which is smaller than 0.05.
The coefficient for income (.1055853) is not statistically significantly different from 0 because its p-value is larger than 0.05.
The coefficient for high (1.630841) is not statistically significantly different from 0 because its p-value is larger than 0.05.
The coefficient for college (2.030894) is not statistically significantly different from 0 because its p-value is larger than 0.05.
The constant (_cons) is significantly different from 0 at the 0.05 alpha level. However, having a significant intercept is seldom
interesting.
                                                                Dr. Md Razib Alam                                          25
      Parameter Estimates
               csat       Coef.   Std. Err.       t    P>|t|      [95% Conf. Interval]
            expense    .0033528   .0044709      0.75   0.457      -.005652          .0123576
            percent   -2.618177   .2538491    -10.31   0.000     -3.129455         -2.106898
             income    .1055853   1.166094      0.09   0.928     -2.243048          2.454218
               high    1.630841    .992247      1.64   0.107      -.367647          3.629329
            college    2.030894   1.660118      1.22   0.228     -1.312756          5.374544
              _cons    851.5649   59.29228     14.36   0.000      732.1441          970.9857
 [95% Conf. Interval] – This shows a 95% confidence interval for the coefficient. This is very useful as it helps
you understand how high and how low the actual population value of the parameter might be. The confidence
intervals are related to the p-values such that the coefficient will not be statistically significant if the confidence
interval includes 0.
                                                               Dr. Md Razib Alam                               26