Regression and Correlation Guide
Regression and Correlation Guide
         Later on we will study correlation analysis to determine the degree to which the variables are
related. It tells us how well the estimating equation actually describes the relationship.
         We find a CAUSAL relationship between variables.
i.e. how does the independent variable causes the dependent variable to change.
Deterministic and Probabilistic Relations or Models
          A formula that relates quantities in the real world is called a model. Recall that in physics we
have studies that if a body is moving under uniform motion with an initial velocity ‘u’ and uniform
acceleration ‘a’, the velocity after time ‘t’ is given by:
                  v = u + at
          This is a model for uniform motion. This model has the property that when a value of ‘t’ is
substituted in the above equation, the value of v is determined without any error. Such models are called
deterministic models. An important example of the deterministic model is the relationship between
Celsius and Fahrenheit scales in the form of F = 32+9/5C. Other examples of such models are Boyle’s
law, Newton’s law of gravitation, ohm’s law etc.
          Consider an other example to investigate the relationship between the yield of potatoes y and the
level of fertilizer application x. An investigator divides the field into eight plots of equal size withequal
fertility and applied varying amounts of fertilizer to each. The yield of potatoes (in kg) and the fertilizer
application (in kg) was recorded for each plot. This data is given below:
         Suppose the investigator believes that the relation between y and x is exactly given by:
                   y = 22 + 2.5 x
         If this is true we must obtain the exact value of yield y for a given value of x. Thus when x = 1,
the yield must be:
                   Y = 22 + 2.5 (1) = 24.5
         But it is 25. There is an error of 24.5 – 25.0 = - 0.5. Hence no deterministic model can be
constructed to represent this experiment. This type of error is known as probabilistic model. The
deterministic relation in such cases is then modified to include both a deterministic component and a
random error component given as
                   Yi = a = bXi +  i , where i’s are the unknown random errors.
Regression Model
         There are many statistical investigations in which the main objective is to determine whether a
relationship exists between two or more variables. If such a relationship can be expressed by a
mathematical formula, we will then be able to use it for the purpose of making predictions. The reliability
of any prediction will, of course, depend on the strength of the relationship between the variables included
in the formula.
         A mathematical equation that allows us to predict values of one dependent variable from known
values of one or more independent variables is called a regression equation. Today the term regression is
applied to all types of prediction problems and does not necessarily imply a regression towards the
population mean.
Linear Regression
        We consider here the problem of estimating or predicting the value of a dependent variable Y on
the basis of a known measurement of an independent and frequently controlled variable X. The variable
intended to be estimated or predicted is termed as dependent variable and the variable on the basis of
which the dependent variable is to be estimated is called the independent variable.
        e.g.     If we want to estimate the heights of children on the basis of their ages, the heights would
be the dependent variable and the ages would be the independent variable. In estimating the yields of a
crop, on the basis of the amount of the fertilizer used, the yield will be the dependent variable and the
amount of fertilizer would be the independent variable.
Scatter Diagram
        Let us consider the distribution of chemistry grades corresponding to intelligence test scores of
50, 55, 65 and 70. The chemistry grades for a sample of 12 freshmen having these intelligence test scores
are presented in the following table
                   Test score (X i)     65   50   55   65    55   70   65   70   55   70   50    55
                 Chemistry grade (Yi)   85   74   76   90    85   87   94   98   81   91   76    74
         The data table has been plotted in figure to give a scattered
diagram. In the scattered diagram, the points follow closely a
straight line indicate that the two variables are to some extend
linearly related. Once a reasonable linear relationship is obtained,
we usually try to express this mathematically by a straight-line
equation         Y = a + bX, called the linear regression line, where
the constants a and b represent the y-intercept and slope
respectively. Such a regression line has been drawn in the
following figure. This linear regression line can be used to predict
the value Y corresponding to any given value X.
         Using the regression line in figure, we can predict a chemistry grade of 88 for a student whose
intelligence test score is 60. However, we would be extremely fortunate if a student with an intelligence
test score of 60 made a chemistry grade of exactly 88. In fact, the original data of table show that three
students with this intelligence test score received grades of 85, 90 and 94. we must therefore, interpret the
predicted chemistry grade of 88 as an average or expected value for all students taking the course who
have an intelligence test score of 60.
         Many possible regression lines could be fitted to the sample data, but we choose that particular
line which best fits that data. The best regression line is obtained by estimating the regression parameters
by the most commonly used method of least squares.
Estimation of a Straight Line using the Method of Least Squares
        The basic linear relationship between the dependent variable Yi and the value Xi is
                 Yi = a + b Xi + i
        where a and b are called the unknown population parameters (b is also called the coefficient of
regression), Yi are the observed values and i are the error components.
        The estimated regression is written as
                  
                 Yi = a + b Xi
        The method of least squares determines the values of the unknown parameters that minimize the
sum of squares of the errors where errors are defined as the difference between observed values and the
corresponding predicted or estimated values. It is denoted by
               n            n           n
    S(a,b) =  ei2 =  (Yi  Yi)2 =  (Yi  a  b Xi)2
              i =1         i =1         i =1
minimizing S(a,b), we put first partial derivatives w. r. t. a and b
equal to zero. Therefore
        S(ab)       n
                = 2  (Yi  a  b Xi)(1) = 0
           a       i =1
        S(ab)       n
                = 2  (Yi  a  b Xi)(Xi) = 0
           b       i =1
        by simplifying, we have
        Yi = na + bXi
        Xi Yi = aXi + bXi 2
        by solving, we have
If the variable X is taken as dependent variable , then the least square line is given by
                          X = c + dY
and the normal equations are
         X = nc + dY
         XY = cY + dY2
By solving simultaneously, w have the values of c and d
             (X)(Y2)  (Y)(XY)           nXY  (X)(Y)
         c=                             ,d=
                  n(Y2)  (Y)2               n(Y2)  (Y)2
Example (1)
        Fit a least square line of regression to the following data taking (i) X as independent variable (ii)
Y as independent variable.
                                        X      1   3   5    6    7   9   10   13
                                        Y      1   2   5    5    6   7   7    8
Solution
         (i)       The equation of the least square line is
                           Y = a + bX
                   and the normal equations are
                   y = na + bX
                   XY = aX + bX2
                   From given data we have n = 8
                             X               Y             XY              X2              Y2
                             1               1              1               1              1
                             3               2              6               9              4
                             5               5             25              25              25
                             6               5             30              36              25
                             7               6             42              49              36
                             9               7             63               1              49
                             10              7             70              100             49
                             13              8             104             169             64
                          X = 54       Y = 41       XY= 341           X2 =470        Y2= 253
                   Substituting the values of n, X, Y, XY, X2, Y2 in the normal equations, we have
                           y = na + bX
                           XY = aX + bX2
                           41 = 8a + 54b
                           341 = 54a + 470b
                   By solving simultaneously, we get
                           a = 1.01, b = 0.609
                   We can also find the values of a and b using the formulas
                                  (Y)(X2)  (X)(XY)     nXY  (X)(Y)
                             a=                         ,b=
                                      n(X2)  (X)2         n(X2)  (X)2
                   Thus the equation of the fitted least square line becomes
                            Y = 1.01 + 0.609X
         (ii)      When X is taken as dependent variable and Y as the independent variable, the equation of
                   the least square line of regression is
                            X = c + dY
                   With the normal equations
                            X = nc + dY
                            XY = cY + dY2
                   Substituting the values from the table 7.2, we have
                            54 = 8c + 41d
                            341 = 41c + 253d
                   By solving simultaneously we have
                            c =  0.93, d = 1.499
                   Thus the equation of the fitted least square line becomes
                            X = 0.93 + 1.499Y
Example (2)
        Following is given the data of 10 randomly selected areas in each area number of oil stoves and
the annual consumption of oil in barrels is given. Fit a regression equation of annual oil consumption on
number of stoves.
                No. of stoves:                   =x   1       1.5   2       2.5     3     3.5   4    4.5
                Annual Consumption of oil:       =y   25      31    27      28      36    35    32   34
Solution
       Necessary calculations are given below
                                 x      y      xy        x2
                                 27     142    3834      729
                                 32     170    5440      1024
                                 38     200    7600      1444
                                 42     194    8148      1764
                                 48     224    10752     2304
                                 54     256    13824     2916
                                 60     261    15660     3600
                                 67     270    18090     4489
                                 73     304    22192     5329
                                 79     349    27571     6241
                        Total    520    2370   133111    29840
                                 x      y      xy        x2       y2       y^       e
                                 1      25     25.0      1.00     625      26.83    -1.83
                                 1.5    31     46.5      2.25     961      28.02    2.98
                                 2      27     54.0      4.00     729      29.21    -2.21
                                 2.5    28     70.0      6.25     784      30.40    -2.40
                                 3      36     108.0     9.00     1296     31.59    4.41
                                 3.5    35     122.5     12.25    1225     32.79    2.21
                                 4      32     128.0     16.00    1024     33.98    -1.98
                                 4.5    34     153.0     20.25    1156     35.17    -1.17
                        Total    22     248    707.0     71.00    7800
         (iv)       If we assume the model to extend to x = 0, a = 24.452 is the estimated expected yield
                    when no fertilizer is used.
Plot of Residuals
        A plot of residuals against the values of x often provides the idea of how good the fit is.
            a) If the points in the plot are close to the x-axis and scattered in a random way, the model
                appears to provide a good fit.
            b) If the points are distributed in a systematic manner we should try some other model.
Example (4)
        For the data in Example (3) relating to potatoes yields, remove the first pair (x = 1, y = 25) and fit
a line of regression to the remaining seven pairs. Is the line same as already determined for the eight
pairs? Do the same by removing the second pair (x=1, y = 31) instead of the first. Are the three lines
different?
        You will observe that a change in data leads to a different line. We say that a least squares line
has zero breakdown point. There are methods in which a change of as many as 50% data points does not
cause any change in equation of the fitted line.
Exercise
         (1)       Given below the data relating to the thermal energy generated in Pakistan 1981-94. The
                   energy generation is in billion kwh.
                   Year                    1981   1982      1983   1984     1985     1986     1987
                   Energy Generated        4.2    5.2       5.1    5.2      6.5      7.3      8.4
                   Year                    1988   1989      1990   1991     1992     1993     1994
                   Energy Generated        10.8   11.9      14.5   16.1     19.4     19.7     23.0
         Fit a straight line to the data. Find the residuals. Plot the residuals and comment on your result.
         (2)      Following is the annual installation of computers in labs in UET. Fit a linear regression
                  equation of the computers on years and give the annual rate of installation of them.
Note
         For each situation where the independent variable is a time factor, the values assigned to
         2001-2003,… may be taken as 1,2,3,…
Example (6)
                                                                               
            Fit a least squares line for 20 pairs of observations having X = 2, Y = 8, X2 = 180 and XY=404
Example (7)
        For 5 pairs of observations, it is given that A.M of X is 2 and A.M of Y is 15. It is also known
that X2 = 30, X3 = 100, X4 =354, XY = 242, X2Y = 850. Fit a second degree parabola taking X as
an independent variable.
Coefficient of Determination
        The variability among the values of the dependent variable Y, called the total variation, is given
           
by (Y  Y)2. This composed of two parts:
                                                                                ^ 
        (i)    one is explained by (associated with) the regression line. i.e (Y  Y)2
                                                                                                     ^
        (ii)   other which is not explained by (not associated with) the regression line. i.e. (Y  Y)2.
Symbolically,
                                  ^       ^ 
               (Y  Y)2 = (Y  Y)2 + (Y  Y)2
               Total Variation = Unexplained Variation + Explained Variation
See the diagram;
       The coefficient of determination which measures the proportion of variability in the values of the
dependent variable (Y) associated with its linear relation with the independent variable (X) is defined by:
                                              ^                      ^
                    Explained Variation
                                           (Y.  Y)2         (Y  Y.)2
                2
               r = Total Variation =                    =1
                                                                     
                                            (Y  Y)2         (Y  Y)2
Alternate formula for Coefficient of Determination:
                                        
                         a Y + bXY  nY2
                  r2 =
                                    2
                              Y2 nY
Example (12)
                                             years         R&D               Annual
                                                        expenses (X)        Profit (Y)
                            Calculate         1 st           5                 31
                            Coefficient of    2 nd          11                 40
                            Determination     3 rd           4                 30
                            using both the    4 th           5                 34
                            formulas          5 th           3                 25
                                              6 th           2                 20
                                                          X = 30           Y=180
Correlation
         Two variables are said to be correlated if they tend to simultaneously vary in some direction; if
both the variables tend to increase (or decrease) together, the correlation is said to be direct or positive.
e.g. the length of an iron bar will increase as temperature increases. If one variable tend to increase as the
other variable decreases, the correlation is said to be negative or inverse. e.g. the volume of gas will
decrease as the pressure increases.
         Correlation in fact is the strength of the relationship that is the interdependence between the two
variables that is there is no distinction between dependent and independent variable. In regression, by
contrast, we are interested in determining the dependence of one variable upon the other variable.
         The numerical measure of strength in the linear relationship between any two variables is called
the correlation coefficient, usually denoted by r, is defined by
                               _       _
                         (XX) (Y Y)
                  r=
                                _        _
                          (XX) (YY) 2
                                  2
         It assumes values that range from +1 for perfect positive linear relationship, to – 1, for perfect
negative linear relationship and r = 0 indicates no linear relationship between X and Y.
         It is important to note that r = 0 does not mean that there is no relationship at all. e.g. if all the
observed values lie exactly on a circle, there is a perfect non-linear relationship between the variables.
Analysis of Correlation and Regression
(1)      The correlation answers the STRENGTH of linear association between paired variables, say X and Y. On
         the other hand, the regression tells us the FORM of linear association that best predicts Y from the values
         of X.
(2)      Correlation is calculated whenever:
           o       both X and Y is measured in each subject and quantify how much they are linearly associated.
           o       in particular the Pearson's product moment correlation coefficient is used when the assumption of
                   both X and Y are sampled from normally-distributed populations are satisfied
           o       or the Spearman's moment order correlation coefficient is used if the assumption of normality is
                   not satisfied.
           o       correlation is not used when the variables are manipulated, for example, in experiments.
(3)      Linear regression is used whenever:
                  at least one of the independent variables (Xi's) is to predict the dependent variable Y. Note: Some
                   of the Xi's are dummy variables, i.e. Xi = 0 or 1, which are used to code some nominal variables.
                  if one manipulates the X variable, e.g. in an experiment.
(4)      Linear regression are not symmetric in terms of X and Y. That is interchanging X and Y will give a
         different regression model (i.e. X in terms of Y) against the original Y in terms of X. On the other hand, if
         you interchange variables X and Y in the calculation of correlation coefficient you will get the same value
         of this correlation coefficient.
(5)      The "best" linear regression model is obtained by selecting the variables (X's) with at least strong
         correlation to Y, i.e. >= 0.80 or <= -0.80
(6)      The same underlying distribution is assumed for all variables in linear regression. Thus, linear regression
         will underestimate the correlation of the independent and dependent when they (X's and Y) come from
         different underlying distributions.
Rank Correlation
          Sometimes, the actual measurements of individuals or objects are either not available or accurate
assessment is not possible. They are then arranged in order according to some characteristic of interest..
Such an ordered arrangement is called a ranking and the order given to an individual or object is called its
rank. The correlation between two such sets of ranking is called Rank Correlation.
          Let we have n pairs of two data sets ranked with respect to some characteristic. Say, (x1, y1),
(x2,y2), (x3, y3), … , (xn, yn). Since both xi and yi are the first n natural numbers, therefore we have
                                    n(n+1)
          xi = 1 + 2 + … + n =
                                       2
                                              n(n+1)(2n+1)
          x 2 = y2 = 12 + 22 + … + n2 =
                                                     6
                                                2                           2     2
                              - 2 = y 2 - (yi) = n(n+1)(2n+1) - n(n+1) = n(n -1)
                 - 2 = (y - y)
          (xi - x)        i            i
                                              n            6            4        12
Let di = xi - yi
Then
          di2 = (xi - yi)2 = x i2 + yi2 - 2x i yi
                 n(n+1)(2n+1) n(n+1)(2n+1)
               =                 +                  - 2xi yi
                        6                  6
                    n(n+1)(2n+1) 1 2
          xi yi =                   - di
                           6          2
The product moment coefficient of correlation is:
                               XY(X)( Y)/n
                    r=
                          [X (X)2/n][ Y2(Y)2/n]
                              2
          by substitution we have
                             6di2
                    r=1-
                           n(n2 - 1)
          This is also ranging from – 1 to + 1
Note
        If two objects or observations are tied (having same value), lets say for fourth and fifth, then they
are both given the mean rank of 4 and 5. i.e. 4.5.
        This situation is given in the following example.
Example (13)
        The following table shows the number of hours studied (X) by a random sample of ten students
and their grades in examination (Y):
                  X:                8    5    11   13    10    5      18     15    2      8
                  Y:                56   44   79   72    70    54     94     85    33     65
         Calculate Spearman’s rank correlation coefficient.
Solution
         We rank the X values by giving rank 1 to the highest value 18, rank 2 to 15, rank 3 to 13, rank 4
to 11, rank 5 to 10, rank 6.5 (mean of rank 6 and 7) to both 8, rank 8.5 (mean of rank 8 and 9) to both 5
and rank 10 to 2. Similarly we rank the values of Y by giving 1 to the highest value 94, rank 2 to 85, rank
3 to 79, …, and rank 10 to 33 which is the smallest.
         Table given below:
                            X       Y    Rank of X        Rank of Y      di      d2
                            8       56      6.5              7         - 0.5    0.25
                            5       44      8.5              9         - 0.5    0.25
                            11      79       4               3          1.0       1
                            13      72       3               4         - 1.0      1
                            10      70       5               5          0.0       0
                            5       54      8.5              8          0.5     0.25
                            18      94       1               1          0.0       0
                            15      85       2               2          0.0       0
                            2       33      10               10         0.0       0
                            8       65      6.5              6          0.5     0.25
                                                                               d2 = 3
The value of n is 10.
Hence
                       6di2
                  r=1-
                      n(n2 - 1)
                         6(3)
                =1-
                      10(102 - 1)
                = 0.98
Compare this value with the correlation coefficient for the original values.
Example (14)
         Ten competitors in a beauty contest are ranked by three judges in the following order
                  1st Judge         1    6   5       10     3    2      4      9     7      8
                  2nd Judge         3    5   8       4      7    10     2      1     6      9
                  3rd Judge         6    4   9       8      1    2      3      10    5      7
      Use the rank correlation coefficient to discuss which pair of judges have the nearest approach to
common tastes in beauty.
X1:        41     31      26      43       21     33       41    31     46     31     36    32      38     27    35     40
X2:        62     52      50      56       51     52       63    50     47     40     56    54      60     57    57     58
X3:        41     33      33      38       35     36       43    33     37     33     33    31      30     37    35     31
(d) The correlation coefficients between X1 and X2 when the effect of X3.is held constant