13simple Linear Regression
13simple Linear Regression
                             1
                 Correlation
Correlation is a statistical
 technique used to determine the
 degree to which two variables are
 related
 12 March 2025                         2
   Scatter diagram
          • Rectangular coordinate
          • Two quantitative variables
          • One variable is called independent (X)
                and the second is called dependent (Y)
          • Points are not joined
          • No frequency table            Y
                                              *       *
                                                  *
                                                          X
12 March 2025                                                 3
      Example
   Wt. 67 69 85 83 74 81 97 92 114 85
  (kg)
  SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
  12 March 2025                             4
SBP(mmHg)
                   Wt. 67 69  85  83  74  81  97 92 114 85
                  (kg)
220               SBP 120 125 140 160 130 180 150 140 200 130
            (mmHg)
200
180
160
140
120
100
 80                                                                   wt (kg)
      60           70      80       90      100     110         120
200
180
160
140
120
100
   80
                                                             Wt (kg)
        60         70    80       90       100      110   120
 12 March 2025                                  7
                Positive relationship
12 March 2025                           8
                    18
16
14
                    12
     Height in CM
10
                    0
                         0   10   20   30   40       50    60   70   80   90
                                          10
Reliability
                          Age of Car
  12 March 2025
                No relation
12 March 2025                 11
                Correlation Coefficient
12 March 2025                              12
           Simple Correlation coefficient (r)
12 March 2025                                   13
 The sign of r denotes the nature
 of association
12 March 2025                       14
 If the sign is +ve this means the relation
  is direct (an increase in one variable is
  associated with an increase in the
  other variable and a decrease in one
  variable      is  associated     with    a
  decrease in the other variable).
12 March 2025                            15
How to compute the simple correlation coefficient (r)
                        xy     x y
           r                      n
                      ( x)2
                                        ( y) 
                                               2
                 x 
                     2          .  y 
                                       2         
                        n                n    
                                              
 12 March 2025                                       16
                           Example:
                                                      17
                 Weight   Age         serial
                   (Kg)   (years)       No
                  12       7            1
                   8       6            2
                  12       8            3
                  10       5            4
                   11      6            5
 12 March 2025
                  13       9            6
These 2 variables are of the quantitative type, one
  variable (Age) is called the independent and
  denoted as (X) variable and the other (weight)
  is called the dependent and denoted as (Y)
  variables to find the relation between age and
  weight compute the simple correlation coefficient
  using the following formula:
                              xy      x y
                 r                       n
                            ( x) 2           ( y)2   
                       x 
                          2            .  y 
                                              2           
                              n                 n      
                                                       
 12 March 2025                                                18
                              Weight      Age
                                                   Serial
  Y2             X2     xy     (Kg)     (years)
                                                     .n
                                  (y)        (x)
144              49     84       12         7         1
64 36 48 8 6 2
144 64 96 12 8 3
100 25 50 10 5 4
121 36 66 11 6 5
169 81 117 13 9 6
12 March 2025                                             19
                                   41 66
                            461 
                 r                     6
                           (41) 2          (66) 2 
                     291          . 742         
                             6               6    
r = 0.759
strong direct correlation
 12 March 2025                                           20
            EXAMPLE: Relationship between Anxiety and
                          Test Scores
Anxiety            Test      X2      Y2        XY
 )X(             score (Y)
                                                    21
   10               2       100         4      20
    8               3        64         9      24
    2               9         4        81      18
    1               7         1        49       7
    5               6        25        36      30
    6               5        36        25      30
X = 32∑          Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
 12 March 2025
                     Calculating Correlation Coefficient
r = - 0.94
     12 March 2025                                           22
Spearman Rank Correlation Coefficient (rs)
12 March 2025                                 23
                 Procedure:
 12 March 2025                             24
5. Apply the following formula
                         6 (di) 2
                 rs 1 
                         n(n 2  1)
 12 March 2025                              25
                       Example
In a study of the relationship between level injury
and income the following data was obtained. Find
the relationship between them and comment.
                                                      26
     Income         level of injury       sample
        (Y)                (X)            numbers
         25             moderate.            A
         10               mild.              B
          8               fatal.             C
         10              Sever.              D
         15              Sever.              E
         50              Normal              F
         60               fatal.             G
12 March 2025
                              Answer:
    di2          di    Rank    Rank
                        Y       X       (Y)      (X)
     4           2      3       5       25    moderate.       A
                                                              27
   0.25         0.5    5.5      6       10      mild.         B
  30.25         -5.5    7       1.5     8       fatal.        C
     4          -2     5.5      3.5     10     Sever.         D
   0.25         -0.5    4       3.5     15     Sever.         E
    25           5      2       7       50     Normal         F
   0.25         0.5     1       1.5     60      fatal.        G
∑ di2=64
12 March 2025
                         6 64
                 rs 1         0.1
                         7(48)
Comment:
There is an indirect weak correlation
 between level of injury and income.
 12 March 2025                          28
                exercise
12 March 2025              29
        What is regression analysis?
• An extension of correlation
• A way of measuring the relationship
  between two or more variables.
• Used to calculate the extent to which one
  variable changes (DV) when other
  variable(s) change (IV(s)).
• Used to help understand possible causal
  effects of one variable on another.
 12 March 2025                                30
    What is linear regression (LR)?
• Involves:
     – one predictor (IV) and
     – one outcome (DV)
• Explains a relationship using a straight line
  fit to the data.
12 March 2025                                     31
                Least squares criterion
12 March 2025                             32
                Least-Squares Regression
The most common method for fitting a
 regression line is the method of least-
 squares.
    This method calculates the best-fitting line for
     the observed data by minimizing the sum of the
     squares of the vertical deviations from each
     data point to the line (if a point lies on the fitted
     line exactly, then its vertical deviation is 0).
    Because the deviations are first squared, then
     summed, there are no cancellations between
     positive and negative values.
12 March 2025                                                33
             Linear Regression - Model
Y
                     ? (the actual value of Yi)
Yi                                                 b0 +
                                                  Y=      bX1
                                  ei
Yi     X i   i Population
       Y ˆ= b0 + b1Xi + e
                                               Sample
Yˆ = b0 + b1Xi
 12 March 2025                                              35
        Simple Linear Regression Model
•    The population simple linear regression model:
y= a + b x +  my|x=a+b x
                                                                                      36
                                                                 or
         Nonrandom or        Random
             Systematic       Component
             Component
•     Where
      • y is the dependent (response) variable, the variable we wish to explain or
         predict;
      • x is the independent (explanatory) variable, also called the predictor variable;
         and
      •  is the error term, the only random component in the model, and thus, the only
         source of randomness in y.
    12 March 2025
                    Cont…
• my|x is the mean of y when x is specified,
  all called the conditional mean of Y.
12 March 2025                                  37
   Picturing the Simple Linear Regression Model
                        Regression Plot                  •   The simple linear regression
         Y
                                                             model posits an exact linear
                                                             relationship     between        the
                                                                                              38
                                                             expected or average value of Y,
                                                         •   the dependent variable Y, and X,
                                          my|x=a +  x
                                                             the independent or predictor
                                                             variable:
                        {
     y
 {
                                                            an unexplained or random
                                                            error(e):
         a = Intercept
                                                                y = my|x + 
                                                   X            = a+b x + 
      0                 x
12 March 2025
                 Assumptions of the Simple Linear
                       Regression Model
• The relationship between X and Y                 LINE assumptions of the Simple Linear
    is a straight-Line (linear)           Y
                                                   Regression Model
    relationship.
                                                                                                39
•   The values of the independent
    variable X are assumed fixed (not
    random); the only randomness in                                                  my|x=a +  x
    the values of Y comes from the
    error term .
•   The errors  are uncorrelated (i.e.   y
    Independent) in successive
    observations. The errors  are
    Normally distributed with mean                                         Identical normal
    0 and variance 2(Equal                                                distributions of errors, all
                                                                           centered on the
    variance). That is: ~ N(0,2)            N(my|x, sy|x2)               regression line.
                                                                                                     X
                                                             x
     12 March 2025
                Fitting a Regression Line
 Y                                       Y
                                                                               40
                      Data
                                                    Three errors from the
                                                    least squares regression
                                     X              line             X
 Y                                       e
                                                                                      41
yi                              .                  yˆ a  bx   the fitted regression line
yˆi
                           {
          Error ei yi  yˆi
                                     yˆ   the predicted value of Y for x
                                               X
                                xi
12 March 2025
 Sums of Squares, Cross Products, and Least
            Squares Estimators
      Sums of Squares andCross Products:
                                      (å x)
                                           2
             lxx = å (x x ) å x
                        -  2
                             =    2
                                    -
                                        n 2
            lyy = å (y - y)2 = å y2 -
                                      (å y)
                                        n
                                                          (å x)(å y)
                  ŷ a =
                lxy      bx   å (x - x)(y - y) = å xy   -
                                                              n
       Least -squares regressionestimators:
                              lxy
                  b=          lxx
                                               ŷ a  bx
                  a = y - bx
12 March 2025                                                          42
                                                       Example
                                 x2          y2
                                                                                 x 
Patient       x        y                                x ×y                          2
                                                                                                         592.62
1          22. 4 134. 0      501. 76   17956. 0     3001. 60   lxx  x   2
                                                                                            41222.14         6104.66
4          25. 1 80. 2       630. 01    6432. 0     2013. 02                      n                        10
8          32. 4 97. 2      1049. 76    9447. 8     3149. 28
                                                                                 y 
                                                                                         2
2          51. 6 167. 0     2662. 56   27889. 0     8617. 20                                  1428.702
3          58. 1 132. 3     3375. 61   17503. 3     7686. 63   l yy  y  2
                                                                                220360.47           16242.10
5          65. 9 100. 0     4342. 81   10000. 0     6590. 00                 n                   10
7
6
           75. 3 187. 2
           79. 7 139. 1
                            5670. 09
                            6352. 09
                                       35043. 8
                                       19348. 8
                                                   14096. 16
                                                   11086. 27   lxy  xy 
                                                                            x y 91866.46  592.6 1428.70 7201.70
10         85. 7 199. 4     7344. 49   39760. 4    17088. 58                  n                     10
9          96. 4 192. 3     9292. 96   36979. 3    18537. 72
Total     592. 6 1428. 7   41222. 14   220360. 5   91866. 46       7201.70
                                                                   lxy
                                                               b         1.18
                                                                 l 6104.66
                                                                    xx
    12 March 2025                                                                                                          43
        Linear Regression - Variation
SSR
Due to regression.
SST
Random/unexplained.
12 March 2025                                            44
          Linear Regression - Variation
Y                                         
                                     SSE =(Yi - Yi
                    _         )2
          SST = (Yi -
    Y)2
                                                 _
                               SSR = (Yi - Y)2
                                                      _
                                                      Y
                         Xi                            X
12 March 2025                                         45
     Contents of correlation and linear
                regression
• Correlation
• Introduction to simple linear regression
• Least-squares estimation of the parameter
                                              46
                  Introduction
• Correlation and regression – for quantitative
  variables
   – Correlation: assessing the association between
     quantitative variables
   – Simple linear regression: description and prediction of
     one quantitative variable from another
• Only considering linear relationships
• When considering correlation or carrying out a
  regression analysis between two variables always
  plot the data on a scatter plot first
                                                           47
Scatter plot
               48
Cont…
        49
Pearson Correlation Coefficient
                                  50
51
Correlation – Linear Relationship
                                    52
Correlation – Linear Relationship
                                    53
 Correlation Does Not Imply Causation
• Correlation does not mean causation
• If we observe high correlation between two variables, this does
  not necessarily imply that because one variable has a high
  value it causes the other to have a high value
• There may be a third variable causing a simultaneous change in
  both variables.
• Example:
   – Suppose we measured children’s shoe size and reading skills
   – There would be a high correlation between these two variables, as
     the shoe size increases so too do the child’s reading abilities
   – But one does not cause the other, the underlying variable is age
   – As age increases so too does shoes size and reading ability
                                                                         54
Example: Percentage of children immunized against
DPT and under-five mortality rate for 20 countries,
1992
Nation        Percentage   Mortality   Nation         Percentage   Mortality
              immunized    rate per                   immunized    rate per
                           1000 live                               1000 live
                           births                                  births
Bolivia       77           118         Greece         54           9
Brazil        69           65          India          89           124
Cambodia      32           184         Italy          95           10
Canada        85           8           Japan          87           6
China         94           43          Mexico         91           33
Czech Repu.   99           12          Poland         98           16
Egypt         89           55          Russian Fed. 73             32
Ethiopia      13           208         Senegal        47           145
Finland       95           7           Turkey         76           87
France        95           9           United King.   90           9
                                                                               55
Natio xi      yi    xi-x^   yi—y^   Natio    xi     yi    xi-x^   yi—y^
n                                   n
Bolivia 77    118   -0.4    59      Gree     54     9
Brazil   69   65                    India    89     124
Camb.    32   184                   Italy    95     10
Canad. 85     8                     Japan    87     6
China    94   43                    Mexi     91     33
Czech    99   12                    Poland 98       16
Egypt    89   55                    Russia   73     32
Ethio.   13   208                   Seneg    47     145
Finla.   95   7                     Turkey 76       87
France 95     9                     United 90       9
Mean                                         77.4   59
                                                                          56
57
      Non-Parametric Correlation
• Rank correlation may be used whatever type of pattern
  is seen in the scatter diagram, doesn’t specifically
  assess linear association but more general association
• Spearman’s rank correlation rho
   – Non-parametric measure of correlation – doesn’t make any
     assumptions about the particular nature of the relationship
     between the variables, doesn’t assume a linear relationship
   – rho is a special case of Pearson’s r in which the two sets of
     data are converted to rankings
   – can test null hypothesis that the correlation is zero and
     calculate confidence intervals
                                                                58
Formula
          59
60
              Linear regression
• Is used to explore the nature of the relationship
  between two “continuous” normally distributed
  random variables.
• Enables us to investigate the change in response
  variable, which corresponds to a given change in
  the explanatory variable.
• The ultimate objective of regression analysis is to
  predict or estimate the value of the response that
  is associated with a fixed value of the explanatory
  variable
                                                        61
                  Example:
     Cigarettes & coronary heart disease
       64
                        Scatterplot
Cigarate   CHD   Cigarate   CHD
11         26    5          4
9          21    5          18
9          24    5          12
9          21    5          3
8          19    4          11
8          13    4          15
8          19    4          6
6          11    3          13
6          23    3          4
5          15    3          14
5          13
                                      65
Scatterplot with Line of Best Fit
                                    66
         Simple linear regression
• It is a model with a single regressor x that has a
  linear relationship with a response y.
• The simple linear regression model is
• Where:
   – Y= response variable - = slope
   – X = regressor variable - ε =random error component
   – = intercept
                                                          67
• X is
   – controlled variable not random variable
   – Deterministic or mathematical variable
• Y
   – Is random variable and can’t be controlled
   – It depends on the regressor variable
                                                  68
    Basic assumptions on the model
       i= 1 to n
1. εi is a random variable with zero mean & variable δ 2
   (unknown). i.e. E(εi)=0 & v(εi)=δ2
2. εi & εj are uncorrelated. i≠j. So Cov (εi,εj)=0
3. εi is a normally distributed random variable, with
   mean zero & variance δ2
        εi ~ N (0, δ2)                                  69
  the variable with out error is not random variable, it is the true population mean
value. If we add error on it we can find the sample statistics value which is the actual
           or observed value. Mean of y/x=bo+b1xi and mean of y=bo+b1x
                                        i+ error.                                 Because
                                                                                   E(ε)=0
                                                                              Because
                                                                             β0+β1χi are
                                                                                not
                                                                              random
                                                                              variable
                                                                               b/c no
                                                                               error.
                                                                                     70
          Assumptions of the Simple Linear
                Regression Model
• The relationship between X                      LINE assumptions of the Simple
    and Y is a straight-Line (linear)   Y
                                                     Linear Regression Model
    relationship.
•   The values of the independent
    variable X are assumed fixed
    (not random); the only                                                    my|x=a +  x
    randomness in the values of Y
    comes from the error term .
•   The errors  are uncorrelated       y
    (i.e. Independent) in
    successive observations. The
                                                                     Identical normal
    errors  are Normally                                            distributions of errors,
    distributed with mean 0 and                                      all centered on the
                                                                     regression line.
    variance 2(Equal variance).            N(my|x, sy|x2)
    That is: ~ N(0,2)
                                                                                            X
                                                             x
                                                                                       71
Picturing the Simple Linear Regression Model
      Y              Regression Plot                       The simple linear regression model
                                                        posits an exact linear relationship
                                                        between the expected or average
                                                        value of Y, the dependent variable Y,
                                                        and X, the independent or predictor
                                         my|x=a +  x   variable:
                                                                    my|x= a+b x
                     {
 yi
                            1
                                                        (my|x ) by an unexplained or random
{
                                                        error(e):
      a = Intercept
                                                                    yi = my|x + 
  0                   x
                                                  X                    = a+b x + 
                                                                                            72
 Estimation: The Method of Least Squares
Estimation of a simple linear regression relationship involves finding estimated or
predicted values of the intercept and slope of the linear regression line.
                        y$ = a    +b x
      ŷ
       $
where (y - hat) is th e value of Y lying n
                                         o the fitted regression line f or a given
value of X .
                                                                                          73
    Fitting a Regression Line
Y                            Y
          Data
                                        Three errors from the
                                        least squares regression
                         X              line             X
Y                            e
          Data
           The parameters β0 & β1 are
                                                  Three errors from the
           unknown and must be                    least squares regression
           estimatedXusing sample                 line             X
Y          data:                e
           (x1,y1), (x2,y2), …, (xn,yn).
                           Data
                                                         Three errors from the
                                                         least squares regression
                                          X              line             X
Y                                             e
                                     Residual
CHD Mortality per 10, 000
20
Prediction
10
                             0
                                 2               4            6            8            10         12
20
Prediction
10
                             0
                                 2               4            6            8            10                         12
20
Prediction
10
                             0
                                 2               4            6            8              10                         12
20
Prediction
10
(x9,ŷ9)
                             0
                                 2               4            6            8               10                           12
20
ε9=y9-ŷ9
Prediction
10
(x9,ŷ9)
                             0
                                 2                  4         6            8               10                           12
yi                             .
yˆi
          Error ei yi  yˆi
                               {
                                       X
                               xi
                                              82
                 Least square estimation is
                           n
                                 2   General
          ssresiduals   i is min imum
                         i 1
      Y
yi                     .
yˆi
                     {
                                              X
                       xi
                                                  83
          Least square estimation is
                            General
      Y
yˆi
              {                              parameters (β0 & β1),
                                              because the sum of
                                                squares of all the
                                           differences between the
                                            observation yi and the
                                            fitted line is minimum
                                       X
                xi
                                                                     84
 .
85
.
• The least square estimator of β0 & β1
must satisfy the following two equations
                                           86
.
• The least square estimator of β0 & βWe
                                      1 have two
                                           normal
must satisfy the following two   equations
                                       equations and
                                       two unknowns
                                        and they are
                                        independent
                                        therefore we
                                      can uniquely fit
                                           β0 & β1
                                                     87
                        .
• So the estimator of       are solution of the
  equations
                                                  88
.
    89
.
    90
Regression Statistics
SST  (Y  Y ) 2
SSR  (Y   Y ) 2
        SSE  (Y  Y )   2
Variance to be
explained by predictors
(SST)
Y
           X1
Variance        Y
explained by        Variance NOT
X1                  explained by X1
(SSR)               (SSE)
Regression Statistics
                        SSR
                        2
                    R 
                        SST
     Coefficient of Determination
     to judge the adequacy of the regression model
Regression Statistics
                                2
                 R R
                               S xy           xy
                 R                        
                             S xx S yy       x y
  Correlation
  measures the strength of the linear association between two variables.
Regression Statistics
      Standard Error for the regression model
            S e  S  ˆ
                       2
                       e
                                 2
                 SSE                 SSE  (Y  Y ) 2
              2
            S 
              e
                 n 2
              2
            S e MSE
        ANOVA
                              H 0 : 1 0
                              H A : 1 0
df SS MS Fcal P-value
                 H 0 :  i 0
                 H 1 :  i 0
                          bi   i
        t( n  k  1)   
                            Sbi
Hypotheses Tests for Regression
Coefficients
                      H 0 : 1 0
                      H A : 1 0
                       b1  1 b1  1
     t( n  k  1)              
                       S e (b1 )    2
                                   Se
                                    S xx
Confidence Interval on Regression
Coefficients of leaner model
                                  2                                 2
                               S                                  S
 b1  t / 2,( n  k  1)         e
                                    1 b1  t / 2,( n  k  1)   e
                               S xx                               S xx
                       H 0 :  0 0
                       H A :  0 0
                     b0   0         b0   0
   t( n  k  1)              
                     S e (b0 )        1 X       2
                                                     
                                   S  
                                    2
                                    e
                                                     
                                       n S xx        
 Confidence Interval on Regression
 Coefficients
                             2 1   X  2
                                                                            2 1   X 2
b0  t / 2,( n  k  1)   S e          0 b0  t / 2,( n  k  1) S e        
                                 n S xx                                        n S xx 
                                         R n 2
                            T0 
                                          1 R2
i
                Yi
     Diagnostic Tests For Regressions
     Residuals for a non-linear fit
i
              Yi
     Diagnostic Tests For Regressions
     Residuals for a quadratic function
     or polynomial
i
              Yi
     Diagnostic Tests For Regressions
     Residuals are not homogeneous
     (increasing in variance)
i
             Yi
Regression – important points
    X
Y
    X
Regression – important points
22Ensure that the distribution of
predictor values is approximately
uniform within the sampled range.
Y
    X
Y
    X
                     Readings
                                                             114
Cigarate xi CHD yi Xi-x~   Yi-y~   Cigarate xi   CHD yi   Xi-x~   Yi-y~
11         26      5.05    11.48   5             4        -0.95   -10.52
9          21      3.05    6.48    5             18       -0.95   3.48
9          24      3.05    9.48    5             12       -0.95   -2.52
9          21      3.05    6.48    5             3        -0.95   -11.52
8          19      2.05    4.48    4             11       -1.95   -3.52
8          13      2.05    -1.52   4             15       -1.95   0.48
8          19      2.05    4.48    4             6        -1.95   -8.52
6          11      0.5     -3.52   3             13       -2.95   -1.52
6          23      0.5     8.48    3             4        -2.95   -10.52
5          15      -0.95   0.48    3             14       -2.95   -0.52
5          13      -0.95   -1.52
mean                               5.95          14.52
                                                                           115
          Making a prediction
• Assume that we want to predict CHD
  mortality when cigarette consumption is 6.
                                     Residual
CHD Mortality per 10, 000
20
Prediction
10
                             0
                                 2               4            6            8            10    12
            124
         Regression Coefficient
• Regression coefficient:
  – this is the slope of the regression line
  – indicates the strength of the relationship between
    the two variables
  – interpreted as the expected change in y for a one-
    unit change in x
  – can calculate a standard error for the regression
    coefficient
  – can calculate a confidence interval for the coefficient
  – can test the hypothesis that b = 0, i.e., that there is
    no relationship between the two variables              125
                   Intercept
• Intercept:
  – the estimated intercept a gives the value of y that
    is expected when x = 0
  – often not very useful as in many situations it may
    not be realistic or relevant to consider x = 0
  – it is possible to get a confidence interval and to
    test the null hypothesis that the intercept is zero
    and most statistical packages will report these
                                                      126
 Coefficient of Determination, R-Squared
• The coefficient of determination or R-squared is the amount of
  variability in the data set that is explained by the statistical model
• Used as a measure of how good predictions from the model will be
• In linear regression R-squared is the square of the correlation coefficient
• The regression analysis can be displayed as ANOVA table, many
  statistical packages present the regression analysis in this format
                                                                                127
          Adjusted R-Squared
• Adjusted R-squared
  – Sometimes an adjusted R-squared will be
    presented in the output as well as the R- squared
  – Adjusted R-squared is a modification to the R-
    squared to compensate for the number of
    explanatory or predictor variables in the model
    (more relevant when considering multiple
    regression)
  – The adjusted R-squared will only increase if the
    addition of the new predictor improves the
    model more than would be expected by chance
                                                        128
  Interpolation and Extrapolation
• Interpolation
  – Making a prediction for Y within the range of values of
    the predictor X in the sample used in the analysis
  – Generally this is fine
• Extrapolation
  – Making a prediction for Y outside the range of values
    of the predictor X in the sample used in the analysis
  – No way to check linearity outside the range of values
    sampled, not a good idea to predict outside this range
129