Linear Regression
(Cross-sectional Data)
1
Types of data used for analysis
• There are 3 types of data which econometricians might use for analysis:
1. Time series data
• Data on one or more variables of a single entity (i.e. individual, company or
country etc) over multiple time periods
• Ex: Sales data of ACI company for last 10 years
2. Cross-sectional data
• Data on one or more variables of multiple entities at one point in time
• Ex: Sales data of all pharmaceutical companies for year 2019
3. Panel/Pooled/Longitudinal data
• Combination of time series and cross section data
• Data on one or more variables of multiple entities for multiple time periods
• E.g. Sales data of all pharmaceutical companies for last 10 years
• This lecture focuses on Cross sectional data
2
3
4
• These two measures- Covariance and Correlation are known as
Measures of Association.
5
6
7
8
9
10
How to draw Scatter diagram in Stata?
11
Scatter diagram with fitted line
700
700
680
680
660
660
testscr
640
640
620
620
600
15 20 25
str
600
15 20 25 testscr Fitted values
str
Stata Command: twoway (scatter testscr str) (lfit testscr str)
12
Correlation and Covariance in Stata
• Covariance: shows Just a negative relation
• Correlation: shows the magnitude of negative relation- Weak
13
14
15
What is regression analysis?
• In statistical modeling, regression analysis is a set of statistical
processes for estimating the relationships between a dependent
variable and one or more independent variables.
• Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent
(target/regressed/explained) and independent variable (s)
(predictor/regressor/explanatory).
• One independent variable- Simple linear regression
• More than one independent variables- Multiple linear regression
• Regression analysis is used for-
forecasting,
time series modelling and
finding the causal effect relationship between the variables
16
17
18
19
20
21
22
23
24
How to run Simple Linear Regression in Stata?
Stata
25
Simple Linear Regression Output in Stata
26
27
28
29
30
31
32
33
• Also we can include some other variables such as expense per student,
district average income etc.
34
Having an initial idea about the dataset:
Descriptive/Summary Statistics
35
Initial idea about the dataset:
Descriptive/Summary Statistics
• How to explain this table in your research paper?
36
37
38
39
40
41
42
• We want to minimise this distance:
43
How OLS works
• The estimate of the residual is given by:
• The estimate of the residual is the difference between the observed value of
and its fitted value
• In other words, it is the “error” we have made
• Thus:
44
• Here we say that is the “least squares” estimator of . solves:
• What we are actually doing is minimising the sum of squared
residuals:
45
•
• Remember,
• So we get:
46
• This is OLS
• OLS estimator minimizes the average squared
difference between the actual values of Yi and
the predicted value based on the estimated line
47
Why use OLS and not something else?
•
•OLS is a generalisation of the sample average:
•If the “line” is just an intercept (no X), then the OLS estimator is just the
sample average of Y1,…Yn ().
•OLS estimator has some desirable properties:
•Under certain assumptions, it is unbiased (that is, ()) and,
•It has a tighter sampling distribution than some other candidate
estimators of 1 (more on this later).
48
The OLS β
• Found solving the minimisation problem above, is given by:
• Or alternatively written as sample moments:
• The intercept is given by:
49
Back to our example
•
• With our data we use EViews to estimate our model:
50
=
=
51
On our graph:
• Estimated slope == – 2.28
• Estimated intercept = = 698.9
• Estimated regression line:
52
Interpretation of β
•
• So we have:
• Given that =-2.28, we can say the following:
• “For an increase of one student per teacher will decrease by 2.28,
holding all other variables constant”
• This is the answer to our question and the slope of our line, ==-2.28
53
54
Can we interpret the intercept?
•
• Here the intercept,
• In this context it means that districts with zero students per teacher,
are predicted to get an average scores of 698.9
• Odd!
• In this context it doesn’t have any meaningful economic
interpretation
• Look at the graph:
55
• The line is extrapolated outside of our data range!
56
We can predict!
•
• Given:
• One of the districts in the dataset has an STR=19.33 and a
TestScore=657.8
• Predicted TestScore:
• Our residual (error):
• OK so not 100% correct
• Other factors?
57
Why do we include a Disturbance term?
•
• The disturbance term can capture a number of features:
• We always omit some determinants of
• Model misspecification
• Functional misspecification
• There may be errors in the measurement of that cannot be modelled
• Random outside influences on which we cannot model
58
Beware of our estimate of β
• If we do make “errors” the resulting bias will manifest itself in our
estimates
• What does this mean?
• It could be that we underestimate or overestimate a statistical
relationship
59
• The OLS regression line is an estimate, computed using our
sample of data; a different sample would have given a different
value of .
• How can we:
• Conduct hypothesis testing?
• Inference (what does the numbers mean?)
• Etc.
• There are key assumptions which underlie our analysis
• These need to hold for us to do this analysis with confidence
60
Assumptions underlying OLS
• To be able to make any statistical inference from our results we need
to make a number of assumptions
• These assumptions are needed to make OLS work
• However as we will see later these assumptions don’t always hold
• There are five main assumptions which we will look at first
61
Assumption 1
•
• The errors have mean zero
• The error is just a random process
• Meaning if we took a large number of samples, the mean disturbance
would be zero
62
Assumption 2
•
• The variance of the errors is constant and finite
• This imposes that all disturbance terms have the same variance
• Also called, Homoskedasticity
63
Assumption 3
•
•, where
• The errors are linearly independent of each other
• We require that the disturbance terms are independently distributed
• In other words, not correlated with each other
• Also called, Serial independence
64
Assumption 4
•
• No relationship between the error term and the independent
variable
• No correlation between X and the error
• We are saying that X is not random, i.e. not determined by chance
65
Assumption 5
•
• is normally distributed with mean zero and variance
• This assumption is what is needed to be able to make valid inferences
about the population parameters from the sample parameters,
66
However the assumptions don’t always hold!
• Failure of…
• A1, leads to: A biased intercept
• A2, leads to: Heteroskedasticity
• A3, leads to: Serial Correlation
• A4, leads to: Autoregression
• A5, leads to: Outliers
• We will look at some of these issues later
67
Additional Assumptions
•
• No (perfect) multicollinearity
• Linearity
• The equations is linear in parameters ( not necessarily in variables
• X has some variability
• We need variability in X to be able to estimate
• n>2
• We need more observations than our number of independent variables
68
OLS is BLUE!
•
• Under some of these assumption, we can say that OLS is
Blue:
• Best (Smallest Variance)
• Linear (Linear in parameters)
• Unbiased ()
• Estimator (are estimators of )
• This is proven by the Gauss-Markov theorem*
69
Moments of
•
• Mean:
• Meaning that it is “unbiased”
• Variance:
• Remember, we don’t know , so we use, , however we invoke law of large
numbers (if n is large) so we get (approximately)
70
Hypothesis Testing:
Inference from our sample
•
• Once again back to our example
• Remember what stage we got to?
• With the benefit of computers, hypothesis testing is easy
• Our Eviews output already had most of what we need!
71
=
=
72
•• , so we reject the null that :
• Typically here we use the phrase, is statistically different from zero
73
P-Value
• We could have saved even more time if we just used the P-Value
• The p-value shows how significant a variable really is
• Before we assumed a 5% significance level, here we don’t have to
• We can say that the coefficient is statistically different from zero at the 5%
level if the p-value is less than or equal to 0.05
• Here we can see that it is significant at beyond 1%!
74
Confidence intervals
•
• In general, if the sampling distribution of an estimator is normal for
large n, then a 95% confidence interval can be constructed as:
• [estimator (t-crit)standard error]
• So: a 95% confidence interval for is:
• We are “95% confident that is within this interval”
75
More tips
• A convention for reporting estimated regressions:
• Put standard errors in parentheses below the estimates
Test score = 698.9 – 2.28xSTR
(10.4) (0.52)
• This is how most journals will present their results
• This is also how you should present your results
76
Goodness of fit
•
• How well does the regression line fit or explains the data?
• There are two regression statistics that provide complementary
measures of the quality of fit:
• The measures the fraction of the variance of Y that is explained by X
• The standard error of the regression (SER) measures the fit in the
units of Y.
77
2
𝑅
•
• It tells us “what percentage of the variation in Y can be explained by
X”
• In other words, how much can our model explain the variations in Y?
• is between 0 and 1
• Given usually by or equivalently
78
In our example
• We have:
• So our model can only explain 5% of the variation in Y
• Not very good it seems
79
• Typically we use the Adjusted as the normal has some problems
• As we add more variables to the equation the will never decrease
• To count for a loss in degrees of freedom we use
• It penalises for adding variables if their added effect isn’t beneficial
80
Be careful
•
• An (or of 1 doesn’t mean you’ve solved everything
• Regression could be spurious
• Omitted variable (might be Z not X)
• Doesn’t imply causality
• Can’t compare ’s of different models
• Low doesn’t mean wrong choice of X
81
More tips: Interpretation
• ; Both y and x are not logged.
• “If x changes by 1 unit, y changes by units”
• ; Only x is logged.
• “If x changes by 1%, y changes by (*0.01) units”
• ; only y is logged.
• “If x changes by 1 unit, y changes by (100*)%”
• ; both are logged.
• “If x changes by 1%, y changes by %”. This is analogous to elasticity
interpretation (e.g. price elasticity of demand).
82
Test score = 698.9 – 2.28xSTR, R2 = .05, SER = 18.6
(10.4) (0.52)
The slope coefficient is statistically significant and large in a policy
sense, even though STR explains only a small fraction of the variation
in test scores. 83
Before we continue
•
• We have been looking at the simple case of one X variable
• Typically we get economic relationships involving more than one
explanatory variable
• We can form a general equation:
• Typically we summarise this using matrix algebra
84
Lets expand our original example
• How about we add a third variable:
• Where AVGINC is the district average income (in
£thousands)
• Could possibly affect student performance
• Lets estimate a model using EViews
85
86
Interpretation
•
• In our multivariate case our interpretation differs slightly
• Interpreting we now say:
• “For an increase of one student per teacher will decrease by 0.64, holding
AVGINC constant”
• When we interpret, we are looking at the effect of a change in a
particular holding all others (where constant
87
•
• So in general we say:
• “For a one unit increase in , Y increases by units, holding all other
variables constant”
• Some times you may read the phrase ceteris paribus
• This is Latin for “all other things being equal or held constant”
88
•
• We can check the individual variable separately for significance
• We create our null hypothesis , against an alternative of for
our
• We can conduct t-tests again
• We can look at the P-value
• AVGINC significant at 1%!
• STR significant at 10%
89
Overall significance
• We could test to see overall significance
• We can perform an F-test, given by the formula:
• Little R is the “restricted model” and little U the “unrestricted”
• k=number of variables (including the constant)
• n=number of observations
90
Example
•
• Say we wanted to test the joint significance of STR and AVGINC
• As we only have 2 explanatory variables this is a test of overall significance
• We want to ask, are our variables jointly different from zero
• We set our hypothesis:
• (that the restrictions are valid)
• isn’t true (that our model is statistically significant)
• Actually here it only takes one coefficient to be significantly different from
zero for us to reject our null
91
•
• The next step is to estimate both are restricted and unrestricted
model and get the RSS from both
• Then calculate the F-statistics using the formula
• Then find the F-Critical value with DoF of )
• If the F-stat > F-crit we reject the null hypothesis
• EViews does this for us!
92
• As we are testing the overall significance, EViews automatically
generates the F-stat for us
• It also gives us a P-value for the F-stat so we don’t have to look at
critical values
• Here we can reject our null hypothesis at the 1% level!
• We can say that our variables are jointly statistically significant from
zero
93
What happens when assumptions break?
• We get problems that’s what!
94
Heteroskedasticity
•
• Recall our second assumption:
• The important thing we need is that the disturbances should have a
constant variance, independent of i
• However when we get this:
• This means that the variance can change for every different
observation
95
Homoskedasticity
96
Hetero Example 1
97
Hetero Example 2
98
•
• OLS is still unbiased and consistent!
• No explanatory variable is correlated with the error term
• So we still get a good
• However it is no longer efficient!
• The variance increases so the estimators of OLS are inefficient
• Look at an example of the distribution of :
99
100
• So with our variances affected this means that our standard errors
have been affected as well
• These are essential to our statistical tests!
• When we have hetero., OLS underestimates the variances (and SE)
• This causes inflated t-stats and F-stats
• So our hypothesis testing is unreliable
• We will end up rejecting the null to often
101
Detecting Heteroskedasticity
• Use our eyes!
• Graphing our data is very important
• This is more of an informal method which can only really be done in
the univariate case
• More formally, there are many tests built into econometric packages.
Two of which are:
• Breusch-Pagen test
• White’s test
102
The remedy?
• Most often, econometricians will use “robust errors” or more
precisely errors robust to heteroskedasticity
• This means that the errors are altered in a way to correct for hetero.
• Most common solution:
• “White’s corrected standard error estimates”
• It proposes fixes to the estimation of the variance and the covariance
• The mathematical nature of it is not necessary to know here
103
• Once we have detected and diagnosed that we have
Heteroskedasticity using Eviews, we can ask Eviews to cure it!
• White’s corrected standard error estimates is an option that we
can include we conducting our regression
• Although these corrections to the standard errors help and
make our statistical tests more reliable our coefficient
estimates will remain possibly slightly inefficient
104
Multicollinearity
• We had assumed that
• More importantly that there is no “perfect” multicollinearity
• For example, given two variables and it turns out that:
• Here we can see that is a linear function of
• So we have variables that move in the exact same way
105
Consequences of imperfect multicollinearity
• As mentioned before, some collinearity is unavoidable
• As long as it isn’t to severe we can just ignore it
• OLS continues to give BLUE estimate
• However the variances can sometimes be hugely affected
• Typically it can cause the variance to increase
• This means that our test statistics become unreliable again
106
Consequences of imperfect multicollinearity
• Here as the variance (and SE) are typically larger than normal this will
reduce of test statistics
• This means that we are more likely to fail to reject the null when we
shouldn’t have
• Meaning we may end up saying that a variable is insignificant when it
actually is significant!
• It can also lead to unexpected signs of coefficients and inflated
coefficient estimates
107
Detection
• There are few “unclear” methods
• We could use Variance Inflation Factors (VIF)
• This detects high levels of variance, suggesting multicollinearity
• Rule of thumb, a VIF>10 could mean we have a problem
• A more simple way of looking for collinearity is to look at the correlation
coefficient
• Eviews can produce a correlation matrix
• Remember correlation is -1 and 1
• Rule of thumb, correlations of around 0.8 we should start to worry, 0.9 really worry!
108
• If we suspect we have a problem, with two variables, estimate the
model with each one separately
• If we see:
• Changes in the sign of the coefficients
• Changes in the size of the Standard Errors
• These could be signs that we have a problem
109
A remedy?
• As unclear as the detection is, the solutions is also unclear
• We could chose to “drop” a problematic variable
• But what if theoretically or logically it should be included?
• We could re-formulate the model
• What if our current specification is correct?
110
• The problem is, we could end up doing more damage by trying to
correct for things
• If the problem is significant, then something should be attempted
• If not, sometimes its best to acknowledge you have a problem and move
on (Do nothing)
• Just be cautious about the problems you may have though
111
A note on serial correlation
• If you are looking at cross-sectional data then this really isn’t an issue
here
• The main issue is Hetero.
• If you are using time series data then this is a big problem!
• It means that OLS fails and that we need a different model to deal
with this issue
• This moves us on to modern time series analysis
112
End
113