KEMBAR78
RATS Programming Manual | PDF | Errors And Residuals | Autocorrelation
0% found this document useful (0 votes)
307 views255 pages

RATS Programming Manual

Rats Programming Manual for time series

Uploaded by

Allister Hodge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
307 views255 pages

RATS Programming Manual

Rats Programming Manual for time series

Uploaded by

Allister Hodge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 255

RATS Programming Manual

2nd Edition

Walter Enders
Department of Economics, Finance & Legal Studies
University of Alabama
Tuscaloosa, AL 35487
wenders@cba.ua.edu
and
Thomas Doan
Estima
Evanston, IL 60201
tomd@estima.com

Draft
April 3, 2014

c 2014 by Walter Enders and Thomas Doan


Copyright

This book is distributed free of charge, and is intended for personal, non-
commercial use only. You may view, print, or copy this document for your own
personal use. You may not modify, redistribute, republish, sell, or translate
this material without the express permission of the copyright holders.
Contents

Preface vi

1 Introduction 1
1.1 What Are Your Options? . . . . . . . . . . . . . . . . . . . . . 2
1.2 Which Should You Use? . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Three Words of Advice . . . . . . . . . . . . . . . . . . . . . . 4
1.4 General Stylistic Tips . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 About This E-Book . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Regression and ARIMA Models 9


2.1 The Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Linear Regression and Hypothesis Testing . . . . . . . . . . . . . 12
2.2.1 Examples using RESTRICT . . . . . . . . . . . . . . . . . . 18
2.3 The LINREG Options . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Using LINREG and Related Instructions . . . . . . . . . . . . . . 20
2.5 ARMA(p,q) Models . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Estimation of an ARMA(p,q) process with RATS. . . . . . . . . . . 27
2.6.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.3 Diagnostic Checking . . . . . . . . . . . . . . . . . . . . . . 29
2.7 An Example of the Price of Finished Goods . . . . . . . . . . . . . 29
2.8 Automating the Process . . . . . . . . . . . . . . . . . . . . . . 33
2.8.1 Introduction to DO Loops . . . . . . . . . . . . . . . . . . . 34
2.9 An Example with Seasonality . . . . . . . . . . . . . . . . . . . 36
2.10 Forecasts and Diagnostic Checks. . . . . . . . . . . . . . . . . . 40
2.11 Examining the Forecast Errors . . . . . . . . . . . . . . . . . . 43
2.12 Coefficient Stability. . . . . . . . . . . . . . . . . . . . . . . . 48
2.13 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.13.1 Preparing a graph for publication . . . . . . . . . . . . . . . 52

i
Contents ii

2.13.2 Preparing a table for publication . . . . . . . . . . . . . . . 52


2.1 Introduction to basic instructions . . . . . . . . . . . . . . . . . . . 53
2.2 Engle-Granger test with lag length selection . . . . . . . . . . . . 55
2.3 Estimation and diagnostics on ARMA models . . . . . . . . . . . . 56
2.4 Automated Box-Jenkins model selection . . . . . . . . . . . . . . . 57
2.5 Seasonal Box-Jenkins Model . . . . . . . . . . . . . . . . . . . . . . 58
2.6 Out-of-sample forecasts with ARIMA model . . . . . . . . . . . . . 59
2.7 Comparison of Forecasts . . . . . . . . . . . . . . . . . . . . . . . . 60
2.8 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 Non-linear Least Squares 63


3.1 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . 64
3.2 Using NLLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Restrictions: Testing and Imposing. . . . . . . . . . . . . . . . . 72
3.4 Convergence and Convergence Criteria . . . . . . . . . . . . . . . 75
3.5 ESTAR and LSTAR Models . . . . . . . . . . . . . . . . . . . . 77
3.6 Estimating a STAR Model with NLLS . . . . . . . . . . . . . . . 79
3.7 Smooth Transition Regression . . . . . . . . . . . . . . . . . . . 87
3.8 An LSTAR Model for Inflation . . . . . . . . . . . . . . . . . . . 91
3.9 Functions with Recursive Definitions. . . . . . . . . . . . . . . . 98
3.10 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.10.1 Understanding Computer Arithmetic . . . . . . . . . . . . 101
3.10.2 The instruction NLPAR . . . . . . . . . . . . . . . . . . . . 102
3.10.3 The instruction SEED . . . . . . . . . . . . . . . . . . . . . 105
3.1 Simple nonlinear regressions . . . . . . . . . . . . . . . . . . . . . 106
3.2 Sample STAR Transition Functions . . . . . . . . . . . . . . . . . 108
3.3 STAR Model with Generated Data . . . . . . . . . . . . . . . . . . 109
3.4 Smooth Transition Break . . . . . . . . . . . . . . . . . . . . . . . 111
3.5 LSTAR Model for Inflation . . . . . . . . . . . . . . . . . . . . . . . 113
3.6 Bilinear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Contents iii

4 Maximum Likelihood Estimation 116


4.1 The MAXIMIZE instruction . . . . . . . . . . . . . . . . . . . . 117
4.2 ARCH and GARCH Models . . . . . . . . . . . . . . . . . . . . 122
4.3 Using FRMLs from Linear Equations . . . . . . . . . . . . . . . 127
4.4 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.4.1 The Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . 133
4.4.2 BFGS and Hill-Climbing Methods . . . . . . . . . . . . . . 135
4.4.3 The CDF instruction and Standard Distribution Functions 137
4.1 Likelihood maximization . . . . . . . . . . . . . . . . . . . . . . . . 139
4.2 ARCH Model, Estimated with MAXIMIZE . . . . . . . . . . . . . 140
4.3 GARCH Model with Flexible Mean Model . . . . . . . . . . . . . . 141

5 Standard Programming Structures 143


5.1 Interpreters and Compilers . . . . . . . . . . . . . . . . . . . . 143
5.2 DO Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.3 IF and ELSE Blocks . . . . . . . . . . . . . . . . . . . . . . . 151
5.4 WHILE and UNTIL Loops . . . . . . . . . . . . . . . . . . . . 154
5.5 Estimating a Threshold Autoregression . . . . . . . . . . . . . . 159
5.5.1 Estimating the Threshold . . . . . . . . . . . . . . . . . . . 161
5.5.2 Improving the Program . . . . . . . . . . . . . . . . . . . . 164
5.6 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.1 Illustration of DO loop . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.2 Illustration of IF/ELSE . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.3 Illustration of WHILE and UNTIL . . . . . . . . . . . . . . . . . . 173
5.4 Threshold Autoregression, Brute Force . . . . . . . . . . . . . . . . 174
5.5 Threshold Autoregression, More Flexible Coding . . . . . . . . . . 176

6 SERIES and Dates 178


6.1 SERIES and the workspace . . . . . . . . . . . . . . . . . . . . 178
6.2 SERIES and their integer handles . . . . . . . . . . . . . . . . . 181
6.3 Series Names and Series Labels . . . . . . . . . . . . . . . . . . 183
6.4 Dates as Integers . . . . . . . . . . . . . . . . . . . . . . . . . 184
Contents iv

6.5 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . 187


6.1 Series and Workspace Length . . . . . . . . . . . . . . . . . . . . . 188
6.2 Series handles and DOFOR . . . . . . . . . . . . . . . . . . . . . . 189
6.3 Date calculations and functions . . . . . . . . . . . . . . . . . . . . 190

7 Nonstationary Variables 191


7.1 The Dickey-Fuller Test . . . . . . . . . . . . . . . . . . . . . . 192
7.1.1 Dickey-Fuller testing procedures . . . . . . . . . . . . . . . 196
7.1.2 DOFOR loops and the REPORT instruction . . . . . . . . . 199
7.2 Other Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.3 Tests with Breaks . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4 Two Univariate Decompositions . . . . . . . . . . . . . . . . . . 213
7.4.1 Hodrick-Prescott Filter . . . . . . . . . . . . . . . . . . . . . 213
7.4.2 The Beveridge and Nelson Decomposition . . . . . . . . . . 215
7.5 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.5.1 The Engle-Granger Methodology . . . . . . . . . . . . . . . 218
7.5.2 The Johansen Procedure . . . . . . . . . . . . . . . . . . . . 222
7.1 Dickey-Fuller Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.2 Unit Root Tests in a Loop . . . . . . . . . . . . . . . . . . . . . . . 227
7.3 Other Unit Root Tests . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.4 Unit Root Test with Break: Simulated Data . . . . . . . . . . . . . 228
7.5 Unit Root Tests with Breaks . . . . . . . . . . . . . . . . . . . . . . 229
7.6 Trend Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.7 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

A Probability Distributions 232


A.1 Univariate Normal . . . . . . . . . . . . . . . . . . . . . . . . 232
A.2 Univariate Student (t) . . . . . . . . . . . . . . . . . . . . . . 233
A.3 Chi-Squared Distribution . . . . . . . . . . . . . . . . . . . . . 234
A.4 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . 235
A.5 Multivariate Normal . . . . . . . . . . . . . . . . . . . . . . . 236

B Quasi-Maximum Likelihood Estimations (QMLE) 237


Contents v

C Delta method 240

D Central Limit Theorems with Dependent Data 241

Bibliography 244

Index 246
Preface

This is an update of the RATS Programming Manual written in 2003 by Wal-


ter Enders. That was, and this is, a free e-book designed to help you learn
better how to use the more advanced features of RATS. Much has changed
with RATS over the intervening ten years. It has new data types, more flexible
graphics and report-building capabilities, many new and improved procedures,
countless new example files.
And the practice of econometrics has changed as well. Its much more com-
mon (and almost expected) to use more computational intensive methods,
such as simulations and bootstrapping, sample stability analysis, etc. These
techniques will often require use of programming beyond the pre-packaged
instructions and procedures, and thats what this e-book is here to explain.
The econometrics used in the illustrations is drawn from Enders (2010), but
there is no direct connection between the content of this e-book and the text-
book. If you have questions about the underlying statistical methods, that book
would be your best reference.
Because the goal is to help you understand how to put together usable pro-
grams, weve included the full text of each example in this book. And the run-
ning examples are also available as separate files.

vi
Chapter 1

Introduction

This book is not for you if you are just getting familiar with RATS. Instead, it is
designed to be helpful if you want to simplify the repetitive tasks you perform
in most of your RATS sessions. Performing lag length tests, finding the best
fitting ARMA model, finding the most appropriate set of regressors, and setting
up and estimating a VAR can all be automated using RATS programming lan-
guage. As such, you will not find a complete discussion of the RATS instruction
set. It is assumed that you know how to enter your data into the program and
how to make the standard data transformations. If you are interested in learn-
ing about any particular instruction, you can use RATS Help Menu or refer to
the Reference Manual and Users Guide.
The emphasis here is on what we call RATS programming language. These
are the instructions and options that enable you to write your own advanced
programs and procedures and to work with vectors and matrices. The book
is intended for applied econometricians conducting the type of research that
is suitable for the professional journals. To do state-of-the-art research, it is
often necessary to go off the menu. By the time a procedure is on the menu
of an econometric software package, its not new. This book is especially for
those of you who want to start the process of going off the menu. With the
power of modern computers, its increasingly expected that researchers justify
their choices of things like lag lengths, test robustness through bootstrapping,
check their model for sample breaks, etc. While some of these have been stan-
dardized and are incorporated into existing instructions and procedures, many
have not and in some cases can not because they are too specific to an applica-
tion. Sometimes, the programming required is as simple as throwing a loop
around the key instruction. But often it will require more than that.
Of course, it will be impossible to illustrate even a small portion of the vast
number of potential programs you can write. Our intent is to give you the tools
to write your own programs. Towards that end, we will discuss a number of the
key instructions and options in the programming language and illustrate their
use in some straightforward programs. We hope that the examples provided
here will enable you to improve your programming technique. This book is
definitely not an econometrics text. If you are like us, it is too difficult to learn
econometrics and the programming tools at the same time. As such, we will try
not to introduce any sophisticated econometric methods or techniques. More-
over, all of the examples will use a single data set. All examples are compatible
with RATS version 8.0 or later.

1
Introduction 2

1.1 What Are Your Options?


If you need some calculation which cant be done by just reading the data,
doing some transformations, running basic statistic instructions (such as least
squares regressions) and reporting the results, then you have three general
platforms in which to do your work:

General Purpose Programming Language


Forty years ago, almost all statistical programming was done in Fortran, which
is still heavily used in science and engineering. (If you ever see an astronomer
or particle physicist scroll through a program, it will probably be Fortran).
Some economists still use Fortran, or C++, and most of the high level statis-
tical packages are written in them. (RATS uses C++). These are compiled
languages, which means that they look at the entire program, optimize it and
produce an executable program for a specific machine. This makes them harder
to write, even harder to debug (you have to make a change and regenerate the
executable in order to test something), but as a result of being compiled are
very fast. The great disadvantage is that they, by and large, manipulate in-
dividual numbers and pairs of numbers, rather than matrices. While you can
obtain and use packages of subroutines that do specific matrix calculations,
you dont use matrices in formula translation form. (The phrase FORmula
TRANslation is the source of the Fortran name). Since matrix calculations
form the backbone of most work on econometrics, this isnt convenient.

Math Packages
The two most prominent of these in econometrics are Matlab R
and GaussTM ,
but there are others. These are primarily designed as matrix programming
languages. Since matrix manipulations are so important in statistics in general
and econometrics in particular, this makes them much simpler to use than
general purpose languages.
These are designed to do the relatively small number of their built-in functions
very well. Some hot spot calculations are specially optimized so they are
faster than the same thing done using the compiled language.
This does come at a cost. Particularly in time series work, a data as matrix
view doesnt work well, because the data matrix for one operation with one
set of lags is different from the one needed with a different set. If you look
at a program for, for instance, the estimation of a Vector Autoregression, very
little of it will be the matrix calculations like B=(X0 X)-1*(X0 Y)most will be
moving information around to create the X and Y matrices.

High Level Statistical Packages


RATS is an example of this. These have a collection of built-in instructions for
specific statistical calculations. All of them, for instance, will have something
Introduction 3

similar to the RATS LINREG instruction which takes as input a dependent vari-
able and collection of explanatory variables and a performs a multiple linear
regression, producing output, which typically includes summary statistics on
residuals and fit, standard errors and significance levels on the coefficients.
While the calculations for a simple LINREG are straightforward, what RATS
and similar programs are doing for you simplifies considerably what would be
required to do the same in a math package:

1. You can adjust the model by just changing the list of variables, instead of
having to re-arrange the input matrices.
2. You dont have to figure out yourself how to display the output in a usable
form. That can be a considerable amount of work if you want your output
to make sense by itself, without having to refer back to the program.
3. While most of the summary statistics are simple matrix functions of the
data and residuals, some common ones (such as the Durbin-Watson and
Ljung-Box Q) arent quite so simple, and thus are often omitted by people
using math packages.

In addition to these, RATS also takes care of adjusting the sample range to allow
for lags of the explanatory variables and for any other type of missing values.
RATS also allows you to use lagged values without actually making a shifted
copy of the data, which is a necessary step both in using a math package or a
statistical package which isnt designed primarily for time series work.
Which high-level commands are built into a particular statistical package de-
pends upon the intended market and, to a certain extent, the philosophy of
the developers. The TS in RATS stands for Time Series, so RATS makes it easy
to work with lags, includes instructions for handling vector autoregressions,
GARCH models, computing forecasts and impulse responses, estimating ARIMA
models, state-space models and spectral methods. While it also handles cross-
sectional techniques such as probit and tobit models, and has quite a bit of
support for working with panel data, you are probably using RATS because
youre working with some type of dynamic model.
Most high-level statistical packages include some level of programmability,
with looping, conditional execution, matrix manipulations, often some type of
procedure language. RATS has these and also user-definable menus and di-
alogs for very sophisticated types of programs. CATS is the best example of
thisit is entirely written in the RATS programming language. We will cover
all of these topics as part of this book.

1.2 Which Should You Use?


Your goal in doing empirical work should be to get the work done correctly
while requiring as little human time as possible. Thirty years ago, computing
time was often quite expensive. Estimating a single probit model on 10000
Introduction 4

data points might cost $100, so it was much more important to choose the
right platform and check and double-check what you were doing before you
even submitted a job for computation. Computer time was more valuable than
human timea good research assistant would keep the computer center bills
down. With modern computers, computing time for almost anything you might
want to do is almost costless. No matter how fast computers get, there will
always be someone figuring out some type of analysis which will take several
days to run on the fastest computers available, but thats not typical. Even
complicated simulations with a large number of replications can now be done
in under an hour.
As a general rule, the best way to achieve this goal is to use a programmable
high-level statistical package. By doing that, you start out with what will be in
most cases large pieces of what you need to do already written, already tested,
already debugged. All software (including RATS) has some bugs in it, but bugs
in mass-marketed software generally get discovered fairly quickly because of
the sheer number of users. The software is also vetted by doing comparisons
against calculations done using other software. By contrast, a function written
in a math package for use for one paper is never actually checked by anyone,
other than the writer. (And, unfortunately, sometimes not even by the writer,
as we have discovered in doing paper replications).
Whats important, though, is that you make use, as much as possible, of the fea-
tures available in RATSthe existing instructions, procedures and functions. If
you dont youre throwing away the single greatest advantage of the statistical
package.

1.3 Three Words of Advice

Get to Know Your Data!


The best way to waste time is to plunge ahead with a complicated piece of
analysis, only to discover that your data arent right only when the results
make no sense. (Even worse, of course, is to do the work, and write an entire
paper only to have a referee or thesis advisor tell you that the results make
no sense). In the interest of brevity, most published papers omit graphs of the
data, tables of basic statistics, simple regression models to help understand the
behavior of the data, but even if they dont make it into your final product, they
should be a vital part of your analysis.

Dont Reinvent the Wheel!


Use the built-in instructions whereever possibletheyre the single greatest
advantage of using a high-level statistical package. Use the RATS procedure
library. Understand how to use the procedures which already exist. Well dis-
cuss how to write your own procedures, and that same information can be used
to modify existing ones where necessary, but before you do either, see if the
Introduction 5

existing ones can be used as is. RATS comes with over 1000 textbook and paper
replication examples. See if something similar to what you want has already
been done and, if so, use it as the base for your work.

Realize That Not All Models Work!


You may need to be flexible. The RATS LINREG instruction will take just about
any set of data that you throw at itit will handle collinear data by (in effect)
dropping the redundant variablesso as long as you have at least one usable
observation, it will give you some result. However with non-linear estimation
instructions like GARCH, MAXIMIZE, DLM, theres no guarantee that they can
handle a model with a given set of data. Some models have multiple modes,
some have boundary problems, some have parameter scaling issues. Many
have structural changes and so dont fit properly over a whole sample. If you
read a published paper, youre generally looking at the models which worked,
not the ones which didnt. And often there are many in the latter category. So
be prepared to have to drop a country or industry, or to change up the data
range if you need to.

1.4 General Stylistic Tips

Commenting
Your first goal is always to get the calculations correct. However, if you have a
program (or part of one) that you expect that you might need again, its a good
idea to add comments. Dont overdo itthe following would be a waste of the
time to do the comment and also is distracting:

*
* Take first difference of log M2
*
set dlm2 = log(m2) - log(m2{1})

Note the difference between this and


*
* Use first difference of log M2 as in Smith and Jones(2010)
*
set dlm2 = log(m2) - log(m2{1})

where the comment will help you in writing up the results later. And if you
have a part of your program which youre not sure is correct, commenting it can
often help you spot errors. (If you cant explain why youre doing something,
that might be a good sign that youre doing it wrong).

Prettifying
The word prettify is used in programming circles to describe making the pro-
gram easier to read by changing spacing, line lengths, etc. It doesnt change
Introduction 6

how it works, just how it reads. Well-chosen spaces and line breaks can make it
easier to read a program, and will go a long way towards helping you get your
calculations correct. Even minor changes can help you do that. Compare

set dlrgdp = log(rgdp)-log(rgdp{1})


set dlm2 = log(m2)-log(m2{1})
set drs = tb3mo-tb3mo{1}
set dr1 = tb1yr-tb1yr{1}
set dlp = log(deflator)-log(deflator{1})
set dlppi = log(ppi)-log(ppi{1})

with
set dlrgdp = log(rgdp) - log(rgdp{1})
set dlm2 = log(m2) - log(m2{1})
set drs = tb3mo - tb3mo{1}
set dr1 = tb1yr - tb1yr{1}
set dlp = log(deflator) - log(deflator{1})
set dlppi = log(ppi) - log(ppi{1})

The only difference is a handful of spaces in each line, but its much clearer in
the second case that these are parallel transformations, and it would be much
easier to spot a typo in any of those lines.
At a minimum, you should get into the habit of indenting loops and the like.
This makes it much easier to follow the flow of the program, and also makes it
easier to skip more easily from one part of the calculation to the next.
Two operations on the Edit menu can be helpful with this. Indent Lines adds
(one level) of indentation at the left; Unindent Lines removes one level. The
number of spaces per level is set in the preferences in the Editor tab. All the
programs in this book are done with 3 space indenting, which seems to work
well for the way that RATS is structured. In the following, if you select the five
lines in the body of the loop and do EditIndent

do i = 1,8
linreg(noprint) dresids 1962:2 *
# resids{1} dresids{1 to i}
com aic = %nobs*log(%rss) + 2*(%nreg)
com sbc = %nobs*log(%rss) + (%nreg)*log(%nobs)
dis "Lags: " i "T-stat" %tstats(1) aic sbc
end do i

youll get
Introduction 7

do i = 1,8
linreg(noprint) dresids 1962:2 *
# resids{1} dresids{1 to i}
com aic = %nobs*log(%rss) + 2*(%nreg)
com sbc = %nobs*log(%rss) + (%nreg)*log(%nobs)
dis "Lags: " i "T-stat" %tstats(1) aic sbc
end do i

1.5 About This E-Book

Examples
The full running examples are included both in the text and are distributed
as separate files with the e-book. The names for these files are RPMn m.RPF,
where RPM is RATS Programming Manual, n is the chapter number and m
the example number. We would suggest that you use the separate example
files rather than trying to copy and paste whole programs out of the PDFif
you do the latter, you can often end up extra information from the page layout.

Typefaces
To help you understand how RATS works and in particular, what is happening
in the sample programs, we will use several conventions for typefaces.
Elements of RATS programs (instructions and options) are in Courier font.
Within text, they will be in upper case to stand out, with instruction names in
bold face and options or variable names in regular face:

We want to suppress the usual LINREG output from the regres-


sions with different lags, so well use NOPRINT. Then well use DISPLAY
to show the test statistic and the two criteria.

However, stand-alone examples of code will be in lower case for readability:

do q=0,3
do p=0,3
boxjenk(constant,ar=p,ma=q) y
end do p
end do q

Standard procedures which are distributed with RATS will be shown (in bold
upper case Courier) with the standard @ prefix, the way that you would use
them in practice: @DFUNIT, @REGCRITS. Since almost all procedures are on a
file named procedure name.src (that is, dfunit.src for @DFUNIT), we wont
talk about where you might find the code for a procedure unless its on a file
other than the expected one.
Introduction 8

Output taken straight out of a RATS output window will be in smaller fixed font
(to keep the information aligned) with a box around it:
Null Hypothesis : The Following Coefficients Are Zero
DRS Lag(s) 5 to 7
F(3,196)= 9.44427 with Significance Level 0.00000739

Wizards
We wont talk much about the use of RATS wizards. While some of these
remain useful even to very experienced RATS programmers (the Data (Other
Formats) and Standard Functions wizards in particular), theyre generally de-
signed to help with one-off calculations for less-experienced users, and not for
calculations with variables for ranges, and looping calculations that well be
doing here.

Tips and Tricks


If there is a subject which you might find interesting, but which would inter-
rupt the flow of a chapter, we will shift that into a Tips and Tricks section at
the end of the chapter.

Exercises
The point of this book is to help you learn better how to accomplish more ad-
vanced tasks using RATS. To that end, we will occasionally insert exercises,
which are ask you to think about how to do a slightly different example or how
to recode what weve already presented.
Chapter 2

Regression and ARIMA Models

This chapter begins with a quick overview of some of the basic RATS instruc-
tions used in estimating linear regression and ARMA models. This book is defi-
nitely not an econometrics text; instead, the aim is to refresh your memory and
to introduce you to some basic RATS tools. Towards that end, a number of key
RATS instructions are illustrated in some straightforward programs.

2.1 The Data Set


The file labeled QUARTERLY(2012).XLS contains quarterly values for the 3-
month and 1-year treasury bill rates, real GDP, potential real GDP, the GDP
deflator, the seasonally adjusted money supply (M2), the producer price index
of finished goods (PPI), and currency in circulation for the 1960:1 2012:4 pe-
riod. The data were obtained from the website of the Federal Reserve Bank of
St. Louis (www.stls.frb.org/index.html) and saved in Excel format. If
you open the file, you will see that the first eight observations are:

DATE Tb3mo Tb1yr RGDP Potent Deflator M2 PPI Curr


1960Q1 3.87 4.57 2845.3 2824.2 18.521 298.7 33.2 31.8
1960Q2 2.99 3.87 2832.0 2851.2 18.579 301.1 33.4 31.9
1960Q3 2.36 3.07 2836.6 2878.7 18.648 306.5 33.4 32.2
1960Q4 2.31 2.99 2800.2 2906.7 18.700 310.9 33.7 32.6
1961Q1 2.35 2.87 2816.9 2934.8 18.743 316.3 33.6 32.1
1961Q2 2.30 2.94 2869.6 2962.9 18.785 322.1 33.3 32.1
1961Q3 2.30 3.01 2915.9 2991.3 18.843 327.6 33.3 32.7
1961Q4 2.46 3.10 2975.3 3019.9 18.908 333.3 33.4 33.4

If you open up Example 2.1 (file RPM2 1.RPF), youll see the following lines,
which read in the entire data set:
cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)

Note that only the first three letters of the CALENDAR and ALLOCATE instruc-
tions have been usedin fact, any RATS instruction can be called using only the
first three letters of its name. If you use the TABLE instruction and limit the
output to only two decimal places, your output should be:

9
Regression and ARIMA Models 10

table(picture="*.##")

Series Obs Mean Std Error Minimum Maximum


TB3MO 212 5.03 2.99 0.01 15.05
TB1YR 212 5.58 3.18 0.11 16.32
RGDP 212 7664.75 3390.65 2800.20 13665.40
POTENT 212 7764.87 3511.54 2824.20 14505.40
DEFLATOR 212 61.53 31.59 18.52 116.09
M2 212 3136.84 2648.84 298.70 10317.70
PPI 212 99.97 49.13 33.20 196.20
CURR 212 327.91 309.02 31.83 1147.62

Many of the examples presented will use the growth rates of M2 and real GDP,
the first differences of the 3-month and 1-year T-bill rates, and the rate of infla-
tion (as measured by the growth rate of the GDP deflator or the PPI). You can
create these six variables using:

set dlrgdp = log(rgdp) - log(rgdp{1})


set dlm2 = log(m2) - log(m2{1})
set drs = tb3mo - tb3mo{1}
set dr1 = tb1yr - tb1yr{1}
set dlp = log(deflator) - log(deflator{1})
set dlppi = log(ppi) - log(ppi{1})

Notice that weve chosen to notate the change in a variable as a prefix of d, the
growth rate of a variable by dl, and the suffixes s and l refer to the short-term
and long-term interest rates, and the logarithmic change in price (called dlp)
is the quarterly inflation rate.
We can create graphs of the series (Figure 2.1) using:1

spgraph(footer="Graphs of the Series",hfields=2,vfields=2)


graph(header="Panel 1: The Interest Rates",key=below,nokbox) 2
# tb3mo
# tb1yr
graph(header="Panel 2: Real and Potential GDP",key=upleft) 2
# rgdp
# potent
graph(header="Panel 3: Time path of money growth",noaxis) 1
# dlm2
graph(header="Panel 4: Time path of Inflation",noaxis) 1
# dlp
spgraph(done)

Recall that the typical syntax of the GRAPH instruction is:

GRAPH( options ) number


# series start end
1
The growth rate of the PPI and CURR are not shown hereboth are considered in more
detail later in the chapter.
Regression and ARIMA Models 11

Panel 1: The Interest Rates Panel 3: Time path of money growth


17.5 0.06

15.0 0.05

12.5
0.04

10.0
0.03

7.5
0.02
5.0
0.01
2.5
0.00
0.0
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
-0.01
TB3MO TB1YR
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Panel 2: Real and Potential GDP Panel 4: Time path of Inflation


15000 0.030 1.00
RGDP
POTENT

0.025 0.75
12500
0.50
0.020
0.25
10000
0.015
0.00
0.010
7500
-0.25
0.005
-0.50
5000
0.000 -0.75

2500 -0.005 -1.00


1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 1960 1970 1980 1990 2000 2010

Figure 2.1: Graphs of the series

number The number of series to graph. The names of the series are
listed on the supplementary cards (one card for each series).
series The name of the series to graph
start end Range to plot. If omitted, RATS uses the current sample range.

The graphs shown in Figure 2.1 illustrate only a few of the options available in
RATS . The commonly used options are (brackets[ ] indicate default choice):

HEADER=header string (centered above graph)


FOOTER=footer string (left-justified below graph)
KEY=the location of the key
Some of the choices you can use are [NONE], UPLEFT, LORIGHT, ABOVE,
BELOW, RIGHT. Some (such as UPLEFT and LORIGHT) are inside the graph
box, others (such as ABOVE and RIGHT) are outside.
STYLE=graph style
Some of the choices include: [LINE], POLYGON, BAR, STACKEDBAR.
DATES/NODATES
RATS will label the horizontal axis with dates (rather than entry numbers)
unless the NODATES option is specified.

The program also illustrates the use of the SPGRAPH instruction to place mul-
tiple graphs on a single page. The first time SPGRAPH is encountered, RATS is
told to expect a total of four graphs. The layout is such that there are two fields
horizontally (HFIELD=2) and two vertically (VFIELD=2). The option FOOTER
produces Graphs of the Series as the footer for the full page. The headers
on the four GRAPH instructions produce the headers on the individual panels.
Nothing is actually shown until the SPGRAPH(DONE).
Regression and ARIMA Models 12

2.2 Linear Regression and Hypothesis Testing


The LINREG instruction is the backbone of RATS and it is necessary to review
its use. As such, suppose you want to estimate the first difference of the 3-
month t-bill rate (i.e., drs) as the autoregressive process:
7
X
drst = 0 + i drsti + t (2.1)
i=1

The next two lines of the program (this is still RPM2 1.RPF) estimate the model
over the entire sample period (less the seven usable observations lost as a re-
sult of the lags and the additional usable observation lost as a result of differ-
encing) and save the residuals in a series called resids.

linreg drs / resids


# constant drs{1 to 7}

Linear Regression - Estimation by Least Squares


Dependent Variable DRS
Quarterly Data From 1962:01 To 2012:04
Usable Observations 204
Degrees of Freedom 196
Centered R2 0.2841953
R-Bar2 0.2586309
Uncentered R2 0.2843637
Mean of Dependent Variable -0.011617647
Std Error of Dependent Variable 0.759163288
Standard Error of Estimate 0.653660810
Sum of Squared Residuals 83.745401006
Regression F(7,196) 11.1168
Significance Level of F 0.0000000
Log Likelihood -198.6489
Durbin-Watson Statistic 1.9709

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant -0.011903358 0.045799634 -0.25990 0.79521316
2. DRS{1} 0.390010248 0.069644459 5.60002 0.00000007
3. DRS{2} -0.380186642 0.074718282 -5.08827 0.00000084
4. DRS{3} 0.406843358 0.078304236 5.19567 0.00000051
5. DRS{4} -0.159123423 0.082740231 -1.92317 0.05590809
6. DRS{5} 0.193334248 0.078290297 2.46945 0.01438724
7. DRS{6} -0.089946745 0.074692035 -1.20423 0.22995107
8. DRS{7} -0.220768119 0.069358921 -3.18298 0.00169542

Almost every piece of information in this output can be retrieved for future
calculationsin the descriptions below the variable name (if it exists) is in
bold. These are all saved to full precision, not just to the number of decimal
places shown in the output. You need to be careful since these are replaced
every time you estimate a regression, and some may be recomputed by other
instructions as well. %NOBS, for instance, is replaced by almost any statistical
instruction.
Usable Observations (%NOBS)
This doesnt count observations lost to differencing and lags. If there were
missing values within the sample it wouldnt count those either.
Regression and ARIMA Models 13

Degrees of Freedom (%NDF)


Number of observations less number of (effective) regressors. If you hap-
pened to run a regression with collinear regressors (too many dummies, for
instance), the number of effective regressors might be less than the number
you listed.
Centered R2 (%RSQUARED)
The standard regression R2 .
R-Bar2 (%RBARSQ)
R2 corrected for degrees of freedom.
Uncentered R2
R2 comparing sum of squared residuals to sum of squares of the dependent
variable, without subtracting the mean out of the latter.
Mean of Dependent Variable (%MEAN)
The mean of the dependent variable computed only for the data points used
in the regression. Thus it will be different from what you would get if you
just did a TABLE or STATISTICS instruction on the dependent variable.
Std Error of Dependent Variable (%VARIANCE)
This is the standard error of the dependent variable computed only for the
data points used in the regression. %VARIANCE is its square.
Standard Error of Estimate (%SEESQ)
The standard degrees-of-freedom corrected estimator for the regression stan-
dard error. Its square (that is the variance estimated) is in the %SEESQ vari-
able.
Sum of Squared Residuals (%RSS)
Regression F (%FSTAT)
This is the F test for the hypothesis that all coefficients in the regression
(other than the constant) are zero. Here, the sample value of F for the joint
test 1 = 2 = 3 = . . . = 7 = 0 is 11.1168. It output also shows the
numerator and denominator degrees of freedom of the test.
Significance Level of F (%FSIGNIF)
This is the significance level of the regression F , which here is highly signif-
icant.
Log Likelihood (%LOGL)
This is the log likelihood assuming Normal residuals. Note that RATS in-
cludes all constants of integration in any log likelihood calculations.
Durbin-Watson Statistic (%DURBIN)
The Durbin-Watson test for first-order serial correlation in the residuals.
This is computed and displayed even though the standard small sample lim-
its dont apply to a regression like this with lagged dependent variablesits
Regression and ARIMA Models 14

mainly used as an informal indicator of serial correlation if it differs substan-


tially from the theoretical value of 2.0 for serially uncorrelated residuals.

In the regressor table at the bottom, we have the coefficient estimate (Coeff),
the standard error of estimated coefficient (Std Error), the t-statistic for the
null hypothesis that the coefficient equals zero (T-Stat), and the marginal sig-
nificance level of the t-test (Signif). The fetchable information here are saved
in VECTORS, each of which would have 8 elements in this case. The coeffi-
cients are in %BETA, the standard errors in %STDERRS and the t-statistics in
%TSTATS. (The significance levels arent saved). Thus %BETA(2) is the coef-
ficient on the first lag of DRS (roughly .3900), %STDERRS(5) is the standard
error on the estimate of DRS{4} (.0827), and (%TSTATS(8)) is the t-statistic on
DRS{8} (-3.183).
There are a several other variables defined by LINREG which dont show di-
rectly on the output:

%NREG Number of regressors


%NMISS Number of skipped data points (between the start and end)
%TRSQUARED Number of observations times the R2 . Some LM tests use this as
the test statistic.
%TRSQ Number of observations times the uncentered R2 . Also some-
times used in LM tests.
%SIGMASQ Maximum likelihood estimate (that is, not corrected for degrees
of freedom) of the residual variance.
%XX The (X 0 X)1 matrix, or (in some cases) the estimated covariance
matrix of the coefficients. This is a k k SYMMETRIC matrix
where k is the number of regressors. %XX(i,j) is its i, j element.

For a time series regression, it is always important to determine whether there


is any serial correlation in the regression residuals. The CORRELATE instruc-
tion calculates the autocorrelations (and the partial autocorrelations) of a spec-
ified series. The syntax is:

CORRELATE( options ) series start end corrs

where

series The series for which you want to compute correlations.


start end The range of entries to use. The default is the entire series.
Regression and ARIMA Models 15

corrs Series used to save the autocorrelations (Optional).

The principal options are:

NUMBER=number of autocorrelations to compute


The default is the integer value of T /4
RESULTS=series used to save the correlations
PARTIAL=series for the partial autocorrelations
If you omit this option, the PACF will not be calculated.
QSTATS
Produces the Ljung-Box Q-statistics
SPAN=interval width for Q-statistics
Use with QSTATS to set the width of the intervals. For example, SPAN=4
produces Q(4), Q(8), Q(12), and so forth.

In the example at hand, we can obtain the first eight autocorrelations, partial
autocorrelations, and the associated Q-statistics of the residuals with:
corr(number=24,partial=partial,qstats,span=4,pic="##.###") resids

The options also include a degrees of freedom correction. Here, you could in-
clude the option DFC=7 since the residuals are generated from a model with
seven autoregressive coefficients.
Correlations of Series RESIDS
Quarterly Data From 1962:01 To 2012:04

Autocorrelations
1 2 3 4 5 6 7 8 9 10
0.015 0.002 -0.019 -0.021 -0.043 0.044 -0.069 0.105 -0.096 0.061
11 12 13 14 15 16 17 18 19 20
-0.137 0.022 -0.067 0.006 -0.131 -0.047 -0.092 0.076 -0.025 0.001
21 22 23 24
0.041 0.028 -0.045 0.049

Partial Autocorrelations
1 2 3 4 5 6 7 8 9 10
0.015 0.002 -0.019 -0.020 -0.043 0.045 -0.071 0.107 -0.103 0.067
11 12 13 14 15 16 17 18 19 20
-0.145 0.033 -0.069 -0.004 -0.123 -0.077 -0.063 0.032 0.004 -0.058
21 22 23 24
0.083 -0.035 0.005 -0.002

Ljung-Box Q-Statistics
Lags Statistic Signif Lvl
4 0.209 0.994898
8 4.390 0.820339
12 11.371 0.497450
16 16.661 0.407869
20 20.035 0.455719
24 21.637 0.600931
Regression and ARIMA Models 16

All of the autocorrelation and partial autocorrelations are small and the Ljung-
Box Q-statistics do not indicate the values are statistically significant. Other
diagnostic checks include plotting the residuals using (for instance)
graph 1
# resids

A concern is that the model is over-parameterized since it contains a total


of eight coefficients. While the t-statistics allow you to determine the signif-
icance levels of individual coefficients, the EXCLUDE, SUMMARIZE, TEST, and
RESTRICT instructions allow you to perform hypothesis tests on several co-
efficients at once. EXCLUDE is followed by a supplementary card listing the
variables to exclude from the most recently estimated regression. RATS pro-
duces the F-statistic and the significance level for the null hypothesis that the
coefficients of all excluded variables equal zero. The following does a joint test
on the final three lags:
exclude
# drs{5 to 7}

Null Hypothesis : The Following Coefficients Are Zero


DRS Lag(s) 5 to 7
F(3,196)= 9.44427 with Significance Level 0.00000739

This can be rejected at conventional significance levels.


With EXCLUDE (and similar instructions) you can suppress the output with the
NOPRINT option, or you can improve the output using the TITLE option to
give a clearer description. Whether or not use print the output or not, they
define the variables

%CDSTAT The test statistic


%SIGNIF The significance level
%NDFTEST The (numerator) degrees of freedom. (The denominator degrees
of freedom on a F will be the %NDF from the previous regression.)

SUMMARIZE has the same syntax as EXCLUDE but is used to test the null hy-
pothesis that the sum of the list coefficients is equal to zero. In the following
example, the value of t for the null hypothesis 5 + 5 + 7 = 0 is -1.11460. As
such, we do not reject the null hypothesis that the sum is zero.

summarize
# drs{5 to 7}

Summary of Linear Combination of Coefficients

DRS Lag(s) 5 to 7
Value -0.1173806 t-Statistic -1.11460
Standard Error 0.1053116 Signif Level 0.2663855
Regression and ARIMA Models 17

In addition to %CDSTAT and %SIGNIF, SUMMARIZE defines %SUMLC and %VARLC


as the sum of the coefficients and the estimated variance of it.
EXCLUDE can only test whether a group of coefficients is jointly equal to zero.
The TEST instruction has a great deal more flexibility; it is able to test joint re-
strictions on particular values of the coefficients. Suppose you have estimated
a model and want to perform a significance test of the joint hypothesis restrict-
ing the values of coefficients i , j , . . . and k equal the values ri , rj , . . . and rk ,
respectively. Formally, suppose you want to test the restrictions
i = ri , j = rj , ..., and k = rk
To perform the test, you first type TEST followed by two supplementary cards.
The first supplementary card lists the coefficients (by their number in the
LINREG output list) that you want to restrict and the second lists the restricted
value of each. Suppose you want to restrict the coefficients of the last three lags
of DRS to all be 0.1 (i.e., 5 = 0.1, 6 = 0.1, and 7 = 0.1). To test this restriction,
use:
test
# 6 7 8
# 0.1 0.1 0.1

F(3,196)= 15.77411 with Significance Level 0.00000000

RATS displays the F-value and the significance level of the joint test. If the
restriction is binding, the value of F should be high and the significance level
should be low. Hence, we can be quite confident in rejecting the restriction
that each of the three coefficients equals 0.1. To test the restriction that the
constant equals zero (i.e., 0 = 0) and that 1 = 0.4, 2 = 0.1, 3 = 0.4, use:

test
# 1 2 3 4
# 0. 0.4 -0.1 0.4

F(4,196)= 4.90219 with Significance Level 0.00086693

RESTRICT is the most powerful of the hypothesis testing instructions. It can


test multiple linear restrictions on the coefficients and estimate the restricted
model. Although RESTRICT is a bit difficult to use, it can perform the tasks of
SUMMARIZE, EXCLUDE, and TEST. Each restriction is entered in the form:
i i + j j + ... + k k = r
where the i are the coefficients of the estimated model (i.e., each coefficient
is referred to by its assigned number), the i are weights you assign to each
Regression and ARIMA Models 18

coefficient, and r represents the restricted value of the sum (which may be
zero).
To implement the test, you type RESTRICT followed by the number of restric-
tions you want to impose. Each restriction requires the use of two supplemen-
tary cards. The first lists the coefficients to be restricted (by their number) and
the second lists the values of the i and r.

2.2.1 Examples using RESTRICT

1. To test the restriction that the constant equals zero (which could be done
with EXCLUDE or TEST) use:

restrict 1
# 1
# 1 0

The first line instructs RATS to prepare for one restriction. The second line is
the supplementary card indicating that coefficient 1 (i.e., the constant) is to be
restricted. The third line imposes the restriction 1.0 0 = 0.
2. To test the restriction that 1 = 2 , we rearrange that to 1 2 = 0 and use

restrict 1
# 2 3
# 1 -1 0

Again, the first line instructs RATS to prepare for one restriction. The second
line is the supplementary card indicating that coefficients 2 and 3 are to be
restricted. The third line imposes the restriction 1.0 1 1.0 2 = 0.
3. If you reexamine the regression output, it seems as if 1 + 2 = 0. Well also
include several other restrictions which arent quite as clear: 3 + 4 = 0 and
4 + 5 = 0. To test the combination of these three restrictions use:

restrict(create) 3 resids
# 2 3
# 1. 1. 0.
# 4 5
# 1. 1. 0.
# 5 6
# 1. 1. 0.

Note that RESTRICT can be used with the CREATE option to test and estimate
the restricted form of the regression. Whenever CREATE is used, you can save
the new regression residuals simply by providing RATS with the name of the se-
ries in which to store the residualshere RESIDS. (%RESIDS is also redefined).
The test is shown above the new regression output.
Regression and ARIMA Models 19

F(3,196)= 3.74590 with Significance Level 0.01197151

Linear Model - Estimation by Restricted Regression


Dependent Variable DRS
Quarterly Data From 1962:01 To 2012:04
Usable Observations 204
Degrees of Freedom 199
Mean of Dependent Variable -0.011617647
Std Error of Dependent Variable 0.759163288
Standard Error of Estimate 0.667052922
Sum of Squared Residuals 88.546960661
Durbin-Watson Statistic 1.9307

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant -0.013441955 0.046725758 -0.28768 0.77389289
2. DRS{1} 0.378316399 0.060581719 6.24473 0.00000000
3. DRS{2} -0.378316399 0.060581719 -6.24473 0.00000000
4. DRS{3} 0.266709604 0.065399412 4.07817 0.00006562
5. DRS{4} -0.266709604 0.065399412 -4.07817 0.00006562
6. DRS{5} 0.266709604 0.065399412 4.07817 0.00006562
7. DRS{6} -0.112088351 0.075915937 -1.47648 0.14139579
8. DRS{7} -0.175829465 0.069193109 -2.54114 0.01181155

Here the F-statistic (with three degrees of freedom in the numerator and 196
in the denominator) is 3.74 with a significance level of 0.01197. Hence, we
would reject the null hypothesis at the 5% significance level and conclude that
restriction is binding. At the 1% significance level, we can (just barely) accept
the null hypothesis.
Note that when you do RESTRICT(CREATE), the t-statistics in the new output
(and any other further tests that you do) take the set of restrictions used as
given. Thus the t-statistic on DRS{1} tests whether the coefficient on the first
lag is zero, given that the first two lags sum to zero, which means that it ac-
tually is restricting both coefficients to zero (hence the matching (up to sign) t
statistics).

2.3 The LINREG Options


LINREG has many options that will be illustrated in the following chapters.
The usual syntax of LINREG is:

LINREG( options ) depvar start end residuals


# list

depvar The dependent variable.


start end The range to use in the regression. The default is the largest
common range of all variables in the regression.
residuals Series name for the residuals. Omit if you do not want to
save the residuals in a separate series. RATS always saves
Regression and ARIMA Models 20

the residuals in a series in a series called %RESIDS. You can


use this series just as if you named the series. However, be
aware that %RESIDS is overwritten each time a new LINREG
instruction (or similar instruction) is performed.
list The list of explanatory variables.

The most useful options for our purposes are:


DEFINE=name of EQUATION to define
[PRINT]/NOPRINT
LINREG also has options for correcting standard errors and t-statistics for hy-
pothesis testing. ROBUSTERRORS/ [NOROBUSTERRORS] computes a consistent
estimate of the covariance matrix that corrects for heteroscadesticity as in
White (1980). ROBUSTERRORS and LAGS= produces various types of Newey-
West estimates of the coefficient matrix. You can use SPREAD is for weighted
least squares and INSTRUMENTS for instrumental variables. The appropriate
use of these options is described in Chapter 2 of the RATS Users Guide.

2.4 Using LINREG and Related Instructions


To illustrate working with the LINREG and related instructions, it is useful to
consider the two interest rate series shown in Panel 1 of Figure 2.1.2 Economic
theory suggests that long-term and short-term rates have a long-term equilib-
rium relationship. Although the two series appear to be nonstationary, they
also seem to bear a strong relationship to each other. We can estimate this
relationship using:

linreg tb1yr / resids


# constant tb3mo

Linear Regression - Estimation by Least Squares


Dependent Variable TB1YR
Quarterly Data From 1960:01 To 2012:04
Usable Observations 212
Degrees of Freedom 210
Centered R2 0.9868383
R-Bar2 0.9867756
Uncentered R2 0.9967863
Mean of Dependent Variable 5.5787735849
Std Error of Dependent Variable 3.1783132737
Standard Error of Estimate 0.3654972852
Sum of Squared Residuals 28.053535759
Regression F(1,210) 15745.3945
Significance Level of F 0.0000000
Log Likelihood -86.4330
Durbin-Watson Statistic 0.5766

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant 0.2706392926 0.0491897079 5.50195 0.00000011
2. TB3MO 1.0547609616 0.0084057656 125.48065 0.00000000

2
The analysis from this section is in Example 2.2, file RPM2 2.RPF.
Regression and ARIMA Models 21

1.0

0.8

0.6

0.4

0.2

0.0

-0.2
0 1 2 3 4 5 6 7 8
CORS PARTIAL

Figure 2.2: Correlations from interest rate regression

An important issue concerns the nature of the residuals. We can obtain the
first 12 residual autocorrelations using:

corr(num=8,results=cors,partial=partial,picture="##.###",qstats) resids

Correlations of Series RESIDS


Quarterly Data From 1960:01 To 2012:04

Autocorrelations
1 2 3 4 5 6 7 8
0.711 0.410 0.260 0.133 -0.042 -0.141 -0.069 -0.013

Partial Autocorrelations
1 2 3 4 5 6 7 8
0.711 -0.193 0.103 -0.109 -0.176 0.003 0.179 -0.039

Ljung-Box Q-Statistics
Lags Statistic Signif Lvl
8 169.465 0.000000

As expected, the residual autocorrelations seem to decay reasonably rapidly.


Notice that we used the QSTATS optionthis option produces the Ljung-Box Q-
statistic for the null hypothesis that all 8 autocorrelations are zero. Clearly,
this null is rejected at any conventional significance level. If we wanted addi-
tional Q-statistics, we could have also used the SPAN= option. For example, if
we wanted to produce the Q-statistics for lags, 4, 8, and 12, we could use:

corr(num=12,results=cors,partial=partial,span=4,qstats) resids

Since we are quite sure that the autocorrelations differ from zero, we wont use
that here. We can graph the ACF and PACF (Figure 2.2) using:
graph(nodates,number=0,style=bar,key=below,footer="ACF and PACF") 2
# cors
# partial
Regression and ARIMA Models 22

Notice that we used the NODATES and NUMBER= options. We want the x-axis
to be labeled with integers ranging from 0 to 24 instead of calendar dates since
these arent data, but a sequence of statistics. Since it is clear that the residuals
decay over time, we can estimate the dynamic process. Take the first difference
of the resids and call the result DRESIDS:
diff resids / dresids

Now estimate the dynamic adjustment process as:


p
X
dresidst = 0 residst + i dresidst1 + t
i=1

If we can conclude that 0 is less than zero, we can conclude that the {resids}
sequence is a convergent process. However, it is not straightforward to esti-
mate the regression equation and then test the null hypothesis 0 = 0. One
problem is that under the null hypothesis of no equilibrium long-run relation-
ship (that is, under the null of no cointegration between the two rates), we
cannot use the usual t-distributionsthis is the Engle-Granger test from En-
gle and Granger (1987). And to apply this, we need to choose p to eliminate
the serial correlation in the residuals. p is clearly not zero, so we must come up
with some method to choose it.3
The ACF suggests that we can look at a relatively short lag lengths although
the partial autocorrelation coefficient at lag 6 appears to be significant. We
could do the test allowing for two full years worth of lags (that is, 8) with:

diff resids / dresids


linreg dresids
# resids{1} dresids{1 to 8}

3
This is a heavily-used test, and RATS provides procedures for doing this, as will be dis-
cussed below. But for now, well look at how to do it ourselves.
Regression and ARIMA Models 23

Linear Regression - Estimation by Least Squares


Dependent Variable DRESIDS
Quarterly Data From 1962:02 To 2012:04
Usable Observations 203
Degrees of Freedom 194
Centered R2 0.2552704
R-Bar2 0.2245599
Uncentered R2 0.2552866
Mean of Dependent Variable -0.001310240
Std Error of Dependent Variable 0.281667715
Standard Error of Estimate 0.248033987
Sum of Squared Residuals 11.935046594
Log Likelihood -0.4213
Durbin-Watson Statistic 2.0048

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. RESIDS{1} -0.379234416 0.084147774 -4.50677 0.00001136
2. DRESIDS{1} 0.241016426 0.092645733 2.60148 0.00999787
3. DRESIDS{2} 0.005897591 0.091245574 0.06463 0.94853175
4. DRESIDS{3} 0.125011741 0.083397849 1.49898 0.13550420
5. DRESIDS{4} 0.129833220 0.079674704 1.62954 0.10482111
6. DRESIDS{5} 0.008635668 0.078080089 0.11060 0.91204778
7. DRESIDS{6} -0.142481467 0.075908766 -1.87701 0.06201793
8. DRESIDS{7} 0.050886186 0.071685366 0.70985 0.47864666
9. DRESIDS{8} 0.085033097 0.071088272 1.19616 0.23309320

where we can read off the E-G test statistic as the t-stat on the lagged resid-
ual (-4.50677). That wouldnt be an unreasonable procedure, but then at least
those last two lags and perhaps all but the first lag on DRESIDS look like they
may be unnecessary. Since each added lag costs a usable data point, and un-
needed coefficients tend to make the small-sample behavior of tests worse, it
would be useful to see if we can justify using fewer.
There are several possible ways to automate lag selection in a situation like
this. Here, well demonstrate use of Information Criteria. Using the variables
defined by a LINREG, values for the Akaike Information Criterion (AIC) and
the Schwartz Bayesian Criterion (SBC) (often called the Bayesian Information
Criterion or BIC) can be computed using:

com aic = -2.0*%logl + %nreg*2


com sbc = -2.0*%logl + %nreg*log(%nobs)

We want to suppress the usual LINREG output from the regressions with dif-
ferent lags, so well use NOPRINT. Then well use DISPLAY to show the test
statistic and the two criteria.
When you use Information Criteria to choose lag length, its important to make
sure that you use the same sample range for each regressionif you dont, the
sample log likelihoods wont be comparable. We can pick up the range from the
8 lag regression using

compute egstart=%regstart()

and use that as the start period on the other regressions:


Regression and ARIMA Models 24

do i = 0,8
linreg(noprint) dresids egstart *
# resids{1} dresids{1 to i}
com aic = -2.0*%logl + %nreg*2
com sbc = -2.0*%logl + %nreg*log(%nobs)
dis "Lags: " i "T-stat" %tstats(1) "The aic = " aic " and sbc = " sbc
end do i

Lags: 0 T-stat -5.87183 The aic = 30.68736 and sbc = 34.00057


Lags: 1 T-stat -6.60326 The aic = 24.73007 and sbc = 31.35648
Lags: 2 T-stat -5.38733 The aic = 24.58260 and sbc = 34.52222
Lags: 3 T-stat -5.63637 The aic = 23.88187 and sbc = 37.13469
Lags: 4 T-stat -6.24253 The aic = 19.43505 and sbc = 36.00108
Lags: 5 T-stat -5.67614 The aic = 21.43312 and sbc = 41.31236
Lags: 6 T-stat -4.37831 The aic = 16.65477 and sbc = 39.84721
Lags: 7 T-stat -4.34237 The aic = 18.33419 and sbc = 44.83984
Lags: 8 T-stat -4.50677 The aic = 18.84250 and sbc = 48.66135

Whether we use the 6-lag model selected by the minimum AIC or the 1-lag
model selected by the SBC, the t-statistic is sufficiently negative that we reject
the null hypothesis 0 equals zero. As such, we conclude that the two interest
rates are cointegrated.4
It is important to note that there are many equivalent ways to report the AIC
and SBC for linear regressions, which is fine as long as you use the same for-
mula in comparing models. The following eliminate the additive terms from
2 log L term that depend only upon the number of observations:

AIC = T log(RSS) + 2k
SBC = T log(RSS) + k log T

and can be computed with

com aic = %nobs*log(%rss)+%nreg*2


com bic = %nobs*log(%rss)+%nreg*log(%nobs)

You can also divide through the formulas by the number of observations. Since
the number of observations should match in models that you are comparing
with the information criteria, neither of these changes will affect the rank or-
derings of models.
We can now re-run the regression with the 6 lags picked using AIC:

linreg dresids
# resids{1} dresids{1 to 6}

Note that the t-statistic on the lagged residual is slightly different here from
what it was for six lags when we did the loop (-4.44300 vs -4.37831). This
4
With 203 usable observations, the 5% critical value is 3.368.
Regression and ARIMA Models 25

is because the earlier regression used the sample range that allowed for eight
lags, while this one has re-estimated using the longer range allowed by only six
lags. Its a fairly common practice in this type of analysis to pick the number of
lags based upon a common range (which is necessary for using the information
criteria), then re-estimate with the chosen lag length using as much data as
possible.
Two other standard procedures can be helpful in avoiding some of the program-
ming shown above. @REGCRITS produces four model selection criterion (includ-
ing the AIC and SBC). Note that it uses uses a per observation likelihood-
based version of the criteria:
AIC = 2 log(L)/T + 2k/T
SBC = 2 log(L)/T + k log(T )/T

Again, this is fine as long as you use the same formula for each model that you
are comparing.
@REGCORRS produces an analysis of the residualswith the options shown be-
low it creates both a graph of the correlations, and a table of running Q statis-
tics.
@regcrits
@regcorrs(number=24,qstats,report)

As mentioned earlier, the Engle-Granger test is important enough that there


is a separate procedure written to do the calculation above. Thats @EGTEST.
To choose from up to 8 lags using AIC, you would do the following:
@egtest(lags=8,method=aic)
# tb1yr tb3mo

Note that this matches what we did, and gives the test statistic from the re-
estimated model:
Engle-Granger Cointegration Test
Null is no cointegration (residual has unit root)
Regression Run From 1961:04 to 2012:04
Observations 206
With 6 lags chosen from 8 by AIC
Constant in cointegrating vector
Critical Values from MacKinnon for 2 Variables

Test Statistic -4.44300**


1%(**) -3.95194
5%(*) -3.36688
10% -3.06609

2.5 ARMA(p,q) Models


Instead of the pure autoregressive process represented by equation (2.1), most
time-series models are based on the stochastic difference equation with p au-
Regression and ARIMA Models 26

toregressive terms and q moving average terms. Consider


yt = a0 + a1 yt1 + a2 yt2 + . . . + ap ytp + t + 1 t1 + . . . + q tq
where yt is the value of the variable of interest in time period t, the values of a0
through ap and 1 through q are coefficients, and t is a white-noise stochastic
disturbance with variance 2 .
As a practical matter, the order of the ARMA process is unknown to the re-
searcher and needs to be estimated. The typical tools used to identify p and q
are the autocorrelation function (ACF) and the partial autocorrelation function
(PACF). The autocorrelation function is the set of correlations between yt and
yti for each value of i. Thus, the ACF is formed using
i = i /0
where i is the covariance between yt and yti . As discussed in Enders (2010),
some of the key properties of the ACF are:

1. White-noise (i.e., ai = 0 and i = 0): All autocorrelations are zero.


2. AR(1): For a1 > 0, the values of i decay geometrically with i = ai1 . For
negative values of a1 , the decay is oscillatory.
3. MA(q): The autocorrelations cut to zero after lag q.
4. AR(2): The decay pattern can contain trigonometric components.
5. ARMA(1, q): If a1 > 0 and q = 1, geometric decay after lag 1; if a1 < 0 there
is oscillating geometric decay after lag 1.
6. ARMA(p, q): The ACF will begin to decay at lag q. The decay pattern can
contain trigonometric components.

In contrast to the autocorrelation i , the partial autocorrelation between yt and


yti holds constant the effects of the intervening values of yt1 through yti+1 .
A simple way to understand the partial autocorrelation function is to suppose
that the yt series is an ARMA(p, q) process has been demeaned. Now consider
the series of regression equations
yt = 11 yt1 + et
yt = 21 yt1 + 22 yt2 + et
yt = 31 yt1 + 32 yt2 + 33 yt3 + et
where et is an error term (that may not be white-noise).
The partial autocorrelation function (PACF) is given by the values 11 , 22 , 33 ,
etc., that is, the coefficient on the final lag. For a pure AR(p) process, p+1,p+1 is
necessarily zero. Hence, the PACF of an AR(p) process will cut to zero after lag
p. In contrast, the PACF of a pure MA process will decay.
Regression and ARIMA Models 27

2.6 Estimation of an ARMA(p,q) process with RATS


The Box-Jenkins methodology is a three-step procedure: Identification, Esti-
mation, and Diagnostic Checking. Each is described below.

2.6.1 Identification

The first step in the Box-Jenkins methodology is to identify several plausible


models. A time-series plot of the series and a careful examination of the ACF
and PACF of the series can be especially helpful. Be sure to check for outliers
and missing values. If there is no clear choice for p and q, entertain a number of
reasonable models. If the series has a pronounced trend or meanders without
showing a strong tendency to revert to a constant long-run value, you should
consider differencing the variables. As discussed in later chapters, the current
practices in such circumstances involve testing for unit roots and/or structural
breaks.
Although you can create the plot the correlations and partial correlations using
CORRELATE and GRAPH, its much quicker to use the @BJIDENT procedure:

@BJIDENT( options ) series start end

series Series used to compute the correlations.


start end Range of entries to use. The default is the entire series.

The principal options are:

DIFF=number of regular differences [0]


SDIFFS=number of seasonal differences [0]
TRANS=[NONE]/LOG/ROOT
Chooses the preliminary transformation (if any). ROOT means the square
root.
NUMBER=number of correlations to compute
The default is the integer value of T /4

2.6.2 Estimation

Although it is straightforward to estimate an AR(p) process using LINREG, the


situation is more complicated when MA terms are involved. Since, the values
of t , t1 , . . . are not observed, it isnt possible to let the lagged values of these
error terms be regressors in an OLS estimation. Instead, models with MA terms
are generally estimated using maximum likelihood techniques. The form of the
RATS instruction used to estimate an ARMA model is:
Regression and ARIMA Models 28

BOXJENK( options ) series start end residuals

For our purposes, the important options are:

AR=number of autoregressive parameters [0]


MA=number of moving average parameters [0]
DIFFS=number of regular differences [0]
CONSTANT/[NOCONSTANT]
Note: by default, a constant is not included in the model.
SAR=number of seasonal autoregressive parameters [0]
SMA=number of seasonal moving average parameters [0]
DEFINE=name of the EQUATION to define from this

As with the LINREG instruction, BOXJENK creates a number of internal vari-


ables that you can use in subsequent computations. A partial list of these
variables includes the coefficient vector %BETA, the vector of the t-statistics
%TSTATS, the degrees of freedom %NDF, the number of observations %NOBS, the
number of regressors %NREG, and the residual sum of squares %RSS.
BOXJENK also creates the variable %CONVERGED. %CONVERGED = 1 if the esti-
mation converged and %CONVERGED = 0 if it didnt.
It is important to remember that the default is to not include an intercept
term from the model. Moreover, the reported value of CONSTANT is actually the
estimate of the mean (not the estimate of a0 ).5 The relationship between the
mean, , and the intercept, a0 , is
= a0 /(1 a1 a2 ... ap )
After the candidate set of models has been estimated, they should be compared
using a number of criteria including:

Parsimony
Select models with low values of p and q. Large values of p and q will neces-
sarily increase the fit, but will reduce the number of degrees of freedom. Poorly
estimated coefficients have large standard errors and will generally result in
poor forecasts. Moreover, high order ARMA(p, q) models can usually be well-
approximated by low-order models.

Goodness of Fit
The most popular goodness-of-fit measures are the Akaike Information Crite-
rion (AIC) and the Schwartz Bayesian Criterion (SBC). You can construct these
measures using the same code as for the LINREG instruction.
5
This parameterization makes it simpler to do more general regressions with ARIMA errors.
Regression and ARIMA Models 29

2.6.3 Diagnostic Checking

It is important to check the residuals for any remaining serial correlation. The
GRAPH, STATISTICS, and CORRELATE instructions applied to the residuals can
help you determine whether or not the residuals are well-described as a white-
noise process. Any pattern in the residuals means that you equation is mis-
specified. As described below, you can use recursive estimations to check for
coefficient stability.

2.7 An Example of the Price of Finished Goods


The ideas in the previous section can be illustrated by considering an extended
example of the producer price index. This is Example 2.3, file RPM2 3.RPF.
Well again read in the data with

cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)

If you look at a time series graph of PPI,6 it should be clear that it is not sta-
tionary and needs to be differenced. It turns out that it is best to work with the
logarithmic difference. This variable can be created using:

log ppi / ly
dif ly / dly

We could do this in one step, but well also have use for the LY series itself.
Now graph the (log differenced) series and obtain the ACF and PACF using

spgraph(footer="Price of Finished Goods",hfield=2,vfield=1)


graph(header="Panel a: Quarterly Growth Rate") 1
# dly
@bjident(separate,number=12) dly
spgraph(done)

This produces Figure 2.3. Notice that we wrapped an SPGRAPH around our own
GRAPH, and the graphs produced by @BJIDENT. @BJIDENT also uses SPGRAPH
to split a graph space vertically between the ACF on top and the PACF on the
bottom. These end up splitting the right pane in the SPGRAPH that we define.
This is how nested SPGRAPHs work.
The plot of the series, shown in Panel a of Figure 2.3 indicates that there was a
run-up of prices in the early 1970s and a sharp downward spike in 2008:4.
6
The quickest way to do that is by doing the menu operation View-Series Window, click on
the PPI series and the Time Series Graph toolbar icon.
Regression and ARIMA Models 30

0 Differences of DLY
Panel a: Quarterly Growth Rate
Correlations
0.06 1.00

0.75

0.50

0.04 0.25

0.00

-0.25
0.02
-0.50

-0.75

-1.00
0 1 2 3 4 5 6 7 8 9 10 11 12
0.00

Partial Correlations
1.00

0.75
-0.02
0.50

0.25

0.00
-0.04
-0.25

-0.50

-0.06 -0.75

-1.00
0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 2.3: Price of Finished Goods

However, for our purposes, the series seems reasonably well-behaved. Al-
though the ACF (shown in the upper right-hand portion of the figure) decays,
the decay does not seem to be geometric. Notice that the PACF has significant
spikes at lags 1 and 3. As such, at this point in the analysis, there are several
possible candidates:

1. AR(3): The ACF does not exhibit simple geometric decay so an AR(1) is
likely to be inappropriate. Moroever, the PACF does not display a simple
decay pattern; instead, the values of 1 and 3 are significant.

2. Low-order ARMA(p, q): Neither the ACF nor the PACF display simple de-
cay patterns. As such, the process may be mixed in that it contains AR
and MA terms.

Estimating the series as an AR(3) can be done using LINREG or BOXJENK. To


illustrate the use of BOXJENK, we have

boxjenk(constant,ar=3) dly

The coefficient block of the output is:


Variable Coeff Std Error T-Stat Signif
************************************************************************************
1. CONSTANT 0.008431166 0.002141902 3.93630 0.00011349
2. AR{1} 0.478711560 0.068156413 7.02372 0.00000000
3. AR{2} -0.008559549 0.076086058 -0.11250 0.91053898
4. AR{3} 0.228904929 0.068504836 3.34144 0.00099121

If we reestimate the model without the insignificant AR(2) term, we obtain:

boxjenk(constant,ar=||1,3||) dly
Regression and ARIMA Models 31

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. CONSTANT 0.0084324694 0.0021480622 3.92562 0.00011810
2. AR{1} 0.4752570939 0.0607005314 7.82954 0.00000000
3. AR{3} 0.2253908909 0.0608218529 3.70576 0.00027113

Checking the residuals for serial correlation, it should be clear that the model
performs well. Note the use of the option DFC=%NARMA. BOXJENK sets the in-
ternal variable %NARMA with the number of AR + MA coefficients, which is
required to correct the degrees of freedom for the Q-statistics when they are
computed for the residuals from an ARMA estimation. Here, %NARMA is equal
to 2.
corr(number=8,qstats,span=4,dfc=%narma,picture=".#.###") %resids

Correlations of Series %RESIDS


Quarterly Data From 1961:01 To 2012:04

Autocorrelations
1 2 3 4 5 6 7 8
0.022 -0.037 -0.003 -0.123 0.067 0.150 -0.035 -0.067

Ljung-Box Q-Statistics
Lags Statistic Signif Lvl
4 3.610 0.164446
8 10.678 0.098844

The fit of the model can be obtained using:

com aic = -2.0*%logl + %nreg*2


com sbc = -2.0*%logl + %nreg*log(%nobs)
display "aic = " aic "bic = " sbc

aic = -1353.86042 bic = -1343.84781

Next, estimate the series as an ARMA(1, 1) process. Since the two models are
to be compared head-to-head, they need to be estimated over the same sample
period. The estimation for the ARMA(1,1) is constrained to begin on 1961:1
(the first usable observation for the AR model with three lags).

boxjenk(constant,ar=1,ma=1) dly 1961:1 *


com aic = -2.0*%logl + %nreg*2
com sbc = -2.0*%logl + %nreg*log(%nobs)
display "aic = " aic "bic = " sbc

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. CONSTANT 0.008516961 0.002099853 4.05598 0.00007093
2. AR{1} 0.810358229 0.066987586 12.09714 0.00000000
3. MA{1} -0.393822229 0.105012809 -3.75023 0.00022985

aic = -1346.07120 bic = -1336.05858


Regression and ARIMA Models 32

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

-0.75 AIC= -6.471 SBC= -6.423


Q= 16.65 P-value 0.01066
-1.00
1 2 3 4 5 6 7 8
ARMA(1,1) Model

Figure 2.4: Residual analysis for ARMA(1,1)

The AR(1,3) is the clear favorite of the information criteria7


In addition, the residual correlations for the ARMA model are unsatisfactory:

corr(number=8,qstats,span=4,dfc=%narma,picture=".#.###") %resids

Correlations of Series %RESIDS


Quarterly Data From 1961:01 To 2012:04

Autocorrelations
1 2 3 4 5 6 7 8
0.045 -0.140 0.092 -0.105 0.040 0.176 -0.040 -0.070

Ljung-Box Q-Statistics
Lags Statistic Signif Lvl
4 8.751 0.012583
8 17.174 0.008665

Note that you can also use @REGCORRS to do the analysis of the correlations,
displaying the autocorrelations (Figure 2.4), the Q and the AIC and SBC:8

@regcorrs(number=8,qstats,dfc=%narma,footer="ARMA(1,1) Model")

Exercise 2.1 Experiment with the following:


7
Since the number of estimated parameters in the two models is the same, the two criteria
must agree on the ordering.
8
The AIC and SBC on the output from @REGCORRS have been divided by the number of
observations which doesnt change the ordering, and makes them easier to display.
Regression and ARIMA Models 33

1. @bjident(report,qstats,number=8) resids
2. box(ar=5,constant) dly / resids
versus
box(ar=||1,3||,constant) dly / resids
3. box(ar=2,ma=1,constant) dly / resids
dis \%converged

2.8 Automating the Process


The Box-Jenkins methodology provides a useful framework to arrive at a care-
fully constructed model that can be used for forecasting. However, to some, the
method is not scientific in that two equally skilled researchers might come up
with slightly different models. Moreover, it may not be practical to carefully
examine each series if there are many series to process. An alternative is to
use a completely data-driven way to select p and q. The method is to estimate
every possible ARMA(p, q) model and to select the one provides the best fit.
Although it might seem time-intensive to estimate a model for every possible
p and q, RATS can do it almost instantly. This is done in Example 2.4, file
RPM2 4.RPF.
The procedure @BJAUTOFIT allows you to specify the maximum values for p
and q and estimates all of the varied ARMA models.9 You can select whether to
use the AIC or the SBC to use as the criterion for model selection.

@BJAUTOFIT( options ) series start end

The important options are

PMAX=maximum value of p
QMAX=maximum value of q
CRIT=[AIC]/SBC
DIFFS=number of regular differencings [0]
SDIFFS=number of regular differencings [0]
CONSTANT/[NOCONSTANT]

It is instructive to understand the programming methods used within the pro-


cedure.
9
It does not allow for zeroing out intermediate lags. An AR(3) is considered but not an
AR({1,3}). There are simply too many possibilities if lags are allowed to be skipped.
Regression and ARIMA Models 34

2.8.1 Introduction to DO Loops

The DO loop is the simplest and most heavily used of the programming features
of RATS (and any other statistical programming language). Its a very simple
way to automate many of your repetitive programming tasks. The usual struc-
ture of a DO loop is:

do index=n1,n2,increment
program statements
end do index

where N1, N2 and INCREMENT are integer number or variables. If INCREMENT


is omitted, the default value of 1 is used. The index is a name that you assign
to be the counter through the loop. Typically, this is a (very) short name, with I,
J and K being the most common ones. I and J are, in fact, defined variables in
RATS precisely because they are such common loop indexes (which is why you
cant use the I name for an investment series). There is, of course, nothing
preventing you for using something more descriptive like LAG or DRAW.
The DO loop is really a counter. The first time that the DO instruction is ex-
ecuted, the index variable takes on the value N1 and the block of program
statements is executed. On reaching the end of the loop, the index is increased
by INCREMENT (that is, index is now N1+INCREMENT). If the index is less than
or equal to N2, the block of program statements is executed again. On reaching
the end of the loop, the value of the index is again increased by INCREMENT
and again compared to N2. This process is repeated until the value of the index
exceeds N2. At that point, RATS exits the loop and subsequent instructions can
be performed.
There are two differences among DO loop implementations in different lan-
guages that you need to keep in mind:

1. In RATS, if N1>N2, the loop is not executed at all (that is, the test is at the
top of the loop). In a few languages, the loop is always executed once (the
test is at the bottom).
2. In RATS, the loop index has the value from the last executed pass through
the statements, not the value that would force the termination of the loop.
This can be quite different from language to language (in some, its not
even necessarily defined outside the loop).

DO loops are particularly useful because they can be nested. Consider the fol-
lowing section of code:

do q=0,3
do p=0,3
boxjenk(constant,ar=p,ma=q) y
end do p
end do q
Regression and ARIMA Models 35

The key point to note is that the AR and MA options of the BOXJENK instruction
do not need to be set equal to specific numbers. Instead, they are set equal to
the counters P and Q. The first time through the two loops, P and Q are both
zero, so the BOXJENK estimates an ARMA(0,0) model (that is, just the mean).
Since the P loop is the inner loop, the value of P is incremented by 1 but Q
remains at 0. Hence, RATS next estimates an AR(1) model. Once the AR(3)
model is estimated, control falls out of the DO P loop, and Q is incremented.
The DO P loop is started again with P=0, but now with Q=1. In the end, all
16 combinations of ARMA(p,q) models with 0 p 3 and 0 q 3 will be
estimated.
The output produced by this small set of instructions can be overwhelming. In
practice, it is desirable to suppress most of the output except for the essential
details. A simple way to do this is to use the NOPRINT option of BOXJENK.
The following estimates the 16 models over a common period (1961:1 is the
earliest that can handle 3 lags on the difference), displays the AIC, SBC, and
also shows the value of %CONVERGED. The results from an estimation that has
not converged are clearly garbage, though its very rare that a model where the
estimation would fail to converge would be selected anyway, since, for an ARMA
model, a failure to converge is usually due to overparameterization, which is
precisely what the AIC and SBC are trying to penalize.
do q=0,3
do p=0,3
boxjenk(noprint,constant,ar=p,ma=q) dly 1961:1 *
com aic=-2*%logl+%nreg*2
com sbc=-2*%logl+%nreg*log(%nobs)
disp "Order("+p+","+q+")" "AIC=" aic "SBC=" sbc "OK" %converged
end do p
end do q

Order(0,0) AIC= -1265.14664 SBC= -1261.80910 OK 1


Order(1,0) AIC= -1342.37368 SBC= -1335.69861 OK 1
Order(2,0) AIC= -1342.78978 SBC= -1332.77717 OK 1
Order(3,0) AIC= -1351.87333 SBC= -1338.52317 OK 1
Order(0,1) AIC= -1326.95443 SBC= -1320.27935 OK 1
Order(1,1) AIC= -1346.07120 SBC= -1336.05858 OK 1
Order(2,1) AIC= -1346.06399 SBC= -1332.71383 OK 1
Order(3,1) AIC= -1352.68025 SBC= -1335.99256 OK 1
Order(0,2) AIC= -1327.74892 SBC= -1317.73630 OK 1
Order(1,2) AIC= -1349.46788 SBC= -1336.11773 OK 1
Order(2,2) AIC= -1353.41486 SBC= -1336.72717 OK 1
Order(3,2) AIC= -1351.42556 SBC= -1331.40034 OK 1
Order(0,3) AIC= -1345.26392 SBC= -1331.91377 OK 1
Order(1,3) AIC= -1350.81865 SBC= -1334.13096 OK 1
Order(2,3) AIC= -1351.41686 SBC= -1331.39164 OK 1
Order(3,3) AIC= -1353.87773 SBC= -1330.51496 OK 1

The minimum AIC is (3,3) (just barely better than the (2,2)), while SBC shows a
clear choice at (3,0). If we want to use AIC as the criterion, we should probably
try a higher limit since as it is, its being minimized at the upper bound in both
parameters.
Regression and ARIMA Models 36

A similar analysis can be done with the @BJAUTOFIT:

@bjautofit(constant,pmax=3,qmax=3,crit=aic) dly

AIC analysis of models for series DLY


MA
AR 0 1 2 3
0 -1287.8512 -1349.7689 -1350.5439 -1368.6927
1 -1365.5246 -1369.9612 -1373.2313 -1374.6395
2 -1365.9782 -1371.0569 -1377.5783* -1375.7936
3 -1375.0658 -1375.5717 -1375.7714 -1377.5110

Youll note that the values (and the decision) are somewhat different. Thats
because @BJAUTOFIT uses maximum likelihood (rather than conditional least
squares), and, because it uses the (more complicated) maximum likelihood es-
timator, it can use the full data range for DLY from 1960:2 on. Youll also note
that its presented in a more convenient table rather than a line at a time list-
ing that we got using DISPLAY. @BJAUTOFIT (and most other standard RATS
procedures) uses the REPORT instructions to format their output. Well learn
more about that later in the book.

Exercise 2.2 You can see how well the method works using the following code:

set eps = %ran(1)


set(first=%ran(1)) y1 = 0.5*y1{1} + eps
@bjautofit(pmax=3,qmax=3,constant) y1

Repeat using

set y2 = 0.
set y2 3 * = 0.5*y2{1} + 0.25*y2{2} + eps }
@bjautofit(pmax=3,qmax=3,constant) y2}
set y3 = 0.5*eps{1} + eps }
@bjautofit(pmax=3,qmax=3,constant) y3}
set(first=%ran(1)) y4 = 0.5*y4{1} + eps + 0.5*eps{1}
@bjautofit(pmax=3,qmax=3,constant) y4

Be aware that the model with the best in-sample fit may not be the one that
provides the best forecasts. It is always necessary to perform the appropriate
diagnostic checks on the selected model. Autofitting techniques are designed
to be a tool to help you select the most appropriate model.

2.9 An Example with Seasonality


Many series have a distinct seasonal pattern such that their magnitudes are
predictably higher during some parts of the year than in others. In order to
properly model such series, it is necessary to capture both the seasonal and
Regression and ARIMA Models 37

0 Differences of DLY 0 Differences of M


1.00 1.00

0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

-0.25 -0.25

-0.50 -0.50

-0.75 -0.75

CORRS CORRS
PARTIALS PARTIALS
-1.00 -1.00
0 5 10 15 20 25 0 5 10 15 20 25

Figure 2.5: ACF/PACF of Differences of Log Currency

nonseasonal dependence. Since the Federal Reserve injects currency into the
financial system during the winter quarter, we would expect the stock of cur-
rency in the current winter quarter to aid in predicting the stock for the next
winter. Fortunately, the autocorrelations and partial autocorrelations often re-
flect pattern of the seasonality. In Example 2.5 (file RPM2 5.RPF), we again
pull in the data set QUARTERLY(2102).XLS:

cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)

We then compute the log difference and log combined regular and seasonal
differences of currency with:

set ly = log(curr)
dif ly / dly
dif(sdiffs=1,dif=1) ly / m

In terms of lag operators, m = (1 L4 )(1 L)ly. The following graphs (Figure


2.5) the ACF and PACF of the two types of differenced data:

spgraph(footer="ACF and PACF of dly and m",hfields=2,vfields=1)


@bjident dly
@bjident m
spgraph(done)

The ACF and PACF of the dly series (shown in the left-hand side panel) indicate
that the 1 is significant at the 5% level. However, the pronounced features of
the figure are the spikes at lags in the autocorrelations at lags 4, 8, 12, 16, 20
Regression and ARIMA Models 38

and 24. Obviously, these correlations correspond to the seasonal frequency of


the data. If we focus only on the autocorrelations at the seasonal frequency, it
should be clear that there is little tendency for these autocorrelations to decay.
In such circumstances, it is likely that the data contains a seasonal unit-root.
Hence, we can transform the data by forming the seasonal difference of dlyt as:
mt = dlyt dlyt4 .
If you are unsure as to the number of differences to use, the procedure @BJDIFF
can be helpful. The following reports the Schwartz criteria allowing for a max-
imum of one regular difference and one seasonal difference with and without a
constant:
BJDiff Table, Series CURR
Reg Diff Seas Diff Intercept Crit
0 0 No -0.104162
0 0 Yes -2.558801
0 1 No -8.222783
0 1 Yes -8.994149
1 0 No -8.524110
1 0 Yes -9.098987
1 1 No -9.477942*
1 1 Yes -9.453089

As indicated by the asterisk (*), the log transformation using one seasonal dif-
ference, one regular difference, and no intercept seems to be the most appro-
priate.10
Because there are now four choices that need to be made (regular AR and MA,
seasonal AR and MA) rather than just two, selecting a model for a seasonal
ARMA can be quite tricky. In general, its a good idea to start with simple
models, estimate and see if theres residual autocorrelation remaining. Here,
well start with four: (1, 1, 0) (0, 1, 1), (0, 1, 1) (0, 1, 1) (sometimes known as
the airline model), (1, 1, 0) (1, 1, 0) and (0, 1, 1) (1, 1, 0). This covers all four
cases with exactly one parameter in the regular polynomial (either one AR or
one MA) and one in the seasonal (one SAR or one SMA). Well use @REGCORRS to
compute the Q, and the information criteria for each. For comparison purposes,
we need to estimate the models over the same sample period. Note that the
model with one AR term and one seasonal AR term can begin no earlier than
1962:3, with losses due to both the differencing and the lags for the AR and
SAR parameters. Well use NOPRINT while were just doing a crude check of
various models.
10
You can formally test for a unit root and a seasonal unit root using the Hylleberg, Engle,
Granger, and Yoo (1990) test. The procedure @HEGY will perform the test using quarterly data.
Regression and ARIMA Models 39

boxjenk(noprint,constant,ar=1,sma=1) m 1962:3 *
@regcorrs(title="(1,1,0)x(0,1,1)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
boxjenk(noprint,constant,ma=1,sma=1) m 1962:3 *
@regcorrs(title="(0,1,1)x(0,1,1)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
boxjenk(noprint,constant,ar=1,sar=1) m 1962:3 *
@regcorrs(title="(1,1,0)x(1,1,0)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
boxjenk(noprint,constant,ma=1,sar=1) m 1962:3 *
@regcorrs(title="(0,1,1)x(1,1,0)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif

aic = -7.11001 bic = -7.06088 Q(signif) 0.566


aic = -7.10238 bic = -7.05325 Q(signif) 0.442
aic = -6.93147 bic = -6.88234 Q(signif) 0.002
aic = -6.92417 bic = -6.87504 Q(signif) 0.001

The first of the four models seems to be slightly preferred over the second, with
the other two being inadequate. We can now take the NO off the NOPRINT to
look at it more carefully:

boxjenk(print,constant,ar=1,sma=1) m 1962:3 *
@regcorrs(title="(1,1,0)x(0,1,1)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif

Box-Jenkins - Estimation by LS Gauss-Newton


Convergence in 10 Iterations. Final criterion was 0.0000032 <= 0.0000100
Dependent Variable M
Quarterly Data From 1962:03 To 2012:04
Usable Observations 202
Degrees of Freedom 199
Centered R2 0.5102749
R-Bar2 0.5053530
Uncentered R2 0.5105079
Mean of Dependent Variable 0.0002124435
Std Error of Dependent Variable 0.0097610085
Standard Error of Estimate 0.0068650289
Sum of Squared Residuals 0.0093785957
Log Likelihood 721.1113
Durbin-Watson Statistic 1.9192
Q(36-2) 38.9946
Significance Level of Q 0.2551573

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. CONSTANT 0.000071690 0.000209506 0.34219 0.73257113
2. AR{1} 0.432725378 0.063065071 6.86157 0.00000000
3. SMA{4} -0.767040488 0.047004350 -16.31850 0.00000000
Regression and ARIMA Models 40

The Q statistic is fine11 and there are no obvious signs of problems in the in-
dividual autocorrelations on the graph (not shown). The only change to the
model that seems likely to make a difference would be to take the constant out.
@BJDIFF chose a model with no constant, and this would confirm that that
seemed to be the correct recommendation.12
The procedure for automating selection of a seasonal ARMA model (given the
choice of differencing) is @GMAUTOFIT. This puts an upper bound, but not lower
bound, on the number of parameters in each polynomial, so it includes models
with no parameters. Even with a limit of 1 on each, there are quite a few
candidate models. And, fortunately, it agrees with our choice:

@gmautofit(regular=1,seasonal=1,noconst,full,report) m

AR MA AR(s) MA(s) LogL BIC


0 0 0 0 666.1212169 -1332.24243
0 0 0 1 713.3733549 -1421.41399
0 0 1 0 700.2558277 -1395.17894
0 0 1 1 714.0155643 -1417.36569
0 1 0 0 689.8974268 -1374.46213
0 1 0 1 736.2116913 -1461.75795
0 1 1 0 720.3065378 -1429.94764
0 1 1 1 736.3675337 -1456.73691
1 0 0 0 689.7674212 -1374.20212
1 0 0 1 737.3251643 -1463.98489*
1 0 1 0 721.2150088 -1431.76458
1 0 1 1 737.4792222 -1458.96029
1 1 0 0 690.6894892 -1370.71354
1 1 0 1 737.9898838 -1459.98161
1 1 1 0 721.7517229 -1427.50529
1 1 1 1 738.1317848 -1454.93269

2.10 Forecasts and Diagnostic Checks


Perhaps the primary use of the Box-Jenkins methodology is to develop a fore-
casting model. The selection of the proper AR and MA coefficients is important
because any model specification errors will be projected into the future. If your
model is too small, it will not capture all of the dynamics present in the series.
As such, you want to forecast with a model that does not contain any remain-
ing correlation in the residuals. Clearly, one way to eliminate serial correlation
in the residuals is to add one or more AR or MA coefficients. However, you
need to be cautious about simply increasing the number of estimated param-
eters. All parameters estimated are subject to sampling error. If your model
includes a parameter with a large standard error, this estimation error will be
projected into the future. To take a specific example, suppose that yt is prop-
erly estimated as an AR(1) process. However, suppose that someone estimates
11
The Q in the output above and the Q in the @REGCORRS output use different numbers of
autocorrelations, but they lead to the same conclusion.
12
A model with two differences (in any combination) plus a constant would have a quadratic
drift, which isnt likely.
Regression and ARIMA Models 41

the series as yt = a0 + a1 yt1 + a2 yt2 + t . Now, if a2 turned out to be exactly


zero, there would not be a problem forecasting with this model. In point of
fact, the probability that the estimated value of a2 takes on the value zero is
almost surely equal to zero. On average, forecasts from the model will contain
the error a2 yt2 .
Most diagnostic checks begin with a careful examination of the residuals. Plot
the residuals and their ACF to ensure that they behave as white-noise. This is
most easily done using the REGCORRS procedure discussed earlier. For now, we
consider the issue of how to use a properly estimated ARIMA model to forecast.

Out of Sample Forecasts


Probably the most important use of time-series models is to provide reliable
forecasts of the series of interest. After all, once the essential properties of
the dynamic process governing the evolution of yt have been estimated, it is
possible to project these properties into the future to obtain forecasts. To take
a simple example, suppose that the evolution of yt has been estimated to be the
ARMA (1,1) process:
yt = a0 + a1 yt1 + 1 t1 + t
so that
yt+1 = a0 + a1 yt + 1 t + t+1
Given the estimates of a0 , a1 , and 1 along with the estimated residual series,
the conditional expectation of yt+1 is
Et yt+1 = Et [a0 + a1 yt + 1 t + t+1 ]
= a0 + a1 y t + 1 t
and the conditional expectation of yt+2 is
Et yt+2 = a0 + a0 a1 + a1 1 t + a21 yt
The arithmetic gets quite messy with larger models and for larger forecasting
horizons. However, the important point is that RATS can readily perform all
of the required calculations using the UFORECAST and FORECAST instructions.
Although the FORECAST instruction is the more flexible of the two, UFORECAST
is very easy to use for single-equation models. The syntax of UFORECAST is:

UFORECAST( options ) series start end

series This is where the forecasts will be saved.


start end The range to forecast. You can use this, or the FROM, TO and
STEPS options to set thischoose whichever is simplest in a
given situation.
Regression and ARIMA Models 42

The most important options are

EQUATION=name of the equation to use for forecasting


FROM=starting period of the forecasts
TO=ending period of the forecasts
STEPS=number of forecast steps to compute
ERRORS=series containing the forecast errors
STDERRS=series of standard errors of the forecasts
PRINT/[NOPRINT]
Note that the default is to not print the forecasts.

Notice that UFORECAST requires the name of a previously estimated equa-


tion to use for forecasting. Whenever you estimate a linear regression using
LINREG or an equation using BOXJENK, use the option to DEFINE the model.
UFORECAST uses this previously defined model for forecasting purposes. If you
omit the EQUATION= option, UFORECAST will use the most recently estimated
equation. In Example 2.6 (file RPM2 6.RPF), well take the model chosen in
Section 2.7 and use it to forecast the logarithmic change of the PPI two years
beyond the end of the sample (2012:4).
We need to define an EQUATION usable for forecasting when we estimate the
model, so we add the DEFINE option to the BOXJENK.

boxjenk(constant,ar=||1,3||,define=ar1_3) dly

In order to forecast, the equation is defined as AR1 3. Now UFORECAST can be


used to instruct RATS to produce the 1- through 8-step ahead forecasts using

uforecast(equation=ar1_3,print) forecasts 2013:1 2014:4

Entry DLY
2013:01 0.003903092
2013:02 0.007167626
2013:03 0.007082460
2013:04 0.006769988
2014:01 0.007357279
2014:02 0.007617198
2014:03 0.007670298
2014:04 0.007827904

You can get the same results with

ufore(equation=ar1_3,print,steps=8) forecasts

as UFORECAST by default starts the forecasts one period beyond the (most re-
cent) estimation range, so this requests 8 forecast steps. You pick the combi-
nation of STEPS, FROM and TO options, or the start and end parameters that
you find most convenient.
Although the FORECAST instruction is typically used for multiequation fore-
casts, it can also forecast using a single equation. Subsequent chapters will
Regression and ARIMA Models 43

consider FORECAST in more detail. For now, it suffices to indicate that the
identical output can be obtained using

forecast(print,steps=8,from=2013:1) 1
# ar1_3 forecasts

If you are going to forecast, there is an important difference between the fol-
lowing two instructions:

boxjenk(constant,ar=||1,3||,define=ar1_3) dly

and
boxjenk(define=ar_alt,constant,ar=||1,3||,dif=1) ly

The coefficients are identical, the fits are identical, but in the second case, the
original dependent variable is LY, not its difference DLY, and the equation it
produces is one that has LY as the dependent variable, so the forecasts are of
LY, not of DLY. You can use DISPLAY to look at the two equations:

disp ar1_3 ar_alt

Dependent Variable DLY


Variable Coeff
***********************************
1. Constant 0.0025242767
2. DLY{1} 0.4752570939
3. DLY{3} 0.2253908909

Dependent Variable LY
Variable Coeff
***********************************
1. Constant 0.002524277
2. LY{1} 1.475257094
3. LY{2} -0.475257094
4. LY{3} 0.225390891
5. LY{4} -0.225390891

You can verify that these are identical if you substitute out DLY=LY-LY1. We
can forecast LY itself and graph the forecasts (with the final four years of actual
data, Figure 2.6) using:

ufore(equation=ar_alt,print) forecasts 2013:1 2014:4


graph(footer="Actual and Forecasted Values of the log PPI") 2
# ly 2009:1 *
# forecasts

2.11 Examining the Forecast Errors


Given that the forecasts will contain some error, it would be desirable to have a
model that produces unbiased forecasts with the smallest possible mean square
Regression and ARIMA Models 44

5.350

5.325

5.300

5.275

5.250

5.225

5.200

5.175

5.150

5.125
2009 2010 2011 2012 2013 2014

Figure 2.6: Actual and Forecasted Values of log PPI

error. Although it is not really possible to know the forecast errors, it is possible
to put alternative models to a head-to-head test. With the PPI series, for exam-
ple, you could compare the AR({1,3}) and ARMA(1,1) specifications by holding
back 50 observations (about 1/4 of the data set) and performing the following
steps:

1. Construct two series to contain the forecast errors. For simplicity, we will
let error1 contain the forecast errors from the AR({1,3}) and error2
contain the forecast errors from the ARMA(1,1).

2. For each time period between 2000:2 and 2012:3, estimate the two mod-
els from the start of the data set through that time. Construct one-step
forecasts and save the errors.

Analyzing these two series can give you insight into which of the two models
generates forecast errors with the most desirable properties. For example, if
the mean of the forecast errors from the AR({1,3}) model is closer to zero than
those of the ARMA(1,1), you might prefer to use the AR model to forecast be-
yond the actual end of the data set (i.e., 2012:4).
Its a good idea to avoid using hard-coded dates on the individual instructions
if at all possible. If we wrote all this with 2000:2 and 2012:3, for instance, then,
if we (or a referee) decided that we really should start in 1998:2 instead, we
would have to change several lines and hope that we caught all the places that
mattered. So in Example 2.7 (file RPM2 7.RPF), after reading the data and
creating DLY as weve done before, well start with:

compute dataend=2012:4
compute baseend=dataend-50
Regression and ARIMA Models 45

which define DATAEND as the end of the observed data and BASEEND as 50 en-
tries earlier. You can even avoid the hard-coded 2012:4 by using the %ALLOCEND()
function, which gives the entry number of the end of the standard range of the
workspace.
The error series are initialized with
set error1 baseend+1 * = 0.
set error2 baseend+1 * = 0.

This is necessary when you are filling entries of a series one-at-a-time using
COMPUTE instructions, which is what we will be doing here with the forecast
errorsRATS needs to set aside space for the information.
The working instructions are:

do end=baseend,dataend-1
boxjenk(constant,ar=||1,3||,define=ar1_3) dly * end
boxjenk(constant,ar=1,ma=1,define=arma) dly * end
ufore(equation=ar1_3,steps=1) f1
ufore(equation=arma,steps=1) f2
compute error1(end+1)=dly(end+1)-f1(end+1)
compute error2(end+1)=dly(end+1)-f2(end+1)
end do t

This runs a loop over the end of the estimation sample, with the variable END
used to represent that. Why is it through DATAEND-1 rather than DATAEND?
If we estimate through DATAEND, there is no data to which to compare a one-
step-ahead forecast. There wouldnt be any real harm in doing it, since the
forecast errors would just be missing values, but its better to write the program
describing what you actually need, rather than relying on the RATS missing
value handlers to fix the mistake.
By default, the UFORECAST instructions forecast from one period beyond the
end of the previous regression range, which is what we want, so we only need
to indicate the number of steps. The forecast errors are computed using the
actual data DLY and the forecasts (F1 and F2) for the period END+1.

Exercise 2.3 The loop above could just as easily have been written with the
loop index running over the start of the forecast period rather than the end of
the estimation period. What changes would be necessary to do that?

table / error1 error2

Series Obs Mean Std Error Minimum Maximum


ERROR1 50 -0.0004350166 0.0136450332 -0.0698340471 0.0214472245
ERROR2 50 -0.0004141885 0.0137877256 -0.0709608893 0.0186385371
Regression and ARIMA Models 46

Notice that the forecast errors from the AR({1,3}) model have a larger mean
(in absolute value) than those of the ARMA(1,1) model but have a smaller stan-
dard error. A simple way to test whether the mean of the forecast errors is
significantly different from zero is to regress the actual values on the predicted
values. Consider:

linreg dly
# constant f1
test(title="Test of Unbiasedness of AR(3) forecasts")
# 1 2
# 0 1

If the forecasts are unbiased, the intercept should be equal to zero and the
slope coefficient should be equal to unit, which is what the TEST instruction is
doing. The results are:
Test of Unbiasedness of AR(3) forecasts
F(2,48)= 3.32730 with Significance Level 0.04433528

Repeating for the ARMA forecasts:

linreg dly
# constant f2
test(title="Test of Unbiasedness of ARMA forecasts")
# 1 2
# 0 1

gives
Test of Unbiasedness of ARMA forecasts
F(2,48)= 3.55490 with Significance Level 0.03633255

Although both models show some bias, there is a slight preference for the
AR({1,3}).

Mean Square Forecast Errors


Not only are we concerned about the bias of the forecast errors, most researchers
would also want the dispersion of the errors to be as small as possible. Al-
though the standard error of the forecasts from the AR({1,3}) model is smaller
than that from the ARMA(1,1) model, we might want to know if it is possible
to conclude that there is a statistical difference between the two (that is reject
the null hypothesis that the variance of the error1 series is equal to that of
the error2 series).
The Granger-Newbold test (Granger and Newbold (1973)) can be used to test
for a difference between the sums of squared forecast errors:
d = (SSR1 SSR2 )/N
Regression and ARIMA Models 47

where SSR1 and SSR2 are the sums of squared forecast errors from models 1
and 2, respectively and N is the number of forecast errors. If the two models
forecast equally well, this difference should be zero. Under the assumptions
that

1. the forecast errors have the same means and are normally distributed
and
2. the errors are serially uncorrelated

Granger and Newbold show that the following has a t-distribution with N 1
degrees of freedom
r
N 1p
1 r2
r is the correlation coefficient between xt and zt defined as xt = e1t + e2t and zt =
e1t e2t and e1t and e2t are the forecast errors from the alternative models. If
you reject the null hypothesis r = 0, conclude that the model with the smallest
residual sum of squares has the smallest mean square forecast errors. Instead
of programming the test yourself, it is simpler to use the procedure @GNEWBOLD.
In this case, the instruction is

@gnewbold dly f1 f2

giving us
Granger-Newbold Forecast Comparison Test
Forecasts of DLY over 2000:03 to 2012:04

Forecast Test Stat P(GN>x)


F1 -0.3295 0.62842
F2 0.3295 0.37158

Hence, we do not reject the null hypothesis that the two mean square fore-
cast errors are equal and conclude that the forecast errors from the AR({1,3})
model has the same dispersion as the ARMA(1,1). The first set (the AR model)
is slightly better since its showing the negative test value, but that isnt sta-
tistically significant.
Diebold and Mariano (1995) have shown how to modify the Granger-Newbold
test for the case in which the forecast errors are serially correlated. Although
one-step-ahead forecasts like these ideally shouldnt be serially correlated, when
we check them with:
corr(qstats,span=4) error1
corr(qstats,span=4) error2

we find that that doesnt seem to be the case here (note particularly the 4th
lag):
Regression and ARIMA Models 48

Correlations of Series ERROR1


Quarterly Data From 2000:03 To 2012:04

Autocorrelations
1 2 3 4 5 6 7 8 9 10
0.0884 -0.1558 -0.2303 -0.3086 0.0422 0.1078 -0.1068 -0.0497 -0.0260 -0.0212
11 12
0.2272 0.0192

Ljung-Box Q-Statistics
Lags Statistic Signif Lvl
4 10.046 0.039667
8 11.678 0.166162
12 15.215 0.229873

The Diebold-Mariano test is done with the @DMARIANO procedure, which has
similar syntax to @GNEWBOLD but requires a LAGS option. The procedure does
a linear regression with a HAC covariance matrixwe recommend that you
also include the option LWINDOW=NEWEY, as the default truncated lag window
(which was recommended in the original paper) doesnt guarantee a positive-
definite covariance matrix estimator.
@dmariano(lags=4,lwindow=newey) dly f1 f2

Diebold-Mariano Forecast Comparison Test


Forecasts of DLY over 2000:03 to 2012:04
Test Statistics Corrected for Serial Correlation of 4 lags
Forecast MSE Test Stat P(DM>x)
F1 0.00018265 -0.5723 0.71645
F2 0.00018647 0.5723 0.28355

Again, the mean square forecast errors are so similar that null hypothesis of
no significant difference is not rejected.
Note that the asymptotics of the Diebold-Mariano test break down if the two
models are nested (one is a special case of the other). Thats not the case here.
A full AR(3) vs the AR({1,3}) would be a pair of nested models and couldnt be
compared using a straight Diebold-Mariano test.

2.12 Coefficient Stability


If the model adequately reflects the data generating process, the coefficients
should be stable over time. In other words, the coefficients should not change
dramatically when we estimate the model over different sample periods. A sim-
ple way to check for coefficient stability is to use recursive estimates. Thats
what well do in Example 2.8 (file RPM2 8.RPF). Consider the following seg-
ment of code for the transformed PPI series. We first set up the target series:
two for the recursive estimates of the coefficients (INTERCEPT for the constant
term and AR1 for the first autoregressive parameter), and two for their esti-
mated standard errors.
Regression and ARIMA Models 49

set intercept = 0.
set ar1 = 0.
set sd0 = 0.
set sd1 = 0.

Next, loop over each entry from 1980:1 to 2012:4, estimating the model through
that period and save the four items required. Note again that we add a NOPRINT
option to the BOXJENK so we dont produce pages of unnecessary output. If,
however, you find that something seems to be wrong, dont hesitate to re-
activate the PRINT to see whats happening. Because BOXJENK uses an it-
erative estimation process, this checks whether there is any problem with
convergencethats unlikely for such a simple model, but it doesnt hurt to
be safe.
do end = 1980:1,2012:4
boxjenk(constant,ar=||1,3||,noprint) dly * end
com intercept(end) = %beta(1), ar1(end) = %beta(2)
com sd0(end) = %stderrs(1) , sd1(end) = %stderrs(2)
if %converged<>1
dis "###DID NOT CONVERGE for " %datelabel(end)
end do end

When we run this, we dont get any messages about non-convergence. We can
construct confidence intervals around the estimated coefficients by adding and
subtracting 1.64 standard deviations13 to the estimated coefficient values. Con-
sider:

set plus0 = intercept + 1.64*sd0


set minus0 = intercept - 1.64*sd0
set plus1 = ar1 + 1.64*sd1
set minus1 = ar1 - 1.64*sd1

Now, we can graph the coefficients and the confidence intervals using:

spgraph(hfield=2,vfields=1,$
footer="Coefficient Estimates with 90% Confidence Intervals")
graph(header="The Estimated Mean") 3
# intercept 1980:1 *
# plus0 1980:1 * 2
# minus0 1980:1 * 2
graph(header="The AR(1) Coefficient") 3
# ar1 1980:1 *
# plus1 1980:1 * 2
# minus1 1980:1 * 2
spgraph(done)

13
If you want greater accuracy, you can use %INVNORMAL(.95) in place of 1.64.
Regression and ARIMA Models 50

The Estimated Mean The AR(1) Coefficient


0.035 0.75

0.70
0.030

0.65

0.025
0.60

0.020 0.55

0.50
0.015

0.45
0.010

0.40

0.005
0.35

0.000 0.30
1980 1985 1990 1995 2000 2005 2010 1980 1985 1990 1995 2000 2005 2010

Figure 2.7: Coefficient Estimates with 90% Confidence Intervals

producing Figure 2.7. The recursive estimates for both coefficients are quite
stable over time. Since the earliest estimates use a small number of observa-
tions, it is not surprising that the confidence intervals are widest for these early
periods. Sometimes it is preferable to estimate recursive regressions using a
rolling window instead of an expanding window. With an expanding window
the number of observations increases as you approach the end of the sample.
With a fixed, or rolling window, you use the same number of observations in
each regression. The way to modify the code to have an expanding window is
to allow the start date to increase by 1 and the end date to increase by 1 every
time through the loop. To modify the code to have a rolling window withsay
75observations use:

compute width=75
do end = 1980:1,2012:4
boxjenk(constant,ar=||1,3||,noprint) dly end-width+1 end
com intercept(end) = %beta(1), ar1(end) = %beta(2)
com sd0(end) = %stderrs(1) , sd1(end) = %stderrs(2)
if %converged<>1
dis "DID NOT CONVERGE for t = " %datelabel(end)
end do end

The first time through the loop, the estimation uses the 75 observations from
1961:3 (74 periods before 1980:1), the second time through observations from
1961:4 to 1980:2, and so on.
For models which are, in fact, linear in the parameters (like this one), another
way to obtain the recursive estimates with an expanding window is to use the
RATS instruction RLS (for Recursive Least Squares). The syntax is:
Regression and ARIMA Models 51

RLS( options ) series start end resids


# list of explanatory variables in regression format

The principal options are:

EQUATION=equation to estimate
COHISTORY=VECTOR[SERIES] of coefficient estimates
SEHISTORY=VECTOR[SERIES] of coefficient standard errors

Hence, similar output to that reported above can be obtained using:

rls(cohist=coeffs,sehist=serrs,equation=ar1_3) dly
set plus0 = coeffs(1) + 1.64*serrs(1)
set minus0 = coeffs(1) - 1.64*serrs(1)
set plus1 = coeffs(2) + 1.64*serrs(2)
set minus1 = coeffs(2) - 1.64*serrs(2)

There are several differences between RLS and what you get by the DO loop
method:

1. The residuals in the first method (either %RESIDS or the series saved
by the resids parameter) are recomputed for the full sample each time
through. RLS produces recursive residuals where the time t residual is the
(standardized) predictive error for period t given the previous estimates,
and t 1 and earlier are left alone at time t. Recursive residuals have
many nice properties for analyzing model stability.
2. Because BOXJENK uses a different parameterization for the intercept/mean
that a linear regression, those wont be directly comparable. The autore-
gressive coefficients will be the same however.
3. RLS cannot be used with MA terms, so we cannot estimate the ARMA(1,1)
model using RLS.
Regression and ARIMA Models 52

2.13 Tips and Tricks


2.13.1 Preparing a graph for publication

You may have noticed that all of our graphs used a FOOTER option on the GRAPH
(if it was stand-alone) or on the outer SPGRAPH. This is used in preference to
a HEADER, which is generally only used in inner graphs for SPGRAPH setups.
You may also have noticed that in almost all cases, that footer was gone in
the version that we included in the book. Most publications will put their own
caption on a graphic, so you probably dont want something similar to show
up in the graph itself. Since the graphic labeling is usually below, the footer,
rather than header, comes closest to the final appearance. The footer also uses
a smaller font, more similar to what will be used.
It wouldnt be a good idea to strip the footer (or header) out of the graph while
youre still doing the empirical work. The footer/header is used in the title bar
of the graph window, and in the Window menu to identify graphs for a closer
look. What we do is to use the GSAVE(NOFOOTER) instruction which was added
with RATS 8.2. This strips the outer footer out of a graph, but only when it is
exported in some way (either by being exported to a file or to the clipboard with
by Edit-Copy).

2.13.2 Preparing a table for publication

If you check the Window-Report Windows menu after running one of the exam-
ple programs, youll see anywhere from 1 to 20 reports queued up. These are
generated by instructions like LINREG, BOXJENK or FORECAST or procedures
like @BJDIFF or @REGCRITS. The ones at the top of the list will be the last
ones created. Note that these are only created if you PRINT the output; if you
do NOPRINT, RATS saves time by not formatting up the reports. If you select
one of the reports, it will load it into a window. However, unlike the standard
output that goes into the text-based output window, this is organized into a
spreadsheet-like table of rows and columns. And, even though you cant see it,
this has the full precision at which the calculations were done.
We will show later in the course how to use the RATS REPORT instruction to
generate a table with the specific information that you want, but in many
cases, you may be able to get by using just the re-loaded standard format-
ted report. Any text-based copy and paste operation (to TeX or format like
comma-delimited) will copy the numbers with the rounding shown in the win-
dow. (Excel and similar formats will get full-precision). You can reformat any
contiguous block of cells to show whatever numerical format you want. Select
the cells you want to reformat, and choose Edit-Change Layout, or Reformat
on the right-click menu. Then select the cells you want to export and copy-
and-paste or export to a file. Note that Copy-TeX copies a TeX table into the
clipboard, so you can paste into a TeX document.
Regression and ARIMA Models 53

Example 2.1 Introduction to basic instructions


cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)
table(picture="*.##")

set dlrgdp = log(rgdp) - log(rgdp{1})


set dlm2 = log(m2) - log(m2{1})
set drs = tb3mo - tb3mo{1}
set dr1 = tb1yr - tb1yr{1}
set dlp = log(deflator) - log(deflator{1})
set dlppi = log(ppi) - log(ppi{1})

spgraph(footer="Graphs of the Series",hfields=2,vfields=2)


graph(header="Panel 1: The Interest Rates",key=below,nokbox) 2
# tb3mo
# tb1yr
graph(header="Panel 2: Real and Potential GDP",key=upleft) 2
# rgdp
# potent
graph(header="Panel 3: Time path of money growth",noaxis) 1
# dlm2
graph(header="Panel 4: Time path of Inflation",noaxis) 1
# dlp
spgraph(done)
*
linreg drs / resids
# constant drs{1 to 7}
*
corr(number=24,partial=partial,qstats,span=4,pic="##.###") resids
graph 1
# resids

exclude
# drs{5 to 7}

summarize
# drs{5 to 7}

test
# 6 7 8
# 0.1 0.1 0.1

test
# 1 2 3 4
# 0. 0.4 -0.1 0.4

restrict(create) 3 resids
# 2 3
# 1. 1. 0.
# 4 5
Regression and ARIMA Models 54

# 1. 1. 0.
# 5 6
# 1. 1. 0.
Regression and ARIMA Models 55

Example 2.2 Engle-Granger test with lag length selection


cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set dlrgdp = log(rgdp) - log(rgdp{1})
set dlm2 = log(m2) - log(m2{1})
set drs = tb3mo - tb3mo{1}
set dr1 = tb1yr - tb1yr{1}
set dlp = log(deflator) - log(deflator{1})
set dlppi = log(ppi) - log(ppi{1})
*
* Estimate "spurious regression"
*
linreg tb1yr / resids
# constant tb3mo

corr(num=8,results=cors,partial=partial,picture="##.###",qstats) resids

graph(nodates,number=0,style=bar,key=below,footer="ACF and PACF") 2


# cors
# partial
*
* Do E-G test with fixed lags
*
diff resids / dresids
linreg dresids
# resids{1} dresids{1 to 8}
*
* Do E-G test with different lag lengths
*
compute egstart=%regstart()
do i = 0,8
linreg(noprint) dresids egstart *
# resids{1} dresids{1 to i}
com aic = -2.0*%logl + %nreg*2
com sbc = -2.0*%logl + %nreg*log(%nobs)
dis "Lags: " i "T-stat" %tstats(1) "The aic = " aic " and sbc = " sbc
end do i

linreg dresids
# resids{1} dresids{1 to 6}
@regcrits
@regcorrs(number=24,qstats,report)
*
@egtest(lags=8,method=aic)
# tb1yr tb3mo
Regression and ARIMA Models 56

Example 2.3 Estimation and diagnostics on ARMA models


cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)
*
log ppi / ly
dif ly / dly
*
spgraph(footer="Price of Finished Goods",hfield=2,vfield=1)
graph(header="Panel a: Quarterly Growth Rate") 1
# dly
@bjident(separate,number=12) dly
spgraph(done)
*
boxjenk(constant,ar=3) dly
boxjenk(constant,ar=||1,3||) dly
corr(number=8,qstats,span=4,dfc=%narma,picture=".#.###") %resids
@regcorrs
com aic = -2.0*%logl + %nreg*2
com sbc = -2.0*%logl + %nreg*log(%nobs)
display "aic = " aic "bic = " sbc

boxjenk(constant,ar=1,ma=1) dly 1961:1 *


com aic = -2.0*%logl + %nreg*2
com sbc = -2.0*%logl + %nreg*log(%nobs)
display "aic = " aic "bic = " sbc

corr(number=8,qstats,span=4,dfc=%narma,picture=".#.###") %resids
@regcorrs(number=8,qstats,dfc=%narma,footer="ARMA(1,1) Model")
Regression and ARIMA Models 57

Example 2.4 Automated Box-Jenkins model selection


cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)

log ppi / ly
dif ly / dly

do q=0,3
do p=0,3
boxjenk(noprint,constant,ar=p,ma=q) dly 1961:1 *
com aic=-2*%logl+%nreg*2
com sbc=-2*%logl+%nreg*log(%nobs)
disp "Order("+p+","+q+")" "AIC=" aic "SBC=" sbc "OK" %converged
end do p
end do q
*
@bjautofit(constant,pmax=3,qmax=3,crit=aic) dly
Regression and ARIMA Models 58

Example 2.5 Seasonal Box-Jenkins Model


cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)

set ly = log(curr)
dif ly / dly
dif(sdiffs=1,dif=1) ly / m

spgraph(footer="ACF and PACF of dly and m",hfields=2,vfields=1)


@bjident dly
@bjident m
spgraph(done)

@bjdiff(diff=1,sdiffs=1,trans=log) curr

boxjenk(noprint,constant,ar=1,sma=1) m 1962:3 *
@regcorrs(title="(1,1,0)x(0,1,1)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
boxjenk(noprint,constant,ma=1,sma=1) m 1962:3 *
@regcorrs(title="(0,1,1)x(0,1,1)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
boxjenk(noprint,constant,ar=1,sar=1) m 1962:3 *
@regcorrs(title="(1,1,0)x(1,1,0)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
boxjenk(noprint,constant,ma=1,sar=1) m 1962:3 *
@regcorrs(title="(0,1,1)x(1,1,0)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
* Do estimation with output for the preferred model
*
boxjenk(print,constant,ar=1,sma=1) m 1962:3 *
@regcorrs(title="(1,1,0)x(0,1,1)",dfc=%narma)
display "aic = " %aic "bic = " %sbc "Q(signif)" *.### %qsignif
*
@gmautofit(regular=1,seasonal=1,noconst,full,report) m
Regression and ARIMA Models 59

Example 2.6 Out-of-sample forecasts with ARIMA model


cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)

log ppi / ly
dif ly / dly

boxjenk(constant,ar=||1,3||,define=ar1_3) dly
ufore(equation=ar1_3,print) forecasts 2013:1 2014:4
ufore(equation=ar1_3,print,steps=8) forecasts
*
forecast(print,steps=8,from=2013:1) 1
# ar1_3 forecasts
*
* Forecasts of log PPI (not differences)
*
boxjenk(define=ar_alt,constant,ar=||1,3||,dif=1) ly
*
disp ar1_3 ar_alt
*
ufore(equation=ar_alt,print) forecasts 2013:1 2014:4
graph(footer="Actual and Forecasted Values of the log PPI") 2
# ly 2009:1 *
# forecasts
Regression and ARIMA Models 60

Example 2.7 Comparison of Forecasts


cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set ly = log(ppi)
set dly = ly-ly{1}
*
compute dataend=2012:4
compute baseend=dataend-50
*
set error1 baseend+1 * = 0.
set error2 baseend+1 * = 0.
do end=baseend,dataend-1
boxjenk(constant,ar=||1,3||,define=ar1_3) dly * end
boxjenk(constant,ar=1,ma=1,define=arma) dly * end
ufore(equation=ar1_3,steps=1) f1
ufore(equation=arma,steps=1) f2
compute error1(end+1)=dly(end+1)-f1(end+1)
compute error2(end+1)=dly(end+1)-f2(end+1)
end do t
table / error1 error2
*
linreg dly
# constant f1
test(title="Test of Unbiasedness of AR(3) forecasts")
# 1 2
# 0 1
*
linreg dly
# constant f2
test(title="Test of Unbiasedness of ARMA forecasts")
# 1 2
# 0 1
*
* Granger-Newbold and Diebold-Mariano tests
*
@gnewbold dly f1 f2
*
corr(qstats,span=4) error1
corr(qstats,span=4) error2
*
@dmariano(lags=4,lwindow=newey) dly f1 f2
Regression and ARIMA Models 61

Example 2.8 Stability Analysis


cal(q) 1960:1
all 2012:4
*
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set ly = log(ppi)
set dly = ly-ly{1}
*
set intercept = 0.
set ar1 = 0.
set sd0 = 0.
set sd1 = 0.
*
do end = 1980:1,2012:4
boxjenk(constant,ar=||1,3||,noprint) dly * end
com intercept(end) = %beta(1), ar1(end) = %beta(2)
com sd0(end) = %stderrs(1) , sd1(end) = %stderrs(2)
if %converged<>1
dis "###DID NOT CONVERGE for " %datelabel(end)
end do end
*
set plus0 = intercept + 1.64*sd0
set minus0 = intercept - 1.64*sd0
set plus1 = ar1 + 1.64*sd1
set minus1 = ar1 - 1.64*sd1
*
spgraph(hfield=2,vfields=1,$
footer="Coefficient Estimates with 90% Confidence Intervals")
graph(header="The Estimated Mean") 3
# intercept 1980:1 *
# plus0 1980:1 * 2
# minus0 1980:1 * 2
graph(header="The AR(1) Coefficient") 3
# ar1 1980:1 *
# plus1 1980:1 * 2
# minus1 1980:1 * 2
spgraph(done)
*
* Alternatively, use a rolling window with 75 observations
*
compute width=75
do end = 1980:1,2012:4
boxjenk(constant,ar=||1,3||,noprint) dly end-width+1 end
com intercept(end) = %beta(1), ar1(end) = %beta(2)
com sd0(end) = %stderrs(1) , sd1(end) = %stderrs(2)
if %converged<>1
dis "DID NOT CONVERGE for t = " %datelabel(end)
end do end
*
equation ar1_3 dly
# constant dly{1 3}
Regression and ARIMA Models 62

*
rls(cohist=coeffs,sehist=serrs,equation=ar1_3) dly / resids
set plus0 = coeffs(1) + 1.64*serrs(1)
set minus0 = coeffs(1) - 1.64*serrs(1)
set plus1 = coeffs(2) + 1.64*serrs(2)
set minus1 = coeffs(2) - 1.64*serrs(2)
Chapter 3

Non-linear Least Squares

It is well-known that many economic variables display asymmetric adjustment


over the course of the business cycle. The recent financial crisis underlines the
point that economic downturns can be far sharper then recoveries. Yet, the
standard ARMA(p, q) model requires that all adjustment be symmetric, as it is
linear in all lagged values of {yt } and {t }. For example, in the AR(1) model
yt = 0.5yt1 + t , a one-unit shock to t will induce yt to increase by one unit,
yt+1 to increase by 0.5 units, yt+2 to increase by 0.25 units, and so on. And a
one-unit decrease in t will induce yt to decrease by one unit, yt+1 to decrease
by 0.5 units, and yt+2 to decrease by 0.25 units. Doubling the magnitude of the
shock doubles the magnitude of the change in yt and in all subsequent values of
the sequence. The point is that in a linear specification, such as the ARMA(p, q)
model, it isnt possible to capture the types of asymmetries displayed by many
time-series variables. As such, there is a large and growing literature on non-
linear alternatives to the standard ARMA model. RATS allows you to estimate
dynamic nonlinear models in a number of different ways including non-linear
least squares, which is the subject of this chapter.
In general, non-linear estimation requires more work from you and more care-
ful attention to detail than does linear regression. You really have to try very
hard to construct an example where the LINREG instruction fails to get the
correct answer, in the sense that it gives a result other than the coefficients
which minimize the sum of squares to any reasonable level of precisionfor
linear regressions, most statistical packages agree to at least eight significant
digits even on fairly difficult data. This doesnt mean that the results make
economic sense, but you at least get results.
However, there are various pathologies which can affect non-linear models
that never occur in linear ones.

Boundary Issues
Probabilities have to be in [0, 1]. Variances have to be non-negative. These are
just two examples of possible non-linear parameters where the optimum might
be at a boundary. If the optimum is at the boundary, the partial derivative
doesnt have to be zero. Since the most straightforward method of optimizing a
continuous function is to find a zero of the gradient, that wont work properly
with the bounded spaces. In addition, the derivative may not even exist at the
optimum if the function isnt definable in one direction or the other.

63
Non-linear Least Squares 64

Unbounded parameter space


If the X matrix is full-rank, the sum of squares surface for linear least squares
is globally concave and goes to infinity in every direction as the coefficients get
very large. However, a non-linear function like exp(xt ) is bounded below by
zero no matter how large gets. Its possible for an optimum to be at =
inf. This is often related to the boundary issue, as bounded parameters are
sometimes mapped into an unbounded parameter space. For instance, 2 can
be replaced with exp() where the boundary 2 = 0 is now mapped to = .

Convergence issues
Assuming that X is full rank, least squares can be solved exactly with a single
matrix calculation. Even if a non-linear least squares problem avoids the previ-
ous two issues, the minimizer can rarely be computed analytically with a finite
number of calculations. Instead, the solution has to be approximated, and at
some point we have to decide when were done. Because of its importance,
were devoting a full section to it (Section 3.4).

Lack of identification
A linear model can have identification issuesthe dummy variable trap is a
good examplebut they are usually the result of an error in specifying the
set of regressors. Generally, you can test for additional coefficients by overfit-
ting a model (adding additional regressors) without any major computational
problems. By contrast, there are whole classes of non-linear models where en-
tire sets of parameters can, under certain circumstances, fail to be identified.
In particular, various switching and threshold models can fail if you try to
overfit by adding an extra (and unnecessary) regime. This will come up quite
often in this chapter.

3.1 Nonlinear Least Squares


Suppose that you want to estimate the following model using nonlinear least
squares:
yt = xt + t (3.1)
Since the disturbance term is additive, you cannot simply take the log of each
side and estimate the equation using OLS.1 However, nonlinear least squares
allows you to estimate and directly, finding the values of the parameters
which minimize
XT
(yt xt )2 (3.2)
t=1

1
If the model had the form yt = xt t where {t } was log-normal, it would be appropriate to
estimate the regression in logs using LINREG.
Non-linear Least Squares 65

The instruction which does the minimization is NLLS (for NonLinear Least
Squares). However, before we use it, we must go through several preliminary
steps. Instead of a linear function of explanatory variables, we need to allow
for a general function of the data on the right-side of the equation as in (3.1).
This is done using the instruction FRML.
However, before we can even define the explanatory FRML, we need to let RATS
know the variable names that we are going to use in defining that, translat-
ing the math equation into a usable expression. That is done with the NONLIN
instruction, which both defines the variables to RATS and also creates the pa-
rameter set to be used in estimation. And there is one more step before we
can use NLLSwe need to give guess values to those parameters. Thats not
necessary with a linear regression, where a single matrix calculation solves the
minimization problem. Non-linear least squares requires a sequence of steps,
each of which brings us closer to the minimizers, but we have to start that
sequence somewhere. Sometimes the estimation process is almost completely
unaffected by the guess values; in other cases, you will get nowhere without a
very good starting point.
The obvious choices for the names of the two parameters would be BETA and
GAMMA. So the first instruction would be
nonlin beta gamma

The following defines a FRML named F1 with dependent variable Y and ex-
planatory function BETA*XGAMMA.

frml f1 y = beta*xgamma

A FRML is a function of the (implied) entry variable T, so, for instance, F1(100)
evaluates to BETA*X(100)GAMMA. Note that if we had not done the NONLIN
instruction first, the FRML instruction would have no idea what BETA and
GAMMA were supposed to be. You would get the message:
## SX11. Identifier BETA is Not Recognizable. Incorrect Option Field or Parameter Order?

If you get a message like that while setting up a non-linear estimation, you
probably either have a typo in the name, or you failed to create all the param-
eters before defining the FRML.
By default, RATS will use 0.0 for any non-linear parameter which isnt other-
wise initialized. Often, thats OK; here, not so. If = 0, has no effect on
the function value. In this case, it turns out that RATS can fight through that2
but its not a good strategy to ignore the need for guess values and hope that it
2
changes on the first iteration while doesnt, after which, with a non-zero , estimation
proceeds normally. However, quite a few statistical programs would quit on the first iteration.
Non-linear Least Squares 66

works. Since we wont be estimating this model, well just give an arbitrary
value, and start at 1.

compute beta=0.5,gamma=1.0

Finally the parameters are estimated with

nlls(frml=f1) y

How does NLLS work? The first thing it does is to see which entries can be used
in estimation by evaluating the input FRML at each entry and seeing if it gets
a (legal) value. NLLS has the standard start and end parameters and SMPL
option to allow you to control the estimating range, but it also needs to test for
problems itself. This is one place where bad guess values, or a generally bad
setup might give a bad outcome. While it wouldnt happen here, its possible to
get the message:
## SR10. Missing Values And/Or SMPL Options Leave No Usable Data Points

which is telling you that the explanatory function has no entries at which it
could be computed, generally due to either missing values working their way
through the data set, or something like log or square root of a negative number
being part of the formula at the guess values.
The next step (under the default method of estimation) is to take a first order
Taylor series expansion of (3.1) with respect to the parameters.

yt xt xt ( ) + (xt log xt )( ) + t

If we treat the unstarred and as fixed, then this is in the form of a linear
regression of the current residuals yt xt on the two derivative series xt and
xt log xt to get ( ) and ( ). Going from (, ) to ( , ) is called
taking a Gauss-Newton step. This is repeated with the new expansion point,
and the process continues until the change is small enough that the process is
considered to be converged.
However, NLLS doesnt always take a full Gauss-Newton step. Its quite pos-
sible that on the first few Gauss-Newton steps, the sum of squares function
actually increases on a full step. This is because the first order expansion may
not yet be very accurate. While the G-N algorithm may actually work (and
work well) despite taking steps that are too large, NLLS instead will take a
shorter step in the same direction as the full step so that the sum of squares
decreases, adopting a slower but steadier approach to optimization.
Note that this first G-N step is where the = 0 guess creates a problem since
it zeroes out the derivative with respect to . What NLLS does with that is to
Non-linear Least Squares 67

not try to solve for (since theres no useful information for that) and just
solve for . On the second step, is no longer zero, so its possible to move
both parameters.

3.2 Using NLLS


Well use the data set to do several examples of non-linear least squares using
something similar to the simple model from the previous section. These are
both in Example 3.1). The first of these will be a nonsense regression, to show
how having a (theoretically) unidentified parameter can affect your estimation.
This first example will be a regression of inflation on its lag plus a power func-
tion on the lag of real GDP:

t = b0 + b1 t1 + b2 yt1 + t (3.3)
There is no particular reason to believe that the level of GDP has any effect
on inflation (the growth rate of GDP would be a different story), so we really
wouldnt expect b2 to be non-zero. But if b2 is zero, isnt identified. However,
this is only a theoretical lack of identificationin sample, b2 wont be exactly
zero, so we will be able to estimate , if not very well.
The second example (which will be discussed in detail in Section 3.3) will use
the two interest rates, estimating a exponential rather than linear relation-
ship:
LRt = a0 + a1 LRt1 + a2 SRt1 + t
where LR is the long rate (one year) and SR the short (three month).

Step 1-Define parameter set


Specify the parameter set to be estimated using the NONLIN instruction. The
syntax for this is:

NONLIN parameter list

In most instances, the parameter list will be a simple list of the coefficients
to be estimated, separated by spaces. For our first model (3.3), this (with the
obvious choices for names) would be

nonlin b0 b1 b2 gamma

One NONLIN controls the parameter set until another NONLIN is executed, so
we dont want to define parameter set for the second problem until were done
with the first. By the way, there is no reason that we couldnt have used the
same set of parameter names in the second problem as the firstwere using
different ones here for illustration, since the two models have nothing to do
with each other. In practice, where the different models are probably much
Non-linear Least Squares 68

more closely related, you would largely keep the same parameter set and make
adjustments to it from one estimation to the next.
Well see later that you can define PARMSET variables that save a parameter
list like this so you can easily switch between sets of parameters.

Step 2-Define FRML


This defines the explanatory formula. The syntax for FRML (as it is generally
used in non-linear least squares is):

frml(options) formula name depvar = function(t)

where:

formula name The name you choose to give to the formula


depvar Dependent variable
function(t) The explanatory function

For the first example, this would be

frml pif pi = b0+b1*pi{1}+b2*y{1}gamma

where the PI and Y series are assumed to have already been defined.

Step 3-Set Guess Values


An obvious set of guess values would be to take B0, B1 and B2 from a linear
regression with Y{1} as the third variable, which gives the sum of squares
minimizers for the case where GAMMA is 1. Could we get by with something less
accurate? For this model, probably yes. For a more complicated model, pos-
sibly not. For a (non-linear) model to be useful in practice, its important that
there be a reasonable way to get guess values for the parameters using more
basic models applied to the data plus prior knowledge from other datasets or
similar models. A model which can only be estimated if fed the best from the
results of dozens of attempts at guess values is unlikely to be useful in practice,
since it cant be applied easily to different data sets.
For the first example, we would do this with:

linreg pi
# constant pi{1} y{1}
compute b0=%beta(1),b1=%beta(2),b2=%beta(3),gamma=1.0

Note that %BETA gets redefined by NLLS, so you need to get those as soon as
possible after the LINREG.
Non-linear Least Squares 69

Step 4-Estimate using NLLS


The syntax for NLLS is

NLLS(frml=formula name,...) depvar start end residuals

depvar Dependent variable used on the FRML instruction.


start end Range to estimate.
residuals Series to store the residuals. This is optional. %RESIDS is
always defined by NLLS.

The principal options are:

METHOD=[GAUSS]/SIMPLEX/GENETIC
GAUSS is the (modified) Gauss-Newton algorithm described on page 66. SIMPLEX
and GENETIC are slower optimizers which dont use the special structure of
the non-linear least squares problem, but are more robust to bad guess val-
ues. Gauss-Newton tends to work fine, but these are available if you get
convergence problems.
ITERATIONS=maximum number of iterations to make [100]
ROBUSTERRORS/[NOROBUSTERRORS]
As with LINREG, this option calculates a consistent estimate of the covari-
ance matrix in the presence of heteroscedasticity.

NLLS defines most of the same internal variables as LINREG including %RSS,
%BETA, %TSTATS and %NOBS. It also defines the internal variable %CONVERGED
which is 1 if the estimation converged and otherwise is 0.
With the variable definitions (which need to be done at some point before the
FRML instruction)

set pi = 100.0*log(ppi/ppi{1})
set y = .001*rgdp

we can estimate the non-linear least squares model with

nlls(frml=pif) pi

which gives us
Non-linear Least Squares 70

Nonlinear Least Squares - Estimation by Gauss-Newton


Convergence in 61 Iterations. Final criterion was 0.0000067 <= 0.0000100
Dependent Variable PI
Quarterly Data From 1960:02 To 2012:04
Usable Observations 210
Degrees of Freedom 206
Skipped/Missing (from 211) 1
Centered R2 0.3183981
R-Bar2 0.3084719
Uncentered R2 0.5576315
Mean of Dependent Variable 0.8431326891
Std Error of Dependent Variable 1.1492476234
Standard Error of Estimate 0.9556932859
Sum of Squared Residuals 188.15002928
Regression F(3,206) 32.0764
Significance Level of F 0.0000000
Log Likelihood -286.4410
Durbin-Watson Statistic 2.1142

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. B0 0.446055502 0.243426525 1.83240 0.06833506
2. B1 0.556482217 0.057868679 9.61629 0.00000000
3. B2 -0.000876438 0.017920471 -0.04891 0.96104072
4. GAMMA 2.072568472 7.640418224 0.27126 0.78645982

A few things to note about this:

1. This shows 1 Skipped/Missing Observation. If you look at the output for


the LINREG used for the guess values (not shown here), youll see that
it gets the same 210 usable observations, but doesnt show any as being
skipped. This is due to the difference in how the two instructions work
if you let the program choose the estimation range. LINREG scans all the
series involved in the regression to figure out the maximum usable range.
NLLS initially restricts the range based upon the one input series that it
knows about (the dependent variable), then tries to evaluate the FRML at
the points in that range, knocking out of the sample any at which it cant
evaluate it. It ends up actually using the range from 1960:3 to 2012:4,
but accounts for it differently.
2. The second line in the output shows the actual iteration count. 61 is quite
a few iterations for such a small model. Thats mainly due to the problem
figuring out the poorly-estimated power term.

The GAMMA is showing the signs of a parameter which isnt really identified.
Even a one standard deviation range runs from roughly -5 to 10, most of which
are rather nonsensical values.
You might ask why Y was defined as

set y = .001*rgdp

rather than just RGDP alone. For a linear regression, a re-scaling like that
has no real effect on the calculation of the estimatesin effect, the data get
standardized as part of the process of inverting the X0 X. The only effect is
Non-linear Least Squares 71

on how the estimates look when displayed: if we use RGDP alone, the linear
regression would give
Linear Regression - Estimation by Least Squares
Dependent Variable PI
Quarterly Data From 1960:03 To 2012:04
Usable Observations 210
Degrees of Freedom 207
Centered R2 0.3179162
R-Bar2 0.3113260
Uncentered R2 0.5573187
Mean of Dependent Variable 0.8431326891
Std Error of Dependent Variable 1.1492476234
Standard Error of Estimate 0.9537190610
Sum of Squared Residuals 188.28306978
Regression F(2,207) 48.2409
Significance Level of F 0.0000000
Log Likelihood -286.5152
Durbin-Watson Statistic 2.1147

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant 0.496236135 0.175785276 2.82297 0.00522240
2. PI{1} 0.557330554 0.057684919 9.66163 0.00000000
3. Y{1} -0.000016091 0.000019699 -0.81684 0.41495922

A good practitioner would try to avoid reporting a regression like this which re-
quires either scientific notation or a large number of digits in order to show the
coefficient on Y{1}. However, everything about the regression matches exactly
with and without the rescaling except for the display of that last coefficient and
its standard error.
Proper scaling of the data makes a much greater difference in non-linear es-
timation for more reasons than just how the estimates look. In many cases,
the only way to get proper behavior is to rescale the data (or sometimes re-
parameterize the whole model). Theoretically, if you could do all calculations
to infinite precision, this wouldnt be an issue. However, the standard at this
point in statistical calculations is double precision (64-bit representation with
about 15 significant digits) typically with intermediate calculations done to a
somewhat higher precision by the microprocessor.3 If we look at the results
from NLLS with the rescaled RGDP series, and think about what would happen

if we used it without rescaling, yt1 would be higher by a factor of more than 106 ,
and b2 would correspondingly have to be divided by more than 106 , making it on
the order of 109 . In the Gauss-Newton algorithm (or any iterative procedure),
theres always a question of when to stopwhen should we consider that weve
done the best that we reasonably can. When you have a parameter with a tiny
scale like that, its hard to tell whether its small because its naturally small
and small changes in it may still produce observable changes in the function
value, or its small because its really (machine-)zero and small changes wont
have an effect. Its not the scale of the data itself thats the problem, but the
scale of the parameters that result from the scale of the data.
3
See this chapters Tips and Tricks (page 101) for more on computer arithmetic.
Non-linear Least Squares 72

In this case, the NLLS on the data without scaling down RGDP doesnt converge
at 100 iterations, but does if given more (ITERS=200 on the NLLS is enough),
and does give roughly the same results for everything other than B2. In other
cases, you may never be able to get convergence without reworking the data a
bit.

3.3 Restrictions: Testing and Imposing


Well now do the second example of non-linear least squares:
LRt = a0 + a1 LRt1 + a2 SRt1 + t (3.4)
The setup here is basically the same as before with renaming of variables:

nonlin a0 a1 a2 delta
linreg tb1yr
# constant tb1yr{1} tb3mo{1}
frml ratef tb1yr = a0+a1*tb1yr{1}+a2*(tb3mo{1})delta
compute a0=%beta(1),a1=%beta(2),a2=%beta(3),delta=1.0
nlls(frml=ratef) tb1yr

Nonlinear Least Squares - Estimation by Gauss-Newton


Convergence in 11 Iterations. Final criterion was 0.0000089 <= 0.0000100
Dependent Variable TB1YR
Quarterly Data From 1960:01 To 2012:04
Usable Observations 211
Degrees of Freedom 207
Skipped/Missing (from 212) 1
Centered R2 0.9445648
R-Bar2 0.9437614
Uncentered R2 0.9864386
Mean of Dependent Variable 5.5835545024
Std Error of Dependent Variable 3.1851074843
Standard Error of Estimate 0.7553379512
Sum of Squared Residuals 118.10083205
Regression F(3,207) 1175.6969
Significance Level of F 0.0000000
Log Likelihood -238.1723
Durbin-Watson Statistic 1.5634

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. A0 -0.083102618 0.233285762 -0.35623 0.72203360
2. A1 0.845299348 0.142798909 5.91951 0.00000001
3. A2 0.302138458 0.203001283 1.48836 0.13817851
4. DELTA 0.719742244 0.319257061 2.25443 0.02521634

The significant t-statistic on DELTA is somewhat misleading because its a


test for = 0 and thats not a particularly interesting hypothesis. Of the hy-
pothesis testing instructions from Section 2.2, you cant use EXCLUDE, since it
uses regressor lists to input the variables to be tested, and NLLS works with
parameter sets instead. TEST and RESTRICT, which use coefficient positions,
are available, and there is a form of SUMMARIZE which can be used as well.
Non-linear Least Squares 73

In this case, the most interesting hypothesis regarding would be whether its
equal to 1. We can use TEST for thatDELTA is coefficient 4, so the test would
be done with
test(title="Test of linearity")
# 4
# 1.0

Test of linearity
t(207)= -0.877844 or F(1,207)= 0.770609 with Significance Level 0.38104636

so we would conclude that the extra work for doing the power term doesnt
seem to have helped much.
SUMMARIZE can also be applied to the results from non-linear least squares. It
can be used to test non-linear functions of the parameters, and can also use
the delta method (Appendix C) to compute asymptotic variances for non-linear
functions. One potentially interesting question about the relationship between
the short and long rates is whether a permanent increase in the short rate
would lead to the same increase in the long rate. If were 1, the long-run effect
would be
a2
1 a1
We can use SUMMARIZE after the LINREG to estimate that and its standard
error (notethis is at the end of the example program):

linreg tb1yr
# constant tb1yr{1} tb3mo{1}
summarize(title="Long-run effect using linear regression") $
%beta(3)/(1-%beta(2))

Long-run effect using linear regression

Value 0.88137621 t-Statistic 3.63694


Standard Error 0.24234021 Signif Level 0.0003479

Its not significantly different from one, but the standard error is quite large.
With the non-linear model, the effect of a change in the short rate on the long
rate is no longer independent of the value of TB3MO. The analogous calculation
for the long run effect would now be:
a2 SR1
1 a1
Since this depends upon the short rate, we cant come up with a single value,
but instead will have a function of test values for the short rate. We can
compute this function, together with upper and lower 2 standard error bounds
using the following:
Non-linear Least Squares 74

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2
2 3 4 5 6

Figure 3.1: Long-run effect using non-linear regression

set testsr 1 100 = .1*t


set lreffect 1 100 = 0.0
set lower 1 100 = 0.0
set upper 1 100 = 0.0
*
do t=1,100
summarize(noprint) $
%beta(3)*%beta(4)*testsr(t)(%beta(4)-1)/(1-%beta(2))
compute lreffect(t)=%sumlc
compute lower(t)=%sumlc-2.00*sqrt(%varlc)
compute upper(t)=%sumlc+2.00*sqrt(%varlc)
end do t

TESTSR is a series of values from 0.1 to 10. The three other series are initialized
to zero over the grid range so they can be filled in entry by entry inside the
loop. SUMMARIZE computes %SUMLC as the estimate of the non-linear function
and %VARLC as the estimated variance.
The following graphs (Figure 3.1) the function. The range is limited to values
of SR between 2 and 6, as the function, particularly for under 2, grows rapidly,
dominating the range of the graph.

scatter(smpl=testsr>=2.0.and.testsr<=6.0,style=lines,vgrid=1.0,$
footer="Long-run effect using non-linear regression") 3
# testsr lreffect
# testsr lower / 2
# testsr upper / 2
Non-linear Least Squares 75

Panel a Panel b
10 10

9 9

8 8
Residual Sum of Squares

Residual Sum of Squares


7 7

6 6

5 5

4 4

3 3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Beta Beta

Figure 3.2: Sums of Squares Examples

3.4 Convergence and Convergence Criteria


Numerical optimization algorithms use iteration routines that cannot guar-
antee precise solutions for the estimated coefficients. Various types of hill-
climbing methods (Gauss-Newton is an example, though it climbs only if
you switch the sign of the objective function) are used to find the parameter
values that maximize a function or minimize the sum of squared residuals. If
the partial derivatives of the function are near zero for a wide range of param-
eter values, RATS may not be able to converge to the optimum point.
To explain, suppose that the sum of squared residuals for various values of
can be depicted by Panel a of Figure 3.2. Obviously, a value of of about 0.71
minimizes the sum of squared residuals. However, there is a local minimum
at = 0.23. If we started with a guess value less than .4 (where the function
has the local maximum), it is likely that Gauss-Newton will find that local
minimum instead. How can we know if we have the global rather than local
maximum? In general, we cant, unless we have strong knowledge about the
(global) behavior of the function.4 For instance, even in this case, notice that
the function is starting to turn down at the right edge of Panel atheres no
way to be sure that the function doesnt get even smaller out beyond = 1.0.
The sum of squares surface shown in the figure is rather convex (has a strongly
positive second derivative) so it is clear where the local minima occur. However,
suppose that the surface was quite flat. As shown by the smooth line in Panel
b, the sum of squared residuals is almost invariant to the value of selected.
In such circumstances, it might take many iterations for RATS to select a value
of within the default tolerance of 0.00001. Theres a practical limit to how
4
For many functions, its possible to prove that the sum of squares surface is (quasi-)convex,
which would mean that it can have just one local minimum.
Non-linear Least Squares 76

accurately you can estimate the coefficients. Suppose youre trying to maximize
f () with respect to , and assume for simplicity that has just one element. If
f has enough derivatives, we can approximate f with a two term Taylor series
approximation as
1
f () f (0 ) + f 0 (0 )( 0 ) + f 00 (0 )( 0 )2
2
If were very close to the optimum, f 0 (0 ) will be very close to zero. So if
f 00 (0 )( 0 )2
(3.5)
2f (0 )
is less than 1015 (which is machine-zero), then on a standard computer, we
cant tell the difference between f () and f (0 ). Since the difference in the s
comes in as a square, in practice, were limited to about 7 significant digits at
the most, and there is rarely any need to try to push below the 5 significant
digits that are the defaultyou are unlikely to ever report more digits than
that, and the extra work wont really change the results in any meaningful
way.
The two main controls for the Gauss-Newton algorithm are options on NLLS:
the ITERATIONS option, which you can use to increase the number of iter-
ations, and the CVCRIT option, which can be used to tighten or loosen the
convergence criterion. The default on ITERATIONS is 100, which is usually
enough for well-behaved problems, but you might, on occasion need to increase
it. The default on CVCRIT is .00001, which, as we noted above, is a reasonable
value in practiceyoure unlikely to report even five significant digits in prac-
tice, and its unlikely that you could get much of a better result if you put in a
smaller value. Theres also a PMETHOD option for using a different method at
first (PMETHOD means Preliminary METHOD) before switching to Gauss-Newton,
but it is rarely needed for NLLS. It will be important for models estimated with
maximum likelihood.
Some secondary controls for non-linear estimation routines are included in the
separate NLPAR instruction which is covered in this chapters Tips and Tricks
(page 102).
The non-linear least squares algorithm is based upon an assumption that the
sum of squares surface is (at least locally) well-behaved. If you have the (for all
practical purposes) non-differentiable function shown in the lower function in
panel (b), NLLS is very unlikely to give good results. Even a brute-force grid
search might fail to find the minimum unless it uses a very fine grid. This
again shows the importance of having some idea of how the sum of squares
surface looks.
Non-linear Least Squares 77

3.5 ESTAR and LSTAR Models


The Logistic Smooth Transition Autoregressive (LSTAR) and Exponential
Smooth Transition Autoregressive (ESTAR) models generalize the standard au-
toregressive model to allow for a varying degree of autoregressive decay, thus
allowing for different dynamics for the up and down parts of cycles. The LSTAR
model can be represented by:

p p
" #
X X
y t = 0 + i yti + 0 + i yti + t (3.6)
i=1 i=1

where = [1 + exp((yt1 c))]1 and > 0 is a scale parameter.


In the limit, as 0, the LSTAR model becomes an AR(p) model since is
actually constant. For 0 < < , the value of changes with the value of
yt1 . Hence, acts as a weighting function so the degree of autoregressive
decay depends on the value of yt1 . As the value of yt1 , 0 so the
behavior of yt is given by 0 + 1 yt1 + . . . + p ytp + t (which well call the
first branch). And, as yt1 +, 1 so that the behavior of yt is given
by (0 + 0 ) + (1 + 1 )yt1 + . . . (p + p )ytp + t (which well call the second
branch). The two branches can have different means and different dynamics,
sometimes very different. For an LSTAR model, as yt1 ranges from very small
to very large values, goes from zero to unity, and you get a blend of the two
branches. In particular, when yt1 equals the centrality parameter c, the value
of = 0.5, and you get an average of the coefficients.
The ESTAR model is similar to the LSTAR model except has the form:
= 1 exp((yt1 c)2 ) ; > 0
 

For the ESTAR model, = 0 when yt1 = c and approaches unity as yt1 ap-
proaches . The shape of is somewhat like an inverted belleffectively its
a Normal density flipped outside down.
You can get a good sense of the nature of the LSTAR and ESTAR models by
experimenting with the following code (Example 3.2). The first line sets up y
to range from 0.5 to +0.5. In the second, c is given the value zero, with a of
10. We then compute the two shape functions:

set y 1 201 = (t-100)/201.


compute c=0.0,gamma=10.0
set lstar = ((1 + exp(-gamma*(y-c))))-1
set estar = 1 - exp(-gamma*(y-c)2)

The following graphs (Figure 3.3) the two transition functions:


Non-linear Least Squares 78

LSTAR Model ESTAR Model


1.0 1.00

0.8
0.75

0.6
Theta

Theta
0.50

0.4

0.25
0.2

0.0 0.00
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

Figure 3.3: Shapes of STAR Transitions

spgraph(footer="Shapes of STAR Transitions",vfields=1,hfields=2)


scatter(header="LSTAR Model",style=line,vlabels="Theta")
# y lstar
scatter(header="ESTAR Model",style=line,vlabels="Theta")
# y estar
spgraph(done)

You should experiment by rerunning the program with different values of c and
. You will find that increasing makes the transitions shorter and steeper;
in the limit, as , the LSTAR converges to a step function and the ESTAR
to a (downwards) spike. Changing the value of the centrality parameter, c,
changes the transition point for the LSTAR and the point of symmetry for the
ESTAR .

Well first use generated data to illustrate the process of estimating a STAR
model. We will do this because STAR models are a classic example of the 3rd
word of advice: Realize That Not All Models Work! Suppose that there really
is no second branch in (3.6), that is, the data are generated by simply:
p
X
y t = 0 + i yti + t
i=1

What would happen if we tried to estimate a STAR (in particular, an LSTAR)?


If i = 0 for all i, then the transition parameters dont matter. If c is bigger
than any data value for y, then a large value of will make effectively zero
through the data set, so the coefficients dont matter. If c is smaller than any
data value for y, a large value of will now make effectively one through the
data set, so only i + i matters and not the individual values for or . So we
have three completely different ways to get equivalent fits to the model.
Non-linear Least Squares 79

However, what is likely to happen in practice?. Suppose the residual for the
entry with the highest value of yt1 in the data set is non-zero. Then, we can
reduce the sum of squares by adding a dummy variable for that entry. Now we
would never actually pick a data point and dummy it out without a good reason,
but we have a non-linear model which can generate a dummy by particular
choices for and c. If c is some value larger than the second highest value of
yt1 , and is very large, then, in this data set, will be a dummy for that one
observation with the highest value of the threshold, and this will thus reduce
the sum of squares. You can do something similar at the other end of the data
set, isolating the smallest value for yt1 . Thus the sum of squares function is
likely to have two local modes with c on either end of the data set, and there
will be no particularly good way to move between them. Neither generates an
interesting transition model, and it will often be hard to get non-linear least
squares to converge to either since it requires a very large value of .
Note, by the way, that this problem is much worse for an ESTAR model, where
the transition function can (in effect) dummy any data point. You cant just
take a set of data and fit a STAR model to it and hope it will give reasonable
results.

3.6 Estimating a STAR Model with NLLS


If c and were known (or at least treated as fixed), (3.6) would be linearwe
would just have to construct the variables for the terms by multiplying the
(time-varying) by the lagged y. However, in practice, the transition param-
eters arent known, which makes this a non-linear least squares problem. To
illustrate how to estimate this type of model, we first need to generate data
with a STAR effect.
The following example is from Section 7.9 of Enders (2010). The first part of
the program (Example 3.3) generates a simple LSTAR process containing 250
observations. The first three lines of code set the default series length to 250,
seed the random number generator, and draw 350 pseudo-random numbers5
from a normal distribution with standard deviation one (and mean zero):
all 250
seed 2003
set eps 1 350 = %ran(1)

Next, well create an LSTAR process with 350 observations. This uses the RATS
function %LOGISTIC function, which computes the transition function without
any chance of an overflow on the exp function.6 This uses c = 5 and = 10:
5
The reason for 350 will be described shortly. See this chapters Tips and Tricks (page 105)
for more on random number generation.
6
exp(z) will overflow when z is 710 or greater. The way that expressions are evaluated in
RATS , 1.0/(1 + exp(z)) will be evaluated as NA for such a value of z. The %LOGISTIC function
knows the behavior of the overall function, and so returns 0 for a case like that.
Non-linear Least Squares 80

7.5

5.0

2.5

0.0

-2.5

-5.0

-7.5

-10.0
25 50 75 100 125 150 175 200 225 250

Figure 3.4: Simulated STAR Process

set(first=1.0) x 1 350 = $
1.0+.9*x{1}+(-3.0-1.7*x{1})*%logistic(10.0*(x{1}-5.0),1.0)+eps

It is not obvious how to set the initial value of x. In such circumstances, a com-
mon practice is to generate a series longer than necessary and then discard the
extra. Here we use 100 extra pointsthese are known as the burn-in period.
There are several ways to handle thisyou could start all the analysis at entry
101, but here well simply copy the data down to the desired entries with

set y 1 250 = x(t+100)

The time series graph of the series (Figure 3.4) is created using:

graph(footer="The Simulated LSTAR Process")


# y

The first branch of the LSTAR is yt = 1.0 + .9yt1 + t . This has mean 10 and
is strongly positively correlated. The second branch is created by adding the
two processesit is what you would get for values of yt1 above 5,7 so its yt =
2.0 .8yt1 + t . This has a mean of 2/1.8 1.11 and is strongly negatively
correlated. As a result, the process generally moves steadily up under the
control of the first branch until the value of y is greater than 5. In the next
time period, it is likely to drop very sharply under control of the second branch,
which will drive it back onto the first branch. So the process going up is slower
than the one going down.
The NONLIN instruction and FRML definition are:
7
Since is large, the transition is very short.
Non-linear Least Squares 81

nonlin a0 a1 b0 b1 gamma c
frml lstar y = (a0+a1*y{1})+$
(b0+b1*y{1})*%logistic(gamma*(y{1}-c),1.0)

Guess Values, Method One. Based at OLS


One way to get initial guesses is to estimate a linear model and use the coef-
ficient estimates as the initial values. This starts as if we have only the first
branch, zeroing out the change. For illustration, well start with c = 0 (roughly
the middle of the data) and = 5.

linreg y
# constant y{1}
compute a0=%beta(1),a1=%beta(2),b0=0.0,b1=0.0
compute c=0.0,gamma=5.0
*
nlls(frml=lstar) y 2 250

Nonlinear Least Squares - Estimation by Gauss-Newton


NO CONVERGENCE IN 100 ITERATIONS
LAST CRITERION WAS 0.0185721
Dependent Variable Y
Usable Observations 249
Degrees of Freedom 243
Centered R2 0.6130672
R-Bar2 0.6051056
Uncentered R2 0.6234211
Mean of Dependent Variable 0.5876231693
Std Error of Dependent Variable 3.5509791503
Standard Error of Estimate 2.2314573881
Sum of Squared Residuals 1209.9947042
Log Likelihood -550.1400
Durbin-Watson Statistic 1.7369

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. A0 2.74882 0.50706 5.42105 0.00000014
2. A1 1.04721 0.11905 8.79664 0.00000000
3. B0 -273.09372 11487.80317 -0.02377 0.98105359
4. B1 28.61805 1312.03609 0.02181 0.98261587
5. GAMMA 0.75187 0.43441 1.73079 0.08475907
6. C 8.92116 51.91215 0.17185 0.86369759

The results dont look at all sensible, but more important, when you look at the
second and third lines of the output, its clear that we dont have results in
the first place. If we havent gotten convergence, one question to ask is whether
its worth just increasing the iteration limit in hope that that will fix things.
When you have a model like this that can have multiple modes, you probably
want to stop and look at whether you will likely just be converging to a bad
mode. Here, we have a value for c which is well above the maximum value for
y in the data set (nearly 9 vs a maximum value of just above 7). With these
values, the largest value that will take is about .2, so only a fraction of those
(rather absurd-looking) values for B0 and B1 will apply. Despite this not being
converged, and having very strange dynamics, the sum of squares is (much)
Non-linear Least Squares 82

lower with these estimates than it is for the simple least squares model, 1209
here vs 2218 for OLSnote that the estimated first branch is somewhat similar
to the one used in the actual DGP, and the effect of the times the second
branch will be sharply negative for the values closest to c, which is the behavior
we need.
In this case, the main problem with the guesses was the value of . Even
though the guess of 5 is smaller than the true value, when combined with the
wrong value of c, it works poorly because the function is almost non-differ-
entiable with respect to c due to the sharp cutoff. If you go back and try with
GAMMA=1.0, youll see that you get convergence to a reasonable set of esti-
mates. Note, however, that you need to re-execute the LINREG and COMPUTE
instructions to make that work: %BETA has been re-defined by NLLS so the
COMPUTE instructions for A0 and A1 will no longer use the OLS values unless
you re-do the LINREG.
The biggest problem in fitting STAR models is finding the threshold value c. As
we mentioned, given and c, the model is linear in the other parameters, and
given c, usually isnt hard to estimate.

Guess Values, Method 2: Data-Determined


The initial values for c and above were basically just wild guesses. A more
straightforward alternative is to use STATISTICS on the threshold and get
guess values for c and off of that, for instance:

stats y
compute c=%mean,gamma=1.0/sqrt(%variance)

Using the reciprocal of the sample standard error8 starts with a rather flat
transition function (most of the observed data will be a blend of the two
branches rather than one or the other), which makes it easier for Gauss-
Newton to find the optimal c. The following then treats C and GAMMA as fixed to
get the corresponding coefficients of the two branches (in the first NLLS) then
estimates all the parameters together:

nonlin a0 a1 b0 b1
nlls(frml=lstar) y 2 250
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250

The first NLLS is actually linear, since GAMMA and C are fixed (not included in
the NONLIN). If you look at the output from it (not shown), youll see that it
converged in 2 iterationsthe first moves to the minimizer, and the second
tries to improve but cant. The output from the NLLS with the full parameter
set is:
8
For a ESTAR, you would use 1.0/%variance instead since the exponent depends upon the
square of the data.
Non-linear Least Squares 83

Nonlinear Least Squares - Estimation by Gauss-Newton


Convergence in 57 Iterations. Final criterion was 0.0000018 <= 0.0000100
Dependent Variable Y
Usable Observations 249
Degrees of Freedom 243
Centered R2 0.9201029
R-Bar2 0.9184590
Uncentered R2 0.9222409
Mean of Dependent Variable 0.5876231693
Std Error of Dependent Variable 3.5509791503
Standard Error of Estimate 1.0139960256
Sum of Squared Residuals 249.84966939
Regression F(5,243) 559.6826
Significance Level of F 0.0000000
Log Likelihood -353.7398
Durbin-Watson Statistic 2.0482

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. A0 1.017585644 0.068354840 14.88681 0.00000000
2. A1 0.917712225 0.020712881 44.30635 0.00000000
3. B0 -4.467787643 3.502956803 -1.27543 0.20337389
4. B1 -1.438183472 0.591133075 -2.43293 0.01569975
5. GAMMA 9.957050393 1.551367386 6.41824 0.00000000
6. C 5.002927739 0.019240934 260.01481 0.00000000

Note that this has quite accurately estimated A0, A1, GAMMA and C, but not so
much B0 and B1. Thats not that surprising since there are only about 20 data
points out of the 250 where the B0 and B1 even matter.

Guess Values, Method 3: Grid Search


A grid search can sometimes be helpful if its hard to fit a model from more
generic guess values and there is a single parameter that is the main prob-
lem. A grid search over a full six-dimensional space (as we have here) is not
really feasible, and here isnt even necessary since the model is linear given
and c. We have three possible approaches that might make sense:
1. Fix and grid search over c.
2. Grid search over c, estimating by non-linear least squares for each.
3. Jointly search over a grid in c and .
The first would involve the least calculation since each test model is linear (
is fixed in advance, and c is fixed for a given evaluation). The second would
likely require the most calculation since it would be fully estimating a non-
linear least squares model for each c. The third would be the hardest to set up
since it requires a two-dimensional grid.
The simplest and most general way to handle the control of a grid search (for
minimization) is the following pseudo-code (descriptive rather than actual):
compute bestvalue=%na
do over grid
calculate value(thisgrid) to thisvalue
if .not.%valid(bestvalue).or.bestvalue>thisvalue
compute bestvalue=thisvalue,bestgrid=thisgrid
end grid
Non-linear Least Squares 84

At the end of this, the grid point at which the calculation is smallest will be
in bestgrid,9 and the value of the function there will be bestvalue. The IF
statement will give a true condition if either the current bestvalue is %NA,10
or if thisvalue is smaller than the current bestvalue.
Although the grid can be an actual equally-spaced set of values, that
isnt required. For our purposes, the quickest way to create the
grid is with the %SEQA function (meaning additive sequence), where
%SEQA(start,increment,n) returns the VECTOR with n values start,
start+increment, . . ., start+increment*(n-1). For the grids for c, well use

stats(fractiles) y
compute ygrid=%seqa(%fract05,(%fract95-%fract05)/19,20)

which will do a 20 point grid (n = 20) over the range from 5%-ile to the 95%-ile
of the data.11
The loop over the grid values is done with

dofor c = ygrid
...
end dofor c

DOFOR is a more general looping instruction than DO (section 2.8.1)while DO


only loops under the control of counter (INTEGER) variables, DOFOR can loop
over a list of anything. Usually (as here), the list is a VECTOR of some data
type but it can be a list of items separated by spaces:

dofor c = 1.0 2.0 4.0 8.0 16.0


...
end dofor c

Each time through the loop, DOFOR just pulls the next value off the list or out
of the VECTOR, sets the index variable (here C) equal to it, and re-executes the
content of the loop. There are other ways to set this upin general, there can
be many equivalent ways to code a calculation like this. For instance, we could
have done
stats(fractiles) y
do i=1,20
compute c=%fract05+(i-1)*(%fract95-%fract05)/19
...
end do i

9
If you want to maximize instead, simply change the <to >on the IF.
10
.not.%valid(bestvalue) will be true if and only if bestvalue is missing, which will
happen the first time through the loop.
11
STATS(FRACTILES) computes a standard set of quantiles of the data, for 1, 5, 10, 25, 50,
75, 90, 95 and 99 which are fetchable as variables named %FRACTnn.
Non-linear Least Squares 85

There are two advantages of using the DOFOR setup:

1. Its clearer what the loop is doing.


2. The controls of the loop (start, end, number of values) are included in
just a single instruction (the COMPUTE with the %SEQA), instead of two
(the limit on the DO and the COMPUTE C).

The working code for the grid search for a fixed value of is:

stats(fractiles) y
compute gamma=2.0/sqrt(%variance)
compute ygrid=%seqa(%fract05,(%fract95-%fract05)/19,20)
nonlin a0 b1 b0 b1
compute bestrss=%na
dofor c = ygrid
nlls(noprint,frml=lstar) y 2 250
if .not.%valid(bestrss).or.%rss<bestrss
compute bestrss=%rss,bestc=c
end dofor c

This uses a NONLIN which doesnt include C or GAMMA, since those are being
fixed on each evaluation. This restores to C the best of the grid values and
estimates all parameters of the model:

disp "Guess Value used" bestc


*
compute c=bestc
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250

The setup for the grid search across C with GAMMA being estimated given C is
similar. Well include all the instructions, even though many are the same as
before:
stats(fractiles) y
compute gamma0=2.0/sqrt(%variance)
compute ygrid=%seqa(%fract05,(%fract95-%fract05)/19,20)
nonlin a0 b1 b0 b1 gamma
compute bestrss=%na
dofor c = ygrid
compute gamma=gamma0
nlls(noprint,frml=lstar) y 2 250
if .not.%valid(bestrss).or.%rss<bestrss
compute bestrss=%rss,bestc=c,bestgamma=gamma
end dofor c

This now includes GAMMA in the parameter set on the NONLIN and saves the
value of GAMMA along with C when we find an improvement. Note that GAMMA
Non-linear Least Squares 86

is restored to its original guess value each time through the loopthis avoids
problems if (for instance) the first values of C have an optimal GAMMA which
is large. If GAMMA isnt re-initialized, it will use the value it got as part of the
previous estimation, which might be a problem.
We now have to restore the best values for both C and GAMMA:

disp "Guess values used" bestc "and" bestgamma


compute c=bestc,gamma=bestgamma
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250

The bivariate grid search requires that we also set up a grid for . Well use
the same grid for c. Since has to be positive, well use the %EXP function with
%SEQA to generate a geometric sequence. %EXP takes the element-by-element
exp of a matrixhere it will give us fractions ranging from .25 to 25 of the
reciprocal of the inter-quartile range of the data.12

stats(fractiles) y
compute ygrid=%seqa(%fract05,(%fract95-%fract05)/19,20)
compute ggrid=%exp(%seqa(log(.25),.1*log(100),11))/(%fract75-%fract25)

As with the first grid, we leave both GAMMA and C out of the parameter set. We
nest the two DOFOR loops (the order here doesnt matter), and do the NLLS and
the test for improvement inside the inner loop.

nonlin a0 b1 b0 b1
compute bestrss=%na
dofor c = ygrid
dofor gamma = ggrid
nlls(noprint,frml=lstar) y 2 250
if .not.%valid(bestrss).or.%rss<bestrss
compute bestrss=%rss,bestc=c,bestgamma=gamma
end dofor gamma
end dofor c

As before, we restore the best values for both, and estimate the full set of pa-
rameters:

disp "Guess values used" bestc "and" bestgamma


compute c=bestc,gamma=bestgamma
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250

12
The inter-quartile range is the distance between the 25%-ile and 75-%ile of a series, and
is a (robust) alternative to the standard deviation for measuring dispersion since it isnt an
sensitive to outliers.
Non-linear Least Squares 87

Not surprisingly (since its with constructed data), all three grid search meth-
ods end up with the same optimum as we got originally. That may not be the
case with actual data, as well see in Section 3.8.

3.7 Smooth Transition Regression


In the models examined above, the threshold variable was a lag of the depen-
dent variable. Its also possible to use a lag of the difference (known as a mo-
mentum TAR model), or possibly some other linear combination of lags (several
period average for instance). Because the threshold in any of these cases is en-
dogenous, the dynamics of the generated process can be quite complicated.
Its also possible to apply the same type of non-linear model to a situation where
the threshold is exogenous. Such models are called Smooth Transition Regres-
sion (or STR) rather than STAR. One obvious case would be where the thresh-
old variable is timeso the model has a structural break at some point but
smoothly moves from one regime to the other, perhaps due to a gradual rollout
or slow adoption of new technologies.
The following program simulates a series with an LSTR break such that:
yt = 1 + 3/ [1 + exp(0.075(t 100))] + 0.5yt1 + t (3.7)
Note that the centrality parameter is 100 and that = 0.075. Here, the break
affects only the intercept term as the autoregressive parameter is always 0.5.
Since the value of ranges from 0 to 1, the intercept is 1 for small values of
t and is 4 for large values of t. With an autoregressive parameter of 0.5, the
mean of the series is about 2 for small values of t and is about 8 for very large
values of t. Nevertheless, with a smooth LSTR break, the average value of the
series starts to slowly shift upward beginning around t = 75 and continues the
upward shift until the process levels off at roughly t = 125.
The first part of the program (Example 3.4) sets the default length of a se-
ries to be 250 observations, seeds the random number generator and creates
the eps series containing 250 pseudo-random numbers drawn from a normal
distribution with a standard deviation of unity.

all 250
seed 2003
set eps = %ran(1)

The next two lines create the series for and the yt . The resulting series is
graphed, producing Figure 3.5:

set theta = 1/(1+exp(-.075*(t-100.)))


set(first=2.) y = 1 + 3*theta + 0.5*y{1} + eps
graph(footer="A Simulated LSTR Break") 1
# y
Non-linear Least Squares 88

12

10

-2
25 50 75 100 125 150 175 200 225 250

Figure 3.5: A Simulated LSTR Break

Note that its easier to simulate this because the function can be generated
separately from the y.
This type of model is much easier to handle than a STAR because the break,
while not necessarily sharp, has an easily visible effect: here the series seems
to have a clearly higher level at the end than at the beginning.
One way to proceed might be to estimate the series as a linear process and
create guess values based upon the estimated coefficients. Since the intercept
appears to be lower near the start of the data set, this makes the first branch
somewhat lower than the linear estimate, and the second somewhat higher.13

linreg(noprint) y
# constant y{1}
compute a1=%beta(2),a0=%beta(1)-%stderrs(1),b0=2*%stderrs(1)
compute c=75.0,gamma=.25

The guess value for c appears at least reasonable given the graph, as 75 seems
to be roughly the point where the data starts changing. This value of may be
a bit too high (at = .25 about 90% of the transition will be over the range of
[c 12, c + 12]) but it appears to work in this case:

nlls(frml=lstar,iterations=200) y

13
Since the second branch intercept is the sum of a0 and b0 , b0 is initialized to the guess at
the difference between the two intercepts.
Non-linear Least Squares 89

Nonlinear Least Squares - Estimation by Gauss-Newton


Convergence in 9 Iterations. Final criterion was 0.0000029 <= 0.0000100
Dependent Variable Y
Usable Observations 249
Degrees of Freedom 244
Skipped/Missing (from 250) 1
Centered R2 0.8871302
R-Bar2 0.8852799
Uncentered R2 0.9725995
Mean of Dependent Variable 5.4629432995
Std Error of Dependent Variable 3.0993810584
Standard Error of Estimate 1.0497715136
Sum of Squared Residuals 268.89293630
Regression F(4,244) 479.4457
Significance Level of F 0.0000000
Log Likelihood -362.8848
Durbin-Watson Statistic 1.9852

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. A0 0.723962117 0.181964155 3.97860 0.00009144
2. A1 0.433037455 0.057838156 7.48705 0.00000000
3. B0 3.876373910 0.448189508 8.64896 0.00000000
4. GAMMA 0.065350537 0.012699962 5.14573 0.00000055
5. C 97.482532947 3.385817497 28.79143 0.00000000

The coefficient estimates are reasonably close to the actual values in the data
generating process. As an exercise, you might want to experiment with differ-
ent initial guesses. It turns out that, for this model, the results are quite robust
to the choice of the initial conditions. Also, use the AIC and BIC to compare the
fit of this model to that of a linear model and to a model estimated with a sharp
structural break.
A grid search for a threshold based upon time is simpler than it is for a con-
tinuous variable, as you can just loop over the entries. The following does a
preliminary grid search over the center 70% of the data (leaving out 15% at
either end). It also saves the values of the sum of squares into the series RSS
so we can graph it (Figure 3.6):

nonlin a0 a1 b0 gamma
set rss = %na
do time=38,213
compute gamma=.25
compute c=time
nlls(frml=lstr,iterations=200,noprint) y
if %converged==1
compute rss(time)=%rss
end do ic

This is doing the second form of grid search, where is estimated for each test
value of c. We dont need to save the best value for c because we can simply use
EXTREMUM to find the best value, and entry at which its achieved (%MINENT is
defined by EXTREMUM):
Non-linear Least Squares 90

360

340

320

300

280

260
50 75 100 125 150 175 200

Figure 3.6: Sum of Squares for LSTR Break

extremum rss
compute c=%minent
compute gamma=.25
nonlin a0 a1 b0
nlls(frml=lstr,iterations=200) y
nonlin a0 a1 b0 gamma c
nlls(frml=lstr,iterations=200) y

This gives us the same results as we got with the empirical guess values. The
graph of the sums of squares (as a function of time) is produced with:

graph(footer="Sum of Squares for LSTR Break") 1


# rss 38 213

Note that the model doesnt always converge for larger values, which is why
the graphs has gaps. (Note the test for %CONVERGED in the grid search loop
above).
Non-linear Least Squares 91

20

15

10

-5

-10

-15

-20

-25
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 3.7: Annualized Inflation Rate (Measured by PPI)

3.8 An LSTAR Model for Inflation


Our previous examples used simulated data, so we knew what the true model
was. We will now try to fit an LSTAR model to the U.S. inflation rate. The full
program is Example 3.5. We can compute and graph (Figure 3.7) the (annual-
ized) inflation rate with:

set pi = 400.0*log(ppi/ppi{1})
graph(footer="Annualized Inflation Rate (Measured by PPI)",$
grid=(t==1983:1))
# pi

Its fairly clear from looking at this that there is a difference in the process
in the period from 1973 to 1980 compared with after that. During the 1970s,
inflation was often over 5% quarter after quarter; more recently, almost any
time inflation exceeds 5% in a quarter, it is followed by a quick drop. While the
U.S. monetary policy has never formally used inflation targeting, it certainly
seems possible that the inflation series since the 1980s might be well-described
by a non-linear process similar to the one generated in Section 3.6.
Well focus on the sample period from 1983:1 on. In the simulated example, we
knew that the process was an LSTAR with AR(1) brancheshere we dont know
the form, or even know whether an LSTAR will even work. We can start by see-
ing if we can identify a AR model from the autocorrelation function (restricting
the calculation to the desired sample, Figure 3.8):

@bjident pi 1983:1 *

Since were picking a pure autoregression, the partial autocorrelations are the
main statistic, and they would indicate either 1 or 4 lags. Since these statistics
Non-linear Least Squares 92

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

-0.75
CORRS
PARTIALS
-1.00
0 5 10 15 20 25

Figure 3.8: Correlations of Inflation Rate (1983-2012)

are being computed assuming a single model applies to the whole sample, it
makes sense to work with the more general model to start, possibly reducing
it if it appears to be necessary. Note that there is no reason the two branches
must have the same form.
Now, we can estimate the base AR(4) model and look at the autocorrelations of
the residuals (Figure 3.9):

linreg pi 1983:1 *
# constant pi{1 to 4}
@regcorrs(qstats,footer="Residuals from AR(4) Model")

Linear Regression - Estimation by Least Squares


Dependent Variable PI
Quarterly Data From 1983:01 To 2012:04
Usable Observations 120
Degrees of Freedom 115
Centered R2 0.1453374
R-Bar2 0.1156100
Uncentered R2 0.3440071
Mean of Dependent Variable 2.2002048532
Std Error of Dependent Variable 4.0148003022
Standard Error of Estimate 3.7755989989
Sum of Squared Residuals 1639.3419971
Regression F(4,115) 4.8890
Significance Level of F 0.0011203
Log Likelihood -327.1461
Durbin-Watson Statistic 1.9443

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant 2.009014916 0.474728227 4.23193 0.00004681
2. PI{1} 0.345155837 0.090901340 3.79704 0.00023547
3. PI{2} -0.180008354 0.095737234 -1.88023 0.06260632
4. PI{3} 0.151718031 0.096356361 1.57455 0.11810720
5. PI{4} -0.227600168 0.091450487 -2.48878 0.01424990
Non-linear Least Squares 93

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

-0.75 AIC= 5.536 SBC= 5.652


Q= 10.00 P-value 0.97889
-1.00
2 4 6 8 10 12 14 16 18 20
Residuals from AR(4) Model

Figure 3.9: Residuals from AR(4) Model

Everything seems fine. Maybe we could look at re-estimating dropping the 3rd
lag, but otherwise this looks fine. However, there are some other tests that
we can apply that could pick up more subtle failures in the model. One of
these, which looks specifically at an autoregression and tests for the possibil-
ity of STAR behavior is STARTEST. The testing procedure is from Terasvirta
(1994). A similar test for more general smooth transition behavior is provided
by REGSTRTEST.
The two main options on STARTEST are P=number of lags and D=delay (lag) in
the threshold. Its recommended that you try several values for Dhere, well
do 1, 2 and 3:

@startest(p=4,d=1) pi 1983:1 *
@startest(p=4,d=2) pi 1983:1 *
@startest(p=4,d=3) pi 1983:1 *

The results for D=2 are the most significant, which is the method used to choose
the best delay:
Test for STAR in series PI
AR length 4
Delay 2

Test F-stat Signif


Linearity 3.6535512 0.0001
H01 5.3764713 0.0005
H02 1.5748128 0.1863
H03 3.2926785 0.0139
H12 3.5313261 0.0012

These provide a series of LM tests for the correlation of the standard AR resid-
uals with various non-linear functions on the data (products of the regressors
with powers of the threshold variable). You should be very careful in inter-
preting the results of this (and any other) LM diagnostic test. If the data show
Non-linear Least Squares 94

STAR behavior, we would expect that this test would pick it up; you might want
to check this with the data from Example 3.3. However, a significant test re-
sult doesnt necessarily mean that STAR behavior is presentit indicates that
there is evidence of some type of non-linearity not captured by the simple AR.
However, it is also possible for this test to be fooled by outliers using the
mechanism described on page 79.
The different test statistics are for different set of powers (from 1 to 3) of the
threshold in the interaction terms. The Linearity statistic is a joint test of all
of them. The combination of results points to a LSTAR rather than an ESTAR
as the more likely modelsome type of STAR behavior is possible given the
rejection of linearity, but if H03 (the 3rd power) were insignificant, it would
point towards an ESTAR which, because of symmetry, wouldnt have 3rd power
contributions.
Because the model now is bigger, and may be subject to change, well introduce
a more flexible way to handle it:

linreg pi 1983:1 *
# constant pi{1 to 4}
frml(lastreg,vector=b1) phi1f
frml(lastreg,vector=b2) phi2f

This estimates the standard AR(4) model, then defines two FRMLs, called PHI1F
and PHI2F, each with the form of that last regression (the LASTREG option).
The PHI1F formula will use the VECTOR B1 for its coefficients and PHI2F will
use B2. Thus, given values for the B1 coefficients, PHI1F(T) will evaluate
b1(1)+b1(2)*pi(t-1)+b1(3)*pi(t-2)+b1(4)*pi(t-3)+b1(5)*pi(t-4)
The advantage of this is clear: we can change the entire setup of the model by
changing that LINREG.
Well now split the parameter set into two parts: STARPARMS, which will be
just the and c, and REGPARMS, which will have the two B VECTORS. This is
done using NONLIN with the PARMSET option.

nonlin(parmset=starparms) gamma c
nonlin(parmset=regparms) b1 b2

A PARMSET is a convenient way to organize a set (or subset) of non-linear pa-


rameters. If you add PARMSETs with (for instance) STARPARMS+REGPARMS,
you combine them into a larger set. Any of the estimation instructions which
allow for general non-linear parameters (such as NLLS) have a PARMSET option
so you can put in a PARMSET that youve created.
We can put together the final calculation for the LSTAR explanatory model with

frml glstar = %logistic(gamma*(pi{2}-c),1.0)


frml star pi = g=glstar,phi1f+g*phi2f
Non-linear Least Squares 95

This does the same general type of calculation as the single FRML used in Ex-
ample 3.3, but has broken it up into more manageable pieces.
The following will do the estimation using the data-determined method of
guess values (page 82):

stats pi 1983:1 *
compute c=%mean,gamma=1.0/sqrt(%variance)
*
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+starparms,frml=star,print) pi 1983:1 *

This uses the PARMSET option on NLLS and the ability to add PARMSETs to
simplify the two-step estimation process; the first NLLS holds c and fixed
(since they arent in REGPARMS), estimating only the regression coefficients,
while the second NLLS does the whole model.
Unfortunately, the results are not promising. The model doesnt converge and
the coefficients look like:
1. B1(1) 0.85 1.50 0.56531 0.57303574
2. B1(2) 0.47 0.13 3.48749 0.00070638
3. B1(3) -0.24 0.18 -1.31836 0.19017384
4. B1(4) 0.19 0.12 1.66112 0.09958973
5. B1(5) -0.03 0.19 -0.13401 0.89364362
6. B2(1) 16687.40 37452754.62 4.45559e-004 0.99964532
7. B2(2) -503.54 1125697.16 -4.47310e-004 0.99964392
8. B2(3) -250.03 559047.27 -4.47245e-004 0.99964398
9. B2(4) -1326.00 2958199.04 -4.48246e-004 0.99964318
10. B2(5) -1459.13 3260045.06 -4.47579e-004 0.99964371
11. GAMMA 0.38 0.47 0.81112 0.41908326
12. C 27.94 5961.68 0.00469 0.99626886

The values for B2 are nonsensical, as is C, which is much higher than the
largest observed value in the data range. How is it possible for this to give
a lower sum of squares than a non-threshold model (which it doeseven not
converged, its 1309 vs 1639 for the simple AR(4))? With the large value of
c, the function is only barely larger than 0 for any of the data pointsthe
largest is roughly .001 at 2008:2. Clearly, however, that tiny fraction applied
to those very large B2 values improves the fit (quite a bit) at some of the data
points.
We can see if the more complicated preliminary grid search helps, though this
seems unlikely given the results above (where c drifted off outside the data
range). To use the combined PARMSETs, we need to define a new one (called
GAMMAONLY) which leaves C out, since we want to peg that at each pass through
the grid:
Non-linear Least Squares 96

stats(fractiles) pi 1983:1 *
*
nonlin(parmset=gammaonly) gamma
*
compute bestrss=%na
dofor c = %seqa(%fract10,(%fract90-%fract10)/19,20)
compute gamma=1.0/sqrt(%variance)
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+gammaonly,frml=star,noprint) pi 1983:1 *
if .not.%valid(bestrss).or.%rss<bestrss
compute bestrss=%rss,bestc=c,bestgamma=gamma
end dofor
*
disp "Grid choices" bestc bestgamma

This picks the following which (as might be expected) has c at the upper bound
of the grid.

Grid choices 7.28375 0.33748

As before, the following would estimate the model given those guess values.
Since we now want to estimate both c and , we use STARPARMS rather than
GAMMAONLY.
compute c=bestc,gamma=bestgamma
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+starparms,frml=star,print) pi 1983:1 *

The outcome, however, is basically the same as before.


When a model fails like this, its helpful to know what conditions might cause
it. With any model, you need to check for major structural breaks, but STAR
models are particularly susceptible to outliers so thats a first check. If we take
a look at the residuals from that last (unconverged) non-linear model and the
original linear regression (Figure 3.10):

set nllsresids = %resids


linreg pi 1983:1 *
# constant pi{1 to 4}
set olsresids = %resids
graph(footer="Comparison of STAR and OLS Residuals",$
key=upleft,klabels=||"STAR","OLS"||) 2
# nllsresids
# olsresids

The 2008:4 residual on the linear model is about 7 standard deviations. Since
2008:4 is preceded by three relatively high values for , the STAR model is able
to combine that with a trigger using the second period lag in the threshold to
Non-linear Least Squares 97

15
STAR
OLS
10

-5

-10

-15

-20

-25
1983 1986 1989 1992 1995 1998 2001 2004 2007 2010

Figure 3.10: Comparison of STAR and OLS Residuals

dramatically reduce the residual at that point, at the cost of worse fits follow-
ing a few of the other large values. The change in the sum of squares due to the
better fit at that one point for the STAR more than covers the difference be-
tween the sums of squares over the whole 120 data points, so the fit is actually,
on net, worse on the other 119 entries.
If we revisit the test for STAR, but apply it to the data set only through 2007:4,
we get a very different result:

@startest(p=4,d=2) pi 1983:1 2007:4

Test for STAR in series PI


AR length 4
Delay 2

Test F-stat Signif


Linearity 1.1955332 0.3001
H01 2.4000773 0.0557
H02 0.4187428 0.7947
H03 0.8483300 0.4986
H12 1.3787493 0.2171

Our conclusion is that a STAR process doesnt seem to be required to explain the
behavior of the inflation rate over that period. Since the autoregressive repre-
sentation has relatively little persistence, its quite possible that the apparent
sharp drops are simply the result of the natural behavior of an AR process
with low serial correlation. If there is some systematic non-linearity, it isnt
explained by a STAR model.
Non-linear Least Squares 98

3.9 Functions with Recursive Definitions


An alternative model which shows non-linear adjustment is the bilinear model.
A simple case of this is
yt = yt1 + t1 + yt1 t1 + t (3.8)
If were zero, this would be a standard ARMA(1,1) model. The bilinear part
of this is the last term, which is linear in yt1 given t1 and linear in t1 given
yt1 . For this model
yt
= + + yt1 + t1
t1
which takes into account the fact that yt1 moves with t1 . Note that when
is zero, the derivative doesnt depend upon the past value(s) of y or the size of
t1 thats what the process being linear means. With non-zero, it depends
upon both.
What complicates this is that two of the regressors depend upon the unobserv-
able t1 which has a recursive definition: we cant compute the residual t
without t1 (and the values for , and ) and we cant compute t1 with-
out t2 , etc. This is also true for ARMA models with MA terms, but because
thats such a standard type of model, the recursion is handled internally by the
BOXJENK instruction. For a non-standard recursive model, youll have to write
the FRML carefully to make sure that it is calculated the way you want.
One obvious problem is that the recursion has to start somewhere. If we begin
at t = 2 (which is the first data point where yt1 is available), what can we use
for t1 ? Since its a residual, the most obvious choice is 0. The following gets
us started:

dec series eps


clear(zeros) eps

The series EPS will be used for the generated time series of residuals.
In Example 3.6, well again use the inflation rate over the period from 1983 on
as the dependent variable:

set pi = 400.0*log(ppi/ppi{1})

To the three parameters in (3.8), well add an intercept to allow for a non-zero
mean.14 Well call that C0. So the NONLIN instruction to declare the non-linear
parameters is

nonlin c0 alpha beta gamma

14
Note, however, that the bilinear term yt1 t1 has a non-zero expectation. A bilinear term
where the y is dated before the will have zero expected value.
Non-linear Least Squares 99

The following is probably the simplest way to do the FRML:

frml bilinear pi = z=c0+alpha*pi{1}+beta*eps{1}+gamma*pi{1}*eps{1},$


eps=pi-z,z

In the end, the FRML needs to provide the explanatory part of the equation, but
we need that to compute the current residual for use at the next data point.15
Thus the three-step calculation, generating into the (REAL) variable Z the sys-
tematic part, computing the current value of EPS using the dependent variable
and the just-computed value of Z, then bringing back the Z to use as the return
value of the FRML. If we didnt add that ,Z to the end return value would be
the last thing computed, which would be EPS.
Obvious guess values here would be either the results from an AR(1) model
( and both zero) or from an ARMA(1,1) model (only zero). Well use the
ARMA. To do this, all we have to do is peg GAMMA to 0 and let NLLS handle it:

nonlin c0 alpha beta gamma=0.0


nlls(frml=bilinear) pi 1983:1 *

We then relax the restrictions to get our model:

nonlin c0 alpha beta gamma


nlls(frml=bilinear,iters=500) pi 1983:1 *

The before and after estimates (cut down to the key values) are
Sum of Squared Residuals 1676.4146107

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. C0 3.063121504 0.713047291 4.29582 0.00003615
2. ALPHA -0.410666169 0.171062737 -2.40068 0.01794100
3. BETA 0.755033543 0.125563348 6.01317 0.00000002

Sum of Squared Residuals 1622.0602194

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. C0 3.170708712 0.695829116 4.55673 0.00001293
2. ALPHA -0.561753920 0.102134138 -5.50016 0.00000023
3. BETA 0.926950151 0.043157032 21.47854 0.00000000
4. GAMMA 0.007104857 0.003221804 2.20524 0.02940923

The bilinear model fits somewhat better than the ARMA.16 From the t-statistic,
the bilinear term is significant at standard levels, but just barely. However,
the sign on would appear to be wrong to explain non-linearity due to some
form of inflation targetinga positive value of means that large residuals
15
%RESIDS doesnt get defined until the end of the calculations, so you cant use it here.
16
If you use BOXJENK to estimate the ARMA model, youll get effectively the same coeffi-
cients for the AR and MA, but a different constant because the constant in the model used by
BOXJENK is the process mean, not the intercept in a reduced form equation.
Non-linear Least Squares 100

(of either sign) tend to produce higher values for in the next period. If we
re-estimate ending at 2008:2 (just before the big negative outlier), we get
Variable Coeff Std Error T-Stat Signif
************************************************************************************
1. C0 2.957252096 0.735882030 4.01865 0.00011494
2. ALPHA -0.213433697 0.264220271 -0.80779 0.42116963
3. BETA 0.597679361 0.217233787 2.75132 0.00707134
4. GAMMA -0.030825660 0.021977365 -1.40261 0.16389334

so the positive seems to have been created by that outlier. You might ask
why we didnt just SMPL the outlier out of the data set (with the option
SMPL=T<>2008:3). With a recursively-defined function, you cant really do
that. If you try to use SMPL this way, it will cut the estimation off at the end of
2008:2 anywaysince EPS doesnt get defined for 2008:3, its not available to
calculate the model at 2008:4, so EPS cant be computed for 2008:4, and, extend-
ing this, the remainder of the sample cant be computed. You could dummy
out the data point by adding the dummy variable created with

set dummy = t==2008:4

to the model.17 What that does, however, is forcibly make EPS equal to zero
at 2008:3, which might cause problems explaining the data in the periods im-
mediately after this. Since its near the end of the data set, simply cutting the
sample before the big outlier is probably the best approach, and the results
would show that, again, we havent come up with a good way to describe the
inflation rate with a nonlinear time series model.

17
As an exercise, you might want to try to incorporate that.
Non-linear Least Squares 101

3.10 Tips and Tricks


3.10.1 Understanding Computer Arithmetic

The standard representation for real numbers in statistical computations for


roughly the past 40 years has been double precision. This uses 64 bits to rep-
resent the value. The older single precision is 32 bits. Before the introduction
of specialized floating point processors, both the single and double precision
floating point calculations had to be done using blocks of shorter integers, sim-
ilar to the way that children are taught to multiply two four-digit numbers. If
you compare the amount for work required to multiply two two-digit numbers
by hand vs the amount required to do two four-digit numbers, you can see that
single precision calculations were quite a bit faster, and were sometimes cho-
sen for that reason. However, particularly when used with time series data,
they were often inadequateeven a relatively short univariate autoregression
on (say) log GDP wouldnt be computed accurately at single precision.
There are three principal problems which are produced due to the way com-
puters do arithmetic: overflow, underflow and loss of precision. The standard
now for double precision number is called IEEE 754. This uses 1 sign bit, 11
exponent bits and 53 significand bits, with a lead 1 bit assumed, so 52 are
actually needed.18 The value is represented in scientific notation as (in binary,
though well use standard notation)

(sign) 1.xxxxxxxxxxxxxxx 10power

The range of the power is from -308 to +308. An overflow is a calculation which
makes the power larger than +308, so theres no valid double precision repre-
sentation. When this happens, RATS and most other software will treat the
result as infinity.19 An underflow is a calculation which makes the power less
than -308. Here the standard behavior is to call the result zero.
Most statistical calculations are able to steer clear of over- and underflow, typi-
cally by using logs rather than multiplyingthe main cause of these conditions
is multiplying many large or small numbers together, so adding logs avoids
that. However, there are situations (particularly in Bayesian methods) where
the actual probabilities or density functions are needed, and you may have to
exercise some care in doing calculations in those cases.
A more serious problem is loss of precision. Whats the value of
1.00000000000000001 1.0
If youre a computer doing double-precision arithmetic, its 0 because it doesnt
have the 17 significant digits available to give different representations to the
18
The number 0 is represented by all zero bits.
19
The IEEE standard has special
codings for infinity and other denormals such as NaN (not
a number) for a result such as 1.
Non-linear Least Squares 102

two values on either side of the . When you subtract two large and almost
equal numbers, you may lose almost all the precision in the result. Take the
following situationthis is (to 10 digits) the cross product matrix of two lags of
GDP from the data set. If we want to do a linear regression with those two lags
as the regressors, we would need to invert this matrix.

14872.33134 14782.32952
14782.32952 14693.68392
You can compute the determinant of this20 using all ten digits, and the same
rounding to just five with

disp 14872.33134*14693.68392-14782.329522
disp 14872.*14694.-14782.2

The results are very different:

12069.82561
21644.00000

Even though the input values in the second case were accurate to five digits,
the end result isnt even correct to a single digit. And this is just a 2 2 case.
Linear regressions in RATS and most other software are set up to avoid these
types of problems where possible. For instance, specialized inversion routines
are used. While theoretically, the sum of squared residuals could be computed
as

y0 y y0 X

thats subtracting two big numbers to create a small one that can cause preci-
sion problems. Instead, RATS computes the individual residuals and uses those
to compute the sum of squares.
Its much easier to work around potential problems when the calculations are
as well-structured as a linear regression. Its much harder to do this with non-
linear ones since its not always obvious where the difficulties may lie. This is
why its important to understand the usefulness of minor changes to the model
discussed in Section 3.4.

3.10.2 The instruction NLPAR

The most important controls for non-linear estimation are the ITERS, CVCRIT
and METHOD options on the estimation instructions. There are, however, quite
20
RATS uses a different method for doing the inversion.
Non-linear Least Squares 103

a few other controls which can be used for particularly troublesome estima-
tion problems. Rather than add these (rarely used) tweaks in each estimation
instruction, they are handled by the separate instruction NLPAR.
One option on NLPAR that changes fundamentally how convergence is deter-
mined is

CRITERION=[COEFFICIENTS]/VALUE
Convergence occurs if the change in the COEFFICIENTS or VALUE is less
than the number specified by the CVCRIT on the estimation instruction.

In most cases, we want the coefficients to be well-estimated, as thats usually


the main interest. The default is thus CRITERION=COEFFICIENTS, so the pro-
cess doesnt converge until the change in each individual coefficient is small.
However, in some cases, you dont need thatyou just need to be reasonably
close to the optimum. Using NLPAR(CRIT=VALUE) changes the convergence
test to look only at the function value, and not the coefficients, and is usually
met more easily. When might that be reasonable? Perhaps youre only doing
some type of grid search and need only a reasonable approximation to the op-
timum. In the examples we did in this chapter, that wouldnt be necessary, but
a grid search with a much more complicated function might take hours. Cut-
ting that by a 1/3 by using a looser fit might be noticeable. In some cases, you
may have a function where an optimum at a boundary as described on page 64
might be possible. In that case, changes of a parameter on the order of 1000s
might have no noticeable effect on the function value so convergence on coef-
ficient values would never be met. However, a better approach in a case like
that is to either re-parameterize the model so the (new) parameter is bounded
rather than unbounded, or to simply peg it at a large value of the proper sign
(since it should be clear the optimum is at one of the infinities) and estimate
the rest of the model.
Some other options control the sub-iteration process using by NLLS and most
other non-linear estimation instructions. As we describe on page 66, NLLS
does not always take a full Gauss-Newton step, since a full step might actually
increase the sum of squares. Instead, it moves in the same direction as the G-
N step, often taking a full step, but sometimes a shorter one. Its looking for a
point where the new sum of squares function is lower than the current one, and
where certain other criteria are met. The process of choosing how far to go in
the selected direction is known as sub-iteration. Gauss-Newton is usually well
enough behaved that adjustments to this arent necessarythere are other
types of optimization algorithms for which this isnt true. The main option in
this category is EXACTLINESEARCH. By default, the subiteration process is gen-
erally to start with a full stepif that doesnt work (sum of squares increases),
take a half-step, test that, take a quarter step, etc. until a point is found at
which the sum of squares is better than it was. With EXACTLINESEARCH, in-
stead of this simple process of searching for a place where the sum of squares
Non-linear Least Squares 104

is better, it searches along the direction for the place where its best. That
sounds like a good idea, but in fact, is usually just a waste of calculations
since its not finding the optimum of the function itself (just the optimum in
one direction from a certain point), the added time required rarely pays off. It
can sometimes be helpful in big models with dozens of parameters, but almost
never makes sense with small ones.
One option which was added with RATS version 8.1 is the DERIVES option:

DERIVES=[FIRST]/SECOND/FOURTH

NLLS and most other FRML-based estimation instructions use numerical meth-
ods when they need derivatives. While some functions have analytical deriva-
tives, many dont, or the analytical derivatives are too complicated to be calcu-
lated feasibly. 21 An approximation to the derivatives of t with respect to at
0 require that we compute t at 0 and at nearby points. These are often quite
accurate, but sometimes might not be. The proper amount by which to per-
turb isnt known, and smaller isnt necessarily bettertoo small a change
and the calculation might run into the loss of precision problem from Section
3.10.1. The default is DERIVES=FIRST, which does the simple arc-derivative
calculation:
f ( + h) f ()
f 0 ()
h
The h is chosen differently for each parameter based upon the information
known about it at the time. This requires one extra full function evaluation
per parameter to get the partial derivatives. With DERIVES=SECOND, the cal-
culation is done using
f ( + h) f ( h)
f 0 ()
2h
This requires two extra function evaluations per parameter, thus doubling the
time required for computing the derivatives. However, its more accurate as it
eliminates a second order term. It thus can be done with a slightly larger
value of h, which makes it less likely that precision issues will come up.
DERIVES=FOURTH is a four-term approximation which is still more accurate at
the cost of four times the calculation versus the simple numerical derivatives.
Again, in most cases, changing this wont help, but its available for problems
where it appears that the accuracy of the derivatives seems to be an issue.
21
For instance, the derivatives of t in the bilinear model are a function of the derivatives at
all preceding time periods.
Non-linear Least Squares 105

3.10.3 The instruction SEED

The purpose of seeding the random number generator is to ensure that you gen-
erate the same data from when you do simulations. After all, computers are not
capable of generating truly random numbersany sequence generated is actu-
ally a deterministic sequence. If you are aware of the algorithm used to gen-
erate the sequence, all values of the sequence can be calculated by the outside
observer. What computers generate are known as pseudo-random numbers
in the sense that the sequence is deterministic if you know the algorithm but
otherwise are indistinguishable from those obtained from independent draws
from a prespecified probability distribution.
In this manual, we use SEED in all the instructions which generate data so
you will get exactly the same data that we do. RATS uses a portable random
number generator which is designed to generate identical sequences on any
computer given the seed. If you dont use SEED, the seed for the random num-
ber generator is initialized using date and time when the program is executed,
so it will be different each time and thus will generate a completely different
set of data.
Non-linear Least Squares 106

Example 3.1 Simple nonlinear regressions


These the simple non-linear least squares regressions from Sections 3.2 and
3.3.

cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set pi = 100.0*log(ppi/ppi{1})
set y = .001*rgdp
*
* Power function with GDP in inflation equation
*
nonlin b0 b1 b2 gamma
frml pif pi = b0+b1*pi{1}+b2*y{1}gamma
linreg pi
# constant pi{1} y{1}
compute b0=%beta(1),b1=%beta(2),b2=%beta(3),gamma=1.0
nlls(frml=pif) pi
*
* Power function with interest rates
*
nonlin a0 a1 a2 delta
linreg tb1yr
# constant tb1yr{1} tb3mo{1}
frml ratef tb1yr = a0+a1*tb1yr{1}+a2*(tb3mo{1})delta
compute a0=%beta(1),a1=%beta(2),a2=%beta(3),delta=1.0
nlls(frml=ratef) tb1yr
*
test(title="Test of linearity")
# 4
# 1.0
*
set testsr 1 100 = .1*t
set lreffect 1 100 = 0.0
set lower 1 100 = 0.0
set upper 1 100 = 0.0
*
do t=1,100
summarize(noprint) $
%beta(3)*%beta(4)*testsr(t)(%beta(4)-1)/(1-%beta(2))
compute lreffect(t)=%sumlc
compute lower(t)=%sumlc-2.00*sqrt(%varlc)
compute upper(t)=%sumlc+2.00*sqrt(%varlc)
end do t
*
scatter(smpl=testsr>=2.0.and.testsr<=6.0,style=lines,vgrid=1.0,$
footer="Long-run effect using non-linear regression") 3
# testsr lreffect
# testsr lower / 2
# testsr upper / 2
Non-linear Least Squares 107

linreg tb1yr
# constant tb1yr{1} tb3mo{1}
summarize(title="Long-run effect using linear regression") $
%beta(3)/(1-%beta(2))
Non-linear Least Squares 108

Example 3.2 Sample STAR Transition Functions


These generate and graph the sample transition functions described in Section
3.5.

set y 1 201 = (t-100)/201.


compute c=0.0,gamma=10.0
set lstar = ((1 + exp(-gamma*(y-c))))-1
set estar = 1 - exp(-gamma*(y-c)2)
spgraph(footer="Shapes of STAR Transitions",vfields=1,hfields=2)
scatter(header="LSTAR Model",style=line,vlabels="Theta")
# y lstar
scatter(header="ESTAR Model",style=line,vlabels="Theta")
# y estar
spgraph(done)
Non-linear Least Squares 109

Example 3.3 STAR Model with Generated Data


This estimates the LSTAR model with generated data from Section 3.6.

all 250
seed 2003
set eps 1 350 = %ran(1)
set(first=1.0) x 1 350 = $
1.0+.9*x{1}+(-3.0-1.7*x{1})*%logistic(10.0*(x{1}-5.0),1.0)+eps
*
* Shift final 250 observations down
*
set y 1 250 = x(t+100)
*
graph(footer="The Simulated LSTAR Process")
# y
*
nonlin a0 a1 b0 b1 gamma c
frml lstar y = (a0+a1*y{1})+$
(b0+b1*y{1})*%logistic(gamma*(y{1}-c),1.0)
*
* Guess values based upon linear regression
*
linreg y
# constant y{1}
compute a0=%beta(1),a1=%beta(2),b0=0.0,b1=0.0
compute c=0.0,gamma=5.0
*
nlls(frml=lstar) y 2 250
*
* Guess values for C and GAMMA from sample statistics with regressions
* from NLLS with those fixed.
*
stats y
compute c=%mean,gamma=1.0/sqrt(%variance)
nonlin a0 a1 b0 b1
nlls(frml=lstar) y 2 250
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250
*
* Guess value for C from grid search with fixed value of GAMMA
*
stats(fractiles) y
compute gamma=2.0/sqrt(%variance)
compute ygrid=%seqa(%fract05,(%fract95-%fract05)/19,20)
nonlin a0 b1 b0 b1
compute bestrss=%na
dofor c = ygrid
nlls(noprint,frml=lstar) y 2 250
if .not.%valid(bestrss).or.%rss<bestrss
compute bestrss=%rss,bestc=c
end dofor c
*
Non-linear Least Squares 110

disp "Guess Value used" bestc


*
compute c=bestc
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250
*
* Guess value for C from grid search with GAMMA estimated separately
*
stats(fractiles) y
compute gamma0=2.0/sqrt(%variance)
compute ygrid=%seqa(%fract05,(%fract95-%fract05)/19,20)
nonlin a0 b1 b0 b1 gamma
compute bestrss=%na
dofor c = ygrid
compute gamma=gamma0
nlls(noprint,frml=lstar) y 2 250
if .not.%valid(bestrss).or.%rss<bestrss
compute bestrss=%rss,bestc=c,bestgamma=gamma
end dofor c
*
disp "Guess values used" bestc "and" bestgamma
*
compute c=bestc,gamma=bestgamma
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250
*
* Guess values for C and GAMMA from bivariate grid search
*
stats(fractiles) y
compute ygrid=%seqa(%fract05,(%fract95-%fract05)/19,20)
compute ggrid=%exp(%seqa(log(.25),.1*log(100),11))/(%fract75-%fract25)
nonlin a0 b1 b0 b1
compute bestrss=%na
dofor c = ygrid
dofor gamma = ggrid
nlls(noprint,frml=lstar) y 2 250
if .not.%valid(bestrss).or.%rss<bestrss
compute bestrss=%rss,bestc=c,bestgamma=gamma
end dofor gamma
end dofor c
*
disp "Guess values used" bestc "and" bestgamma
compute c=bestc,gamma=bestgamma
nonlin a0 a1 b0 b1 gamma c
nlls(frml=lstar) y 2 250
Non-linear Least Squares 111

Example 3.4 Smooth Transition Break


This an example of a AR model with a smooth transition break in the mean
from Section 3.7.

cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set pi = 400.0*log(ppi/ppi{1})
graph(footer="Annualized Inflation Rate (Measured by PPI)",$
grid=(t==1983:1))
# pi
*
@bjident pi 1983:1 *
*
linreg pi 1983:1 *
# constant pi{1 to 4}
*
@regcorrs(qstats,footer="Residuals from AR(4) Model")
*
@startest(p=4,d=1) pi 1983:1 *
@startest(p=4,d=2) pi 1983:1 *
@startest(p=4,d=3) pi 1983:1 *
*
linreg pi 1983:1 *
# constant pi{1 to 4}
frml(lastreg,vector=b1) phi1f
frml(lastreg,vector=b2) phi2f
*
nonlin(parmset=starparms) gamma c
nonlin(parmset=regparms) b1 b2
*
frml glstar = %logistic(gamma*(pi{2}-c),1.0)
frml star pi = g=glstar,phi1f+g*phi2f
*
stats pi 1983:1 *
compute c=%mean,gamma=1.0/sqrt(%variance)
*
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+starparms,frml=star,print) pi 1983:1 *
*
stats(fractiles) pi 1983:1 *
*
nonlin(parmset=gammaonly) gamma
*
compute bestrss=%na
dofor c = %seqa(%fract10,(%fract90-%fract10)/19,20)
compute gamma=1.0/sqrt(%variance)
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+gammaonly,frml=star,noprint) pi 1983:1 *
if .not.%valid(bestrss).or.%rss<bestrss
Non-linear Least Squares 112

compute bestrss=%rss,bestc=c,bestgamma=gamma
end dofor
*
disp "Grid choices" bestc bestgamma
*
compute c=bestc,gamma=bestgamma
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+starparms,frml=star,print) pi 1983:1 *
*
* Compare residuals
*
set nllsresids = %resids
linreg pi 1983:1 *
# constant pi{1 to 4}
set olsresids = %resids
graph(footer="Comparison of STAR and OLS Residuals",$
key=upleft,klabels=||"STAR","OLS"||) 2
# nllsresids
# olsresids
*
* Test just with the data through 2007:4
*
@startest(p=4,d=2) pi 1983:1 2007:4
Non-linear Least Squares 113

Example 3.5 LSTAR Model for Inflation


This attempts to fit an LSTAR model to the U.S. inflation rate. This is described
in detail in section 3.8.

cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set pi = 400.0*log(ppi/ppi{1})
graph(footer="Annualized Inflation Rate (Measured by PPI)",$
grid=(t==1983:1))
# pi
*
@bjident pi 1983:1 *
*
linreg pi 1983:1 *
# constant pi{1 to 4}
*
@regcorrs(qstats,footer="Residuals from AR(4) Model")
*
@startest(p=4,d=1) pi 1983:1 *
@startest(p=4,d=2) pi 1983:1 *
@startest(p=4,d=3) pi 1983:1 *
*
linreg pi 1983:1 *
# constant pi{1 to 4}
frml(lastreg,vector=b1) phi1f
frml(lastreg,vector=b2) phi2f
*
nonlin(parmset=starparms) gamma c
nonlin(parmset=regparms) b1 b2
*
frml glstar = %logistic(gamma*(pi{2}-c),1.0)
frml star pi = g=glstar,phi1f+g*phi2f
*
stats pi 1983:1 *
compute c=%mean,gamma=1.0/sqrt(%variance)
*
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+starparms,frml=star,print) pi 1983:1 *
*
stats(fractiles) pi 1983:1 *
*
nonlin(parmset=gammaonly) gamma
*
compute bestrss=%na
dofor c = %seqa(%fract10,(%fract90-%fract10)/19,20)
compute gamma=1.0/sqrt(%variance)
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+gammaonly,frml=star,noprint) pi 1983:1 *
if .not.%valid(bestrss).or.%rss<bestrss
Non-linear Least Squares 114

compute bestrss=%rss,bestc=c,bestgamma=gamma
end dofor
*
disp "Grid choices" bestc bestgamma
*
compute c=bestc,gamma=bestgamma
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+starparms,frml=star,print) pi 1983:1 *
*
* Compare residuals
*
set nllsresids = %resids
linreg pi 1983:1 *
# constant pi{1 to 4}
set olsresids = %resids
graph(footer="Comparison of STAR and OLS Residuals",$
key=upleft,klabels=||"STAR","OLS"||) 2
# nllsresids
# olsresids
*
* Test just with the data through 2007:4
*
@startest(p=4,d=2) pi 1983:1 2007:4
*
set threshvar = pi{1}+pi{2}
linreg pi 1983:1 *
# constant pi{1 to 4}
@regstrtest(threshold=threshvar) 1983:1 *
frml glstar = %logistic(gamma*(threshvar-c),1.0)
*
stats threshvar 1983:1 *
compute c=%mean,gamma=1.0/sqrt(%variance)
*
nlls(parmset=regparms,frml=star,noprint) pi 1983:1 *
nlls(parmset=regparms+starparms,frml=star,print) pi 1983:1 *
Non-linear Least Squares 115

Example 3.6 Bilinear Model


This is an example of a bilinear model from Section 3.9.

cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set pi = 400.0*log(ppi/ppi{1})
*
dec series eps
clear(zeros) eps
*
nonlin c0 alpha beta gamma
frml bilinear pi = z=c0+alpha*pi{1}+beta*eps{1}+gamma*pi{1}*eps{1},$
eps=pi-z,z
*
* Estimate ARMA model
*
nonlin c0 alpha beta gamma=0.0
nlls(frml=bilinear) pi 1983:1 *
*
* Estimate bilinear model
*
nonlin c0 alpha beta gamma
nlls(frml=bilinear) pi 1983:1 *
*
* Estimate with truncated sample
*
nlls(frml=bilinear) pi 1983:1 2008:2
Chapter 4

Maximum Likelihood Estimation

Suppose you wanted to estimate parameters ( and ) in the process:


yt = Xt + t ; t N (0, 2 ) (4.1)
The obvious way to do this is least squares, which can be done using the LINREG
instruction. An alternative approach to parameter estimation is maximum
likelihood. The following derivation can be found in any elementary economet-
rics text: the likelihood for entry t is:
 
1 2 1/2
 1 2
exp 2 (yt Xt )
2 2
If the entries are independent, then the log likelihood for the full sample is:
X 1 1 1 T T 1 X
log(2) log 2 2 (yt Xt )2 = log(2) log 2 2 (yt Xt )2
t
2 2 2 2 2 2 t
By inspection, the maximizer of this for is the minimizer of the sum of
squares:
X
(yt Xt )2
t

regardless of the value of . The first order condition for maximization over 2
2

is
T 1 X 1X
2+ 4 (yt Xt )2 = 0 2 = (yt Xt )2
2 2 t T t

Thus, for model (4.1), the maximum likelihood and least squares give identical
estimates. This is specific to the assumption that the residuals are Normal.
Suppose, instead, that the residuals were assumed to be t with degrees of
freedom to allow for heavier tails in the error process. The likelihood for yt is
now:
1/2 (+1)/2
K 2 ( 2)/ 1 + (yt Xt )2 / 2 /( 2) (4.2)
where the (rather complicated) integrating constant K depends upon . This
is parameterized with 2 as the variance of the process. The only term in the
full sample log likelihood which depends upon is:
( + 1) X
log 1 + (yt Xt )2 / 2 /( 2)


2 t

116
Maximum Likelihood Estimation 117

Unlike the Normal, you cant optimize separately from 2 , and there are no
sufficient statistics for the mean and variance, so the log likelihood for a {, 2 }
combination can only be computed by summing over the full data set.
There are many other types of models for which the log likelihood cant be
simplified beyond the sum of the log likelihood elements, even if the error pro-
cess which is basically Normalto simplify to least squares or weighted least
squares, the variance has to be constant or at least a function that depends
upon exogenous variables but not free parameters.
There are some specialized instructions for estimating particular models by
maximum likelihood, such as DDV for probit, LDV for tobit and related models,
BOXJENK for ARIMA and related models, GARCH for specific types of ARCH and
GARCH models. However, the instruction which is most flexible and thus most
commonly used is MAXIMIZE.

4.1 The MAXIMIZE instruction


MAXIMIZE requires that you define a FRML which evaluates at entry t the func-
tion ft (yt , xt , ) where the log likelihood takes the form
T
X
F () log L(Y |X, ) = ft (yt , xt , ) (4.3)
t=1

where: xt and yt can be vectors (and xt can represent a lagged value of yt ).


Now MAXIMIZE doesnt know whether the FRML that you provided is, in fact,
the log likelihood element for t. It will try to maximize (4.3) no matter what you
are actually calculating. And, in most cases, it will give the proper optimized
values for . F being the actual log likelihood matters to the interpretation
of the output: what are displayed as the standard errors and t-statistics
are that only if F is the log likelihood, or at least the log likelihood up to an
additive constant which doesnt depend upon . However, the ROBUSTERRORS
option can be used to compute an asymptotically valid covariance matrix if F
isnt the true log likelihood.

MAXIMIZE(options) frml start end

where

frml A previously defined formula


start end The range of the series to use in the estimation

The key options for our purposes are:


Maximum Likelihood Estimation 118

METHOD=BHHH/[BFGS]/SIMPLEX/GENETIC/EVALUATE
PMETHOD=BHHH/BFGS/[SIMPLEX]/GENETIC
ITERATIONS=limit on iterations
PITERATIONS=limit on preliminary iterations
CVCRIT=convergence limit [0.00001]
ROBUSTERRORS/[NOROBUSTERRORS]

The set-up of a maximum likelihood estimation is very similar to that of NLLS.


The essential difference is that the FRML instruction defines the the log like-
lihood that you want to maximize. Otherwise the steps to perform maximum
likelihood estimation are just like those of nonlinear least squares:

1. Define the parameters to be estimated using NONLIN instruction.


2. Define the log likelihood for observation t using a FRML instruction.
3. Set the initial values of the parameters using the COMPUTE command.
4. Use the MAXIMIZE instruction to maximize the sum across time of the
formula in Step 2.

How does MAXIMIZE solve the optimization of (4.3)? At a minimum, F must be


a continuous function of . The SIMPLEX and GENETIC choices for the METHOD
(and PMETHOD) options require nothing more than that.
Of course, the same types issues that arise with NLLS (page 63) can occur with
MAXIMIZE. Be sure to use good initial guesses and ensure that you are finding
the global maximum.
The two main algorithms used for optimization are BFGS and SIMPLEX. These
are described in greater detail in this chapters Tips and Tricks (Section 4.4).
These are often used together, with SIMPLEX used for preliminary iterations
(PMETHOD=SIMPLEX, PITERS=number of preliminary iterations) and BFGS (the
default choice for METHOD) used to actually estimate the model. SIMPLEX is
slower, and, because it doesnt assume differentiability, cant provide estimates
of the standard errors, and thus is rarely used as the main estimation method.
However, it is much less sensitive to bad guess values, and so is handy for
getting the estimation process started.
If you have a program written for an older version of RATS, you might
see two otherwise identical MAXIMIZE instructions at some point, one with
METHOD=SIMPLEX, the second with METHOD=BFGS. Since version 6, MAXIMIZE
has had the PMETHOD option to allow the single instruction to use the two meth-
ods sequentially, so you should take advantage of that. However, note that
many optimization problems dont need preliminary simplex iterations.
As a simple example (Example 4.1), well do the same power function on the
interest rates as in Section 3.3, but well allow for t distributed rather than
Normal errors. Its helpful to start out with the Normal model, which can be
estimated with NLLS as before:
Maximum Likelihood Estimation 119

nonlin a0 a1 a2 delta
linreg tb1yr
# constant tb1yr{1} tb3mo{1}
frml ratef tb1yr = a0+a1*tb1yr{1}+a2*(tb3mo{1})delta
compute a0=%beta(1),a1=%beta(2),a2=%beta(3),delta=1.0
nlls(frml=ratef) tb1yr 2 *

How do we extend this to allow for t errors? First, as we saw above, we cant
simply concentrate out the variance. So we need to add two parameters, 2
and . For convenience, well separate the original and new parameters into
two PARMSETs.

nonlin(parmset=baseparms) a0 a1 a2 delta
nonlin(parmset=tparms) sigsq nu

We can take the guess value for SIGSQ from the non-linear least squares and
start with NU at a fairly high value, so that we will be approximately at the
same location as we would have with Normal residuals:
compute sigsq=%seesq,nu=20.0

Now, we need to define a FRML which evaluates the log likelihood for the t
rather than simply the right-hand-side of an equation. Although one could
write out the density for the t as in (4.2), its simpler (and faster) to use the
existing %LOGTDENSITY function. This takes (in order) the variance, residual
and degrees of freedom as its arguments. Since we already have the existing
RATEF formula to evaluate the right-side expression, this is fairly simple:

frml logl = %logtdensity(sigsq,tb1yr-ratef,nu)

Note that all RATS built-in density and log density functions like
%LOGTDENSITY and %LOGDENSITY include all integrating constants, not just
the ones that depend upon the parameters. What RATS reports as the log like-
lihood is indeed the full log likelihood given the model and parameters. If you
try to replicate published results, you may find that the reported log likelihoods
are quite different than you get from RATS even if the coefficients are effectively
identical. If thats the case, its usually because the authors of the original work
left out some constants that had no effect on the results otherwise.
The optimization can be done with:

maximize(parmset=baseparms+tparms,iters=300) logl 2 *

We ended up increasing the number of iterations, as the model hadnt con-


verged in 100. The original output (with the default 100 iterations) was
Maximum Likelihood Estimation 120

MAXIMIZE - Estimation by BFGS


NO CONVERGENCE IN 100 ITERATIONS
LAST CRITERION WAS 0.0183646
Quarterly Data From 1960:02 To 2012:04
Usable Observations 211
Function Value -211.5020

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. A0 -0.387364457 2.750515178 -0.14083 0.88800154
2. A1 0.986460486 0.028992266 34.02495 0.00000000
3. A2 0.440294321 2.735910770 0.16093 0.87214732
4. DELTA 0.067177518 0.545349075 0.12318 0.90196250
5. SIGSQ 0.983706526 0.801170498 1.22784 0.21950828
6. NU 2.451540091 0.514078793 4.76880 0.00000185

A common error that users make is to ignore those warnings about lack of
convergence. Theyre in all upper case for a reason. This is an easy one to fix by
increasing the iteration count.
One thing to note about MAXIMIZE and the BFGS algorithm is that if the model
fails to converge in 100 iterations, and you simply select the MAXIMIZE again
and re-execute, you will get roughly the same parameter estimates as if you
allowed for 200 (or more) in the first go, but not the same standard errors. The
BFGS estimate of the (inverse) Hessian is dependent upon the path taken to
reach the optimum. If you re-execute the MAXIMIZE, the BFGS Hessian is re-
initialized as a diagonal matrix. If the optimization was interrupted by the
iteration limit when nearly converged, the information to update the Hessian
is rather weak; the changes in both the gradient and the parameter vectors are
small. In this case, doing 100 iterations, then another 100 gives:
MAXIMIZE - Estimation by BFGS
Convergence in 11 Iterations. Final criterion was 0.0000095 <= 0.0000100
Quarterly Data From 1960:02 To 2012:04
Usable Observations 211
Function Value -211.5020

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. A0 -0.387252416 0.050604638 -7.65251 0.00000000
2. A1 0.986378426 0.025989119 37.95351 0.00000000
3. A2 0.440550069 0.061171096 7.20193 0.00000000
4. DELTA 0.067414052 0.100235707 0.67256 0.50123027
5. SIGSQ 0.984391447 0.275353108 3.57501 0.00035021
6. NU 2.450853777 0.179995863 13.61617 0.00000000

If we do the same estimation starting with enough iterations to converge we


get:
Maximum Likelihood Estimation 121

MAXIMIZE - Estimation by BFGS


Convergence in 130 Iterations. Final criterion was 0.0000000 <= 0.0000100
Quarterly Data From 1960:02 To 2012:04
Usable Observations 211
Function Value -211.5019

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. A0 -0.433719170 1.787260243 -0.24267 0.80825900
2. A1 0.986521104 0.031013227 31.80969 0.00000000
3. A2 0.486938369 1.756506306 0.27722 0.78161132
4. DELTA 0.060438285 0.279017340 0.21661 0.82851134
5. SIGSQ 0.984881664 0.785715851 1.25348 0.21002989
6. NU 2.450541777 0.503477442 4.86723 0.00000113

The estimation with the two sequential MAXIMIZEs wasnt really even able
to move off the results from the first successfully because the initial diagonal
Hessian has the shape of the function completely wrong. Note that even though
the coefficients appear to be quite different in the last two outputs, the function
value itself, what were trying to maximize, is almost identical, so the likelihood
surface is very flat. Thats reflected in the high standard errors in the second
estimator which had enough iterations.
Of the various algorithms used by RATS for non-linear estimation, BFGS is the
only one with the property that the covariance matrix is dependent upon the
path used by the optimization algorithm. Since it is heavily used, not just
by RATS, but by other software, this is a characteristic of the algorithm about
which you need to be careful.
MAXIMIZE defines both %LOGL and %FUNCVAL as the value of the function at
the final set of parameters. Assuming that youve set up the FRML to include
all integrating constants (as mentioned above, %LOGTDENSITY does this), and
youve estimated the model over the same range, then the log likelihood for
MAXIMIZE and a simpler NLLS are comparable. If we want to compare the re-
sults from the t with the Normal, we need to be careful about the parameter
count. While it would appear that the t has six free parameters, and the Nor-
mal has only four, thats not including the variance in the latter. The variance
is estimated by NLLS, but in a second step given the other parameters rather
than being included directly. So the t model adds just 1. The Normal is a spe-
cial case of the t with = . If we ignore the fact that the restriction is on the
boundary1 we can get a likelihood ratio statistic for the Normal vs the t with
2(211.5019 238.1723) = 53.3408, which is clearly way out in the tails for one
degree of freedom.
We can test the linearity hypothesis = 1 with a Wald test either by

test(title="Test for linearity")


# 4
# 1.0

1
which violates one of the assumptions governing the most straightforward proof of the
asymptotics of the likelihood ratio test.
Maximum Likelihood Estimation 122

since DELTA is the fourth parameter in the combined parameter set, or, with
RATS 8.2 or later

summarize(parmset=baseparms+tparms,$
title="Test for linearity") delta-1

These produce the identical results (the TEST in squared form):


Test for linearity
Chi-Squared(1)= 11.339352 with Significance Level 0.00075882

Test for linearity

Value -0.9395617 t-Statistic -3.36740


Standard Error 0.2790173 Signif Level 0.0007588

We can also do a likelihood ratio test for = 1 relatively easily: we just add a
third PARMSET that pegs to the hypothesized value. We save the original log
likelihood and re-estimate the model with the restriction:

nonlin(parmset=pegs) delta=1.0
compute loglunr=%logl
*
maximize(parmset=baseparms+tparms+pegs) logl 2 *
cdf(title="LR Test for delta=1") chisqr 2*(%logl-loglunr) 1

This gives us the rather remarkable result:


LR Test for delta=1
Chi-Squared(1)= 0.576804 with Significance Level 0.44756761

which conflicts with the Wald test. Now unlike linear restrictions on linear
models, there is no theorem which says that the Wald and Likelihood Ratio
tests should be identical, but these arent even close. Apparently, the likeli-
hood surface is even flatter than even the rather wide standard errors in the
MAXIMIZE output suggest. Now, we knew from the beginning that this was
likely a flawed model, and the results here would confirm that its not reli-
able for any real inference. Again, not all models work. The fact that you get
converged estimates doesnt help if the model itself isnt good.

4.2 ARCH and GARCH Models


RATS includes a GARCH instruction that is capable of estimating many stan-
dard types of univariate and multivariate GARCH models. However, there are
even more forms of GARCH that arent covered by the built-in GARCH, and these
require MAXIMIZE. Thus, it makes sense to examine the process of estimating
a simple GARCH model using the more general instruction, since extensions
usually start with the more basic forms.
Maximum Likelihood Estimation 123

-2

-4

-6
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 4.1: Standardized Residuals from Linear Regression

Suppose you want to estimate a simple regression model with an ARCH(1)


error process:
y t = xt + t (4.4)
q
t = vt 0 + 1 2t1 (4.5)
where vt is a zero-mean normal i.i.d. variable. Under these assumptions,
Et1 t = 0 (4.6)
Et1 2t ht = 0 + 1 2t1 (4.7)

Well look at the regression model


LRt = 0 + 1 LRt1 + 2 SRt1 + t
which is the model from Section 3.3 without the power on the SR term. In
Example 4.2, we start with OLS estimates:

linreg tb1yr
# constant tb1yr{1} tb3mo{1}

We can compute and graph (Figure 4.1) the standardized residuals:

set stdu = %resids/sqrt(%seesq)


graph(footer="Standardized Residuals from Linear Regression")
# stdu

We see that we have a clear problem that there are several very large residuals
(greater than 4 standard errors), with more that are larger than 2 than would
be expected in just 200 data points. That, by itself, could be an indication that
a fatter tailed distribution than the Normal might be appropriate, but, what
Maximum Likelihood Estimation 124

suggests the need for something more complicated is that most of the largest
residuals (of various signs) are grouped together in a relatively short range of
data. While the residuals might not be showing serial correlation (which is a
measure of linear association), they dont appear to be independentinstead,
the squares of the residuals appear to be serially correlated. A Lagrange mul-
tiplier (LM) test for ARCH disturbances was proposed by Engle (1982). After
you have estimated the most appropriate model for yt , save the residuals, then
create the square of the residuals and regress these squared residuals on a con-
stant and on m lagged values of the squared residuals. In our case, if m = 4:

set u = %resids
set usq = u2
linreg usq
# constant usq{1 to 4}
cdf(title="Test for ARCH") chisqr %trsquared 4

If there are no ARCH or GARCH effects, this regression will have little explana-
tory power so the coefficient of determination (the usual R2 ) will be quite low.
With a sample of T residuals, under the null hypothesis of no ARCH errors, the
test statistic T R2 converges to a 2 distribution with m degrees of freedom. We
can use the variance %TRSQUARED computed by LINREG as the test statistic.
This turns out to be very significant:
Test for ARCH
Chi-Squared(4)= 55.352434 with Significance Level 0.00000000

so we would conclude that the residuals arent (conditionally) homoscedastic,


which is the null, and strongly suggests the presence of ARCH or GARCH ef-
fects.2 Since this is a heavily-used test, theres a standard @ARCHTEST proce-
dure to do it:
@archtest(lags=4,form=lm) u

which produces the same result as before. Note that the input to @ARCHTEST is
the residual itself, not its square. Note also that we saved the original residuals
from the LINREG into a separate series since the standard %RESIDS will be
overwritten by the auxiliary regression.
Since our test seems to show the residuals show ARCH behavior, how do we
adjust our estimation to allow for that? We can write f (yt |yt1 , . . . , y1 ) using
yt N (xt , ht )
ht = a0 + a1 2t1
t = yt xt
2
Again, its important to understand what a test like this means. We havent determined
that there are, in fact, ARCH or GARCH effects, or (more particularly) that any specific ARCH
or GARCH model is appropriate, just that the large residuals are clustering in a way that isnt
compatible with a simple model (the null hypothesis) where the sizes of residuals are indepen-
dent across time.
Maximum Likelihood Estimation 125

since t1 (and thus ht ) is a function of data only through t 1. Thus, we can get
the full sample likelihood by using the standard trick of writing f (y1 , . . . , yT )
as f (y1 )f (y2 |y1 ) . . . f (yT |yT 1 , . . . , y1 ). In logs, this converts to a sum so
T
X
log f (y1 , . . . , yT ) = log fN (t |ht )
t=1

where fN (x| 2 ) is the Normal density function at x with mean 0 and variance
2 . The simplest way to compute log fN (eps, h) in RATS is with the function
%LOGDENSITY(h,eps) (note the parameter order).
We could directly write out the log likelihood at t by substituting everything
out and getting
log f (yt |yt1 , . . .) = fN yt xt |a0 + a1 (yt1 xt )2


However, aside from the fact that writing a complicated formula like that ac-
curately isnt easy, it also evaluates t twice, once at t, when its the residual
and once at t + 1 when it would be the lagged residual needed for computing
ht+1 . The extra time required for computing the same value twice isnt a major
problem here, though it could be in other cases; the biggest problem is the lack
of flexibilityif we want to change the mean function for the process, well
have to make the same change twice. It would be better if we could change it
just once.
Well start by using NONLIN instructions to declare the five parameters: the
three parameters in the regression model and two in the ARCH process. By
splitting these up, we make it easier to change one part of the model separately
from the other.

nonlin(parmset=meanparms) b0 b1 b2
nonlin(parmset=archparms) a0 a1

Well define the log likelihood in three parts:

frml efrml = tb1yr-(b0+b1*tb1yr{1}+b2*tb3mo{1})


frml hfrml = a0+a1*efrml{1}2
frml logl = %logdensity(hfrml,efrml)

The mean model appears only in the EFRML, the variance model only in the
HFRML, and the LOGL FRML really doesnt need to know how either of those is
computed. This is good programming practiceif different parts of a model
can be changed independently of each other, try to build the model that way
from the start.
We now need guess values for the parameters. We cant simply allow the de-
fault 0s, because if A0 and A1 are zero, HFRML is zero, so the function value is
undefinedif the function value is undefined at the guess values, theres really
Maximum Likelihood Estimation 126

no good way off of it. The first thing that MAXIMIZE does is to compute the for-
mula to see which data points can be used. Without better guesses (for A0 and
A1), the answer to that is none of them. Youll get the message:
## SR10. Missing Values And/Or SMPL Options Leave No Usable Data Points

In this case, there are fairly obvious guess values for the regression parameters
as the coefficients from the LINREG. One obvious choice for A0 and A1 are the
residual variance from the LINREG and 0, respectively, which means that the
model starts out from the linear regression with fixed variance. Since 0 is a
boundary value, however, it might be better to start with a positive value for
A1 and adjust A0 to match the sample variance, thus:

linreg tb1yr
# constant tb1yr{1} tb3mo{1}
*
compute b0=%beta(1),b1=%beta(2),b2=%beta(3)
compute a0=%seesq/(1-.5),a1=.5

The instruction for maximizing the log likelihood is:

maximize(parmset=meanparms+archparms) logl 3 *

Note that we lose two observations: one for the lag in the mean model so t isnt
defined until 2 and one additional observation due to the lag t1 in the ARCH
specification, so we start at entry 3.
MAXIMIZE - Estimation by BFGS
Convergence in 19 Iterations. Final criterion was 0.0000078 <= 0.0000100
Quarterly Data From 1960:03 To 2012:04
Usable Observations 210
Function Value -226.8267

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. B0 0.091504296 0.092871014 0.98528 0.32448474
2. B1 1.010540358 0.173973445 5.80859 0.00000001
3. B2 -0.033379484 0.184926188 -0.18050 0.85675875
4. A0 0.386146972 0.047680337 8.09866 0.00000000
5. A1 0.350387587 0.122202087 2.86728 0.00414017

If we want to compare the ARCH model with the linear regression, we need to
use the same sample range. The AIC for the ARCH model can be computed with

compute aicarch=-2.0*%logl+2.0*%nreg

while the same for the OLS is done with:

linreg tb1yr %regstart() %regend()


# constant tb1yr{1} tb3mo{1}
compute aicols=-2.0*%logl+2.0*(%nreg+1)
Maximum Likelihood Estimation 127

using %REGSTART() and %REGEND() to ensure we use the same range on the
LINREG as we did on the MAXIMIZE. As with the earlier example of the t vs
Normal, we need to count the variance as a separate parameter for OLS since
the ARCH is explicitly modeling the variance. A comparison of the AIC values
shows a clear edge to the ARCH:

disp "AIC-ARCH" @15 *.### aicarch


disp "AIC-OLS" @15 *.### aicols

AIC-ARCH 463.653
AIC-OLS 483.993

The same model can be estimated using the built-in GARCH instruction with:

garch(reg,q=1) / tb1yr
# constant tb1yr{1} tb3mo{1}

which is obviously much simpler than going through the setup for using
MAXIMIZE. In general, where a built-in instruction is available for a model,
its a good idea to use it.

4.3 Using FRMLs from Linear Equations


Many users concentrate so much on condition (4.7) that they forget about
(4.6)that the residuals are also supposed to be serially uncorrelated. Al-
though Engles original ARCH paper used the inflation rate in the U.K. as its
example, for years most empirical work using the more flexible GARCH model
(Bollerslev (1986)) applied it to returns on financial assets where lack of serial
correlation could almost be taken as a given. However, if youre applying an
ARCH technique to some other type of data (such as macroeconomic data as we
are here), its important to also try to get the mean model correct.
One problem with the program above is that it keeps repeating the same lin-
ear equation specification. The first one is for illustration only, but theres a
LINREG for guess values, a LINREG for the AIC comparison, the regression re-
lation is coded into the EFRML formula, and the regressor list is repeated again
on the GARCH instruction. We would be much better off if we could define the
mean model once and have that used all the way through.
We saw this idea in Chapter 3, page 94. Here, well define both an EQUATION,
and a FRML based upon the single linear specification. Right up at the top of
Example 4.3, well do:

linreg tb1yr
# constant tb1yr{1} tb3mo{1}
equation(lastreg) meaneq
frml(lastreg,vector=beta) meanfrml
compute rstart=%regstart(),rend=%regend()
Maximum Likelihood Estimation 128

This

1. Defines the EQUATION MEANEQ with the form (and coefficients) taken from
the regression.
2. Defines the FRML MEANFRML with the form taken from the regression,
using BETA(1), BETA(2) and BETA(3) for the three coefficients, with
those three given the values from the regression.
3. Defines RSTART and REND as the estimation range of the regression.

The third of these is useful because if we add lags to the model, the estimation
range will change, and RSTART will change automatically. We can then use
RSTART to get the proper start entry on the MAXIMIZE instruction.
In addition to using a more flexible setup for the mean model, well use a GARCH
model rather than the simpler ARCH. In a GARCH(1,1) model, the variance
evolves as
ht = c + a2t1 + bht1 (4.8)
This gives a smoother evolution to the variance than is possible with the sim-
pler ARCH, and, in practice, works so much better that the ARCH is now rarely
used. One major difference in programming is that, unlike the ARCH, the vari-
ance isnt computable using only the data and parameters: how do we compute
h1 (or more specifically h at the first entry in the estimation range)? Unlike the
case of a moving average model, zero isnt an obvious pre-sample value, since
the expected value of a variance isnt zero. The model can be solved to get a
long-run value for the variance of
h = c/(1 a b)
except that wont exist if a + b 1. And unlike an ARMA model, there is no
stationary distribution for a GARCH process which can be used for doing full
information maximum likelihood. In short, there is no single obvious log like-
lihood value given the data and the parametersdifferent programs will come
up with somewhat different results given different choices for the pre-sample
values for h. With a large data set, the differences are generally quite minor,
but with a shorter one (and the roughly 200 data points in our data set is quite
short for a GARCH model), they could be more substantial.
The RATS GARCH instruction uses the common choice of the estimate from a
fixed variance model (that is, the linear regression) and thats what well show
here. In order to avoid dropping data points to handle the 2t1 term, we will
also use the same pre-sample value for that.
Because h now requires a lagged value of itself, we cant simply write out a
FRML for (4.8). Instead, we have to create a separate series for the variances,
and have the formulas use that for lags and reset that as its computed. The
following will get us started: these create three series, one for the variances,
one for the residuals and one for the squared residuals.
Maximum Likelihood Estimation 129

set h = %seesq
set u = %resids
set uu = %seesq

H and UU are initialized to the fixed variance from the LINREG; however, the
only entries for which that matters are the pre-sample ones, as all others will
be rewritten as part of the function evaluation.
The two parameter sets can be defined with

nonlin(parmset=meanparms) beta
nonlin(parmset=garchparms) c a b

and the log likelihood formula is again defined in three parts:

frml efrml = tb1yr-meanfrml


frml hfrml = c+a*uu{1}+b*h{1}
frml logl = h=hfrml,u=efrml,uu=u2,%logdensity(h,u)

Unlike Example 4.2, this now has the more flexible method of handling the
mean model, so if we change the initial LINREG, the whole model will change.
If we look at the LOGL formula piece by piece we see that it does the following
(in order) when evaluating entry T:

1. Evaluates and saves H(T) using HFRML. This uses UU(T-1) and
HH(T-1). When T is the first entry in the estimation range, those will
be the values we put into the H and UU series by the SET instructions
earlier
2. Evaluates and saves (into U(T)) the residual using EFRML
3. Saves the square of U(T) into UU(T)
4. Finally, evaluates the log likelihood for T.

Why do we create a series for UU rather than simply squaring U when needed?
Its all for that one pre-sample valuethere is no residual available to be
squared to get U(T-1) 2.
Why do we evaluate H first rather than U? In this model, it doesnt matter, but
if you allow for an M effect (see Engle, Lilien, and Robins (1987)) you must
compute current H first so it will be available for computing the residual. Its
never wrong to compute H before E.
We now need guess values. The mean model is already donethe
FRML(LASTREG) copies the LINREG coefficients into BETA. The following is a
reasonable set of start values for the GARCH parameters:

compute a=.1,b=.6,c=%seesq/(1-a-b)
Maximum Likelihood Estimation 130

In practice, if there is a GARCH effect, the coefficient on the lagged variance


(what were calling B) tends to be larger than the one on the lagged squared
residual (A).
We can estimate the model with

maximize(parmset=meanparms+garchparms,$
pmethod=simplex,piters=10) logl rstart rend
compute aicgarch=-2.0*%logl+2.0*%nreg

This model doesnt converge properly without the simplex iterationsif you
want to test this, change it to PITERS=0 and re-do the program starting at the
beginning (so you get the original guess values).
The results from estimation with the combination of simplex and BFGS are
MAXIMIZE - Estimation by BFGS
Convergence in 27 Iterations. Final criterion was 0.0000009 <= 0.0000100
Quarterly Data From 1960:02 To 2012:04
Usable Observations 211
Function Value -187.4168

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. BETA(1) 0.0368121628 0.0486053825 0.75737 0.44882940
2. BETA(2) 0.9346725646 0.1594455856 5.86202 0.00000000
3. BETA(3) 0.0584863698 0.1776252153 0.32927 0.74195283
4. C 0.0147378717 0.0105093858 1.40235 0.16080978
5. A 0.5084325110 0.1922289141 2.64493 0.00817072
6. B 0.5779414855 0.0930391861 6.21181 0.00000000

This is much better than the earlier ARCH model. This uses an extra data
point (211 rather than 210) and has one extra parameter, but the gap in log
likelihood between the ARCH and GARCH is huge. As before, we can do an AIC
comparison with OLS using

linreg(equation=meaneq) * %regstart() %regend()


compute aicols=-2.0*%logl+2.0*(%nreg+1)
*
disp "AIC-GARCH" @15 *.### aicgarch
disp "AIC-OLS" @15 *.### aicols

This leaves no doubt at to which model the data prefer:

AIC-GARCH 384.730
AIC-OLS 483.993

We can do tests for the adequacy of the model using the standardized residuals
and the squared standardized residuals:3
3
The DFC options are 1 for the residuals because we have one lag of the dependent variable
in the mean model, and 2 for the squared residuals because we have two lagged coefficients in
the GARCH variance model.
Maximum Likelihood Estimation 131

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

-0.75
Q= 4.67 P-value 0.79259
-1.00
1 2 3 4 5 6 7 8 9 10

Figure 4.2: Standardized Squared Residuals

set stdu = u/sqrt(h)


set stdusq = stdu2
*
@regcorrs(number=10,dfc=1,nocrits,qstat,$
title="Standardized Residuals") stdu
@regcorrs(number=10,dfc=2,nocrits,qstat,$
title="Standardized Squared Residuals") stdusq

The second of these (Figure 4.2) shows what we would like to see:
This is a test of any remaining ARCH or GARCH. If we applied this to the ARCH
model, we wouldnt get such a comforting resulta significant Q suggests that
the variance model isnt adequate, and the simple ARCH isnt.
The first test, on the standardized residuals themselves, gives us Figure 4.3
which shows a problem with serial correlation, that is, that (4.6) doesnt appear
to be true. Note that we cant really test the non-standardized residuals as we
did with ARMA models because those calculations will be strongly influenced
by a relatively small number of data points where the variance is high. After
standardizing by the GARCH estimate of the standard deviation (Figure 4.4),
we get a first lag autocorrelation which is very significant and a Q with a very
significant p-value.
This suggests that we didnt allow enough lags in the mean model. If you go
back and change the model to

linreg tb1yr
# constant tb1yr{1 2} tb3mo{1 2}

and re-run, youll find that the standardized residuals are much closer to white
noise.
Maximum Likelihood Estimation 132

1.00

0.75

0.50

0.25

0.00

-0.25

-0.50

-0.75
Q= 48.18 P-value 0.00000
-1.00
1 2 3 4 5 6 7 8 9 10

Figure 4.3: Standardized Residuals from GARCH

The GARCH estimates of the standard deviations (from the model with just one
lag) can be created with

set hstddev = sqrt(h)


graph(footer="Standard Deviations from GARCH Model")
# hstddev

It usually works better to graph standard deviations rather than variances


because of scalethe larger variances are so much higher than the low ones
that you get very little detail except for the zone where the variance is high.
Maximum Likelihood Estimation 133

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.0
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 4.4: Standard Deviations from GARCH Model

4.4 Tips and Tricks


4.4.1 The Simplex Algorithm

The simplex algorithm can be applied to an optimization problem assuming


nothing more than continuity of the objective function. In order to operate
with so little restriction on the behavior of the function, the simplex algorithm
is more a collection of rules that seem to work in practice than a formal algo-
rithm.
Instead of trying to climb the hill directly, it instead crawls upward primarily
by determining which directions arent correct. In a K-dimensional space, it
uses a set of K + 1 vertices, thus, for instance, the vertices of a triangle in 2-
space. At a pass through the algorithm, a replacement is sought for the worst
of the vertices. An obvious guess (if were still trying to move up) is that the
function will be better if we go through the face opposite the worst point. The
test is whether that new point is better than the one were trying to replace, not
that its better than all the other points. If we have an improvement over the
worst point, that old one is removed from the simplex collection and the new
one added. If the new point is worse, then it seems likely that the optimum may
already be surrounded, so a test point is chosen in the interior. This process
of replacing the worst of the vertices continues until the entire simplex has
shrunk down so the difference between the best and worst vertices satisfies
the convergence criterion.
As an example, suppose that we are trying to maximize f (x, y) = (x + 4y 2 ).
The optimum is (0, 0) by inspection, but suppose that we only have a black
box which returns the value given an (x, y) combination. If our guess value is
(1, 2), we need to construct a triangle with that as one of the vertices. Suppose
Maximum Likelihood Estimation 134

4.00 (1.0,4.0)

2.00 (1.0,2.0) (2.0,2.0)

(1.25,1.0)

0.00 (1.0,0.0) (2.0,0.0)

-2.00 (2.0,-2.0)

-4.00
-4.00 -2.00 0.00 2.00 4.00

Figure 4.5: Simplex Algorithm in Action

that our three initial points are (1, 2), (2, 2) and (1, 4).4 The function values at
the three points are -17, -20 and -65 respectively, so the vertex to be replaced
is (1, 4). The test point is the reflection of (1, 4) through the midpoint of the
side formed by the two other points, which is (2, 0).5 The value there is -4, so
we accept it and eliminate (1, 4). Now the worst of the three is (2, 2). The new
test point is (1 + 2 2, 2 + 0 2) or (1, 0) where the function value is -1, again,
better than the point being replaced. Our three vertices are now (1, 0), (2, 0)
and (1, 2). The test replacement for (1, 2) is (2, 2) where the function is -20,
thus worse than at (1, 2). So instead, we try the interior point halfway between
the vertex being replaced and the center of the side opposite: (1.25, 1.00). The
function there is -5.5625, thus better than the -17 at (1, 2), though not better
than the best one so far. Figure 4.5 shows the contours of the function, and the
points evaluated. Note that its replaced all three of the original vertices.
Unlike more direct climbing methods, this has to drag all K +1 vertices up the
hill, rather than a single point, so its less efficient for well-behaved problems.
However, even with functions that are twice-continuously differentiable, the
simplex algorithm can be useful as a preliminary optimization method, as it
can work even if the likelihood surface has the wrong curvature at the guess
values (that is, its not concave). In effect, the preliminary simplex iterations
help to refine the guess values. To get preliminary simplex iterations, use the
combination of the options PMETHOD=SIMPLEX and PITERS=number of prelim-
inary iterations. What counts as an iteration for this is 2K simplex moves,
which roughly equalizes the number of function evaluations in an actual itera-
tion with other methods.
4
will actually start with a much tighter cluster.
RATS
5
In two dimensions, the coordinates are found by summing the two kept points and subtract-
ing the coordinates of the one being rejected, thus the test has x = 1 + 2 1 and y = 2 + 2 4.
Maximum Likelihood Estimation 135

We often see users overdo the number of preliminary iterations. Usu-


ally 5 to 10 is enough, and rarely will you need more than 20.
PMETHOD=SIMPLEX,PITERS=200 does quite a bit of calculation and wont re-
ally get you much farther along than the same with PITERS=20.

4.4.2 BFGS and Hill-Climbing Methods

The BFGS algorithm6 is the workhorse optimization algorithm for general max-
imum likelihood, not just in RATS but in many other statistical packages, as it
works quite well in a broad range of applications. BFGS requires that the func-
tion being optimized be twice-continuously differentiable which will be the case
for most log likelihoods that you will encounter. The function will have a second
order Taylor expansion around 0 :
f () f (0 ) + f 0 (0 ) ( 0 ) + 1/2( 0 )0 f 00 (0 )( 0 )
If f were quadratic, then f 00 (0 ) would be a constant (negative definite) matrix
Q, and the optimum could be found directly by solving
= 0 Q1 f 0 (0 )
no matter what start values we have for 0 . If we start near the optimum,
then the same calculation should be at least close to finding the top even if the
function is only locally quadratic. There are two main problems in practice:

1. For a non-quadratic function, what happens if were not near the opti-
mum?
2. It may not be easy to compute f 00 (0 ).

The gradient can usually be computed fairly accurately by numerical methods.


Numerical second derivatives are much less accurate and require quite a bit of
extra calculationyou can compute the gradient in K space with K +1 function
evaluations (a base plus a slightly perturbed value in each direction), but the
second derivative requires an extra K(K + 1)/2 to fill up a K K symmetrical
array.
The key result for the BFGS algorithm is that you can get an increasingly accu-
rate estimate of the Hessian (f 00 ) without ever computing it directly by seeing
how the gradient changes from one iteration to the next. The precise result is
that if the function is actually quadratic and you do K iterations of BFGS with
exact line searches at each stage, then at the end, you will have built Q exactly
(and thus will have also found the optimum).
In practice, the function isnt globally quadratic, and we generally dont (for
efficiency) do exact line searches, so the algorithm will not converge exactly
in K iterations and the final f 00 will only be approximate. However, with rare
6
For Broyden, Fletcher, Goldfarb and Shanno, the creators of the algorithm.
Maximum Likelihood Estimation 136

exceptions, if the function is, indeed, twice continously differentiable and has a
negative definite Hessian (no singularities) at the local maximum on the start-
ing hill, then BFGS will find its way to the top of that hill.
To illustrate how BFGS works, well use the same example as simplex, maxi-
mizing (x + 4y 2 ) starting at (1, 2) where the function value is -17. BFGS builds
an approximation to G [f 00 (0 )]1 which (ideally) will be a positive defi-
nite matrix. For illustration, well start with G = I. The gradient at (1, 2) is
g = (2, 16). The direction of search on the first iteration is d = Gg, so the
line being searched is () = 0 + d. The directional derivative along this line
is d g = g0 Gg, which must be positive since G is positive definite, so there
must be some positive value of (possibly very small) at which f (()) > f (0 ).
We need to find such a value. Because G isnt giving us a good estimate of
the curvature (yet), a full step ( = 1) doesnt work well: that would take us
to (1, 14) where the function value is -785. The RATS hill-climbing procedure
then tries = .5 (still doesnt help), then = .25, which puts us at (.5,-2.0)
where the function value is -16.25. This is better than -17, so it would appear
that we are done. However, the directional derivative is 260. That means that
with = .25, we would expect to see the function increase by much more than
simply -17 to -16.25the small arc-derivative to the new point fairly strongly
indicates that = .25 is still too long a step, so its on the other side of the maxi-
mum in the chosen direction. So the next test is with = .125 or (.75, 0.0) where
the function value is -.5625. Now, were improving by 16.4375 over a distance
of .125; the arc-derivative is 131.25 which is a (very) good ratio to 260.7 So
our first iteration takes us to (.75, 0.0). This used 4 subiterationsthe function
evaluations along the chosen line. Its very common for early iterations of BFGS
to require this many, or sometimes more, subiterations since the early G isnt
very precise. With well-behaved problems, the number of subiterations usually
drops fairly quickly to 1 or perhaps 2 on each iteration.
BFGS then updates the estimate of G using the actual and predicted values for
the gradient at the new position. In this case, the result is
 
.5 0
G=
0 .125
which is exactly correct. Ordinarily, the first update wouldnt hit exactly, but
because the Hessian is diagonal and we started with a diagonal matrix, it
works here. Given that we have the correct G, the next iteration finds the
optimum.8
You can see why preliminary simplex iterations are often handyeven with
a truly quadratic function, the first iteration on the hill-climbing algorithm
7
A ratio of .5 is what we would get if we found the exact maximum for given that this is a
true quadratic.
8
Actually, almost. Because the gradient is computed numerically, none of the calculations
done above are exact.
Maximum Likelihood Estimation 137

4.00

2.00 (1.0,2.0)

0.00 (.75,0.0)

-2.00 (.5,-2.0)

-4.00
-4.00 -2.00 0.00 2.00 4.00

Figure 4.6: BFGS in Action

tries some very wrong parameter vectors. In Figure 4.6, we cant even show
the first two subiterations on the graphthe first one would be three pages
down. In this case, the wildly incorrect vectors wont be a problem, but with
some functions which arent globally defined (because of the presence of logs or
square roots as part of the calculations, or because of explosive behavior for an
iterative calculation), it may be necessary to check for conditions that would
invalidate the value. A function evaluation which requires log or square root of
a negative number will naturally result in a missing value the way that RATS
does evaluationsits the explosive recursions, which might occur in GARCH or
ARMA models, that may require special care.

4.4.3 The CDF instruction and Standard Distribution Functions

We used the CDF instruction earlier to compute and display the significance
level of the test for ARCH. It supports the four most commonly used test distri-
butions: the Normal, t, F and 2 . The syntax is:

CDF(option) distribution statistic degree1 degree2

where

distribution Choose the desired distribution: FTEST for F , TTEST for t,


CHISQ for 2 or NORMAL for (standard) normal.
statistic The value of the test statistic
degree1 Degrees of freedom for TTEST and CHISQ or numerator de-
grees of freedom for FTEST
Maximum Likelihood Estimation 138

degree2 Denominator degrees of freedom for FTEST

The (main) option is TITLE=descriptive title. One of the examples was:

cdf(title="LR Test for delta=1") chisqr 2*(%logl-loglunr) 1

which produces
LR Test for delta=1
Chi-Squared(1)= 0.576804 with Significance Level 0.44756761

CDF sets the variables %CDSTAT as the test statistic and %SIGNIF as the sig-
nificance level.
CDF is very handy if output formatted as above is fine. If you need something
else (for instance, to insert the information onto a graph or into a report), you
can use a built-in function to compute the same significance level:

Normal %ZTEST(z) returns the two-tailed significance level of z as a


Normal(0,1).
t %TTEST(t,nu) returns the two-tailed significance level of t aa a
t .
F %FTEST(F,num,den) returns the significance level of F as an
F with num numerator degrees of freedom and den denominator
degrees of freedom.
2 %CHISQR(x,nu) returns the significance level of x as a 2 with
nu degrees of freedom.

An alternative to using CDF would be something like:

compute deltatest=2*(%logl-loglunr)
compute deltasignif=%chisqr(deltatest,1)
disp "Test delta=1" *.### deltatest "with signif" *.### deltasignif
Maximum Likelihood Estimation 139

Example 4.1 Likelihood maximization


cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set pi = 100.0*log(ppi/ppi{1})
set y = .001*rgdp
*
* Power function with interest rates
*
* Non-linear least squares (maximum likelihood assuming Normal residuals)
*
nonlin a0 a1 a2 delta
linreg tb1yr
# constant tb1yr{1} tb3mo{1}
frml ratef tb1yr = a0+a1*tb1yr{1}+a2*(tb3mo{1})delta
compute a0=%beta(1),a1=%beta(2),a2=%beta(3),delta=1.0
nlls(frml=ratef) tb1yr 2 *
*
* Maximum likelihood assuming t residuals. This requires adding the
* variance and the degrees of freedom to the parameter set.
*
nonlin(parmset=baseparms) a0 a1 a2 delta
nonlin(parmset=tparms) sigsq nu
*
* This starts with sigsq as the estimate from NLLS with a relatively
* high value of nu.
*
compute sigsq=%seesq,nu=20.0
frml logl = %logtdensity(sigsq,tb1yr-ratef,nu)
maximize(parmset=baseparms+tparms,iters=300) logl 2 *
*
* Test delta=1
*
test(title="Test for linearity")
# 4
# 1.0
*
* This is available in RATS 8.2 or later
*
summarize(parmset=baseparms+tparms,title="Test for linearity") delta-1
*
* Add a constraint that delta is 1.
*
nonlin(parmset=pegs) delta=1.0
*
* Save the unrestricted log likelihood
*
compute loglunr=%logl
*
maximize(parmset=baseparms+tparms+pegs) logl 2 *
cdf(title="LR Test for delta=1") chisqr 2*(%logl-loglunr) 1
Maximum Likelihood Estimation 140

Example 4.2 ARCH Model, Estimated with MAXIMIZE


cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
linreg tb1yr
# constant tb1yr{1} tb3mo{1}
*
* Graph standardized residuals
*
set stdu = %resids/sqrt(%seesq)
graph(footer="Standardized Residuals from Linear Regression")
# stdu
*
* Test for ARCH using auxiliary regression
*
set u = %resids
set usq = u2
linreg usq
# constant usq{1 to 4}
cdf(title="Test for ARCH") chisqr %trsquared 4
*
* Test for ARCH using @ARCHTEST
*
@archtest(lags=4,form=lm) u
*
* Define the parameters
*
nonlin(parmset=meanparms) b0 b1 b2
nonlin(parmset=archparms) a0 a1
*
* Define log likelihood FRML in three parts:
*
frml efrml = tb1yr-(b0+b1*tb1yr{1}+b2*tb3mo{1})
frml hfrml = a0+a1*efrml{1}2
frml logl = %logdensity(hfrml,efrml)
*
* Guess values based upon linear regression
*
linreg tb1yr
# constant tb1yr{1} tb3mo{1}
compute b0=%beta(1),b1=%beta(2),b2=%beta(3)
compute a0=%seesq/(1-.5),a1=.5
*
* Estimate the ARCH model
*
maximize logl 3 *
compute aicarch=-2.0*%logl+2.0*%nreg
*
* Estimate the linear regression over the same range
*
linreg tb1yr 3 *
Maximum Likelihood Estimation 141

# constant tb1yr{1} tb3mo{1}


compute aicols=-2.0*%logl+2.0*(%nreg+1)
*
* Compare AICs with proper counting of parameters
*
disp "AIC-ARCH" @15 *.### aicarch
disp "AIC-OLS" @15 *.### aicols
*
* Same model done with GARCH
*
garch(reg,q=1) / tb1yr
# constant tb1yr{1} tb3mo{1}

Example 4.3 GARCH Model with Flexible Mean Model


cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
linreg tb1yr
# constant tb1yr{1} tb3mo{1}
equation(lastreg) meaneq
frml(lastreg,vector=beta) meanfrml
compute rstart=%regstart(),rend=%regend()
*
set h = %seesq
set u = %resids
set uu = %seesq
*
* Define the parameters
*
nonlin(parmset=meanparms) beta
nonlin(parmset=garchparms) c a b
*
* Define log likelihood FRML in three parts:
*
frml efrml = tb1yr-meanfrml
frml hfrml = c+a*uu{1}+b*h{1}
frml logl = h=hfrml,u=efrml,uu=u2,%logdensity(h,u)
*
* The BETAs are already done as part of FRML(LASTREG)
*
compute a=.1,b=.6,c=%seesq/(1-a-b)
*
* Estimate the GARCH model
*
maximize(parmset=meanparms+garchparms,$
pmethod=simplex,piters=10) logl rstart+1 rend
compute aicgarch=-2.0*%logl+2.0*%nreg
*
linreg(equation=meaneq) * %regstart() %regend()
Maximum Likelihood Estimation 142

compute aicols=-2.0*%logl+2.0*(%nreg+1)
*
* Compare AICs with proper counting of parameters
*
disp "AIC-GARCH" @15 *.### aicgarch
disp "AIC-OLS" @15 *.### aicols
*
* Test for serial correlation in the standardized residuals
*
set stdu = u/sqrt(h)
set stdusq = stdu2
*
@regcorrs(number=10,dfc=1,nocrits,qstat,$
title="Standardized Residuals") stdu
@regcorrs(number=10,dfc=2,nocrits,qstat,$
title="Standardized Squared Residuals") stdusq
*
set hstddev = sqrt(h)
graph(footer="Standard Deviations from GARCH Model")
# hstddev
Chapter 5

Standard Programming Structures

Weve already seen some (relatively) simple examples of using the program-
ming features of RATS using the DO and DOFOR loops. In this chapter, well look
in greater detail at the program control structures in RATS, emphasizing the
ones that tend to be common, in some form, to most programming languages.
These are the DO loop, IF and ELSE blocks, and WHILE and UNTIL loops. Well
cover DOFOR, which is very useful but not as standard, in the next chapter.

5.1 Interpreters and Compilers


Except for the examples with loops, most of what weve seen has been RATS
as an interpreted language, which means that it executes each instruction im-
mediately after it is processed. This is often very handy as it allows you to
experiment with different ways of handling a model and you get immediate
feedback.
However, lets take a line out of a DO loop in an earlier program:

compute lreffect(t)=%sumlc

This really doesnt do muchit takes the real value %SUMLC and puts it into an
entry of the series LREFFECT. What does the interpreter have to do before this
happens? The main steps are that it takes the first three characters on the line
(com) converts them to upper case, looks that up in a table of instructions and
determines that its a COMPUTE instruction. It then has to isolate the token
LREFFECT, look that up in a symbol table, recognize its a SERIES, look up T,
recognize that its an INTEGER variable, look up %SUMLC, recognize that its a
REAL, and determine that it can put all those together into a sensible instruc-
tion. At that point, it can actually do the assignment of the value of %SUMLC to
LREFFECT(T). If you got the impression that it takes a lot more time to turn
the character string compute lreffect(t)=%sumlc into something usable than
it does to actually do what it requests, youre right. Now neither takes very
long in an absolute sensethe total time required might be 50 microseconds.
However, if you had to do that millions of times as part of a calculation, it could
matter.
A pure interpreted language (which RATS is not) has to go through something
like that process each time an instruction is executed. That has certain ad-
vantages as you can quite literally alter instructions right up to the time that

143
Standard Programming Structures 144

they are executed. Theres a cost in time, however, so its most useful when
the instructions, when executed, tend to do a lot of calculation. For instance,
if instead of a simple assignment above, we were doing a LINREG, the amount
of time the interpreter requires might go up by a factor of three, while the
amount of work done by the instruction would likely go up by many thousands,
so interpretation wouldnt be as significant a part of the calculation time.
Instructions in complex RATS programs are often a mix of simple (such as
COMPUTE on scalars) and complicated (LINREG, MAXIMIZE) instructions. For
efficiency, when you do a loop or some other programming structure RATS uses
a compiler, which does the interpretation once and saves the generated code so
it can be executed with relatively little time added to what is needed to exe-
cute the requested calculations. This requires advanced planning on your part
as RATS doesnt actually do anything (other than parse the instructions) until
youre done with all the instructions for the loop. As a general rule, if you are
doing any type of loop, you are best off entering the instructions off-line. If
you type the following into your input window while its in ready or on-line
mode:

do i=1,10
dsp i

(the last line is a typo, it should have been disp i), youll get an error message
that it expected an instruction. At this point, the attempt to compile the loop
is aborted; you cant just correct the spelling and continue on.
Instead, the better approach is to make the window local or off-line before
you even start putting in the loop code. Click the icon or type <Control>+L,
then type (or paste) in the following (with the mistake on the second line):

do i=1,10
dsp i
end do i

Now click on the or type <Control>+L to put the window back into ready
mode, select the three lines and hit the <Enter> key or click the icon.
1
Youll get the Expected Instruction error. Now, however, you can just fix the
second line to read disp i, select the three lines again, hit <Enter>and you
get the expected
1
If you select the three lines and hit <Enter> without switching back to ready mode, youll
delete the three lines since in local mode the <Enter> is just a standard editing keystroke.
If you do that by accident, just Undo and make sure you switch to ready.
Standard Programming Structures 145

1
2
3
4
5
6
7
8
9
10

Now lets look at

disp "Before loop"


do i=1,10
disp i
end do i
disp "After loop"

As weve written this, the first and last DISPLAY instructions are done in in-
terpreted mode, while the middle three lines are done in compiler mode. In-
terpreted mode is the natural state for RATS, so it needs one of a few special
instructions to put it into compiler mode. DO is one of those. Once its in com-
piler mode, RATS needs another signal that it should leave compiler mode and
execute the code that it has just generated. How that is done will be different
depending upon what instruction put it into compiler mode. In the case of DO,
its a matching END. (The DO I after END are actually treated as comment, so
thats for information only.) If you have nested DO loops such as

do i=1,4
do j=1,3
disp i j
end do j
end do i

the outer DO puts RATS into compiler mode. RATS then keeps track of the level
of the compiler structures, so the second DO raises that to level two. The match-
ing END for the DO J drops the level to one so its still in compile mode. The level
doesnt drop to zero until after the matching END for the outer DO I at which
point RATS exits compiler mode and executes the double loop. This may seem
obvious from the indenting, but the indenting is only to make it easier to read
for humansRATS ignores the lead blanks when processing the instructions. If
you have a long or complicated program segment that either is switching to in-
terpreter mode before you expect or not leaving compiler model when you think
it should, you probably have some problem with the structure levels. Its much
easier to find those types of errors if you try to keep the instructions properly
indented (see page 6) to show the levels.
Standard Programming Structures 146

5.2 DO Loops
As illustrated in Section 2.8.1, the DO loop is a very simple way to automate
many of your repetitive programming tasks. Its by far the most common pro-
gram control structure. The most common DO loop will look like:
DO i=1,n
instructions to execute
end do i
There are slight differences in how DO loops function in different languages, so
well go through this step by step and point out where you need to be careful if
you are trying to translate a program from a different language.

1. The variable I is given the start value (here 1)


2. RATS determines how many passes through the loop are required to run
the index from the start value (here 1) to the end value (here n). If the end
value is less than the start value, the number of passes is zero, and con-
trol passes immediately to the instruction after the END DO I. In some
languages (though not many), the loop instructions are always executed
at least once.
3. The instructions are executed with the current value of I.
4. If the number of passes computed in step 2 has been reached, the loop is
exited and control passes to the instruction after the END DO I. This is
where different languages can differ quite a bit as well point out in more
detail.
5. I is incremented by 1 and the pass count is incremented by 1.
6. Repeat steps 3, 4 and 5 until the pass count is reached in step 4.

Some programming languages test for the loop exit differentlyat the end of
the loop, they first increment I, then test it against N. If i > n, the loop is
exited. As a result, once the loop is done, I will be equal to n + 1 (assuming n
was bigger than 0 in the first place). The way RATS handles this, at loop exit
I will be equal to the value on the final trip through the loop. This is a subtle
difference and well see that it can matter.
What happens if you change the value of I inside the loop? While this is legal
in RATS (in some programming languages it isnt), it isnt a good idea, as the
results will probably not be what you expect. If you execute:

do i=1,10
compute i=i+3
disp i
end do i

the output is
Standard Programming Structures 147

4
8
12
16
20
24
28
32
36
40

and on loop exit, I will be equal to 40. As it says in steps 2 and 4, the DO loop
operates by determining how many passes are required right up front, and
then does that number, regardless of what happens to the value of the index. If
you need a loop where the increment can change from pass to pass, or the end
value might change, use a WHILE or UNTIL loop instead (section 5.4).
More generally, the DO loop has the form:
DO integer variable=n1,n2,increment
instructions to execute
end do integer variable
The variables I and J are pre-defined in RATS as INTEGER variables and are
by far the most common loop index variables. You can introduce a new variable
name for the integer variable and it will be defined as an INTEGER type.

do k=p,1,-1
...
end do k

will loop p times through the controlled instructions (as long as p is 1 or larger)
with K taking the value p to start and being decreased by 1 each pass through
the loop.
In Example 5.1, well use the DO loop to analyze the possibility that log GDP
(Figure 5.1) has a broken trend.
To the eye, it has an approximately linear trend, but with substantial devia-
tions. If we regress on just constant and trend:

set trend = t
*
linreg loggdp
# constant trend

we get a Durbin-Watson of .035. The residuals are very strongly serially corre-
lated and any attempt to model the trend will have to take that into account. A
broken trend could take several forms, but what well look at here is a joined
trend, where the linear trend rate has one value through some point in time,
and another after, but the level of the process doesnt change at the join point.
That can be handled by a linear function which looks like
+ t + max(t t0 , 0) (5.1)
Standard Programming Structures 148

9.6

9.4

9.2

9.0

8.8

8.6

8.4

8.2

8.0

7.8
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 5.1: Log U.S. Real GDP

This will grow at the rate up to time t0 , then at + after t0 .


There are two basic ways to model this while allowing for serial correlation.
The simpler way to do this is add lags of the dependent variable to (5.1):
yt = + t + max(t t0 , 0) + 1 yt1 + . . . + p ytp + ut (5.2)
This makes the break what is known as an innovational outlier. In this setup,
and are not structural parameters describing the trend rate of the process
the trend rate of y prior to t0 can be solved out as

(1 1 . . . p )
If you look at what happens at t0 + 1, none of the lagged y terms have yet been
affected by the trend change, so the first period after the break, the process
goes up by an extra compared to the process without the break. At t0 + 2, the
yt1 has increased by an extra , and the term itself will now be 2, so the
overall effect from including the break term is 2 + 1 and so on. Notice
how the break works itself into the system gradually as the lag terms reach
the break location.
If we allow for two autoregressive lags, we can compute the sums of squares
for different break locations using the following loop:

set rssio = %na


do t0=1965:1,2007:4
set btrend = %max(t-t0,0)
linreg(noprint) loggdp
# constant trend btrend loggdp{1 2}
compute rssio(t0)=%rss
end do t0
Standard Programming Structures 149

0.0131

0.0130

0.0129

0.0128

0.0127

0.0126

0.0125

0.0124
1965 1970 1975 1980 1985 1990 1995 2000 2005

Figure 5.2: RSS for Broken Trend, Innovational Outlier

The LINREG is just a straightforward translation of the formula (5.2) given the
value of T0. All we had to do was throw a loop around this. Why is this
running only from 1965:1 to 2007:4? With any analysis where youre looking
for some type of break in a model, you want to exclude breaks near the ends.
Because we are searching for breaks in the trend, it makes sense to require
enough data points in each branch to properly determine a trend rather than
just a cycle. Here we make sure there are at least five years of data both before
and after the change date.
The series RSSIO is initialized to %NA since the series doesnt exist outside the
range of the DO loop. Inside the loop, we do the regression and save the sum of
squared residuals into the T0 entry in RSSIO.
We can graph (Figure 5.2) the sum of squares with

graph(footer="RSS for Broken Trend, Innovational Outlier")


# rssio

We can find where the minimum was attained using

ext(noprint) rssio
disp "Minimum at" %datelabel(%minent) %minimum

Minimum at 2003:04 0.01247

The more complicated type of model has


yt = + t + max(t t0 , 0) + zt
(5.3)
zt = 1 zt1 + . . . + p ztp + ut
Standard Programming Structures 150

0.0131

0.0130

0.0129

0.0128

0.0127

0.0126

0.0125

0.0124
1965 1970 1975 1980 1985 1990 1995 2000 2005

Figure 5.3: RSS for Broken Trend, Additive Outlier

that is, y is described as a broken trend plus an AR(p) noise term. If it were not
for the broken trend terms, (5.2) and (5.3) would be equivalent models (with the
coefficients on the deterministics mapping to each other) if you work through
the expansions. However, with the broken trend, they arent. (5.3) has what is
known as an additive outlier. Here , and are structural parameters, with
+ being the trend rate of the y process starting immediately at t0 + 1.
You cant estimate (5.3) using LINREGa mean + AR or ARMA noise is done
using BOXJENK with the REGRESSORS option. The loop is almost identical other
than the substitution of the main instruction:
set rssao = %na
do t0=1965:1,2007:4
set btrend = %max(t-t0,0)
boxjenk(regressors,ar=2,noprint) loggdp
# constant trend btrend
compute rssao(t0)=%rss
end do t0
graph(footer="RSS for Broken Trend, Additive Outlier")
# rssao
*
ext(noprint) rssao
disp "Minimum at" %datelabel(%minent) %minimum

Perhaps not too surprisingly, the sum of squares (Figure 5.3) is quite a bit
more volatile than it is for innovational model since changes to the trend rate
hit immediately.
Its important to note that the F statistic for either model in comparison with
a non-breaking trend model has a non-standard distribution if you search for
Standard Programming Structures 151

the best break point. Its beyond the scope of this book to deal with the theory
behind that.

5.3 IF and ELSE Blocks


There are many instances in which we want to perform a set of instructions
only if a particular condition is met. The most common way to do this is to
use an IF or IF-ELSE block. We already saw a very simple example of this on
page 49 where we checked for whether a BOXJENK estimation converged and
displayed a message when it failed.
The basic structure of an IF block is:
IF condition {
block of statements executed if condition is true
}
while an IF-ELSE block is:
IF condition {
block of statements executed if condition is true
}
ELSE {
block of statements executed if condition is false
}
What form does the condition take? It can be any expression that is non-zero
when you want true and zero when you want false. This is usually built
using the following standard relational operators for comparing expressions A
and B. (Each has two equivalent representations).

A==B or A.EQ.B Equality


A<>B or A.NE.B Inequality
A>B or A.GT.B Greater than
A<B or A.LT.B Less than
A>=B or A.GE.B Greater than or equal to
A<= B or A.LE.B Less than or equal to

Note well that the test for equality is done with == (two =), not just a single =.
A=B assigns the value of B to the variable or array element A.
You can create compound conditions using and, or and not with
condition 1.AND.condition 2
condition 1.OR.condition 2
.NOT.condition
Standard Programming Structures 152

Its important to note that some programming languages have constructions


like this that are used in transforming data. That is not done in RATSuse
SET with %IF or a relational operator instead. For instance, this is how not to
create a dummy variable in RATS thats 1 when real GDP is above potential:

* This is not RATS code


if rgdp>potent
set boom = 1
else
set boom = 0

Instead, you would use simply

set boom = rgdp>potent

SET has an implied loop over the entriesthe IF-ELSE does not. Something
like the IF-ELSE code could work by inserting it inside a loop. Here it would
be a bad idea since seven lines can be replaced with one, but there are other
situations where it would be superior to using a SET if the two branch calcula-
tions were sufficiently complicated. In this case, the code would be something
like:
set boom = 0.0
do t=1,%allocend()
if rgdp(t)>potent(t)
compute boom(t)=1
else
compute boom(t)=0
end do t

You may have noticed that we didnt use { and } around the instructions con-
trolled by the IF and the ELSE in this last example. By default, IF and ELSE
control just one line. If you need to execute more than one line, you need to
enclose the controlled lines in braces. It never hurts to add the braces, but they
arent necessary in the simplest case.
As weve seen before on page 49, you can have an IF without an ELSE. IF-ELSE
can be used if there are two alternative calculations. If you have three or more
(mutually exclusive) cases, you can string together a set of IFs and ELSEs. For
illustration:
Standard Programming Structures 153

do rep=1,100
compute r=%ran(1.0)
if r<-2.0
disp "Big Negative value" r
else
if r<2.0
disp "Between -2 and 2" r
else
disp "Big Positive value" r
end do rep

Each pass through the loop, this draws a N (0, 1) random number. If r < 2,
the first condition is met, and the message about the big negative value is
displayed. If r 2, since the first IF condition fails, we do the first ELSE.
That immediately goes into a second IF. If r < 2, the second IF condition is
met, and the between -2 and 2 message gets displayed. We know that r 2
as well since were in the ELSE from the first IF. Finally, if r 2, we get down
to the final ELSE clause and display the big positive value message.
Note that you can do the same type of multiple branch calculation within a SET
instruction using nested %IF functions. For instance

set u = %ran(1.0)
set range = %if(u<-2,-1,%if(u<2,0,1))

will create RANGE as a series which has ranget = 1 if ut < 2, ranget = 0 if


ut 2 and ut < 2 and ranget = +1 if ut 2.
As a more concrete example, well look at lag length selection again in Example
5.2. Well find the best AIC autoregression on the change in log real RGDP. As
we mentioned on Chapter 2 (page 23), when you use an information criterion,
its important to run the regressions over the same range or youll bias the
results in one direction or the other. Before, we used the range parameters
on LINREG to enforce that. Well show an alternative which both makes sure
that the range is the same, and also is more efficient computationally. The
CMOMENT instruction generates a cross product matrix of a set of data. After
that, the LINREG instruction with a CMOMENT option will run a regression using
the cross product information (and range) from the CMOMENT instruction.
The CMOMENT instruction needs to include both the explanatory variables and
the dependent variable(s) from all the regressions that will be run using it.
In this case, that means lags from 0 (for the dependent variable) to 12 of the
DLRGDP plus the CONSTANT.

set dlrgdp = log(rgdp)-log(rgdp{1})


*
cmom
# dlrgdp{0 to 12} constant
Standard Programming Structures 154

The following does the regressions and picks out the minimum AIC lag length:

do lags=0,12
if lags==0 {
linreg(noprint,cmom) dlrgdp
# constant
compute aic = -2.0*%logl + %nreg*2
compute bestlag=lags,bestaic=aic
}
else {
linreg(noprint,cmom) dlrgdp
# constant dlrgdp{1 to lags}
compute aic = -2.0*%logl + %nreg*2
if (aic < bestaic)
compute bestlag=lags,bestaic=aic
}
end do lags

The instruction block controlled by the IF runs the regression on the CONSTANT
only and saves the AIC into the BESTAIC variable. These lines are executed
only when LAGS is equal to 0. Since thats the first pass through the loop, we
know that the model is the best that we have seen to that point. If LAGS is non-
zero, we execute the instruction block controlled by the ELSE. This estimates
the model with the current number of LAGS, computes the AIC and compares
it to whatever is now in BESTAIC. If the new AIC is smaller, we replace both
BESTLAG with the current number of lags and BESTAIC with the current value
for AIC. Note that we used a second (rather simple) IF inside the instruction
block controlled by the main ELSEstructures can be nested as deeply as you
need, though once you get above five levels it can be very hard to keep track of
which is controlling what. More advanced structures called PROCEDURES and
FUNCTIONS are often handy for removing parts of a very involved calculation
into a separate subblock of code to make the program flow easier to follow and
the whole program easier to maintain.

5.4 WHILE and UNTIL Loops


The DO loop is appropriate if you know exactly how many passes you want to
make. However, there are circumstances in which the number of repetitions
is unclear. For example, a common way to select the a lag length in an AR(p)
model is to estimate the autoregression using the largest value of p deemed
reasonable. If the t-statistic on the coefficient for lag p is insignificant at some
pre-specified level, estimate an AR(p-1) and repeat the process until the last lag
is statistically significant. This is known as a general-to-specific model selection
process. You can do this with the help of a WHILE or UNTIL instruction.
The syntax for a WHILE block is:
Standard Programming Structures 155

WHILE condition {
block of statements executed as long as condition is true
}
The syntax for an UNTIL block is:
UNTIL condition {
block of statements executed until condition is true
}
As part of Example 5.3, well first use WHILE to do the lag selection as described
above, picking a lag length no larger than 12 for the growth rate in the deflator
(which well call DLDEFLATOR). A possible way to write this is:

set dldeflator = log(deflator)-log(deflator{1})


*
compute lags=13,signif=1.00
while signif>.05 {
compute lags=lags-1
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
}
end while

which will give us


Significance of lag 12 = 0.14249
Significance of lag 11 = 0.73725
Significance of lag 10 = 0.48749
Significance of lag 9 = 0.78494
Significance of lag 8 = 0.43323
Significance of lag 7 = 0.28264
Significance of lag 6 = 0.98133
Significance of lag 5 = 0.04629

The first time through the loop, the variable SIGNIF is compared to 0.05. Since
SIGNIF was initialized to be larger than 0.05, all of the the instructions within
the block are executed. So LAGS is decreased from 13 to 12 and a 12 lag AR
is estimated on DLDEFLATOR. %TSTATS(%NREG) is the t-statistic on the final
coefficient in the regression; we compute its two-tailed significance level with
%TTEST using as the degrees of freedom for the t the variable %NDF thats set by
the LINREG . Note that both %NREG and %NDF will change (automatically) with
the number of lags in the regressionyou dont have to figure them out your-
self. For illustration, this now displays the number of lags and the significance
level. In a working program, you probably wouldnt do that, but its a good idea
to put something like that in until youre sure you have the loop correct.
Were now at the end of the block so control loops up to the WHILE check at the
top. With LAGS=12 on the first pass, the significance is .14249 so the WHILE
Standard Programming Structures 156

condition is still true. Thus we start a second pass through, decreasing lags
to 11 and redoing the calculation. This repeats until LAGS is 5. This gives
SIGNIF=.04629 so when we loop to the top, the WHILE condition finally fails.
Control passes to the first instruction after the controlled block of instructions,
which means were done with the whole compiled subprogram.2
You may have noticed a problem with the WHILE loopwhat happens if none of
the final coefficients is ever significant? Thats certainly possible if the series
is white noise. Most loops of this nature need a safeguard against running
forever in case the condition isnt met. Here, the loop wont run forever, but it
will get to the point where the regression uses lags from 1 to 0. RATS actually
interprets that the way it would be intended here, which is to use no lags at
all. However, the t-test would then be on the CONSTANT (since it would be the
last and only coefficient), rather than a lag.
One possibility would be to change the condition to

while signif>.05.and.lags>1

which will prevent it from running the regression with no lags. However, it
wont give us the right answer for the number of lags, because we cant tell
(based upon this condition alone) whether LAGS is 1 when we drop out of the
loop because lag 1 was significant, or whether it was because lags 1 wasnt
significant, and we triggered the second clause. If its the latter, we want the
report to be LAGS=0.
In most such cases, the secondary condition to break out of the loop is best done
with a separate BREAK instruction, controlled by an IF. BREAK does exactly
what it sounds like it would dobreaks out of the current (most inner) loop.
Here, we insert the test right after LAGS is reduced. You can check that this
does get the result correct. If we get to lag 1 and its significant, we break the
loop based upon the WHILE condition while LAGS is still 1. If we get to lag 1
and its not significant, we break the loop when LAGS is reduced to zero, which
is what we want.
compute lags=13,signif=1.00
while signif>.05 {
compute lags=lags-1
if lags==0
break
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
}
end while
2
The END WHILE is the signal that you want to exit compile mode. Its only needed if the
WHILE instruction isnt already inside some other compiled structure.
Standard Programming Structures 157

Not surprisingly, there is more than one way to do this. BREAK also can be
applied to DO and DOFOR loops. Instead of using WHILE, we could use a DO loop
which counts backwards through the lags and break out of it when we get a
significant coefficient. A first go at this would be something like:
do lags=12,1,-1
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
if signif<.05
break
end do lags

This will give exactly the same results as before. The only problem is again
with the case where none of the lags is significant. Its not that the loop runs
forever, since it will quit after the pass where LAGS is 1 as we planned. Its just
that, because of the way that the DO loop runs (section 5.2), the value of LAGS
will be 1 if the only significant lag is 1, and will also be 1 if none of the lags
are significanton the normal exit from the loop, the value of the index is the
value from the last pass through.
An alternative which gets all cases correct is:
compute p=0
do lags=12,1,-1
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
if signif<.05 {
compute p=lags
break
}
end do lags
disp "Number of lags chosen =" p

Instead of using the loop index LAGS to represent the chosen number of lags,
this uses the separate variable P. This is originally set to 0 and is only reset if
and only if we hit a significant lag.
WHILE and UNTIL are similar, but there are two differences, one minor and one
more important:

1. The condition for WHILE is true if the loop is to continue, while for UNTIL
it is true if the loop is to terminate.
2. The body of the UNTIL loop is always executed at least once, as the test
is done at the end of a pass; WHILE tests at the top and so could drop out
without ever executing the body.
Standard Programming Structures 158

The same analysis done using an UNTIL loop is:

compute lags=13,signif=1.00
until signif<.05 {
compute lags=lags-1
if lags==0
break
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
}
end until

Which of these is best to use? All of them get the job done correctly and are
roughly the same number of lines so its largely a matter of taste. Automatic
lag selection is extremely common in modern econometrics, particularly in unit
root and cointegration testing, so this shows up in quite a few RATS procedures.
For several reasons, we generally end up using the DO loop, as the coding is a bit
clearer, plus, the added step of saving the chosen lags into a separate variable
isnt (in practice) really an added step, since that will almost always be done
anyway so the selected number of lags can be used later.
After the UNTIL or WHILE examples, we can estimate the chosen regression
with:
compute p=lags
linreg(title="Least Squares with Automatic Lag Selection") $
dldeflator
# constant dldeflator{1 to p}

We would do the same after the DO, but without the compute p=lags.
Standard Programming Structures 159

Linear Regression - Estimation by Least Squares with Automatic Lag Selection


Dependent Variable DLDEFLATOR
Quarterly Data From 1961:03 To 2012:04
Usable Observations 206
Degrees of Freedom 200
Centered R2 0.7980032
R-Bar2 0.7929533
Uncentered R2 0.9376799
Mean of Dependent Variable 0.0088412547
Std Error of Dependent Variable 0.0059200109
Standard Error of Estimate 0.0026937464
Sum of Squared Residuals 0.0014512539
Regression F(5,200) 158.0229
Significance Level of F 0.0000000
Log Likelihood 929.6086
Durbin-Watson Statistic 1.9964

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant 0.000685461 0.000353780 1.93754 0.05408801
2. DLDEFLATOR{1} 0.576388906 0.070224827 8.20777 0.00000000
3. DLDEFLATOR{2} 0.166608793 0.080443169 2.07114 0.03962957
4. DLDEFLATOR{3} 0.112942556 0.080816573 1.39752 0.16380637
5. DLDEFLATOR{4} 0.210004031 0.080784979 2.59954 0.01003079
6. DLDEFLATOR{5} -0.142165754 0.070898975 -2.00519 0.04629106

Note that, while the final lag in this regression is significant at .05, lag 3 isnt.
Its possible to do a more involved lag pruning to get rid of any other apparently
insignificant lags using, for instance, stepwise regression with STWISE. How-
ever, thats almost never done in practiceyou use an automatic procedure to
select just the length, not the full set of lags.

5.5 Estimating a Threshold Autoregression


To provide another example of the topics in this chapter, we will estimate a
threshold autoregresson. The threshold autoregressive (TAR) model has be-
come popular as it allows for different degrees of autogressive decay. Consider
a two-regime version of the threshold TAR developed by Tong (1983):

p p
" # " #
X X
yt = It 0 + i yti + (1 It ) 0 + i yti + t (5.4)
i=1 i=1

where 
1 if yt1
It = (5.5)
0 if yt1 <
yt is the series of interest, the i and i are coefficients to be estimated, is the
value of the threshold, p is the order of the TAR model and It is the Heaviside
indicator function.
How is this different from the STAR models in Section 3.5? The TAR is the limit
as in the LSTAR model. It seems like there might not be much point
to the TAR when its a special case of the STAR, but its a special case that (as
Standard Programming Structures 160

we saw) isnt well-handled by non-linear least squares because the objective


function isnt differentiable (at the limit) and, in fact, isnt even continuous.
The nice thing about the STAR is that, if it is a good explanation of the data
with a finite value of , it can be estimated successfully by NLLS; however, if it
requires an infinite value of , or if there is no threshold effect, the estimation
fails completely. On the other hand, the sum of squares (or log likelihood)
of the TAR model is easily computed given (just two standard least squares
regressions over the two branches), but is discontinuous in itself. The only
way to estimate is with a grid search over the observed values of yt1 .
Example 5.4 illustrates the estimation of a TAR model for the growth rate of
the money supply. The first part of the program reads in the data set and
constructs the variable gm2 using:

set gm2 = log(m2) - log(m2{1})

The next line in the program estimates the gm2 series as an AR({1,3}) process.

linreg gm2
# constant gm2{1 3}

If you experiment a bit, you will see that the AR({1,3}) specification is quite
reasonable. If you are going to estimate a TAR model, it is standard to start
with a parsimonous linear specification. First, suppose that we want the value
of the threshold to equal the sample mean (0.016788). This might be the case
if we were certain that greater than average money growth behaved differently
from below average growth. Also, suppose you knew the delay factor used to
set the heaviside indicator was 2.3
We can create the indicator It (called PLUS) using

stats gm2
compute tau=%mean
set plus = gm2{2}>=tau

We cannot use the symbol I (since I, along with J and T are reserved integer
variables) to represent the indicator, so we use the label PLUS. For each possible
entry in the data set, the SET instruction compares gmt2 to the value in TAU.
If gmt2 is greater than TAU, the value of plust is equal to 1, otherwise its zero.
Next, we create (1 It ) as the series MINUS using:

set minus = 1 - plus

There are two ways to estimate the model: you can do two separate estima-
tions with LINREG using the SMPL=PLUS option for one and SMPL=MINUS for
3
We actually experimented to find the best delay.
Standard Programming Structures 161

the other, adding the two sums of squared residuals to get the full model sum
of squares, or you can generate dummied-out versions of the regressors for
the two periods and do a single LINREG. Well first show a brute force imple-
mentation of the second of the two by creating the variables It gm2t1 , It gm2t3 ,
(1 It )gm2t1 and (1 It )gm2t3 :

set y1_plus = plus*gm2{1}


set y3_plus = plus*gm2{3}
set y1_minus = minus*gm2{1}
set y3_minus = minus*gm2{3}

Now we can estimate the regression using:

linreg gm2
# plus y1_plus y3_plus minus y1_minus y3_minus

Linear Regression - Estimation by Least Squares


Dependent Variable GM2
Quarterly Data From 1961:01 To 2012:04
Usable Observations 208
Degrees of Freedom 202
Centered R2 0.4665048
R-Bar2 0.4532994
Uncentered R2 0.8914579
Mean of Dependent Variable 0.0168372348
Std Error of Dependent Variable 0.0085299382
Standard Error of Estimate 0.0063069683
Sum of Squared Residuals 0.0080351254
Regression F(5,202) 35.3270
Significance Level of F 0.0000000
Log Likelihood 761.6537
Durbin-Watson Statistic 1.9522

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. PLUS 0.0049868654 0.0023541720 2.11831 0.03537110
2. Y1_PLUS 0.5062254697 0.0799787393 6.32950 0.00000000
3. Y3_PLUS 0.1542573375 0.0790118150 1.95233 0.05228060
4. MINUS 0.0017519286 0.0014967659 1.17048 0.24318846
5. Y1_MINUS 0.8237612657 0.0958011189 8.59866 0.00000000
6. Y3_MINUS 0.2164646725 0.1009068421 2.14519 0.03313113

At this point, you might want perform the standard diagnostic checks and per-
haps eliminate MINUS coefficient since its t-statistic is quite low. However, our
goal here is to illustrate programming techniques, not to obtain the best fitting
TAR model for money growth.

5.5.1 Estimating the Threshold

One problem with the above model is that the threshold may not be known.
When is unknown, Chan (1993) shows how to obtain a super-consistent es-
timate of the threshold parameter. For a TAR model, the procedure is to order
the observations from smallest to largest such that:
y 1 < y 2 < y 3 ... < y T (5.6)
Standard Programming Structures 162

For each value of y j , let = y j , set the Heaviside indicator according to this
potential threshold and estimate a TAR model. The regression equation with
the smallest residual sum of squares contains the consistent estimate of the
threshold. In practice, the highest and lowest 15% of the y j values are excluded
from the grid search to ensure an adequate number of observations on each side
of the threshold.
Note that this is quite a different form of grid search than we saw in Chapter 3
(page 83). Because the objective function there was continuous, each different
grid value likely would produce a different value of the objectivewe can only
hope that the grid isnt too coarse to miss the minimum. Here, however, the
objective function is discontinuous and we know exactly at which points it can
change. Thus, the grid search that were conducting here, over the observed
values of the threshold, is guaranteed to find the minimum. It is, however, a
bit harder to set up. The following two lines copy the threshold series into a
new series called TAUS and sorts it (in increasing order, which is the default for
the ORDER instruction).

set taus = gm2{2}


order taus

We now need to figure out which entries of TAUS we can use, given that we want
to eliminate 15% at each end. The quickest and most flexible way to do that is
to use the INQUIRE instruction to figure out what the defined range of TAUS is.
INQUIRE is described in greater detail in this chapters Tips and Tricks section
(page 169).

inquire(series=taus) tstart tend


compute tlow=tstart+fix(%nobs*.15),thigh=tend-fix(%nobs*.15)

TSTART will be the first defined entry of TAUS (here 4 because GM2 starts at 2
and the threshold has a delay of 2), so TLOW will be 15% of the way into the data
set from the lowest value and THIGH 15% of the way in from the highest. Note
that this is not 15% of the gap in the values between the highest and lowest,
but 15% of the entry count. If there are many values at (for instance) the low
end, we could be starting at a value not much above the minimum, but thats
OK since we are excluding these largely so that we dont run regressions with
almost no data. The FIX function is needed because the entry numbers are
integer-valued and %NOBS*.15 is realFIX(x) rounds x down to the first
integer below it.
The search can be done with:
Standard Programming Structures 163

compute rssbest=%na
do itau=tlow,thigh
compute tau=taus(itau)
set plus = gm2{2}>=tau
set minus = 1 - plus
*
set y1_plus = plus*gm2{1}
set y3_plus = plus*gm2{3}
set y1_minus = minus*gm2{1}
set y3_minus = minus*gm2{3}
linreg(noprint) gm2
# plus y1_plus y3_plus minus y1_minus y3_minus
if .not.%valid(rssbest).or.%rss<rssbest
compute rssbest=%rss,taubest=tau
end do itau

Once the program exits the loop, we can display the consistent estimate of the
threshold with

disp "We have found the attractor"


disp "Threshold=" taubest

We have found the attractor


Threshold= 0.01660

Finally, we can estimate the TAR model with the consistent estimate of the
threshold using

compute tau=taubest
set plus = gm2{2}>=tau
set minus = 1 - plus
*
set y1_plus = plus*gm2{1}
set y3_plus = plus*gm2{3}
set y1_minus = minus*gm2{1}
set y3_minus = minus*gm2{3}
linreg(title="Threshold autoregression") gm2
# plus y1_plus y3_plus minus y1_minus y3_minus
Standard Programming Structures 164

Linear Regression - Estimation by Threshold autoregression


Dependent Variable GM2
Quarterly Data From 1961:01 To 2012:04
Usable Observations 208
Degrees of Freedom 202
Centered R2 0.4687938
R-Bar2 0.4556451
Uncentered R2 0.8919236
Mean of Dependent Variable 0.0168372348
Std Error of Dependent Variable 0.0085299382
Standard Error of Estimate 0.0062934232
Sum of Squared Residuals 0.0080006495
Regression F(5,202) 35.6533
Significance Level of F 0.0000000
Log Likelihood 762.1009
Durbin-Watson Statistic 1.9391

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. PLUS 0.0058913925 0.0023302839 2.52819 0.01223032
2. Y1_PLUS 0.4745687514 0.0791112032 5.99876 0.00000001
3. Y3_PLUS 0.1502105166 0.0787585904 1.90723 0.05790997
4. MINUS 0.0015219242 0.0015023522 1.01303 0.31225885
5. Y1_MINUS 0.8625034171 0.0969960051 8.89215 0.00000000
6. Y3_MINUS 0.1885968485 0.1012876103 1.86199 0.06405639

5.5.2 Improving the Program

The program described in Section 5.5.1 is rather crude. It works, but far too
much of it is hard-coded for a specific example. For instance, it uses the variable
GM2 almost 20 times and the threshold delay of 2 is repeated four times. We can
also enhance the program by creating a series of the sums of squared residuals
for different values of , so we can see how sensitive the objective is to the
threshold value. The revised program is Example 5.5.
One question you might have is whether we should have planned ahead for
this when we originally wrote the program. How you handle it will generally
depend upon how comfortable you are with the more flexible coding that we
will be doing. One problem with trying to start with the improved program
is that the most complicated part of this isnt making the specification more
flexibleits getting the coding for finding the optimal threshold correct. If you
try to do two things at once:

1. work out and debug the optimal threshold code


2. write a program easily adapted to other data

you might have a hard time getting either one correct. Again, that will depend
upon your skill level with RATS programming. However, the graph of the sums
of squares is definitely something that good programming practice would tell
you to wait onthats easy to add once everything else is done.
The first thing we will do differently is to add a DEFINE option to the initial
LINREG:
Standard Programming Structures 165

linreg(define=baseeq) gm2
# constant gm2{1 3}
compute rssols=%rss

This defines BASEEQ as an EQUATION data type, which keeps track of (among
other things) the form of the equation and the dependent variable. The last
line above also saves the sum of squared residuals from the least squares esti-
mation.
To allow for greater flexibility in setting the threshold variable and delay, we
can do the following:

set threshvar = gm2


compute d=2

From this point on, if we use THRESHVAR{D} whenever we need the threshold
expression, then we can change the threshold by changing just these two lines.
As we described earlier, there are two ways to estimate the threshold regres-
sion. The method from the previous section was to create dummied-out re-
gressors and do a combined LINREG. The alternative is to run two LINREGs
over the plus and minus samples. While well show later how to create the
dummies more flexibly, its much easier to do the two sample regression.
Well skip over the estimation with at the mean and jump straight into the
code for finding the optimal threshold. Youll see two differences with the set
up code:

clear taus
set taus = threshvar{d}
order taus
inquire(series=taus) tstart tend
*
compute tlow=tstart+fix(%nobs*.15),thigh=tend-fix(%nobs*.15)

First, we added a CLEAR instruction for the TAUS series. That will allow us to
change the threshold variable or delay without having to worry whether TAUS
still has left-over values from the previous analysis.4 Second, the SET TAUS
now uses THRESHVAR{D} rather than hard-coded values from the example.
We also need to add one more instruction to initialize a series for the sums of
squares as they are generated:

set rsstau = %na

4
You could also avoid any problems like this by doing File-Clear Memory menu item or by
clicking on the the toolbar button before re-running the program with any changes, but
the CLEAR instruction will work whether or not you do that.
Standard Programming Structures 166

Note that CLEAR RSSTAU would also work fine. This sets up the series RSSTAU
and sets all values to NAthe only data points which will have non-missing val-
ues will be the ones where we estimate a threshold regression. The simplified
loop for finding the attractor is:

compute rssbest=rssols
do itau=tlow,thigh
compute tau=taus(itau)
set plus = threshvar{d}>=tau
linreg(noprint,equation=baseeq,smpl=plus)
compute rssplus=%rss
linreg(noprint,equation=baseeq,smpl=.not.plus)
compute %rss=%rss+rssplus
compute rsstau(itau)=%rss
if %rss<rssbest
compute rssbest=%rss,taubest=tau
end do itau

Whats different here? First, the test for whether a new sum of squares is
the best that weve seen is simplified a bit by starting with RSSBEST equal to
RSSOLS. Since all models with a break have to be at least as good as the same
model with no breaks, we know that this will be replaced right away. In the
previous coding, we started with RSSBEST=%NA, which then required testing
RSSBEST for %VALID. Since we have available a value which were computing
anyway that we know is finite but bigger than the optimal value, we might as
well use it.
Second, were using THRESHVAR{D} rather than the specific GM2{2}. Third, the
sums of squares for the threshold regression is computing using:

linreg(noprint,equation=baseeq,smpl=plus)
compute rssplus=%rss
linreg(noprint,equation=baseeq,smpl=.not.plus)
compute %rss=%rss+rssplus

The first LINREG runs the regression over the sample where PLUS is non-zero
(in this case, non-zero always means one), and the second runs it over the
remainder of the sample (.NOT.PLUS is true wherever PLUS is non-zero).
%RSS will be equal to the sum of the %RSS values from the two regressions.
Finally

compute rsstau(itau)=%rss

saves the value of %RSS into the entry of RSSTAU that corresponds to the cur-
rent value of TAU being examined. Note that since TAUS is a sorted copy, these
dont represent the original time period of the data, but were doing a SCATTER
plot, so all that matters is that RSSTAU matches up with TAUS.
Standard Programming Structures 167

0.0087

0.0086

0.0085

0.0084

0.0083

0.0082

0.0081

0.0080
0.0075 0.0125 0.0175 0.0225

Figure 5.4: Threshold Values vs Sums of Squares

Not surprisingly, this produces the same result as the cruder coding. We add
the graph (Figure 5.4) using:

scatter(footer="Threshold Values vs Sums of Squares",style=step)


# taus rsstau

Note that this uses STYLE=STEP rather than STYLE=LINE. STYLE=STEP gives
a graph of function as it should look, which is a step function between the
observed values for the threshold.
The process for generating the final regression with the dummied-out break
variables uses some more advanced programming features which will be cov-
ered in the next chapter. This will give you a taste of some of the special capa-
bilities that RATS has, particular for dealing with time series.
As before, we need to create a PLUS series which is the dummy for the above
branch:
compute tau=taubest
set plus = threshvar{d}>=tau

The number of dummied-out series that we need is 2 the number of re-


gressors in the base model. That is most conveninently done by creating a
RECTANGULAR matrix of SERIES with the dimensions we need. Youll notice
that this next code segment uses a specialized set of functions for pulling infor-
mation out of the saved equation. See this Chapters Tips and Tricks page 170
for a more complete description of those.
Standard Programming Structures 168

dec rect[series] expand(%eqnsize(baseeq),2)


do i=1,%eqnsize(baseeq)
set expand(i,1) = %eqnxvector(baseeq,t)(i)*plus
set expand(i,2) = %eqnxvector(baseeq,t)(i)*(1-plus)
labels expand(i,1) expand(i,2)
# "PLUS_"+%eqnreglabels(baseeq)(i) $
"MINUS_"+%eqnreglabels(baseeq)(i)
end do i

EXPAND(I,1) is the plus branch and EXPAND(i,2) is the minus branch


for each of the regressors. The LABELS instruction is then used to give
more informative output labels to those two series. The + operator, when
applied to strings, does concatenation, so this will create labels which are
PLUS CONSTANT and MINUS CONSTANT when the regressors standard label is
CONSTANT, PLUS GM{1} and MINUS GM{1} when the regressors standard label
is GM{1}, etc.
We run the final regression with the optimal threshold using:

linreg(title="Threshold Regression") %eqndepvar(baseeq)


# expand

which gives us
Linear Regression - Estimation by Threshold Regression
Dependent Variable GM2
Quarterly Data From 1961:01 To 2012:04
Usable Observations 208
Degrees of Freedom 202
Centered R2 0.4687938
R-Bar2 0.4556451
Uncentered R2 0.8919236
Mean of Dependent Variable 0.0168372348
Std Error of Dependent Variable 0.0085299382
Standard Error of Estimate 0.0062934232
Sum of Squared Residuals 0.0080006495
Regression F(5,202) 35.6533
Significance Level of F 0.0000000
Log Likelihood 762.1009
Durbin-Watson Statistic 1.9391

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. PLUS_Constant 0.0058913925 0.0023302839 2.52819 0.01223032
2. PLUS_GM2{1} 0.4745687514 0.0791112032 5.99876 0.00000001
3. PLUS_GM2{3} 0.1502105166 0.0787585904 1.90723 0.05790997
4. MINUS_Constant 0.0015219242 0.0015023522 1.01303 0.31225885
5. MINUS_GM2{1} 0.8625034171 0.0969960051 8.89215 0.00000000
6. MINUS_GM2{3} 0.1885968485 0.1012876103 1.86199 0.06405639

You can experiment with different sets of lags in the AR and different delay
values and see how this is able to adapt to them.
As we said in the Preface, you should try not to reinvent the wheel. Weve
shown a program to estimate a threshold autoregression, but there already
exist several procedures which may be able to do what you need. The
Standard Programming Structures 169

@THRESHTEST procedure both estimates a general threshold regression (not


just an autoregression) and can compute bootstrapped significance levels. @TAR
estimates a threshold autoregression including a test for the best threshold de-
lay. For specific applications, there are @EndersGranger and @EndersSiklos
which do threshold unit root and cointegration tests respectively.

5.6 Tips and Tricks

The Instruction INQUIRE


If you look at the code for many of the popular RATS procedures (such as DFU-
NIT.SRC), youll see that one of the first executable instructions is an INQUIRE.
If you do an instruction like LINREG or STATISTICS, RATS will automatically
determine the maximum range given the series involved. However, for in-
stance, you need to run a DO loop over a range of entries, you need to find
out in advance the specific range thats available. Thats what INQUIRE is de-
signed to do.

INQUIRE(options) value1<<p1 value2<<p2


# list of variables in regression format (only with
REGLIST)

The <<p1 and <<p2 are only used in procedures, so well discuss them later.
Thus, were looking at the basic instruction being

INQUIRE(options) value1 value2


# list of variables in regression format (only with
REGLIST)

In the example in this chapter, we used:

inquire(series=taus) tstart tend

which makes TSTART equal to the first entry of TAUS which isnt an NA, and
TEND equal to the last valid entry.
If you need the limit of a set of series, use the REGRESSORLIST (which we
usually shorten to REGLIST) and list the variables, which can include lag/lead
fields on a supplementary line. For instance, to determine the largest estima-
tion range for the model used in Example 5.4, we would do

inquire(reglist) rstart rend


# gm2 constant gm2{1 3}

Note that you need to include the dependent variable as wellif you dont,
REND will actually be one entry past the end of the data since entry T + 1 is
valid for the lags.
Standard Programming Structures 170

Theres also an EQUATION option which can be used to determine the maxi-
mum range permitted by the variables (both dependent and explanatory) in an
EQUATION. For instance,

inquire(equation=baseeq) estart eend

If its important to identify missing values within the data range, you can add
the VALID option to any of those. For instance,

inquire(valid=esmpl,equation=baseeq) estart eend

would define ESTART and EEND as the outer common limits of the variables in
BASEEQ with ESMPL created as a dummy variable with 1s in the entries which
are valid across all those variables and 0s in the entries which arent. In this
example, since there are no missing values inside a series, ESMPL would just
be all 1s between ESTART and END.

EQUATION functions
We defined an EQUATION early in Example 5.5 to save the base specification
that we extended with breaks. There is a whole set of functions which can
be used to take information out of (or, less often, put information into) an
EQUATION. All of these have names starting with %EQN. Here, we used the
rather simple %EQNSIZE(eqn) which returns the size (number of explanatory
variables) of the equation. The two more important functions used in the ex-
ample are %EQNXVECTOR and %EQNREGLABELS.
%EQNXVECTOR(eqn,t) returns the VECTOR of explanatory variables for equa-
tion eqn at entry t. An eqn of 0 can be used to mean the last regression run.
The instructions

set expand(i,1) = %eqnxvector(baseeq,t)(i)*plus


set expand(i,2) = %eqnxvector(baseeq,t)(i)*(1-plus)

are inside a loop over the time subscript T. The %EQXVECTOR(baseeq,t) pulls
out the vector of explanatory variables for BASEEQ at T, which (in this case)
means [1, gm2t1 , gm2t3 ]. The further (I) subscript takes one of those three
elements out.
%EQNREGLABELS(eqn) returns a VECTOR of STRINGS which are the re-
gressor labels used in standard regression output, combining the variable
name and (if used) lag number, such as GM2{1} and GM2{3} for the lags of
GM2. Again, we use subscript I applied to the result of that to pull out the
string that we need.
There are several related functions which can also be handy. In all cases, eqn is
either an equation name, or 0 for the last estimated (linear) regression. These
also evaluate at a specific entry T.
Standard Programming Structures 171

%EQNPRJ(eqn,t) evaluates the fitted value Xt for the current set of


coefficients for the equation.
%EQNVALUE(eqn,t,beta) evaluates Xt for an input set of coefficients.
%EQNRESID(eqn,t) evaluates the residual yt Xt for the current set of
coefficients, where yt is the dependent variable of the equation.
%EQNRVALUE(eqn,t) evaluates the residual yt Xt for an input set of
coefficients.
Standard Programming Structures 172

Example 5.1 Illustration of DO loop


open data quarterly(2012).xls
cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)
*
set loggdp = log(rgdp)
*
graph(footer="U.S. Real GDP")
# loggdp
*
set trend = t
*
linreg loggdp
# constant trend
*
set rssio = %na
do t0=1965:1,2007:4
set btrend = %max(t-t0,0)
linreg(noprint) loggdp
# constant trend btrend loggdp{1 2}
compute rssio(t0)=%rss
end do t0
*
graph(footer="RSS for Broken Trend, Innovational Outlier")
# rssio
ext(noprint) rssio
disp "Minimum at" %datelabel(%minent) %minimum
*
set rssao = %na
do t0=1965:1,2007:4
set btrend = %max(t-t0,0)
boxjenk(regressors,ar=2,noprint) loggdp
# constant trend btrend
compute rssao(t0)=%rss
end do t0
graph(footer="RSS for Broken Trend, Additive Outlier")
# rssao
*
ext(noprint) rssao
disp "Minimum at" %datelabel(%minent) %minimum

Example 5.2 Illustration of IF/ELSE


open data quarterly(2012).xls
cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)
*
set dlrgdp = log(rgdp)-log(rgdp{1})
*
cmom
Standard Programming Structures 173

# dlrgdp{0 to 12} constant


*
do lags=0,12
if lags==0 {
linreg(noprint,cmom) dlrgdp
# constant
compute aic = -2.0*%logl + %nreg*2
compute bestlag=lags,bestaic=aic
}
else {
linreg(noprint,cmom) dlrgdp
# constant dlrgdp{1 to lags}
compute aic = -2.0*%logl + %nreg*2
if (aic < bestaic)
compute bestlag=lags,bestaic=aic
}
end do lags
*
disp "Minimum AIC lag" bestlag

Example 5.3 Illustration of WHILE and UNTIL


open data quarterly(2012).xls
cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)
*
set dldeflator = log(deflator)-log(deflator{1})
*
* Cut lags until the last one is significant
*
compute lags=13,signif=1.00
while signif>.05 {
compute lags=lags-1
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
}
end while
*
disp "Chosen number of lags" lags
*
* Same thing with safeguard for the number of lags
*
compute lags=13,signif=1.00
while signif>.05 {
compute lags=lags-1
if lags==0
break
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
Standard Programming Structures 174

disp "Significance of lag" lags "=" signif


}
end while
*
* Same thing done using a DO loop
*
compute p=0
do lags=12,1,-1
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
if signif<.05 {
compute p=lags
break
}
end do lags
disp "Number of lags chosen =" p
*
* Same thing done using an UNTIL loop
*
compute lags=13,signif=1.00
until signif<.05 {
compute lags=lags-1
if lags==0
break
linreg(noprint) dldeflator
# constant dldeflator{1 to lags}
compute signif=%ttest(%tstats(%nreg),%ndf)
disp "Significance of lag" lags "=" signif
}
end until
compute p=lags
*
* Redo regression with chosen number of lags
*
linreg(title="Least Squares with Automatic Lag Selection") dldeflator
# constant dldeflator{1 to p}

Example 5.4 Threshold Autoregression, Brute Force


open data quarterly(2012).xls
cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)
*
set gm2 = log(m2) - log(m2{1})
*
linreg gm2
# constant gm2{1 3}
*
stats gm2
compute tau=%mean
Standard Programming Structures 175

set plus = gm2{2}>=tau


set minus = 1 - plus
*
set y1_plus = plus*gm2{1}
set y3_plus = plus*gm2{3}
set y1_minus = minus*gm2{1}
set y3_minus = minus*gm2{3}
*
linreg gm2
# plus y1_plus y3_plus minus y1_minus y3_minus
*
* Create the empirical grid for the threshold values
*
set taus = gm2{2}
order taus
inquire(series=taus) tstart tend
*
* These are the lowest and highest entry numbers in <<taus>> that we
* will try, discarding 15% at either end.
*
compute tlow=tstart+fix(%nobs*.15),thigh=tend-fix(%nobs*.15)
*
compute rssbest=%na
do itau=tlow,thigh
compute tau=taus(itau)
set plus = gm2{2}>=tau
set minus = 1 - plus
*
set y1_plus = plus*gm2{1}
set y3_plus = plus*gm2{3}
set y1_minus = minus*gm2{1}
set y3_minus = minus*gm2{3}
linreg(noprint) gm2
# plus y1_plus y3_plus minus y1_minus y3_minus
if .not.%valid(rssbest).or.%rss<rssbest
compute rssbest=%rss,taubest=tau
end do itau
disp "We have found the attractor"
disp "Threshold=" taubest
*
* Re-estimate the model at the best values
*
compute tau=taubest
set plus = gm2{2}>=tau
set minus = 1 - plus
*
set y1_plus = plus*gm2{1}
set y3_plus = plus*gm2{3}
set y1_minus = minus*gm2{1}
set y3_minus = minus*gm2{3}
linreg(title="Threshold autoregression") gm2
# plus y1_plus y3_plus minus y1_minus y3_minus
Standard Programming Structures 176

Example 5.5 Threshold Autoregression, More Flexible Coding


open data quarterly(2012).xls
cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)
*
set gm2 = log(m2) - log(m2{1})
*
linreg(define=baseeq) gm2
# constant gm2{1 3}
compute rssols=%rss
*
set threshvar = gm2
compute d=2
*
* Create the empirical grid for the threshold values
*
clear taus
set taus = threshvar{d}
order taus
inquire(series=taus) tstart tend
*
* These are the lowest and highest entry numbers in <<taus>> that we
* will try, discarding 15% at either end.
*
compute tlow=tstart+fix(%nobs*.15),thigh=tend-fix(%nobs*.15)
*
set rsstau = %na
*
compute rssbest=rssols
do itau=tlow,thigh
compute tau=taus(itau)
set plus = threshvar{d}>=tau
linreg(noprint,equation=baseeq,smpl=plus)
compute rssplus=%rss
linreg(noprint,equation=baseeq,smpl=.not.plus)
compute %rss=%rss+rssplus
compute rsstau(itau)=%rss
if %rss<rssbest
compute rssbest=%rss,taubest=tau
end do itau
disp "We have found the attractor"
disp "Threshold=" taubest
*
scatter(footer="Threshold Values vs Sums of Squares",style=step)
# taus rsstau
*
* Re-estimate the model at the best values
*
compute tau=taubest
set plus = threshvar{d}>=tau
*
dec rect[series] expand(%eqnsize(baseeq),2)
Standard Programming Structures 177

do i=1,%eqnsize(baseeq)
set expand(i,1) = %eqnxvector(baseeq,t)(i)*plus
set expand(i,2) = %eqnxvector(baseeq,t)(i)*(1-plus)
labels expand(i,1) expand(i,2)
# "PLUS_"+%eqnreglabels(baseeq)(i) "MINUS_"+%eqnreglabels(baseeq)(i)
end do i
*
linreg(title="Threshold Regression") %eqndepvar(baseeq)
# expand
Chapter 6

SERIES and Dates

6.1 SERIES and the workspace


The following is the top of the data file that were using
DATE Tb3mo Tb1yr RGDP Potent Deflator M2 PPI Curr
1960Q1 3.87 4.57 2845.3 2824.2 18.521 298.7 33.2 31.8
1960Q2 2.99 3.87 2832.0 2851.2 18.579 301.1 33.4 31.9
1960Q3 2.36 3.07 2836.6 2878.7 18.648 306.5 33.4 32.2
1960Q4 2.31 2.99 2800.2 2906.7 18.700 310.9 33.7 32.6

If we do the following:

open data quarterly(2012).xls


cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)

we create a series workspace with a standard length of 212 entries, which is


2012:4 given the quarterly calendar starting in 1960:1. At this point, it has
eight series, in order, TB3MO, TB1YR, RGDP, POTENT, DEFLATOR, M2, PPI and
CURR.
What does it mean for the workspace to have a standard length of 212 entries?
If we do the following

set sims = %ran(1.0)


stats sims

youll see that SIMS is defined as 212 data points (the other statistics will differ
because of randomness):
Statistics on Series SIMS
Quarterly Data From 1960:01 To 2012:04
Observations 212
Sample Mean 0.069595 Variance 0.825391
Standard Error 0.908510 SE of Sample Mean 0.062397
t-Statistic (Mean=0) 1.115358 Signif Level (Mean=0) 0.265966
Skewness -0.189381 Signif Level (Sk=0) 0.263675
Kurtosis (excess) 0.559017 Signif Level (Ku=0) 0.102232
Jarque-Bera 4.027656 Signif Level (JB=0) 0.133477

However, if you do

178
SERIES and Dates 179

set sims 1 10000 = %ran(1.0)


stats sims

youll get something like


Statistics on Series SIMS
Quarterly Data From 1960:01 To 4459:04
Observations 10000
Sample Mean -0.006149 Variance 0.980794
Standard Error 0.990350 SE of Sample Mean 0.009904
t-Statistic (Mean=0) -0.620923 Signif Level (Mean=0) 0.534664
Skewness -0.008913 Signif Level (Sk=0) 0.716008
Kurtosis (excess) 0.004250 Signif Level (Ku=0) 0.930898
Jarque-Bera 0.139914 Signif Level (JB=0) 0.932434

so SIMS now has 10000 data points. Thus the workspace length isnt a limit
it simply sets the standard length which is used if no other information is
available. In general, that means only a few situations where this comes into
play, typically on SET instructions. Because the expression on the right side of
a SET could be quite complicated, RATS doesnt try to work out the range over
which it could be computed, so, if there is no end parameter on the SET, it uses
the standard length.
Note that you can lengthen a series easily, as we did here, changing SIMS from
212 to 10000 data points. A new SET on a series doesnt destroy the information
thats already there. For instance, if you now repeat

set sims = %ran(1.0)


stats sims

you will replace the first 212 data points (default length), leaving everything
from 213 to 10000 as it was.
What if you want to erase the old information in a series? You can do a CLEAR
instruction. That replaces the current content of the series (as many as you list
on the instruction) with NAs. If you now do

clear sims
set sims = %ran(1.0)
stats sims

youll again see just 212 entries in the statistics.


What happens when you do a SET instruction involving lags?

set pi = 100.0*log(ppi/ppi{1})

Again, the target range for the SET is the standard 1 to 212. However, because
PPI{1} isnt defined when T=1, the result for PI is an NA for entry 1. There
is no effective difference between a series created from 1 to 212 with an NA
in entry 1 and another which is defined only from 2 to 212, which is why we
SERIES and Dates 180

suggest that you not try to adjust the ranges on SET to allow for lagsjust let
RATS handle it automatically.

As an example of the use of a series which is intentionally longer than the stan-
dard length, well generate draws of series of N (0, 1) variates and see how well
the Jarque-Bera test statistic compares to its asymptotic chi-squared distribu-
tion. Well tack this onto the end of Example 6.1. First, we set the number of
replications of the experiments and the length of the sampled series:

compute ndraws=10000
compute nobs =500

Now, well clear out the series of simulations so we dont get any unwanted
data from before. We also zero out the series which will get the J-B statistics,
making sure we extend it to the NDRAWS entries that we will need.

clear sims
set jbstats 1 ndraws = 0.0

The simulations and calculations of the Jarque-Bera statistics are done with:

do try=1,ndraws
set sims 1 nobs = %ran(1.0)
stats(noprint) sims
compute jbstats(try)=%jbstat
end do try

At this point, we have 10000 (NDRAWS actually, since we wrote this to change
that easily) samples of Jarque-Bera test statistics from independent standard
Normal samples of length 500. The J-B statistic asymptotically has a 22 distri-
bution. The following uses the SSTATS instruction (page 187) to evaluate the
percentage of the draws that exceed the 5% and 1% critical values for the 22 .
We would hope this would be close to .05 and .01 respectively.

compute crit05=%invchisqr(.05,2),crit01=%invchisqr(.01,2)
sstats(mean) 1 ndraws (jbstats>crit05)>>sim05 $
(jbstats>crit01)>>sim01
disp "JB Statistic"
disp "Rejections at .05" sim05 "at .01" sim01

The results will change because of the randomness, but they tend to be fairly
similar to:
JB Statistic
Rejections at .05 0.05510 at .01 0.01840
which would indicate that the empirical distribution is close, but, in practice,
the tails are a bit thicker than the 22 .
SERIES and Dates 181

6.2 SERIES and their integer handles


If you do the following

print 1970:1 1972:4 2 4 5

youll get
ENTRY TB1YR POTENT DEFLATOR
1970:01 7.55 4215.3 23.915
1970:02 7.45 4254.2 24.247
1970:03 6.94 4292.7 24.438
1970:04 5.65 4330.7 24.752
1971:01 4.05 4368.0 25.126
1971:02 4.99 4404.8 25.455
1971:03 5.75 4441.6 25.711
1971:04 4.73 4478.6 25.918
1972:01 4.41 4516.4 26.319
1972:02 4.84 4554.5 26.475
1972:03 5.15 4593.4 26.731
1972:04 5.44 4633.2 27.083

This is because each series created has an integer handle which is assigned in
the order in which they are created. So TB3MO is number 1, TB1YR is number 2,
etc. Handles are especially useful together with the DOFOR instruction. We first
saw the DOFOR loop instruction on page 84 to handle a loop over a set of real val-
ues. In practice, its probably more commonly used to loop over data series. The
following, for instance, does a custom set of basic statistics on the raw data:1
report(action=define)
report(atrow=1,atcol=1,align=center) "Series" "Mean" "Std Dev" $
"Skew" "Kurtosis" "LB(Q)"
dofor s = tb3mo to curr
stats(noprint) s
corr(noprint,qstats,number=12) s
report(row=new,atcol=1) %l(s) %mean sqrt(%variance) %skewness $
%kurtosis %qstat
end dofor s
report(action=format,width=10)
report(action=show)

The first pass through the DOFOR loop, S (which is an INTEGER) is 1. The
STATISTICS and CORRELATE instruction, and the %L function (which returns
the label of a series) all understand that when they see an INTEGER where
they expect a SERIES they should interpret it as the handle to a SERIES. The
output that we get is:
1
This is for illustrationit would make little sense to offer the (excess) kurtosis and Q
statistic on raw data like this.
SERIES and Dates 182

Series Mean Std Dev Skew Kurtosis LB(Q)


TB3MO 5.0325 2.9934 0.7154 1.0936 1448.0391
TB1YR 5.5788 3.1783 0.6662 0.8233 1595.7395
RGDP 7664.7505 3390.6523 0.3447 -1.2172 2602.4349
POTENT 7764.8722 3511.5367 0.3922 -1.0878 2609.8333
DEFLATOR 61.5296 31.5948 0.0578 -1.3610 2610.9023
M2 3136.8420 2648.8435 0.9134 -0.1237 2565.5939
PPI 99.9745 49.1331 0.0554 -1.1648 2587.5536
CURR 327.9118 309.0194 0.9423 -0.3364 2570.6722

The one place where its harder for RATS to distinguish between an INTEGER
as integer, and INTEGER as a series handle is in a SET or FRML expression.
Here, the six series other than the two interest rates would often be studied
in growth rates. With just six series, the easiest way to create their associated
growth rates might be to just do six SET instructions. However, in other cases,
you might have many more series than this which need transformation. The
following shows how to do that:

dofor s = rgdp to curr


set %s("gr_"+%l(s)) = 100.0*log(s{0}/s{1})
end dofor s

%S maps a string expression to a SERIES (which can be new or existing). In


this case, it will be a new series, which will be named GR RGDP when S is rep-
resenting RGDP, GR POTENT when S is representing POTENT, etc. In the SET
expression, you will note that we use S{0} and S{1} to represent the current
and lagged value for S. The S{1} isnt a surprise, since thats how you repre-
sent a lag in a formula. If S were the name of an actual SERIES, rather than
a INTEGER handle for SERIES, we could just use S by itself rather than S{0}
to select the current value. However, since its an INTEGER, S by itself means
just the value of the handle (for instance, 3 for RGDP, 8 for CURR). Thus, the
use of the S{0} notation, which means the current (0 lag) value for the series
represented by S.
After weve generated the growth rates, we can make a (this time standard)
statistical table of those series using:

table(picture="*.####") / %slike("gr_*")

which produces
Series Obs Mean Std Error Minimum Maximum
GR_RGDP 211 0.7437 0.8625 -2.3276 3.8589
GR_POTENT 211 0.7755 0.1751 0.3491 1.1109
GR_DEFLATOR 211 0.8699 0.5921 -0.1956 2.9479
GR_M2 211 1.6788 0.8493 -0.2955 5.3601
GR_PPI 211 0.8420 1.1466 -5.1092 4.9597
GR_CURR 211 1.6991 1.0640 -1.7034 6.4793

%SLIKE(string exp) returns a VECTOR of INTEGER series handles for the


series whose labels match the string expression, where you can use * (match
SERIES and Dates 183

any number of characters) and ? (match any one character) for wild cards. In
this case, it will returns a list of all the series which start with GR .
A somewhat more complicated example of the use of DOFOR with series lists
is the following, which does a regression of GDP growth on its own lags plus
lags of the real rate of interest, where we try all four possible combinations
of price indexes and interest rates for the real rate. This uses nested DOFOR
loops, the outer one over the price index and the inner over the interest rate.
Note again, how RATE{0} and PRICE{0} are used in the SET instruction to get
the current value of the RATE and PRICE series.
set gdpgrow = 400.0*log(rgdp/rgdp{1})
dofor price = deflator ppi
dofor rate = tb3mo tb1yr
set realrate = rate{0}-400.0*log(price{0}/price{1})
disp "Real Rate using" %l(rate) "and" %l(price)
linreg gdpgrow
# constant gdpgrow{1 to 4} realrate{1 to 4}
exclude
# realrate{1 to 4}
end dofor rate
end dofor price

6.3 Series Names and Series Labels


The names of the eight original data series came off the data file. Suppose you
dont like those; perhaps POTENT and CURR arent descriptive enough, or you
would prefer to use a common naming convention for (for instance) the GDP,
price and money series. For a spreadsheet data file like we have here (similarly
for a labeled text file), you can change the names on the way into RATS by
suppressing the file labels. This is done with the combination of the TOP=2 and
NOLABELS options. TOP=2 tells DATA to start processing information beginning
with row 2 (thus skipping the top row with the labels) and NOLABELS tells DATA
that there are no usable labels on the file, and that it is to use the labels off the
DATA instruction. You, of course, have to be very careful with the order that
you list the names on DATA, since that will be the only source of identification
of the series. An example would be:

data(format=xls,org=columns,top=2,nolabels) / r3mo r1yr y yp $


pdef m2 ppi mcurr

This method of re-labeling is possible for only certain types of data files. For
others (RATS, FRED, Haver, FAME and others that do random access for se-
ries), you have to request a series by the name under which it is stored in
the database. Relabeling the series after the fact can be done using either the
LABELS or the EQV (short for EQuiValence) instruction.
SERIES and Dates 184

LABELS re-defines the output label of a series. This can be any string up to 16
characters. The output label is used for a bit more than just output, as the %L
and %SLIKE functions both work off the labels. An example of LABELS (using
the original program with the original data file names) is:

labels rgdp potent


# "Y" "YP"

If we now do
stats rgdp

the output will read


Statistics on Series Y
Quarterly Data From 1960:01 To 2012:04
Observations 212
Sample Mean 7664.750472 Variance 11496522.844881
Standard Error 3390.652274 SE of Sample Mean 232.870954
t-Statistic (Mean=0) 32.914154 Signif Level (Mean=0) 0.000000
Skewness 0.344714 Signif Level (Sk=0) 0.041897
Kurtosis (excess) -1.217166 Signif Level (Ku=0) 0.000374
Jarque-Bera 17.285112 Signif Level (JB=0) 0.000176

Note that (for output) RGDP is now re-labeled as Y. However, we still had to use
RGDP on the STATISTICS instruction.
EQV goes farther than this by both defining a new output label and a new name
for the series which you can use directly in instructions. This has a slightly
different (simpler) instruction syntax than LABELS because EQV can only take
legal variable names, not general string expressions. The following shows a
use of EQV:

eqv tb3mo tb1yr


r3mo r1yr
set spread = r1yr-r3mo

Note that after the EQV instruction, we can use R1YR and R3MO in expressions
to refer to the two series. (You can also still use TB1YR and TB3MO if you want).

6.4 Dates as Integers


When you write a date expression like 2012:4, you are actually using an op-
erator which takes the pair of numbers (2012 and 4) and uses the current
CALENDAR scheme to convert that to an entry number (in this case 212). The
numbers could be replaced with variables or expressions:

compute endyear=2012,endqtr=4
compute end=endyear:endqtr

Print out the four values of RGDP from 1970:1 through 1970:4 using
SERIES and Dates 185

print 1970:1 1970:4 rgdp

ENTRY RGDP
1970:01 4252.9
1970:02 4260.7
1970:03 4298.6
1970:04 4253.0

Now try using:

print(nodates) 1970:1 1970:4 rgdp

ENTRY RGDP
41 4252.9
42 4260.7
43 4298.6
44 4253.0

As RGDP is stored, it has 212 entries which are numbered from 1 to 212. The
association of the entry number 41 with 1970:1 is based upon the current
CALENDAR setting. If you change the CALENDAR,2 the data dont move, all that
changes is the association of a data point with a particular date. For illustra-
tion, if we now do

cal(m) 1980:1
print 41 44 rgdp

ENTRY RGDP
1983:05 4252.9
1983:06 4260.7
1983:07 4298.6
1983:08 4253.0

Note that the data in entries 41 through 44 havent changed, but the dates now
associated with those entries have. The time to change data frequencies is when
you read the data in the first place, not later on.
When you operate with the date functions, note that the calculation wraps
the way you would expect. With the CALENDAR reset, lets try

disp %datelabel(2000:5)

This will display


2001:01

RATS has quite a large collection of date-related functions. Some of these are
relatively straightforward functions that let you decompose the dates of en-
tries. Weve already used %DATELABEL to display the standard label of an entry.
Others are %YEAR and %PERIOD. For instance,
2
Which you should only do if you understand exactly what is happening
SERIES and Dates 186

do time=1970:3,1971:4
disp %datelabel(time) %year(time) %period(time)
end do time

will generate
1970:03 1970 3
1970:04 1970 4
1971:01 1971 1
1971:02 1971 2
1971:03 1971 3
1971:04 1971 4

If (for some reason) we needed a dummy variable for quarter 4 for years 1980
to 1989, we could generate that with

set d80_89q4 = %year(t)>=1980.and.%year(t)<=1989.and.%period(t)==4

There are quite a few others which are based upon a perpetual calendarthese
tend to be more interesting with monthly data and particularly with sector or
firm level data. But,

do time=1970:3,1971:4
disp %datelabel(time) %daycount(time) %tradeday(time,5)
end do time

displays the number of days in the quarter, and the number of Fridays (week-
day number 5, as the date functions number them):
1970:03 92 13
1970:04 92 13
1971:01 90 13
1971:02 91 13
1971:03 92 13
1971:04 92 14

If you do calculations that include a date field, the : operator takes precedence
over other arithmetic operations (+ and - being the only ones likely to ever be
used). Thus,

disp 1980:1+1 1+1980:1

maps both expressions to entry 82, which is 1 period after 1980:1. However,
even though thats how they are interpreted, it would be easier to read if you
add parentheses, so (1980:1)+1 and 1+(1980:1) would be preferred.
On page 148, we checked for a broken trend in the GDP series. There, we ran
the loop over possible break points from 1965:1 to 2007:4 to exclude breaks in
the 20 observations at either end. We could also have let RATS figure out the
limits for the loop using:

do t0=(1960:1)+20,(2012:4)-20
...
SERIES and Dates 187

Of course, a better way to handle this would be

compute nobreak=20
do t0=(1960:1)+nobreak,(2012:4)-nobreak

so it would be more obvious what the point of the 20 is, and also to make it
easier to change if we need to.

6.5 Tips and Tricks

The Instruction SSTATS


SSTATS is a handy instruction which can be used to compute the sum (or mean
or maximum, etc.) of one or more general expressions. Since it accepts a for-
mula, you dont have to take the extra step of generating a separate series with
the needed values.
It can be used to answer some (apparently) quite complicated questions. For
instance,
sstats(min,smpl=peak+trough) startl endl t>>tp0

gives tp0 as the smallest entry for which either the series peak or trough
(both dummies) is true.
sstats 1 nobs p1*y>>p1ys p1>>p1s p2*y>>p2ys p2>>p2s

computes four parallel sums. Without the SSTATS, this would require about
eight separate instructions.
sstats / date<>date{1}>>daycount

computes the number of days in a data set with intra-day data. date<>date1
is 1 when the value of date(t) is different from date(t-1) and 0 if its the
same. So the SSTATS is summing the number of changes in the date series.
In Example 6.1, we use the following:

sstats(mean) 1 10000 (jbstats>crit05)>>sim05 (jbstats>crit01)>>sim01

JBSTATS>CRIT05 is 1 if JBSTATS(t) is bigger than the .05 critical value and


0 if it isnt. When we request the mean for this calculation, we are getting the
fraction of draws which exceed CRIT05. Similarly, the parallel calculation of
JBSTATS>CRIT01 is computing the fraction that exceed the .01 critical value.
SERIES and Dates 188

Example 6.1 Series and Workspace Length


This demonstrates the effect of the standard workspace length versus extended
lengths.

open data quarterly(2012).xls


cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)
*
* Generate random data over the standard range
*
set sims = %ran(1.0)
stats sims
*
* Generate random data over longer range
*
set sims 1 10000 = %ran(1.0)
stats sims
*
* Replace data over standard range
*
set sims = %ran(1.0)
stats sims
*
* Clear information and generate data only over standard range
*
clear sims
set sims = %ran(1.0)
stats sims
*
* Example of use of "overlong" series
*
compute ndraws=10000
compute nobs =500
*
clear sims
set jbstats 1 ndraws = 0.0
do try=1,ndraws
set sims 1 nobs = %ran(1.0)
stats(noprint) sims
compute jbstats(try)=%jbstat
end do try
*
compute crit05=%invchisqr(.05,2),crit01=%invchisqr(.01,2)
sstats(mean) 1 ndraws (jbstats>crit05)>>sim05 $
(jbstats>crit01)>>sim01
disp "JB Statistic"
disp "Rejections at .05" sim05 "at .01" sim01
SERIES and Dates 189

Example 6.2 Series handles and DOFOR


This demonstrates the use of the DOFOR instruction with lists of series.

open data quarterly(2012).xls


cal(q) 1960:1
allocate 2012:4
data(org=obs,format=xls)
*
* Create custom table of statistics
*
report(action=define)
report(atrow=1,atcol=1,align=center) "Series" "Mean" "Std Dev" $
"Skew" "Kurtosis" "LB(Q)"
dofor s = tb3mo to curr
stats(noprint) s
corr(noprint,qstats,number=12) s
report(row=new,atcol=1) %l(s) %mean sqrt(%variance) $
%skewness %kurtosis %qstat
end dofor s
report(action=format,picture="*.##")
report(action=show)
*
* Create growth rates for non-interest rates
*
dofor s = rgdp to curr
set %s("gr_"+%l(s)) = 100.0*log(s{0}/s{1})
end dofor s
*
table(picture="*.####") / %slike("gr_*")
*
* Do regression with generated real rates of interest
*
set gdpgrow = 400.0*log(rgdp/rgdp{1})
dofor price = deflator ppi
dofor rate = tb3mo tb1yr
set realrate = rate{0}-400.0*log(price{0}/price{1})
disp "Real Rate using" %l(rate) "and" %l(price)
linreg gdpgrow
# constant gdpgrow{1 to 4} realrate{1 to 4}
exclude
# realrate{1 to 4}
end dofor rate
end dofor price
*
* Relabel series
*
labels rgdp potent
# "Y" "YP"
*
stats rgdp
*
eqv tb3mo tb1yr
SERIES and Dates 190

r3mo r1yr
set spread = r1yr-r3mo

Example 6.3 Date calculations and functions


open data quarterly(2012).xls
cal(q) 1960:1
*
* Using expression for date
*
compute endyear=2012,endqtr=4
compute end=endyear:endqtr
*
allocate end
data(org=obs,format=xls)
*
print 1970:1 1970:4 rgdp
print(nodates) 1970:1 1970:4 rgdp
*
* For illustration (this is dangerous!)
*
cal(m) 1980:1
print 41 44 rgdp
*
* Reset the calendar
*
cal(q) 1960:1
*
disp %datelabel(2000:5)
*
do time=1970:3,1971:4
disp %datelabel(time) %year(time) %period(time)
end do time
*
* Create quarter 4 dummy for 1980-1989
*
set d80_89q4 = %year(t)>=1980.and.%year(t)<=1989.and.%period(t)==4
*
do time=1970:3,1971:4
disp %datelabel(time) %daycount(time) %tradeday(time,5)
end do time
*
disp 1980:1+1 1+1980:1
Chapter 7

Nonstationary Variables

A crucial issue in time-series modeling is to determine whether or not the vari-


ables in question are stationary. Even if a series contains a clear trend, the
trend itself can contain both stochastic and deterministic components. It is
inappropriate to difference a series with a purely deterministic trend or to de-
trend by regression a series with a stochastic trend. Unfortunately, it is not
always straightforward to distinguish between stationary and nonstationary
series. The autocorrelations of persistent stationary processes and of non-
stationary I(1) processes both decay slowly. A slowly decaying ACF can indicate
a unit root or a near-unit root process. RATS has a number of procedures that
allow you to test for unit roots and for cointegration. In addition, this chap-
ter illustrates several ways to decompose an I(1) series into its stationary and
trend components.
While unit root tests are a fairly common feature in published work, there
remains quite a bit of confusion about the importance of them in particular
situations. For instance, in ARIMA modeling (Chapter 2), the decision about
whether or not to difference the series is almost never left to unit-root tests
since the point of the ARIMA model is forecasting rather than hypothesis test-
ing, you make the decision based upon what is likely to forecast best. If there
is any question whether the process should be differenced, youre generally
better off differencing since it creates a more parsimonious model. And its a
common misconception is that you cant run a regression with non-stationary
data. It is true that you cant run a static model with non-stationary data with-
out running into the spurious regression problem demonstrated by Granger
and Newbold (1974).1 However, even in 1974, it was already considered to be
bad form to run a static regression (that is current y on current X only with no
lags) leaving very highly correlated errorsGranger and Newbold showed that
aside from being bad practice, it could lead to incorrect conclusions about sta-
tistical significance. With proper treatment of the dynamics (using lags of the
dependent and explanatory variables), you can, in fact, run regressions with
non-stationary datathats what techniques such as Vector Autoregressions
(VAR), Vector Error Correction Models (VECM) and Autoregressive Distributed
Lag models (ARDL) are designed to do.
1
They showed by Monte Carlo experiments that if you ran regressions of one random walk
process on a completely independent one, with enough data you would conclude that the two
were correlated.

191
Nonstationary Variables 192

7.1 The Dickey-Fuller Test


The data-generating process of a covariance stationary series has a finite time-
independent mean and variance. In addition, all autocovariances are time-
independentExt xtk depends only upon the gap k and not the time period t.
A series is non-stationary if the means, variances or covariances do somehow
change over time. However, in economics, non-stationary is generally short-
hand for having a unit root, which makes the series variance increase over
time.
Consider the time plots of real and potential U.S. GDP (in logs) shown in Figure
7.1. Its obvious from looking at the graph that the series cant be stationary,
since its a steadily increasing function of time. The issue is whether we can
produce a stationary series by removing a linear time trend (in which case, the
series is called trend-stationary or whether we need to difference to accomplish
that.2
log rgdp / ly
log potent / lpot
*
* Construct the graph including the labels for the series
*
com l$ = ||"Real GDP","Potential"||
graph(klabels=l$,footer="Real and Potential GDP", $
key=upleft,vlabel="logarithms") 2
# ly
# lpot

Well show here how the calculations are done, but in practice you would use a
procedure like @DFUNIT (section 7.1.1).
We get the following if we regress the log of real GDP on time

set trend = t
linreg ly
# constant trend

Linear Regression - Estimation by Least Squares


Dependent Variable LY
Quarterly Data From 1960:01 To 2012:04
Usable Observations 212
Degrees of Freedom 210
Centered R2 0.9904127
R-Bar2 0.9903670
(some lines removed)
Variable Coeff Std Error T-Stat Signif
************************************************************************************
1. Constant 8.0276898730 0.0063575729 1262.69727 0.00000000
2. TREND 0.0076234562 0.0000517586 147.28862 0.00000000

2
Note that its possible that neither method will produce a stationary series, so we have to
be careful about that.
Nonstationary Variables 193

9.6
Real GDP
Potential
9.4

9.2

logarithms 9.0

8.8

8.6

8.4

8.2

8.0

7.8
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 7.1: Real and Potential GDP

Although it might appear that a linear trend provides a good fit for the real
2
GDP series (note the high R ), this is misleading. The deviations from the trend
(%RESIDS) exhibit no obvious tendency to revert back to the trend line. If you
construct the residual autocorrelations, you will find that the ACF is

corr(number=8,picture="##.##") %resids

Correlations of Series %RESIDS


Quarterly Data From 1960:01 To 2012:04

Autocorrelations
1 2 3 4 5 6 7 8
0.98 0.95 0.91 0.87 0.82 0.77 0.72 0.68

To formally test the null hypothesis of a unit root against the alternative of a
unit root, we can perform an augmented Dickey-Fuller (ADF) test using
p
X
dlyt = c0 + c1 t + lyt1 + i dlyti + t (7.1)
i=1

where dlyt is the first difference of the lyt series.


If the series is trend-stationary, the value of will be negative forcing the series
to revert to trend from any deviation. The added lags of dlyti are to eliminate
the serial correlation in the residuals.3
We can estimate (7.1) using
3
Augmented in the description of the test refers to the use of the added lags of y in the
regression. The original Dickey-Fuller test didnt include those. The added lags have no effect
on the asymptotic distribution and its now common to simply use Dickey-Fuller or DF rather
than ADF to refer to the test with added lags.
Nonstationary Variables 194

Table 7.1: Dickey-Fuller critical values


1% 5% 10%
With Constant + Time Trend
50 -4.15 -3.50 -3.18
100 -4.04 -3.45 -3.15
250 -3.99 -3.43 -3.13
With Constant but no Time Trend
50 -3.58 -2.93 -2.60
100 -3.51 -2.89 -2.58
250 -3.46 -2.88 -2.57

diff ly / dly
linreg dly
# constant trend ly{1} dly{1 2}

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant 0.222564020 0.099957773 2.22658 0.02706890
2. TREND 0.000187860 0.000095954 1.95780 0.05161661
3. LY{1} -0.026996297 0.012480224 -2.16313 0.03169432
4. DLY{1} 0.274240120 0.068466856 4.00544 0.00008670
5. DLY{2} 0.185270940 0.068932447 2.68772 0.00778821

The coefficient on lyt1 is -0.027 and the t-statistic is -2.16313. Note that you
cant read the significance level from the Signif columnthats for a two-
tailed test with an asymptotically Normal distribution. Under the null hypoth-
esis that the series is nonstationary ( = 0), its necessary to use the Dickey-
Fuller critical values which are for a one-tailed test (the explosive process
where > 0 isnt an interesting alternative) with a particular non-standard
distribution. For an equation containing a constant and trend, the 5% critical
value of the t statistic for a sample size of 250 is -3.43 (Table 7.1), and we reject
the null only if the t is more negative than that. Clearly, we cannot reject the
null hypothesis of a unit root and thus conclude that the log of the real GDP
should be differenced if we want to create a stationary series.
To test the joint restriction that the series that = c1 = 0 (so under the null
hypothesis, the series is a drifting random walk), use

exclude
# ly{1} trend

Null Hypothesis : The Following Coefficients Are Zero


LY Lag(s) 1
TREND
F(2,204)= 4.37333 with Significance Level 0.01381264

Again, it is not appropriate to use the significance level reported by the


EXCLUDE instruction because it involves a restriction on a nonstationary vari-
able (lyt1 is nonstationary under the null hypothesis). The critical value for the
Dickey-Fuller 3 statistic is 6.34 at the 5% significance level. Clearly, we do
Nonstationary Variables 195

not reject this null hypothesis, so it is possible to accept the alternative that the
series contains a unit-root with a non-zero drift. To test whether = c0 = c1 = 0
(under the null, the series is a non-drifting random walk, which is not really a
serious option for a series with such an obvious trend)

exclude
# ly{1} constant trend

Null Hypothesis : The Following Coefficients Are Zero


LY Lag(s) 1
Constant
TREND
F(3,204)= 11.64685 with Significance Level 0.00000045

The 5% critical value for the 2 statistic is 4.75 so this (admittedly unlikely
null) is rather clearly rejected. In practice, with a series like this, you wouldnt
even bother with this last test.
It is important to ensure that the lag length used in the DF test is correct. You
need enough lags to remove any (obvious) serial correlation in the residuals,
but not more than needed, since each extra lag costs two degrees of freedom in
the regression (one lost data point, plus one extra estimated parameter).
The AIC, BIC and General-to-Specific (GTOS) methods are the most common
ways used to select the lag length. It is straightforward to construct a loop to
select the lag length using the GTOS method. If the maximum lag length is 4
and the minimum is 0, you could use

compute p=0
do lags=4,1,-1
linreg(noprint) dly
# constant trend ly{1} dly{1 to lags}
if %ttest(%tstats(%nreg),%ndf)<.05 {
compute p=lags
break
}
end do lags

We can then display the chosen lag length (2, as was done above) and estimate
the Dickey-Fuller regression with that number of lags using:

disp "Chosen lag length" p


linreg(print) dly
# constant trend ly{1} dly{1 to p}

As another example, it seems likely that the gap between (the logs of) real and
potential GDP is stationaryit is hard to see how real GDP can drift infinitely
far from its potential. To formally test whether the output gap is stationary,
form the variable CYCLE as the difference between the logs of the two variables
Nonstationary Variables 196

set cycle = lpot - ly

Now run the Dickey-Fuller test. Since the alternative doesnt involve a trend,
we dont include one. You can verify that two lags are sufficient for the test.

diff cycle / dcycle


linreg dcycle
# constant cycle{1} dcycle{1 to 2}

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. Constant 0.000588264 0.000557422 1.05533 0.29251703
2. CYCLE{1} -0.059869225 0.018735466 -3.19550 0.00161648
3. DCYCLE{1} 0.283165322 0.067698566 4.18274 0.00004269
4. DCYCLE{2} 0.195582125 0.068644449 2.84921 0.00482998

The tstatistic for the null hypothesis if a unit root is 3.1955. With 209 ob-
servations, to be conservative we can use the critical values for 250, which are
-2.89, -3.17 and -3.51 for at the 5%, 2.5%, and 1% levels, respectively. Thus we
can reject the null at the 5% and 2.5%, but not the 1% significance level.

7.1.1 Dickey-Fuller testing procedures

RATS contains a number of procedures that make programming variants of the


Dickey-Fuller test quite simple. @DFUNIT can readily estimate a model in the
form of (7.1). The proper syntax is

@DFUNIT( options ) series start end

The most important options are


DET=NONE/[CONSTANT]/TREND
LAGS=number of augmenting lags [0]
MAXLAGS=maximum number of augmented lags [T/4]
METHOD=AIC/BIC/GTOS
SIGNIF=significance level for the t-test in GTOS

You can reproduce the unit root test on lyt with two lags using

@dfunit(det=trend,lags=2) ly

Dickey-Fuller Unit Root Test, Series LY


Regression Run From 1960:04 to 2012:04
Observations 210
With intercept and trend
Using fixed lags 2

Sig Level Crit Value


1%(**) -4.00465
5%(*) -3.43224
10% -3.13959

T-Statistic -2.16313
Nonstationary Variables 197

This shows the critical values created using the algorithm in MacKinnon
(1991), which does a (very) sophisticated interpolation to get approximate crit-
ical values for each sample size. As before, we accept the null.
The following will choose the lag length and then re-run the regression with
the selected number of lags.

@dfunit(det=trend,method=gtos,maxlags=4,signif=0.05) ly
@dfunit(det=trend,method=bic,maxlags=4) ly
@dfunit(det=trend,method=aic,maxlags=4) ly

All of these select two lags. The output from the procedure is altered slightly,
with a line like
With 2 lags chosen from 4 by GTOS/t-tests(0.050)

replacing Using Fixed Lags.


Since the results of a Dickey-Fuller test depend upon the number of lags and
since its possible that different methods of selecting the lag length can produce
different choices for the lags, you can also use the @ADFAUTOSELECT procedure,
which shows a sensitivity table listing four different lag length selection crite-
ria (AIC and BIC, plus Hannan-Quinn and the Modified AIC of Ng and Perron
(2001). The syntax for this in our case is:

@adfautoselect(det=trend,maxlags=4,print) ly

The PRINT option is needed because its designed to, by default, silently choose
the lag length.
Information Criteria for ADF Lag Lengths, Series LY
Lags AIC BIC HQ MAIC ADF
0 -9.578 -9.529 -9.558 -9.579 -1.682
1 -9.676 -9.612 -9.650 -9.660* -2.140
2 -9.697* -9.616* -9.664* -9.659 -2.548
3 -9.687 -9.591 -9.648 -9.652 -2.447
4 -9.681 -9.568 -9.635 -9.637 -2.548

As we saw above, two seems to be the consensus choicethe MAIC is very


slightly better at 1. One thing to note is that the ADF statistic doesnt match the
results shown before with the chosen number of lags. This is because the ADF
statistics in this table are all generated with regressions run over the range
allowed with the maximum 4 lags, while DFUNIT does the displayed test using
the maximum range allowed by the chosen number of lags.
The Dickey-Fuller critical values depend upon which deterministic regressors
are included in the estimating equation. If you are unsure as to which deter-
ministic regressors to include in the regression, you can test for the presence
of a trend and/or intercept. The procedure @URAUTO performs a Dickey-Fuller
test for a unit root while trying to pare down the deterministic regressors using
Nonstationary Variables 198

a series of t-tests and F -tests, starting with constant and trend and (if neces-
sary) using just a constant, or finally none at all. To a large extent, the method
uses the scheme developed in Appendix 4.2 of Enders (2015). The syntax is

@URAUTO( options ) series start end

where the most typically used options are

SIZE=ONE/TWO5/[FIVE]/TEN
LAGS=number of augmenting lags [0]
[PARAM]/NOPARAM
[TRACE]/NOTRACE

The option NOPARAM requests non-parametric (Phillips-Perron type) (section


7.2) tests rather than Dickey-Fuller. TRACE requests that all regression and
test statistics be printed rather than just summaries.
In our case, we can use

@urauto(lags=2) ly

The test output starts with this, which is for a Dickey-Fuller regression with
constant and trend:
Regressions with constant,trend

t-tau statistic for rho=1 -2.16313 with critical value -3.41000


Cannot reject a unit root t-statistic
Next is joint test of trend=0 and root=1
psi3 = 4.37333 with critical value 6.25000
psi3 cannot reject unit root and no linear trend

The first step is testing for a unit root in the presence of both deterministics.
Note how large (negative) the critical value is for this. If we were to reject
the unit root at this point, we would be done. However, we accept. The next
step is to check the joint hypothesis that the trend coefficient is zero and there
is a unit root. This is an F test with a non-standard distribution. We accept
that, so it appears we do not need the trend in the test. So we move on to the
Dickey-Fuller regression with just the constant:
Regressions with constant,no trend
t-mu statistic for rho=1 -2.20152 with critical value -2.86000
Cannot reject a unit root with t-mu
Next is joint test of constant=0 and root=1
psi1 = 15.34177 with critical value 4.59000
psi1 significant
Testing constant=0 under the unit root
constant=0 test = 5.03619 with normal distribution
Constant significant under the unit root

Again, we start out by accepting the unit root in the presence of the constant.
The joint test for a zero intercept and the unit root is now rejected (rather
Nonstationary Variables 199

strongly). The next part of the test is to check to see if the constant is zero
given the unit root. Thats rejected. So we need the constant. Thus we get:
Conclusion: Series contains a unit root with drift

This type of automatic analysis can be a helpful tool, but you still have to pay
attention to the details, as some of those early tests might just barely accept or
reject.

7.1.2 DOFOR loops and the REPORT instruction

If you have a number of series that are potentially nonstationary, it is efficient


to nest a unit-root testing procedure within a DOFOR loop. Consider the exam-
ple from Enders (2015) using quarterly data through 2013:1. The issue is to
analyze the time-series properties of the effective real exchange rates of Aus-
tralia, Canada, France, Germany, Japan, the Netherlands, the United King-
dom, and the United States. The theory of Purchasing Power Parity (PPP) sug-
gests that (log) real exchange rates should be stationary processes. If it is not
possible to reject the null hypothesis of a unit-root in the real exchange rates,
PPP fails. Read in the data set from the file PANEL(2013).XLS using

open data "panel(2013).xls"


calendar(q) 1980:1
data(format=xls,org=columns) 1980:01 2013:01 australia canada $
france germany japan netherlands uk us

A quick look at any of the series would rule out the need to allow for trends
in the unit root test, and, of course, its completely unreasonable for a real
exchange rate to have a predictable trend. So for each series, we want to do a
unit root test allowing only for a constant in the deterministic part. If we accept
the unit root, we would conclude that PPP doesnt hold.4 We want to do this for
each of the eight countries. The simplest way to organize this is to use a DOFOR
loop. Since the data are real exchange rates, we have to transform each to logs
and do the test. We have no prior knowledge of how many augmenting lags will
be necessary for each country, so well let @DFUNIT pick those automatically (in
this case, using GTOS from a maximum of 12 lags).
This is easy to set up with:

dofor x = australia to us
set lx = log(x{0})
@dfunit(title="Dickey-Fuller Test for "+%l(x),$
method=gtos,maxlags=12) lx
end dofor x
4
Or at least PPP doesnt hold given the choice for conversion of nominal to real exchange
ratesa bad choice for the price indexes could be the cause.
Nonstationary Variables 200

Each pass through, LX is set to the log of the value of the current X series.5
Because (by default) @DFUNIT labels the output using the name of the series
passed through to it (which would be LX for each country as we have this writ-
ten), we use the TITLE option to put in our own description using the label of
the original X series. The output consists of eight blocks of something like
Dickey-Fuller Test for AUSTRALIA
Regression Run From 1983:01 to 2013:01
Observations 122
With intercept
With 11 lags chosen from 12 by GTOS/t-tests(0.100)

Sig Level Crit Value


1%(**) -3.48464
5%(*) -2.88510
10% -2.57919

T-Statistic -1.63443

While its possible to read through this and pick out the information that we
want, this isnt at all useful if we want to include this in a paper. For that,
we dont need all information (most of it the same from country to country)a
table with country name, test statistic and (perhaps) the number of lags chosen
will be enough. You could copy, paste and reformat the information into a
spreadsheet or table in a target document, but thats time-consuming, its easy
to make errors, and would have to be repeated completely if you ended up
deciding upon a different set of countries, or a different time period.
Instead, well introduce the REPORT instruction, which is designed to insert in-
formation into a table within RATS. As a first go, well just insert three pieces of
information: the series name, the test statistic and the number of lags chosen.
@DFUNIT defines %CDSTAT as the Dickey-Fuller statistic and %%AUTOP as the
number of augmenting lags.6

report(action=define,title="Unit Root Tests for PPP")


dofor x = australia to us
set lx = log(x{0})
@dfunit(noprint,method=gtos,maxlags=12) lx
report(row=new,atcol=1) %l(x) %cdstat %%autop
end dofor x
report(action=show)

This produces:
5
See page 182 for a discussion of the x{0} notation.
6
As a general rule, variables with a single % are standard variables which are defined by
many RATS instructions and procedures, while those with %% are variables defined only by
procedures.
Nonstationary Variables 201

AUSTRALIA -1.634429 11
CANADA -1.896289 7
FRANCE -2.998498 1
GERMANY -2.140409 11
JAPAN -2.367424 12
NETHERLANDS -3.625032 3
UK -2.677756 12
US -1.865584 3

which is fine for a start and certainly is more useful than the original output.
The construction of the REPORT is started using the REPORT instruction with
the option ACTION=DEFINE. You only use this once in a given report (use it
again, and youll erase the contents). If you run this, and check the Windows-
Report Windows menu (which is a pull-right menu with a list of recent re-
ports), youll see where the TITLE option on the first REPORT instruction is
usedits the title in the report list and the title of the window if you select
that report from the list.
The instruction which adds content is
report(row=new,atcol=1) %l(x) %cdstat %%autop

This creates a new row in the table, and inserts three items beginning in col-
umn 1 (ATCOL=1 option). The first is the label of the current X series, the second
is the test statistic and the third the number of lags. Note that the content can
be a mix of character information (the label), real-valued statistics (%CDSTAT)
and integer-valued information (%%AUTOP). The REPORT instruction tries hard
to create a reasonable-looking table from the different types of information.

report(action=show)

then displays the information. As in programming situations, its usually a


good idea to start with a fairly basic layout, and then improve it. Here, it
would be nice to add headers to the columns, and trim some decimal places off
the statistics. The first can be done by adding

report(row=new,atcol=1,align=center) "Country" "ADF" "Lags"

before we start the loop. By default, numerical data aligns right and character
data aligns left, so you need to add the ALIGN=CENTER option to get the strings
centered on the columns.
The easiest way to reformat the numbers is to add

report(action=format,picture="*.###")

after the loop and before the ACTION=SHOW. This will reduce the display to
three decimal places. We now get the more useful:
Nonstationary Variables 202

Country ADF Lags


AUSTRALIA -1.634 11
CANADA -1.896 7
FRANCE -2.998 1
GERMANY -2.140 11
JAPAN -2.367 12
NETHERLANDS -3.625 3
UK -2.678 12
US -1.866 3

If we want to get even fancier output, we can use *s to tag the significant
statistics. We can use * for significant at 10%, ** for 5% and *** for 1%. This
is a major step up in complexity as we have to figure out which level of *s (if
any) that we need for each country and then tell REPORT about it. After the
@DFUNIT inside the loop, we can do the following:

@mackinnoncv(det=constant) dfcv
compute stars=fix($
(%cdstat<dfcv(1))+(%cdstat<dfcv(2))+(%cdstat<dfcv(3)))

The first of these computes the critical values from MacKinnon (1991) for a
series of the length that was just analyzed. This returns DFCV as a 3-vector
with the 1%, 5% and 10% critical values in that order. %CDSTAT<DFCV(1)
will be 1 if the test statistic is smaller than the 1% value and 0 otherwise,
similarly for the other two. If the test statistic is in fact, less than the 1%
critical value, then all three of those comparisons will return a 1 so STARS will
be 3. If %CDSTAT is larger than even the 10% critical value, all the comparisons
will return 0, so STARS will be 0. You should be able to convince yourself that
this line will give the number of stars that we want.
The tagging is done by adding the SPECIAL option when we insert the value.
This has choices

SPECIAL=[NONE]/ONESTAR/TWOSTARS/THREESTARS/PARENS/BRACKETS

The first four are of interest here. Because of the way options like this work, we
can just use SPECIAL=1+STARS, which translates into the first choice (NONE)
when STARS is zero, the second choice (ONESTAR) when STARS is one, etc. We
only want to add the stars to the test statistic, so we need to use two REPORT
instructions to insert the information:
report(row=new,atcol=1) %l(x) "" %%autop
report(row=current,atcol=2,special=1+stars) %cdstat

The first inserts the unadorned columns (with an empty string in the middle),
while the second fills in the test statistic in the second column with the desired
tagging. When we put this all together, we have whats almost what we want:
Nonstationary Variables 203

Country ADF Lags


AUSTRALIA -1.634 11
CANADA -1.896 7
FRANCE -2.998** 1
GERMANY -2.140 11
JAPAN -2.367 12
NETHERLANDS -3.625*** 3
UK -2.678* 12
US -1.866 3

Unfortunately, numeric fields default to align right. To get our final product,
we need to align the second column on the decimal place. If we replace the
formatting instruction with

report(action=format,atcol=2,tocol=2,atrow=2,$
picture="*.###",align=decimal)

we will get the desired


Country ADF Lags
AUSTRALIA -1.634 11
CANADA -1.896 7
FRANCE -2.998** 1
GERMANY -2.140 11
JAPAN -2.367 12
NETHERLANDS -3.625*** 3
UK -2.678* 12
US -1.866 3

The ATCOL, TOCOL and ATROW options (there is also a TOROW options, but we
dont need that since the default for that is the end of the table) restrict the
range that the formatting affects. We only want this to apply to the test statis-
tic column, and only from row 2 down.
Although this is much easier to read than the original information, it still isnt
quite ready to go straight into a publication because this is formatted for view-
ing in a fixed-width font. If you copy and paste this from the output window
into a word processor, youll either have to switch to a fixed-width font like
Courier, or reformat using tabs. Instead, if you go the the Window-Report Win-
dows menu, you can open the report (the good one will be the top entry in the
menu) as a spreadsheet-style window. From that, you can do Copy or Export
operations which will let you better control what you do with the result. In our
case, we did a Copy to TeX which creates a TeX tabular environment. With a
couple of minor edits (wrapping a table environment around it), we get
At this point, weve spent quite a bit of time formatting the results and none ac-
tually looking at them carefully. What we can see is that we reject PPP (accept
the unit root) fairly easily for the big, geographically isolated countries: Aus-
tralia, Canada and the U.S. We reject the unit root for three of the EU countries:
France, Netherlands and the U.K. Japan and Germany are in an intermediate
rangewe wouldnt reject at standard levels of significance, but theyre just a
bit below that.
Nonstationary Variables 204

Table 7.2: Unit Root Tests for PPP


Country ADF Lags
AUSTRALIA 1.634 11
CANADA 1.896 7
FRANCE 2.998 1
GERMANY 2.140 11
JAPAN 2.367 12
NETHERLANDS 3.625 3
UK 2.678 12
US 1.866 3

7.2 Other Tests


There are two obvious problems with the standard Dickey-Fuller tests:

1. The test depends upon the nuisance parameter of the extra lags to re-
move serial correlation.

2. The deterministics change their meanings as the model moves between


the null (unit root) and the alternative (stationary). For instance, under
the unit root, the constant is a drift rate, while it determines the mean
for a stationary process.

To deal with this, quite a few alternative unit root testing procedures have been
proposed. Well explain some of them herenote that all of these are included
in the Time Series-Unit Root Tests wizard.
As mentioned earlier, the original Dickey-Fuller test did not include the aug-
menting lags. Suppose we write
yt = c0 + yt1 + ut (7.2)
where ut is (possibly) serially correlated. The augmented Dickey-Fuller test
uses a parametric autoregressive filter (in effect) to eliminate the serial corre-
lation. Is it possible to test = 0 in (7.2) without doing this? This was answered
yes by Phillips and Perron (1988). They showed how to adjust the test statis-
tics to allow for a serially correlated residuals process, using some of the same
types of calculations that go into HAC standard errors in linear regressions.
This is known as the Phillips-Perron test.
The procedure for doing this is @PPUNIT:

@PPUNIT( options ) series start end

The most important options are

DET=[CONSTANT]/TREND
Nonstationary Variables 205

LAGS=number of lags in spectral estimation window [4]


TABLE/[NOTABLE]

The LAGS option here chooses the number of lags used in the estimate for the
long-run variance of the ut residuals process. While this is still a nuisance
parameter (the test statistic depends upon it), changing it doesnt require re-
running the entire regression and it doesnt cost data points. However, there
is also no relatively simple way to decide if you have chosen the right value.
The TABLE option offers the ability to do a sensitivity table, which shows the
test statistic for different number of lags in the windowthe statistics have a
tendency to settle down at some point. The instruction

@ppunit(det=trend,lags=8,table) ly

generates
Phillips-Perron Test for a Unit Root for LY
Regression Run From 1960:02 to 2012:04
Observations 211
With intercept and trend

Sig Level Crit Value


1%(**) -4.00426
5%(*) -3.43205
10% -3.13947

Lags Statistic
0 -0.62128
1 -0.92143
2 -1.13143
3 -1.24882
4 -1.32996
5 -1.36302
6 -1.37924
7 -1.37640
8 -1.36148

Once you get to 4 or 5 lags, the results settle down, with acceptance of the null.
The Phillips-Perron test is rarely used for several reasons. One is that the
types of serial correlation that it handles better than the ADF test are rarely
observed and the ones that it handles poorly are common. The other is that its
much easier to justify a lag choice in the ADF using a well-known information
criterion than it is in the PP test.
The second problem can be approached by replacing the Dickey-Fuller Wald
test with a Lagrange multiplier (LM) test. This is done in Schmidt and Phillips
(1992). Suppose we write the model as the decomposition into trend plus noise:
yt = c0 + c1 t + Zt (7.3)
Zt = Zt1 + ut (7.4)
Nonstationary Variables 206

Under the null, = 1. The Dickey-Fuller test arises if we quasi-difference


(7.3) using (7.4) to eliminate Z:
yt = yt1 + c0 (1 ) + c1 (t (t 1)) + ut
yt = ( 1)yt1 + {c0 (1 ) + c1 } + {c1 (1 )} t + ut (7.5)
We estimate an unconstrained version of (7.5) and test whether 1 = 0.
Instead, if we impose the null that = 1, (7.5) simplifies to
yt = c1 + ut
so the trend rate is easy to estimate. A limit argument can show that the
maximum likelihood estimate for c0 is simply y1 c1 . A Lagrange multiplier test
for the unit root is formed by regressing the difference of the data on constant
and the first lag of the detrended data (called S) and testing that lag for zero.
The test statistic is (as expected) non-standard, but is different from the DF
statistic.
The @SPUNIT procedure has a somewhat different syntax from the other unit
root tests because the authors allowed for polynomial degrees higher than
one (though there is probably little need for those). So instead of the option
DET=TREND, you use P=1.

@spunit(p=1) ly

This generates output which includes a sensitivity table to the number of lags
in the non-parametric window for handling remaining serial correlation (weve
cut part of this out). All of these are well within the acceptance values for the
unit root.
Schmidt-Phillips Test (TAU) for a Unit Root for LY
Regression Run From 1960:02 to 2012:04
Observations 211
Signif. Level Critical Value
1%(**) -3.610000
2.5% -3.300000
5%(*) -3.040000
10% -2.760000

Variable Coefficient T-Stat


SBAR{1} -0.016837 -1.332059
Constant 0.009155 6.449585

Semiparametric Corrections for TAU


Bartlett Window Estimates of sigma2
Schwert value of L4 =4, L12=14

Lags Sigma2 tau


0 0.000073 -1.33206
1 0.000099 -1.54861
2 0.000121 -1.71336
3 0.000137 -1.81642
4 0.000148 -1.89393
...
11 0.000169 -2.02324
12 0.000170 -2.02498
13 0.000169 -2.02239
14 0.000168 -2.01671
Nonstationary Variables 207

A similar idea is to improve the estimate of the trend using GLS. Probably
the most popular form is the test developed in Elliott, Rothenberg, and Stock
(1996). Since first differencing the data is inappropriate if the data are trend-
stationary, they quasi-difference the data using a filter which is local to unity
(close to a unit root, but not quite), using 1L rather than 1L where = 1c
and c is close to 0. This is applying (7.5) with an assigned value of and only
for the purpose of estimating c0 and c1 . ERS recommend a value of c = 13.5/T
if there is a time trend in the deterministic process and c = 7/T if an intercept
only is used. The detrended filtered data are then subjected to a Dickey-Fuller
test (without any deterministics).
The ERS test can be done using the procedure @ERSTEST:

@erstest(det=trend,lags=2) ly

This uses the recommended filter coefficient of c = 13.5/T for the model with
DET=TREND. @ERSTEST has a CBAR option which can be used to override the
default numerator value of 13.5 for DET=TREND and 7.0 for DET=CONSTANT.
The output includes several variants which differ in the treatment of the first
observation.
DF-GLS Tests, Dependent Variable LY
From 1960:01 to 2012:04
Using 2 lags
Detrend = constant and linear time trend, z(t)=(1,t)
Tests for a unit root null. All tests reject null in lower tail
Critical values (asymptotic)
Elliott et al (1996 Econometrica)
1%(**) 2.5% 5%(*) 10%
PT 17.411 3.96 4.78 5.62 6.89
DFGLS -1.279 -3.48 -3.15 -2.89 -2.57
Elliott (IER 1999)
QT 11.901 2.05 2.44 3.15 3.44
DFGLSu -2.274 -3.71 -3.41 -3.17 -2.91

Note that there is a separate @GLSDETREND procedure which can be used to just
do the GLS detrending of the data. This allows for the constant and constant
and trend deterministics, and also for a trend with a break in it.
All of the tests described so far have had the unit root as the null. This makes
sense since its the simple hypothesis while the alternative of stationarity is
composite. However, it is possible to construct a test with a null of stationarity;
this is shown in Kwiatkowski, Phillips, Schmidt, and Shin (1992). The vari-
ance of the deviations from trend for a stationary process is bounded, while
its unbounded for a non-stationary process so if the process wanders too far to
be compatible with stationarity, we conclude that its non-stationary. This is
known as the KPSS test and can be done using the @KPSS procedure. As with
the Phillips-Perron and Schmidt-Phillips tests, this requires a lag window esti-
mator. Use the LMAX option to get a sensitivity table (again, we removed part of
this). The tests all come in significant which, since the hypothesis is reversed,
is consistent with the previous results.
Nonstationary Variables 208

@kpss(det=trend,lmax=12) ly

KPSS Test for Stationarity about Trend, Series LY


From 1960:01 to 2012:04
Observations 212

Sig Level Crit Value


1%(**) 0.216000
2.5% 0.176000
5%(*) 0.146000
10% 0.119000

Lags TestStat
0 1.985801**
1 1.013917**
2 0.689268**
...
9 0.245777**
10 0.229726**
11 0.216616**
12 0.205793*

7.3 Tests with Breaks


If there is a structural break in an otherwise stationary series, the Dickey-
Fuller test is biased towards falsely accepting the null hypothesis of a unit root.
This is the result from Perron (1989) and has been studied extensively since
then. The intuition is that unit root behavior is more persistent than stationary
behavior, and thats what the DF and other tests are detecting. However, a
change to the data process partway through range is also persistentin a
different way, but enough to produce false acceptances.
Section 4.8 of Enders (2010) simulates 100 observations representing the
breaking process
yt = 0.5yt1 + t + DL
where DL is a level-shift dummy variable such that DL = 0 for t = 1, ..., 50 and
DL = 1 thereafter. The data (on the file BREAK.XLS) look like Figure 7.2. Notice
that the series fluctuates around a mean of zero for the first 50 realizations and
around a mean of 2 thereafter.
If we dont take into account the break (which is here quite obvious, but in
practice might not be) and compute the autocorrelations we get:
corr(number=6,picture="*.##") y1

Correlations of Series Y1

Autocorrelations
1 2 3 4 5 6
0.94 0.89 0.86 0.83 0.80 0.77

which are much higher than they should be for an AR model with a .5
parameterif we split the sample we get the following for the first 50 data
points and the last 50, both in line with what we would expect:
Nonstationary Variables 209

3.0

2.5

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0
10 20 30 40 50 60 70 80 90 100

Figure 7.2: Simulated Data with Broken Mean

Autocorrelations
1 2 3 4 5 6
0.49 0.11 0.11 0.06 -0.18 -0.27

Autocorrelations
1 2 3 4 5 6
0.51 0.27 0.04 -0.00 -0.06 -0.12

The high value for the autocorrelation is because the full sample mean is
roughly 1, which is too high for the first half of the data, and too low for the
second. So deviations from the mean are all negative for the first half and all
positive for the second. Theres only one data point out of 100 (at t = 51) where
(xt x) (xt1 x) is negative. If we run a standard Dickey-Fuller test:

@dfunit(max=6,det=constant) y1

we would (incorrectly) accept the unit root:


Dickey-Fuller Unit Root Test, Series Y1
Regression Run From 8 to 100
Observations 94
With intercept
Using fixed lags 6

Sig Level Crit Value


1%(**) -3.50063
5%(*) -2.89217
10% -2.58290

T-Statistic -1.64806

Perrons original paper dealt with the (relatively) simple case of a break at a
known location; however, because the target series were trending U.S. macroe-
conomic data, his analysis was based upon various ways that a trending series
could break. His idea, applied to a non-trending model, is to first regress on
Nonstationary Variables 210

an appropriate function for the mean and take the residuals. Since the residu-
als should now be mean zero, you run a Dickey-Fuller test on them using no
deterministics. In our case, the set up would be7

set du = t>50
linreg y1
# constant du
set ytilde = %resids

To do an ADF test on YTILDE, we can use:

@adfautoselect(print,det=none,max=6) ytilde

Information Criteria for ADF Lag Lengths, Series YTILDE


Lags AIC BIC HQ MAIC ADF
0 -2.612* -2.584* -2.601* -1.848* -6.009
1 -2.590 -2.536 -2.568 -1.826 -5.235
2 -2.573 -2.492 -2.540 -1.700 -4.943
3 -2.556 -2.447 -2.512 -1.579 -4.719
4 -2.543 -2.407 -2.488 -1.394 -4.682
5 -2.532 -2.369 -2.466 -1.166 -4.708
6 -2.514 -2.324 -2.437 -0.989 -4.496

Not surprisingly (since the actual data generating process is an AR(1)), all the
criteria agree that we dont need any augmenting lagsthe one thats always
included is enough. The Dickey-Fuller statistic is -6.009 which would suggest a
strong rejection of the unit root. However, we cant use standard Dickey-Fuller
tables. In fact, a major complication is that the critical values depend upon

1. The base deterministic model (trending or not)


2. The type of break
3. The location of the break within the data set

Perrons paper did not include a non-trending model, but the test statistic is
well beyond the critical values.
The issue is more complicated when the break date is unknown since it needs
to be estimated along with the other parameters of the modelpapers which
followed Perrons original work almost always allowed for the unknown break.
While they differ in certains parts of the calculation, the standard procedure
is to search over the interior of the sample (leaving out some fraction of the
sample at each end, typically 15%) for the break point which is least favorable
to the unit root hypothesis, that is, most negative. Note that this can involve a
great deal of calculation since it requires running a sequence of unit root tests
where each one typically requires several regressions in order to set the lag
length.
7
By convention, the date of the break is typically the period before the change.
Nonstationary Variables 211

We can illustrate the Zivot and Andrews (1992) test that allows for a single
break (at an unknown date) in the intercept, trend and/or in both of the deter-
ministic regressors. This generalizes the Dickey-Fuller procedure. The syntax
for the procedure @ZIVOT is

@ZIVOT( options ) series start end

The most important options are

BREAK=[INTERCEPT]/TREND/BOTH
CRIT=[INPUT]/AIC/BIC/TTEST
PI=trimming fraction [.15]
LAGS=number of augmenting lags [0]
GRAPH/[NOGRAPH]
SIGNIF=significance level for the t-test in GTOS

This is also based upon the assumption that the series is trending. The three
choices for the BREAK option allow:

INTERCEPT Discrete change in level but no change in trend rate


TREND Change in trend rate, but no immediate change in level
BOTH Simultaneous change to both

To determine whether real U.S. GDP is stationary around a broken trend, we


can use the following. Although the default trimming fraction is 0.15, we can
use PI=.10 to better capture the possibility of a break at the time of the 2008
financial crisis.
@zivot(break=trend,crit=aic,pi=0.1,graph) ly

Zivot-Andrews Unit Root Test, Series LY


Allowing for Break in Trend Only
Breaks Tested for 1965:04 to 2007:04
Including 2 Lags of Difference
Selected by AIC

Sig Level Crit Value


1%(**) -4.93000
5%(*) -4.42000

Breakpoint TestStat
2004:04 -4.01972

The break date yielding the minimum t-statistic for the lag coefficient is
2004:04 where the t is -4.01972. The procedure defines %%BREAKPOINT as
the chosen break entry.8 Given the critical values, it is not possible to reject
the null hypothesis of a unit root at the 5% level.
8
The @ZIVOT procedure defines the break point as the first point affected by the break
rather than the last point before it, in keeping with the convention used by the authors.
Nonstationary Variables 212

-2.00

-2.25

-2.50

-2.75

-3.00

-3.25

-3.50

-3.75

-4.00

-4.25
1965 1970 1975 1980 1985 1990 1995 2000 2005

Figure 7.3: Zivot-Andrews Unit Root Tests for log GDP

To replicate the regression, create the trend shift dummy (TS) and add it to the
regression:

set ts = %max(t-(%%breakpoint-1),0.0)
linreg dly
# ly{1} constant trend dly{1 to 2} ts

Similar results hold if we allow for multiple endogenous breaks. Lee and
Strazicich (2003) develop an LM test allowing for two endogenous breaks. First
estimate the Crash model (so that there is an abrupt change in level of the
series)

@lsunit(model=crash,breaks=2,lags=2) ly

Lee-Strazicich Unit Root Test, Series LY


Regression Run From 1960:04 to 2012:04
Observations 209
Crash Model with 2 breaks
Estimated with fixed lags 2

Variable Coefficient T-Stat


S{1} -0.0351 -2.8921
Constant 0.0106 8.3275
D(1970:04) 0.0250 3.2399
D(1980:03) 0.0204 2.6187

The test statistic is the t on the lagged (detrended) level S{1}. Since the 5%
critical value reported in Lee and Strazicich is 3.842, we cannot reject the
null hypothesis of a unit root with a breaking trend. The estimated break dates
are 1970:4 and 1980:3. The more general BREAK model which allows both the
intercept and slope of the trend to change can be obtained with

@lsunit(model=break,breaks=2,lags=2) ly
Nonstationary Variables 213

Lee-Strazicich Unit Root Test, Series LY


Regression Run From 1960:04 to 2012:04
Observations 209
Trend Break Model with 2 breaks
Estimated with fixed lags 2

Variable Coefficient T-Stat


S{1} -0.1280 -4.7369
Constant 0.0086 4.6473
D(1965:03) 0.0120 1.5946
DT(1965:03) 0.0002 0.0976
D(2007:03) 0.0058 0.7571
DT(2007:03) -0.0107 -5.3391

This is still insignificant as the 5% critical value is 5.286.


Two-break tests can require a very substantial amount of computation. If you
have T data points, the number of unit root tests that need to be run is on the
order of T (T + 1)/2 since every pair of break points9 needs to be examined. For
T = 2000, this is several million. In this case, we used a fixed number of lags,
but the recommendation is to search for the lag length for each pair of breaks
(using the METHOD=GTOS option) which increases the computational burden.
The point of unit root tests with breaks (particularly multiple breaks) is of-
ten misunderstood. These should be applied as a specification test on more
standard unit root tests if those have accepted the unit root. If youve already
rejected the unit root, there is no particular reason to apply a test with a break
since as Perron shows, the break biases the basic test in favor of the unit root.
There are several other procedures for doing unit root tests with breaks.
@PERRONBREAKS is quite general and allows for additional types of breaks.
@LPUNIT does the Lumsdaine-Papell test which generalizes Zivot-Andrews to
two breaks.

7.4 Two Univariate Decompositions


The trend included in (7.3) is obviously rather crude, and if the process has
a unit root in the Z, the data can wander arbitrarily far from it. For many
purposes, its useful to have a way to extract a trend which leaves a stationary
residual. This section describes two such calculations.

7.4.1 Hodrick-Prescott Filter

Hodrick and Prescott (1997) develop a procedure to extract a time-varying


trend from a nonstationary series. To use the Hodrick and Prescott (HP) fil-
ter, suppose that you want to decompose {yt } into a trend component, t , and a
9
Breaks near the end are excluded as are pairs of breaks that are too close together. Both
exclusions are controlled by the PI option.
Nonstationary Variables 214

stationary component yt t = st . Consider the sum of squares


T T 1
1X 2 X
(y t ) + [( t ) (t t1 )]2
T t=1 t T t=2 t+1
where is a constant and T is the number of observations.
The problem is to select the {t } sequence so as to minimize this sum of
squares. In the minimization problem is a constant chosen to reflect the
cost or penalty of incorporating fluctuations into the trend. For quarterly
data, Hodrick and Prescott suggest of 1600 and for monthly data = 14400.
Increasing the value of acts to smooth out the trend. For example, if = 0,
the sum of squares is minimized when yt = t and as , the sum of squares
is minimized when (t+1 t ) = (t t1 ). Since the change in the trend is
constant, the HP trend degenerates into a linear time trend.
The HP filter can be computed in RATS using the FILTER instruction with the
option TYPE=HP. The value of is input using the additional option TUNING=
value. However, you generally dont need the TUNING option, since the default
adjusts based upon the CALENDAR frequency.10

filter(type=hp) ly / hp_trend
set cycle = ly - hp_trend

This filters the LY series and returns the trend in the series named HP TREND.
The next instruction constructs the CYCLE as the difference between the two.
The code below constructs three graphs. The first, using LY and HP TREND, is
not shown below because it is difficult to discern the difference between the two
series in a small graph. If you examine the graph of CYCLE, it does generally
resemble the NBER business cycles. However, it is hard to believe that the
post-2010 period has been one in which the economy operated above its trend.
In the third graph, the difference between the HP TREND and potential GDP is
shown to be small except for the most recent period. 11
graph(header="GDP and Trend") 2
# hp_trend
# ly
spgraph(hfields=2,header="HP Filter for GDP")
graph(header="HP Cycle")
# cycle
graph(header="HP Trend and Potential GDP",key=below,nokbox) 2
# hp_trend
# lpot
spgraph(done)

10
Some older RATS programs might use the @HPFILTER procedure. We would recommend
using the built-in option as part of FILTER.
11
The HP filter is a two-sided filteras such, its estimates would not be expected to be as
precise near the two ends of the data where there is information only in one direction.
Nonstationary Variables 215

HP Cycle HP Trend and Potential GDP


0.04 9.6 HP_TREND
LPOT

0.03 9.4

0.02 9.2

0.01 9.0

0.00 8.8

-0.01 8.6

-0.02 8.4

-0.03 8.2

-0.04 8.0

-0.05 7.8
1960 1970 1980 1990 2000 2010

Figure 7.4: HP Filter for GDP

7.4.2 The Beveridge and Nelson Decomposition

Beveridge and Nelson (1981) use an alternative decomposition method that


forces the trend to be the random walk with drift:
t = a0 + t1 + t
so that the s-step-ahead conditional forecast of the trend is
ET T +s = T + a0 s
= yT cycleT + a0 s
where cycleT = yT T .
As described in Enders (2010), in order to use the Beveridge and Nelson (BN)
decomposition, you need to

Estimate the first-difference of yt as an ARMA(p, q) process.


For each period t = 1, ..., T , find the one- through s-step ahead forecasts
(i.e., find Et yt+s for every value of t and s). In practice, s is usually set
equal to 100. For each t, use the estimated ARMA(p, q) model to construct
the long-run forecast t = Et [y100+t + y100+t1 + ... + yt+1 ] + yt . Hence,
the mean at t is the current value of yt plus the sum of the forecasted
changes.
Form the variable cyclet by subtracting t from yt .

The process is simple if you use the procedure @BNDECOMP. The usual syntax is

@BNDECOMP( options ) y start end bntrend


Nonstationary Variables 216

BN Cycle The BN Trend and Potential GDP


0.03 9.6 BN_TREND
LPOT

9.4
0.02

9.2

0.01
9.0

0.00 8.8

8.6
-0.01

8.4
-0.02

8.2

-0.03
8.0

-0.04 7.8
1960 1970 1980 1990 2000 2010

Figure 7.5: BN Decomposition for GDP

where y is the input series and bntrend is the estimated trend.

AR=number of AR lags [1]


MA=number of MA lags [0]

We can decompose the log of real GDP with the BN decomposition using

@bndecomp(ar=2) ly / bn_trend

For comparison purposes, we can construct the BN cycle and also compare the
BN trend to potential GDP using the code

spggraph(hfields=2,vfields=1,header="BN Decomposition for GDP")


graph(header="BN Cycle") 1
# bn_cycle 3 *
graph(header="The BN Trend and Potential GDP",key=upleft) 2
# bn_trend
# lpot
spgraph(done)

As illustrated in the left-hand panel, it is typical for the BN cycle to be quite


jagged. Nevertheless, the graph of BN CYCLE does seen to do reasonably well
with the post-financial crisis data. Moreover, the BN trend does seem to fall
sharply at the time of the financial crisis.

7.5 Cointegration
Cointegration, in the usual sense of the term, occurs when there is a linear com-
bination of nonstationary I(1) variables that is stationary. Engle and Granger
Nonstationary Variables 217

(1987) show that the existence of such a stationary relationship means that
the dynamic paths of the nonstationary variables must be linked. Let xt be
the n 1 vector consisting of the I(1) variables (x1t, x2t, ..., xnt )0 and let be the
1 n vector of parameters (1 , 2 , ..., n ). The system is said to be in long-run
equilibrium when
1 x1t + 2 x2t + ... + n xnt = 0
Denote the deviation from long-run equilibrium as et , so that we can write
et = xt (7.6)
For the equilibrium to be meaningful, the deviations from equilibrium must
converge toward zero. As such, it must be the case that
p
X
et = a1 et1 + a1+i eti + vt (7.7)
i=1

where 2 < a1 < 0 and vt is an i.i.d. error term.


Since et is a linear combination of the various xit , (7.7) indicates that the time
paths of the cointegrated variables are influenced by the last periods deviation
from long-run equilibrium. With some manipulation, it can be shown that the
dynamic adjustment mechanism has the form
xt = xt1 + A(L)xt1 + t (7.8)
where is an n n matrix with elements ij and A(L) is n n matrix with
elements that are polynomials in the lag operator L. For example, consider the
simple first-order VAR
xt = Axt1 + t
Subtracting xt1 from each side of the equation yields
xt = (I A)xt1 + t
Defining = I A, it immediately follows that xt = xt1 + t .
Since xt is stationary, each expression on the right-hand side of (7.8) must be
stationary as well. For xt1 to be stationary, either all elements of must be
zero or each row of must be a cointegrating vector of xt . Hence, a key feature
of (7.8) is the rank of .
If the rank of = 0, every element of is zero so (7.8) becomes nothing more
than a vector autoregression (VAR) in first differences. This is clearly incon-
sistent with the notion that the variables are cointegrated. In such circum-
stances, there is no error-correction since xt does not respond to the previous
periods deviation from long-run equilibrium. Alternatively, if rank() = n, the
variables cannot be I(1). If is of full rank, the long-run solution to (7.8) is
xt1 = 0. As such, there are n distinct equations that can be used to solve
for the n long-run values of the xit . For cointegration to occur, it is necessary
Nonstationary Variables 218

that 0 < rank() = r < n. Hence, rank() is equal to the number of indepen-
dent cointegrating vectors. Since xt1 does not vanish, one or more of the xit
must respond to the previous periods deviation from long-run equilibrium. It
is important to note that cointegration implies that estimating a VAR entirely
in first differences is inappropriatethe model in (7.8) without the expression
xt1 (that is, the model in first differences) is misspecified.
To take a specific example, suppose rank() = 1 and that we can ignore the
term A(L)xt1 in (7.8). The ith row of (7.8) can be written in error-correction
form
xit = i1 x1t1 + i2 x2t1 + ... + in xnt1 + it
If we factor out i = i1 /1 = i2 /2 = ... = in /n , we can write
xit = i [1 x1t1 + 2 x2t1 + . . . + n xnt1 ] + it
(7.9)
= i et1 + it
You can see that each xit adjusts in the constant proportion i of the previous
periods deviation from long-run equilibrium. In (7.9) the value of i is the
factor loading or speed of adjustment term. The larger is i , the larger the
response of xit to last periods deviation from long-run equilibrium. If i = 0,
the variable xit is said to be weakly exogenous.

7.5.1 The Engle-Granger Methodology

The Engle-Granger cointegration test entails estimating (7.6) by OLS and sav-
ing the residuals. Then, use the saved residuals to estimate an equation in the
form of (7.7). If you can reject the null hypothesis that a1 = 0, you can conclude
that the deviations from the long-run equilibrium converge toward zero. As
such, the xit series are cointegrated.
To illustrate the Engle-Granger procedure, we will analyze the relationship
between the 3-month and 1-year interest rates. It is anticipated that both of
the series will be I(1) and that they are cointegrated. After all, the theory of
the term structure implies that the two rates cannot drift too far apart. Af-
ter reading the data, the first step (which is sometimes incorrectly skipped) is
to ensure that the interest rates are I(1)they cant be COintegrated if they
arent integrated in the first place. This is readily accomplished using a stan-
dard Dickey-Fuller test. Consider

@dfunit(maxlags=8,method=gtos,signif=0.05) tb3mo
@dfunit(maxlags=8,method=gtos,signif=0.05) tb1yr

Although the output is not shown here, you should find that the general-to-
specific method selects a lag-length of 7 for each variable. More importantly,
for each series, we cannot reject the null of a unit root. Respectively, the t-
statistics from the two tests are -1.61304 and -1.39320.
Nonstationary Variables 219

Given that both variables are I(1), we can use OLS to estimate the long-run
relationship and save the residuals. Since we will refer to this equation below,
we define it as rshort.

linreg(define=rshort) tb3mo
# constant tb1yr
set u = %resids

The most important part of the output is


Variable Coeff Std Error T-Stat Signif
************************************************************************************
1. Constant -0.186974297 0.047845507 -3.90788 0.00012559
2. TB1YR 0.935603746 0.007456159 125.48065 0.00000000

The choice of name RSHORT refers to the fact that the short-term interest rate
is taken to be the dependent variable. We could also have switched the roles to
get

linreg(define=rlong) tb1yr
# constant tb3mo

which will result in a similar but not identical equation (if you renormalize).
To estimate an equation in the form of (7.7), we can use the general-to-specific
method to determine the lag length. This selects a lag length of 6. Hence, you
can conduct the test using

diff u / du
linreg du
# u{1} du{1 to 6}

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. U{1} -0.371800057 0.077712165 -4.78432 0.00000335
2. DU{1} 0.227025861 0.081701928 2.77871 0.00598294
3. DU{2} -0.029765103 0.079218675 -0.37573 0.70751733
4. DU{3} 0.112752100 0.077105454 1.46231 0.14524160
5. DU{4} 0.120913953 0.075398989 1.60365 0.11038410
6. DU{5} -0.001189432 0.070357067 -0.01691 0.98652891
7. DU{6} -0.155839468 0.069639909 -2.23779 0.02634888

Since, by construction, the residuals series has a mean of zero, there is no need
to include an intercept term. The key point is that the coefficient on U1 is -
0.3718 with a t-statistic of -4.784. Although it might seem appropriate to use a
Dickey-Fuller table to test the null hypothesis a1 = 0, we are applying it to the
residuals from a regression equation; we do not know the actual et sequence,
only the estimated deviations from equilibrium. An ordinary Dickey-Fuller
table would be appropriate only if the true (or hypothesized) values of the i
were used to construct the et sequence. With two variables and about 200
observations, the 5% critical value for the Engle-Granger test is 3.368. As
Nonstationary Variables 220

such, we can reject the null hypothesis of no cointegration. Note that you can
get the identical output using the procedure @EGTEST.

@egtest(lags=6,det=constant)
# tb3mo tb1yr

Engle-Granger Cointegration Test


Null is no cointegration (residual has unit root)
Regression Run From 1961:04 to 2012:04
Observations 206
Using fixed lags 6
Constant in cointegrating vector
Critical Values from MacKinnon for 2 Variables

Test Statistic -4.78432**


1%(**) -3.95194
5%(*) -3.36688
10% -3.06609

or after saving the residuals from the long-run equilibrium relationship, you
can use
@egtestresids(det=constant,nvar=2,maxlags=8,method=gtos) u

where the NVAR option gives the number of variables in the cointegrating
systemthis doesnt change the calculation, but does change the critical val-
ues.
The next step is to estimate the error-correction model given by (7.8) and to
obtain the impulse responses and variance decompositions. One simple way
to estimate the lag length is to use the procedure @VARLAGSELECT. Note that
the procedure determines the lag length of a standard VAR written entirely in
levels though the result should be similar to what we would get if we went to
the (considerable) work involved in estimating a series of models with cointe-
gration. Consider the program statement

@varlagselect(crit=gtos,signif=0.05,lags=8)
# tb3mo tb1yr

If you enter the code as shown, you should obtain a lag length of 7. As such,
there are 6 augmented changes of each variable in (7.8). The estimation of an
error correction model is similar to that of estimating any VAR. The important
difference is the need to include the error correction term in the model. Use
the following steps:
Step 1: Estimate the long-run equilibrium relationship, using the DEFINE op-
tion on the LINREG instruction. This step allows you to pass the estimated
coefficients from LINREG to the VAR system. Thus, in the interest rate exam-
ple, we used

linreg(define=rshort) tb3mo
# constant tb1yr
Nonstationary Variables 221

Step 2: Set up the VAR system using the MODEL option on the SYSTEM instruc-
tion. Include in the system definitions the instruction

ect rshort

ECT is for Error Correction Term and is a description of the equilibrium condi-
tion.
For the example at hand, the Vector Error Correction Model (VECM) is set up
with:
system(model=rates)
variables tb3mo tb1yr
lags 1 to 7
ect rshort
end(system)

Note that there is no DETERMINISTIC instructionthe usual DET CONSTANT


in this case would be appropriate if we had data with a drift, but for the (non-
trending) interest rates, we have restricted the constant to be only in the coin-
tegrating relationship (RSHORT).
Note also that you set up the model in levels, not in first differences and with
the full number of lags in the expanded VARthe ESTIMATE instruction will
take care of the transformations necessary to estimate the model properly.
Step 3: Enter the appropriate ESTIMATE instruction. For the interest rate
example, we can use

estimate(noftests)

Variable Coeff Std Error T-Stat Signif


************************************************************************************
1. D_TB3MO{1} 0.888179056 0.232947909 3.81278 0.00018506
2. D_TB3MO{2} 0.219968337 0.234426212 0.93833 0.34925550
...
6. D_TB3MO{6} 0.027826363 0.203517804 0.13673 0.89138991
7. D_TB1YR{1} -0.461705290 0.230978920 -1.99891 0.04702895
8. D_TB1YR{2} -0.625887274 0.221697163 -2.82316 0.00525647
...
12. D_TB1YR{6} -0.148310245 0.194009611 -0.76445 0.44553885
13. EC1{1} 0.728038833 0.227517963 3.19992 0.00160855

The output shown above is obviously abbreviated. The important point is that
error-correction term in the first equation is highly significant and that the
adjustment toward the long-run equilibrium relationship is quite rapid. In
response to a one unit discrepancy from long-run equilibrium, the short term
rate adjusts by 72.803 percent of the gap. The error correction term is not
significant in the TB1YR equation implying that the long-term rate is weakly
exogenous.
Step 4: As in a standard VAR, you can obtain the impulse response functions
and variance decompositions.
Nonstationary Variables 222

0.90

0.85

0.80

0.75

0.70

0.65

0.60

0.55

0.50
5 10 15 20
T-Bill 1-Year

Figure 7.6: VECM Responses to the T-Bill Rate

The variance decompositions can be obtained using

errors(model=rates,results=errors,steps=24)

To graph the impulse responses you can use

compute implabels=||"T-Bill","1-Year"||
impulse(model=rates,results=impulses,steps=24)
graph(nodates,footer="Responses to the T-Bill Rate",$
key=below,klabels=implabels) 2
# impulses(1,1)
# impulses(2,1)

Alternatively, we can obtain all of the impulse responses with 95% confidence
intervals using the two procedures

@MCVARDoDraws(model=rates,draws=2000,steps=24)
@mcgraphirf(model=rates,shocks=||"to 3-month","to 1-year"||,$
varlabels=||"3-month","1-year"||,footer="Impulse Responses",$
center=median,percent=||.025,.975||)

7.5.2 The Johansen Procedure

Unlike the Engle-Granger test, the Johansen procedure seeks to determine the
rank of . This has three distinct advantages. The first is that there is no need
to treat one of the variables as the dependent variable in Step 1. After all, it
would have been possible to estimate the long-run equilibrium relationship by
reversing the role of the two interest rates. Specifically, we could have used the
estimates from
Nonstationary Variables 223

to 3-month to 1-year
1.25 1.25

1.00 1.00

0.75 0.75

0.50 0.50
3-month
0.25 0.25
Responses of
0.00 0.00

-0.25 -0.25
0 5 10 15 20 0 5 10 15 20

1.25 1.25

1.00 1.00

0.75 0.75

0.50 0.50
1-year
0.25 0.25

0.00 0.00

-0.25 -0.25
0 5 10 15 20 0 5 10 15 20

to 3-month to 1-year

Figure 7.7: VECM Responses with Error Bands

linreg(define=rlong) tb1yr
# constant tb3mo

A second problem with the Engle-Granger procedure is that it is a two-step


procedure. Instead, it is preferable to obtain the long-run relationship and the
short-run dynamics using a full information maximum likelihood estimator.
The third disadvantage of the Engle-Granger procedure is that it does not allow
you to determine the number of cointegrating vectors in the system.
The basic Johansen techniques can be implemented using the @JOHMLE proce-
dure. Its syntax is

@JOHMLE( options ) start end


# list of endogenous variables

The most important options are

LAGS=number of lags in the V A R


DETERM=NONE/[CONSTANT]/TREND/RC/RTREND
SEASONAL/[NOSEASONAL]
LOADINGS=the matrix of factor loadings
VECTORS=coefficients of the matrix

The deterministic variables in the VAR can include a CONSTANT, a constant and
a TREND, a constant restricted (RC) to the cointegrating vector (which is what
well use here), and a trend restricted to the cointegrating vector (RTREND).
CONSTANT (which allows for drift) and RC (which doesnt) are by far the two
most common choices.
Nonstationary Variables 224

To estimate the relationship between the 1-year and 3-month interest rates
using the Johansen procedure, we can use
@JohMLE(lags=7,determ=rc,vectors=vectors)
# tb3mo tb1yr

Likelihood Based Analysis of Cointegration


Variables: TB3MO TB1YR
Estimated from 1961:04 to 2012:04
Data Points 205 Lags 7 with Constant restricted to Cointegrating Vector

Unrestricted eigenvalues and -T log(1-lambda)


Rank EigVal Lambda-max Trace Trace-95% LogL
0 -193.1365
1 0.1523 33.8836 35.4921 20.1600 -176.1947
2 0.0078 1.6085 1.6085 9.1400 -175.3904

Cointegrating Vector for Largest Eigenvalue


TB3MO TB1YR Constant
4.789823 -4.342510 0.051374

Notice that the two estimated eigenvalues, or characteristic roots, of the


matrix are 0.1523 and 0.0078. Recall that the rank of a matrix is equal to
the number of non-zero characteristic roots. The question is whether these two
roots (called 1 , 2 ) are statistically different from zero.
Johansen shows that it is possible to construct the trace (r) test as
n
X
trace (r) = T ln(1 i ) (7.10)
i=r+1

where the i are the estimated characteristic roots ordered from largest to
smallest and T is the number of usable observations.
For a given value of r, the trace (r) statistic can be used to test the null hypoth-
esis of r cointgrating vectors against the general alternative hypothesis that
the number of cointegrating vectors is greater than r. If there is no cointe-
gration, the estimated characteristic roots should be small so that the value of
trace should be small as well. In our example, we can test the null hypothesis
of zero cointegrating vectors (i.e., r = 0) against the alternative of one or two
cointegrating vectors. If we use equation (7.10) we obtain
trace (0) = 205 [ln(1 0.1523) + ln(1 0.0078)] = 35.48
Since 35.48 (notice the slight difference from the procedures output because
of rounding errors) exceeds the 95% critical value of 20.1600, we can reject
the null hypothesis of no cointegration. To test the null of r = 1 (i.e., one
cointegrating vector) against the alternative of two cointegrating vectors, we
can form
trace (1) = 205 ln(1 0.0078) = 1.61
Since, 1.61 (note the rounding error) is smaller than the 95% critical value of
9.14, we do not reject the null hypothesis and conclude that there is exactly
one cointgrating vector.
Nonstationary Variables 225

An alternative test considered by Johansen is the max (r, r + 1) test


max (r, r + 1) = T ln(1 r+1 )
As opposed to (7.10), here the null is that there are exactly r cointegrating
vectors against the alternative of r+1 cointegrating vectors. The test for exactly
one cointegrating vector can be constructed as
max (0, 1) = T ln(1 1 ) = 205 ln(1 0.1523) = 33.87
Comparing this value to the 95% critical value of 19.96 reported in Osterwald
and Lenum (1992), we reject the null hypothesis of no cointegrating and acept
the alternative hypothesis of exactly one cointegrating vector. The estimated
cointegrating vector, including the constant, is such that
4.789823TB3MO4.342510TB1YR + 0.051374 = 0
If we normalize with respect to the 3-month rate, we need to divide each term
by 4.789823. Thus,

com c1 = -vectors(3,1)/vectors(1,1)
com c2 = -vectors(2,1)/vectors(1,1)
dis c1 c2

As such, the normalized cointegrating vector is


TB3MO = 0.01073 + 0.90661TB1YR
In this case, it turns out that the maximum likelihood estimates are quite sim-
ilar to those obtained by OLS estimation of the long-run equilibrium relation-
ship.
Nonstationary Variables 226

Example 7.1 Dickey-Fuller Tests


cal(q) 1960:1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
log rgdp / ly
log potent / lpot
*
* Construct the graph including the labels for the series
*
com l$ = ||"Real GDP","Potential"||
graph(klabels=l$,footer="Real and Potential GDP", $
key=upleft,vlabel="logarithms") 2
# ly
# lpot
*
set trend = t
linreg ly
# constant trend
*
corr(number=8,picture="##.##") %resids
*
diff ly / dly
linreg dly
# constant trend ly{1} dly{1 2}
*
exclude
# ly{1} trend
*
exclude
# ly{1} constant trend
*
compute p=0
do lags=4,1,-1
linreg(noprint) dly
# constant trend ly{1} dly{1 to lags}
if %ttest(%tstats(%nreg),%ndf)<.05 {
compute p=lags
break
}
end do lags
*
disp "Chosen lag length" p
linreg(print) dly
# constant trend ly{1} dly{1 to p}
*
set cycle = lpot - ly
diff cycle / dcycle
linreg dcycle
# constant cycle{1} dcycle{1 to 2}
*
* Using procedures
*
Nonstationary Variables 227

@dfunit(det=trend,lags=2) ly
*
@dfunit(det=trend,method=gtos,maxlags=4,signif=0.05) ly
@dfunit(det=trend,method=bic,maxlags=4) ly
@dfunit(det=trend,method=aic,maxlags=4) ly
*
@adfautoselect(det=trend,maxlags=4,print) ly
*
@urauto(lags=2) ly

Example 7.2 Unit Root Tests in a Loop


open data "panel(2013).xls"
calendar(q) 1980:1
data(format=xls,org=columns) 1980:01 2013:01 australia canada $
france germany japan netherlands uk us
*
dofor x = australia to us
set lx = log(x{0})
@dfunit(title="Dickey-Fuller Test for "+%l(x),$
method=gtos,maxlags=12) lx
end dofor x
*
* First try at creating a REPORT
*
report(action=define,title="Unit Root Tests for PPP")
dofor x = australia to us
set lx = log(x{0})
@dfunit(noprint,method=gtos,maxlags=12) lx
report(row=new,atcol=1) %l(x) %cdstat %%autop
end dofor x
report(action=show)
*
* Improved
*
report(action=define,title="Unit Root Tests for PPP")
report(row=new,atcol=1,align=center) "Country" "ADF" "Lags"
dofor x = australia to us
set lx = log(x{0})
@dfunit(noprint,method=gtos,maxlags=12) lx
report(row=new,atcol=1) %l(x) %cdstat %%autop
end dofor x
report(action=format,picture="*.###")
report(action=show)
*
* With tags for significance
*
report(action=define,title="Unit Root Tests for PPP")
report(row=new,atcol=1,align=center) "Country" "ADF" "Lags"
dofor x = australia to us
set lx = log(x{0})
@dfunit(noprint,method=gtos,maxlags=12) lx
Nonstationary Variables 228

@mackinnoncv(det=constant) dfcv
compute stars=fix($
(%cdstat<dfcv(1))+(%cdstat<dfcv(2))+(%cdstat<dfcv(3)))
report(row=new,atcol=1) %l(x) "" %%autop
report(row=current,atcol=2,special=1+stars) %cdstat
end dofor x
report(action=format,atcol=2,tocol=2,atrow=2,$
picture="*.###",align=decimal)
report(action=show)

Example 7.3 Other Unit Root Tests


cal(q) 1960 1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set ly = log(rgdp)
*
* Phillips-Perron test
*
@ppunit(det=trend,lags=8,table) ly
*
* Schmidt-Phillips (LM) test
*
@spunit(p=1) ly
*
* Elliot, Rothenberg, Stock test
*
@erstest(det=trend,lags=2) ly
*
* KPSS test
*
@kpss(det=trend,lmax=12) ly

Example 7.4 Unit Root Test with Break: Simulated Data


open data "break.xls"
data(format=xls,org=columns) 1 100 epsilon y1 y2
*
graph(footer="Simulated Data with Broken Mean")
# y1
corr(number=6,picture="*.##") y1
corr(number=6,picture="*.##") y1 1 50
corr(number=6,picture="*.##") y1 51 100
*
@dfunit(max=6,det=constant) y1
*
set du = t>50
linreg y1
Nonstationary Variables 229

# constant du
set ytilde = %resids
*
@adfautoselect(print,det=none,max=6) ytilde

Example 7.5 Unit Root Tests with Breaks


cal(q) 1960 1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set ly = log(rgdp)
set dly = ly-ly{1}
set trend = t
*
* Zivot-Andrews test
*
@zivot(break=trend,crit=aic,pi=0.1,graph) ly
*
set ts = %max(t-(%%breakpoint-1),0.0)
lin dly
# ly{1} constant trend dly{1 to 2} ts
*
* Lee-Strazicich test
*
@lsunit(model=crash,breaks=2,lags=2) ly
@lsunit(model=break,breaks=2,lags=2) ly

Example 7.6 Trend Decompositions


cal(q) 1960 1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
set ly = log(rgdp)
set lpot = log(potent)
*
filter(type=hp) ly / hp_trend
set cycle = ly - hp_trend
*
graph(header="GDP and Trend") 2
# hp_trend
# ly
*
spgraph(hfields=2,header="HP Filter for GDP")
graph(header="HP Cycle")
# cycle
graph(header="HP Trend and Potential GDP",key=upleft) 2
Nonstationary Variables 230

# hp_trend
# lpot
spgraph(done)
*
@bndecomp(ar=2,print) ly / bn_trend
set bn_cycle = ly - bn_trend
*
spggraph(hfields=2,vfields=1,header="BN Decomposition for GDP")
graph(header="BN Cycle") 1
# bn_cycle 3 *
graph(header="The BN Trend and Potential GDP",key=upleft) 2
# bn_trend
# lpot
spgraph(done)

Example 7.7 Cointegration


cal(q) 1960 1
all 2012:4
open data quarterly(2012).xls
data(org=obs,format=xls)
*
@dfunit(maxlags=8,method=gtos,signif=0.05) tb3mo
@dfunit(maxlags=8,method=gtos,signif=0.05) tb1yr
*
linreg(define=rshort) tb3mo
# constant tb1yr
set u = %resids
*
diff u / du
linreg du
# u{1} du{1 to 6}
*
@egtest(lags=6,det=constant)
# tb3mo tb1yr
*
@egtestresids(det=constant,nvar=2,maxlags=8,method=gtos) u
*
@varlagselect(crit=gtos,signif=0.05,lags=8)
# tb3mo tb1yr
*
system(model=rates)
variables tb3mo tb1yr
lags 1 to 7
ect rshort
end(system)
*
estimate(noftests)
*
errors(model=rates,results=errors,steps=24)
*
compute implabels=||"T-Bill","1-Year"||
Nonstationary Variables 231

impulse(model=rates,results=impulses,steps=24)
graph(nodates,footer="Responses to the T-Bill Rate",$
key=below,klabels=implabels) 2
# impulses(1,1)
# impulses(2,1)
*
@MCVARDoDraws(model=rates,draws=2000,steps=24)
@mcgraphirf(model=rates,shocks=||"to 3-month","to 1-year"||,$
varlabels=||"3-month","1-year"||,footer="Impulse Responses",$
center=median,percent=||.025,.975||)
*
@JohMLE(lags=7,determ=rc,vectors=vectors)
# tb3mo tb1yr
*
com c1 = -vectors(3,1)/vectors(1,1)
com c2 = -vectors(2,1)/vectors(1,1)
dis c1 c2
Appendix A

Probability Distributions

A.1 Univariate Normal


Parameters Mean (), Variance ( 2 )
!
1 (x )2
Kernel exp
2 2

Support (, )

Mean

Variance 2

Main uses Prior, exact and approximate posteriors for param-


eters with unlimited ranges.

Density Function %DENSITY(x) is the non-logged stan-


dard Normal density. More gener-
ally, %LOGDENSITY(variance,u). Use
%LOGDENSITY(sigmasq,x-mu) to compute
log f x|, 2 .


CDF %CDF(x) is the standard Normal CDF. To get


F(x|, 2 ), use %CDF((x-mu)/sigma)

Draws %RAN(s) draws one or more  (depending upon the


target) independent N 0, s2 .
%RANMAT(m,n) draws a matrix of independent
N (0, 1).

232
Probability Distributions 233

A.2 Univariate Student (t )


Parameters Mean (), Variance of underlying Normal ( 2 ) or of
the distribution itself (s2 ), Degrees of freedom ()
(+1)/2
Kernel 1 + (x )2 / 2 or
2 2
(+1)/2
1 + (x ) / s ( 2)

Support (, )

Mean

Variance 2 / ( 2) or s2

Main uses Prior, exact and approximate posteriors for param-


eters with unlimited ranges.

Density Function %TDENSITY(x,nu) is the (non-logged) density


function for a standard ( = 0, 2 = 1) t.
%LOGTDENSITY(ssquared,u,nu) is the log den-
sity based upon the s2 parameterization.
Use %LOGTDENSITY(ssquared,x-mu,nu) to com-
2

pute log f x|, s , and
%LOGTDENSITYSTD(sigmasq,x-mu,nu) to com-
pute log f x|, 2 , .1


CDF %TCDF(x,nu) is the CDF for a standard t.

Draws %RANT(nu) draws one or more (depending upon


the target) standard ts with independent nu-
merators and a common denominator. To
get a draw from a t density with variance
ssquared and nu degrees of freedom, use
%RANT(nu)*sqrt(ssquared*(nu-2.)/nu).

Notes With = 1, this is a Cauchy (no mean or variance);


with 2, the variance doesnt exist. v tends
towards a Normal.

1
%LOGDENSITYSTD and %TCDF were added with RATS 7.3. Before that, use
%LOGTDENSITY(sigmasq*nu/(nu-2),x-mu,nu) and %TCDFNC(x,nu,0.0).
Probability Distributions 234

A.3 Chi-Squared Distribution


Parameters Degrees of freedom ().

Kernel x(2)/2 exp (x/2)

Range [0, )

Mean

Variance 2

Main uses Prior, exact and approximate posterior for the pre-
cision (reciprocal of variance) of residuals or other
shocks in a model

Density function %CHISQRDENSITY(x,nu)

Tail Probability %CHISQR(x,nu)

Random Draws %RANCHISQR(nu) draws one or more (depending


upon the target) independent chi-squareds with NU
degrees of freedom.
Probability Distributions 235

A.4 Gamma Distribution


Parameters shape (a) and scale (b), alternatively, degrees of free-
dom () and mean (). The RATS functions use the
first of these. The relationship between them is
2
a = /2 and b = . The chi-squared distribu-

tion with degrees of freedom is a special case with
= .
 x  
a1 (v/2)1 x
Kernel x exp or x exp
b 2
Range [0, )

Mean ba or

22
Variance b2 a or

Main uses Prior, exact and approximate posterior for the pre-
cision (reciprocal of variance) of residuals or other
shocks in a model

Density function %LOGGAMMADENSITY(x,a,b). Built-in with RATS


7.2. Available as procedure otherwise. For the
{, } parameterization, use
%LOGGAMMADENSITY(x,.5*nu,2.0*mu/nu)

Random Draws %RANGAMMA(a) draws one or more (depending upon


the target) independent Gammas with unit scale
factor. Use b*%RANGAMMA(nu) to get a draw from
Gamma(a, b). If you are using the {, } parameter-
ization, use 2.0*mu*%RANGAMMA(.5*nu)/nu.
You can also use mu*%RANCHISQR(nu)/nu.

Moment Matching %GammaParms(mean,sd) (external function) re-


turns the 2-vector of parameters ( (a, b) parameteri-
zation) for a gamma with the given mean and stan-
dard deviation.
Probability Distributions 236

A.5 Multivariate Normal


Parameters Mean (), Covariance matrix () or precision (H)
 
1/2 1 0 1
Kernel || exp (x ) (x ) or
 2 
1/2 1 0
|H| exp (x ) H (x )
2
Support Rn

Mean

Variance or H1

Main uses Prior, exact and approximate posteriors for a collec-


tion of parameters with unlimited ranges.

Density Function %LOGDENSITY(sigma,u). To compute log f (x|, )


use %LOGDENSITY(sigma,x-mu). (The same func-
tion works for univariate and multivariate Nor-
mals).

Draws %RANMAT(m,n) draws a matrix of independent


N (0, 1).
%RANMVNORMAL(F) draws an n-vector from a
N (0, FF0 ), where F is any factor of the covariance
matrix. This setup is used (rather than taking the
covariance matrix itself as the input) so you can do
the factor just once if its fixed across a set of draws.
To get a single draw from a N (, ), use
MU+%RANMVNORMAL(%DECOMP(SIGMA))
%RANMVPOST, %RANMVPOSTCMOM, %RANMVKRON and
%RANMVKRONCMOM are specialized functions which
draw multivariate Normals with calculations of the
mean and covariance matrix from other matrices.
Appendix B

Quasi-Maximum Likelihood Estimations (QMLE)

The main source for results on QMLE is White (1994). Unfortunately, the book
is so technical as to be almost unreadable. Well try to translate the main
results as best we can.
Suppose that {xt }, t = 1, . . . , is a stochastic process and suppose that we have
observed a finite piece of this {x1 , . . . , xT } and that the true (unknown) log joint
density of this can be written
T
X
log gt (xt , . . . , x1 )
t=1

This is generally no problem for either cross section data (where independence
may be a reasonable assumption) or time series models where the data can be
thought of as being generated sequentially. Some panel data likelihoods will
not, however, be representable in this form.
A (log) quasi likelihood for the data is a collection of density functions indexed
by a set of parameters of the form
T
X
log ft (xt , . . . , x1 ; )
t=1

which it is hoped will include a reasonable approximation to the true density.


In practice, this will be the log likelihood for a mathematically convenient rep-
resentation of the data such as joint Normal. The QMLE is the (or more tech-
nically, a, since there might be non-uniqueness) which maximizes the log
quasi-likelihood.
Under the standard types of assumptions which would be used for actual max-
imum likelihood estimation, proves to be consistent
and asymptotically Nor-
mal, where the asymptotic distribution is given by T ( ) N (0, A1 BA1 ),
d
where A is approximated by
T
1 X 2 log ft
AT =
T t=1 0
and B by (if there is no serial correlation in the gradients)
T  0  
1 X log ft log ft
BT = (B.1)
T t=1

237
Quasi-Maximum Likelihood Estimations (QMLE) 238

with the derivatives evaluated at .1 Serial correlation in the gradients is han-


dled by a Newey-West type calculation in (B.1). This is the standard sandwich
estimator for the covariance matrix. For instance, if log ft = (xt zt )2 , (with
zt treated as exogenous), then
log ft
= 2 (xt zt ) zt0

and
2 log ft
= 2zt0 zt
0
and the asymptotic covariance matrix of is
X 1 X  X 1
zt0 zt zt0 u2t zt zt0 zt

the standard Eicker-White robust covariance matrix for least squares. Notice
that, when you compute the covariance matrix this way, you can be somewhat
sloppy with the constant multipliers in the log quasi likelihoodif this were
the actual likelihood for a Normal, log ft would have a 21 2 multiplier, but that
would just cancel out of the calculation since it gets squared in the center factor
and inverted in the two ends.
This is very nice, but what is the 0 to which this is converging? After all,
nothing above actually required that the ft even approximate gt well, much
less include it as a member. It turns out that this is the value which minimizes
the Kullback-Liebler Information Criterion (KLIC) discrepancy between f and
g which is (suppressing various subscripts) the expected value (over the density
g) of log(g/f ). The KLIC has the properties that its non-negative and is equal to
zero only if f = g (almost everywhere), so the QMLE will at least asymptotically
come up with the member of the family which is closest (in the KLIC sense) to
the truth.
Again, closest might not be close. However, in practice, were typically less in-
terested in the complete density function of the data than in some aspects of it,
particularly moments. A general result is that if f is an appropriate selection
from the linear exponential family, then the QMLE will provide asymptotically
valid estimates of the parameters in a conditional expectation. The linear ex-
ponential family are those for which the density takes the form
log f (x; ) = a() + b(x) + 0 t(x) (B.2)
This is a very convenient family because the interaction between the parame-
ters and the data is severely limited.2 This family includes the Normal, gamma
(chi-squared and exponential are special cases), Weibull and beta distributions
1
The formal statement of this requires pre-multiplying the left side by a matrix square root
of AB1 A and having the target covariance matrix be the identity.
2
The exponential family in general has d() entering into that final term, though if d is
invertible, its possible to reparameterize to convert a general exponential to the linear form.
Quasi-Maximum Likelihood Estimations (QMLE) 239

among continuous distributions and binomial, Poisson and geometric among


discrete ones. It does not include the logistic, t, F , Cauchy and uniform.
For example, suppose that we have count datathat is, the observable data
are nonnegative integers (number of patents, number of children, number of
job offers, etc.). Suppose that we posit that the expected value takes the form
E(yt |wt ) = exp(wt ). The Poisson is a density in the exponential family which
has the correct support for the underlying process (that it, it has a positive
density only for the non-negative integers). Its probability distribution (as a
x
function of its single parameter ) is defined by P (x; ) = exp()
x!
. If we de-
fine = log(), this is linear exponential family with a() = exp(), b(x) =
log x!, t(x) = x. Theres a very good chance that the Poisson will not be the cor-
rect distribution for the data because the Poisson has the property that both its
mean and its variance are . Despite that, the Poisson QMLE, which maximizes
P
exp(wt ) + xt (wt ), will give consistent, asymptotically Normal estimates
of .
It can also be shown that, under reasonably general conditions, if the model
provides a set of moment conditions (depending upon some parameters) that
match up with QMLE first order conditions from a linear exponential family,
then the QMLE provides consistent estimates of the parameters in the moment
conditions.
Appendix C

Delta method

The delta method is used to estimate the variance of a non-linear function of


a set of already estimated parameters. The basic result is that if are the
parameters and we have
 
d
T N (0, ) (C.1)

and if f () is continuously differentiable, then, by using a first order Taylor


expansion
   
0
f () f () f ()

Reintroducing the T scale factors and taking limits gives
 
d
N 0, f 0 () f 0 ()0

T f () f ()

In practice, this means that if we have


N (, A) (C.2)
then    
0 0 0
f N f (), f ()Af () (C.3)

(C.1) is the type of formal statement required, since the A in (C.2) collapses to
p p
zero as T . Its also key that (C.1) implies that , so f 0 ()
f 0 ()
allowing us to replace the unobservable f 0 () with the estimated form in (C.3).
So the point estimate of the function is the function of the point estimate, at
least as the center of the asymptotic distribution. If is unbiased for , then its
almost certain that f () will not be unbiased for f () so this is not a statement
about expected values.
To compute the asymptotic distribution, its necessary to compute the partial
derivatives of f . For scalar functions of the parameters estimated using a
RATS instruction, that can usually be most easily done using the instruction
SUMMARIZE.

240
Appendix D

Central Limit Theorems with Dependent Data

The simplest form of Central Limit Theorem (CLT) assumes a sequence of i.i.d.
random variables with finite variance. Under those conditions, regardless of
the shape of the distributions (anything from 0-1 Bernoullis to fat-tailed vari-
ables with infinite fourth moments),
d
N (0, 2 )
T (x ) (D.1)
Those were extended to allow independent, but non-identically distributed,
random variables as long as there was some control on the tail behavior and
the relative variances to prevent a small percentage of the summands from
dominating the result. The assumption of independence serves two purposes:

1. It makes it much easier to prove the result, since its relatively easy to
work with characteristic functions of independent random variables.
2. Independence helps to restrict the influence of each element.

In time series analysis, independence is too strong an assumption. However,


its still possible to construct CLTs with weaker assumptions as long as the
influence of any small number of elements is properly controlled.
One type of useful weakening of independence is to assume a sequence is a
martingale difference sequence (m.d.s.). {ut } is an m.d.s. if
E(ut |ut1 , ut2 , . . .) = 0
Its called this because a martingale is a sequence which satisfies
E(xt |xt1 , xt2 , . . .) = xt1
so, by the Law of Iterated Expectations (conditioning first on a superset)
E(xt xt1 |xt1 xt2 , xt2 xt3 , . . .) =
E (E(xt xt1 |xt1 , xt2 , . . .)|xt1 xt2 , xt2 xt3 , . . .) = 0
thus the first difference of a martingale is an m.d.s. An i.i.d. mean zero process
is trivially an m.d.s. A non-trivial example is ut = t t1 , where t is an i.i.d.
mean zero process. ut isnt independent of ut1 because they share a t1 factor;
as a result, the variances of ut and ut1 will tend to move together.

241
Central Limit Theorems with Dependent Data 242

The ergodic martingale CLT states that if ut is a stationary ergodic m.d.s. and
Eu2t = 2 , then
T
1 X d
N 0, 2

ut
T t=1
We can write this (somewhat informally) as
T
1 X d
N 0, Eu2t

ut
T t=1
and very informally, this is used as
!
X X
ut N 0, u2t (D.2)
t t

This is the form that is useful when we have serially uncorrelated (though not
necessarily serial independent) summands. However, it wont handle serial
correlation. A basic CLT which can be applied more generally is the following:
if
q
X
xt = cs ts (D.3)
s=0

where t has assumptions which generate a standard N (0, 2 ), then


T  X  
1 X d 2
xt
N 0, cs 2 (D.4)
T t=1
q
P
If we write xt = C(L)t , then we can write C(1) = cs , so the limiting dis-
s=0
tribution can be written C(1)2 2 . This is known as the long-run variance of x:
if t were subject to a permanent shift generated by a random variable with
variance 2 , the variance that would produce in xt is C(1)2 2 .
The somewhat informal restatement of this is
T
1 X d
xt
N (0, lvar(x))
T t=1
where lvar(x) is the long-run variance of the x process, and in practice we use
L
!
X XX
xt N 0, wl xt x0 tl (D.5)
t t l=L

where the variance in the target distribution uses some feasible estimator for
the long-run variance (such as Newey-West).
The approximating covariance matrix in (D.2) can be computed using the in-
struction CMOMENT (applied to u), or with MCOV without any LAG options, and
Central Limit Theorems with Dependent Data 243

that in (D.5) can be computed using MCOV using LAG and LWINDOW options.
Note that both these are written using sums (not means) on both sides. That
tends to be the most convenient form in practicewhen you try to translate a
result from the literature, you need to make sure that you get the factors of T
correct.
Bibliography

B EVERIDGE , S., AND C. R. N ELSON (1981): A new approach to decomposition


of economic time series into permanent and transitory components with par-
ticular attention to measurement of the business cycle, Journal of Monetary
Economics, 7, 151174.

B OLLERSLEV, T. (1986): Generalized Autoregressive Conditional Het-


eroskedasticity, Journal of Econometrics, 31(3), 307327.

C HAN, K. (1993): Consistency and Limiting Distribution of the Least Squares


Estimator of a Threshold Autoregressive Models, Annals of Statistics, 21,
520533.

D IEBOLD, F. X., AND R. S. M ARIANO (1995): Comparing Predictive Accuracy,


Journal of Business and Economic Statistics, 13, 253263.

E LLIOTT, G., T. R OTHENBERG, AND J. S TOCK (1996): Efficient Tests for an


Autoregressive Unit Root, Econometrica, 64(4), 813836.

E NDERS, W. (2010): Applied Econometric Time Series. Wiley, 3rd edn.

(2015): Applied Econometric Time Series. Wiley, 4th edn.

E NGLE , R. F. (1982): Autoregressive Conditional Heteroscedasticity with Es-


timates of the Variance of United Kingdom Inflation, Econometrica, 50(4),
9871007.

E NGLE , R. F., AND C. W. J. G RANGER (1987): Co-integration and Error Cor-


rection: Representation, Estimation, and Testing, Econometrica, 55, 251
276.

E NGLE , R. F., D. M. L ILIEN, AND R. P. R OBINS (1987): Estimating Time


Varying Risk Premia in the Term Structure: The Arch-M Model, Economet-
rica, 55(2), 391407.

G RANGER , C. W. J., AND P. N EWBOLD (1973): Some Comments on the Evalu-


ation of Economic Forecasts, Applied Economics, 5, 3547.

(1974): Spurious Regressions in Econometrics, Journal of Economet-


rics, 2, 111120.

H ODRICK , R. J., AND E. C. P RESCOTT (1997): Postwar U.S. Business Cycles:


An Empirical Investigation, Journal of Money, Credit and Banking, 29(1),
116.

244
Bibliography 245

H YLLEBERG, S., R. F. E NGLE , C. W. J. G RANGER , AND B. S. Y OO (1990):


Seasonal Integration and Cointegration, Journal of Econometrics, 44, 215
238.

K WIATKOWSKI , D., P. P HILLIPS, P. S CHMIDT, AND Y. S HIN (1992): Testing


the Null Hypothesis of Stationarity against the Alternative of a Unit Root,
Journal of Econometrics, 54(1-3), 159178.

L EE , J., AND M. S TRAZICICH (2003): Minimum LM Unit Root Test with Two
Structural Breaks, Review of Economics and Statistics, 85(4), 10821089.

M AC K INNON, J. (1991): Critical Values for Cointegration Tests, in Long-Run


Economic Relationships, ed. by R. F. Engle, and C. W. J. Granger, chap. 13,
pp. 267276. Oxford: Oxford University Press.

N G, S., AND P. P ERRON (2001): Lag Length Selection and the Construction of
Unit Root, Econometrica, 69, 15191554.

P ERRON, P. (1989): The Great Crash, the Oil Price Shock, and the Unit Root
Hypothesis, Econometrica, 57(6), 13611401.

P HILLIPS, P., AND P. P ERRON (1988): Testing for a Unit Root in Time Series
Regressions, Biometrika, 75, 335346.

S CHMIDT, P., AND P. P HILLIPS (1992): LM Test for a Unit Root in the Presence
of Deterministic Trends, Oxford Bulletin of Economics and Statistics, 54,
257287.

T ERASVIRTA , T. (1994): Specification, Estimation and Evaluation of Smooth


Transition Autoregressive Models, Journal of American Statistical Associa-
tion, 89(425), 208218.

T ONG, H. (1983): Threshold Models in Nonlinear Time Series Analysis. New


York: Springer Verlag.

W HITE , H. (1980): A Heteroskedasticity-Consistent Covariance Matrix Esti-


mator and a Direct Test for Heteroskedasticity, Econometrica, 48, 817838.

(1994): Estimation, Inference and Specification Analysis. Cambridge:


Cambridge University Press.

Z IVOT, E., AND D. W. K. A NDREWS (1992): Further Evidence on the Great


Crash, the Oil-Price Shock, and the Unit-Root Hypothesis, Journal of Busi-
ness and Economic Statistics, 10, 251270.
Index

Additive outlier, 150 Delta method, 240


ADF test, 193 @DFUNIT procedure, 192, 196
@ADFAUTOSELECT procedure, 197 Dickey-Fuller test, 192
AIC , 23 Diebold, F., 47
Akaike Information Criteria, 23 @DMARIANO procedure, 48
%ALLOCEND function, 45 DO instruction, 34
Andrews, D., 211 DOFOR instruction, 84
@ARCHTEST procedure, 124 Double precision, 101
%%AUTOP variable, 200
@EGTEST procedure, 25, 220
Bayesian Information Criterion, 23 Elliott, G., 207
Beveridge, S., 215 Enders, W., 26, 79, 198
Beveridge-Nelson decomposition, 215 @EndersGranger procedure, 169
BFGS algorithm, 135 @EndersSiklos procedure, 169
BIC , 23 Engle, R., 22, 38, 124, 217
Bilinear model, 98 %EQNPRJ function, 171
@BJAUTOFIT procedure, 33, 36 %EQNREGLABELS function, 170
@BJDIFF procedure, 38 %EQNRESID function, 171
@BJIDENT procedure, 27 %EQNRVALUE function, 171
@BNDECOMP procedure, 215 %EQNSIZE function, 170
BOXJENK instruction, 27 %EQNVALUE function, 171
DEFINE option, 42 %EQNXVECTOR function, 170
BREAK instruction, 156 EQV instruction, 183
BREAK.XLS data file, 208 ERS test, 207
Burn-in, 80 @ERSTEST procedure, 207
ESTAR model, 77
CDF instruction, 137 EXCLUDE instruction, 16
%CDSTAT variable, 16, 138 %EXP function, 86
Chan, K., 161 EXTREMUM instruction, 89
Chi-squared distribution, 234
%CHISQR function, 138 FILTER instruction
CLEAR instruction, 179 TYPE=HP option, 214
CMOMENT instruction, 153 FIX function, 162
Compiled language, 144 FORECAST instruction, 42
COMPUTE instruction %FRACTnn variables, 84
for series elements, 45 FRML instruction, 65, 68
%CONVERGED variable, 28, 69 LASTREG option, 94
CORRELATE instruction, 14 %FTEST function, 138
%FUNCVAL variable, 121
DATA instruction
NOLABELS option, 183 Gamma distribution, 235
TOP option, 183 Gauss-Newton algorithm, 66

246
Index 247

@GLSDETREND procedure, 207 Mariano, R., 47


@GMAUTOFIT procedure, 40 Martingale difference sequence, 241
@GNEWBOLD procedure, 47 MAXIMIZE instruction, 117
Granger, C.W.J., 22, 38, 46, 191, 217 %MINENT variable, 89
GRAPH instruction, 10 Multivariate Normal distribution, 236
NODATES option, 22
NUMBER option, 22 %NARMA variable, 31
Grid search, 83 %NDFTEST variable, 16
Nelson, C., 215
@HEGY procedure, 38 Newbold, P., 46, 191
Hylleberg, S., 38 NLLS instruction, 65, 69
PARMSET option, 95
Identification NLPAR instruction, 102
lack of, 64 NONLIN instruction, 65, 67
Innovational outlier, 148 Nonlinear least squares, 64
INQUIRE instruction, 169 Normal distribution, 232
Inter-quartile range, 86 Numerical derivatives, 104
Interpreted language, 143
%INVNORMAL function, 49 ORDER instruction, 162
Outlier
@JOHMLE procedure, 223 additive, 150
@KPSS procedure, 207 innovational, 148
KPSS test, 207 Overflow, 101

LABELS instruction, 183 Papell, D., 213


lag length tests, 1 PARMSET data type, 94
Least squares Perron, P., 204, 208, 213
nonlinear, 64 @PERRONBREAKS procedure, 213
Lee, J., 212 Phillips, P.C.B., 204, 205
Lee-Strazicich test, 212 Phillips-Perron test, 204
LINREG instruction, 12 PITERS option, 134
DEFINE option, 164 PMETHOD option, 134
Ljung-Box Q statistic, 15 @PPUNIT procedure, 204
%LOGDENSITY function, 119, 236 Precision
%LOGISTIC function, 79 double, 101
%LOGL variable, 121 loss of, 101
%LOGTDENSITY function, 119, 233 Probability distributions
%LOGTDENSITYSTD function, 233 chi-squared, 234
Loss of precision, 101 gamma, 235
LSTAR model, 77 multivariate normal, 236
@LSUNIT procedure, 212 normal, 232
Lumsdaine, R., 213 t, 233
Lumsdaine-Papell test, 213 Q statistic, 15
MacKinnon, J., 197, 202
%RANMAT function, 236
@MACKINNONCV procedure, 202
Index 248

%RANMVNORMAL function, 236 %TCDF function, 233


%RANT function, 233 %TDENSITY function, 233
Recursive formula, 98 Terasvirta, T., 93
Recursive residuals, 50 TEST instruction, 17, 73
@REGCORRS procedure, 32 Threshold autoregression, 159
@REGCRITS procedure, 25 @THRESHTEST procedure, 169
%REGSTART function, 23 Tong, H., 159
@REGSTRTEST procedure, 93 Trend-stationary, 192
REPORT instruction, 200 %TTEST function, 138
Report Windows menu item, 201
%RESIDS series, 20, 69 UFORECAST instruction, 41
RESTRICT instruction, 17 Underflow, 101
RLS instruction, 50 UNTIL instruction, 157
Rothenberg, T., 207 @URAUTO procedure, 197

%S function, 182 VAR ,1


SBC , 23 @VARLAGSELECT procedure, 220
Schmidt, P., 205 %VARLC variable, 17
Schwarz Bayesian Criterion, 23 Vector autoregression, 1
SEED instruction, 105 White, H., 20, 237
%SEQA function, 84
Series Yoo, B.S., 38
output label of, 184
%SIGNIF variable, 16, 138 @ZIVOT procedure, 211
Simplex algorithm, 133 Zivot, E., 211
%SLIKE function, 182 Zivot-Andrews test, 211
Smooth Transition Regression, 87 %ZTEST function, 138
SPGRAPH instruction, 11
@SPUNIT procedure, 206
Spurious regression, 191
SSTATS instruction, 187
STAR model, 77
problems with outliers, 96
@STARTEST procedure, 93
STATISTICS instruction
FRACTILES option, 84
Stock, J., 207
STR model, 87
Strazicich, M., 212
%SUMLC variable, 17
SUMMARIZE instruction, 16

t distribution, 233
TAR model, 159
@TAR procedure, 169

You might also like