R Socialscience
R Socialscience
Mark R. Montgomery∗
∗ Departmentof Economics, Stony Brook University and Population Council, New York, email: mark.
montgomery@stonybrook.edu
1
Contents
Contents 2
I The Basics 4
1 Introducing R 5
1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2
CONTENTS 3
Bibliography 62
Part I
The Basics
4
One
Introducing R
The speed with which R is evolving has tended to leave its documentation lagging behind.
Luckily, a lot can be learned from a handful of web sites and manuals:
The main web site At http://www.r-project.org you will find links to manuals, the R
wiki, and much more.
Quick R Robert Kabacoff’s elegant web site provides a wide-ranging view of R’s essential
features. I often take a look here first when I have what seems like a simple question
whose answer is somehow not obvious: http://www.statmethods.net/index.html
Wiki book on R programming A more in-depth look at a range of topics is: http://en.
wikibooks.org/wiki/R_Programming
R for S TATA users I would highly recommend Muenchen and Hilbe (2010) for users who
already know S TATA and would like to put that knowledge to use in understanding
how R works. Another version of the book is aimed at S AS and S PSS users.
Two other books are also strongly recommended: Adler (2010) and Teetor (2011).
5
CHAPTER 1. INTRODUCING R 6
1.1 Installation
It is important to bear in mind from the outset that R is case-sensitive, which means that
packages, functions, and R objects—more on them in a moment—must be capitalized in
exactly the right way to avoid errors.
R itself can be downloaded from http://www.r-project.org, which provides a simple
point-and-click installer for Windows machines. R’s default graphical user interface is
adequate, but more elegant interfaces are available that can actually help you work more
productively. I myself use R-S TUDIO, which can be installed from rstudio.org. This
interface is available for the Mac and Linux operating systems as well as for Windows.
You will want to install a number of packages that extend R’s capabilities for dealing with
data. Some indispensable packages come with base-R, for instance, the foreign package
which provides facilities for reading and writing S TATA, S AS and similar files. Once you
have base-R on your computer, go to the console window or the R-S TUDIO equivalent and
type library()—this is an R function that will list and briefly describe all of your installed
packages.
A comprehensive list of R packages is available at http://cran.r-project.org/.
(C RAN is the Comprehensive R Archive Network.) This is the place to go when you want
to look at the details of what a package does before you install it. To enhance the tool-set
for econometrics and social science analysis more generally, you should install: AER, whose
name appeals to economists (although it stands for Applied Econometrics with R), a request
for which also downloads the car package having a number of useful data-mainpulation
routines; stringr for a well-organized set of functions for string (i.e., character) data, and—
if you deal with very large datasets—the remarkable data.table package. Those who use
LaTeX will certainly want the xtable package as well.
Package installation on Windows I use an R “script” file to manage the installation, and
then check periodically for package updates using R-S TUDIO’s updating tool. Putting all of
the installation commands in one editable file, whose commands can be executed one-by-
one as needed, substantially eases the task of re-installing all of your favorite packages each
time a new version of R is released. (Recently, new versions have been rolled out about
every six months.) Storing commands in files is the way you’ll work with R in any case, so
it makes sense to gather all of your installation commands in one file as well.
In the sample program of Figure 1.1, the first important step is to issue the command
chooseCRANmirror(), which will bring up a numbered list of sites from which packages
can be downloaded. You type in one of these numbers to instruct R to use that site. Having
done so, to install a specific package—the gmm package, for example—you would run the
line:
install . packages ( " gmm " , dependencies = NA )
Note that the package name is placed between quotes. The dependencies=NA option
specifies that if the gmm package depends for its operation on other packages, these should
be installed as well (if they have not already been installed). Setting dependencies=TRUE
would go even further, by installing all packages that depend on the gmm package—in my
experience, however, this brings in more packages than I find useful.
CHAPTER 1. INTRODUCING R 7
# NOTE on dependencies :
# Setting dependencies = TRUE installs not only all packages on which
# the package in question depends , but also all packages
# that it would enhance or which are " suggested ". That can
# be quite a lot of packages in total .
# If you would rather not have all of them , I would advise
# setting dependencies = NA at first , as this default setting
# covers just the packages absolutely needed for the one you are
# installing .
# To remove a package , put its name between the quotes and run the
# following line
remove . packages ( " " )
Installation on Macs The installation procedure is essentially the same on Macs, but
extra care is needed—not all sites maintain binaries for the Mac and there are Mac-specific
technical issues that arise in installing complex packages. Given this, it is probably a good
idea for Mac users to take the console approach no matter what. For instance, to install
ggplot2, the must-have tool for graphics that we’ll discuss later, a Mac user would issue
these two commands
setRepositories ( ind =1:2)
install . packages ( " ggplot2 " , dependencies = NA )
The command setRepositories(ind=1:2) directs the installer to C RAN and C RAN EXTRAS,
the repositories where O SX binaries for R are maintained by Brian Ripley. If I understand
correctly, these are for Intel architectures only.
or, alternatively,
library ( help = " ggplot2 " )
In the help() function, package is an argument whose value is specified via package =
"ggplot2". Many but not all packages come with a vignette, which is an extended and often
highly illuminating example of usage, typically written in a less formal and more accessible
style than the package’s help file. To see which of your installed packages has a vignette,
type vignette(), and to read one of these vignettes, say that coming with the sp package,
type vignette("sp"). Note that the name of the vignette is not always the same as that of
the package.
How do we get help on the usage and syntax of an R function? This is not as easy as the
beginner might hope—the help files that come with R seem to be aimed mainly at highly
sophisticated users, the sort of people who would hardly need the help. Still, if you have
an idea of what the tool you’re after might be named—something to do with polygons, for
instance—then
apropos ( " poly " )
will show commands with the fragment poly in their names. (Typing ??poly would do
much the same thing.) You will find that this list includes polygon(), and proceeding to
type help(polygon) or simply ?polygon will bring up a help screen that explains how this
command is used to plot polygons. Typing example(polygon) would bring up an example
of its use. To search more widely,
CHAPTER 1. INTRODUCING R 9
will range over R’s help manuals and archived mailing lists.
If you have a keyword in mind, then
help . search ( keyword = " character " , package = " base " )
Using built-in data Base-R comes with the datasets package already installed. By typing
data() you can see the names of the datasets that are supplied. Some packages provide
additional data that put the package’s capabilities on display. To see what data are available,
type data(package="PackageName"). Having spotted an interesting-looking dataset, you
would load it via
load ( " DataSetName " , package = " PackageName " )
Package conflicts Because R’s package authors operate more-or-less independently, they
sometimes assign a name to a function using a name that has already been claimed by
another package. (When the two package authors become aware of the situation, they often
collaborate in rewriting their packages so that both functions can be used.) When both of
these packages are loaded in an R session, R must find a way to resolve the ambiguity, and
does so by “masking” the function from the package that was loaded first. For instance,
both the memisc and the car packages (the latter is downloaded with the AER package) have
a function that goes by the name of recode. If the version of recode in the memisc package
is masked, you can nevertheless get at it by invoking the memisc version of recode in this
way,
memisc :: recode ()
prefacing the function name with the name of its package and two colons.
Two
Linear regression models OLS linear models are estimated with lm() (from stats) and
summary() and anova() provide the standard tests assuming normally distributed, ho-
moskedastic disturbances that are independent of the covariates. Asymptotic tests (Z
and χ2 instead of t and F tests), with facilities for substituting alternative covariance
matrices, are available in the coeftest() and waldtest() functions of the lmtest
package. General linear hypothesis tests can be conducted via linear.hypothesis()
in car. Both HC and HAC covariance matrices can be inserted in these functions
via the sandwich package. Diagnostic checking: The packages car and lmtest supply
many regression diagnostics. The packages VGAM, rms and Hmisc provide several tools
for generalized linear regression (GLS) models. Fitting linear regression models with AR
error terms via OLS is possible using gls() from nlme. Instrumental variables methods
are addressed in ivreg() in the AER package; another implementation is tsls() in
package sem.
Models for discrete outcomes Binary and count data: The glm() function of stats allows
the estimation of logit, probit, and Poisson models. Marginal effects for these models
can be displayed using effects. The package bayesm implements a Bayesian ap-
proach to binary, count, and other discrete dependent variables models. Negative
binomial models are implemented in the glm.nb() function of the MASS package,
and also by the aod package, which contains other models for over-dispersed data.
Zero-inflated and hurdle count models are implemented in the pscl package. Ordered
responses: Proportional-odds regression for ordered responses is available through the
polr() function of MASS. The package ordinal provides cumulative link models for
ordered data. Bayesian ordered probit models are provided by bayesm. Multinomial
responses: Multinomial models with individual-specific covariates only are available
10
CHAPTER 2. PACKAGES FOR ECONOMETRICS 11
Further regression models Nonlinear least squares models can be estimated by nls() in
the stats package. Quantile regression: quantreg covers linear, nonlinear, censored,
locally polynomial and additive quantile regressions. Linear models for panel data:
plm is the main package, providing within, between, and random-effects methods
(among others) along with options for corrected standard errors and the like. For
panel-corrected standard errors in OLS and GEE models, see the geepack and pcse
packages. Estimation of linear models with multiple group fixed effects can be done
with lfe. For the generalized method of moments (GMM) and generalized empirical
likelihood (GEL), use the gmm package. Spatial econometric models: The Spatial Task
View addresses spatial data and models. Spatial regression models can be estimated
with spdep and sphet (the latter using a GMM approach). (A package for spatial
panel models, splm, is under development on R-Forge.) Linear structural equation
models: sem, which includes two-stage least squares as mentioned above. Simulta-
neous equation estimation: systemfit. Nonparametric kernel methods: np. Truncated
(Gaussian) regression: truncreg. Nonlinear mixed-effect models: nlme and lme4.
Generalized additive models (GAMs): mgcv, gam, gamlss and VGAM.
Time series functions and classes See the Time Series Task View for much more detailed
information. The class ts defined in package stats is R’s standard class for regularly-
spaced time series, with strong support for annual, quarterly, and monthly data. Time
series in the ts class can be converted to and from the class of objects defined in the
zoo package, which provides support for both regularly and irregularly-spaced time
series (the latter via the class zoo) where the time information can be of arbitrary
class. This includes daily series (typically with ”Date” time index) or intra-day series
(e.g., with ”POSIXct” time index). Several other implementations of irregular time
series building on the ”POSIXct” time-date class are available in its tseries and
timeSeries, which are aimed particularly at finance applications.
Time series modeling Classical time series modeling tools are contained in the stats
package and include arima() for ARIMA modeling and models of the Box-Jenkins
type. Structural time series models are provided by StructTS() in stats. Filtering
CHAPTER 2. PACKAGES FOR ECONOMETRICS 12
Data sets Packages AER and Ecdat contain a comprehensive collections of data sets from
various standard econometric textbooks as well as several data sets from the Journal
of Applied Econometrics and the Journal of Business & Economic Statistics data archives.
AER additionally provides an extensive set of examples reproducing analyses from
textbooks and papers, illustrating various econometric methods. FinTS is the R com-
panion to Tsay’s Analysis of Financial Time Series (2nd ed., 2005, Wiley) containing data
sets, functions, and R scripts for some of the examples. The pwt package provides
several releases of the Penn World Tables. The packages expsmooth, fma, and Mcomp
are data packages with time-series data from the books Forecasting with Exponential
Smoothing: The State Space Approach (Hyndman, Koehler, Ord, Snyder, 2008, Springer)
and Forecasting: Methods and Applications (Makridakis, Wheelwright, Hyndman, 3rd
ed., 1998, Wiley) and the M-competitions, respectively. Package erer contains func-
tions and datasets for the book of Empirical Research in Economics: Growing up with R
(Sun, forthcoming).
There is perhaps more detail in this chapter than any first-time reader would care to see.
One way to approach what’s here is to jump immediately to Section 3.2 on dataframes,
which are R’s version of datasets, or go directly to Chapter 4. You can leaf back through this
chapter when you come across unfamiliar ideas.
The language that confronts the user of R is an interpreted rather than a compiled language;
in this it is much like the front-ends of S TATA, M ATLAB, P YTHON, and other popular
software. It is also object-oriented, a design choice that sometimes baffles R novices but which
can be exploited to great effect by an experienced user.1 R has been designed to work with
objects stored in memory rather than on disk, and this means that the amount of memory
your computer has will affect what you can do. Obviously, the more memory, the better!
into the language over a period of years. Significant strides toward full object-orientation have been made
in the past few years in what are termed S4 classes of objects, but not all of R’s functions and packages have
completed this transition.
13
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 14
(The objects() command also lists objects.) The object.size() function will tell you
approximately how much memory an object occupies—that can be important if your
computer is memory-challenged. To remove just one object from memory, use rm() with
the object’s name in the parentheses; and if you are in the middle of an R session and want
to clear away all of the R objects that have already been created, wiping the board clean, so
to speak, then issue this compound command:
rm ( list = ls ())
In this bit of code, the ls() command lists all current objects and passes that list to the list
argument of the rm() function, which then promptly removes them from memory. This is
an example of how functions can be chained together.
Perhaps the most elementary object in R is the vector, which can be composed of integers,
real numbers (all such numbers are held in double precision, which means 8 bytes of
storage—the real number printed to your console is generally an abbreviated version of
what is in memory), complex numbers, character data, or logicals (TRUE, FALSE). The entries
of a vector must all be of the same type. Scalars are simply vectors with one entry; in R
there is no scalar data type as such. The generic notation used in R for missing is NA, which
can be used in all of these types of vectors.
Basic mathematical operations are carried out using the conventional symbols. Ordinary
addition, subtraction, multiplication, division and exponentiation use +, -, *, / and ^;
when these operations are applied to vectors, they act on an element-wise basis. Logical
comparisons yield TRUE or FALSE when the things being compared involve no NA values.
When testing for equality, you would use == for “equal to” (take note of the double equals
sign) and != for “not equal to”. Likewise, use <, <=, >=, > for inequalities. Please
remember that tests of equality between two integers are exact in computer arithmetic, but
testing whether two real numbers are equal is problematic owing to the imprecisions in
the representation and storage of real numbers.2 We’ll have more to say about vectors and
mathematical operations in a moment.
2 If X and Y are two real numbers, you can define them as being essentially equal if they differ by less than
some pre-defined tolerance. The result of abs(X-y) < .Machine$double.eps is TRUE if the two numbers differ
in absolute value by an amount that is smaller than the tiny positive real number represented by R’s built-in
.Machine$double.eps object.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 15
You will get an error if this directory does not exist. Always employ forward slashes (/) in
specifying your paths: despite what you might have thought, this works even on Windows.
Having set the working directory in this way, issue the command list.files() to see
what files live there. If you get lost, use getwd() to find out exactly where you are in the
directory structure.
Pasting strings together Paths and file names often get very long, and I therefore tend to
store them in character variables. A very useful function for cobbling together long strings
is paste(), which can be used as follows:
PathIn <- " C : / SpatialTutorial / Examples "
File <- paste ( PathIn , " Ghana . RData " , sep = " / " )
load ( File )
In the sequence of commands shown above, we store in PathIn the full path to the directory
in which the Ghana.RData file sits, and then, via paste(), pre-pend that path to the file
name with the appropriate “/” separator, so that in the next step we can load the file into
memory using the load() command. This code snippet illustrates how <- is used assign
things on its right-hand side to the object on the left. Within functions, such as paste(), the
equals sign (=) is used to set the value of a function argument. In this case the sep argument
is used to specify the separator between the two strings that are to be joined together. (To
have no such separator, use sep="".) If the first two arguments of paste() are not of the
character type, R silently coerces them to character using the as.character() function.
Interrupting commands There will be times when you realize that you have hopelessly
garbled a command while typing at the console but have already pressed E NTER and asked
R to parse and somehow execute the mess. To stop R from trying to do the impossible, enter
E SC, which works on both Windows and Macs. (In Linux, use C NTRL-C instead.) Your
graphical user interface may display a small S TOP sign, which when clicked will do the
same thing.
Getting out To exit from an R session, type q() at the console or click the File menu.
Control structures
As with most programming languages, R has if-else structures and loops. The essence of
the if-else form is
if (.. logical expression ..) {
...
}
else {
...
}
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 16
in which the logical expression in parentheses is evaluated, and if it is TRUE the first set of
commands within the braces is executed, whereas if it is FALSE the commands within the
else braces are run.3
Loops are specified in a form such as
for ( i in 1:10) {
...
if (...) { next }
...
if (...) { break }
...
}
In this example, 1:10 is the range over which the loop index i will iterate. It is a sequence of
integers, but R will also allow lists (see below) to appear in this context. The next command
causes the program to ignore the remaining code in the body of the loop and advance to the
next iteration, whereas break causes the program to leap out of the loop altogether.
Visitors to the R user discussion groups are often advised in the strongest terms to avoid
loops. This is not because loops are somehow bad programming practice, nor because they
are in general an inefficient mechanism for getting computations accomplished. Indeed,
in the compiled programming languages from which R’s language is constructed—C code
currently accounts for about half of R’s code and Fortran about one-quarter—loops can be
extremely efficient. The problem is that their implementation in R is extremely inefficient.
The reasons for this are complicated, but stem from the fact that R has no scalar data type
and R’s interpreter is quite slow at processing long loops.4 Even so, short loops execute
quickly and there is no real need for the R programmer to avoid them.
Here A is a vector of integers, B a character vector, and P a logical vector.5 There are three
types of numeric vectors: integer (as above—these are stored in 4 bytes per entry), double
3 The help file advises placing the else{ on the same line as the closing brace of the if, which aids R in
recognizing and parsing the full structure.
4 Over the coming year, as blocks of base-R code are compiled into byte code, we can expect to see substantial
improvements in efficiency. But an interpreted language such as R is unlikely ever to achieve the efficiencies of
a formal programming language with an optimizing compiler able to look across program units and arrange
operations and memory use in a forward-looking manner.
5 It is a very good idea to fully spell out TRUE and FALSE in your R programming, rather than abbreviate as
T and F, because T and F are not reserved keywords. They can stand for, and thus be confused with, the names of
objects that you have created elsewhere in your program.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 17
precision (i.e., real numbers, each stored in 8 bytes), and complex.6 If handed entries that
are of different types, the c() function will coerce them to one type (often that means to
character type).
Using square brackets, we can isolate the entries of a vector, as in
A [2] # 7
P [2:3] # FALSE and TRUE
The length of a vector is the number of its entries, which can be determined by the length()
function as in length(P), which gives 3 as a result. Note, however, that the length of the
character vector
Cscalar <- " Henrietta "
is 1 and not (as you might have expected) the number of characters in “Henrietta”. To
repeat: The length of a vector is the number of its entries, and Cscalar is a 1-entry vector. If
the number of characters is what you actually wanted to know,
nchar ( Cscalar )
which can be abbreviated as Z <- seq(10,45,5), leaving off the named arguments. The
advantages of named arguments are twofold: they serve as a short form of documentation
when you go back to reacquaint yourself with an old piece of code, and they do not have to
be issued in the order listed in the function’s help file, whereas un-named arguments do
have to appear in exactly that order.
To determine whether a vector has NA entries (“missing” or “not available”), use the
is.na() function:
Z <- c (1 ,7 , NA ,24 ,33)
is . na ( Z )
which will return FALSE FALSE TRUE FALSE FALSE. If you simply want to know whether
any entry is NA, summarize with
any ( is . na ( Z ))
6 Unfortunately,no provision has been made in base-R for short integers which can be stored in only 1 byte.
However, short integers can be written to and read from binary files.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 18
which in this case yields TRUE. Use sum(is.na(Z)) to find out how many NA entries there
are.7 Many functions operating upon vectors will return NA if there are any NA elements in
the vector: for the Z vector we’ve just defined, min(Z), max(Z), mean(Z) would all return
NA. You can have these functions remove all NA entries before undertaking calculations by
saying
mean (Z , na . rm = TRUE )
There are similar functions and tests for NULL (an empty object), NaN (not a number), and
Inf, -Inf, which represent positive and negative infinity, respectively.
Suppose that the Z we used above has been redefined to exclude NA values, by way of
this command:
Z <- na . omit ( Z )
(Interestingly, R records that NAs have been omitted by setting an attribute of Z accordingly.)
With the new Z, the result of the logical comparison
Z < 30
in which the Z vector is on the left of the inequality and the scalar 30 is on the right, is a
logical vector with entries TRUE, TRUE, TRUE, FALSE. As we’ll discuss below, in R nothing
in particular is required of you to shorten Z by one element, whereas you would have to be
more explicit about this step in a compiled language.
To prepare for working with real data in R, we will need two types of tools for accessing
the elements of vectors and arrays. Arrays differ from vectors in having their entries
arranged in dimensions. A matrix is a two-dimensional rows-and-columns array; an array
can have more than two dimensions. If M is a matrix with 3 rows and 2 columns, it will have
a dim attribute that indicates the number of its rows and columns. It is often convenient to
construct a matrix from a vector by specifying a number of rows and columns into which the
vector’s entries should be organized. (As does Fortran, which supplies a substantial amount
of R’s core code, R fills in a matrix on a column-by-column basis, filling the entries of the
first column before moving on to the next.) These dimensions then become an attribute
of the matrix, as the following code will illustrate.
> V <- seq ( from =1 , to =40 , by =2)
> attributes ( V )
NULL
> M <- matrix (V , nrow =4 , ncol =5)
> M
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5]
[1 ,] 1 9 17 25 33
[2 ,] 3 11 19 27 35
[3 ,] 5 13 21 29 37
[4 ,] 7 15 23 31 39
> attributes ( M )
$ dim
7 This is an example of the power of R and also of the possibilities for confusion. What would it mean,
mathematically speaking, to sum up logical values? On its face, this does not appear to be a well-defined
mathematical operation. In this case, R carries out a sum by coercing TRUE values to 1 and FALSE values to 0.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 19
[1] 4 5
The matrix() function also has a byrow option that allows you to fill in the matrix on a
row-by-row basis.
R provides a great number of pre-coded operations for matrices. Among these are
matrix transposition, t(X), creation of an identity matrix of size n, via diag(n), matrix
inverse, solve(X), and matrix multiplication, X %*% Y. These are only a few of a long list
of operations.
Returning to vectors, consider a variable named Z (if you like, envision it as a column
vector) that has 100 entries. We can extract bits of its content in several ways. If we wish to
see the values of the 3rd, 7th, and 25th entries, we can use the row-index method,
r <- c (3 ,7 ,25)
Z[r]
whereby we create a vector of positive integers to indicate the entries of Z that we want. We
can shorten this to
Z [ c (3 ,7 ,25)]
and can collect adjacent entries and discrete entries via code such as
Z [ c (3 ,7 ,21:25)]
Alternatively, we could use the “repeat” function rep() to create a logical vector rl of the
same length as Z, and having set its 3rd, 7th, and 25th entries to TRUE, use this vector to pick
out the elements of Z that we are after:
rl <- rep ( FALSE , length ( Z ))
rl [ r ] <- TRUE
Z [ rl ]
Although the second method would seem to be more work for the same result, there are
times when it offers the easiest method of selection.
Sometimes we want to work with a subset of a vector’s entries that satisfy some logical
criterion, for instance, the entries that are less than 17,
SubSet <- which ( Z < 17)
The which() function returns a vector of indices—in this case, only those i entries (row
indices, so to speak) for which Z[i] < 17 is TRUE. Then Z[SubSet] would hold only these
particular Z values. The which function is especially helpful when Z has NA entries—it
collects only the indices of entries that are TRUE, passing over any that are NA or FALSE. By
contrast, the logical vector W <- Z < 17 would contain NA values in just the positions where
Z itself contains them.
The following sorting functions are commonly applied to vectors: sort(Z) returns a
vector whose entries are those of Z sorted in ascending order, and if we specify sort(Z,
decreasing=TRUE) we sort them in descending order—alternatively, use rev(sort(Z)).
The order(Z) function does not rearrange the entries of Z but instead returns a vector
whose entries indicate the ranking of the corresponding entry of Z. If we want to sort Z,
therefore, we can say Z <- Z[order(Z)]. This might seem redundant given the existence
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 20
of the sort function, but when we consider dataframes we will see how useful order can
be in re-arranging dataframe rows.
A related and often very helpful function is match(Z1,Z2). It considers in succession
each entry of the vector Z1, reporting the position (that is, the index value) in the Z2 vector
where the first entry that matches it is found, returning NA if no such match can be located.
If you simply want to know whether any match is found, use %in%. To illustrate,
Z1 <- seq (1 ,5 ,1)
Z2 <- seq (1 ,4 ,1)
match ( Z1 , Z2 )
>[1] 1 2 3 4 NA
Z1 % in % Z2
[1] TRUE TRUE TRUE TRUE FALSE
As is the case with other R objects, vectors are things whose entries can be named. To
put this more formally, although by default a vector has no attributes, it can be assigned
a names attribute upon creation, as in
x <- c ( a =1 , b =5 , c =10 , d =7)
or the names can be assigned later, after the vector is created, as in this example
names ( x ) <- letters (1: length ( x ))
V [1 : (n − lag)]
which is a vector of length n-lag. Base-R makes the diff(x, lag=) function available to
efficiently compute such differences; also see the lag() function, although note that its
output is not a generic vector but rather a time-series object.
As the examples above suggest, R’s vectors can be re-sized seemingly at will. If a numeric
vector V has had six entries, say, then the assignment V[7] <- 5 would add a seventh entry.
Automatic extensions and deletions are a common feature of interpreted languages such as
R, whereas in compiled languages these changes require explicit commands to de-allocate
and re-allocate memory for the object. Of course, memory allocation must take place in
R as well, but it is carried out silently in the background, and is usually all but invisible
to the user. On occasion, however, reallocations can introduce gross inefficiencies that are
easily noticed. It is wasteful of time and memory, for instance, to construct a long vector by
extending it one entry at a time, or to add many rows to a matrix by doing so one row at a
time.
8 The letters() function (pre-defined in base R) creates a vector from as many of the 26 lower-case letters
as you request. LETTERS() is the counterpart for upper case.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 21
Factors Factors in R are simple and yet somehow elusive structures that often confound
new users. When given a character vector as input, the factor() function stores it as a
vector of integers running from 1 to k in which k is the number of unique values taken on by
the original character vector’s entries. R then establishes a short vector of characters—these
are the distinct entries among the original values, now termed levels—whose entries are
mapped to these k integers.
As a user, you have no control over the integer values 1, . . . , k as such: in an R factor, they
are always coded as 1, 2, . . . , k. You can, however, exert control over the order of the levels
if you want a different order than what R would establish by default. You refer to factor
values by way of the levels, not by the internally-stored integer values that these levels
are linked to.9 The levels(F) function will show what the levels of the factor F are and
how they are ordered—levels is actually an attribute of a factor variable. By default,
the levels of factors are defined according to the type of vector from which the factor
was originally derived. If the original vector was of character mode, the levels order is
alphabetical by default; if the original was of numeric mode, the factor levels derived from
it are ordered in ascending numerical order.
Factors are mainly used for categorical variables, that is, variables whose values have
no strictly numerical interpretation—some social science disciplines term them “nominal”
variables. You can also set up ordered categorical factors—for so-called “ordinal” variables—
either by using the ordered() function or by setting ordered=TRUE in the factor() function.
When storing data in a dataframe (see below), R will by default convert character vectors
into factors. A number of R’s graphics and statistical routines treat factors and ordered
factors as distinct from numeric vectors.
An example may clarify some of these issues. Consider a vector indicating socio-
economic status whose original values in the data are SES <- c("low", "middle", "low",
"low", "high"). If we convert SES to a factor using SES F1 <- factor(SES), allowing R
to decide the mapping, the levels will be arranged in the order "high", "low", "medium"
because that is the alphabetical order of the original character entries. In this way, "high"
is mapped to the integer 1, "low" to 2, and "medium" to 3. If we were then to create a bar
graph of socioeconomic status using SES F1, the high group would be placed first, then the
low group, and finally the medium group. This is probably not what we would want. To
establish a more sensible ordering, we have to intervene by being explicit about ordering
via the levels argument:
SES _ F2 <- factor ( SES , levels = c ( " low " , " middle " , " high " ))
Now "low" is mapped to the integer 1, "medium" to 2, and "high" to 3. Several functions
are available to help in ordering factor levels. The relevel() allows you to specify which
level is placed first (and thereby mapped to the integer 1), which is useful in setting the
omitted category of a categorical variable in a statistical model. The reorder() function
allows you to set the order of a factor’s levels according to the value taken on by another
variable. If at some point, you need to work with the underlying integer codes of the factor,
you can create a new vector with these integer values using unclass():
9 Although there are ways of getting at these integers via unclass(), as we’ll discuss, it is a good idea not
to organize your code around that approach.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 22
To create a factor from a continuous numeric variable, use cut() as in the following
snippet of code:
X _ factor = cut ( Data $X , breaks = c (0 ,5.5 ,10.5 ,15.5 , Inf ))
Note the use of the keyword Inf. In addition to its breaks and labels arguments, the
cut() function has further helpful arguments: ordered results = TRUE creates an ordered
factor; right = TRUE indicates that the intervals should be closed on the right and open
on the left, that is, of the form (, ] (otherwise [, ) is the default); and dig.lab allows you to
control the number of digits in the label if the label is not already specified.
Although you might expect factor levels to behave as if they were variable values, bear
in mind that they are instead attributes of the variable. Hence, as this example shows,
selecting a subset of observations based on a factor’s levels does not keep only the selected
levels:
R > x = factor ( c ( " A " ," B " ," C " ," A " ," B " ," C " ," C " ))
R> x
[1] A B C A B C C
Levels : A B C
Note that the x observations are correctly selected but "C" is retained among the levels. As
this suggests, it is possible to assign levels to a factor that do not correspond to any observa-
tion in the data.10 For instance, to construct a table with entries of 0 for x=4, x=5, and
x=8, given x <- c(1, 2, 3, 6, 7, 7, 9, 10, 10, 11), we can say table(factor(x,
levels=1:11)).
Dates and times There are specialized functions and objects in R for handling dates and
times. The as.Date function converts a string of the form yyyy-mm-dd to an object of class
Date:
MyDate <- as . Date ( " 2010 -11 -25 " )
and a format argument can be specified to handle other representations of dates. For
instance, given a date string ”11/25/2010”,
MyDate <- as . Date ( " 11 / 25 / 2010 " , format = " % m / % d / % Y " )
does the conversion. The ISOdate() and as.POSIXlt() functions generate a list with
detailed year, day, and time components.
10 In this way, factors in R are like value labels in S TATA , where codes that do not appear in the data can be
assigned labels.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 23
Lists
Lists are another fundamental element of R: they are ordered collections of objects that need
not be of the same type or length. For instance, the list W
W <- list (A , P )
contains a numeric vector A of length 4 and a logical vector P of length 3, the result of
our earlier assignments A <- c(1,7,13,25) and P <- c(FALSE,FALSE,TRUE). A list can
contain another list.
You can always get access to a list’s components by referring to the component you
want by its position. In the double-bracket or [[ method,
W [[2]]
extracts the second component from the list W, which originated as the logical vector P, and
W [[2]][3]
shows its third entry, which is TRUE. (We wrote “originated” in the preceding sentence
because the list W was constructed from the objects A and P, but the original A and P continue
to exist outside W in their original form unless they have been deliberately altered or
removed, and the contents of list W’s second component can be changed without affecting
P.) This double-bracket notation is decidedly peculiar and more than a little obscure, but
you will eventually get used to it. Note that having constructed the list W, if you simply type
W in your console, the contents of this list are displayed to you using the double-bracket
notation.
It is very often convenient to assign a name to each list component, much as we saw
above with simple vectors, via
W <- list ( score =A , pass = P )
If you do this and then type W in the console, the list is displayed differently than before,
with the first component now given as $score rather than W[[1]]. The $ method gives us a
way of accessing the components of such named lists. With this method,
W $ pass
gives us the component of W that is named pass. (Unfortunately, this is not an option if you
use the $ approach.) The names of list components are “internal” to the list itself; once you
extract a named component from a list, the name by which it is known within the list no
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 24
longer applies.11 As with other types of objects, if a list object has named components, the
names() function will reveal them.
As is true in general with R, a list can be automatically extended by assigning a new
component. Our list W has had only two components up to now, but if we have already
defined a matrix MM and then say
W [3] <- list ( Matrix = MM )
this would add a third component to the list that goes by the name Matrix. A list component
can be removed (nullified) by assigning the keyword NULL to it,
W $ Matrix <- NULL
Functions
It is really easy to define functions in R, and having defined one, to use it within another
function. Here is a realistic illustrative example that puts some of the basics on display:
MyFunc <- function (x , u =12){
T <- 100.0 * c ( x [1:( u -1)] , sum ( x [ u : length ( x )]) ) / sum ( x )
names ( T ) <- c ( as . character ( seq (1:( u -1)) , as . character ( u ))
return ( T )
}
The basic idea of this function is to aggregate into the last entry of T all values of a longer
vector x that appear in its u-th entry and beyond. MyFunc therefore depends on two
arguments: x, a vector whose length must be greater than or equal to u, the second argument,
which by default is set to 12. In this function, we keep the first u − 1 = 11 elements of x
intact and sum up all the remaining elements from u to the end of the vector. The result is
placed in T, a new vector whose length is u, which is then converted to percentages. The T
vector is returned adorned with names for its entries.
We call the function as in the following example, in which V is a vector of length 25
which we want to truncate to length 10 and convert as above. If we say
V <- seq (1 ,25 ,1)
MyFunc (V ,10)
its values would be stored in the W vector. Note that by calling MyFunc with a second
argument of 10, we have over-ridden the default value of u=12. Had we mistakenly supplied a
second argument that was larger than the length of V, no error as such would be signalled
(in this case) but the T vector would be silently extended with NA values. Extra code (using
any(is.na())) would be needed to flag such errors.
Functions in R can return different types of objects: vectors, matrices, lists, and so
on. (Above we used the return() function, which is good programming practice, but by
11 One way around this is to use yet another bracket extraction method, which uses single brackets. The
code W[2] creates a one-item list with the item name retained.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 25
default the last object mentioned in an R function will be returned if there is no explicit
return statement.) As you would expect, variables defined within a function are local to
that function—they cannot be accessed from outside the function and cease to exist after
the function is exited, unless they are explicitly returned.
As we’ve mentioned, both user-defined and built-in functions in R have named arguments,
and if these arguments are referenced by name, they need not be called in the same sequence
as they appear in the function definition. If the arguments are not named in the call, however,
they must appear in exactly the order given in the definition. Default values for arguments
can be specified in the function definition, as we’ve just seen. If a default value is supplied
for an argument, then that argument need not be mentioned when the function is invoked.
In R, function definitions often include a final optional argument, denoted by ..., through
which additional information can be passed into the function.
Another interesting example of functions is given on the Q UICK R website:
# Get measures of central tendency and spread for a
# numeric vector x .
In this example, npar = TRUE means that by default the median, a non-parametric measure
of central location, will be calculated instead of the mean, although if the function is called
with npar = FALSE, this will over-ride the default setting so that the mean is returned. We
also see that print = TRUE by default, although again this can be over-ridden. Note that
the cat() function is used to send the results to the screen; it needs to end with the newline
character.
# Invoke the function accepting its default values
set . seed (1234)
x <- rpois (500 , 4)
y <- mysummary ( x )
We often come across anonymous functions that used on a one-time basis inside other
functions, where they are written as in this simple example that returns its argument x plus
10,
function ( x ) { x + 10}
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 26
with the argument x assigned by the function that encloses this one.
What about cases in which the function we want is of the general form f i = h( xi , yi )
with xi being the i-th entry of a vector X and likewise for yi ? The mapply() function can be
used as follows, assuming that the function h has already been defined with two arguments:
f <- mapply ( FUN =h , X , Y )
by which the arguments of h are passed as vectors. (There are more possibilities; see the
mapply help page for the options.) In effect, we are asking for the following vector to be
created:
f1 h ( x1 , y1 )
f 2 h ( x2 , y2 )
.. = .
..
. .
fn h( xn , yn )
Some but not all R functions do this sort of thing naturally. For instance X + Y automatically
produces a vector whose entries are the sums of the respective entries of the input vectors.
So mapply is actually needed only for functions that are not already vectorized in this sense.
Finding the source code of functions It can be difficult to track down the code that
defines R’s functions. If you put the name of the function whose code you’d like to see in
quotes, then getAnywhere("func") gets you started by at least identifying the package (or
packages) in which that function is defined. I have found the showMethods() function to be
even more helpful. Here is an example showing how code for the plot3D() function of the
rasterVis package can be displayed:
library ( rasterVis )
showMethods ( " plot3D " , includeDefs = TRUE )
3.2 Dataframes
In R, a dataframe is a collection of variables—if you like, think of them as named column
vectors, each of which has the same number of entries—that can be of different types for
different columns: logical, character, integer, double precision, complex.12 It is what social
scientists would normally term a “dataset.” In formal terms, R dataframes are special cases
of lists with named components, and some of the ways we access and manipulate dataframes
stem from this fundamental fact. For instance, the length() of a dataframe is not the
number of its rows, as you might have thought, but rather the number of components of
the list, that is, the number of variables in the dataframe.
To get a visual sense of an R dataframe, the edit() and head() functions are very
helpful. If the dataframe is named X, then to see its rows and columns in a spreadsheet-like
layout, we simply say edit(X).13 A more basic display of the first few rows is produced by
12 Indeed, each entry of a dataframe column can be a list, which is very helpful in some programming
contexts.
13 At present RS TUDIO, the otherwise-excellent graphical front end for R, does not support the edit function
head(X), and you can control the number of rows shown by setting the function’s argument
n= to the desired number.
would extract the values of the 3rd, 7th and 25th observations for all variables in the X
dataframe, much as if the dataframe X were a matrix whose 3rd, 7th and 25th rows were
needed. We can also apply operators that exploit the fact that dataframes are lists with
named components. If ID is one of the variables (i.e., named components) in the dataframe,
the $ operator can be used,
X $ ID
which would extract the full ID vector. This operation can also be performed using the
“double bracket” operator we introduced earlier,
X [[ " ID " ]]
In many cases, the helpful subset function allows us to express clearly what we want,
and it is designed so that variable names do not need to be given in quotes. (At the end
of this chapter, we will return for a more closer look at the meaning and context of object
names in R.) Its generic form is
subset (X , select = ID , subset = rlogic )
with the first argument being the name of the dataframe, and with ID being the variable in
the dataframe that we want to keep and rlogic being a logical vector that determines the
rows (observations) to be extracted. For example,
subset ( Cars , select = Model , subset =( MPG > median ( MPG )))
would select the variable Model from the dataframe Cars, taking only the rows of this vector
for which the miles per gallon variable MPG has a value in excess of the median. Note that
the subset argument of subset() is optional; to keep only variables Var1 and Var4 but
retain all dataframe rows, use:
View ( X )
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 28
The setdiff() function is one of a small family of R functions that implement set-theoretic
operations.
Note that when we sub-set a dataframe by any of these methods, thereby eliminating
some rows, all the factor levels are carried along without modification even if there is
no longer any observation left in the data to which they apply. (We saw an example of this
behavior earlier, outside the context of dataframes.) That can be annoying when tabulating
the new dataframe—but the fix is easy. Assume that a dataframe X has been derived by
sub-setting observations from some larger dataframe. The command
X <- droplevels ( X )
will strip away the levels that have no counterpart in any factor of the X dataframe.
Transforming variables
R offers the with(), within(), and transform() functions to allow us to operate on vari-
ables in a dataframe without having to repeat the dataframe name each time a variable is
mentioned. The with() function is used for simple operations, such as
X $ NewVar <- with (X , Var1 - Var2 )
which adds NewVar to the X dataframe, setting it equal to the difference between X$Var1
and X$Var2. When multiple or more complex operations are needed, the within() function
comes into play:
X <- within (X ,
{ NewVar1 <- Var1 - Var2
NewVar2 <- log ( Var1 )
})
memory for the dataframe and then fill it in. If we know that N rows are eventually going to
be needed (or we make an educated guess, choosing N to be on the high side), then we can
initialize memory by saying
mydata <- data . frame ( Var1 = numeric ( N ) , Var2 = character ( N ))
If, ultimately, we want the second variable to be a factor, we can make that conversion after
filling in the dataframe. Also, if we have set N to be larger than the number of rows that
turn out to have been needed, we can delete those extra rows.
to collect vectors V1, V2, and V3 in a dataframe, whereas if you begin with a list having
named components that are vectors, use as.data.frame():
List <- list ( v1 = V1 , v2 = V2 , v3 = V3 )
mydata <- as . data . frame ( List )
The new vector will by default be named values and the factor named ind; its levels will
be v1 and v2. The unstack function does the reverse operation.
and give X, an object you’ve created, the name you want it to have from the SomeNames
vector. Of course, you could simply make a copy of X under the new name,
Name2 <- X
or you could use an alternative method by which you assign "Name2", the second entry of
the SomeNames vector, to X via
assign ( SomeNames [2] , X )
This indirect approach is needed if you do not want to type out Name2 explicitly in your
code. If, for instance, you are writing a loop within which X is created many times (each
time having some distinctly different feature), you might greatly benefit from the indirect
naming allowed by assign() so that at each iteration, you save the version of X that is
specific to that iteration under a distinct name.14
The reverse operation is carried out by the get() function, which takes a name in the
form of a string of characters, and locates the object that goes by that name. As with
assign(), there would be little practical need for this function unless you do not want to
type explicitly the name of the object.
There will be times when you would like to construct an R expression from a series of
characters. We can put whatever we like in a series of characters, but when we ask for that
series to be interpreted by R as an expression, R must parse the characters and ensure that
they form a coherent whole, describing a meaningful operation. Here is an example, which
assumes that D is a vector with named elements:
D <- seq (1 ,3)
names ( D ) <- c ( " Var1 " ," Var2 " ," Var3 " )
x <- c (2 , 3)
addCols <- paste ( names ( D )[ x ] , collapse = " + " ) # gives " Var2 + Var3 "
e <- paste ( " ( " , addCols , " ) / " , length ( x ) , sep = " " ) # "( Var2 + Var3 ) / 2"
To this point, e is merely text, a series of characters. We now ask R to take a closer look at it
and convert the text—assuming all is well—into an expression:
E <- parse ( text = e )
Although they may seem similar, E and e are fundamentally different objects. To get a sense
of the difference, write e in a form that is not coherent and try to parse() it:
e <- " ( Var1 + ) / 2 "
parse ( text = e )
You get an error, exactly as if you had mis-typed a command at the console and tried to
execute it. Indeed, just as in all computer languages, R’s first step is to parse the text you’ve
typed to ensure that it is meaningful. When it isn’t, you are told that you’ve made a mistake
and R waits for you to correct it.15
14 S TATA programmers often use macros for this purpose, and they are apt to find R’s approach a bit
roundabout.
15 On occasion you will find a need for the deparse() function, which takes an R expression and turns it
back into a series of characters, such as would be needed to label the axes of a graph.
CHAPTER 3. LANGUAGE ELEMENTS AND CONCEPTS 31
When you give R input from the console in the form of a series of characters, and the
characters pass the parse() test—that is, they make sense as an expression that could be
executed—then R’s next step is to proceed to evaluate the expression, that is, to give it a value.
An evaluation could fail if, for instance, a perfectly legitimate expression makes reference to
an as-yet undefined input variable and therefore cannot be evaluated. At the console, errors
from illegitimate expressions and expressions that can’t be evaluated appear in a flash, so
that the logical differences between these two types of errors are blurred. In programming
with scripts, however, you can separate the parse and evaluation steps, holding on to a
successfully parsed E expression until the point at which it should be evaluated. In such
cases, we would use the eval(E) function to carry out the evaluation.
This discussion may well seem like an excursion into the arcane reaches of computer
science, but try to bear in mind the distinctions between text, expressions, and evaluations,
and think about how every computer language, when handed text as input, much first assess
whether—according to the rules of the language—the text makes sense as an expression in
that language, and if it does, then attempts to evaluate it. The differences come into play
when we engage in real programming with R, especially when we make use of functions
that we have defined. We’ll see examples of these concepts—involving the parse(), eval(),
quote(), and substitute() functions—when we discuss the data.table package, which
provides remarkable facilities for dealing with large datasets.
Four
Those of us who got our econometrics training in S TATA or S AS are apt to find the base
installation of R strangely lacking in many of the data-manipulation tools we would have
expected to be immediately at hand. For example, base-R offers no obvious facility for
recoding variables and assigning labels to the various values of categorical variables. For-
tunately, the car package supplies a recode function that is at once simple and powerful,
closely resembling a similar function in S TATA and no doubt other software. Renaming
variables is so easy that it requires no add-on package. And there are several approaches in
R that provide something akin to value labels. With a little experience the R user comes to
understand how such familiar data-manipulation tools can be re-created in R and made
even more powerful, in some cases, than their counterparts in other statistical software.
In the previous chapter, we went over many features of R’s dataframes, but there are
two that deserve re-emphasis here as we ready ourselves to look more deeply into these
structures. When you want to refer to a variable in a dataframe, remember that you have
two main ways of doing so. Suppose the dataframe is named MyData and let it contain the
variables UrbanPercentage, Population, and Region. You can refer to Population using
the $ method, such as in
fivenum ( MyData $ Population )
when you’d like to know the median, quartiles and other summary statistics for the variable,
which is what the fivenum() function supplies. Alternatively, use what some term the [[
method,
fivenum ( MyData [[ " Population " ]])
32
CHAPTER 4. EXPLORING SOCIAL SCIENCE DATA 33
whereas fivenum(MyData$X) would cause an error, since there is no variable literally named
X in the dataframe. Even more usefully, you can employ indirect references to variable
names when you construct loops for repetitive tasks:
X <- c ( " UrbanPercentage " ," Population " )
for ( x in X ){
fivenum ( MyData [[ x ]])
}
there are several ways of telling R that the variable names you are using refer to names
within a particular dataframe. The with() function is one way of doing it. Given two
population variables that are measured 5 years apart, we could find the (continuous) rate of
population growth over the period using:
MyData $ GrowthRate <- with ( MyData ,
( log ( Population2 ) - log ( Population1 ) ) / 5.0 )
The program first clears all R objects from memory using the rm(list=ls()) command,
loads the foreign package, and establishes some paths. It then reads in the S TATA file via
the read.dta() function and stores its contents in the R dataframe DataList. We ask that
in the course of this operation, all character vectors be left just as they are (giving the same
treatment to integer-valued variables with S TATA value labels), and not converted to R
factors.
rm ( list = ls ())
library ( foreign )
StataDataFile <- " N : / PRD / Data / DHS / Stata _ Data / DHS _ DataList . dta "
PathIn <- " N : / PRD / Data / DHS / Shapefiles / HIV _ Spatial _ Data "
PathOut <- " C : / Research / Temp2 "
Log <- " C : / Research / Logs / New _ HIV _ DBF . txt "
This code also illustrates the use of the capture.output command, whose arguments are,
first, the output that is to be sent to a log file, which can be as simple as a single string of
characters; second, the name of that log file; and third, whether the output should over-write
the file contents or be appended.
Next, the program loops through all the rows of DataList and examines the content of
the HIV_Shapefile variable. If a valid file name is present, the script codes the full path
to it in File. In this example, the files are zipped files, but it is no problem for R to unzip
the each such file and place the contents (that is, the unzipped files) in another directory
(here C:/Research/Temp2). One of these files is a DB ASE .dbf file—the foreign package
also handles this (rather primitive) form of dataset—which the program then reads into R
dataframe named X. Note in particular how the glob2rx() function is used to pick out the
.dbf file—this function translates standard wildcard expressions, such as *.dbf, to regular
expressions. The names of all of the file’s variables are listed and two of them are tabulated.
(We’ll discuss the table() function in more detail in a moment.) The program finishes up
by removing all files from the temporary working directory, so that the next iteration of the
loop will execute cleanly.
for ( i in 1: nrow ( DataList ))
if ( DataList [i , " HIV _ Shapefile " ] ! = " . " ) {
File <- paste ( PathIn , DataList [i , " HIV _ Shapefile " ] , sep = " / " )
capture . output ( File , file = Log , append = TRUE )
unzip ( zipfile = File , files = NULL , overwrite = TRUE ,
junkpaths = TRUE , exdir = PathOut )
DBFile <- list . files ( path = PathOut , pattern = glob2rx ( " * . dbf " ) ,
ignore . case = TRUE )
DBFile <- paste ( PathOut , DBFile , sep = " / " )
X <- read . dbf ( DBFile , as . is = TRUE )
At the clean-up step, the list.files() function is invoked with the full.names=TRUE
option, which retains the full path in the file name.
R’s default behavior is not to print results produced in the course of a loop unless the
print() function is applied, which has the effect of sending the output to the screen.2
When the output is extensive, however, sending it to a log file makes more sense. There
are other methods available for directing output to a file, some of which can be even more
convenient than the one we’ve shown. One of the easiest approaches is to establish a “write”
connection to an external file and then use the cat() or capture.output() function to send
the output there:
Connection <- file ( " MyLogFile . txt " , " w " )
cat ( results , file = Connection )
close ( Connection )
places them in a character vector. We will shortly discuss how comment() can be used
to create the R equivalent of variable labels. You can store both the variable names and
variable labels as two columns in a character-valued matrix, using
VariableLabels <- cbind ( names ( X ) , attributes ( X ) $ var . labels )
In the latter case, the attributes(X)$val.labels holds the names of these value labels,
and in attributes(X)$label.table, which is itself a list, each value-labelled variable is
a component, holding in a named vector the integer codes as they were defined in S TATA
along with the associated character labels. So all of the essential information from the S TATA
dataset is retained, if not in a place where the new R user might think to look.
will show all the variables in the mydata dataframe for the first 25 observations. The
edit(mydata) function is a nice alternative; it arranges the data in a spreadsheet-like
format with a scroll bar, allowing you the somewhat dangerous luxury of being able to edit
the data directly. In R STUDIO you would use View(mydata) instead of edit().
in which the first argument ... is where you’d insert the name of the factor (or a pair of
them, or a vector, or a matrix) to be summarized. (You are given some control over the
handling of NA and NaN entries via the useNA argument and the set of factor levels to be
excluded.) The output of table is an object that holds the tabulated counts—it is actually
of the table class—which is akin to a vector or matrix with named rows and columns.
For instance, if a variable Data$X is integer-valued, then F <- table(Data$X) creates a
factor-like vector F whose levels are the distinct values of the variable and whose entries
are the total number of cases having each such value.
The xtabs() function creates contingency tables, that is, cross-tabs. Its syntax is
xtabs ( formula = ~ . , data = , subset , na . action ,
exclude = c ( NA , NaN ) , drop . unused . levels = FALSE )
in which the formula indicates the factors whose levels are to be cross-classified, and the
name of the dataframe can be specified in data.
The xtable package deserves special mention here. It easily translates the results of
many of R’s descriptive and statistical functions into LaTeX tables. It thus fills an important
gap between R’s minimalist output as sent to the console, and the professionally formatted
output needed for inclusion in scientific articles and presentations. The package has options
that can be used to control the number of digits appearing after the decimal point in a
CHAPTER 4. EXPLORING SOCIAL SCIENCE DATA 37
table’s entries, and which allow for rotated and long tables to be produced. It is a great
time-saver.
Alternatively,
comment ( mydata [[ " New " ]]) <-
" New is the renamed version of the Old variable "
Note the double square brackets in the second approach. To see the comment, simply say
comment(mydata$New).
in which index vector is a vector of the indices i of x for which we want x[i] <- value.
The recode function from the car package lets you define each element of a new vector
y using the values of the corresponding element of the old x vector. Multiple recoding
“rules” can be defined, provided that they are separated by semicolons. For example,
x <- c (1 ,2 ,3 ,2 ,1 ,3 ,1 ,1 ,3)
y <- recode (x , " c (1 ,2)= ’ A ’;
else = ’B ’ " )
# y is : " A " " A " " B " " A " " A " " B " " A " " A " " B "
# Alternatively ,
y <- recode (x , " 1:2= ’ A ’; 3= ’B ’ " )
# y is : " A " " A " " B " " A " " A " " B " " A " " A " " B "
4.7 Sorting
We introduced the order() function in our earlier discussion of vectors, but will briefly
review it here because the need to sort and order dataframe rows (observations) is so
common. The order function is both simple in syntax and powerful in execution. It does
not itself rearrange the order of rows, but rather establishes a vector of row-index integers,
whose length is the same as the number of dataframe rows, and which indicates the order
in which these rows would be sorted according to the criteria set out in the function.
For example, given a dataframe mydata, to determine the order of its rows according to
one dataframe variable, Var1, and add this information to the dataframe, we would say
mydata$Order <- order(mydata$Var1), on the assumption that ascending order is what is
desired. If Var1 is numeric, the indexes will indicate order from smallest to largest; if it is a
character variable, the ordering would be increasing in the lexicographic sense. To proceed
to actually sort the rows of the dataframe on this basis, we would then go on to say mydata
<- mydata[mydata$Order,].
More generally, mydata$Order <- order(-mydata$Var1) uses the - sign to request
descending order, and
mydata $ Order <- order ( mydata $ Var1 , mydata $ Var2 )
establishes the order of rows first with respect to Var1 and, for each value of this variable,
then by Var2. Once mydata$Order has been derived, the dataframe is sorted using mydata
<- mydata[mydata$Order,] just as before.
will place TRUE in mydata$Dup if the corresponding row of mydata fully duplicates any
earlier row. To look for partial duplicates on variables Var1 and Var2 of the dataframe, and
to add that information to the dataframe, you’d say
mydata $ Dup <- duplicated ( mydata [ c ( " Var1 " ," Var2 " )])
To eliminate all fully duplicated observations in one go—not a good idea until you have
inspected these duplicates and confirmed that they should be tossed out—simply say
myUniqueData <- unique ( mydata )
4.9 Merging
Suppose that you have two dataframes, named Frame1 and Frame2, which contain identifier
variables named ID1 and ID2 respectively. Although the names of the identifier variables dif-
fer, their content is (mostly) the same. ID1 may have a few values that do not appear in ID2
CHAPTER 4. EXPLORING SOCIAL SCIENCE DATA 39
and vice-versa, but the vast majority of values appear in both. Hence, the two dataframes
can be merged on the basis of the common values in ID1 and ID2, provided that we then
take care to inspect non-merged cases to verify that the merger has been done correctly.
You can get an idea of how many such cases there will be by using match(Frame1$ID1,
Frame2$ID2) and then match(Frame2$ID2, Frame1$ID1), paying close attention to the
NAs.
Once you are satisfied that you have a legitimate basis for merging the two dataframes,
the merge() function handles the task with ease. You would say something like this:
Merged <- merge ( Frame1 , Frame2 , by . x = " ID1 " , by . y = " ID2 " , all . x = TRUE )
In this example the by.x argument refers to the first dataframe and the by.y argument to
the second. In asking for all.x=TRUE, we are requesting that all rows of Frame1 be retained
in Merged, and for any rows for which no ID2 value could be found that matched the ID1
values, NAs will appear in the columns of Merged for the variables of Frame2.
When the dataframes to be merged are very large, the merge() function can be slow. In
such cases, the data.table package has “join” functions that execute very quickly.
When it is imported into R, the S TATA variable MyVariable will be handled according to
the arguments of read.dta that pertain to labelled categorical data. By default, that is,
with the convert.factors=TRUE option, MyVariable would be converted to an R factor.
Now, as you will recall, factors in R represent categorical data using (internally) a set of
consecutive integers starting at 1, which the user interacts with via character-valued levels.
By default, therefore, the character portions of S TATA value labels are examined and placed
in alphabetical order, and (in this case, which has only three distinct categories) the integer
1 would be internally linked to “Category numbered 7 is labelled this way”, whereas 2
would be linked to “Label for category 1” and 3 would be linked to “The label for the third
category”. If you have no intention of returning to S TATA, this would probably be fine. (You
might, perhaps, wish to reorder the levels, but that is up to you.) If, however, you were to
write the R version of MyVariable back to S TATA, it would arrive there as a variable whose
CHAPTER 4. EXPLORING SOCIAL SCIENCE DATA 40
integer values are the consecutive integers 1, 2, 3, each of which is associated with what
had been the corresponding level in the R factor. That is not at all what the original S TATA
coding looked like! And in S TATA, your programs will probably be coded in terms of the
integer values of MyVariable, rather than the labels attached to them.
The scope of this problem is wider than might be initially appreciated. Long-time S TATA
users will know that many inherently numeric variables can have value labels assigned
to certain specific values. Not uncommonly, for instance, the numeric variable age would
have an integer value 96 that has a label such as “96 years and older” applied to it. When R
tries to import such a variable, it will take it to be categorical and convert it to a factor. Upon
export back to S TATA, therefore, the age variable would have integer values that could
depart in many unforeseen ways from the S TATA original—there is potential for serious
confusion. Clearly vigilance is in order!
Five
Although R’s capabilities extend well beyond the analysis of dataframes, it is in this area
where most of us will spend most of our time. Considerations of computational efficiency
are typically of little practical importance when your dataset is small—by which I mean no
more than a few tens of thousands of observations—but can become a first-order concern in
larger problems. As mentioned earlier, flaws in R’s design cause loops—especially those
with many iterations—to operate with appalling inefficiency. As a result, much effort has
been expended to develop substitutes for loops that are more efficient given the constraints
of the language.
The base-R approaches to be discussed in this chapter are all variations on the following
stylized set-up:
OutPut <- LoopSubstitute ( List , Function ( x ) , OtherArguments )
Because a dataframe is a list, dataframes often appear as the List argument, but we also
encounter vectors and matrices in this role (whose columns or rows become the x1 , x2 , . . .
components) and sometimes similar structures appear here. The efficiency gains come
from cases in which xi is a vector to which whole-vector operations can be applied via
the function in question. To clarify further: Although ”x”, without any subscript, would
seem to be the argument of Function(x), in reality "x" is simply a name of convenience
for whatever the xi component being processed is actually named. The function itself can
be a named function (built-in or supplied by the researcher) or an “anonymous” function
that appears as function(x){ ...}.
An alternative route to improving efficiency is exemplified by the data.table package,
which supplies a redesigned version of dataframes that can be processed with high efficiency.
Especially for researchers who deal with what is coming to be called “big data,” the
data.table is rapidly emerging as the storage unit of choice, especially for problems in
41
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 42
which functions need to be applied to subsets of the data (that is, collections of dataset
rows). We will give a brief introduction to data.tables later in the chapter.
apply(): Perhaps the simplest function of this type in base-R is apply, which works on
whole columns or rows of arrays (in typical use, matrices). It can work on dataframes, too,
but only if the dataframe happens to have columns that are all of the same type. If you plan
to use a dataframe that is heterogeneous in its data types, remember to feed apply only the
columns that are of the same type!
The apply function is invoked as
apply (X , MARGIN , FUN ,...)
in which X specifies the name of the matrix to be operated upon, FUN= specifies the name of
the function—it can be a built-in function or a user-defined function customized for the task
at hand—and MARGIN is 1 if the function is to be applied across each row of the matrix and
2 if the function is to be applied to each column. (For arrays, higher-level margins can be
indicated.) The ... argument provides a place where additional arguments can be passed
to the function named in FUN.
In the typical use of apply, the FUN function operates on what amounts to a vector
argument (whose length is the length of the row or column depending on whether MARGIN
is set to 1 or 2) but it returns a scalar result. But if FUN = sort, for example, apply will
sort each of the (row or column) vectors, so that both input and output are vectors. An
interesting possibility is offered by setting MARGIN=c(1,2), which for matrix X has the effect
of applying the function to each individual (scalar) element of the matrix.
As examples of usage, the last of which employs an anonymous function,2
# create a matrix of 10 rows x 2 columns
m <- matrix ( c (1:10 , 11:20) , nrow = 10 , ncol = 2)
# mean of the rows
apply (m , 1 , mean )
# mean of the columns
apply (m , 2 , mean )
# divide all values by 2
1 Hadley Wickham’s plyr package—not yet written up in this document—gathers together many of the
approaches we will describe below, and organizes them using a consistent, well-organized naming scheme and
a common syntax which is likely to be easier to remember. He has also added options for parallel processing
which should eventually bring further efficiency gains.
2 Taken from http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/.
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 43
Faster versions of the most popular apply functions are available—they are faster
because they have been compiled in C. The rowSums and colSums functions do what their
names suggest: they sum across rows and columns of a matrix or an object that can be
transformed to a matrix. Similarly, rowMeans and colMeans calculate means across rows
and columns.
lappy() and sapply(): The lapply() function is applied to lists, and it returns its result as
a list as well. (Recall that a dataframe is one kind of list, as is a single variable Var from a
dataframe D if specified in the single-bracket D["Var"] form.) When L is a list,
lapply (L , FUNC ,...)
applies the function named by FUNC= to each list component in succession. For instance,
# create a list with 2 components
L <- list ( a = 1:10 , b = 11:20)
# the sum of the values in each component
Lout <- lapply (L , FUNC = sum )
# Note the nature of the output :
> class ( Lout )
[1] " list "
> Lout
$a
[1] 55
$b
[1] 155
If you find that having a list as output is not really what you want, the sapply function
offers one alternative. It is also given a list as input, but if the output can be simplified to
vector form (or matrix form), sapply will do so:
Sout <- sapply (L , FUNC = sum )
> class ( Sout )
[1] " integer "
> Sout
a b
55 155
This is an integer vector with the original list component names a and b used to name the
entries of the vector. A nice feature is that the output of the function used in sapply need
not be a scalar; if we make use of the built-in function fivenum(), for instance, we obtain
sapply (L , fivenum )
a b
[1 ,] 1.0 11.0
[2 ,] 3.0 13.0
[3 ,] 5.5 15.5
[4 ,] 8.0 18.0
[5 ,] 10.0 20.0
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 44
which gives for each input list component a and b, the minimum, first quartile, median,
third quartile, and maximum values. A variant of sapply named vapply can also be used
in such cases; it allows names to be specified for each element of the output.
One helpful application of sapply is to determine the class of each column of a
dataframe, here named X:
sapply (X , FUNC = class )
does this:
rbind ( mylist [[1]] , mylist [[2]] , mylist [[3]])
In other words, the rbind function is called only once, but is supplied with 3 arguments
(the three component dataframes) in that call. (Obviously the function in question has to be
prepared to accept multiple arguments!) In this case, as it happens, rbind is designed so
that it will create one dataframe from the three component dataframes.
By contrast, D <- lapply(mylist, FUNC=rbind) would call rbind three times, each
time passing it one argument, which is equivalent to the sequence
rbind ( mylist [[1]])
rbind ( mylist [[2]])
rbind ( mylist [[3]])
Because of the way rbind is designed, this sequence does not do what we want.
In short, if L is a list with n components and f is a function, then do.call("f", L) calls
f just once, giving it n arguments (if that is allowed), but lapply(L, FUNC=f) calls f a total
of n times, each time giving it a single argument.
use data.table, taken from an analysis of a census dataframe named M that has 14 million
observations:
MT <- as . data . table ( M )
setkey ( MT , Urban , Sex , AgeGroup )
In this code, we first create a data.table object MT from the ordinary dataframe M. We
then ask for MT to be sorted according to a key that is defined by combinations of the
Urban, Sex, and AgeGroup variables. Making use of the quote() function, we then define
an expression that, once it is evaluated, becomes a function that creates the mean of the
Major variable weighted by the SamplingWeight variable. (Major takes values of 1 and 0,
and we want the resulting mean to be expressed in percentages rather than proportions,
which is the reason for introducing the 100.0 here). The first line that follows, MT[,eval(q)],
evaluates the quoted expression and displays on-screen the weighted mean of Major across
all observations. The next line does the same calculation for each sub-set of the data defined
by the combination of Urban, Sex, and AgeGroup codes. The result is stored in Z, which is a
new data.table.
This bit of code should give you at least an initial sense of how to go about specifying
complex operations using data.table. We now describe in a bit more detail some of the
similarities and differences between dataframes and datatables. The first thing to know is
that a data.table object is a type of data.frame, so that functions designed to work with
dataframes should also work on datatables. (No doubt there are some exceptions.) But
beneath the surface there are significant differences in the structure of these objects, and
even on the surface the user sees significant differences in syntax. In general, as mentioned
above, a data.table object is designed to work efficiently with large datasets, and great care
has been taken to minimize data copying so that this time- and memory-using operation
is carried out as infrequently as possible. To use the language of computer programmers,
data.table operations are meant to employ copy by reference methods rather than copy by
value methods whenever the latter can be avoided.
The concept of a key is, yes, key to the operation of data.table. A key is a set of one
or more data.table columns that define the order in which the rows of the datatable are
sorted. (In recent versions of data.table, character columns as well as integers are allowed
in keys and are preferred to factors, although factors can be used.) Once the key is set and
the datatable correspondingly sorted, we then select sub-sets of rows by referring to their
key values. Using a key organizes the data in such a way that the cases for each distinct
data group are kept contiguous in memory. To illustrate, suppose that we have only a single
key variable named Key1 in the datatable. To order the datatable, you would say
setkey ( DT , Key1 )
Then, supposing "b" to be one of the values that Key1 can take on,
DT [ " b " ]
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 46
selects all rows for which Key1=="b". Because it exploits the fact that the data have already
been sorted, and the observations whose key is b lie next to each other in memory, the
algorithm that selects these rows is far faster than the equivalent for an ordinary dataframe.
Indeed, for the new user of data.table, the difference in speed can be startling.
To verify that a data.table has a key, use the haskey() or key() functions, or examine
the concise summary produced by tables().
If multiple variables are needed to define the key, use something like this for a case
where the first of the keys is character-valued and the second integer-valued:
setkey ( DT , Key1 , Key2 )
DT [ J ( " b " ,7)]
the first line defines the key and the second selects the rows for which Key1=="b" and
Key2==7. The J() function is used often in data.tables.
A convenience function mult can be employed to pick out the first or last row in a set of
rows identified by their key. To get the last such row, use
DT [ " b " , mult = " last " ]
for the case of a single key variable having "b" as one value; and specify mult="first" to
obtain the first row in the set.
Let’s denote a data.table object by DT and a data.frame object by DF. Both of these
objects can be accessed via row and column indexes and identifiers, as in DT[i,j], but
data.tables allow different meanings for "i" and "j" than what we are accustomed to in
the context of dataframes. There are fundamental differences in how we refer to named
columns—the "j" dimension. In working with data.tables, all variable names are un-
derstood (by default) to refer to names within the datatable and no quotation marks are
needed. Hence,
DT [ , VarName ]
returns a vector with the contents of the VarName variable. If you want a data.table as the
output, put the variable name inside a list():
DT [ , list ( VarName )]
where in the second version, we tell data.table to look outside itself (rather than within
itself) to find what mycol contains (in this case, x).
Data.tables DT[i,j] are designed so that the j argument can be one or more expressions
that depend on variables in the data.table. As illustrated just above, use list() to enclose
multiple "j" expressions. Hence, use
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 47
DT [ , sum ( VarName1 )]
will produce the results. (The number of rows of each by-group is stored in a special
variable called .N that you can access.) The code executes fastest if VarName2 is also the key,
but it is very fast even if by= does not refer to the key. With two or more variables in the
by-statement, you can use
DT [ , sum ( VarName1 ) , by = " VarName2 , VarName3 " ]
More complicated cases, in which the by categories are functions of the data.table variables,
are handled with lists. Assuming that one of the data.table variables is named dateCol,
representing a date, and that the month() function extracts the month from that date, we’d
write something akin to this:
DT [ , sum ( VarName1 ) , by = list ( month ( dateCol ) , VarName3 )]
As we saw at the outset of this section, it is often convenient to define the function in a
quoted expression specified outside the data.table, and then evaluate it inside. The quote()
and eval() functions are what we need to make this approach work. A simple example
will illustrate. Let q <- quote(x), with x being the name of one column in the datatable;
then
DT [ , eval ( q )]
returns the vector x. If we wished instead to have a 1-column datatable, we would use
q <- quote ( list ( x ) )
DT [ , eval ( q )]
This provides another mechanism for referring indirectly to the names of variables in the
data.table.
For a more complicated example, suppose that we want to construct a 1-row, 2-column
datatable holding the mean and standard deviation of the x variable:
q <- quote ( list ( mean ( x ) , sd ( x )) )
DT [ , eval ( q )]
To elaborate this approach so that the mean and standard deviation are calculated on a
group-specific basis, with the variable Group defining the distinct groups, use the by=Group
option:
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 48
As mentioned earlier, the sorting algorithm of data.table is so fast that even if Group was
not already defined as the key variable, this executes very quickly.
To process the data subsets using a function such as quantile() that returns a (named)
vector of values,
DT [ , as . list ( quantile ( x )) , by = Group ]
will create new datatable columns that are named with the names of the vector’s entries.
You should now be able to understand the example with which we began this discussion
of data.table. The by operations are intelligently designed, and certain variables that you
are likely to want are calculated for you automatically. This code snippet shows how to
process the data in “chunks”, which in this example are generally 4 rows, but with the last
chunk only covering 2 rows:
set . seed (0)
DT = data . table ( a = rnorm (10))
DT
a
[1 ,] 1.262954285
[2 ,] -0.326233361
[3 ,] 1.329799263
[4 ,] 1.272429321
[5 ,] 0.414641434
[6 ,] -1.539950042
[7 ,] -0.928567035
[8 ,] -0.294720447
[9 ,] -0.005767173
[10 ,] 2.404653389
Merging data.tables There are a number of changes afoot in the way that data.table
handles merges. The simplest case requires that the keys of the two data.tables have the
same name. This illustrates:
setkey ( dt1 , id )
setkey ( dt2 , id )
# Merge on the basis of id :
> dt3 <- dt1 [ dt2 ]
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 49
That is, the split function produces a list of dataframes, one for each level of the factor
M$Sex, each dataframe containing the observations on variables Major and SamplingWeight
that correspond to that factor level. When combined with one of R’s list processing tools,
split() enables us to write compact expressions such as:
sapply ( split ( M $ Major , f = list ( M $ AgeGroup , M $ Sex )) , mean , na . rm = TRUE )
The aggregate, ave, tapply, and by functions in R are very similar in spirit to S TATA’s
egen tool: they apply a function to a vector or dataframe on a group-by-group basis, with
groups being defined by a list of one or more factors.
The aggregate() function takes the form
aggregate (X , by = , FUN =)
in which X is a dataframe and by is an argument that is set to a list of grouping factors. (It
can be specified in the form by=list(F1,F2) for two factors, or, since a dataframe is itself a
list, by=data.frame(F1,F2) would do.) The output of aggregate is a dataframe. In the past,
the aggregate accepted only functions returning scalar values. Recently, however, it has
been revised to accept functions that return vectors, but it stores the output in a list in the
dataframe.
One very nice application of aggregate() is to single out the first or last observation in
groups of observations using the head or tail functions, helpfully storing the results in a
dataframe:
myFirstList <- aggregate ( mydata , by = myByVariables , FUN = head , n =1)
myLastList <- aggregate ( mydata , by = myByVariables , FUN = tail , n =1)
In this example, n=1 is the additional argument passed to the head and tail functions.
Unfortunately, because FUN can only reference functions that generate scalar output, you
cannot use this method to collect the first or last handfuls of observations.
The ave() function works not on dataframes but on subsets of numeric objects (vectors)
defined by factors. Its default is to construct averages of the subsets but other functions can
be supplied. The syntax is
Z <- ave (X , g1 , FUN = mean )
in which g1 is a factor of the same length as X and the output Z is a vector that is also of the
same length as X. The ave() function inserts in each entry Z[i] the mean of all X values
CHAPTER 5. APPLYING FUNCTIONS TO DATAFRAMES 50
with the same factor level as g1[i]. It can accept multiple factors as well, as in ave(X,
g1,g2,g3, FUN=mean) in which case the FUN= function is applied to all cases with the same
combination of levels.
The syntax for tapply is
tapply (X , INDEX = , FUN = ,... , simplify = FALSE )
in which X is a vector, and INDEX is a list of one or more factors that define the groups over
which the function named in FUN is to be applied. (The function can only have 1 argument,
which is a vector of the entries in X that are members of each group defined by INDEX.) The
grouping factors should of course be of the same length as X. (Again, since dataframes are
lists, the INDEX to tapply can be a dataframe.) Importantly, the function in question can
return either a scalar as output or more general forms of output. Hence, tapply is decidedly
more general than aggregate. By default, the output of tapply is a list. This may or may
not be convenient for the next step of data processing. Note, however, that if FUN returns a
scalar and if the simplify argument is set to TRUE, then tapply will return an array of the
same mode as that scalar.
The by() function operates similarly, but applies to dataframes rather than to vectors. Its
syntax is
by (X , INDICES = , FUN = ,...)
in which X is a dataframe and you will note that the second argument—again a list of factors
defining groups—is named INDICES rather than INDEX. The function can take only one
argument, but that argument is a dataframe. Via the ... argument you can pass additional
arguments to this function. For example
by ( DrugTrials , INDICES = DrugTrials [ " Sex " ] , FUN = summary )
In essence—we will need a little more information about lm(), the linear models function,
to understand this example—each sex sub-group is formed into its own smaller dataframe
(df), and the lm function is then applied to the sub-group, running a linear regression of
post on pre. The results are stored as a list in the Models object.
The plm package provides standard econometric models for panel data, and also contains
several useful data manipulation functions. In what follows we describe other tools for
common tasks in longitudinal settings.
in which the -length(x) bit drops the T-th (that is, the last) entry of x. We can apply the
Lag() function in one of the base-R tools described above for working with data subsets.
Instead, we take this opportunity to introduce the well-designed plyr package. To
illustrate its use, suppose that we have a Population variable in a dataframe df with N
countries identified by id and a year of observation variable year. Then a lagged population
variable can be created via
df <- df [ order ( df $ id , df $ year ) , ]
library ( plyr )
df <- ddply ( df , .( id ) , transform , LagPop = Lag ( Population ))
To create a variable that holds the mean values for each country (or whatever i-specific
unit is being examined),
ddply ( df , .( id ) , transform , grp . mean . values = mean ( value ))
Six
Extended, book-length treatments are needed to do justice to R’s full suite of graphics
capabilities, and indeed, a number of such books have been written. Part of the appeal
as well as the difficulty of R graphics is that there are no fewer than three main graphics
systems in common use: in vintage order, these are the base-R system, lattice graphics
(this is the main system used in displaying spatial data), and the ggplot2 system developed
by Wickham (2009), the last of which is increasingly coming into favor. All three systems
are extraordinarily powerful, and each has its own syntax and defaults. Here we can do
little more than indicate some of the basic uses to which these remarkable tools can be put.
Both Mittal (2011) and Murrell (2006) provide insights into R graphics, the former
focusing mainly on base-R routines and the latter going deeper into the foundations of grid
and lattice graphics, which provide a well-structured alternative to the base-R approach.
Adler (2010) is also much worth reading on both of these graphics systems, giving a
remarkably clear account of lattice graphics in particular.
All three systems have this in common: When the variable to be graphed is categorical
(even if the categories are ordered) it is important to have defined the variable as a factor.
When you want several graphs to be assembled within one overall graph image, each
representing a different data sub-group, these sub-groups will typically be identified by
the levels of a factor. The factor levels are also the means by which colors are linked to
the categories of categorical data—until this principle is grasped, novice users expecting
to link ordinary integer-valued variables to colors often find themselves utterly baffled.
(By contrast, color assignment for continuous variables is far easier to understand and
manipulate.) As we’ll discuss below, you can manually set the colors by constructing a
vector of color names that is the same length as the number of factor levels. You may first
want to double-check the ordering of the levels so that you can match up colors to levels in
just the way you’d like.
52
CHAPTER 6. R’S NON-SPATIAL GRAPHICS 53
good portion of what is here will have become outdated. But there should be enough in
this discussion to get you started. In my view, the best way to learn ggplot2 is to sit down
with a dataset and try out its capabilities.
Several web sites have sprung up to help users navigate the intricacies of ggplot2:
These resources and Wickham (2009) go much deeper than I can in the following notes,
which mainly aim to explain ggplot2 by way of examples.
To underscore a few basic points that can hardly be repeated often enough: Unlike
base-R graphics, the ggplot2 graphical machinery works only with dataframes and if there are
data sub-groups involved in a graph, the dataframe must be organized in the “long” form,
with sub-groups identified by one factor. It is easy to work with pre-processed, already
summarized data, so long as the processed version is stored in a little dataframe and you
tell ggplot2 that it does not need to carry out further processing. We will see how to do
that in what follows. Also, bear in mind that ggplot2 creates graphical objects, but does
not automatically print them to the screen. You will need to apply the print() function
to the graphical object to get the graph image to appear. The ggsave() function is an easy
way of saving the image to a file.
Here is an introductory example in which graphs are made from a dataframe named M
with country-level information on various economic indicators, region of the world, and so
on:
\ library { ggplot2 }
\ library { scales }
.
. ( steps omitted )
.
# This step loads dataframe M into a ggplot2 object named P0
P0 <- ggplot ( M )
# Summarize the ggplot object
summary ( P0 )
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -#
# Create graphs from P0 #
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -#
print ( p )
Note that several important functions were used: geom point() is used to specify points in
a scatterplot, geom bar() for bar and geom box() for box plots, and geom smooth for using
a regression method to summarize a scatterplot. In the first part of each plot definition, the
aesthetics function aes() was invoked to tell ggplot2 about what goes on the horizontal
and vertical axes, although these specifications can be adjusted later on in the individual
geom *() functions.
Adjusting bar width You can use rescale(), from the scales package so that the within-
category width of the bar is representative of some other variable (in this example, the
variable is named value.x).
library ( scales )
ggplot ( dd , aes ( x = variable , y = value .y , fill = Date )) +
geom _ bar ( aes ( width = rescale ( value .x , c (0.5 ,1))) ,
stat = " identity " , position = " stack " ) ’ +
coord _ flip ()
It is possible to supply geom boxplot and similar functions with pre-calculated values,
as in the following example code, which produces two boxes for groups A and B. The
important point to note is the stat = "identity" option:
DF <- data . frame ( x = c ( " A " ," B " ) ,
min = c (1 ,2) , low = c (2 ,3) , mid = c (3 ,4) , top = c (4 ,5) , max = c (5 ,6))
ggplot ( DF ,
aes ( x =x , ymin = min , lower = low , middle = mid ,
upper = top , ymax = max )) +
geom _ boxplot ( stat = " identity " )
Choosing colors The colors of lines and points can be set directly using colour="red",
replacing ”red” with any color name. The colors of filled objects, such as bars, can be
set using fill="red", and for these objects colour refers not to the inside of the object
but to the object’s border. If you want colors to vary with the values of variables (factors)
and also want to depart from ggplot2 default color choices, you can control the colors
mapped to your variables, using scale colour manual() for a discrete set of colors and
scale colour gradient() for a gradient. Here are some examples. If you have a factor
with three levels that you’d like to see displayed in green, blue, and orange, try this:
last _ plot () +
scale _ colour _ manual ( values = c ( ’ green ’ , ’ blue ’ , ’ orange ’ ))
+ scale _ colour _ gradient ( low = " #000000 " , high = " # C77CFF " )
Other properties can be changed via scale fill hue() and scale colour hue() to change
the luminance (via their l= argument) or chromaticity (c=):
# Use luminance =40 , instead of default 65
ggplot ( df , aes ( x = cond , y = yval , fill = cond )) + geom _ bar () +
scale _ fill _ hue ( l =40)
Lines and symbols For lines, use geom line() function. The basic line types are named
solid, dashed, dotted, dotdash, longdash, and twodash. If you want multiple lines to
appear in the graph, you will need to tell ggplot the name of the identifier variable so that
it knows which points to connect in any given line. This is done with a group = specifier, as
will be shown below. With two such lines, you would set the line type using code such as
the following:
scale _ linetype _ manual ( values = c ( " dotdash " , " dotted " ))
For plotting symbols, we’ve already seen that geom point() is the function to use. The
shape of these symbols can either be set or allowed to vary by mapping it to a factor inside
an aes() specification such as:
geom _ point ( aes ( shape = let ) , size =6)
with the variable let in this example being (of course) a factor. Note that by default, no
more than 6 shapes are allowed. The way around this (given a factor with, say, 7 levels),
is to add scale shape manual(values=1:7). This means that shape 1 will be mapped to
level 1 of the factor (let, in this example), shape 2 to level 2 and so on. There are additional
options:
• A breaks = argument can be used to create a factor on the fly from a continuous
variable, by setting ranges that correspond to what will be the factor levels;
• A labels = argument gives the labels to associate with the breaks; by default, they
are the breaks themselves.
these shapes, which have a transparent fill, are numbered 1 (circle), 0 (square), 5 (diamond),
2 (upward triangle), and 6 (downward triangle). If you want to use hollow shapes, without
manually declaring each shape, you can use scale shape(solid=FALSE). Note that when
a hollow shape is plotted on top of a line, the line shows through. To avoid this, you can
use shapes 21–25 and specify a white fill.:
# Hollow shapes
ggplot ( df , aes ( x = xval , y = yval , group = cond )) +
geom _ line ( aes ( linetype = cond ) , # Line type depends on cond
size = 1.5) + # Thicker line
geom _ point ( aes ( shape = cond ) , # Shape depends on cond
size = 4) + # Large points
scale _ shape ( solid = FALSE )
Adding text To add just a bit of text to a plot, use annotate() to give the text, followed
by a coordinate to indicate where the text should do, as in this example:
annotate ( " text " , x =41 , y =4.5 ,
label = paste ( " Average = " , round ( mean ( age $ age ) ,1)) , size =12)
The geom text() function is used to add text that is defined in a variable in the dataframe.
One common application is to add labels (or numbers) above the bars of a bar graph. The
following example shows how to do that when the bars are “dodged”.
p <- ggplot ( diamonds , aes ( clarity , fill = cut )) +
geom _ bar ( position = " dodge " )
lab _ dat <- unique ( diamonds [ , c ( " cut " ," clarity " )])
lab _ dat $ y <- 4000
lab _ dat $ lab <- with ( lab _ dat , paste ( cut , clarity , sep = " -" ))
With some extra effort, the vertical position of the text can be adjusted to the bar height.
The following puts the count (the bar height) just above the bar as a label:
+ geom _ text ( aes ( label = count , x = type , y = count ) ,
position = position _ dodge ( width = 0.8) , vjust = -.6)
CHAPTER 6. R’S NON-SPATIAL GRAPHICS 58
Legends When the legend doesn’t quite work out as you expected, the safety valve is to
use a function of the form scale aesthetic type().
Adjusting angle of axis labels, turning off axis titles or tick marks:
p <- p + opts ( axis . text . x = theme _ text ( angle =45) ,
axis . ticks = theme _ blank () , axis . title . y = theme _ blank ())
Faceting Another feature to bear in mind has to do with ggplot’s handling of subsets of
the data for which you may want group-specific graphs (arranged, for instance, in what are
termed facets). The groups are identified by factor levels, and it is assumed that in the
dataframe, one factor contains all levels needed to uniquely identify each of the groups.
You can usually get around this requirement by using the interaction() function, which
essentially creates such a factor on the fly, but there are times when that approach fails and
you need to construct the single identifying factor on your own. The following code uses
the interaction() function to define a grouping (based on the factors Urban and Sex) for
a plot with several lines. The facet grid function applies labels to the separate graphs.
p <- ggplot ( data =Z ,
aes ( x = AgeGroup , y = V1 , group = interaction ( Urban , Sex )))
p1 <- p + geom _ line ( aes ( colour = Sex ) , size =1.5) +
facet _ grid ( Urban ~ .)
p1 <- p1 + labs ( y = " Percentage Migrating in Last 5 Years " ,
x = " Age at Census " )
p1 <- p1 + theme _ bw ()
Saving graph images There are two methods for saving images to files. The first of them,
f0 <- paste ( f0 , " png " , sep = " . " )
png ( file = paste ( Figures , f0 , sep = " / " ) , width =850 , height =600)
print ( p1 )
dev . off ()
is identical to what would be done in base-R graphics. To write graphical output to a file,
you decide upon the output format and invoke the graphics driver to prepare to write in
that format to the file you’ve specified. The graphics commands to create the graph then
follow, and you complete the process with dev.off() to close the connection to the file. For
png format (Portable Network Graphics), the series of commands looks like
png ( file = , res = , height = , width =)
...[ graph commands ]
dev . off ()
in which the file name is (typically) a character vector composed of a path and a file name
(with extension .png), res is the resolution in dots per inch, and the height and width are
specified in pixels. Hence, if you set res=600, asking for 600 dots per inch, and then set
height=2400, the implied height is 4 inches. When you change the resolution, you must
therefore adjust the height and width parameters accordingly. For pdf output the setup
CHAPTER 6. R’S NON-SPATIAL GRAPHICS 59
is similar, except that height and width are expressed in inches and there is no resolution
parameter.
An easier method to remember is provided by ggplot’s ggsave() function:
ggsave (p , file = ’ no _ ticks . png ’ , width =6 , height =4)
If you pass palette a vector of color names, it will use those instead. To restore the default
values of the function, use palette("default"). There are a number of additional built-in
palettes, some of the most useful being defined as functions to which you supply an integer
argument that determines how many colors are generated. The heat.colors() function is
one popular example: via heat.colors(5) you request a sequence of 5 colors in a series
running from red to white, which you could then assign to depict (say) an ordered factor
with 5 levels. Other functions of this general type include rainbow(), terrain.colors(),
and topo.colors().
There are sophisticated R functions that allow you to exert further influence over the
sequence of colors that is chosen. For instance, the code
my . colors <- colorRampPalette ( c ( " white " , " orange " , " red " ) ,
space = " Lab " )
will use the colorRampPalette() function to create a new function named my.colors that
defines a series of colors ranging from white through orange and then on to red. This would
be the kind of function to use with numeric or ordered categorical data. We invoke the
my.colors() function as follows
my . colors (20)
In http://research.stowers-institute.org/efg/Report/UsingColorInR.pdf, Earl
Glynn gives a comprehensive account of how colors are defined and used in R graphics.
CHAPTER 6. R’S NON-SPATIAL GRAPHICS 60
Without this background, it would be quite difficult to understand the R code that generates
some of the more complex graphs you are likely to come across. We highly recommend
his presentation. There are several guides on the web that list R’s built-in colors, but this
one accompanies Earl Glynn’s presentation: http://research.stowers-institute.org/
efg/R/Color/Chart/. Those of you who would never choose to use Blue if DodgerBlue is
available, will find satisfaction here. The site also explains how to mix your own custom
colors in the equivalent of an electronic paint store.
Bar plots Bar plots are generally used to display the relative frequencies of the categories
of a vector, which usually originates as a factor. In general, each bar is labelled with the
name of the corresponding entry of the vector, which is usually the level of the original
factor. Hence, if you want the bars to appear in a particular order, you must ensure that
the factor levels are specified in that order. In the base-R implementation of bar plots,
which uses the function barplot(), you must supply the function with a vector (or matrix)
that summarizes the data in terms of counts, proportions, or percentages. Often these
summaries are produced by the table() function with the factor as its argument, or by one
of the *apply functions. The barplot() function can also be handed a matrix.
There is no numeric x-axis as such in a bar plot, since the bars represent nominal
categories (or at most, ordered categories) of a variable and the distance between bars has
no substantive meaning.1 Even so, there are times when you would like a labelled x-axis to
appear. To override R’s wishes and both draw and label an x-axis, what we must do is to
save the result of the barplot call, which contains the centers of the bars.
For instance, to add text over each bar that indicates the value that the bar represents,
we can mimic the following excerpt from a program that analyzes city population data with
T being a vector of the percentages of cities of different types in the dataset.
B <- barplot (T , col = " cornflowerblue " , border = " blue " ,
ylim = c (0 ,50) ,
xlab = " Boundary Concepts in City - Specific Time Series " ,
ylab = " Percentage of Cities " )
text (B , T +2 , labels = as . character ( round (T , digits =1)) , col = " blue " )
In B the horizontal mid-points of each bar are stored as a vector. The text() function
then adds text (the values of T, rounded to 1 digit after the decimal and converted to
character) with the horizontal location being the bar mid-point and the vertical location
being the entry of T plus 2 units (percentage points, in this case) to set the text just above
the bar. The example also illustrates several other features: we extend the y-axis to 50,
add labels for both the y-axis and what would be the x-axis, fill each bar with one of R’s
1 S TATA takes the same view of bar charts, and like R resists applying numeric labels to the x-axis.
CHAPTER 6. R’S NON-SPATIAL GRAPHICS 61
built-in colors (col="cornflowerblue") and set the outline of each bar to a different shade
(border="blue").
Another illustration uses a sequence of 8 colored bars whose colors are those of the
default 8-color palette.
x <- rep (1 ,8) # Produces 1 1 1 ... 1
# Color the bars using the 8 colors of the default palette
B <- barplot (x , axes = FALSE , col =1:8) # the bars are all of height 1
# B contains the x - value of the center of each bar in the plot
axis (1 , at =B , labels = as . character (1:8)) # Add a labelled x - axis
The axes=FALSE argument in barplot asks that the axes not be drawn; the next line in the
code draws them. Had we instead used axis(1,at=B,labels=palette()) in the last line,
we would have labelled the bars with the names of the colors.
Here is another artificial example in which 200 bars express the result of 200 independent
draws from the distribution of a Poisson random variable. In this case barplot arranges
the 200 bars according to the order in which they were drawn, the height of each bar being
the value of the draw. We can label the x-axis wherever we choose, illustrating here with
ticks at 50, 100, and 150 to denote the 50th, 100th, and 150th draws.
x <- rpois (200 , 20)
centres <- barplot ( x )
axis (1 , at = centres [ c (50 , 100 , 150)] , labels = c (50 , 100 , 150))
The barplot function will happily draw whatever vector you supply to it. To depict the
frequency of values in an integer or character vector X, you would usually define XPlot <-
table(X) and pass XPlot.
requests (in Windows) a bold Times New Roman font. For a pdf graph,
pdf ( family = " Times " )
[1] Joseph Adler. R in a Nutshell. Sebastopol, CA: O’Reilly Media, Inc., 2010.
[2] Hrishi V. Mittal. R Graphs Cookbook. Birmingham, UK: Packt Publishing, 2011.
[3] Robert A. Muenchen and Joseph Hilbe. R for Stata Users. Springer, 2010.
[4] Paul Murrell. R Graphics. Boca Raton, FL: Chapman and Hall/CRC, 2006.
[5] Paul Teetor. R Cookbook. Sebastopol, CA: O’Reilly Media, Inc., 2011.
[6] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Use R! Baltimore, MD:
Springer, 2009.
62