TextbookECO 329 Fall 2024
TextbookECO 329 Fall 2024
Jason Abrevaya
i i
i i
i i
i i
i i
i i
Contents
1 The basics of R 1
1.1 Installing R 1
1.2 Arithmetic operations and mathematical functions 2
1.3 Variables and data types 4
1.4 Vectors 9
1.5 Output 18
1.6 Programming 18
1.7 Writing functions 22
1.8 Data frames and file input 24
1.9 Missing values 29
1.10 R packages 30
Exercises 32
i i
i i
i i
iv Contents
i i
i i
i i
Contents v
i i
i i
i i
vi Contents
i i
i i
i i
1 The basics of R
1.1 Installing R
R is a statistical programming language for data analysis and visualization widely used by economists and data
scientists. This book uses R to illustrate statistical concepts and to implement analytical methods. For the best
experience of working with R, readers should also install the software package RStudio. While RStudio is strictly
speaking not required to run R, it provides a user-friendly graphical interface that makes it much easier to use R.
RStudio provides an advanced editor with features that include syntax highlighting, code completion, and debugging.
It also has several tools to streamline data analysis, including a workspace viewer and a plotting window.
Both R and RStudio are available for standard operating systems (Windows, macOS, Linux) and can be downloaded
for free at https://cran.r-project.org and https://posit.co/downloads, respectively. Be sure to
download and install R prior to downloading and installing RStudio.
RStudio has four main “panes” as part of its interface, as seen in Figure 1.1:
1. Source (top left): This pane allows the user to create, edit, and save R scripts. Datasets can also be browsed in this
pane. This pane only appears when a script or dataset has been opened.
2. Console (bottom left): This pane allows the user to enter commands directly in R and see the output.
3. Environment/History (top right): This pane displays the current workspace, including datasets, variables, and
functions that have been introduced. This pane also provides access to previous commands and output.
4. Plots/Help/Packages/Files (bottom right): This pane displays plots, provides access to R documentation, allows the
installation and management of R packages, and allows direct access to file directories.
Within RStudio, there are two main options for writing R commands:
• The console: This “command line” interface appears when RStudio is opened. Commands can be typed directly
into the console, and R executes them immediately. Using the console is a good option for simple calculations or
experimentation with R functions.
• R scripts: An R script is a text file containing a series of R commands. A script file can be edited and saved, making
it useful for more complex data analysis and programming tasks. With RStudio, it is straightforward to run an entire
script or a selected section of a script.
The two options have their advantages and disadvantages. The console is quick and easy to use, but it can be difficult
to keep track of all the commands that have been issued. An R script, on the other hand, makes it easier to organize
code, make code more readable, and save code. But, since many commands from an R script are executed at once, it
may take more time to figure out why code is not performing as expected.
Throughout the book, we use the console option for R commands and output to demonstrate how commands work.
However, all R code is accessible as script files on the companion website https://www.probstats4econ.com,
organized by chapter and section. These files enable readers to run large chunks of code conveniently. Moreover,
readers can modify scripts to see the impact of code edits on output, add additional analysis, or cut and paste from
script file(s) to create new scripts.
i i
i i
i i
2 The basics of R
RStudio
Non-hispanic Nevermarried
$ hrslastwk
$ unempwks
: i.nt
: i.nt
40 40 44 30 40 ZS 40 46 41 44 .
NA NA NA NA NA NA NA NA NA NA .
�
Non-hispanic Divorced Employed
$ wagehr : num 12 NA NA NA 25 .
Non-hispanic Married Employed Hourly $ earnwk : num 577 3049 2500 300 1000 . □=
CJ
Non-hispanic Married
4.12310562561766
�
Non-hispanic Nevermarried
R i.s free software and comes wi.th ABSOLUTELY NO WARRANTY. Miscellaneous Mathematical Functions
You are welcome to redi.stri.bute i.t under certai.n condi.ti.ans.
Type 'li.cense()' or 'li.cence()' for di.stri.buti.on detai.ls. Description
Natural language support but runni.ng i.n an Engli.sh locale abs(x) computes the absolute value of x, sqrt(x) computes the (principal) square root of x, y'x.
R i.s a collaborati.ve project wi.th many contri.butors. The naming follows the standard for computer languages such as C or Fortran.
Type 'contri.butors()' for more i.nformati.on and
'ci.tati.on()' on how to ci.te R or R packages i.n publi.cati.ons. Usage
Type 'demo()' for some demos, 'help()' for on-li.ne help, or abs(x)
'help.start()' for an HTML browser i.nterface to help. sqrt(x)
Type 'q()' to qui.t R.
Arguments
>li.brary(probstats4econ)
>Vi.ew(cps) x a numeric or QQIDplex vector or array.
>?sqrt
>x<- sqrt(17)
Details
>cpsemployed <- cps[cps$lfstatus=="Employed" ,]
These are internal generic Rrimitive functions: methods can be defined for them individually or via the Math group generic. For complex arguments (and the default method), z, abs (z) == !,!Qg(z)
and sqrt(z) == z'0.5.
S4methods
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. I➔
Figure 1.1
The four panes in RStudio
5*3
## [1] 15
5/3
## [1] 1.666667
5^3
## [1] 125
In this block of code, the first line is a “comment” that is not executed by R. When writing R code, we can start any
line with the number-sign character (#) to indicate that the line is a descriptive comment rather than a command to
be evaluated. The first command that gets evaluated is 5+3, and R returns the answer 8. Any printed R output in this
book is on a line starting with two number-sign characters (##).1 How about the [1] that appears right after ##? In
cases where a single value is returned, the [1] appears before the ouput. Later in this chapter, we consider cases in
i i
i i
i i
The basics of R 3
which multiple values are returned simultaneously (as part of a “vector”), and when there are multiple lines of output,
the number within the [] brackets for a particular line indicates which element is being shown first.
R follows the standard mathematical order of operations, which from highest to lowest priority is:
1. Parentheses
2. Exponentiation
3. Multiplication and division (performed from left to right)
4. Addition and subtraction (performed from left to right)
(3+4)^2
## [1] 49
3+4^2
## [1] 19
3+2*4^2
## [1] 35
3+2*4+2
## [1] 13
(3+2)*(4+2)
## [1] 30
For the first expression, the addition within the parentheses occurs before the exponentiation. For the second
expression, the exponentiation occurs before the addition. For the third expression, the exponentiation occurs first,
followed by the multiplication and then the addition. For the fourth expression, the expressions within both parentheses
are evaluated before being multiplied together.
In addition to arithmetic operators, R has many mathematical functions that facilitate calculations and data analysis.
Here are some examples of commonly used mathematical functions:
• abs(x): Calculates the absolute value |x|.
√
• sqrt(x): Calculates the square root x.
• exp(x): Calculates the exponential value ex . exp(1) returns Euler’s constant e ≈ 2.718282 since e1 = e.
• log(x): Calculates the natural logarithm ln(x).
• factorial(x): Calculates the factorial x! = x(x – 1)(x – 2) · · · (3)(2)(1).
The following output shows examples using abs(x) and sqrt(x).
5*abs(-3)
## [1] 15
sqrt(17)/2
## [1] 2.061553
R functions, like the mathematical functions above, can have a required argument or multiple required arguments,
along with optional arguments that can be specified during function calls. To determine these arguments for any
function, R documentation or “help” can be requested by typing a question mark (?) followed by the function name. For
example, the command ?sqrt requests the documentation for the sqrt function, which then appears in the bottom-
right window of the RStudio interface. The documentation indicates that the “usage” for the function is sqrt(x),
meaning it has a single required argument x and no optional arguments. In the example above, the argument is a single
i i
i i
i i
4 The basics of R
number (17), though the documentation specifies that the argument x can be a “numeric or complex vector or array.”
The concept of vectors is described in Section 1.4.
The log function has an optional argument, which can be confirmed by requesting documentation with ?log.
The documentation indicates the “usage” for the function is log(x, base = exp(1)), meaning it has a required
argument x and an optional argument base. If the optional argument is omitted, then its default value is exp(1),
resulting in the natural logarithm (base e). To calculate a base 10 logarithm, there are two ways to call the log function
with the optional argument.
log(100,base=10)
## [1] 2
log(100,10)
## [1] 2
log(100)
## [1] 4.60517
The first and second commands are equivalent, with the first explicitly using the name of the base argument and the
second relying on the “usage” having the base as the second argument. The former approach is generally preferred, as
a function may have multiple optional arguments, and specifying the name of the optional argument avoids confusion
and mistakes. The third command uses no optional argument, so the natural logarithm (base e) of 100 is calculated.
Two additional examples of mathematical functions with optional arguments are round and signif:
• round(x, digits = 0): Rounds a number x to a specified number of digits. If the argument digits is
not specified, the default value is 0, in which case the function rounds to the nearest integer.
• signif(x, digits = 6): Reports a number x with a specified number of significant digits equal to
round(50/3)
## [1] 17
round(exp(1),digits=4)
## [1] 2.7183
signif(50/3,digits=5)
## [1] 16.667
i i
i i
i i
The basics of R 5
x <- 8
x
## [1] 8
x+5
## [1] 13
x <- 2*x
x
## [1] 16
frog
## Error in eval(expr, envir, enclos): object ’frog’ not found
The first command assigns the value 8 to the variable x. This command has no output associated with it. The next
command, simply x, does provide output, corresponding to the value 8 stored in the variable x. The variable can then
be used in other expressions. The third command outputs the value of x+5, which is 13. Importantly, this command
does not change the value of the variable x, which is still 8. The fourth command does change the value of the variable
x with another assignment operator <-. Specifically, the new value assigned to x is two times the old value of x, or
16. The last command shows that an error message is returned when we refer to a variable name, in this case frog,
that has not been assigned a value.
Variables can store different types of data. The variable x above has a numeric value, but variables can also contain
text strings, logical values (indicating true or false), and other types of data. The data type of a variable can be
determined by the class() function.
y <- 3.4
class(y)
## [1] "numeric"
str <- "Economic statistics"
class(str)
## [1] "character"
• factor: a data type for a categorical variable with a fixed set of possible values (e.g., a variable for eye color
with the three possible values Blue, Brown, and Other or a variable for restaurant ratings with the four possible
values Excellent, Good, Fair, and Poor)
If a variable is no longer needed, we can delete it with the rm function. For example, if the variable x has been assigned
a value, the command rm(x) removes the variable from the R working environment. The command rm(list =
ls()) removes all variables from the R working environment, thought it should be used with caution since R does
not ask for confirmation when the rm function is called.
i i
i i
i i
6 The basics of R
8 == 3+5
## [1] TRUE
x <- 16
x > 12
## [1] TRUE
xsmall <- (x<=9)
xsmall
## [1] FALSE
The first command returns TRUE since 8 is exactly equal to 3+5. The second command assigns the value 16 to the
variable x, and the third command returns TRUE since x is strictly greater than 12. The fourth command assigns the
value of x<=9, which is FALSE since x is not less than or equal to 9, to the variable xsmall. As a result, the variable
xsmall has a logical data type, and the last command outputs the value of xsmall.
The logical operators “and” and “or” are represented by the symbols & and |, respectively, and can be used to
combine multiple logical values and return a new logical value based on the combination. The “and” operator (&)
returns TRUE if both of the logical values being considered are TRUE and FALSE otherwise. The “or” operator (|)
returns TRUE if at least one of the logical values being considered is TRUE and FALSE otherwise.
x <- 16
(x>12) & (x<=9)
## [1] FALSE
(x>12) | (x<=9)
## [1] TRUE
The “not” operator (!) can be applied to a logical value to return the opposite value. If v is a logical variable, the
expression !v is FALSE if v is TRUE and TRUE if v is FALSE.
x <- 16
(x>12)
## [1] TRUE
!(x>12)
## [1] FALSE
Parentheses can be used for more complex logical expressions and to control the order of operations.
i i
i i
i i
The basics of R 7
x <- 16
y <- 30
((y>2*x)|(y<3*x)) & (abs(x-y)!=10)
## [1] TRUE
In this example, ((y>2*x)|(y<3*x)) is TRUE since (y<3*x) is TRUE. The absolute value of the difference
between x and y is not equal to 10, so the expression (abs(x-y)!=10) is also TRUE, meaning the overall
expression obtained by applying the “and” (&) operator is also TRUE.
A useful feature of logical values is that we can perform mathematical operations on them. When a logical value is
included in a mathematical expression in R, a TRUE value is treated as a one and a FALSE value is treated as a zero.
The following R code provides a few simple examples.
TRUE+FALSE
## [1] 1
x <- 4.3
y <- 8.7
1*(x<4)
## [1] 0
(x>4)*(x<7)
## [1] 1
(y>4)*(y<7)
## [1] 0
The first expression, TRUE+FALSE, evaluates as 1+0. After setting the values of the x and y variables, the
1*(x<4) expression returns 0 since (x<4) is false, (x>4)*(x<7) returns 1 since (x>4) and (x<7) are both
true, and (y>4)*(y<7) returns 0 since (y>4) is true and (y<7) is false.
As will be seen in Section 1.6, logical data types also provide a convenient way to control the flow of a program or
to make decisions based on whether certain conditions hold.
The first command assigns the (string) value "intro to statistics" to a variable named str, and the
second command outputs the value of the variable str.
There are several useful functions for manipulating strings, including the following:
• nchar(x): Returns the number of characters in the string x.
• toupper(x) and tolower(x): Convert all the characters in the string x to uppercase or lowercase,
respectively.
i i
i i
i i
8 The basics of R
• substr(x, start, stop): Extracts a portion of the string x, called a “substring,” that starts at the character
indicated by the start parameter and ends at the character indicated by the stop parameter.
• paste(..., sep = " "): Takes one or more values (strings, numbers, etc) as arguments and pastes them
together as a single string. The optional parameter sep, whose default value is a space (" "), is inserted between
each of the strings being pasted together.
This example creates the str variable as a string, as before with a char data type. The outputs of the nchar
and toupper functions are self-explanatory. For the substr(str,3,11) command, the output is the substring
that starts at the third character of str, which is the first t, and ends at the 12th character of str, which is the
a. The remaining commands illustrate how the paste function can be used. After the x and y values are assigned
values, the first paste function results in a string that is output with the default space (" ") separator, and the second
paste function results in a string that is output with no separator between the arguments. The command str2 <-
paste(str,str) pastes two copies of the str together, with the default space separator.
Sometimes it is useful to know whether or not a particular substring is contained within a string. The grepl
function provides one way to do this:
• grepl(pattern, x): Returns TRUE if the string pattern is contained within the string x and FALSE
otherwise. The function grepl has several optional arguments, including for instance ignore.case, which
indicates whether the case of the letters (upper versus lower) should be ignored and whose default value is FALSE.
The interested reader can use the R documentation, by typing ?grepl, to get more information about grepl and
related functions.
i i
i i
i i
The basics of R 9
In this example, answers is assigned the value "ACBBAD", corresponding to a student’s answers to six multiple-
choice exam questions, each of which has the possible answers A, B, C, and D. The first grepl command asks whether
the substring "AA" (two straight A responses) is within answers. The second grepl command asks whether two
straight B responses are within answers. The third grepl command asks whether three straight A responses are
within answers. And the fourth grepl command asks whether there is a sequence of three straight responses of
any of A, B, C, or D.
1.4 Vectors
A vector is a collection of elements of the same data type. For example, a vector can be a collection of numerical
values, a collection of logical values, or a collection of strings. Vectors can be created and manipulated in various
ways.
i i
i i
i i
10 The basics of R
The first command creates a vector with five numeric elements and assigns it to the variable numvec. In the third
command, the c(numvec,18) constructs a vector consisting of the original numvec vector with an additional
numeric value (18) added to the end of the vector. The assignment of the answers variable shows that a vector can
consist of strings, each of which happen to be one-character strings here. The assignment of the tfvec variable shows
that a vector can consist of logical values (TRUE or FALSE). The last command, c(8,12,"A") illustrates that the
elements of a vector must all be of the same type. Notice that, unlike numvec, the 8 and 12 values are contained
within quotation marks, meaning "8" and "12" are strings. When R sees that the string "A" is part of the vector
being created, it forces the other values (which were numeric in this case) to be strings.
There are other useful functions for creating vectors. For instance, we can create a numeric vector containing a
sequence of numbers using either the seq function or the : operator:
• seq(from, to, by = 1): Returns a numeric vector consisting of a sequence of numbers that starts at the
value of the argument from, with each successive element of the sequence obtained by adding the increment by
(an optional argument whose default value is 1) and the argument to indicating when the sequence should end. For
an increasing sequence of numbers (positive by value), the last number in the sequence will be less than or equal to
the argument to. For a decreasing sequence of numbers (negative by value), the last number in the sequence will
be greater than or equal to the argument to. (When from is larger than to, the default value of by is -1.)
• The : operator: Returns a numeric vector consisting of a sequence of numbers between the two operands, where
the increment is equal to one. Specifically, the command from:to is shorthand for seq(from, to, by =
1) if from is less than to and shorthand for seq(from, to, by = -1) if from is less than to.
The following examples illustrate the use of the seq function and the : operator.
i i
i i
i i
The basics of R 11
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10,2)
## [1] 1 3 5 7 9
seq(5,4,-0.1)
## [1] 5.0 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4.0
1:6
## [1] 1 2 3 4 5 6
1.2:6
## [1] 1.2 2.2 3.2 4.2 5.2
6:1
## [1] 6 5 4 3 2 1
vec <- c(1:6,seq(10,20,2))
vec
## [1] 1 2 3 4 5 6 10 12 14 16 18 20
For the seq(1,10,2) command, the increment for the sequence is 2, and the last element of the sequence is
9 since the next element of the sequence (11) would be larger than the end of the sequence (10) that was specified.
Similarly, for the 1.2:6 command, where the increment is automatically equal to one, the last element of the sequence
is 5.2 since the next element would be larger than the end of the sequence (6) specified. The creation of the vec
variable illustrates that sequences can be stored in a vector variable and can also be combined through the use of the
c function. In this case, the vec variable has been assigned to a vector that consists of the integers between 1 and 6
(inclusive) and then the even integers between 10 and 20 (inclusive).
The rep function is a convenient way to create or “initialize” a vector that, unlike the seq function, can be used
for any type of vector, including numeric, logical, and string.
• rep(x, times = 1): Returns a vector created by repeating the value x a certain number of times, specified
by the optional argument times (whose default value is 1).
rep(0,5)
## [1] 0 0 0 0 0
rep(1,5)
## [1] 1 1 1 1 1
rep(FALSE,8)
i i
i i
i i
12 The basics of R
of the vector x). To access multiple elements of the vector, we specify a vector of “indices” within the square brackets,
as illustrated below.
In this example, the variable numvec is assigned to be a vector containing ten numeric values. The numvec[3]
command returns the third element of numvec. The numvec[2:5] command returns a vector consisting of the
the second through fifth elements (inclusive) of numvec. The next command uses the length function to assist in
returning a vector with the last three elements of numvec. The numvec[c(4,1,9)] command illustrates that the
chosen elements need not be consecutive, as this command returns a vector with the fourth element, first element, and
ninth element of numvec.
• sum(x): Returns the sum of the elements of the vector x. (If x is a logical vector, TRUE and FALSE are treated
by the number of elements. (If x is a logical vector, TRUE and FALSE are treated as the values 1 and 0, respectively.)
• cumsum(x): Returns a vector containing the cumulative or “running” sum of the elements of the vector x. The
first element is first element of x, the second element is the sum of the first two elements of x, the third element
is the sum of the first three elements of x, and so on. (If x is a logical vector, TRUE and FALSE are treated as the
values 1 and 0, respectively.)
The functions min, max, sort, and unique can be used for numeric, logical, and string vectors, whereas the
functions sum, mean, and cumsum are not meant for string vectors.
i i
i i
i i
The basics of R 13
sqrt(numvec)[2:4]
## [1] 3.464102 2.236068 3.162278
numvec >= 9
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
prices <- c(1.24,3.12,0.78,2.22,4.57,2.89,4.08,1.83,3.78,2.66)
numvec*prices
## [1] 9.92 37.44 3.90 22.20 13.71 52.02 28.56 18.30 7.56 21.28
revenue <- sum(numvec*prices)
revenue
## [1] 214.89
i i
i i
i i
14 The basics of R
The numvec variable is a vector with ten numerical values that was used in a previous example. The result of the
command numvec + 1 is a vector which takes each element of numvec and adds the value 1 to it. In contrast,
for the command numvec + 1:10, where both operands are ten-element vectors, the vector that is returned is an
element-by-element sum of the two operands. The first element is the sum of the first element of numvec (8) and
the first element of 1:10 (1), the second element is the sum of the second element of numvec (12) and the second
element of 1:10 (2), and so on. A mathematical function like sqrt returns a vector with the function applied to each
element of the vector argument. In this case, we can confirm that the resulting vector has the square root of 8 as its first
element, the square root of 12 as its second element, and so on. The command sqrt(numvec)[2:4] demonstrates
the generality of indexing vectors; since sqrt(numvec) is itself a vector, with the same length as the original
numvec, the [2:4] indexing returns the second through fourth elements of the square root applied to the numvec
vector. The numvec >= 9 command illustrates how logical operators are also applied on an element-by-element
basis. The resulting vector is a logical vector, where the first element indicates whether the first element of numvec
(8) is greater than 9, the second element indicates whether the second element of numvec (12) is greater than 9, and
so on. The last set of commands, involving the prices and revenue variables, illustrate how vector operations can
be used in a simple economic example. If the variable numvec represents the quantities of ten different goods that
are purchased at a certain store and the variable prices represents the corresponding prices of these ten goods, the
command numvec*prices returns a vector with ten elements, each corresponding to the revenue associated with
a given good. Then, the variable revenue, obtained with the sum function, provides the total revenue for all of the
goods.
Here are some examples of element-by-element operations and functions for string and logical vectors.
For the string vector strvec, toupper(strvec) capitalizes each element, paste(strvec,".",sep="")
appends a period (.) to the end of each element, and strvec=="ab" does a logical comparison of each element
with the string "ab". For the logical vectors, logicvec1 and logicvec2 are created as logical vectors with four
elements. The commands logicvec1|logicvec2 and logicvec1&logicvec2 apply the “or” (|) and the
“and” (&) operators, respectively, to the two vectors on an element-by-element basis.
i i
i i
i i
The basics of R 15
• all(x): Returns TRUE if all of the values in the logical vector x are TRUE, and FALSE otherwise.
• any(x): Returns TRUE if any of the values in the logical vector x are TRUE, and FALSE otherwise.
• which(x): Returns a vector of indices corresponding to TRUE elements of the logical vector x.
• ifelse(test, yes, no): Returns a vector based upon a logical condition, given by the argument test,
where the value given by the yes argument is used if the condition is TRUE and the value given by the no argument
is used if the condition is FALSE.
Assume again that numvec represents the quantities of ten different goods purchased at a store and the variable
prices represents the corresponding prices of these ten goods. The all(numvec >= 9) command returns
FALSE since the quantities are not all greater than or equal to 9, and the any(numvec >= 9) command returns
TRUE since at least one of the quantities is greater than or equal to 9. The first ifelse command returns a string
vector, with the value "high price" corresponding to any element of prices greater than 3.50 and the value
"low price" corresponding to any element of prices that is not greater than 3.50. The second ifelse
does something slightly different, creating a numeric vector where each element is either exactly 3.50 (when the
corresponding element of prices is greater than 3.50) or equal to the corresponding element of prices (when
this element is not greater than 3.50). The which(prices > 3.50) command returns a vector of the indices
corresponding to the elements of prices that are greater than 3.50. There are three such elements, corresponding
to the indices 5, 7, and 9 of the prices vector. The usefulness of the which function command is illustrated in
the subsequent command, where which(prices > 3.50) it itself used within the square brackets [] to select
elements of the prices vector. Specifically, the new variable high_prices as a vector that consists of only the
elements of prices for which the condition prices > 3.50 is true.
A particularly useful function to use for a logical vector is the sum function. When x is a logical vector, sum(x)
treats TRUE values as ones and FALSE values as zeros , which means that sum(x) returns the total number of TRUE
values that the vector x contains. As an example, using the numvec and prices vectors defined above, the following
code uses the sum function to count how many of the numvec elements are greater than or equal to 9 and how many
of the prices elements are greater than 3.50.
i i
i i
i i
16 The basics of R
sum(numvec >= 9)
## [1] 4
sum(prices > 3.50)
## [1] 3
For a logical vector x, the mean(x) returns the number of TRUE elements in x divided by the total number of
elements of x, which is the proportion or fraction of elements of x that are TRUE.
mean(numvec >= 9)
## [1] 0.4
mean(prices > 3.50)
## [1] 0.3
Throughout the book, we will apply the sum or mean functions to logical vectors as part of computer simulations
involving random numbers. To preview this type of calculation, we briefly introduce the function runif. When called
with a single argument n, the function runif(n) returns a vector of n random (real) numbers between zero and one.
This “uniform” random variable will be discussed in Chapter 10, but for now we can think of any real number between
zero and one as being equally likely to be chosen.
set.seed(1234)
x <- runif(1000)
x[1:10]
## [1] 0.113703411 0.622299405 0.609274733 0.623379442 0.860915384 0.640310605
## [7] 0.009495756 0.232550506 0.666083758 0.514251141
sum(x<0.3)
## [1] 291
mean(x<0.3)
## [1] 0.291
mean((x>0.6)*(x<0.8))
## [1] 0.198
Ignore the first command for now, as the set.seed function will be discussed in Chapter 2. The second command
creates a vector x containing 1,000 random numbers between zero and one, and the third command outputs the first
ten values of x. The sum(x<0.3) command outputs the number of random numbers, out of 1,000, that are less
than 0.3, and the mean(x<0.3) command outputs the sum divided by the length of x (1,000). The final command,
mean((x>0.6)*(x<0.8)), returns the fraction of the 1,000 random numbers that are between 0.6 and 0.8 since
(x>0.6)*(x<0.8) evaluates to 1 when both x>0.6 and x<0.8 are TRUE and 0 otherwise. For this simulation,
29.1% of the 1,000 random numbers are less than 0.3, and 19.8% are between 0.6 and 0.8.
i i
i i
i i
The basics of R 17
The first command assigns answers to be a string vector with the student’s answers, and the second command
outputs this vector. The third command creates tfvec by applying the factor function to the original vector
answers. The factor function converts the original string vector into a vector that has the factor data type. As
the output shows, the tfvec has the same values answers, but R has automatically detected that the Levels of the
factor variable are False and True. The levels function can be used to return the levels associated with a factor
variable. By default, R orders the levels of the factor variable alphabetically unless specified otherwise.
If a categorical variable has a natural ordering, we can also explicitly specify the factor levels and do so in the correct
order. For example, suppose we have customer ratings for two different restaurants, contained in the vectors vec1 and
vec2. There are four possible ratings, which are (in ascending order) Poor, Fair, Good, and Excellent.
ratings1
## [1] Good Excellent Good Poor Good
## Levels: Poor Fair Good Excellent
ratings2
## [1] Fair Excellent Excellent Poor Good Excellent
## Levels: Poor Fair Good Excellent
table(ratings1)
## ratings1
## Poor Fair Good Excellent
## 1 0 3 1
table(ratings2)
## ratings2
## Poor Fair Good Excellent
## 1 1 1 3
The assignment of rating_levels to the four ratings categories tells R the order of the categories. Then,
ratings1 and ratings2 are created as vectors of factor variables, based upon the original vec1 and vec2
string vectors and using the levels specified by rating_levels. When ratings1 is output, the levels are shown
in the order that we specified, and moreover the output shows the level Fair even though there are no Fair ratings
in the ratings1 vector. Another way to output and view the data within the factor-variable vector is with the table
command, which provides a tabulation (count) of the number of vector elements that are within each category. For the
table of ratings1, all four categories are shown, with a 0 indicating no elements with a Fair rating.
i i
i i
i i
18 The basics of R
1.5 Output
In the examples considered thus far, R supplies output automatically for most expressions, with the exception of
variable assignments.
x <- 5
x
## [1] 5
x/3
## [1] 1.666667
In this example, the variable assignment x <- 5 does not result in output, but the commands x and x/3 both
result in output, corresponding to the values of the two expressions.
The print function provides an alternative method for providing output. In many cases, the command print(x)
leads to the same output as the command x, but the print function is sometimes preferred by R users since (i) it has
optional arguments that can be useful and (ii) it can provide output that is better formatted for tables, regressions, etc.
1.6 Programming
1.6.1 Conditional (if-else) execution
The simplest version of conditional execution involves a single if statement, where a command or series of commands
gets executed if a certain logical condition holds.
# if (logical condition) {
# ... commands ...
# }
The syntax for the if statement has a logical condition, which evaluates to TRUE or FALSE, within parentheses
after the if keyword. If the logical condition evaluates to TRUE, the sequence of commands within the curly braces {
and } are executed.
The if statement checks whether the string str has more than ten characters. Since it does, the commands within {
and } are executed, resulting in two lines of output. Had str been a string with ten or fewer characters, the commands
within { and } would have been skipped, and no output would have resulted.
What if we want to execute different commands when the logical condition within the if statement does not hold?
In this case, we use an if-else statement that executes one set of commands if the logical condition holds and a
different set of commands if it does not.
i i
i i
i i
The basics of R 19
# if (logical condition) {
# ... commands executed if condition is TRUE...
# } else {
# ... commands executed if condition is FALSE...
# }
This syntax has the else keyword appearing after the right curly brace } of the original if statement and the
“else” commands contained within a second set of curly braces { and } after the else keyword.
The if statement checks if price is greater than 20, which is the case here. As a result, the output "The price
is too high." is provided, and the variable price is set to 90% of its original value. Had price been less than or
equal to 20, the output "The price is not too high." would have been provided, and the variable price
would have been left unchanged.
We can check additional logical conditions within an if-else statement by using the keywords else if rather
than else, as illustrated in the following example.
Like the previous example, the first logical condition (price > 20) is checked by the initial if statement,
with the two commands below it executed if this condition is TRUE. In this example, however, if the (price >
20) condition is FALSE, the subsequent else if checks whether the logical condition (price < 10). If this
condition is TRUE, the output "The price is too low." is provided, and the variable price is set to 110%
of its original value. If this condition is FALSE, in which case price is between 10 and 20, the output "The price
is not too high or too low." is provided.
i i
i i
i i
20 The basics of R
In this syntax, var is a variable used to store values during the loop, and sequence is a vector specifying the
sequence of values over which the for loop will iterate. For each value in sequence, the for loop executes the
commands specified within the curly braces { and }. For instance, if we want to conduct 10,000 simulations of a
certain process, a for loop can loop over the simulation number (from 1 to 10,000), and the commands within the
loop will be repeatedly executed for each simulation.
As an example, we consider how a for loop can be used to calculate the Fibonacci sequence, which is a sequence
of numbers in which each number is the sum of the two preceding numbers. The sequence starts with the numbers 1
and 1, so that the Fibonacci sequence is 1, 1, 2, 3, 5, 8, 13, 21, ….
sequence_length <- 10
fib_sequence <- rep(0,sequence_length)
fib_sequence[1] <- 1
fib_sequence[2] <- 1
for (i in 3:sequence_length) {
fib_sequence[i] <- fib_sequence[i-2]+fib_sequence[i-1]
}
print(fib_sequence)
## [1] 1 1 2 3 5 8 13 21 34 55
The variable sequence_length is the desired length of the Fibonacci sequence, set to 10 here. Then,
the variable fib_sequence is initialized as a numeric vector having all zeros and with length equal to the
specified sequence_length. The next two commands assign the value 1 to both the first and second elements
of fib_sequence, which corresponds to 1 and 1 being the first two numbers in the Fibonacci sequence. The
for loop does the rest of the work in determining the Fibonacci sequence. The variable i loops from 3 to
sequence_length. With sequence_length being 10, the variable i has the value 3 on the first iteration
of the loop, 4 on the second iteration of the loop, and so on, through 10 on the eighth iteration of the loop.
On the first iteration of the loop, fib_sequence[3] is assigned to the sum of fib_sequence[1] and
fib_sequence[2], which is 2. On the second iteration of the loop, fib_sequence[4] is assigned to the sum
of fib_sequence[2] and fib_sequence[3], which is 3. This process continues until the final iteration of the
loop, where fib_sequence[10] is assigned to the sum of fib_sequence[8] and fib_sequence[9]. The
last print command provides the output with the Fibonacci sequence of length sequence_length.
The following example involves the use of strings and illustrates how the sequence vector can be something other
than a simple sequence of numeric values.
i i
i i
i i
The basics of R 21
print(first_letters)
## [1] "chpcg"
The for loop builds up a string first_letters that consists of the first letters from each string in the string
vector strvec. The variable first_letters is initialized to be an empty string "". In the first iteration of the
for loop, the variable str is equal to "cow", the first element of strvec. The expression substr(str,1,1)
yields the first character of str, which is "c", and the paste function pastes it at the end of first_letters, so
that the value of first_letters after the first iteration is "c". In the second iteration, the variable str is equal
to "horse", and the character "h" gets added to the end of first_letters, so that its value is "ch" after the
second iteration. This process continues, with a total of five iterations, and the print command outputs the final value
of first_letters.
The syntax for the while loop has a logical condition, which evaluates to TRUE or FALSE, within parentheses after
the while keyword. If the logical condition is TRUE, the commands within the curly braces { and } are executed.
After the commands are executed, the computer goes back to the start of the while loop to check the logical condition
again. If the logical condition is still TRUE, the commands within the curly braces { and } are executed again. This
process continues until the while logical condition is FALSE, at which point the computer stops the loop (and, if
there are commands after the loop (i.e., after the right curly brace }), jumps to those commands).
i i
i i
i i
22 The basics of R
The numeric vector sales contains data on daily sales for a certain company, and a while loop determines how
many days it takes for the company’s total (cumulative) sales to be at least 200 units. The total_sales variable,
used to keep track of cumulative daily sales, is initialized to 0. The idx variable, used to keep track of the number of
the day (i.e., the index of the sales vector), is also initialized to 0. When the while loop is first reached, the logical
condition is TRUE since total_sales is equal to 0. The commands within the loop increment idx, so it now has a
value of 1, and add sales[1] to total_sales, which now has a value of 38. The logical condition remains true,
so the commands within the loop are executed again, leading to an idx value of 2 and a total_sales of 90. This
process continues until idx is equal to 5, and total_sales is equal to 222, at which point the logical condition
for the while loop is FALSE, so the loop is ended and the print command outputs the information.
these arguments are enclosed in parentheses after the function name. Some arguments may be required, and some
arguments may be optional, as we have seen for built-functions like round and log.
• Write the code: The code for the function, which is contained within curly braces { and }, performs the desired
output could be any type of object, like a number, a string, a logical value, or a vector.4
The following example shows how a function for calculating the area of a triangle based upon the values of its base
and its height.
triangle_area(10,5)
## [1] 25
area
## Error in eval(expr, envir, enclos): object ’area’ not found
The name of the function is triangle_area, and it is “assigned” to be a function with two required arguments,
base and height. Within the function, the variable area calculates the area of the triangle, which is 1/2 times
base times height. Then, the return function returns the function’s output, which is the value of area. The
triangle_area(10,5) command makes a call to the function triangle_area with argument values 10 (for
base) and 5 (for height), and the resulting area of 25 is printed. The final command, area, results in an error
message. Even though the variable area is defined within the function triangle_area, it is not recognized by R
once the function evaluation is complete; we say that the variable area is “local” to the function. The same is true
for any variables created within a function. It is also true for the arguments of the function; the variables base and
height would not be recognized after the function evaluation is complete. When a variable or argument is defined
i i
i i
i i
The basics of R 23
inside a function, it becomes a “local” variable by default. As such, the variable or argument name can be used freely
within the function without affecting any variables outside of the function. For example, if height were a variable
that already existed before the call to triangle_area, the use of the argument name height would not affect the
value of the variable height. This idea is illustrated below.
height <- 8
triangle_area(10,5)
## [1] 25
height
## [1] 8
The variable height is assigned the value 8. Then, even though the triangle_area function uses an argument
with the same name (height) and assigns the value 10 to the argument height, the value of the original variable
remains 8 after the function evaluation is complete.
Without any changes, the triangle_area function can actually take vectors as its two arguments, with base
being a vector of triangle bases and height being a vector of triangle heights. The first elements of base and
height correspond to the first triangle, the second elements correspond to the second triangle, and so on. Then,
since R automatically performs arithmetic operations on an element-by-element basis, the function triangle_area
returns a vector of the areas for each of the triangles.
The following example shows how to define optional arguments in a function. p We define the function distance
to calculate the distance between any two points (x1 , y1 ) and (x2 , y2 ), given by (x1 – x2 )2 + (y1 – y2 )2 .
distance(3,4)
## [1] 5
distance(3,4,5,-1)
## [1] 5.385165
x1 and y1 are required arguments for distance, while x2 and y2 are optional arguments. For the optional
arguments, an equal sign (=) appears after the argument and is followed by the default value. The default values for
x2 and y2 are both equal to 0, corresponding to (x2 , y2 ) = (0, 0) being the origin if the function is called without the
optional arguments. Thus, distance(3,4) command returns the distance from (3, 4) to (0, 0), √ which is equal to 5,
and distance(3,4,5,-1) command returns the distance from (3, 4) to (5, –1), which is 29 or approximately
5.385165.
i i
i i
i i
24 The basics of R
df <- data.frame(name = c("Amy", "Blake", "Chloe"), age = c(28, 41, 32), employed = c("Yes", "No", "Yes"))
print(df)
## name age employed
## 1 Amy 28 Yes
## 2 Blake 41 No
## 3 Chloe 32 Yes
mean(df$age)
## [1] 33.66667
sapply(df,class)
## name age employed
## "character" "numeric" "character"
df$employed <- factor(df$employed)
sapply(df,class)
## name age employed
## "character" "numeric" "factor"
The first command, using the data.frame function, assigns the data frame to the variable df. This data frame
is specified to have three variables (columns): name, age, and employed. The data frame has three rows, or
three observations for each variable, as can be seen most clearly from the output of the print(df) command.
The $ syntax refers to an individual variable within the data frame, so that df$age refers to the age variable
(column). The command mean(df$age) outputs the arithmetic mean of the three values of age. The next command,
sapply(df,class), indicates the data type associated with each of the three variables in df. The function
sapply applies the function specified by its second argument (in this case class) to each variable in the data
frame specified in its first argument (in this case df). The resulting output indicates that name and employed are
strings and age is numeric. Since it is more appropriate to treat employed as a factor variable, with possible values
"Yes" and "No", the factor function in the next command re-assigns the employed variable so that its values
are factors rather than strings. The output of the second sapply command confirms that the data type of employed
has changed.
R provides various ways to access data frame elements, including rows, columns, and other selected data cells. For
the data frame df created above, the following examples show some different possibilities. Just as square bracket
syntax ([ and ]) references elements of vectors, square brackets can also be used to reference elements of data frames.
The difference is that data frames are two-dimensional objects whereas vectors are one-dimensional objects.
i i
i i
i i
The basics of R 25
df$name
## [1] "Amy" "Blake" "Chloe"
df$age[1]
## [1] 28
df[1,2]
## [1] 28
df[,2]
## [1] 28 41 32
df[,c("name","employed")]
## name employed
## 1 Amy Yes
## 2 Blake No
## 3 Chloe Yes
df[1:2,]
## name age employed
## 1 Amy 28 Yes
## 2 Blake 41 No
df[df$age>30,]
## name age employed
## 2 Blake 41 No
## 3 Chloe 32 Yes
The expression df$name refers to the name variable and outputs a vector with all of the values for that variable in
the data frame. The expression df$age[1] refers to the age variable, with df$age being a vector of all the age
values and df$age[1] being the first element of that vector, which is 28. An alternative method to access the same
age value from the data frame is to directly reference the appropriate element of the data frame, which is what the
df[1,2] expression does. Note that df[1,2] has two indices specified within the square brackets and separated by
a comma. For this syntax, the first index (or indices) refers to the row(s) of the data frame, and the second index (or
indices) refers to the column(s) or variable(s) of the data frame. Therefore, df[1,2] is the first element (row 1) of the
second variable (column 2, which is the age variable), which has the same value 28 as the df$age[1] expression.
The expression df[,2] omits the first index within the square brackets, which tells R to provide all the rows for the
specified column(s). Therefore, the value of df[,2] is the vector of all of the values for the second (age) variable. In
the following expression, df[,c("name","employed")], the variables are specified by name rather than index
number, and all rows for name and employed are returned since the first index is again omitted. The result is itself
a data frame with two columns, in contrast to the vector that was returned by df[,2] when only a single column
was specified. The expression df[1:2,] omits the second index within the square brackets, which tells R to provide
all the columns for the specified row(s). As such, the result of df[1:2,] is itself a data frame which contains the
first and second rows, as specified by 1:2, of the original data frame df. This notation provides a convenient way
to select certain rows from the data frame, which can be extremely useful for data analysis. The final expression,
df[df$age>30,], provides a simple example of selecting a subset of data based upon a logical condition. This
expression returns a data frame with all rows for which the age variable is greater than 30.
While the data.frame function can be used to directly create a data frame, it is much more common to create
a data frame based upon an existing file that contains data. R has the ability to read many different types of files,
including text files and spreadsheet (e.g., Excel) files. In the interest of space, we focus on two specific file formats in
this section, the csv format and the RData format.
i i
i i
i i
26 The basics of R
exam1,exam2
63,71
86,84
68,63
94,82
...
...
To load a csv file as a data frame, here are the necessary steps:5
1. Set your “working directory”: R uses the working directory when trying to load a file. The current working
directory can be checked by typing getwd(). The working directory can be set by either (i) changing it
in the “Preferences...” settings in RStudio or (ii) using the setwd function to specify the directory (e.g.,
setwd("C:/stat-course/data/")).
2. Load the csv file: With the working directory set, the read.csv function loads the csv file. The first argument
of the read.csv function is the file name, provided within double quotes. By default, read.csv assumes that the
first line of the file contains the variable names. (If the first line of the file contains the first row of data rather than
variable names, the optional argument header = FALSE should be specified. Then, R reads in the data and create
the variable names as V1, V2, V3, etc) There are several optional arguments that can be specified, with details available
from the ?read.csv documentation. Two commonly used options can help to make sure that the data types of the
variables are appropriate:
• stringsAsFactors: Specifying stringsAsFactors = TRUE ensures that string data are read in as factor
variables. For example, if a variable has only the values "Yes" and "No", the default is for read.csv to read
the variable in as a string variable, but stringsAsFactors = TRUE instead reads it in as a factor variable
with two possible values.
• colClasses: Specifying the colClasses argument directly indicates the desired data types for each of the
variables. For example, if the file zipcode.csv contains just a single variable with five-digit U.S. zip codes in it,
those zip codes can be read in as strings rather than numbers with the expression read.csv("zipcode.csv",
colClasses = c("character")). With multiple variables, the argument colClasses is a vector with
length equal to the number of variables and data types specified, in order, corresponding to the variables. For
example, colClasses = c("character","numeric","factor") would read in three variables, the
first as a string variable, the second as a numeric variable, and the third as a factor variable.
The following example loads the file exams.csv into R. As exam1 and exam2 will be treated as numeric variables,
no additional arguments are specified for the read.csv function since R automatically reads them in as numeric
variables.
i i
i i
i i
The basics of R 27
The variable exams is assigned to be a data frame associated with the data in exams.csv. Then, any of the usual
commands can be used to access or manipulate the data in exams. The exams[5:9,] command returns a data
frame consisting of the data for the fifth through ninth students. The two print commands output the maximum
scores for the exam1 and exam2 variables.
Once a data frame has been created from a file, there are several useful R functions that can be used to examine and
summarize the data frame, including the following:
• View(df): Displays df as a spreadsheet in the RStudio script window.
• head(df, n = 6): Returns the first six rows of df. The optional argument n, whose default value is 6, can be
changed to return a different number of rows.
• tail(df, n = 6): Returns the last six rows of df. The optional argument n, whose default value is 6, can be
the variable data types, and some sample values for each variable.
• summary(df): Provides a summary of the variables in df. The information provided for a variable depends
upon its data type. For numeric variables, descriptive measures (including minimum value, maximum value, and
arithmetic mean) are provided. For factor variables, observation counts for some or all categories are provided.
• nrow(df) and ncol(df): Return the number of rows and columns in df, respectively.
• names(df): Returns a string vector containing the names of the variables in df.
i i
i i
i i
28 The basics of R
head(exams)
## exam1 exam2
## 1 63 71
## 2 86 84
## 3 68 63
## 4 94 82
## 5 60 59
## 6 75 80
str(exams)
## 'data.frame': 77 obs. of 2 variables:
## $ exam1: int 63 86 68 94 60 75 79 72 90 79 ...
## $ exam2: int 71 84 63 82 59 80 68 67 79 85 ...
summary(exams)
## exam1 exam2
## Min. :33.00 Min. :20.0
## 1st Qu.:68.00 1st Qu.:64.0
## Median :79.00 Median :73.0
## Mean :77.31 Mean :70.9
## 3rd Qu.:89.00 3rd Qu.:81.0
## Max. :99.00 Max. :96.0
names(exams)
## [1] "exam1" "exam2"
We can add variables to an existing data frame. For example, to add the variable avg (the average of the two exam
scores) to the exams data frame, we do an assignment command for exams$avg:
The output from str(exams) confirms that a third variable, named avg, has been added to the exams data frame.
Had we instead done the assignment as avg <- (exams$exam1+exams$exam2)/2, without the exams$
prefix, the vector avg would have been created correctly, but it would simply be another variable in R and not part of
the exams data frame. Since the dataset has been changed, we might want to save the new data frame to a file. The
write.csv command saves the data frame exams to a new csv file called exams-edited.csv; alternatively,
we could over-write the original file with the command write.csv(exams,file="exams.csv").
As with other types of variables, a data frame can be removed from the R working environment using the rm
function. For example, if we are done using the exams data frame, rm(exams) removes it from the R working
environment.
i i
i i
i i
The basics of R 29
load("exams.RData")
str(exams)
## 'data.frame': 77 obs. of 2 variables:
## $ exam1: int 63 86 68 94 60 75 79 72 90 79 ...
## $ exam2: int 71 84 63 82 59 80 68 67 79 85 ...
The file exams.RData contains a data frame already stored within the exams variable. As such, there is no
assignment necessary in the first command, in contrast to the assignment that was necessary when using read.csv
for a csv file. The command str(exams) confirms that the data frame is loaded correctly. The save function is
used to save a data frame (and/or other R objects) to an RData file. In the code below, we add the variable avg to the
exams data frame and then save the exams data frame to a different RData file:
Even though the file name is different, the name of the data frame (exams) remains the same, and the new file
exams-edited.RData contains the exams data frame.
The following code shows a simple example where we save more than one object to an RData file:
We first create a string variable instructor. The save command now has two arguments before the file
argument is specified, causing both the exams data frame and the instructor string variable to be saved in
exams-edited.RData. Now, when this file gets loaded by the load command, both objects are created in the
R environment. To save more objects in an RData file, this basic idea can be extended to include many objects as
arguments before the file argument is specified for the save function.
i i
i i
i i
30 The basics of R
hourly_wage <- c(12.50, NA, 10.75, 11.00, NA, NA, 14.80, 13.25)
age <- c(24, 42, 31, 61, 55, 26, 34, 59)
is.na(hourly_wage)
## [1] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
hourly_wage[!is.na(hourly_wage)]
## [1] 12.50 10.75 11.00 14.80 13.25
sum(is.na(hourly_wage))
## [1] 3
worker_df <- data.frame(wagehr = hourly_wage, age = age)
na.omit(worker_df)
## wagehr age
## 1 12.50 24
## 3 10.75 31
## 4 11.00 61
## 7 14.80 34
## 8 13.25 59
The hourly_wage vector has three missing values and five non-missing values. The expression
is.na(hourly_wage) returns a logical vector having TRUE values corresponding to the missing elements of
hourly_wage and FALSE values for the other elements. The expression sum(is.na(hourly_wage)) returns
the total number of missing values in hourly_wage since TRUE and FALSE are treated as 1 and 0, respectively,
by the sum function. The hourly_wage[!is.na(hourly_wage)] expression returns a vector with all non-
missing elements of hourly_wage. The data.frame function creates the data frame worker_df consisting of
the hourly wage and age variables. Then, the na.omit(worker_df) command returns a data frame consisting of
only those rows with non-missing hourly wage values.
Several R functions that take vectors as arguments do not work correctly in the presence of missing (NA) values.
For instance, to calculate the average hourly wage for workers, the expression mean(hourly_wage) would seem
appropriate. Unfortunately, that expression doesn’t perform as desired, and the optional argument na.rm = TRUE
needs to be specified.
mean(hourly_wage)
## [1] NA
mean(hourly_wage, na.rm = TRUE)
## [1] 12.46
max(hourly_wage, na.rm = TRUE)
## [1] 14.8
The expression mean(hourly_wage) returns NA, indicating that R is unable to calculate the mathematical
average when the vector has missing values. Adding the optional argument na.rm = TRUE fixes the problem,
with 12.46 being reported as the average of the five non-missing values. Similarly, using na.rm = TRUE for
the max function allows R to calculate the largest value (14.8) among the non-missing values. If the expression
max(hourly_wage) had been used without na.rm = TRUE, the expression would also have a result of NA.
1.10 R packages
An R package is a collection of functions, data, and/or documentation that are bundled together so that they can be
easily installed and used. Such packages provide add-on capabilities to the standard R statistical software, and they
i i
i i
i i
The basics of R 31
are a convenient way to share code with others and to organize and re-use your own code. Before an R package can
be used, it first needs to be installed using the install.packages function. Once the package has been installed,
it can be used in an R session with the library function, and the description for the package can be accessed using
the optional help argument for the library function.
As an example, we consider the installation and use of stringr, an R package that provides several useful
functions for string manipulation. First, the stringr package must be installed:
install.packages("stringr")
The package name stringr is enclosed in double quotes when used as an argument of the install.packages
function. With the stringr package installed, its contents can be used in R after the command
library(stringr) is executed.
library(stringr)
library(help = stringr)
str_trim(" testing this out ")
## [1] "testing this out"
The double quotes are unnecessary for stringr when used as an argument for the library function. The second
command, library(help = stringr), opens a window in the RStudio script pane with information about the
stringr library and its included functions. For example, the function str_trim removes any extra spaces from
the start and end of a string; in the example above, there are two spaces at the start of the string that are removed and
one at the end of the string that is removed. In most cases, we can access detailed documentation for new functions that
are added through packages. The command ?str_trim, for instance, provides documentation for the str_trim
function.
install.packages("probstats4econ")
library(probstats4econ)
The data frames in the probstats4econ package are “lazily loaded” into R, which means they are available
to the user even though they will not initially appear in the (upper-right) Environment/History pane. As an example,
after loading the package with the install.packages and library functions, we can immediately use the
exams data frame within the console. Even after using the dataset, it will still not appear in the Environment/History
pane unless a change (e.g., adding a new variable) is made to the data frame. (R itself has several datasets lazily
loaded when RStudio is launched. A complete list of datasets can be seen by using the command library(help =
datasets).)
Aside from the probstats4econ package, this book primarily considers the use of R without installation of
additional packages, which is known as “base R.” That said, there are several R packages that are very useful for
data analysis. Examples include ggplot, a package for creating high-quality graphics and data visualizations, and
dplyr, a package for data manipulation. Both ggplot and dplyr are part of a larger package, tidyverse, that
is popular among data scientists.
i i
i i
i i
32 NOTES
Notes
1 The RStudio console window does not display the ## at the beginning of output lines. This book puts the ## at the beginning of output so that
the full block can be copied and, when executed, only the commands and not the output lines (since they are now “comments”) are executed.
2 R has an integer data type, which provides a more efficient way to store integer-valued variables in memory.
3 R has a time data type for storing time values (e.g., 11:49:20).
4 It is also possible for a function to have multiple return values, which can be done by returning a list object. An R list is a collection of different
objects, which need not be of the same data type. The code for wald_test in Chapter 14 is an example having a list of two objects returned.
5 There are point-and-click alternatives in RStudio to read data files as well, including (1) click on “Import Dataset” in the (upper-right)
Environment/History pane, (2) click on the “Files” tab in the (lower-right) Plots/Help/Packages/Files pane and select the desired directory and
file, and (3) click on the “Import Dataset” submenu from the main “File” menu.
Exercises
1. For each of the following sections of R code, indicate (i) what output, if any, is provided by R and (ii) the final value
of the variable x. Answer without using R.
(a) Code section #1:
x <- 16
sqrt(x)
x <- 16
x <- sqrt(x)
x <- c(x,x)
x <- c(79,16,53,44)
x <- sort(x, decreasing = TRUE)
length(x)
y <- c(6.7,-3.3,4.2)
x <- (1:3)*y
min(x)
x <- (x > 0)
sum(x)
2. A company has the following daily sales over the course of nine days: 38, 52, 24, 61, 47, 18, 29, 44, 41.
(a) Create a numerical vector sales that contains the daily sales.
(b) Provide a single command to calculate the average daily sales.
(c) Provide a single command to calculate the number of days for which sales are (strictly) between 40 units and
60 units.
(d) Provide a single command to calculate the proportion of days for which sales are (strictly) between 40 units
and 60 units.
3. Provide a single command, involving the sqrt function, to output a vector of all “perfect squares” that are less than
1,000 and in ascending order. The “perfect squares,” in ascending order, are 12 , 22 , 32 , 42 , 52 , … or 1, 4, 9, 16, 26, ….
i i
i i
i i
NOTES 33
4.
(a) Create a vector mult_two consisting of all multiples of two (i.e., even numbers) between 1 and 200
(inclusive).
(b) Create a vector mult_three consisting of all multiples of three between 1 and 200 (inclusive).
(c) Provide a single command to calculate how much longer mult_two is than mult_three.
(d) Using mult_two and mult_three, create a new vector mult_vec consisting of all numbers between 1
and 200 (inclusive) that are a multiple of two, a multiple of three, or both. The vector mult_vec should have
(i) the vector elements in increasing order and (ii) no duplicate elements (e.g., the number 6 should only appear
once even though it’s in both mult_two and mult_three). How many elements does mult_vec have?
5. Write a function rectangle_area that calculates the area of a rectangle based upon two arguments. The first
argument is base, the length of the rectangle base. The second argument is height, the rectangle height. Make the
second argument optional, with the default value specified by height = base, corresponding to a square. Confirm
that rectangle_area(3) returns the area of a 3 × 3 square and rectangle_area(4,5) returns the area of a
4 × 5 rectangle.
6. Write a function even_product that takes a single numerical (integer) argument x and returns the product of the
first x even integers. For example, even_product(4) should return the product of 2, 4, 6, and 8, which is 384.
7. Refer to the R code (Section 1.6.2) that calculates the first ten numbers in the Fibonacci sequence.
(a) Modify the code to create a function fibonacci that takes sequence_length as an argument and
returns the Fibonnaci sequence as a vector. Confirm that fibonacci(10) returns the correct sequence of
ten numbers.
(b) Does your fibonacci function work when called with 1 or 2 as its argument? If not, modify the code to
handle these two cases.
(c) The Tribonacci sequence is similar to the Fibonacci sequence, except that each element of the Tribonacci
sequence is the sum of the previous three numbers in the sequence. The beginning of the Tribonacci sequence
is 1, 1, 2, 4, 7, 13, 24, 44, .... Write a function tribonacci that takes a single (positive-integer) argument and
returns the Tribonacci sequence of that length.
8. Use a while loop to determine the smallest positive integer n for which ln(n) + ln(2n) is greater than 7.
9. Write a function longest_trend, with a numerical vector x as its single argument, that returns the
number associated with the longest subvector of strictly increasing values in x. For instance, for the vector
c(3,-1,1,0,-2), the function would return 2 since the sequence (-1,1) is the longest subvector of strictly
increasing values. Similarly, for the vector c(78,43,21,-5), the function would return 1 since the elements
are decreasing; for the vector c(-10,5,0,3,21,56,8), the function would return 4 since the sequence
(0,3,21,56) is the longest subvector of strictly increasing values.
(a) Write the function longest_trend using either a for loop or a while loop. Confirm that the function
returns the correct values for the three example vectors specified in the question.
(b) Modify the function longest_trend to include a second optional argument decrease with a default value
of FALSE. When decrease is FALSE, the function should work as above. When decrease is TRUE, the
function should return the number associated with the longest subvector of strictly decreasing values in x. In
the three example vectors given in the question, the function should return 3, 4, and 2, respectively, when
decrease=TRUE.
10. Using the sales vector created in Question 2, answer the following questions.
(a) Create a vector sales_yesterday that contains the daily sales that occurred yesterday, which should
be c(NA,38,52,24,61,47,18,29,44), where the first element is missing (NA) since there is no
observation before the first day. Try to create this new vector by using the original sales vector (rather
than the brute-force method of assigning the new vector to the list of values specified).
i i
i i
i i
34 NOTES
(b) Use a single command that, ignoring the first day, calculates the number of days for which sales are strictly
greater than the previous day’s sales.
(c) Using the vectors sales and sales_yesterday, calculate the average daily sales on days for which the
previous day’s sales were stricly less than 30 units.
11. Create a vector of 5,000 random numbers between zero and one with the following two commands:
set.seed(1234) followed by x <- runif(5000).
(a) Use a single sum command to calculate the sum of the 5,000 numbers.
(b) Use a single mean command to calculate the proportion of random numbers between 0.15 and 0.40.
(c) Thinking about the vector as an ordered sequence of random numbers, determine the proportion of times that
an element of the sequence is within 0.1 of the previous element of the sequence. The first element has no
previous element, so the proportion should be calculated for the remaining 4,999 elements of the sequence.
12. Suppose the 20-character string returns, defined below, indicates whether the stock price of a certain company
goes up, indicated by U, or down, indicated by D, over the course of 20 days of trading on the stock market.
(a) Using a for loop to loop over the characters of returns, determine how many days the stock goes up.
(b) Using another for loop, determine how many times a D (stock-price drop) is followed immediately by a U
(stock-price increase).
(c) Use the grepl command to determine whether there is a streak of four consecutive days of stock-price
increases. Use the grepl command to determine whether there is a streak of four consecutive days of
stock-price drops.
(d) Write a function strtovec that takes a string string as a single argument and returns a vector consisting
of the single characters that comprise string. For example, strtovec("abc") should return the vector
consisting of the elements "a", "b", and "c".
(e) Using the strtovec function from the previous part, provide a single command to determine the number of
days the stock goes up (based upon returns).
13. Use the exams dataset, read into the data frame exams, for this question.
(a) Create a vector scorediff that contains the difference between the exam2 score and the exam1 score for
each student. What is the average of scorediff? What is the maximum value of scorediff and which
student (i.e., which row number) has this maximum value? What is the minimum value of scorediff and
which student has this minimum value?
(b) The professor would like to reward students who show improvement on the second exam. Specifically, she
will place 70% weight on the second exam if a student’s second exam score is at least 5% higher than
the student’s first exam score. Otherwise, she will just place 50% weight on both exams. Create a vector
composite_score that calculates a composite score based upon these grading guidelines. For a student
who gets 50% weight on both exams, the composite score should be the sum of the two exam scores; for a
student who gets 70% weight on the second exam, the composite score should be 0.6 times the first exam score
plus 1.4 times the second exam score. What is the average of composite_score? Which student benefits
the most from the 70% weighting rule?
14. Use the cigdata dataset, read into the data frame cigdata, for this question. The dataset contains information on
cigarette taxes, prices, and sales in 2019 for each state (plus the District of Columbia) in the United States.
(a) Use the nrow function to confirm the number of observations.
i i
i i
i i
NOTES 35
(b) The variables cigprice and cigtax are the average price of a pack of cigarettes and the tax per pack of
cigarettes, respectively, for each state. Which states have the highest and lowest values for these two variables?
(The states can be identified by either of the string variables state or statename.)
(c) What is the average tax per pack in the data?
(d) Write a command that gives a vector of the five highest values of the tax per pack.
(e) Create a variable cigdata$statetax_pct equal to the state-tax percentage, defined as the tax per pack
divided by the average pack price for each state. What are the minimum, maximum, and average of this
variable?
i i
i i
i i
i i
i i
i i
Probability theory underlies all of statistics, providing a mathematical framework for quantifying uncertainty and
randomness in real-world phenomena. It allows practitioners to model and analyze the likelihood of different outcomes
in a given situation. Key statistical concepts, like estimation, confidence intervals, and hypothesis tests, are all based
upon probability theory. Therefore, to properly understand statistics and apply statistical inference, it is important to
have a solid foundation in the basic concepts of probability theory. This chapter provides an introduction to probability
theory by discussing the meaning of probability and introducing the fundamental properties of probabilities. Chapter 3
builds upon the material of this chapter and introduces the concepts of conditional probability and independence.
Before jumping into terminology and definitions, we first consider a motivating example:
Example 2.1 (Widget website) The website widgets.com has 3,000 total registered users and wants to test the
effectiveness of two possible e-mail campaigns. E-mail A is sent to 300 users at random, e-mail B is sent to 300 users
at random, and the other 2,400 users receive no e-mail. For each registered user, widgets.com has the following
information at some point (say, one week) after the e-mail campaigns:
A if user receives e-mail A
campaign = B if user receives e-mail B
None if user receives no e-mail
(
Y if user has made a purchase in the last week
purchase =
N if user has not made a purchase in the last week
Despite being a relatively simple example, there are many probability concepts associated with this experiment. First,
what is meant by “at random” when we say that an e-mail is send to users at random? Intuitively, we mean that any
user has the same chance of receiving the e-mail as any other user. One way to choose the users at random for the two
e-mail campaigns is as follows:
• E-mail A: The first recipient is chosen randomly from the 3,000 users, so that everyone has a 1/3000 chance of
being chosen. The second recipient is chosen randomly from the 2,999 remaining users, so that each remaining user
has a 1/2999 chance of being chosen. This process continues through the 300’th recipient, with each of the 2,701
remaining users having a 1/2701 chance of being chosen.
• E-mail B: The first recipient is chosen randomly from the 2,700 remaining users, so that each remaining user has
a 1/2700 chance of being chosen. The second recipient is chosen randomly from the 2,699 remaining users, so that
each remaining user has a 1/2699 chance of being chosen. This process continues through the 300’th recipient, with
each of the 2,401 remaining users having a 1/2401 chance of being chosen.
• The 2,400 remaining users who did not receive e-mail A or e-mail B are the users for which the value of campaign
is None. We can think of this group of users as a “control group” to which we can compare the e-mail A recipients
and/or the e-mail B recipients.
i i
i i
i i
This method of selecting the e-mail recipients is called sampling without replacement. Starting with the full sample
of 3,000 users, the 300 e-mail A recipients are chosen one-by-one. Once a user is randomly selected to receive e-
mail A, that user can not be randomly selected again. The user is not replaced back into the sample, from which the
terminology “without replacement” comes. Instead, the user is removed from the sample that is used to randomly
select the subsequent e-mail recipients.
The sample function in R can be used to implement the random selection of e-mail A and e-mail B recipients. Here
is a description of the usage of sample:
• sample(x, size, replace = FALSE, prob = NULL): Returns a random sample of size size from
the elements in the vector x. The optional argument replace indicates whether sampling should be done without
replacement (which is the default, replace = FALSE) or with replacement (replace = TRUE). The optional
argument prob specifies how likely each element of x is to be sampled, with the default that each element is equally
likely to be chosen.
If the 3,000 registered users are uniquely identified by a user number, ranging from 1 to 3000, the following code
randomly selects the e-mail recipients:
The vector 1:3000 represents the full set of users. The vector temp is created with the sample function by
randomly picking 600 elements from 1:3000, where each of the elements is equally likely to be chosen. Then, the
emailA_recipient vector is assigned to be the first 300 elements of temp, and the emailB_recipient vector
is assigned to be the last 300 elements of temp. The non-recipients are the elements of 1:3000 that are not in either
emailA_recipient or emailB_recipient.
Before implementation of this e-mail experiment, there are some known probabilities and some unknown
probabilities:
• For any given user, the chance or probability that she receives e-mail A is known. Since each user has the same
chance of being chosen to receive e-mail A, this known probability is 300/3000 = 1/10 or 10%. Similarly, the known
probability of receiving e-mail B is 1/10 or 10%, and the known probability of receiving no e-mail is 2400/3000 = 4/5
or 80%.
• For any user who receives e-mail A, the chance or probability that she makes a purchase is an unknown probability.
The same is true of the probability that an e-mail B recipient makes a purchase and the probability that a user who
received neither e-mail makes a purchase.
Ultimately, the goal of the e-mail experiment by widgets.com is to determine how effective the e-mail campaigns
are. Specifically, the following questions are of interest:
• Is campaign A more (or less) effective than campaign B? In terms of probabilities, is the probability of purchase
by an e-mail A recipient higher (or lower) than the probability of purchase by an e-mail B recipient?
• Is campaign A more (or less) effective than no campaign?
Suppose widgets.com finds that 60 e-mail A recipients (20% of the 300) made a purchase, 66 e-mail B recipients
(22% of the 300) made a purchase, and 360 of the non-recipients (15% of the 2400) made a purchase. For these
specific users, the results indicate that the e-mail B campaign is slightly more successful than the e-mail A campaign
(22% versus 20%) in leading to purchases and even more so when compared to the non-recipients (22% versus 15%).
But is there enough evidence to conclude that the e-mail B campaign is truly better than the e-mail A campaign (or
no e-mail at all)? Should widgets.com use e-mail B for its future campaigns instead of e-mail A? Perhaps the
i i
i i
i i
higher purchase rate (22%) for e-mail B recipients happened by chance, and it might not be the case that we would
see the same result for a new batch of users receiving the two e-mail campaigns. As seen later in the book, the power
of statistical inference is the ability to take outcomes such as these (i.e., the purchasing outcomes for the three groups)
and provide an analysis of what might happen for new users who are presented with e-mail A, e-mail B, or neither.
Before leaving this example, it is worth noting that widgets.com might not be simply interested in whether or
not a purchase is made but also the revenue from such purchases. For example, the following information, in addition
to campaign and purchase above, could be collected:
amount = dollar amount of purchases by the user (equal to zero if no purchases) in the last week
Whereas purchase is a binary outcome, with the two possibilities purchase (Y) or no purchase (N), amount can
potentially take on many different values, depending upon the prices and quantities of the widgets available for
purchase. Based upon the results of their e-mail campaign experiment, widgets.com might like to determine which
campaign is likely to be more successful with respect to revenues or profits for new users.
Definition 2.2 The sample space, denoted S, is the set of all possible outcomes for an experiment.
The mathematical concept of a set is used in Definition 2.2. To review, a set is a collection of distinct objects that is
specified by listing its elements within curly braces { and }. The ordering of the elements within a set does not matter,
and a set does not have duplicate elements.
The simplest sample space has two possible outcomes. If there is only one outcome, there is no uncertainty, and
therefore it does not constitute an experiment. A classic example of a two-outcome sample space is a coin toss, where
the possible outcomes are heads (denoted H) and tails (denoted T). The sample space for a coin toss is S = {H, T}.
Since the order of outcomes doesn’t matter, S = {T, H} is equivalent to S = {H, T}.
Here are some additional examples of two-outcome sample spaces:
• asset return for the year: S = {U, D}, where U denotes a positive return (“up”) and D denotes a negative return
(“down”)
• website visitor purchase behavior: S = {Y, N}, where Y indicates purchase and N indicates no purchase
• student exam result: S = {P, F}, where P indicates pass and F indicates fail
• worker’s union status: S = {U, NU}, where U indicates a union worker and NU indicates a non-union worker
Of course, sample spaces can have more than two outcomes. Some examples include:
• roll of a six-sided die: S = {1, 2, 3, 4, 5, 6}
• day of the week of a baby’s birth: S = {Mon, Tue, Wed, Thu, Fri, Sat, Sun}
i i
i i
i i
• number of days during January for which a city’s temperature is above freezing: S = {0, 1, 2, ..., 31}, where the
“...” shorthand indicates that the sample space contains 32 outcomes, consisting of all integers between 0 and 31
(inclusive)
Oftentimes, an experiment involves multiple “trials” of the same underlying process:
• tossing two coins: S = {HH, HT, TH, TT}
• purchase behavior of the first three visitors to a website on a given day:
S = {YYY, YYN, YNY, YNN, NYY, NYN, NNY, NNN}
In these last two examples, we’ve implicitly assumed that the order of the outcomes matters, with the outcome HT
(heads then tails) being different from the outcome TH (tails then heads). The number of possible outcomes for the
two coin tosses is four, which is the number of possible outcomes for the first toss (two) times the number of possible
outcomes for the second toss (two). The number of possible outcomes for the purchase behavior of the three website
visitors is eight, which is the number of possible outcomes for the first visitor (two) times the number of possible
outcomes for the second visitor (two) times the number of possible outcomes for the third visitor (two). Using this
same logic, the size of the sample space for an experiment involving the tossing of five coins is 2 · 2 · 2 · 2 · 2 = 25 = 32,
and the size of the sample space for an experiment involving the tossing of ten coins is 210 = 1024.
For the coin toss and website purchase examples, if we only care about the total number of heads or purchases, the
sample spaces would be:
• number of heads from two coin tosses: S = {0, 1, 2}
• number of purchases from the first three visitors to a website on a given day: S = {0, 1, 2, 3}
As compared with the sample spaces for the sequences of coin tosses or website purchases, these sample spaces have
fewer possible outcomes. Why? Here, the order of the outcomes doesn’t matter since only the total number of heads
or purchases matters. For the coin tosses, the outcome of 1 total head arises when the sequence HT or the sequence
TH occurs. For the website purchases, the outcome of 1 total purchase arises when YNN, NYN, or NNY occurs.
In R, we can use a vector to represent a finite sample space. For example, the following code creates vectors for the
sample spaces for a coin toss, a roll of a die, and a sequence of two coin tosses:
If there is concern that a vector vec contains duplicate elements, unique(vec) returns a vector (or set) of the
distinct elements. The sort function can be used to order the elements, if desired.
Example 2.2 (Two car dealerships) Suppose you own two car dealerships, dealership A and dealership B, where
dealership A has four salespeople and dealership B has three salespeople. Consider the experiment associated with
the number of salespeople at dealership A and the number of salespeople at dealership B that sell a car on a given
day. The following table enumerates all of the possible outcomes associated with the sample space for this experiment.
i i
i i
i i
Dealership B
0 1 2 3
0 (0, 0) (0, 1) (0, 2) (0, 3)
1 (1, 0) (1, 1) (1, 2) (1, 3)
Dealership A 2 (2, 0) (2, 1) (2, 2) (2, 3)
3 (3, 0) (3, 1) (3, 2) (3, 3)
4 (4, 0) (4, 1) (4, 2) (4, 3)
For instance, the outcome (2, 1) occurs when two of the salespeople at dealership A and one of the salespeople at
dealership B sell a car on a given day, and the outcome (0, 0) occurs when no one sells a car on a given day. The
sample space S is the set of the 20 possible outcomes in the table. The number of possible outcomes is equal to the
five possible outcomes for dealership A times the four possible outcomes for dealership B.
Although the examples considered so far have had a finite number of outcomes, Definition 2.2 does not restrict the
sample space to be a finite set. For example, the sample space corresponding to the number of patents that a company
receives in a given year can be written as S = {0, 1, 2, 3, …}, the set of all non-negative integers. It may seem strange
to specify this sample space to be infinite, as it is seemingly impossible for a company to receive 1 million patents in a
given year. But there’s no obvious way to specify what the maximum number of patents should be (1,000? 10,000?), so
it makes more sense to allow arbitrarily large values even though the larger values are extremely unlikely. The sample
space S = {0, 1, 2, 3, …} is useful for many other situations where it is not obvious how we would arbitrarily set a
maximum possible value, for instance the number of children in a given family, the number of times that an individual
is arrested, the number of initial public offerings (IPO’s) on the New York Stock Exchange in a given year, etc.
Infinite sample spaces can arise in other ways. Thinking about website purchases again, if we are interested the
sequence of website visitors on a given day until a purchase is made, the sample space is
S = {Y, NY, NNY, NNNY, NNNNY, …}.
An equivalent approach is to consider the sample space corresponding to the number of visitors observed before a
purchase is made, which is
S = {1, 2, 3, 4, 5, …}.
These two approaches are equivalent since there is a direct one-to-one relationship between the outcomes in the two
sample spaces. Y corresponds to 1, NY corresponds to 2, NNY corresponds to 3, and so on.
An infinite sample space also naturally arises when the underlying experiment involves a quantity that can be a real
number. Some examples include the following:
• the income of a given individual: S = [0, ∞), where [0, ∞) denotes all non-negative real numbers
– The outcome 0 corresponds to an individual with no income. The specification of S here allows for an
arbitrarily large income value, as there’s no way to reasonably set a maximum.
• the fraction of income that a given employed individual saves in a given year: S = [0, 1], where [0, 1] denotes all
real numbers between 0 and 1 (inclusive)
– The outcomes 0 and 1 correspond to the individual saving none of their income or all of their income,
respectively. An outcome of 0.2 corresponds to the individual saving 20% of their income. Allowing for
all real numbers between 0 and 1 is convenient here since it places no restrictions on what the savings
(numerator) or the income (denominator) can be. Even in cases where the observed data are rounded (e.g.,
to two decimal places), it is still useful to specify the sample space in terms of real numbers for conceptual
purposes.
• the return of an asset in a given year: S = [–1, ∞), where [–1, ∞) denotes all real numbers greater than or equal to
–1
i i
i i
i i
– If the asset price at the beginning of the year is p0 and at the end of the year is p1 , the asset return is p1p–p0 0 .
Assuming the asset price can not be negative, the lowest possible asset return is –1, which occurs when
p1 = 0. There is no upper bound on the sample space S since p1 can be arbitrarily large.
2.2 Events
As defined above, the sample space is the set of the possible outcomes for a given experiment. To be able to consider
the possible combinations of an experiment’s outcomes, we define an event and some specific types of events:
Definition 2.3 An event is any subset of outcomes within the sample space S. A simple event is exactly one outcome
from S, whereas a composite event consists of more than one outcome from S.
If the sample space S is finite, the number of simple events is the size of the sample space S. The sample space S
is itself an event since S is a subset of itself and, moreover, is a composite event since S has more than one outcome.
Definition 2.4 For any event E with a finite number of outcomes, let |E| denote the number of outcomes in event E.
Using this notation, a simple event E has |E| = 1, and a composite event E has |E| > 1.
Example 2.3 (Three website visitors) For the first three visitors to a website on a given day, the sample space for their
purchase behavior is
S = {YYY, YYN, YNY, YNN, NYY, NYN, NNY, NNN}.
Each element of S is a simple event. For example, the event E = {YYY} is a subset of S corresponding to three purchases
in a row. There are eight different simple events since |S| = 8. How about the event A that the first two visitors make a
purchase? A = {YYY, YYN} is a composite event, with two possible outcomes from S. How about the event B that two
total purchases are made? B = {YYN, YNY, NYY} is also a composite event, with three possible outcomes from S.
Example 2.4 (Two car dealerships) In Example 2.2, the sample space S has 20 outcomes, so that |S| = 20. There are
20 simple events associated with S. One example is the event E = {(2, 2)} that two salespeople at each dealership
sell a car on a given day. How about the event, denoted E0 , that the total number of salespeople who sell a car is
equal to two? This composite event is E0 = {(0, 2), (1, 1), (2, 0)}. How about the event, denoted E00 , that the number of
salespeople at dealership B who sell a car is greater than the number of salespeople at dealership A who sell a car?
This composite event is E00 = {(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)}.
For any event, the result of the experiment is that either the event occurs or the event does not occur. If the event
does not occur, we say that the complement of the event occurs.
Definition 2.5 The complement of an event A, denoted Ac , is the set of all outcomes in the sample space S that are
not in A.
Example 2.5 (Three website visitors) From Example 2.3, the event B = {YYN, YNY, NYY} corresponds to two total
purchases. Its complement is Bc = {YYY, YNN, NYN, NNY, NNN}, which contains all outcomes for which there were
not two total purchases made, or equivalently Bc contains all outcomes for which either zero, one, or three total
purchases were made. The size of the event Bc (5 outcomes) is the size of the sample space S (9 outcomes) minus the
size of the event B (4 outcomes).
Since any outcome in the sample space S must be either in A or its complement Ac , we have the following
proposition:
Proposition 2.1. For a sample space S with a finite number of outcomes, |A| + |Ac | = |S| for any event A.
Since many possible events may be associated with a sample space S, it is useful to have ways to think about
multiple events at once. The definitions below introduce the concepts of the union of two events and the intersection of
i i
i i
i i
two events, which correspond to either of two events occurring (the union) and both events occurring (the intersection).
We start with the union.
Definition 2.6 The union of events A and B, denoted A ∪ B, is the set of all outcomes in event A or event B or in both.
The union is sometimes read as “A or B.”
Example 2.6 (Three website visitors) Recall that event A = {YYY, YYN} corresponds to the first two visitors
making a purchase and event B = {YYN, YNY, NYY} to a total of two purchases being made. The event A ∪ B =
{YYY, YYN, YNY, NYY} corresponds to the first two visitors making a purchase or a total of two purchases being
made. The outcome YYN is in both event A and event B, but it appears only once in the event A ∪ B since A ∪ B is a set.
If A and B are events with a finite number of outcomes, |A ∪ B| must always be less than or equal to the sum of |A|
and |B|. In Example 2.6, |A| = 2, |B| = 3, and |A ∪ B| = 4. The only case in which |A ∪ B| = |A| + |B| is when A and B have
no outcomes in common; in such a case, A and B are said to be disjoint events.
Definition 2.7 Events A and B are disjoint events (or mutually exclusive events) if they have no outcomes in common;
that is, there is no outcome in the sample space S that is an element of both event A and event B.
In Example 2.6, A and B are not disjoint events since they share the outcome YYN. Since disjoint events have no
outcomes in common, they can be thought of as events that can not possibly occur at the same time. In the example
of tossing two coins, where S = {HH, HT, TH, TT}, the event A that an equal number of heads and tails appear (A =
{HT, TH}) and the event B that two heads appear (B = {HH}) are disjoint events since they have no outcomes in
common and can not possibly occur at the same time.
Proposition 2.2. If A and B are events with a finite number of outcomes, |A ∪ B| ≤ |A| + |B|. Moreover, |A ∪ B| = |A| + |B|
if A and B are disjoint events, and |A ∪ B| < |A| + |B| if A and B are not disjoint events.
We now move to the intersection of two events:
Definition 2.8 The intersection of events A and B, denoted A ∩ B, is the set of all outcomes in both event A and
event B. The intersection is sometimes read as “A and B.”
Example 2.7 (Three website visitors) Recall that event A = {YYY, YYN} corresponds to the first two visitors making a
purchase and event B = {YYN, YNY, NYY} to a total of two purchases being made. The event A ∩ B corresponds to the
first two visitors making a purchase and a total of two purchases being made. In this case, A ∩ B = {YYN} since YYN
is the only outcome in both events.
If A and B are events with a finite number of outcomes, |A ∩ B| can not possibly be larger than either |A| or |B|:
Proposition 2.3. If A and B are events with a finite number of outcomes, |A ∩ B| ≤ |A| and |A ∩ B| ≤ |B|. The only case
in which |A ∩ B| = |A| is when event A is a subset of event B (i.e., all outcomes in A are also outcomes in B). Similarly,
the only case in which |A ∩ B| = |B| is when event B is a subset of event A.
Again focusing on events with a finite number of outcomes, there is an interesting relationship between the size of
the union A ∪ B and the size of the intersection A ∩ B. We know that all of the outcomes in event A are in A ∪ B, and all
of the outcomes in event B are in A ∪ B. However, the size of A ∪ B is not necessarily equal to |A| + |B| since Proposition
2.2 states that |A ∪ B| < |A| + |B| when A and B are not disjoint (i.e., they have outcomes in common). Note that A ∪ B
contains three types of outcomes: (i) outcomes in both A and B, of which there are |A ∩ B|; (ii) outcomes in A but not B,
of which there are |A| – |A ∩ B|; and (iii) outcomes in B but not A, of which there are |B| – |A ∩ B|. Therefore, the number
of outcomes in A ∪ B is |A ∩ B| + |A| – |A ∩ B| + |B| – |A ∩ B|, or |A| + |B| – |A ∩ B|. In this last expression, the subtraction
of |A ∩ B| effectively eliminates the double-counting in |A| + |B| for the outcomes that appear in both A and B. This
result is formally stated in the following proposition:
i i
i i
i i
Previously, Proposition 2.2 stated that |A ∪ B| = |A| + |B| only in the case that A and B are disjoint. When A and B are
disjoint, there are no outcomes in A ∩ B, so that |A ∩ B| = 0, which agrees with the result in Proposition 2.4.
An event that has no outcomes is called the null event and is formally defined as follows:
Definition 2.9 The null event consists of no outcomes and is denoted ∅. The size of the null event is |∅| = 0.
Disjoint events A and B have A ∩ B = ∅.
Several R functions facilitate the use and manipulation of (finite) events or sets, including the following:
• elt %in% x: Returns TRUE if the element elt is in the set x and FALSE otherwise.
• union(x, y): Returns the union of the sets x and y.
• intersect(x, y): Returns the intersection of the sets x and y.
• setdiff(x, y): Returns the elements of the set x that are not in the set y. If x is the sample space, then the
Also, since a vector represents an event or set in R, the number of outcomes in the event is the size of the vector
and can be determined with the length function. The following code illustrates the use of these functions for the
example above where S is the sample space for the purchase behavior of three customers, A is the event that the first
two customers makes a purchase, and B is the event that a total of two customers make a purchase:
S <- c("YYY","YYN","YNY","YNN","NYY","NYN","NNY","NNN")
A <- c("YYY","YYN")
B <- c("YYN","YNY","NYY")
length(S)
## [1] 8
"YYY" %in% S
## [1] TRUE
union(A,B)
## [1] "YYY" "YYN" "YNY" "NYY"
intersect(A,B)
## [1] "YYN"
setdiff(S,B)
## [1] "YYY" "YNN" "NYN" "NNY" "NNN"
no_purchases <- c("NNN")
intersect(no_purchases,A)
## character(0)
length(intersect(no_purchases,A))
## [1] 0
The total number of outcomes, returned by length(S), is 8. The union and intersect functions properly
return A ∪ B and A ∩ B, respectively. The setdiff(S,B) expression returns the elements of S not in B, which
is the complement Bc . The last two expressions consider the intersection of the event of no purchases, {NNN},
i i
i i
i i
with the event A. The intersection {NNN} ∩ A = ∅, which is indicated by the output of character(0) from the
intersect(no_purchases,A) expression. The size of {NNN} ∩ A = ∅ is zero, as indicated by the length
function.
For another example of the union and the intersection of events, we return to the car dealership example:
Example 2.8 (Two car dealerships) Recall from Example 2.4 that E0 = {(0, 2), (1, 1), (2, 0)} is the event that the total
number of salespeople selling a car is 2 and E00 = {(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)} is the event that the number
of salespeople at dealership B selling a car is greater than the number of salespeople at dealership A selling a car. The
events E0 and E00 are not disjoint since they share the outcome (0, 2). The intersection E0 ∩ E00 is the event containing
this one outcome, E0 ∩ E00 = {(0, 2)}. The union E0 ∪ E00 , which contains all outcomes for which the total number of
salespeople selling a car is 2 or for which more dealership B salespeople sell a car than dealership A salespeople, is
E0 ∪ E00 = {(0, 1), (0, 2), (0, 3), (1, 1), (1, 2), (1, 3), (2, 0), (2, 3)}.
We have |E0 | = 3, |E00 | = 6, |E0 ∩ E00 | = 1, and |E0 ∪ E00 | = 3 + 6 – 1 = 8, where the –1 eliminates the double-counting of the
(0, 2) outcome in the union.
The notion of unions and intersections can be extended to more than two events, as follows:
Definition 2.10 Suppose there are k events, denoted A1 , A2 , …, Ak . The union of these k events, denoted A1 ∪ A2 ∪
· · · ∪ Ak , is the set of all outcomes that are in any of the events A1 , A2 , …, Ak .
Definition 2.11 Suppose there are k events, denoted A1 , A2 , …, Ak . The intersection of these k events, denoted A1 ∩
A2 ∩ · · · ∩ Ak , is the set of all outcomes that are in all of the events A1 , A2 , …, Ak .
i i
i i
i i
The sample space {H, T}, represented by the vector c("H","T"), is the first argument for the sample function.
The second argument(10) indicates the number of coin tosses to be simulated. The argument replace = TRUE
specifies sampling with replacement; that is, for each coin toss, it is always possible to get heads or tails regardless of
what has occurred previously. Finally, there is no need to specify the optional prob argument since the two outcomes
are equally likely. If we had an “unfair” coin that was more likely to come up heads, say with a 60% chance, then the
command sample(c("H","T"), 10, replace = TRUE, prob = c(0.6,0.4)) would be used, with
probabilities 0.6 and 0.4 associated with "H" and "T", respectively.
Using the sample function, we can conduct many random simulations and summarize the results with very few
lines of code, sometimes even just one line of code. For example, to simulate 10,000 coin tosses, we change the second
argument of the sample function. And, to check how often heads occurs, we use the mean function to calculate the
proportion of the vector elements equal to H:
set.seed(1234)
cointosses <- sample(c("H","T"), 10000, replace = TRUE)
mean(cointosses=="H")
## [1] 0.4963
The set.seed(1234) command sets the random seed in the computer. The argument 1234 has been arbitrarily
chosen, but the purpose of the set.seed function is to ensure that we get the same simulation results if we run it
at a different time or on a different computer. Setting a random seed is good practice, as it makes it easier for you
and others to replicate results from simulations involving randomness. Throughout the book, the set.seed(1234)
command is used before any simulations involving randomness. For each of the 10,000 tosses, the computer randomly
chooses between heads (H) and tails (T). The vector cointosses is assigned to the full vector of 10,000 coin tosses,
and the mean(cointosses=="H") command calculates the proportion of heads over the 10,000 tosses. The result
is a proportion of 49.63%, or 4,963 heads out of 10,000 tosses.
To see how the calculated frequency of heads changes as the number of simulations increases, Figure 2.1 plots
calculated frequencies against the simulation number. After each of the 10,000 tosses, we calculate the fraction of
heads that have occurred so far. The x-axis has the number of the coin toss, and the y-axis has the fraction of heads
after each toss. A dotted line is drawn at the value of 0.5. When very few tosses have occurred, which corresponds to
the leftmost part of the x-axis, the fraction of heads can be quite far away from 0.5, with the points appearing to be
more “noisy.” As the number of tosses increases, moving to the right along the x-axis, the points get less “noisy” with
the fraction of heads stabilizing around 0.5. The complete R code is as follows:
i i
i i
i i
0.5
0.4
Cumulative frequency of heads
0.3
0.2
0.1
0.0
Simulation number
Figure 2.1
Head frequency for 10,000 simulated coin tosses
set.seed(1234)
# plot the freqheads vector (when plot has only a single variable (vector) argument,
# it plots the variable versus 1 through the length of the vector)
plot(freq_heads,xlab="Simulation number",ylab="Cumulative frequency of heads",cex=0.5)
abline(h=0.5,lty=3)
The function cumsum calculates the cumulative number of heads through any number of tosses, which is stored in
the vector cumul_heads. Figure 2.1 is created by the plot and abline functions, with the plot command
plotting the vector freq_heads against the number of the simulation and the abline command drawing a
horizontal line at the value 0.5 (h=0.5 argument) that is dotted (lty=3 argument). To highlight the dotted line,
i i
i i
i i
we could also change its color; for example, adding col="blue" as an argument to the abline command would
make it blue.
How many coin tosses are required for the fraction of heads to be close to 0.5 and stabilize near that value? While
we don’t have the tools to answer that question yet, the takeaway from Figure 2.1 is that a higher number of tosses
has a realized fraction of heads that is more likely to be closer to the probability P(A) = 0.5. Additional examples of
computer-simulated experiments are provided below, but at this point we can provide a more formal description of
what is meant by P(A), the probability of an event A associated with an experiment and sample space.
Imagine being able to repeat an experiment a large number of times, say n times. Think really large — a million, a
billion, a trillion, or more! For each experiment, record whether the event A has occurred. Let nA be the total number of
times that event A occurs over the n experiments. The fraction, or frequency, of experiments in which event A occurs
is equal to nnA . Then, P(A) is the number that nnA approaches as n gets arbitrarily large. (We are implicitly assuming
that there is a number (a “limit”) that nnA approaches as n gets arbitrarily large. This idea, known as the Law of Large
Numbers, is discussed in Chapter 13.)
Thinking about probability in terms of a large number of repeated experiments is known as the frequentist
interpretation of probability. The word “frequentist” is used since the probability of event A is viewed as the long-run
frequency of A occurring in a large number of repeated experiments.
Before discussing the properties of probabilities, a few more examples are considered to illustrate the frequentist
interpretation of probability.
Example 2.10 (Tossing two coins) Consider the experiment of tossing two fair coins. The sample space S =
{HH, HT, TH, TT} has four possible outcomes. If A = {HH} is the event corresponding to two heads, what is P(A)? For
fair coins, it’s perhaps not surprising that each of the four outcomes in S are going to be equally likely, so we would
expect P(A) = 0.25. (In Chapter 3, we formalize why this probability is equal to 0.25.) Figure 2.2 shows the results from
10,000 simulations, with the computer randomly tossing two coins for each simulation. The x-axis is the number of the
simulation, and the y-axis is the calculated frequency of HH occurring through that number of simulations. Similar to
the simple coin toss example, the frequencies appear to stabilize when the number of experiments gets larger, but here
the stabilization occurs at a level close to the P(A) = 0.25 value, indicated by the horizontal dotted line.
set.seed(1234)
The vectors coin1tosses and coin2tosses each contain the outcomes of 10,000 coin tosses. The occurrence
of two heads happens when the corresponding elements of these two vectors are both "H". For example, the fifth
simulation of two coin tosses results in two heads when both coin1tosses[5] and coin2tosses[5] are "H".
The vector cumul_twoheads contains the cumulative number of times that the two-heads event has occurred.
If we are only interested in approximating the probability of getting two heads in two tosses, rather than tracking
the cumulative frequencies, the following code suffices:
i i
i i
i i
0.30
0.25
Cumulative frequency of two−heads
0.20
0.15
0.10
0.05
0.00
Simulation number
Figure 2.2
Double-head frequency for 10,000 simulations of two coin tosses
set.seed(1234)
coin1tosses <- sample(c("H","T"), 10000, replace=TRUE)
coin2tosses <- sample(c("H","T"), 10000, replace=TRUE)
mean(coin1tosses=="H" & coin2tosses=="H")
## [1] 0.2513
Example 2.11 (Six-sided die) Consider the experiment of rolling a fair six-sided die. The sample space S =
{1, 2, 3, 4, 5, 6} has six possible outcomes. Let A = {6} be the simple event that a 6 is rolled. The die being “fair”
means that each of the six outcomes is equally likely, so that P(A) = 1/6. Figure 2.3 shows the results from 10,000
simulations of this experiment, with the computer randomly rolling a six-sided die for each of the 10,000 simulations.
The frequency of a 6 being rolled stabilizes as the number of experiments gets larger, and it appears to stabilize at
around 1/6, which is the level of the horizontal dotted line.
i i
i i
i i
0.4
Cumulative frequency of sixes
0.3
0.2
0.1
0.0
Simulation number
Figure 2.3
Six frequency for 10,000 simulations of a die roll
set.seed(1234)
The sample space is represented by the vector 1:6, consisting of the integers between 1 and 6 (inclusive), and
this argument is the only difference from the coin-toss example with sample space c("H","T"). The optional prob
argument is not specified since each outcome is equally likely.
If we are only interested in approximating the probability of rolling a 6, the following code suffices:
i i
i i
i i
set.seed(1234)
dierolls <- sample(1:6, 10000, replace = TRUE)
mean(dierolls==6)
## [1] 0.1652
Definition 2.12 A1 , A2 , …, Ak are a collection of disjoint events if there is no pair of events within the collection for
which there is a shared outcome. In terms of mathematical notation, A1 , A2 , …, Ak are a collection of disjoint events if
Ai ∩ Aj = ∅ for any i, j ∈ {1, …, k} with i 6= j.
Intuitively, A1 , A2 , …, Ak being disjoint means that any given event, say Aj , can not possibly happen if any of the
other events happens. Axiom 3 states that the probability that any of the disjoint events occurs is equal to the sum of
the probabilities of each of the individual events. A simple case of disjoint A1 , A2 , …, Ak arises when each Aj contains
a single outcome (i.e., Aj is a simple event) that is different from the other (simple) events.
i i
i i
i i
plus the probability of exactly two purchases being made plus the probability of exactly three purchases being made.
At this point, we can’t say anything more about the value of this probability since the purchase probability is unknown.
The probability axioms lead to several other interesting properties of probabilities, some of which are stated in the
following proposition:
Proposition 2.5. (Properties of probabilities) Let A and B be any two events. The following properties are implied by
the Axioms of Probability:
(i) (Probability of a complement) P(Ac ) = 1 – P(A).
(ii) (Probability of the null event) P(∅) = 0.
(iii) (Partitioning an event) P(A) = P(A ∩ B) + P(A ∩ Bc ).
(iv) (Probability of the union of events) P(A ∪ B) = P(A) + P(B) – P(A ∩ B).
(v) If A and B are disjoint events, P(A ∩ B) = 0 and P(A ∪ B) = P(A) + P(B).
(vi) (De Morgan’s laws for probabilities) P((A ∪ B)c ) = P(Ac ∩ Bc ) and P((A ∩ B)c ) = P(Ac ∪ Bc ).
(vii) (Equally likely outcomes) Suppose S is a finite sample space with k possible outcomes. If every outcome in S
is equally likely to occur,
P(A1 ) = P(A2 ) = · · · = P(Ak ) = 1/k
for each of the simple events A1 , A2 , …, Ak in S.
Property (i) follows from the facts that Ac and A are disjoint, Ac ∪ A = S, and P(S) = 1, which taken together imply
P(Ac ) + P(A) = 1 or, equivalently, P(Ac ) = 1 – P(A).
Example 2.13 (Three website visitors) In Example 2.12, we considered the probability of the event B that at
least one purchase is made. An alternative approach to finding that probability is to consider Bc , which is the
event A0 that no purchases are made, Bc = A0 = {NNN}. Applying property (i), we have P(B) = 1 – P(A0 ). Since
we know that P(A0 ∪ A1 ∪ A2 ∪ A3 ) = P(A0 ) + P(A1 ) + P(A2 ) + P(A3 ) = 1, note that P(B) = 1 – P(A0 ) is equivalent to
P(B) = P(A1 ) + P(A2 ) + P(A3 ).
Property (ii) says that the probability of the null event is equal to zero, which intuitively makes sense since it is
impossible for the experiment to have no outcome. This property follows from ∅ = S c , to which we apply property (i):
P(∅) = 1 – P(S) = 1 – 1 = 0.
Property (iii) involves partitioning event A into two disjoint events, one event with outcomes also in the event B,
which is A ∩ B, and one event with outcomes not in event B, which is A ∩ Bc . The two events A ∩ B and A ∩ Bc are
disjoint since no outcome can be in both B and Bc . The union of A ∩ B and A ∩ Bc is the event A since every outcome
in A is either in A ∩ B or A ∩ Bc . Applying Axiom 3 then implies that P(A) = P(A ∩ B) + P(A ∩ Bc ).
Example 2.14 (Six-sided die) Let A = {3, 4, 5, 6} be the event of rolling at least a 3. Let B = {2, 4, 6} be the event of
rolling an even number. A can be partitioned into its even numbers, A ∩ B = {4, 6}, and its odd numbers, A ∩ Bc = {3, 5}.
Property (iv), which states P(A ∪ B) = P(A) + P(B) – P(A ∩ B) for events A and B, provides the relationship between
the probability of a union of events and the probability of an intersection of events. We already know from Axiom
3 that for disjoint events A and B, we have P(A ∪ B) = P(A) + P(B). But property (iv) is more general, as it covers
events which are disjoint, when P(A ∩ B) = ∅, and events that are not disjoint, when P(A ∩ B) 6= ∅.7 As we saw before
in the context of counting the number of outcomes in A ∪ B (recall Proposition 2.4, where |A ∪ B| = |A| + |B| – |A ∩ B|),
subtracting the probability P(A ∩ B) avoids the double-counting that occurs from the P(A) and P(B) terms for outcomes
that are in both A and B.
One way to visualize property (iv) is through the use of a Venn diagram, as shown in Figure 2.4. This figure shows
two different scenarios, one in which the intersection A ∩ B contains one or more outcomes (the Venn diagram on the
left) and one in which the intersection A ∩ B contains no outcomes (the Venn diagram on the right, having A ∩ B = ∅).
For both Venn diagrams, the light-gray circle corresponds to the event A and the dark-gray circle corresponds to the
event B. In the left Venn diagram, the intersection A ∩ B is the region where the two circles overlap. If the area of each
i i
i i
i i
𝐴 𝐴
𝐵 𝐵
𝐴 ∩ 𝐵𝑐 𝐴𝑐 ∩ 𝐵
Figure 2.4
Venn diagrams and P(A ∩ B)
circle is equal to the probability of the corresponding event, with the area of the overlap A ∩ B being equal to P(A ∩ B),
property (iv) follows intuitively. The quantity P(A) + P(B) is the sum of the areas of the two circles, but this sum double
counts the region A ∩ B. Therefore, to get P(A ∪ B), which is the area of the the full shaded region that only counts the
region A ∩ B once, we must subtract off P(A ∩ B). This leads to P(A ∪ B) = P(A) + P(B) – P(A ∩ B). In the right Venn
diagram, the events A and B have no outcomes in common, so that A ∩ B = ∅ and, by property (v), P(A ∩ B) = 0. In this
case, property (iv) simplifies to P(A ∪ B) = P(A) + P(B) since there is no issue with double counting.
For the Venn diagram on the left, the event A is partitioned into two parts, A ∩ B (events in A that are also in B)
and A ∩ Bc (events in A that are not in B). This partitioning corresponds to property (iii), P(A) = P(A ∩ B) + P(A ∩ Bc ).
Similarly, the event B is partitioned into two parts, A ∩ B (events in B that are also in A) and Ac ∩ B (events in B that
are not in A), so that P(B) = P(A ∩ B) + P(Ac ∩ B).
Example 2.15 (Six-sided die) Let A = {3, 4, 5, 6} be the event of rolling at least a 3. Let B = {2, 4, 6} be the event
of rolling an even number. Then, A ∪ B = {2, 3, 4, 5, 6} and A ∩ B = {4, 6}. Note that P(A) + P(B) double-counts the
probability of rolling a 4 or rolling a 6, and the subtraction of P(A ∩ B) corrects this double-counting in the formula
P(A ∪ B) = P(A) + P(B) – P(A ∩ B). With respect to the Venn diagram in Figure 2.4, the light-gray circle would represent
{3, 4, 5, 6}, the dark-gray circle would represent {2, 4, 6}, and the overlap would represent {4, 6}.
Property (v) states that it is impossible for two disjoint events to occur at the same time, which follows directly from
property (ii) since A ∩ B = ∅ for disjoint A and B.
Property (vi) states the probability properties associated with famous set-theory properties known as De Morgan’s
Laws. The two properties for sets are that
(A ∪ B)c = Ac ∩ Bc and (A ∩ B)c = Ac ∪ Bc .
The first set property is that the complement of a union of two events is the intersection of the complement of the two
events, which holds since any outcome which is not in either A or B must be in both Ac and Bc (and vice versa). Both
(A ∪ B)c and Ac ∩ Bc consist of outcomes that belong to neither A nor B. Similarly, the second set property is that the
complement of the intersection of two events is the union of the complement of the two events, which holds since any
outcome which is not in both A and B must be in either Ac or Bc (and vice versa). Both (A ∩ B)c and Ac ∪ Bc consist
of outcomes that are not in both A and B. Property (vi) follows immediately once we know (A ∪ B)c = Ac ∩ Bc and
(A ∩ B)c = Ac ∪ Bc .
i i
i i
i i
54 NOTES
Property (vii) provides the basis for calculating event probabilities when the outcomes of an experiment are equally
likely, as with the toss of a fair coin or the roll of a fair six-sided die. The result itself is a direct implication of
Axioms 2 and 3. The simple events A1 , …, Ak are disjoint, so that A1 ∪ · · · ∪ Ak = S and, applying Axioms 2 and 3, we
have P(A1 ) + P(A2 ) + · · · + P(Ak ) = P(S) = 1. Since each of the k events are equally likely, each event probability is 1/k.
Example 2.16 (Six-sided die) Let A = {3, 4, 5, 6} be the event of rolling at least a 3. If all outcomes of S =
{1, 2, 3, 4, 5, 6} are equally likely, as for a fair die, the probability of any outcome is 1/6. The event A can be thought of
as the union of the four disjoint simple events {3}, {4}, {5}, and {6}, so that P(A) is the sum of the probability of these
four simple events, which is 4/6 or 2/3. Similarly, for the event B = {2, 4, 6} that an even number is rolled, P(B) = 3/6 = 1/2.
Property (iv) from Proposition 2.5, which is the union-of-events property, can be extended to more than two events,
as stated in the following proposition:
Proposition 2.6. (Generalization of the union-of-events property)
For two events A and B,
P(A ∪ B) = P(A) + P(B) – P(A ∩ B).
For three events A, B, and C,
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) – P(A ∩ B) – P(A ∩ C) – P(B ∩ C) + P(A ∩ B ∩ C).
For four events A, B, C and D,
P(A) + P(B) + P(C) + P(D)
–P(A ∩ B) – P(A ∩ C) – P(A ∩ D) – P(B ∩ C) – P(B ∩ D) – P(C ∩ D)
P(A ∪ B ∪ C ∪ D) = .
+P(A ∩ B ∩ C) + P(A ∩ B ∩ D) + P(A ∩ C ∩ D) + P(B ∩ C ∩ D)
–P(A ∩ B ∩ C ∩ D)
And, so on, for larger numbers of events.
For the case of three events in Proposition 2.6, where
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) – P(A ∩ B) – P(A ∩ C) – P(B ∩ C) + P(A ∩ B ∩ C),
Figure 2.4 provides a Venn diagram that illustrates how this property works. This figure is similar to the left diagram
in Figure 2.4 except that the third event C has been added. Again, the area of each circle should be thought of as the
probability of the associated event, as should any of the areas of the overlaps of the circles. The intersection events
A ∩ B, A ∩ C, B ∩ C, and A ∩ B ∩ C are indicated in the figure. To be clear, the region for intersection A ∩ B includes
the region for A ∩ B ∩ C; similarly, the regions for A ∩ C and B ∩ C also include the region for A ∩ B ∩ C. When
the areas of the three circles are added up, giving P(A) + P(B) + P(C), the sum double counts the areas associated
with A ∩ B, A ∩ C, and B ∩ C. As before, subtracting the quantities P(A ∩ B), P(A ∩ C), and P(B ∩ C) accounts for
the double counting. Unfortunately, when these three probabilities are subtracted, the area associated with A ∩ B ∩ C,
which is P(A ∩ B ∩ C), is subtracted one too many times. The region corresponding to A ∩ B ∩ C gets counted three
times in P(A) + P(B) + P(C) and then subtracted three times when P(A ∩ B), P(A ∩ C), and P(B ∩ C) are subtracted, so
the quantity P(A ∩ B ∩ C) is added back to get the correct P(A ∪ B ∪ C), as in the formula above.
Notes
6 A more general version of the third axiom adds the following to allow for an infinite number of disjoint events: If A , …, A , … is a countably
1 k
infinite collection of disjoint events,
∞
X
P(A1 ∪ A2 ∪ A3 ∪ · · · ) = P(Ai ).
i=1
“Countably infinite” means that there is a natural way to count the events or, put another way, to label the events 1, 2, 3, and so on.
i i
i i
i i
NOTES 55
Sample space 𝑆
𝐴∩𝐵
𝐴 𝐴∩𝐵∩𝐶
𝐵
𝐴∩𝐶
𝐵∩𝐶
Figure 2.5
Venn diagram and P(A ∩ B ∩ C)
7 To show that property (iv) holds, consider the two events B and A ∩ Bc . These two events are disjoint since any outcome in B can’t possibly be
in Bc or, therefore, A ∩ Bc . We also know that (A ∩ Bc ) ∪ B = A ∪ B, so that
P(A ∪ B) = P(A ∩ Bc ) + P(B) = P(A) – P(A ∩ B) + P(B) = P(A) + P(B) – P(A ∩ B)
from applying property (iii), plugging in P(A) – P(A ∩ B) for P(A ∩ Bc ).
Exercises
1. You have five songs on your playlist, with songs 1 and 2 by Beyoncé and songs 3, 4, and 5 by Pink. You listen to the
playlist in random order, but without repeats (e.g. once song 1 is played, it doesn’t get played again). You continue to
listen until a song by Pink is played. For example, 214 is one possible song sequence (outcome).
(a) What is the sample space S?
(b) What is the event A that song 5 is played?
(c) What is the event B that song 2 is not played?
2. For two events A and B, suppose every outcome in A is also in B. State whether each of the following statements is
true or false, and explain why.
(a) The size of B, denoted |B|, is strictly less than the size of A, denoted |A|.
(b) The union of A and B is B.
(c) Every outcome in Bc is in Ac .
3. Consider randomly selecting a college student. Let V be the event that the student has a paid video-streaming
account, and let M be the event that the student has a paid music-streaming account. Suppose P(V) = 0.7 and P(M) =
0.5.
(a) Is it possible that P(V ∩ M) = 0.6? Why or why not?
i i
i i
i i
56 NOTES
(b) Suppose the probability that the student has both a video-streaming account and a music-streaming account is
0.35.
i. What is the probability that the student has at least one of the two types of accounts?
ii. What is the probability that the student has neither type of account?
iii. In terms of V and M, what is the event that the student has a video-streaming account but no music-
streaming account? What is the probability of this event?
4. Consider the experiment of picking a number randomly from the sample space S = {1, 2, 3, …, 999, 1000}. Let A be
the event that the number is a multiple of three, let B = {500, 501, …, 699, 700} be the event that the number is between
500 and 700 (inclusive), and let C be the event that the number is a perfect square (i.e., C = {1, 4, 9, 16, …, 900, 961}).
For this question, use R to create sets (vectors) and perform the necessary set operations.
(a) Create three vectors eventA, eventB, and eventC for the events A, B, and C and a vector samplespace
for S.
(b) Create a vector containing A ∪ B and calculate |A ∪ B|.
(c) Create a vector containing A ∩ C and calculate |A ∩ C|.
(d) Create a vector containing Cc and calculate |Cc |.
(e) Create a vector containing A ∩ B ∩ C and calculate |A ∩ B ∩ C|.
(f) Create a vector containing (A ∩ Bc ) ∪ C and calculate |(A ∩ Bc ) ∪ C|.
5. Suppose that, on any given weekday, 70% of college students eat breakfast, 60% do homework, and 85% do at least
one of these two things.
(a) What is the probability that a randomly selected student eats breakfast and does homework?
(b) What is the probability that a randomly selected student does neither activity?
6. The probability of the union, P(A ∪ B) = P(A) + P(B) – P(A ∩ B), includes outcomes that are in both A and B. Provide
a formula for the probability that exactly one of the events A and B occurs (but not both) in terms of P(A), P(B), and
P(A ∪ B).
7. A company has two research projects R1 and R2 , each of which either results in a patent or not. The probability that
project R1 results in a patent is 0.45, the probability that project R2 results in a patent is 0.15, and the probability that
both projects result in a patent is 0.05.
(a) What is the probability that at least one of the two projects results in a patent?
(b) What is the probability that neither of the two projects results in a patent?
(c) What is the probability that exactly one of the two projects results in a patent?
8. The sample space for the price of a company’s stock on a given day is S = [0, ∞), which consists of all non-negative
real numbers (including zero). In this case, S has an infinite number of possible outcomes. Consider the following
events: A = [80, 100], B = (60, 90], and C = (95, ∞). A square bracket [ or ] indicates that an interval is inclusive for
that endpoint, and a parenthesis ( or ) indicates that an interval is exclusive for that endpoint. Therefore, A is the set
of prices p such that 80 ≤ p ≤ 100, B is the set of prices p such that 60 < p ≤ 90, and C is the set of prices p such that
p > 95. Using the bracket and parenthesis notation and (when necessary) the union operator, what are the following
events?
(a) Cc
(b) Bc
(c) A∩B
(d) A∪C
(e) Ac ∪ B
(f) A ∩ B ∩ Cc
i i
i i
i i
NOTES 57
9. At a given company, worker salaries for the following year have just been decided. A worker’s salary can either
increase (I), decrease (D), or remain the same (R). Consider observing whether it increases, decreases, or stays the
same for three different employees.
(a) What is the sample space S?
(b) What is the event A that all three workers have different outcomes?
(c) What is the event B that exactly one of the three workers has a salary increase?
(d) What is the event C that exactly two of the three workers have the same outcome?
(e) Determine the following events: Cc and B ∩ C.
(f) Are A and C disjoint events?
(g) Are A and C collectively exhaustive?
10. A small information systems firm has the resources to respond to two invitations to submit a proposal for a contract.
When a proposal is submitted, it may be accepted outright, rejected outright, or a revision of the proposal may be
requested. If a revision is requested, submission of the revision leads to acceptance or rejection. You may assume that
the firm always submits a revision if it is requested.
(a) What is the sample space? Develop your own notation for this part.
(b) Define the event A to be “both proposals are eventually accepted,” the event B to be “both proposals are
eventually rejected,” and event C to be “a revision is submitted.” Which outcomes belong to each of these
events?
(c) Are A and B disjoint events? Are A and B collectively exhaustive?
(d) Are A and C disjoint events? Are A and C collectively exhaustive?
11. Burger Barn and Patty Palace will both open one new restaurant in Texas next year. Burger Barn is choosing among
four cities: A(ustin), D(allas), H(ouston), and S(an Antonio). Patty Palace is only choosing among three cities (A, D,
H) since it already has too many locations in San Antonio. Here is the probability table associated with their decisions:
Burger Barn
A D H S
A 0.07 0.12 0.16 0.12
Patty Palace D 0.03 0.02 0.15 0.15
H 0.06 0.04 0.03 0.05
(a) What is the probability that Burger Barn locates in Dallas and Patty Palace locates in Houston?
(b) What is the probability that Burger Barn locates in Dallas?
(c) What is the probability that Patty Palace locates in Houston?
(d) What is the probability that Burger Barn and Patty Palace locate in the same city?
(e) What is the probability that Burger Barn and Patty Palace locate in different cities?
12. An investor has three investment opportunities (A, B, and C) and is asked to rank them in preference order with the
most preferred listed first. Since she is indifferent between the three investments, she ranks them randomly.
(a) What are the outcomes in S, and what are the probabilities for each of the outcomes?
(b) What is the probability that investment C is ranked first?
(c) What is the probability that investment C is ranked first and investment A is ranked last?
13. On a particular day, Steve’s Sneaker Shop is beginning to stock pairs of a highly anticipated new sneaker. Hundreds
of customers are expected at the store. The owner (Steve) decides that one of the first 50 customers will receive a free
pair of the sneakers, and he randomly picks a number between 1 and 50 (inclusive), each being equally likely, before
the store opens.
(a) What is the probability that the fifth customer gets the free pair of sneakers?
(b) What is the probability that the free pair of sneakers is given away before the 20’th customer?
i i
i i
i i
58 NOTES
(c) Conduct 10,000 simulations in R to approximate the probabilities in (a) and (b), where each simulation involves
a random number being chosen from the vector 1:50.
14.
(a) Modify the R code for 10,000 coin-toss simulations, used to create Figure 2.1 in Section 2.3, to replace the fair
coin with an unfair coin. Specifically, consider an unfair coin that comes up heads with 60% probability. In
addition to changing the random tosses, change where the dotted line appears.
(b) Repeat (a) but with only 100 simulations. How does the figure compare to the one in (a)?
15. At a large technology company, 80% of employees work full time (at least 40 hours per week), 70% have a flexible
work arrangement (that allows them to work partially from home), and 50% have a company-issued laptop computer.
Refer to these three employee characteristics as A, B, and C, respectively, so that P(A) = 0.8, P(B) = 0.7, and P(C) =
0.5. The following things are also true: 60% of employees work full time and have a flexible work arrangement,
38% of employees work full time and have a company-issued laptop computer, 42% of employees have a flexible
work arrangement and have a company-issued laptop computer, and 35% work full time and have a flexible work
arrangement and have a company-issued laptop computer.
(a) What is the probability that an employee has at least one of the three characteristics?
(b) What is the probability that an employee has exactly one of the three characteristics? (For this part, you might
find it easiest to use a Venn diagram.)
(c) What is the probability that an employee has at least two of the three characteristics?
i i
i i
i i
This chapter builds upon the basic probability properties from Chapter 2 to say more about what happens when
there are multiple events. What can be said about the probability that two events both occur? What can be said
about the probability of one event occurring if it is known that the other event occurs? Various types of probabilities
are discussed, including joint probabilities, conditional probabilities, and marginal probabilities, and the important
concept of independence is introduced.
Definition 3.1 P(A), the probability of an event A, is also called the unconditional probability or marginal
probability of A.
We say P(A) is an “unconditional probability” since we do not condition on any other event occurring, in contrast to
the conditional probability introduced below. And, as seen later in this chapter, the terminology “marginal probability”
relates to the idea that this unconditional probability will sometimes appear in the “margin” of a probability table.
Each outcome in the sample space has an unconditional or marginal probability associated with it, and the collection
of these probabilities is the probability distribution:
Definition 3.2 A probability distribution is a complete description of the probabilities associated with every outcome
in the sample space S.
Example 3.1 (Six-sided die) For a fair die and S = {1, 2, 3, 4, 5, 6}, the probability distribution is
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6.
For any two events, the joint probability is defined as the probability that both events occur:
Definition 3.3 The joint probability of events A and B is the probability P(A ∩ B) that both events occur.
Examples of joint probabilities were seen in Chapter 2. For instance, Example 2.15 considered the joint probability
of rolling at least a 3 (event A) and rolling an even number (event B).
We can now formally define the conditional probability:
Definition 3.4 The conditional probability P(A|B), which is the probability of event A given that event B has
occurred, is
P(A ∩ B)
P(A|B) =
P(B)
if P(B) > 0. P(A|B) is often read as “the probability of A given B” or “the probability of A conditional on B.”
i i
i i
i i
The P(B) > 0 condition ensures no division by zero. Having P(B) > 0 is not restrictive since it just means the event
being conditioned upon can actually happen. When we condition on the event B, we focus only on outcomes in B and
ignore any outcomes in the rest of the sample space S (given by Bc ).
The conditional probability is quite different from the joint probability. For the six-sided die example, P(A|B) is the
probability the die roll is at least a 3 if we know that the die roll is an even number. Knowing B has occurred means
that the possible outcomes are {2, 4, 6}, so now what is the probability that the die roll is at least 3? Flipping things
around, P(B|A) is the probability the die roll is an even number if we know that the die roll is at least 3. Knowing A
has occurred tells us that the possible outcomes are {3, 4, 5, 6}, so now what is the probability that the die roll is even?
We can calculate both of these conditional probabilities using Definition 3.4.
Example 3.2 (Six-sided die) The probabilities of A = {3, 4, 5, 6}, a roll of at least 3, and B = {2, 4, 6}, an even roll, are
P(A) = 2/3 and P(B) = 1/2. The joint probability is P(A ∩ B) = 1/3 since A ∩ B = {4, 6}. The conditional probability that the
die roll is at least a 3 given that the die roll is even is
P(A ∩ B) 1/3 2
P(A|B) = = 1 = /3,
P(B) /2
which makes sense since A ∩ B = {4, 6} has two of the three outcomes in B = {2, 4, 6}. Reversing the roles of A and B,
the conditional probability that the die roll is even given that the die roll is at least a 3 is
P(B ∩ A) 1/3 1
P(B|A) = = 2 = /2.
P(A) /3
We have P(B ∪ A) = P(A ∪ B) since B ∪ A = A ∪ B. Again, the answer is intuitive since A ∩ B = {4, 6} has two of the four
outcomes in A = {3, 4, 5, 6}.
The conditional probability is itself a probability, and its main properties follow directly from the Axioms of
Probability introduced in Chapter 2:
Proposition 3.1. (Properties of conditional probabilities) If B is an event with P(B) > 0, the following properties hold:
(i) 0 ≤ P(A|B) ≤ 1 for any event A.
(ii) If A1 , A2 , …, Ak is a collection of disjoint events,
k
X
P(A1 ∪ A2 ∪ · · · ∪ Ak |B) = P(Ai |B).
i=1
(iii) If A1 , A2 , …, Ak is a collection of disjoint and exhaustive events, P(A1 ∪ A2 ∪ · · · ∪ Ak |B) = 1.
(iv) P(Ac |B) = 1 – P(A|B) for any event A.
(v) P(A1 |B) = P(A1 ∩ A2 |B) + P(A1 ∩ Ac2 |B) for any events A1 and A2 .
(vi) P(A1 ∪ A2 |B) = P(A1 |B) + P(A2 |B) – P(A1 ∩ A2 |B) for any events A1 and A2 .
(vii) If A1 and A2 are disjoint events, P(A1 ∩ A2 |B) = 0.
The properties look very similar to the first and third probability axioms and the properties from Proposition 2.5,
with the difference being that all of the probabilities in Proposition 3.1 are conditional probabilities, conditioning on
the event B.
i i
i i
i i
and
P(A ∩ B) = P(B|A)P(A) if P(A) > 0.
Proposition 3.2 follows directly from Definition 3.4 since P(A|B) = P(A∩B) P(A∩B)
P(B) and P(B|A) = P(A) , respectively. It can
be useful to have the two alternative equations since in some cases you might know P(B) and P(A|B) but not P(A) and
P(B|A), or vice versa.
Example 3.3 (Product returns) Suppose widgets.com sells three types of widgets, each with different purchase
probabilities and return rates, as follows:
• Widget 1: The probability that a widget purchase is Widget 1 is 60%. The probability that a Widget 1 purchase is
returned is 15%.
• Widget 2: The probability that a widget purchase is Widget 2 is 30%. The probability that a Widget 2 purchase is
returned is 25%.
• Widget 3: The probability that a widget purchase is Widget 3 is 10%. The probability that a Widget 3 purchase is
returned is 35%.
Let the events A1 , A2 , and A3 correspond to a purchase being Widget 1, Widget 2, and Widget 3, respectively. We know
P(A1 ) = 0.6, P(A2 ) = 0.3, and P(A3 ) = 0.1. Let R correspond to the event that a widget purchase is returned. We know
the conditional probabilities P(R|A1 ) = 0.15, P(R|A2 ) = 0.25, and P(R|A3 ) = 0.35. Given this information, what is the
joint probability that a widget purchase is Widget 1 and the purchase is returned? We have
P(A1 ∩ R) = P(R|A1 )P(A1 ) = (0.15)(0.6) = 0.09.
(Why not use the alternative formula P(A1 ∩ R) = P(A1 |R)P(R) here?)
How about the joint probability that a widget purchase is Widget 2 and the purchase is not returned? The event of
the purchase not being returned is Rc , so that
P(A2 ∩ Rc ) = P(Rc |A2 )P(A2 ) = (1 – 0.25)(0.3) = 0.225.
We can do even more with conditional probabilities after introducing two important results, the Law of Total
Probability and Bayes’ Theorem. We start with the Law of Total Probability, which is based upon the concept of
partitioning introduced in Chapter 2:
Proposition 3.3. (Law of Total Probability) If A1 , A2 , …, Ak are disjoint events and also exhaustive events (that is,
P(A1 ) + P(A2 ) + · · · + P(Ak ) = 1), then for any event B,
k
X
P(B) = P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + · · · + P(B|Ak )P(Ak ) = P(B|Aj )P(Aj ).
j=1
Using the disjoint and exhaustive events A1 , …, Ak , the event B can itself be partitioned into k events. These k events
are B ∩ A1 , B ∩ A2 , …, B ∩ Ak . Since A1 , …, Ak are exhaustive events, the union of these k events is equal to B since any
outcome in B must be in one of the k partitions. Applying Axiom 3, then, yields
P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + · · · + P(B ∩ Ak ).
The Venn diagram in Figure 3.2 illustrates the idea of partitioning. The Venn diagram depicts a partition of the sample
space S into five disjoint and exhaustive events A1 , A2 , A3 , A4 , and A5 . The event B is represented by the gray circle,
and B is itself partitioned into five events: B ∩ A1 , B ∩ A2 , B ∩ A3 , B ∩ A4 , and B ∩ A5 . (In Figure 3.2, each of the B ∩ Aj
events is non-empty, though it is certainly possible in other examples to have B ∩ Aj = ∅.) Then, the overall probability
of event B is the sum of the five probabilities P(B ∩ A1 ), P(B ∩ A2 ), P(B ∩ A3 ), P(B ∩ A4 ), and P(B ∩ A5 ).
Having P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + · · · + P(B ∩ Ak ) for disjoint and exhaustive events A1 , …, Ak leads directly
to the Law of Total Probability result. Specifically, from the multiplication rule (Proposition 3.2), P(B ∩ A1 ) =
P(B|A1 )P(A1 ) and simililarly for the other partitions, so that:
P(B) = P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + · · · + P(B|Ak )P(Ak ).
i i
i i
i i
Sample space 𝑆
𝐴1 𝐴2
𝐵 ∩ 𝐴1
𝐵 𝐵 ∩ 𝐴2
𝐴5
𝐵 ∩ 𝐴5
𝐵 ∩ 𝐴3
𝐵 ∩ 𝐴4
𝐴4 𝐴3
Figure 3.1
Venn diagram for event partitions
Example 3.4 (Product returns) Continuing Example 3.3, the following question can now be answered: For a widget
purchase, what is the probability that the widget is returned? In other words, what is the unconditional probability
P(R)? A direct application of the Law of Total Probability gives
P(R) = P(R|A1 )P(A1 ) + P(R|A2 )P(A2 ) + P(R|A3 )P(A3 )
= (0.15)(0.6) + (0.25)(0.3) + (0.35)(0.1) = 0.2.
The unconditional probability of a return is 0.2 or 20%. From the equation above, this unconditional probability is
a weighted average of the three conditional return probabilities for the different types of widgets, where the weights
are the probabilities of the partitions (here, the probabilities of the three types of widgets). With the unconditional
probability P(R), we can also determine the conditional probability of each widget type given that a widget is returned.
If a widgets.com employee receives a return in a sealed package, what is the probability that this return is Widget
1 or Widget 2 or Widget 3? These conditional probabilities can be determined as follows:
P(A1 ∩ R) P(R|A1 )P(A1 ) (0.15)(0.6)
P(A1 |R) = = = = 0.45
P(R) P(R) 0.2
P(A2 ∩ R) P(R|A2 )P(A2 ) (0.25)(0.3)
P(A2 |R) = = = = 0.375
P(R) P(R) 0.2
P(A3 ∩ R) P(R|A3 )P(A3 ) (0.35)(0.1)
P(A3 |R) = = = = 0.175
P(R) P(R) 0.2
The conditioning information changes the probabilities. Whereas the unconditional probability of A1 is 0.6, the
probability of A1 given that it is a returned widget is 0.45. For A2 , the unconditional probability is 0.3, and the
conditional probability given R is 0.375. For A3 , the unconditional probability is 0.1, and the conditional probability
given R is 0.175.
As seen in Example 3.4, probabilities can be updated once new information is incorporated. We move from an
unconditional probability, with no information given, to a conditional probability, where we condition upon new
information. This approach to “updating” probabilities is embodied in a famous result known as Bayes’ Theorem:
i i
i i
i i
Proposition 3.4. (Bayes’ Theorem) If A1 , A2 , …, Ak are disjoint and exhaustive events, then for any event B with
P(B) > 0,
P(B|Ai )P(Ai )
P(Ai |B) = Pk .
j=1 P(B|Aj )P(Aj )
i ∩B)
To see why Bayes’ Theorem holds, recall that P(Ai |B) = P(AP(B) by the definition of a conditional probability. For the
Pk
numerator, P(Ai ∩ B) = P(B|Ai )P(Ai ) by the multiplication rule. For the denominator, P(B) = j=1 P(B|Aj )P(Aj ) by the
Law of Total Probability.
In the context of Bayes’ Theorem, the unconditional probability P(Ai ) is sometimes referred to as the prior
probability of event Ai since it is the probability prior to considering any additional information. The conditional
probability P(Ai |B) is sometimes referred to as the posterior probability of event Ai since the probability has been
updated with the information that event B has occurred. Incorporation of conditioning information and moving from a
prior probability to a posterior probability is called Bayesian updating.
Example 3.5 (Vaccination and infection) Suppose the probability that an individual is vaccinated against a certain
disease is 70%. Let V be the event that an individual is vaccinated, with P(V) = 0.7, and NV = V c that an individual is
unvaccinated, with P(NV) = 0.3. Let D be the event that an individual is infected with the underlying disease. Suppose
the probability of infection is 5% for vaccinated individuals (P(D|V) = 0.05) and 50% for unvaccinated individuals
(P(D|NV) = 0.5). Then, what is the unconditional probability of infection P(D)? Since V and NV are disjoint and
exhaustive, the Law of Total Probability yields
P(D) = P(D|V)P(V) + P(D|NV)P(NV) = (0.05)(0.7) + (0.5)(0.3) = 0.185.
The probability of infection is a weighted average of the conditional probabilities of infection. With P(D) determined,
we can use Bayes’ Theorem to determine the conditional probability that an individual is vaccinated given that the
individual is infected:
P(D|V)P(V) (0.05)(0.7)
P(V|D) = = ≈ 0.189.
P(D|V)P(V) + P(D|NV)P(NV) (0.05)(0.7) + (0.5)(0.3)
The conditional probability of vaccination given infection is approximately 18.9%. We could also determine P(NV|D)
using Bayes’ Theorem, but once we know P(V|D) ≈ 0.189, we immediately have P(NV|D) = 1 – P(V|D) ≈ 0.811 or
81.1%.
What would happen if the vaccination rate were much higher, say 90%, so that P(V) = 0.9 and P(NV) = 0.1? With
this higher vaccination rate, the conditional probability of vaccination given infection is
P(D|V)P(V) (0.05)(0.9)
P(V|D) = = ≈ 0.474.
P(D|V)P(V) + P(D|NV)P(NV) (0.05)(0.9) + (0.5)(0.1)
We get a much higher conditional probability of vaccination given infection in this case. Even though the infection
rate for vaccinated individuals is quite low (5%), the very high unconditional probability of vaccination leads to the
vaccinated making up a very large proportion of the infected individuals, much larger than with a 70% vaccination
rate. In the partition terminology, the vaccinated “partition” of the infected event gets larger when the vaccinated
probability goes up. The interested reader can try this exercise with even higher vaccination rates, like 95% and 99%,
and determine P(V|D).
i i
i i
i i
Example 3.6 (Product returns) From Example 3.3, recall that the three types of widgets made up 60%, 30%, and 10%
of purchases with corresponding return rates of 15%, 25%, and 35%. Here is a probability table that corresponds to
these probabilities:
Returned
Yes (R) No (Rc ) Total
Widget 1 (A1 ) 0.090 0.510 0.600
Purchase Widget 2 (A2 ) 0.075 0.225 0.300
Widget 3 (A3 ) 0.035 0.065 0.100
Total 0.200 0.800
The events A1 , A2 , and A3 are disjoint and exhaustive, and the events R and Rc are disjoint and exhaustive. Each of
the bold numbers in the table corresponds to a joint probability. For instance, the 0.090 value is the joint probability
of a Widget 1 purchase and a return: P(A1 ∩ R) = 0.090. The 0.225 value is the joint probability of a Widget 2 purchase
and no return: P(A2 ∩ Rc ) = 0.225. Since both the rows and columns have collections of joint and exhaustive events,
the joint probabilities in the table must sum to one. In this case, 0.090 + 0.510 + 0.075 + 0.225 + 0.035 + 0.065 = 1.
Both a row and a column labeled “Total” have been included in this probability table. The “Total” column provides
the totals for the joint probabilities in each row, and the “Total” row provides the totals for the joint probabilities in
each column. These “marginal” probabilities are the unconditional probabilities corresponding to the given row or
column that is being summed. For the “Widget 1” row, the 0.600 value is the sum of P(A1 ∩ R) = 0.090 and P(A1 ∩ Rc ) =
0.510, so that P(A1 ) = 0.6. For the “Yes” column, the 0.200 value is the sum of P(A1 ∩ R) = 0.090, P(A2 ∩ R) = 0.075,
and P(A3 ∩ R) = 0.035, so that P(R) = 0.200.
When the joint probabilities are completely specified, as in this case, conditional probabilities can be calculated
quite easily. For instance, the conditional probability of a Widget 1 purchase (A1 ) given that the purchase is returned
(R) is
P(A1 ∩ R) 0.090
P(A1 |R) = = = 0.45,
P(R) 0.200
where the numerator is the joint probability 0.090 from the table and the denominator is the marginal probability
P(R) = 0.200 in the “Total” row. Similarly, the conditional probability of a return (R) given a Widget 1 purchase (A1 )
is
P(R ∩ A1 ) 0.090
P(R|A1 ) = = = 0.15,
P(A1 ) 0.600
where the numerator is the joint probability 0.090 from the table and the denominator is the marginal probability
P(A1 ) = 0.600 in the “Total” column.
Example 3.7 (Vaccination and infection) From Example 3.5, recall that the vaccination probability was P(V) = 0.7,
the probability of infection given vaccination was P(D|V) = 0.05, and the probability of infection given non-vaccination
was P(D|NV) = 0.5. The following probability table corresponds to these vaccination/infection probabilities:
Infected
Yes (D) No (Dc ) Total
Yes (V) 0.035 0.665 0.700
Vaccinated
No (NV) 0.150 0.150 0.300
Total 0.185 0.815
Each of the four joint probabilities comes directly from the information on the vaccination probability and the
conditional infection probabilities. For instance, the 0.035 value is P(V ∩ D) = P(D|V)P(V) = (0.05)(0.7) = 0.035.
From the probability table, the conditional probability of vaccination given infection is
P(V ∩ D) 0.035
P(V|D) = = ≈ 0.189,
P(D) 0.185
i i
i i
i i
i i
i i
i i
0.06+0.02+0.01
= 0.01+0.02+0.06+0.02+0.01 = 0.75.
Another example is the conditional probability of two dealership A salespeople selling a car (A2 ) given that the
number of salespeople selling a car at the two dealerships are equal to each other. Let E be the event that the number
of salespeople selling a car at the two dealerships are equal to each other, which is
E = (A0 ∩ B0 ) ∪ (A1 ∪ B1 ) ∪ (A2 ∪ B2 ) ∪ (A3 ∪ B3 ),
the union of “both 0” and “both 1” and “both 2” and “both 3.” The conditional probability is
P(A2 ∩ E) 0.11 11
P(A2 |E) = = = .
P(E) 0.02 + 0.08 + 0.11 + 0.02 23
3.4 Independence
In this section, we consider the concept of independence of events, which essentially means that knowing that
one event occurs does not provide any additional information about the other event. In situations where events are
independent of each other, the calculation of joint probabilities is greatly simplified. We start with the definition of
independent events.
Definition 3.5 Events A and B are independent if P(A|B) = P(A). Events A and B are dependent if P(A|B) 6= P(A).
The P(A|B) = P(A) condition says that knowing B occurs does not affect the probability of A occurring. The
conditional probability of A given B is the same as the unconditional probability of A. Even though Definition 3.5
specifies that only P(A|B) = P(A) is required for independence, P(A|B) = P(A) immediately implies that P(B|A) = P(B)
(that is, knowing A does not affect the probability of B):
P(B ∩ A) P(A|B)P(B)
P(B|A) = = = P(B).
P(A) P(A)
Therefore, independence of A and B can be established by checking either P(A|B) = P(A) or P(B|A) = P(B). It is not
necessary to check both. Likewise, to show dependence of A and B, either P(A|B) 6= P(A) or P(B|A) 6= P(B) can be
checked.
If events A and B are independent, knowing that B occurs doesn’t affect the probability of A. It is also the case that
knowing that B does not occur doesn’t affect the probability of A:
Proposition 3.5. If the events A and B are independent, P(A|Bc ) = P(A).
Example 3.9 (Product returns) Based upon the probability table in Example 3.6, are the events A1 (Widget 1 purchase)
and R (return) independent? Note that P(A1 |R) = 0.09/0.20 = 0.45, which is not equal to the unconditional probability
P(A1 ) = 0.6. Alternatively, note that P(R|A1 ) = 0.15, which is not equal to the unconditional probability P(R) = 0.2.
Intuitively, it makes sense that A1 and R are dependent since the return probability depends upon the type of widget
purchased; that is, knowing that Widget 1 is purchased provides additional information about the return probability.
Example 3.10 (Vaccination and infection) From the probability table in Example 3.7, the vaccination event (V) and
infection event (D) are dependent since P(D|V) = 0.05 and P(D) = 0.095. The dependence of these events arises since
the infection probability changes depending upon whether an individual is vaccinated or non-vaccinated.
Example 3.11 (Two coin tosses) Recall the experiment where two coins are tossed, with sample space S =
{HH, HT, TH, TT}. Let the event H1 correspond to the first toss being heads and event H2 correspond to the second
i i
i i
i i
toss being heads. If the two tosses have nothing to do with each other (that is, the outcome of one toss has no
effect on the outcome of the other toss), then H1 and H2 are independent events with P(H1 |H2 ) = P(H1 ) = 0.5 and
P(H2 |H1 ) = P(H2 ) = 0.5. Due to independence, each of the probabilities for the outcomes in S is equal to 0.25:
P(HH) = P(H1 |H2 )P(H2 ) = P(H1 )P(H2 ) = (0.5)(0.5) = 0.25
P(HT) = P(H1 |H2c )P(H2c ) = P(H1 )P(H2c ) = (0.5)(0.5) = 0.25
P(TH) = P(H1c |H2 )P(H2 ) = P(H1c )P(H2 ) = (0.5)(0.5) = 0.25
P(TT) = P(H1c |H2c )P(H2c ) = P(H1c )P(H2c ) = (0.5)(0.5) = 0.25
In Example 3.11, the joint probabilities for the two coin tosses each simplify to the product of two unconditional
probabilities. This simplification is a general property of independent events:
Proposition 3.6. Events A and B are independent if and only if P(A ∩ B) = P(A)P(B).
This proposition provides an alternative method to check independence without conditional probabilities.
Example 3.12 (Product returns) A1 and R are dependent since P(A1 ∩ R) = 0.09 and P(A1 )P(R) = (0.6)(0.2) = 0.12.
Example 3.13 (Vaccination and infection) V and D are dependent since P(V ∩ D) = 0.035 and P(V)P(D) =
(0.7)(0.185) = 0.1295.
The concept of independence can be extended beyond two events, as follows:
Definition 3.6 (Independence of more than two events) Events A1 , A2 , …, Ak are mutually independent if, for any
subset of the events, the joint probability is equal to the product of the individual probabilities. Equivalently, the
conditional probability of any event Ai given any subset of other events is equal to the unconditional probability of Ai .
For the case of two events, this definition corresponds exactly to the notion of independence in Proposition 3.6. For
three events (A1 , A2 , A3 ), the events are mutually independent if
P(A1 ∩ A2 ) = P(A1 )P(A2 ), P(A1 ∩ A3 ) = P(A1 )P(A3 ), P(A2 ∩ A3 ) = P(A2 )P(A3 )
and
P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 ).
Or, in terms of conditional probabilities, the three events are mutually independent if
P(A1 |A2 ) = P(A1 |A3 ) = P(A1 |A2 ∩ A3 ) = P(A1 ),
P(A2 |A1 ) = P(A2 |A3 ) = P(A2 |A1 ∩ A3 ) = P(A2 ),
and
P(A3 |A1 ) = P(A3 |A2 ) = P(A3 |A1 ∩ A2 ) = P(A3 ).
At this point, it is useful to introduce more concise notation for the intersection (∩) of events. Specifically, we use a
comma in between events to indicate intersection. For example, we write
P(A1 , A2 ) for P(A1 ∩ A2 ),
P(A1 , A2 , A3 ) for P(A1 ∩ A2 ∩ A3 ),
P(A1 |A2 , A3 ) for P(A1 |A2 ∩ A3 ),
P(A1 , A2 |A3 ) for P(A1 ∩ A2 |A3 ),
and so on.
When there are many events, the number of possible products involved in the definition of mutual independence
can be quite large. Consider the case of 50 events (A1 , A2 , …, A50 ), for which mutual independence is equivalent to the
following:
P(A1 , A2 ) = P(A1 )P(A2 ) and similarly for any two events in {A1 , …, A50 },
i i
i i
i i
P(A1 , A2 , A3 ) = P(A1 )P(A2 )P(A3 ) and similarly for any three events in {A1 , …, A50 },
.. .. ..
...
.. .. ..
...
P(A1 , A2 , …, A49 ) = P(A1 )P(A2 ) · · · P(A49 ) and similarly for any 49 events in {A1 , …, A50 },
and P(A1 , A2 , …, A50 ) = P(A1 )P(A2 ) · · · P(A50 ).
Thankfully, in practice, we rarely check all the probability-product possibilities for such a large number of events.
Instead, the more common situation is that we assume mutual independence of a collection of events and then use
the implied probability-product equalities. For instance, for 50 tosses of a coin, we might assume that the events
corresponding to heads on each toss (denoted H1 , H2 , …, H50 ) are mutually independent since the result of one coin
toss should not depend on the result of any other coin toss(es). With independence of the 50 coin tosses, it is then easy
to calculate joint probabilities involving outcomes on any of the individual tosses. As one example, the probability of
heads on tosses 10, 20, and 30 is P(H10, H20 , H30 ) = P(H10 )P(H20 )P(H30 ) = (0.5)(0.5)(0.5) = 0.125. In fact, since heads
and tails have the same probability 0.5 for every toss, the probability of any joint outcome of three tosses is going to
be equal to 0.125.
Example 3.14 (Three website visitors) Recall Example 2.3, which considered the purchase behavior of the first three
visitors to a website on a given day, with sample space
S = {YYY, YYN, YNY, YNN, NYY, NYN, NNY, NNN}.
Let A1 , A2 , and A3 denote the events that the first, second, and third visitors makes a purchase, respectively:
A1 = {YYY, YYN, YNY, YNN}
A2 = {YYY, YYN, NYY, NYN}
A3 = {YYY, YNY, NYY, NNY}
Assume that A1 , A2 , and A3 are mutually independent, which is sensible if the purchase behavior of one website
visitor is not affected by the purchase behavior of other website visitors. Moreover, assume that the unconditional
purchase probability of any given website visitor is 20%, so that P(A1 ) = P(A2 ) = P(A3 ) = 0.2. The unconditional
probability of non-purchase by any website visitor is 80%, so that P(Ac1 ) = P(Ac2 ) = P(Ac3 ) = 0.8. With the mutual
independence assumption, we can determine the probability for any outcome in S. For instance, the probability of
YNY (first and third visitors make a purchase, second visitor does not) is P(A1 )P(Ac2 )P(A3 ) = (0.2)(0.8)(0.2) = 0.032.
With the ability to calculate these joint probabilities, the probabilities associated with the total number of purchases
can also be determined. Let B0 , B1 , B2 , and B3 denote the events that zero, one, two, and three total purchases are
made, respectively:
B0 = {NNN}, B1 = {YNN, NYN, NNY}, B2 = {YYN, YNY, NYY}, B3 = {YYY}.
Their probabilities are:
NNN
z }| {
P(B0 ) = (0.8)(0.8)(0.8) = 0.512
YNN NYN NNY
z }| { z }| { z }| {
P(B1 ) = (0.2)(0.8)(0.8) + (0.8)(0.2)(0.8) + (0.8)(0.8)(0.2) = 0.384
YYN YNY NYY
z }| { z }| { z }| {
P(B2 ) = (0.2)(0.2)(0.8) + (0.2)(0.8)(0.2) + (0.8)(0.2)(0.2) = 0.096
YYY
z }| {
P(B3 ) = (0.2)(0.2)(0.2) = 0.008
These four probabilities sum to one since B0 , B1 , B2 , and B3 are disjoint and exhaustive events.
i i
i i
i i
Does the total number of purchases convey any useful information about whether a given individual makes a
purchase? Let’s focus on the first website visitor and consider the conditional probabilities P(A1 |B0 ), P(A1 |B1 ),
P(A1 |B2 ), and P(A1 |B3 ):
P(A1 , B0 ) 0
P(A1 |B0 ) = = =0
P(B0 ) 0.512
P(A1 , B1 ) (0.2)(0.8)(0.8) 1
P(A1 |B1 ) = = =
P(B1 ) 0.512 3
P(A1 , B2 ) (0.2)(0.2)(0.8) + (0.2)(0.8)(0.2) 2
P(A1 |B2 ) = = =
P(B1 ) 0.384 3
P(A1 , B3 ) (0.2)(0.2)(0.2)
P(A1 |B3 ) = = =1
P(B3 ) 0.008
These four conditional probabilities make intuitive sense. If no purchases occur (B0 ), the first visitor can’t possibly
make a purchase, so P(A1 |B0 ) = 0. If three purchases occur (B3 ), the first visitor must make a purchase, so P(A1 |B3 ) = 1.
When one total purchase is made (B1 ), the probability that the purchaser is the first visitor is 1/3 since there’s nothing
inherently different about the three visitors, meaning their chance of being the purchaser should be the same (1/3 each).
When two total purchases are made (B2 ), the probability that the purchaser is the first visitor is 2/3 since the chance
of any of the three visitors being the non-purchaser should be the same (1/3 each). So, it’s certainly the case that
information about the total number of purchases is useful for updating the probability that the first visitor makes a
purchase (A1 ). One can check that A1 and B0 are dependent, as are A1 and B1 , A1 and B2 , and A1 and B3 .
Likewise, whether or not a given individual makes a purchase is also informative about the number of total purchases
that are made. For instance, if the first visitor makes a purchase (A1 ), then
P(A1 , B0 )
P(B0 |A1 ) = =0
P(A1 )
P(A1 , B1 ) (0.2)(0.8)(0.8)
P(B1 |A1 ) = = = 0.64
P(A1 ) 0.2
P(A1 , B2 ) (0.2)(0.2)(0.8) + (0.2)(0.8)(0.2)
P(B2 |A1 ) = = = 0.32
P(A1 ) 0.2
P(A1 , B3 ) (0.2)(0.2)(0.2)
P(B3 |A1 ) = = = 0.04
P(A1 ) 0.2
These conditional probabilities differ from the unconditional probabilities of B0 , B1 , B2 , and B3 .
When a collection of events are not mutually independent, it can be more complicated to determine joint
probabilities. The usual approach is to consider successive conditioning. This idea has been seen in the context of
two events, where
P(A1 , A2 ) = P(A2 |A1 )P(A1 ) or P(A1 , A2 ) = P(A1 |A2 )P(A2 ).
For three events, this idea can be extended to
P(A1 , A2 , A3 ) = P(A3 |A1 , A2 )P(A2 |A1 )P(A1 ).
The joint probability of A1 ∩ A2 ∩ A3 , which is P(A1 , A2 , A3 ), is the probability that A1 occurs times the probability A2
occurs given A1 occurring times the probability A3 occurs given both A1 and A2 occurring. The order of events can be
switched around however we would like. Depending upon the problem, it might be easier to find the joint probability
with one ordering rather than another. For three events, the following successive conditioning equations are also true:
P(A1 , A2 , A3 ) = P(A2 |A1 , A3 )P(A3 |A1 )P(A1 ),
P(A1 , A2 , A3 ) = P(A3 |A1 , A2 )P(A1 |A2 )P(A2 ),
P(A1 , A2 , A3 ) = P(A1 |A2 , A3 )P(A3 |A2 )P(A2 ),
P(A1 , A2 , A3 ) = P(A2 |A1 , A3 )P(A1 |A3 )P(A3 ),
i i
i i
i i
i i
i i
i i
The probability that the first heads appears on the 10-th toss is P(A10 ) = 2110 = 1024
1
≈ 0.00098 or approximately 0.098%.
While this probability of A10 is quite low, it’s not equal to zero, so it is possible that A10 occurs. This is true even for
much larger k. While the probability of Ak gets smaller and smaller, it never reaches zero.
In the repeated coin toss example, there are an infinite number of possible events A1 , A2 , A3 , … corresponding to
when the first heads is observed. These events are disjoint. For their probabilities to constitute a probability distribution,
P∞ P∞
it must be the case that they sum to one. To show that P(A1 ) + P(A2 ) + P(A3 ) + · · · = j=1 P(Aj ) = j=1 21j is equal to
one, we introduce a general fact about the sum of an infinite geometric series:
Proposition 3.7. For real numbers a and r, with |r| < 1, the sum of an infinite geometric series is
∞
X a
arj = a + ar + ar2 + ar3 + · · · = .
1–r
j=0
The series is “geometric” since each successive term in the series is multiplied by the same constant r and “infinite”
since the terms in the series continue forever. The condition |r| < 1 ensures that the sum of the series has a well-defined
value rather than diverging to infinity or negative infinity.
Example 3.17 (Coin tosses until a head) Proposition 3.7 can be applied with a = 1/2 and r = 1/2 to yield
∞
X 1/2
P(A1 ) + P(A2 ) + P(A3 ) + · · · = P(Aj ) = = 1,
1 – 1/2
j=1
confirming that the probabilities constitute a probability distribution over the sample space S = {1, 2, 3, …}. We can
use Proposition 3.7 to determine other probabilities for this experiment. For example, the probability that it takes at
least four tosses to see heads is
1 1 1 1/16 1
P(A4 ) + P(A5 ) + P(A6 ) + · · · = + + +···= = .
16 32 64 1 – 1/2 8
Alternatively, to determine this probability, we can think about the complement of the event that the heads occurs
during the first three tosses, whose probability is
1 1 1 1
1 – (P(A1 ) + P(A2 ) + P(A3 )) = 1 – + + = .
2 4 8 8
Example 3.18 (Website visitors until a purchase) The example of purchase behavior for three website visitors can be
extended to allow for an infinite stream of website visitors. Assume independence of each visitor’s purchase behavior;
that is, for any subset of visitors, the individual purchase events are mutually independent. Assume that the probability
of a purchase by any given visitor is 20%. Similar to the coin toss example, let Ak denote the event corresponding to
the first purchase being made by the k-th website visitor. The probabilities of A1 , A2 , … are
Y
z}|{
P(A1 ) = 0.2 = 0.2
NY
z }| {
P(A2 ) = (0.8)(0.2) = 0.16
NNY
z }| {
P(A3 ) = (0.8)(0.8)(0.2) = 0.128
..
.
P(Ak ) = (0.8)k–1 (0.2)
..
.
i i
i i
i i
For Ak to occur, the first k – 1 visitors do not make a purchase and then the k-th visitor makes a purchase, which
corresponds to P(Ak ) = (0.8)k–1 (0.2). These probabilities sum to one since Proposition 3.7 can be applied with a = 0.2
and r = 0.8. What is the probability that it takes at least ten website visitors before a purchase is observed? Applying
Proposition 3.7 yields
(0.8)9 (0.2)
P(A10 ) + P(A11 ) + · · · = (0.8)9 (0.2) + (0.8)10 (0.2) + · · · = = (0.8)9 ≈ 0.134.
1 – 0.8
We get (0.8)9 as the probability here, which is the equal to the probability that the first nine visitors are non-purchasers,
an event that is equivalent to seeing at least ten visitors before a purchase.
Example 3.19 (Number of patents) Let the sample space S = {0, 1, 2, 3, …} contain the possible outcomes for the
number of patents that a firm is awarded in a given year. Let Ak denote the event that the firm is awarded k patents
in a given year. Assume that the probability of being awarded no patents is P(A0 ) = 0.5 and the probability of being
awarded one patent is P(A1 ) = 0.3. If the probabilities for Ak for k ≥ 2 (events associated with two or more patents) are
P(Ak ) = 0.3ck–1 for some constant c, what must the value of c be? Note that P(A2 ) = 0.3c, P(A3 ) = 0.3c2 , P(A4 ) = 0.3c3 ,
and so on. Since the events A0 , A1 , A2 , … are disjoint and exhaustive, their probabilities must sum to one:
0.3
P(A0 ) + P(A1 ) + P(A2 ) + P(A3 ) + · · · = 0.5 + 0.3 + 0.3c + 0.3c2 + · · · = 0.5 + ,
1–c
so that c = 0.4. For this value of c, P(A2 ) = 0.12, P(A3 ) = 0.048, P(A4 ) = 0.0192, and so on.
Exercises
1. A survey of college students with cell phones finds that 50% have an iPhone (I), 30% have a Samsung phone (S), and
20% have another phone (A). Moreover, 70% of iPhone owners have unlimited data plans, 60% of Samsung owners
have unlimited data plans, and 45% of other phone owners have unlimited data plans. Let U denote the event for
having an unlimited data plan.
(a) What is the probability that a surveyed student has a Samsung phone and an unlimited data plan?
(b) What is the probability that a surveyed student has an unlimited data plan?
(c) For a surveyed student with an unlimited data plan, what is the probability that they own an iPhone?
2. The percentages of blood types in the United States (ignoring positive and negative sub-types) are as follows: A
(40%), B (11%), AB (4%), and O (45%). Suppose two individuals are drawn randomly from the population and that
their blood types are independent of each other.
(a) What is the probability that both individuals have blood type A?
(b) What is the probability that the blood types of the two individuals are the same?
(c) What is the probability that the blood types of the two individuals are different?
3. You have three different e-mail accounts, with 70% of e-mails coming into account A, 20% coming into account B,
and 10% coming into account C. The likelihood that an e-mail is spam is 1%, 2%, and 5% for the three accounts,
respectively.
(a) What is the probability that a randomly selected e-mail is spam?
(b) If you know that a received e-mail is spam, what is the probability that it is in account A?
4. An inventor submits two patent applications. Let P1 be the event that the first patent application is successful, and
let P2 be the event that the second patent application is successful. Assume that P1 and P2 are independent, with
P(P1 ) = 0.6 and P(P2 ) = 0.3.
(a) If the first patent application is unsuccessful, what is the probability that the second application is successful?
(b) What is the probability that at least one of the two patent applications is successful?
(c) If you know that at least one of the patent applications is successful, what is the probability that only the second
application is successful?
i i
i i
i i
5. One of two possible suspects (Bert or Ernie) commits a crime. Bert has more of a checkered past than Ernie, so
suppose the prior probability of Bert committing the crime is 70% and the prior probability of Ernie committing
the crime is 30%. There was an eyewitness to the crime who says that she saw Ernie do it. The probability that the
eyewitness is correct is equal to 90%. That is, the probability of the eyewitness saying she saw Ernie given that Ernie
committed the crime is 0.9. Similarly, the probability of the eyewitness saying she saw Bert given that Bert committed
the crime is 0.9. What is the posterior probability that Ernie committed the crime given the eyewitness testimony?
6. The probability that an individual in the United States has a certain disease is 5%. There is a diagnostic test that
correctly detects the disease 97% of the time (3% false negative) and correctly detects absence of the disease 99% of
the time (1% false positive). A randomly selected individual has a diagnostic test, and the result is positive (detects the
disease). What is the posterior probability that the individual has the disease?
7. At a certain large company, 70% of the economists and 50% of the data scientists have a graduate degree,
respectively. Furthermore, 80% of the employees at the company are economists and 20% are data scientists. If a
randomly chosen employee has a graduate degree, what is the probability that the employee is an economist?
8. For a certain election, the probability of a registered voter being a “young voter” (under 40) is 35%, and the
probability of a registered voter being an “older voter” (at least 40) is 65%. Registered young voters are less likely to
actually vote (55% probability) in the election than registered older voters (75% probability). If a randomly selected
registered voter did not vote in the election, what is the probability that it is a young voter?
9. The use of generative artificial intelligence (AI) tools to assist with writing has become prevalent. It is sometimes
possible to detect the use of an AI tool based upon words that are more frequently used by an AI tool as compared to a
human. This question considers a simple example. Suppose that, prior to the introduction of AI tools, the probability
that a student uses the word “delve” in a submitted paper is 3%, whereas after the introduction of AI tools, the
probability jumps to 15%. Let g denote the probability that a student uses an AI tool (after its introduction). Assume
that the probability that a student uses the word “delve” remains at 3% conditional on the student not using the AI tool
(after its introduction).
(a) If g = 1 (all students use the AI tool), what is the probability that a student uses “delve” in a submitted paper
after the introduction of AI tools?
(b) For this part, assume that the probability that a student uses the AI tool is equal to 40% (that is, g = 0.4).
i. What is the probability that a student uses “delve” in a submitted paper if the student uses an AI tool?
ii. If “delve” is used in a submitted paper, what is the probability that the student used an AI tool?
(c) *For this part, assume that g is unknown. What is the lowest possible value of g?
10. Two companies, Acme Manufacturing and Bolts Emporium, have shares of stock that trade on an exchange. On any
given day, the following probability table describes the performance of the two stocks, where each stock can go up in
price, go down in price, or stay the same.
Bolts Emporium
Down No Change Up
Down 0.18 0.02 0.22
Acme Manufacturing No Change 0.03 0.02 0.04
Up 0.14 0.05 0.30
(a) What is the probability that Acme Manufacturing’s stock goes up on a given day?
(b) What is the probability that Bolts Emporium’s stock has some change (“Down” or “Up”) on a given day?
(c) What is the probability that at least one of the two stocks has no change on a given day?
(d) Are the events “Acme Manufacturing’s stock goes up” and “Bolts Emporium’s stock goes up” independent?
(e) What is the probability that Acme Manufacturing’s stock goes up on a day that Bolts Emporium’s stock goes
up?
i i
i i
i i
(f) What is the probability that Acme Manufacturing’s stock goes up on a day that Bolts Emporium’s stock doesn’t
go down?
(g) If exactly one of the two stocks goes up on a given day, what is the probability that it is Acme Manufacturing?
11. The table below gives the joint counts of age and rank of the faculty of a major university in a recent year. There are
a total of 1164 faculty members in the table.
Rank
Assistant Associate Full
Under 30 57 3 2
30-39 163 170 54
Age 40-49 61 125 160
50-59 36 68 155
60 and over 3 15 92
Interpret the relative frequencies as true probabilities for this question. For example, the joint probability that a
57
randomly chosen professor is under 30 and an assistant professor is 1164 .
(a) What is the probability that a randomly chosen faculty member is a full professor?
(b) What is the probability that a randomly chosen faculty member is 50+ years old?
(c) What is the probability that a randomly chosen faculty member is 50+ years old and a full professor?
(d) What is the probability that a randomly chosen faculty member is 50+ years old or a full professor?
(e) What is the probability that a professor who is 50+ years old is a full professor?
(f) What is the probability that a full professor is 50+ years old?
12. Suppose that, on any given weekday, 70% of college students eat breakfast, 60% do homework, and 85% do at least
one of these two things.
(a) Are the events “student eats breakfast” and “student does homework” independent?
(b) If a randomly selected student eats breakfast, what is the probability they do homework?
(c) Suppose two students are selected at random and their behaviors are independent of each other. What is the
probability that exactly one of the two students does homework?
(d) Suppose 80% of students that eat breakfast on a given day also eat lunch that day, while 90% of students that
don’t eat breakfast on a given day eat lunch that day. For a student who eats lunch on a given day, what is the
probability they eat breakfast on that day?
13. Suppose A and B are independent events.
(a) Show that A and Bc are independent events.
(b) Show that Ac and Bc are independent events.
14. Consider a group of 10 stocks, with 20% (two stocks) corresponding to “tech” companies and 80% (eight stocks)
corresponding to “non-tech” companies. Suppose you randomly pick two (different) stocks, with T1 denoting that the
first stock is “tech” and T2 denoting that the second stock is “tech.”
(a) Determine P(T1 ), P(T2 ), and P(T1 ∪ T2 ). Are T1 and T2 independent?
(b) Repeat (a), but assume that there are 100 stocks with 20% (20 stocks) corresponding to “tech” companies.
(c) Repeat (a), but assume that there are 1,000 stocks with 20% (200 stocks) corresponding to “tech” companies.
(d) In thinking the independence checks in (a)-(c), what can you say about independence becoming
“approximately” correct as the number of stocks increases?
15. Four employees at a given company individually go out for lunch. Suppose they each randomly and independently
choose a restaurant from among seven choices.
(a) What is the probability that each employee goes to a different restaurant?
i i
i i
i i
(b) What is the probability that all four employees go to the same restaurant? (The restaurant can be any one of the
seven possibilities.)
(c) Conduct 10,000 computer simulations in R to confirm your answers to (a) and (b).
16. A space technology company is attempting to launch four satellites into orbit, using four separate rockets. Assume
that the probability of success for each launch is 90% and that the launches are independent of each other.
(a) What is the probability distribution of the total number of successful launches?
(b) Conditional on knowing that at least three launches were successful, what is the probability that exactly three
launches were successful?
17. You have three quarters, two dimes, three nickels, and one penny in your pocket. When removing a coin from your
pocket, you may assume that there is an equal chance of picking any of the coins that are in your pocket.
(a) If you randomly remove one coin, what is the probability that a quarter is not chosen?
(b) If you randomly remove one coin at a time, without putting coins back into your pocket, what is the probability
that the first quarter is not chosen until the third coin or later?
(c) Rather than being interested in the sequence of coins, suppose you are interested in the total amount of money
that you take out of your pocket before the first quarter is taken out. What are the outcomes in the sample space
S for this experiment?
(d) Conduct 10,000 computer simulations in R to approximate the probability that exactly $0.15 is taken out of
your pocket before the first quarter is taken out.
18. Suppose you repeatedly and independently roll a fair six-sided die. Let S1 denote the event that the first six appears
on the first roll, S2 denote the event that the first six appears on the second roll, and so on.
(a) What is P(S1 )?
(b) What is P(Sk ) for an arbitrary k ∈ {1, 2, 3, …}?
(c) Show that the P(Sk ) probabilities sum to one.
(d) What is the probability that it takes at least ten rolls for a six to appear?
(e) What is the probability that the first six appears on an even roll (i.e., for k ∈ {2, 4, 6, 8, …})?
(f)What is the probability that the first six appears on the second roll given that the first six appears on an even
roll?
(g) Conduct 10,000 computer simulations in R to confirm your answer in (d). For each simulation, use a while
loop to repeatedly simulate a die roll until a six appears.
(h) Modifying your code from (g) as necessary, what is the average number of rolls that it takes for a six to appear
over the 10,000 simulations?
19. Two individuals A and B play the following dice game: the players alternate rolling a fair six-sided die, and a player
wins if she rolls the same number as the previous player did. The players continue alternating rolls until someone
wins. Assume that player A starts and that they can’t win on the first roll (since there was no previous roll). What is
the probability that player A wins?
20. An electronics store (Electric City) sells laptop computers. On any given day, the probability that Electric City
doesn’t sell a laptop computer is 0.35, and the probability that it sells k laptop computers (for k > 0) is 0.25ck–1 .
(a) What is the value of c?
(b) What is the probability that the store sells at least three laptop computers?
(c) Another store in the same city (Computer King) also sells laptop computers. On any given day, this store has
probabilities 0.20, 0.50, and 0.30 of selling zero, one, and two laptop computers, respectively. If sales at the
two stores are independent of each other, what is the probability that Electric City sells more laptop computers
than Computer King on any given day?
(d) Using the same information provided in (c), on any given day, what is the probability that Electric City sells
exactly two laptop computers given that Electric City sells more laptop computers than Computer King?
i i
i i
i i
21. *A gambler is playing a casino game where they win $100 with probability p and lose $100 with probability 1 – p,
with p < 0.5 (so that the casino has an advantage).
(a) Suppose the gambler starts with $200. If the gambler repeatedly plays the game until they lose all their money
or until they have $400, what is the probability (in terms of p) that they lose all their money?
(b) Conduct 10,000 simulations in R to confirm your answer in (a) for the values p = 0.48 (small casino advantage)
and p = 0.45 (large casino advantage).
22. *In college volleyball, the first team to score 25 points wins a set. But, if the teams are tied at 24-24, the set continues
until one team has two points more than the other team (that is, 26-24 or 27-25 or 28-26 and so on). Suppose two teams,
call them team A and team B, have reached a 24-24 score. Team A is the better team, and the probability that team A
wins any future point is 60%. Assume that all future points are mutually independent.
(a) What is the probability that team A wins the set by a 26-24 score?
(b) What is the probability that the set becomes tied at 25-25?
(c) What is the probability that team A wins the set?
(d) Conditional on team A winning the set, what is the probability that team A wins the set by a 26-24 score?
(e) Conduct 10,000 simulations in R to confirm your answer in (c). For each simulation, use a while loop to
repeatedly simulate points until one of the two teams wins.
(f) Modify your code from (e) to record, for each simulated game, the number of points scored by the winning
team. What is the average winning score over the 10,000 simulations?
i i
i i
i i
Combinatorics is a topic in mathematics that involves counting methods. In many interesting situations, our ability to
determine probabilities requires that we count and/or enumerate the possible outcomes. Chapters 2 and 3 considered
some simple examples (e.g., sequences of coin-toss outcomes and sequences of customer-purchase outcomes) of
counting different types of outcomes. As another example, consider a lottery where the winning numbers are four
distinct numbers that the lottery authority chooses from the set {1, 2, …, 30}. How many different ways can the four
numbers be drawn? If you were to play the lottery at random (i.e., randomly picking four numbers from the 30 possible
numbers), what is your chance of winning the lottery?
i i
i i
i i
is (8)(12)(4) = 384. If you’re restricted to picking just one of the funds, the number of choices from the sum rule is
8 + 12 + 4 = 24.
Definition 4.1 An ordered subset of distinct choices is called a permutation, and Pn,k denotes the number of
permutations of size k that can be formed from n objects (for k ≤ n).
i i
i i
i i
Proposition 4.5. (Number of permutations) The number of permutations of size k that can be formed from n objects
(for k ≤ n), denoted Pn,k , is
n!
Pn,k = ,
(n – k)!
where j! = (1)(2) · · · (j), read as “j factorial,” is the product of all positive integers up through j. An equivalent formula
is
Pn,k = (n)(n – 1) · · · (n – k + 1).
Example 4.8 (Board of directors with titled positions) The number of permutations for the three titled board positions
in Example 4.6 is P20,3 = 20!
17! = (20)(19)(18) = 6,840.
factorial(20)/factorial(17)
## [1] 6840
20*19*18
## [1] 6840
Example 4.9 (Stock portfolio with different weights) Suppose there are 100 possible stocks, and you want to form
a 30%/25%/20%/15%/10% weighted portolio of five stocks. How many possible portfolios are there? The order of
the five stocks matters here since a different weight is being placed on each of the five choices. Then, the number of
portfolio choices is P100,5 = 100!
95! = (100)(99)(98)(97)(96) = 9,034,502,400 (over 9 billion!).
Definition 4.2 An unordered subset of choices is called a combination, and Cn,k denotes the number of combinations
of size k that can be formed from n objects (for k ≤ n).
Proposition 4.6. (Number of combinations) The number of combinations of size k that can be formed from n objects
(for k ≤ n), denoted Cn,k , is
Pn,k n!
Cn,k = = .
k! k!(n – k)!
Sometimes the notation nk , read as “n choose k,” is used as an alternative to the Cn,k notation, so that
n n!
= Cn,k = .
k k!(n – k)!
Example 4.10 (Board of directors with no titled positions) The number of combinations for the three (untitled)
board positions in Example 4.7 is C20,3 = 20
3 = 20!
3!17! = (20)(19)(18)
(3)(2)(1) = 1,140. Proposition 4.6 provides an exact relationship
between C20,3 and P20,3 , specifically that C20,3 = P3!
20,3
= P20,3
6 . As discussed above, any ordered choice of three members
is equivalent to five other ordered choices of those three members. Why is that the case? For a group of three members,
there are six different orderings: three choices for the first member, then two choices for the second member, and then
one choice for the third member, which is a total of (3)(2)(1) = 3! = 6. The division of P20,3 by 3! in the C20,3 formula
accounts for the fact that the 6 possible orderings of any three members is equivalent to just one unordered group of
the three members. The R function choose(n,k) calculates nk = Cn,k .
i i
i i
i i
choose(20,17)
## [1] 1140
Example 4.11 (Stock portfolio with equal weights) Suppose there are 100 possible stocks. You want to invest equally
(20% each) in five different stocks. How many possible portfolios are there? The order of the five stocks does
100
100!
not matter since it’s an equally weighted portfolio. Then, the number of portfolio choices is C100,5 = 5 = 5!95! =
(100)(99)(98)(97)(96)
(5)(4)(3)(2)(1) = 75,287,520, still a large number but smaller than P100,5 by a factor of 5! = 120).
Example 4.12 (Choice of bonus stocks) We return to Example 4.5, where we considered an on-line trading app
bonus in which the user is allowed to choose three shares from the set of ten possible stocks {S1 , S2 , …, S10 }. The
important difference from Example 4.11 is that the user is not restricted to picking three distinct stocks. There are
three possibilities: (i) the user’s choice of shares involves three distinct stocks, (ii) the user’s choice of shares involves
two distinct stocks, with two shares of one stock and one share of the other stock, and (iii) the user’s choice of shares
involves one stock, with all three shares being that stock. If we can determine the number of possible choices associated
with each of these three possibilities, the sum rule can be used to determine the total number of possible choices. For
case (i), where the choice of shares involves three distinct stocks, the number of possible choices is C10,3 = 10 3 = 120
since the order of the three stocks doesn’t matter (each has one share). For case (ii), where the choice of shares involves
two distinct stocks, the number of possible choices is P10,2 = (10)(9) = 90 since the order of the two stocks matters (one
has two shares, the other has one share). For case (iii), where the choice of shares involves one stock, the number of
possible choices is just 10. Thus, by the sum rule, the total number of possible choices is 120 + 90 + 10 = 220.
i i
i i
i i
What if the department also needs to specify a chair for each of the three committees? We can think of there being
m = 7 subsets now, rather than m = 4 subsets, as each of the three committees is effectively split into two subsets, one
for the chair member (1 faculty member) and one for the non-chair members (11 for admissions, 7 for curriculum, 3
for alumni relations). The number of possible ways to form the committees, with the chairs specified, is considerably
larger than the answer above and is equal to
26 26!
= = 167,051,941,056,000.
11, 1, 7, 1, 3, 1, 2 11!1!7!1!3!1!2!
Example 4.15 (Stock portfolio with sectors) Consider again the problem of forming a five-stock equally weighted
portfolio, with 20% weight on each stock, but now suppose 20 of the 100 stocks are classified as “tech” stocks and 80
of the 100 stocks are classified as “non-tech” stocks. How many equally weighted five-stock portfolios have exactly two
20
“tech” stocks and three “non-tech” stocks? There are C20,2 = 2 ways of choosing the “tech” stocks (the ordering of
the two stocks doesn’t matter) and C80,3 = 80
3 ways of choosing the “non-tech” stocks, so thatby the
product rule the
total number of five-stock portfolios with two “tech” stocks and three “non-tech” stocks is 20 2
80
3 .
If we form a five-stock equally weighted portfolio by picking five stocks from the 100 total at random, what is
the probability the portfolio
has two “tech” stocks and three “non-tech” stocks? We’ve already found the relevant
20 80
numerator, which is 2 3 . The relevant denominator is the total number of possible five-stock portfolios, which is
C100,5 = 100
5 . Thus, the probability is
(20)(19) (80)(79)(78)
20 80
2 (2)(1) · (3)(2)(1)
100
3 = (100)(99)(98)(97)(96) ≈ 0.207 or 20.7%.
5 (5)(4)(3)(2)(1)
How about the probability that a randomly chosen five-stock portfolio has at most two “tech” stocks? Using similar
reasoning, the number of five-stock portfolios with zero “tech” stocks is 20 80
0 5 and the number with one “tech” stock
i i
i i
i i
20 80
is 1 4 . Thus, the probability of having at most two “tech” stocks in a randomly chosen five-stock portfolio is
20 80
20 80 20 80
0 5 + 1 + 2 3
100
4 ≈ 0.947 or 94.7%.
5
Although we have used combinatorics to answer interesting probability questions in the examples above, some
problems are too complicated to solve analytically in this way (or at least complicated enough that you would rather
not try!). Such an example is considered below and illustrates how computer simulation offers an alternative method
for calculating probabilities.
Example 4.16 (The likelihood of “streaks”) If a coin is tossed 100 times, what is the probability of observing a
streak of at least five consecutive heads during the 100 tosses? (Try guessing before reading any further.) With two
possibilities for each toss, there are 2100 possible 100-coin sequences by the product rule. Moreover, if the coin tosses
are independent of each other, each of these 2100 sequences must be equally likely. Therefore, the probability of
S
observing a streak of at least five consecutive heads during the 100 tosses is equal to 2100 , where S is the number of 100-
coin sequences with a streak of at least five consecutive heads. Unfortunately, it is extremely difficult to analytically
determine the value of S. An alternative approach is to use computer simulation, as follows:
• Step 1: Simulate the experiment of flipping 100 coins, with probability 1/2 of heads and probability 1/2 of tails for
each toss.
• Step 2: For the simulated sequence of 100 coin tosses, check and record whether or not there is a streak of at least
five heads.
• Repeat Steps 1 and 2 many times, and then determine the frequency or proportion of simulated 100-coin sequences
i i
i i
i i
set.seed(1234)
We use a for loop to repeatedly conduct the 100-toss experiment. The loop is executed 100,000 times. During
each iteration of the loop, (i) 100 coin tosses are simulated, (ii) a string of the results is created and stored in
toss_string, using the collapse = "" option for the paste function, (iii) the variable streak_counter
is incremented by one if the string "HHHHH" occurs within toss_string, and (iv) the cumulative streak count is set
to the value of streak_counter. After the loop, the vector of cumulative frequencies freq_streaks is created
and plotted using the plot function.
The simulated frequencies level off at around 0.81 or 81%. The actual calculated frequency after the 100,000
simulations, resulting from the command freq_streaks[num_simulations], is 0.81156. How can we
determine the accuracy of the computer simulated frequency? That is, how close is the computer’s answer is to the
“true” probability of the event, which is the probability of seeing a five-head streak in a 100-coin sequence? While a
meaningful answer can’t be provided yet, the statistical tools developed in this book will allow simulation accuracy to
be quantified. There is simulation error whenever a computer simulates probabilities, and we can quantify how large
this simulation error is likely to be.
Exercises
1. Your phone has 30 different songs available: 10 by artist A, 8 by artist B, and 12 by artist C.
(a) How many different ways can you play first a song by artist A and then a song by artist B?
(b) How many different ways can you play first a song by artist A and then a song by artist B and then a song by
artist C?
(c) If you no longer restrict the order of the artists, how many different ways can you play three songs that each
have a different artist?
i i
i i
i i
0.83
Cumulative frequency of five−head streak occurrence
0.82
0.81
0.80
0.79
0.78
0.77
Simulation number
Figure 4.1
Simulated frequency of a five-head streak among 100 coin tosses
2. A 12-person jury is being selected from a pool of 50 potential jurors. Among the potential jurors, there are 20 men
and 30 women.
(a) How many possible ways are there to select the jury?
(b) If the judge insists on there being an equal number of men and women on the jury, how many possible ways
are there to select the jury?
(c) The jury needs a foreperson, who is one of the 12 jurors. Ignoring the gender restriction for now, how many
possible ways are there to select the jury and foreperson?
(d) Now assume that the judge insists that there are an equal number of men and women on the jury and that the
foreperson is woman. How many possible ways are there to select the jury and foreperson?
3. A company visits a college campus to interview students. The company has seven Economics majors, six Finance
majors, and five Accounting majors from which to choose. Unfortunately, the company has lost everyone’s résumés,
so they randomly pick three students to interview.
(a) What is the probability that all three interviewees are Economics students?
(b) What is the probability that all three interviewees are from the same major?
(c) What is the probability that the set of three interviewees has either no Economics students or no Finance
students?
(d) What is the probability that at least one of the majors has no students interviewed?
4. At the “Pick One” restaurant, you have to choose one dish option from each of the following four dinner courses:
Appetizer (Vegetarian, Chicken, Pork), Salad (Vegetarian, Chicken), Small Plate (Vegetarian, Beef, Seafood), Large
Plate (Vegetarian, Beef, Chicken, Pork, Seafood). So, in order, CVBS is one possible choice.
i i
i i
i i
i i
i i
i i
(c) Following the logic from the previous part, if the size of the class is 20 students, what is the probability that at
least two of them share a birthday? Try to simplify the formula using factorial functions, but don’t worry about
calculating the exact answer.
(d) Using this approach for any class size N, one can show that the probability that at least two of the students
share a birthday is equal to
N! 365
N
1– .
365N
Apply this formula in R with N = 20 to calculate the probability for (c).
(e) Write a function birthday that takes an integer n ≥ 2 as its single argument and returns the probability from
the formula in (d).
(f) Using the birthday function from (e), how many students would need to be in a class for the probability of
two students sharing a birthday to be greater than 90%?
(g) Using the birthday function from (e), plot the probabilities of at least two students sharing a birthday against
class size N for all values of N between 20 and 60 (inclusive).
9. (Matching Problem) At the beginning of an exam, every student is required to leave their phone on the professor’s
desk. After the exam is over, the professor randomly gives back the phones to students when they turn in their exams.
(a) If there are only three students in the class, what is the probability that at least one student gets their own phone
back after the exam?
(b) If there are only four students in the class, what is the probability that at least one student gets their own phone
back after the exam? (If you’re having trouble with this part, list all P4,4 = 4! = 24 possible permutations for the
order of the phones being returned.)
(c) While it is possible to mathematically generalize the results from (a) and (b) to an arbitrary class size N, an
alternative approach is to use computer simulation to approximate the probability. Conduct 10,000 computer
simulations in R to approximate the probability that at least one student gets their own phone back in a class of
20 students.
(d) Generalize your code from (c) by writing a function matching that takes an integer n ≥ 2 as its single
argument and returns the approximate probability (based on 10,000 simulations) that at least one student gets
their own phone back in a class of n students.
10. A business magazine rates S&P 500 mutual fund managers based upon how often their fund beats the return on the
S&P index. Any given manager has a 50% of beating the S&P index in a given year (and 50% of not beating it), and
each year’s performance is mutually independent.
(a) The magazine gives its “Gold Star” rating to any manager who has beaten the S&P index in at least five of the
last six years. What is the probability of any given manager getting the “Gold Star” rating?
(b) The magazine gives its “Silver Star” rating to any manager who has beaten the S&P index in exactly four of
the last six years. What is the probability of any given manager getting the “Silver Star” rating?
(c) If you know that a manager has beaten the S&P index in at least four of the last six years, what is the probability
that the manager gets the “Gold Star” rating? the “Silver Star” rating?
11. A publisher has six economics textbooks in its catalog. The publisher is deciding on its advertising strategy, and
each textbook can be advertised or not.
(a) How many distinct outcomes are possible? (Order does not matter.)
(b) How many outcomes have four or more of the economics textbooks advertised?
(c) A marketing specialist is hired to rank the six textbooks according to projected sales over the next two years.
How many distinct rankings are possible?
(d) The publisher decides to commission new editions of three of the six textbooks, one in each of the next three
years. The marketing specialist is asked to choose three textbooks for new editions, specifying the textbook to
i i
i i
i i
be updated in the first year, the textbook to be updated in the second year, and the textbook to be updated in the
third year. How many distinct choices of the three textbooks are possible? (Order matters.)
(e) The publisher decides to start an annual mailing to economics professors of an advertising pamphlet that
focuses on three of its six textbooks. The company does not want to focus on the same three textbooks more
than once. It does not mind repeating one or two of them, but it does not want all three to be the same as in a
previous mailing. How many years can the publisher stick to this policy before it is forced to change it?
(f) How would your answer to (e) change if one of the authors just won a Nobel Prize, and it is decided that this
author’s textbook must be included every year?
(g) Suppose that, of the five remaining textbooks (other than the Nobel Prize author’s book), three are advanced
and two are introductory. How does your answer change if, in addition to the text by the Nobel Prize winner,
one text must be advanced and the other must be introductory?
(h) If the three textbooks are chosen randomly, what is the probability that the textbook by the Nobel Prize winner
is included?
(i) If the three textbooks are chosen randomly, what is the probability that at least one introductory textbook is
included?
(j) If three textbooks are chosen randomly for four years in a row, what is the probability that the textbook by the
Nobel Prize winner is included in all four years?
12. Modify the R code from Example 4.16 to approximate the probability of having a streak of at least six consecutive
heads in 100 tosses. Change the limits of the y-axis, using the ylim = c(0, 1) option. What is the simulated frequency
of streaks? Now do the same exercise for a streak of at least seven consecutive heads in 100 tosses.
13. Refer to Example 4.14 for this question.
(a) Conduct 100,000 simulations in R to approximate the two probabilities found analytically in Example 4.14:
(i) the probability that a randomly chosen three-letter website name has only alphabetic characters, and (ii) the
probability that a randomly chosen three-letter website name has three distinct alphabetic characters. (Hint:
Sample from the vector 0:35, where 0 through 9 correspond to the numerical characters, and 10 through 35
correspond to the alphabetic characters.)
(b) Conduct 100,000 simulations in R to approximate the probability that the sum of any numerical characters in a
randomly chosen three-letter website name is greater than 10. (If there are no numerical characters in a website
name, the sum should be treated as zero.)
i i
i i
i i
i i
i i
i i
This chapter discusses several types of economic data that are typically encountered in practice. Also, this chapter
formalizes the concept of sampling, providing a framework to explain how data are generated and observed.
i i
i i
i i
Definition 5.2 Time-series data consist of observations on the same unit that are measured at different points in
time. When the time series consists of a single variable, the data are univariate time-series data. When the time series
consists of more than one variable, the data are multivariate time-series data.
Again, the idea of a “unit” is general and can be many things in practice (an individual, a firm, a country, etc). Some
examples of time-series datasets include the following:
Example 5.4 (Macroeconomic indicators for the United States) To track and analyze the overall macroeconomy of the
United States, annual data for the unemployment rate, inflation, GDP growth, and the budget deficit can be collected.
Example 5.5 (Asset returns) There is a wealth of time-series data available for financial markets, which can be
collected at almost any time frequency — annually, monthly, daily, hourly, and even by the minute. An example of a
time-series dataset would be the daily returns for a given stock, like Apple, or for some other financial asset.
Whereas the units, like individuals or firms, in cross-sectional data can generally be treated as unrelated to each
other, it is often the case that the observations in a time-series dataset are related to each other. For example, the
price of Apple’s stock on a given day is going to be closely related to the price of Apple’s stock on the previous day.
Likewise, the inflation rate in the United States for a given year is likely to be related to the inflation rate for the
previous year.
Definition 5.3 Panel (or longitudinal) data have both a cross-sectional dimension and a time-series dimension, with
information about the same cross-sectional units being observed at different points in time. The number of times that
each cross-sectional unit is observed is at least two.
Here are some examples:
Example 5.6 (State-level panel data on cigarettes) Building upon Example 5.3, annual data can be collected for all
50 states on cigarette taxes (tax per pack), cigarette prices (average price per pack), and smoking rates among certain
age groups. Whereas the cross-sectional dataset only allows analysis of data collected at one point in time, the panel
dataset allows analysis of how cigarette taxes and smoking rates have changed over time and perhaps whether there
is a relationship between tax changes and smoking-rate changes.
Example 5.7 (Influenza data) To analyze trends in influenza infections and vaccinations in the United States, monthly
data over the course of several years can be collected for all 50 states on influenza vaccination rates, deaths caused
by influenza, and hospitalizations caused by influenza.
Example 5.8 (Global macroeconomic panel data) Example 5.4 considered time-series data of macroeconomic
variables for the United States. To build a panel dataset for the global economy, time-series data with the same
variables, at the same annual frequency, could be collected for other countries. The resulting panel dataset would
have annual data on the unemployment rate, inflation rate, and GDP growth rate for many countries over a period of
several years.
The three types of data discussed here (cross-sectional, time-series, and panel) do not cover all possibilities of
interest. For example, in cases where we have different cross sections that are observed at different points in time, the
data are known as a repeated cross section. Whereas each cross-sectional unit in a panel dataset is observed more than
once, each unit is observed only once in the repeated cross section (i.e., in one of the time periods but not the others).
i i
i i
i i
Definition 5.6 A discrete variable (or discrete numerical variable) is a variable where the number of possible values
can be counted, even if the number of possible values is infinite.
The number of children in a given household and the number of patents awarded to a given firm in a given year are
both examples of discrete variables. Even though these variables will usually have pretty low values in any observed
data, we can still think of these variables as having an infinite number of possible values, with any value from the
set {0, 1, 2, …} possible. In other cases, a discrete variable may be inherently finite. Some examples of finite discrete
variables include a student’s score on an Advanced Placement (AP) exam, which is in {1, 2, 3, 4, 5}, the number of
states that have a Republican governor in a given year, which is in {0, 1, 2, …, 50}, or the number of months for which
a given stock has a positive return in a given year, which is in {0, 1, 2, …, 12}. Although these examples all involve
integer values, there is nothing in the definition of discrete variables that restricts values to be integers. For example,
shoe size is a discrete variable that can take non-integer values for “half” sizes.
In contrast to discrete variables, a continuous variable has values along some portion, or all, of the real line, so that
the number of possible values is not countable.
Definition 5.7 A continuous variable (or continuous numerical variable) is a variable that can take on any value
on some interval or intervals of the real line, including perhaps the entire real line.
Examples of continuous variables include the monthly rainfall in a given city (measured in inches, but not rounded),
the daily stock return for a given stock, the fraction of monthly income that a given employed individual saves in a
given month, and the annual GDP of a given country. The possible values are different in these examples, with monthly
rainfall and annual GDP being non-negative real numbers in [0, ∞), the daily stock return being any real number in
[–1, ∞], and the savings fraction being any real number in [0, 1].
As seen in future chapters, the probability models used to model discrete variables and continuous variables are
fundamentally different. Think about the comparison between one of our discrete variable examples (the number
of children in a household) and one of our continuous variable examples (the amount of monthly rainfall in a city).
Suppose we are interested in knowing the probability that the value of the variable is between one and three (inclusive).
i i
i i
i i
For the number of children variable, we would want to add up three probabilities: the probability of one child, the
probability of two children, and the probability of three children. For the monthly rainfall variable, that approach is
not appropriate since the variable can take any non-integer value between one and three. Instead, we use to calculus
“add up” (integrate) the probabilities over all the possible values between one and three.
Since probability models for continuous variables are often easier to work with than those for discrete variables, it
is often the case that a discrete variable will be modeled as continuous when it is reasonable to do so. The concept of
an “approximately continuous” variable covers this case:
Definition 5.8 An approximately continuous variable (or approximately continuous numerical variable) is a
discrete variable that can be treated and modeled as continuous, which is the case when two conditions are met:
(i) the unit of measurement is small compared to typical values of the variable, and (ii) the number of values in the
dataset is large, with relatively few repeats.
Some examples of approximately continuous variables are the weekly earnings of a given individual, the number of
employees in a given firm, and the credit score for a given individual. Thinking about the weekly earnings variable,
the values would generally be reported in dollars (perhaps as an integer or perhaps reported to two decimal places),
so formally speaking the variable is discrete. But the number of possible values is extremely large, and the unit of
measurement (dollars) is very small relative to the typical weekly wage, which can be in the hundreds or thousands
of dollars. Likewise, the number of employees in a given firm can take on many possible values and can be in the
hundreds or even thousands, whereas the unit of measurement (one worker) is relatively small.
Definition 5.10 A categorical variable is ordered if there is a natural ordering to the choices. A categorical variable
is unordered if there is no natural ordering to the choices.
In practice, it is often useful to use numerical values to indicate that a categorical variable has a certain value.
Specifically, a discrete zero-one variable can be used to indicate if a categorical variable has a given value. A one
indicates that it has that value, and a zero indicates that it does not have that value. This zero-one variable can be called
an indicator variable, a dummy variable, or a binary variable. Here the binary variable is numerical, in contrast to
the binary categorical variable in Definition 5.9.
i i
i i
i i
To see how this works, first consider one of the binary categorical variable examples. For the home ownership
variable, which has possible values “yes” and “no,” we can define the variable owner to be
(
1 if individual owns a home ("yes")
owner =
0 if not ("no")
While we could also define the variable nonowner to be 1 if the individual does not own a home and 0 if the individual
does, there’s no need to do so since the value of nonowner is known if the value of owner is known.
Now consider the labor force status variable, which is a categorical variable with more than two categories. In this
case, three different indicator variables employed, unemployed, and notinlf can be defined:
(
1 if individual is employed
employed =
0 if not (unemployed or not in labor force)
(
1 if individual is unemployed
unemployed =
0 if not (employed or not in labor force)
(
1 if individual is not in labor force
notinlf =
0 if not (employed or unemployed)
Using three indicator variables here is overkill, as once the values of two indicator variables are known, the value of
the third variable is known. Since an individual is in one and only one of the three categories, we have
employed + unemployed + notinlf = 1.
Therefore, if employed and unemployed are already defined, it follows that notinlf = 1 – employed – unemployed, so
that it’s unnecessary to specify notinlf as a third variable. For three categories, only two indicator variables (for two
of the categories) are needed to completely characterize the categorical variable. It doesn’t matter which of the three
categories is the “omitted category.”
This basic idea generalizes to categorical variables with more categories. If a categorical variable has C different
categories that are disjoint and exhaustive, using the same terminology introduced for events in Section 2.2, C – 1
indicator variables can be used to completely describe the categorical variable, where it doesn’t really matter which
category is omitted. Of course, it is important to have disjoint and exhaustive categories for this to be true, as that
implies that one and only one category may be the actual value for the categorical variable.
Definition 5.11 The population is the entire group for which conclusions are to be made from statistical analysis.
Definition 5.12 The sample consists of the specific units from the population for which data are collected and
observed.
Definition 5.13 The sample size, usually denoted n, is the number of units collected and observed.
The collected and observed data in the sample, consisting of n units, are taken or “drawn” from the underlying
population of interest. In many cases, the population is very large, and it is impossible to collect data on the entire
population due to cost, logistics, and/or other factors.
i i
i i
i i
Example 5.9 (Election polling) We are interested in predicting the outcome of a political election with two candidates,
candidate A and candidate B. We conduct a pre-election poll, where the population of interest consists of all likely
voters. The collected sample is a subset of the population of likely voters who are surveyed to ask if they intend to vote
for candidate A or candidate B.
Example 5.10 (Salary expectations) We are interested in determining what students at Capita University think about
their post-graduation jobs and salaries. The population consists of all Capita University students. The sample is
a subset of Capita University students who are surveyed about their predicted post-graduation salary and job and
maybe additional items.
In both of these examples, it’s impractical to survey the entire population of interest, so drawing a representative
sample makes sense. In other cases, we can collect data on all of the units available, as would be the case, for example,
if we want to collect data for all 50 states in the United States.
Example 5.11 (State-level cigarette data) We are interested in collecting data on cigarette taxes/prices and smoking
rates for individual states in a given year. The population of interest consists of the states in the United States. Since
it’s feasible to collect data on all 50 states, the sample consists of the cigarette taxes/prices and smoking rates for all
50 states, with sample size n = 50.
Definition 5.14 A sample of size n is called a simple random sample if each element of the population is equally likely
to be sampled or, equivalently, that any possible sample of size n is equally likely to be chosen from the population.
While a thorough discussion of non-random sampling procedures is beyond the scope of this book, it is important
to understand why a sample might not be a simple random sample and how a non-random sample may affect the
analysis being conducted. When the sample is not a simple random sample, it may introduce sampling bias (or
sample selection bias) if the sampling method is related to the variable(s) of interest.
Example 5.12 (Election polling) Returning to Example 5.9, would a simple random sample be drawn from the
population of likely voters if a landline phone survey is used? The answer depends on whether or not someone
having a landline (i.e., a non-cellular phone in one’s residence) is related to whether they would favor candidate A
or candidate B. If it’s known that older voters are much more likely than younger voters to both support candidate A
and have a landline, the landline survey is not going to yield a simple random sample. Actually, surveying through
landlines by itself would violate the definition of a sample random sample in Definition 5.14 since older voters are
more likely to have landlines and therefore more likely than younger voters to be sampled in the first place. However,
that fact by itself would not mean that the sampling approach is a poor one if it weren’t also the case that use of
a landline is related to candidate preference. In this case, since the goal is to infer the proportion of likely voters
who favor each candidate, the landline survey is problematic because landline use is related to candidate preference.
Sampling bias or sample selection bias is introduced since the sample is more likely to have candidate A supporters
than a simple random sample would have.
Example 5.13 (Salary expectations) Returning to Example 5.10, let’s assume that every Capita University student is
assigned a random student ID number when they first arrive on campus. Consider two alternative sampling approaches
for gathering data about job/salary expectations: method 1, which involves surveying all Capita University students
whose student ID number ends in a “5,” and method 2, which involves surveying all Capita University students who
are in an advanced economics course. Method 1 does not lead to sampling bias since the last number of the student ID
number is chosen at random and, thus, should not be systematically related to any of the variables (salary expectation,
job expectation, etc) in which we might be interested.8 Method 2 is likely to lead to sampling bias since students in
i i
i i
i i
the advanced economics course are not representative of the population of all Capita University students, except in
the unlikely case that the advanced economics course is required for all students. For instance, if advanced economics
students are more likely to end up in high-paying jobs than other students, surveying only those students will most
likely lead to higher salary expectations than we would expect from representative students in the population.
Definition 5.15 A stratified random sample is created by splitting the population into defined subpopulations
or strata and drawing a simple random sample from each subpopulation or stratum. Such a sample is called a
proportionate stratified random sample if the sizes of the strata samples are proportional to the true probabilities
of the strata in the population and a disproportionate stratified random sample if they are not.
The use of a proportionate stratified random sample ensures that the obtained sample is representative of the
population with respect to the defined strata.
Example 5.14 (Political polling) Suppose a political poll of 100 individuals is being conducted in a particular area,
where each individual is asked whether they prefer the candidate from political party A or the candidate from political
party B. Within the area, 60% of the population are registered with political party A and 40% are registered with
political party B. A proportionate stratified random sample results from drawing a simple random sample of 60 party
A individuals (from the subpopulation of party A individuals) and a simple random sample of 40 party B individuals
(from the subpopulation of party B individuals). Without imposing the proportionate strata, a simple random sample
of 100 individuals might over-represent party A individuals or party B individuals. For instance, if there were 65 party
A individuals in the simple random sample, we’d intuitively get an over-estimate of the preference for the party A
candidate from the poll (assuming that party A individuals are more likely than party B individuals to support the
party A candidate).
While the use of proportionate stratified random sampling is appealing in certain circumstances, there are also
sometimes reasons for using disproportionate sampling instead. With disproportionate sampling, one or more strata
will be oversampled relative to the true proportion(s) in the population, and one or more strata will be undersampled
relative to the true proportion(s) in the population.
Example 5.15 (Rare disease) In a certain population, suppose the probability that an individual has a particular
genetic disease is 0.05% (1-in-2,000 chance). If a researcher is interested in studying the characteristics of individuals
with the disease or making comparisons of those characteristics to individuals without the disease, she needs to have
a considerable number of individuals with the disease in her sample. Thinking about a simple random sample, even a
sample of size 10,000 would have very few individuals (five, on average) with the disease. Instead, the researcher would
want to oversample the subpopulation or stratum of individuals with the disease. For example, she might construct
a disproportionate stratified random sample with 20% of the individuals having the disease and 80% not having
the disease. Since the subpopulation of individuals with the disease is oversampled, the subpopulation of individuals
without the disease is undersampled.
Example 5.16 (Union and non-union wages) In the population of employed individuals in the United States, the
proportion that are union members is 10%, and it is well-known that union workers earn more, on average, than non-
union workers. If a researcher is interested in estimating the average earnings in the overall population of workers,
it may be sensible to construct a proportionate stratified random sample based upon the true proportions of union
workers (10%) and non-union workers (90%) in the population. On the other hand, if a researcher wants to compare
i i
i i
i i
96 NOTES
the average earnings between union workers and non-union workers, it might make more sense to oversample union
workers so that the size of the union-worker and non-union-worker samples are comparable.
Notes
8 Whether method 1 yields a simple random sample depends upon how you think about the population. If the population consists of students
before they arrive on campus, method 1 yields a simple random sample since the assignment of the student ID hasn’t occurred yet, and any student
is equally likely to have a “5” as the last digit. If the population consists of students after they arrive on campus, method 1 won’t pick any students
who don’t have a “5” as the last digit, so in that sense the units of the population are not equally likely to be chosen. This distinction is unimportant
here since, as the discussion has highlighted, there is no sampling bias introduced by method 1.
Exercises
1. For each of the following examples, indicate whether the data are cross-sectional, time-series, or panel data.
(a) An avid runner records the number of miles that she runs every day for 100 straight days.
(b) A random sample of 100 older adults, aged 65 and older, were surveyed in January 2023 about their health
status and medical expenditures.
(c) A financial analyst randomly picks 20 companies that are listed on the New York Stock Exchange and gathers
data on their profits and sales for the year 2022.
(d) A Canadian economist gathers annual data on each of Canada’s 13 political territories, including the population
and the unemployment rate for each territory, for each year between 2000 and 2020.
(e) A random sample of 100 college students is asked whether or not they have received an influenza vaccine in
the last year.
(f) A random sample of 100 college seniors is asked for their semester GPA for each of their first six semesters at
the university.
2. A survey is taken of Economics majors at a particular university. For each of the following variables, indicate
whether the variable is (i) categorical, (ii) discrete but not approximately continuous, (iii) discrete and approximately
continuous, or (iv) continuous.
(a) Favorite economics professor.
(b) Number of economics courses taken prior to taking econometrics (a required course for majors).
(c) Born in the United States or not.
(d) Cumulative GPA prior to the current semester (not rounded).
3. For each of the following variables, indicate whether the variable is (i) categorical, (ii) discrete but not approximately
continuous, (iii) discrete and approximately continuous, or (iv) continuous.
(a) The daily number of deliveries that a major on-line retailer makes to U.S residential addresses.
(b) Whether or not a person has a flu vaccination in a given year.
(c) The time that it takes a personal shopper at a grocery store to complete a customer’s order.
(d) The number of cellphones that an individual has owned in their lifetime.
4. Consider drawing a simple random sample of n = 10 observations from a population consisting of 100 units.
(a) How many possible ways are there to draw the simple random sample?
(b) What is the probability that a given observation from the population is in the simple random sample that is
drawn?
(c) What is the probability that any two given observations from the population are in the simple random sample
that is drawn?
5. A major city has a complete census of all restaurants within its city limits and is interested in the percentage that
have public-health violations. It doesn’t have the resources to conduct inspections at all restaurants, so it must draw
a sample from the full census (population). For each of the following sampling possibilities, explain whether there
would be concern about sample selection bias and explain why.
i i
i i
i i
NOTES 97
(a) The city has six different zip codes and conducts inspections at all restaurants in one of the six zip codes.
(b) The city conducts inspections at 15% of the restaurants chosen completely at random.
(c) The city conducts inspections at all restaurants having a street address number ending in 3.
(d) The city conducts inspections at the 30 largest restaurants in the census.
6. For each of the following examples, discuss whether there is a potential for sample selection bias and explain why.
For a given example, there may be multiple reasons for sample selection bias.
(a) A pharmaceutical company wants to test the efficacy of a new medication for treating a disease. The company
is interested in the population of all individuals having the disease, and they recruit participants through
advertisements posted in medical facilities and in online forums.
(b) A firm wants to determine the effectiveness of an employee training program, i.e. how effective the training is
for a representative employee at the firm. The program is voluntary, so the firm measures the effectiveness (the
difference between pre-program productivity and post-program productivity) for employees who enroll in the
program.
(c) The Social Security Administration (SSA) has comprehensive earnings-history data for every United States
citizen. A researcher with access to SSA data is interested in the average earnings of 30-year-old citizens in
2020. She randomly draws 5,000 individuals from the SSA data who were 30 years in old in 2020 and gathers
their earnings data for that year.
(d) A magazine reports the average salaries of graduates from law schools. To gather their data, the magazine
conducts an on-line survey of law-school graduates, who are asked to voluntarily share their own salary
information.
(e) An economist would like to analyze the economic impact of recent tax cuts on small businesses. She collects
data from businesses that voluntarily express willingness to respond to a survey.
7. The following table summarizes the political party affiliation (A or B) and age group (“Under 30,” “30-50,” and
“Over 50”) for the population of registered voters in a particular voting district:
Party
A B
Under 30 15% 5%
Age 30-50 25% 15%
Over 50 20% 20%
For example, 15% of the registered voters are affiliated with party A and under 30 years old.
(a) If a sample of 1,000 voters is stratified on the basis of party affiliation alone, how many voters in the sample
are affiliated with party A versus party B?
(b) If a sample of 1,000 voters is stratified on the basis of both party affiliation and age group, how many strata are
there and how many voters are in each of the stratum?
(c) A sample of 1,000 voters has exactly 380 voters in the “Over 50” age group. Which one (or more) of the
following are possible:
i. the sample is a simple random sample
ii. the sample is a proportionate stratified random sample, stratified by age group
iii. the sample is a proportionate stratified random sample, stratified by party affiliation
i i
i i
i i
i i
i i
i i
Chapter 5 introduced different types of variables that might be observed in a dataset. This chapter considers actual
dataset examples and introduces common descriptive statistics and visual devices that can be used to summarize
variables. This chapter focuses on descriptive statistics and visuals for univariate data, considering summary measures
of a single variable, and Chapter 7 focuses on descriptive statistics and visuals that summarize the relationship between
two or more variables.
Recall that n denotes the sample size. For a single variable, generically denoted x, the following notation denotes a
sample of observations:
{x1 , x2 , …, xn },
or, more concisely, {xi }ni=1 . The term statistic is defined to be any numerical measure that is based upon the sample:
Definition 6.1 A statistic is a function of the observed sample data. For univariate data, a statistic has the form
s(x1 , x2 , …, xn ) for some function s(·).
A descriptive statistic is a statistic whose purpose is to describe data in some way. Descriptive statistics can be used
to describe how likely certain values are, where the “center” of the sample observations is, how “noisy” the sample
observations are, etc. This chapter considers descriptive statistics for categorical data and numerical data separately.
While it’s not possible to cover the full universe of descriptive statistics used in real-world applications, several of the
most prevalent ones are considered.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 100 — #107
i i
– statefips: 51 possible two-character codes (“AL”, “AK”, “AZ”, ..., “WI”, “WV”, “WY”) for the 50 U.S.
states plus Washington, DC
– gender: two values (“Female”, “Male”)
– metro: two values (“Metro”, “Non-metro”), indicating whether individual lives in a metropolitan area or
not
– race: three values (“Black”, “White”, “Other”)
– hispanic: two values (“Hispanic”, “Non-hispanic”)
– marstatus: four values (“Married”, “Divorced”, “Widowed”, “Never married”)
– lfstatus: three values (“Employed”, “Unemployed”, “Not in LF”)
– ottipcomm: two values (“Yes”, “No”), indicating whether earnings include overtime, tips, and/or
commissions; missing if lfstatus is “Unemployed” or “Not in LF”
– hourly: two values (“Hourly”, “Non-hourly”); missing if lfstatus is “Unemployed” or “Not in LF”
– unionstatus: two values (“Union”, “Non-union”); missing if lfstatus is “Unemployed” or “Not in LF”
• Numerical variables:
– age (in years): age of the individual; values provided as integers
– hrslastwk (in hours): hours worked by the individual last week; values provided as integers; missing if
lfstatus is “Unemployed” or “Not in LF”
– unempwks (in weeks): number of weeks that an individual has been unemployed; values provided as
integers; missing if lfstatus is “Employed” or “Not in LF”
– wagehr (in dollars): hourly wage for the individual; values provided to two decimal places (cents); missing
if hourly is “Non-hourly” or if lfstatus is “Employed” or “Not in LF”
– earnwk (in dollars): earnings for the individual last week; values provided as integers; missing if lfstatus is
“Unemployed” or “Not in LF”
– ownchild: number of children in the individual’s household
– educ (in years): highest level of education attained by the individual
Having loaded the cps dataset into a data frame called cps, the following R code uses the functions str and
summary to show the structure of the data frame and to summarize its variables, respectively:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 101 — #108
i i
str(cps)
## 'data.frame': 4013 obs. of 17 variables:
## $ statefips : Factor w/ 51 levels "AK","AL","AR",..: 5 1 27 35 41 43 2 6 21 44 ...
## $ age : int 50 34 50 30 40 35 56 42 55 58 ...
## $ hrslastwk : int 40 40 NA 44 NA 30 40 25 40 46 ...
## $ unempwks : int NA NA NA NA NA NA NA NA NA NA ...
## $ wagehr : num 12 NA NA NA NA NA 25 8 NA NA ...
## $ earnwk : num 577 3049 NA 2500 NA ...
## $ ownchild : int 0 1 4 0 0 0 0 0 0 0 ...
## $ educ : num 14 18 16 18 12 12 12 7.5 12 13 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 1 1 ...
## $ metro : Factor w/ 2 levels "Metro","Non-metro": 1 1 1 1 1 1 1 1 1 1 ...
## $ race : Factor w/ 3 levels "Black","Other",..: 1 3 3 3 1 3 3 1 2 3 ...
## $ hispanic : Factor w/ 2 levels "Hispanic","Non-hispanic": 2 2 1 2 2 2 2 2 2 1 ...
## $ marstatus : Factor w/ 4 levels "Divorced","Married",..: 3 2 2 3 3 1 2 4 3 2 ...
## $ lfstatus : Factor w/ 3 levels "Employed","Not in LF",..: 1 1 2 1 2 1 1 1 1 1 ...
## $ ottipcomm : Factor w/ 2 levels "No","Yes": 1 1 NA 1 NA 1 1 1 1 1 ...
## $ hourly : Factor w/ 2 levels "Hourly","Non-hourly": 1 2 NA 2 NA 2 1 1 2 2 ...
## $ unionstatus: Factor w/ 2 levels "Non-union","Union": 1 1 NA 1 NA 1 1 1 1 1 ...
summary(cps)
## statefips age hrslastwk unempwks
## CA : 341 Min. :30.00 Min. : 1.00 Min. : 1.00
## TX : 250 1st Qu.:37.00 1st Qu.:40.00 1st Qu.: 4.00
## FL : 196 Median :45.00 Median :40.00 Median : 8.50
## NY : 162 Mean :45.02 Mean :40.34 Mean : 17.75
## OH : 109 3rd Qu.:53.00 3rd Qu.:42.00 3rd Qu.: 20.00
## GA : 108 Max. :59.00 Max. :99.00 Max. :119.00
## (Other):2847 NA's :1204 NA's :3907
## wagehr earnwk ownchild educ
## Min. : 1.01 Min. : 12.0 Min. :0.0000 Min. : 0.00
## 1st Qu.:12.78 1st Qu.: 520.0 1st Qu.:0.0000 1st Qu.:12.00
## Median :16.41 Median : 770.0 Median :0.0000 Median :12.00
## Mean :18.60 Mean : 971.2 Mean :0.7478 Mean :12.57
## 3rd Qu.:22.00 3rd Qu.:1193.6 3rd Qu.:1.0000 3rd Qu.:14.00
## Max. :90.00 Max. :8779.7 Max. :7.0000 Max. :18.00
## NA's :2174 NA's :1204
## gender metro race hispanic
## Female:2093 Metro :3175 Black: 476 Hispanic : 745
## Male :1920 Non-metro: 838 Other: 349 Non-hispanic:3268
## White:3188
##
##
##
##
## marstatus lfstatus ottipcomm hourly
## Divorced : 707 Employed :2809 No :2341 Hourly :1839
## Married :2377 Not in LF :1098 Yes : 468 Non-hourly: 970
## Never married: 853 Unemployed: 106 NA's:1204 NA's :1204
## Widowed : 76
##
##
##
## unionstatus
## Non-union:2533
## Union : 276
## NA's :1204
##
##
##
##
The str(cps) command shows the structure of the cps data frame, indicating the number of observations and
variables and, for each variable, indicating the variable type and the first few observed values from the data. The
summary(cps) command shows more detailed information about each variable, with category counts shown for
categorical (factor) variables and descriptive statistics shown for numerical variables; for variables with missing
values, the number of missing values is indicated in the row labeled NA’s. For example, the unionstatus variable
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 102 — #109
i i
has 2,533 observations in the Non-union category, 276 observations in the Union category, and 1,204 with a
missing (NA) value.
We can also get the summary statistics for a single variable by specifying that variable, rather than the whole data
frame, as the argument for the summary function.
summary(cps$earnwk)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.0 520.0 770.0 971.2 1193.6 8779.7 1204
Example 6.2 (Monthly stock returns) The sp500 dataset consists of 364 monthly observations (January 1991 through
April 2021) for a set of 266 individual stocks, each of which is part of the S&P 500 stock market index.9 Each variable
in the dataset corresponds to a single company, with the variable name corresponding to the company’s stock ticker.
For example, the variable names for AT&T and Bank of America are T and BAC, respectively. For any given stock, the
observations constitute a time series with a sample size of n = 364. The data are all numerical, with each observation
representing a monthly return for a given stock. The monthly return for month m is defined as
pricem – pricem–1
returnm = ,
pricem–1
where pricem is the price at the end of month m and pricem–1 is the price at the end of the previous month m – 1. This
variable is continuous with possible values in [–1, ∞), where the –1 corresponds to pricem being 0.10 The variable
values are unitless here. Even though stock prices pricem and pricem–1 have monetary units (dollars), the formula
above indicates that the monetary units in the numerator cancel those in the denominator, leaving no units for returnm .
After loading the sp500 dataset in R, we can use the head function to display the first few observations for T and
BAC, along with the dates:
head(sp500[,c("Date","T","BAC")])
## Date T BAC
## 1 1991-02-01 0.041747550 0.03999942
## 2 1991-03-01 0.044186174 0.18376090
## 3 1991-04-01 -0.048997714 0.07811400
## 4 1991-05-01 -0.017966258 0.14237314
## 5 1991-06-01 0.033815331 -0.15133543
## 6 1991-07-01 0.009346888 -0.01935816
The monthly return for AT&T in January 1991 was approximately 4.17%, and the monthly return for Bank of
America was approximately 4.00%. During the first six months, AT&T and Bank of America both had four positive
monthly returns and two negative monthly returns, though the timing of the negative-return months was different for
the two companies.
In the rest of this chapter, we discuss several different descriptive statistics and data visualization options for different
types of data, including the following:
• categorical data: sample proportion, bar charts
• discrete or continuous numerical data: histograms, measures of location (sample mean, sample median, sample
quantiles), box plots, measures of dispersion (interquartile range, sample variance, sample standard deviation)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 103 — #110
i i
fraction of observations within each category c ∈ {1, 2, …, C}. These quantities are known as the sample counts and
sample proportions, respectively.
Definition 6.2 For a categorical variable x, the sample count associated with category c is the number of
observations in category c:
n
X
sample count for category c = 1(xi = c).
i=1
In this definition, the function 1(xi = c) has the value 1 when xi = c (the i-th observation is in category c) and the value
0 when xi 6= c (the i-th observation is not in category c). This function is an example of an indicator function, which is
used elsewhere in the book. More generally, an indicator function 1(E) is equal to 1 if the event E is true and 0 if the
event E is not true.
Definition 6.3 For a categorical variable x, the sample proportion associated with category c is the fraction or
percentage of observations in category c:
Pn n
1(xi = c) 1 X
sample proportion for category c = i=1 = 1(xi = c).
n n
i=1
Since the C categories of x are disjoint and exhaustive, every xi is in one and only one category, which immediately
implies the following:
Proposition 6.1. For any sample {x1 , x2 , …, xn } of a categorical variable x, the sum of the sample counts is n, and the
sum of the sample proportions is 1.
Similar to the discussion of categorical variables in Section 5.2.2, the sample proportion of any category c can be
inferred if the sample proportions of the other C – 1 categories are known since the proportions sum to one. In the
case of a binary categorical variable, the sample proportion of one of the two categories completely summarizes the
observed data.
Example 6.3 (Labor force data) Let’s focus on the labor force status (lfstatus) variable from the cps data. The sample
counts for the lfstatus categories were already seen in Example 6.1, as part of the output from the summary(cps)
command. Another method to directly tabulate the sample counts for a categorical variable, which works even when
there are more categories than will fit in the summary output, is to use the table function.
table(cps$lfstatus)
##
## Employed Not in LF Unemployed
## 2809 1098 106
table(cps$lfstatus)/nrow(cps)
##
## Employed Not in LF Unemployed
## 0.69997508 0.27361077 0.02641415
The first table command provides the sample counts. The second command, which divides by the sample size (i.e.,
the number of rows in the dataset, given by nrow(cps)), provides the sample proportions. The sample counts and
sample proportions for the three categories are provided more neatly in the following table:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 104 — #111
i i
For the first barplot command, the first argument is table(cps$lfstatus) (the table of sample counts),
the second argument (ylim=c(0,3000)) specifies the lower and upper limit to be used for the y-axis, and the third
argument (main) provides a title for the chart. The second barplot command is similar, except that the table of
sample proportions is the first argument and the values of the ylim and main are different. The ylim argument is
optional, and R uses its default upper/lower limits for the y-axis if it is omitted. The formatting command par(mfrow
= c(2,1)) is used before the bar charts are produced. This command provides a convenient way to display multiple
graphs simultaneously. The c(2,1) can be changed to accommodate a different number of rows and columns to be
displayed. Here, there are 2 rows and 1 column, leading to the first bar chart being displayed on top of the second bar
chart. If there were six different graphs to be displayed in three rows of two graphs each, the appropriate command
is par(mfrow = c(3,2)). The graphs are displayed from left to right, starting with the first row and continuing
left-to-right for subsequent rows.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 105 — #112
i i
3000
2000
1000
0 Employed Not in LF Unemployed
Figure 6.1
Bar charts of labor-force status (CPS data)
height of the rectangle indicates how many of the variable’s values are within that bin. To construct the histogram,
the bins need to be specified, which involves specifying the bin width and the starting/ending values for the bins. To
illustrate how this works, we start with a simple example involving a discrete numerical variable.
Example 6.5 (Labor force data) Consider the age variable from the cps data, which is a discrete numerical variable
that can take on the integer values 30, 31, …, 59. Figure 6.2 shows four different histograms for the age variable.
The top two histograms look identical, except for the y-axis, with the one on the left having “frequency” (counts) of
observations and the one on the right having the “density” of observations. Both of these histograms have 30 bins,
each with a bin width of 1 year. The bins themselves are (29.5, 30.5], (30.5, 31.5], (31.5, 32.5], and so on through
(58.5, 59.5]. Since age is only integer-valued, the (29.5, 30.5] bin contains observations with age = 30, the (30.5, 31.5]
bin contains observations with age = 31, and so on through the (58.5, 59.5] bin which contains observations with
age = 59. Looking at these two histograms, the two most observed age values are age = 56 and age = 59, and the two
least observed age values are age = 38 and age = 44. For the “frequency” histogram on the top-left, the height of each
rectangle is the count of observations within the corresponding bin or, equivalently, the count of observations with
that specific age value. For the “density” histogram on the top-right, the height of each rectangle turns out to be the
proportion of observations in the associated bin or, equivalently, the proportion of observations with that specific age
value; as we’ll see, these heights are proportions for this histogram since the bin widths are exactly equal to one.
The bottom two histograms use a bin width of 2 years, with the 15 bins defined as (29.5, 31.5], (31.5, 33.5],
(33.5, 35.5], and so on through (57.5, 59.5]. The first bin contains the observations with age = 30 or age = 31, the
second bin contains the observations with age = 32 or age = 33, and so on through the last bin which contains the
observations with age = 58 or age = 59. For the bottom-left “frequency” histogram, the height of each rectangle still
corresponds to a count of the observations within the associated bin. Since the wider 2-year bins naturally contain
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 106 — #113
i i
more observations than the 1-year bins, the scale of the y-axis is considerably larger. For instance, the height of the
first rectangle for the 2-year bin width is equal to the sum of the first two rectangles for the 1-year bin width histogram
above it. For the bottom-right “density” histogram, the scale of the y-axis is quite similar to the “density” histogram
with 1-year bins (top-right). The “density” values on the y-axis lead to the area of each rectangle to be equal to the
proportion or fraction of observations within the associated bin. For example, the height (“density”) of the first bin is
approximately 0.036, so that the area of the rectangle is approximately (0.036)(2) = 0.072, with roughly 7.2% of the
observations having age = 30 or age = 31.
The R code to create Figure 6.2 uses the function hist:
The first argument for the hist function is the variable of interest (here cps$age). The optional breaks
argument allows the user to directly specify the starting/ending values for the bins. For the histograms in the top row,
the one-year bins are specified using the vector seq(29.5,59.5,1); for the histograms in the bottom row, the
two-year bins are specified using the vector seq(29.5,59.5,2). The optional argument freq indicates whether
the y-axis of the histogram should display counts (when freq is TRUE, which is the default) or densities (when freq
is FALSE). The main argument provides a title for the histogram, and the xlab argument specifies the text to be
shown on the x-axis (here specified as Age rather than the default value, which is cps$age).
The properties of the histograms described in Example 6.5 are stated in the following general propositions:
Proposition 6.2. For a frequency or count histogram, the height of any rectangle is the number of observations within
the associated bin. The sum of the heights of all of the rectangles is equal to the sample size n.
Proposition 6.3. For a density histogram, the area of any rectangle (height times bin width) is the proportion or
fraction of observations within the associated bin. The sum of the areas of all of the rectangles is equal to 1 or 100%.
From this point forward, we focus on density histograms since they have a direct relationship with the probability
distributions introduced in Chapters 8 and 10.
While Example 6.5 considers a discrete numerical variable, the next example considers a continuous numerical
variable and discusses the choice of the bin width or the number of bins in more detail.
Example 6.6 (Labor force data) Consider the weekly earnings (earnwk) variable from the cps data, which is a
continuous, or at least an approximately continuous, numerical variable. The earnwk variable has non-missing values
for the 2809 employed individuals in the sample. Figure 6.3 shows six different density histograms for earnwk, with
the number of bins specified as 10, 20, 50, 100, 200, and 500. These six histograms illustrate an inherent tradeoff
when choosing the number of bins. If the number of bins is chosen to be too small (or, equivalently, the bin width to be
too large), the histogram will be less “noisy” but might miss key aspects of the shape of the distribution of the data.
The top-left histogram with 10 bins completely misses the “hump” in the distribution of the observed weekly earnings.
This hump, located just below $1000 per week, first becomes evident in the histogram with 20 bins and even more so
in the histograms with 50 bins and 100 bins. As the number of bins gets even larger, however, the histograms display
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 107 — #114
i i
0.04
150
0.03
Frequency
Density
100
0.02
50
0.01
0.00
0
30 35 40 45 50 55 60 30 35 40 45 50 55 60
Age Age
0.04
250
0.03
Frequency
Density
0.02
150
0.01
50
0.00
0
30 35 40 45 50 55 60 30 35 40 45 50 55 60
Age Age
Figure 6.2
Histograms of age (CPS data)
less smoothness and more noise. The two histograms with 200 bins and 500 bins have rectangle heights that jump up
and down. This lack of smoothness can be exacerbated by variables where data may be bunched at round numbers,
as is the case for the earnwk variable where round numbers like $1,000 or $2,000 are more likely to be reported than,
say, $1,019 or $1,992. When the number of bins is very large (or the bin width very small), the histogram is unable to
smooth out this type of bunching.
Here is the R code used to create Figure 6.3:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 108 — #115
i i
Since weekly earnings are observed only for employed individuals, the relevant rows of the data frame cps are first
selected. Specifically, cpsemployed is created as a new data frame which consists of the rows of cps for which
cps$lfstatus=="Employed" is TRUE. The nrow command confirms that the number of employed individuals
is 2809. The six hist commands each use the breaks argument, but here a value is provided for breaks, which
specifies the number of bins for the histogram. This use of breaks is in contrast to Example 6.5, where a vector was
provided for breaks, corresponding to the starting/ending values for the bins rather than the number of bins.
where xmax and xmin are the maximum and minimum x values in the sample, respectively. For the earnwk variable in
Example 6.6, the Freedman-Diaconis rule yields a choice of 92 bins, based upon a calculated bin width of 95.48 since
earnwkmax = 8779.73 and earnwkmin = 12. Of the histograms displayed in Figure 6.3, this choice yields a histogram
similar to the one with 100 bins. Figure 6.4 shows the earnwk histogram with the 92 bins from the Freedman-Diaconis
rule. In addition, the figure includes a density curve that overlays the histogram. Most statistical packages, including
R, offer the option to draw this type of smooth density curve either on its own or along with a histogram. In Figure 6.4,
the density curve roughly passes through the tops of the histogram rectangles, but it does so in a smooth fashion rather
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 109 — #116
i i
0.0008
0.0006
Density
Density
0.0004
0.0003
0.0000
0.0000
0 2000 4000 6000 8000 0 2000 4000 6000 8000
0.0006
Density
Density
0.0000
0.0000
0 2000 4000 6000 8000 0 2000 4000 6000 8000
0.0010
Density
Density
0.0000
0.0000
Figure 6.3
Histograms of weekly earnings (CPS data)
than jumping from flat step to flat step as the histogram does. While a formal discussion of density curves is beyond
the scope of this book, a density curve (i) can be quite useful as a descriptive visual for a numerical variable and (ii) has
a direct relationship to the probability distributions introduced in Chapters 8 and 10.
Here is the R code used to create Figure 6.4:
The hist command is similar to those seen in Example 6.6, here with breaks=92 specified. The lines
command, with density(cpsemployed$earnwk) specified as its first argument, draws the density curve on
the same graph as the original histogram. The optional “line width” argument lwd=2 specifies a slightly thicker line
for the density curve, as compared to the default value of lwd=1.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 110 — #117
i i
0.0010
0.0008
0.0006
Density
0.0004
0.0002
0.0000
Weekly earnings
Figure 6.4
Histogram of weekly earnings with Freedman-Diaconis bin width and density curve
Definition 6.4 The sample mean or sample average of observations x1 , x2 , …, xn , denoted x̄, is
n
x1 + x2 + · · · + xn 1 X
x̄ = = xi .
n n
i=1
The sample mean depends on the value of each and every observation in the sample. As such, the sample mean can
be sensitive to unusually small or unusually large values of x, sometimes known as outliers.
The sample median is a descriptive statistic used to describe the center of the sample. The sample median depends
only on the relative order of the observations and, specifically, the observation value(s) directly “in the middle” of the
sample. The sample median is not affected by the values of outliers.
Definition 6.5 The sample median of observations x1 , x2 , …, xn , denoted x̃1/2 or x̃0.5 , is the value for which half of
the observations are below (≤) x̃1/2 and half of the observations are above (≥) x̃1/2 .
Unlike the sample mean, the sample median does not have a closed-form formula.12
To determine the sample median x̃1/2 , the following procedure can be used:
• Sort the observations from lowest to highest. There may be some repeated values (“ties”).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 111 — #118
i i
Both the sample mean and the sample median have the same units as the underlying x variable. For instance, if the x
variable is measured in dollars, the sample mean x̄ and sample median x̃1/2 are also measured in dollars.
Example 6.9 (Labor force data) Examples 6.5 and 6.6 considered histograms for the age and earnwk variables
from the cps data. The sample average and the sample median can be calculated in R using the mean and median
functions, respectively.
mean(cps$age)
## [1] 45.0167
median(cps$age)
## [1] 45
mean(cps$earnwk, na.rm = TRUE)
## [1] 971.1785
median(cps$earnwk, na.rm = TRUE)
## [1] 770
Since earnwk has missing values for non-employed individuals, we specify the optional argument na.rm =
TRUE for the mean and median functions to ignore the missing values. Alternatively, the cpsemployed data
frame from Example 6.6 could be used; for instance, the command mean(cpsemployed$earnwk) would give the
same result as mean(cps$earnwk, na.rm = TRUE).
For the age variable, the sample mean (44.64 years) and the sample median (45 years) are quite close to each other,
which is expected from the fairly symmetric histograms in Example 6.5. For the earnwk variable, it’s a very different
story, with the sample mean of weekly earnings ($971.18) much larger than the sample median ($770). Figure 6.5
shows the same histogram as Figure 6.4, but now with the sample mean and sample median indicated on the graph.
This histogram exhibits a long right tail, with some very large weekly earnings values observed in the right tail. There
is no long left tail in the histogram since earnwk must be positive. The right-tail values cause the sample mean to be
larger than the sample median. Since the sample mean depends on all observations, the large weekly earnings values
in the right tail effectively pull the sample mean to the right. On the other hand, the sample median does not increase
due to the very large right-tail values. Even if all the weekly earnings values in the right tail were instead equal to
2000, the sample median would be unchanged, as there would still be 50% of observations below 770 and 50% of
observations above 770.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 112 — #119
i i
Sample median
Sample mean
0.0010
0.0008
0.0006
Density
0.0004
0.0002
0.0000
Weekly earnings
Figure 6.5
Right skewness of the distribution of weekly earnings
# histogram of weekly earnings with 92 bins (Freedman-Diaconis), with estimated density overlaid
hist(cpsemployed$earnwk, breaks=92, freq=FALSE, main="", xlab="Weekly earnings")
lines(density(cpsemployed$earnwk), lwd=2)
abline(v=mean(cpsemployed$earnwk), lwd=2, lty=3)
abline(v=median(cpsemployed$earnwk), lwd=2, lty=2)
legend("topright", legend=c("Sample median","Sample mean"), lty=c(2,3), lwd=c(2,2))
The hist and lines commands are identical to those used for Figure 6.4. The two abline commands add
vertical lines, due to the inclusion of the optional argument v, at the sample mean and the sample median of earnwk.
The vertical lines for the sample mean and the sample median are dotted and dashed, respectively. Finally, the legend
command shows how a legend can be added to a graph. The first argument indicates where the legend appears, and the
remaining arguments specify the legend text (legend), line type (lty), and line width (lwd). Refer to the legend
documentation in R for more details on the available options.
When a histogram is characterized by a long right tail, like weekly earnings in Example 6.9, the variable is said to
have a right-skewed distribution.
Definition 6.6 A variable has a right-skewed distribution if there is a longer tail on the right side of its distribution
than on the left side of its distribution, as exhibited by a histogram or density curve.
A variable with a right-skewed sample distribution usually has its sample mean greater than its sample median. Many
economic variables naturally have right-skewed distributions, including earnings or wealth for a sample of individuals,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 113 — #120
i i
sales or profits for a sample of firms, gross domestic product (GDP) for a sample of countries, etc. While left-skewed
distributions are less common in economics, we provide a formal definition in the interest of completeness.
Definition 6.7 A variable has a left-skewed distribution if there is a longer tail on the left side of its distribution than
on the right side of its distribution, as exhibited by a histogram or density curve.
A variable with a left-skewed sample distribution usually has its sample mean less than its sample median.
Unlike the weekly earnings variable in Example 6.9, the age variable exhibits neither a right-skewed distribution
nor a left-skewed distribution. In fact, the histogram for the age variable looks approximately flat on both sides of the
center of its distribution. This type of distribution is said to be an approximately symmetric distribution.
Definition 6.8 A variable has an approximately symmetric distribution if the shape of the distribution to the left of
the sample median is approximately a mirror image of the shape of the distribution to the right of the sample median.
If a variable does not have an approximately symmetric distribution, the variable is said to have an asymmetric
distribution.
A variable with an approximately symmetric sample distribution has a sample mean that is close to the sample
median, meaning that either provides a good measure of the center of the sample distribution. A sample with either a
right-skewed distribution or a left-skewed distribution must be an asymmetric distribution, as the presence of a longer
tail on either the right or the left means that the distribution can not possibly appear to have mirror images on the two
sides of the sample median.
Example 6.10 (Monthly stock returns) Example 6.2 introduced the monthly stock return dataset sp500. Focusing
again on the monthly returns for AT&T (T) and Bank of America (BAC), the sample means and sample medians can
be calculated with the mean and median functions in R. Alternatively, the sample mean and sample median are part
of the output provided by the summary function applied to a numerical variable.
mean(sp500$T)
## [1] 0.008239883
median(sp500$T)
## [1] 0.007072945
mean(sp500$BAC)
## [1] 0.01294691
median(sp500$BAC)
## [1] 0.014761
summary(sp500[,c("T","BAC")])
## T BAC
## Min. :-0.191214 Min. :-0.52203
## 1st Qu.:-0.022352 1st Qu.:-0.03675
## Median : 0.007073 Median : 0.01476
## Mean : 0.008240 Mean : 0.01295
## 3rd Qu.: 0.047903 3rd Qu.: 0.06571
## Max. : 0.276617 Max. : 0.72658
The mean monthly return for AT&T is 0.824%, and the median monthly return is 0.707%. Both measures are larger
for the Bank of America, whose monthly returns have a sample mean of 1.295% and a sample median of 1.476%.
Figure 6.6 shows histograms and density curves for the T and BAC variables. Both distributions exhibit a “bell curve”
shape and look approximately symmetric. That said, the histograms do not look identical to each other, as the BAC
distribution has many more monthly returns that are larger in magnitude (i.e., either large positive returns or large
negative returns). This feature is discussed further when the dispersion of distributions is introduced in Section 6.5.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 114 — #121
i i
8
6
Density
4
2
0 −0.5 0.0 0.5
3
2
1
0
Figure 6.6
Histograms of monthly stock returns
The use of the hist and lines functions to draw the histograms and density curves is similar to the
examples above. For both histograms, the vector for the breaks argument is specified to be the same
(seq(-0.8,0.8,0.02)) for ease of comparison.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 115 — #122
i i
To generalize the idea of a sample median to other parts of the sample distribution, sample quantiles are defined as
follows:
Definition 6.9 For any q where 0 < q < 1, the sample quantile x̃q is a value for which (100q)% of the observations
are below (≤) x̃q and (100 – 100q)% of the observations are above (≥) x̃q . The sample median x̃1/2 is a special case
of the sample quantile where q = 1/2. Other commonly used sample quantiles are sample quartiles, corresponding to
q ∈ {0.25, 0.50, 0.75}, and sample deciles, corresponding to q ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}.
As an example, for q = 0.9, the sample quantile x̃0.9 is the value for which 90% of the observations are less than or
equal to x̃0.9 and 10% of the observations are greater than or equal to x̃0.9 . We call x̃0.9 the sample 90% quantile. For
the sample quartiles, the 25% quantile is the first quartile or lower quartile, and the 75% quantile is the third quartile
or upper quartile. For the sample deciles, the 10% quantile is the first decile, the 20% quantile is the second decile,
and so on. Like the sample mean and sample median, any sample quantile x̃q has the same units as the underlying x
variable.
Most statistical packages have functions to calculate sample quantiles, and in practice we rely upon the statistical
package to calculate these quantiles. There are several alternative algorithms to calculate sample quantiles, meaning
that one statistical package might give a slightly different answer than another statistical package. In reasonably sized
samples, these small differences will not be practically meaningful.
Here is a procedure that generalizes the procedure used previously for calculating a sample median and can be used
to manually compute the sample quantile for any value q between 0 and 1:
• Sort the observations from lowest to highest. There may be some repeated values (“ties”), which is fine.
• If nq is an integer, then x̃q is the average of the nq-th value and the (nq + 1)-th value in the sorted sample.
• If nq is not an integer, then x̃ is the dnqe-th value in the sorted sample, where dnqe is the smallest integer larger
q
than nq.
To see how this works, consider a sample with sample size n = 50. First, the 50 observations are sorted in ascending
order, from lowest to highest. For the 10% quantile (q = 0.1), nq = 5 is an integer, so x̃0.1 is the average of the 5-th and
6-th values of the sorted sample. For the 25% quantile (q = 0.25), nq = 12.5 is not an integer, so x̃0.25 is equal to the
13-th value of the sorted sample since dnqe = d12.5e = 13.
The R function quantile calculate sample quantiles:
• quantile(x, probs = ...): Returns sample quantiles of the vector x, where the quantiles returns are
specified by the probs argument. For example, quantile(x, probs = c(0.25,0.75)) returns the
sample 25% and 75% quantiles.
Example 6.11 (Labor force data) Sample quantiles provide a more complete description of the distribution of weekly
earnings (earnwk) from the cps data. The following R output shows the sample quartiles and sample deciles of earnwk:
Interpreting the values for the sample quantiles is straightforward. For instance, for the sample 70% quantile (x̃0.7 =
1080), approximately 70% of the sample observations are below 1080 and approximately 30% are above 1080.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 116 — #123
i i
0.0010
Sample 75% quantile
Sample 90% quantile
0.0008
0.0006
Density
0.0004
0.0002
0.0000
Weekly earnings
Figure 6.7
Sample quantiles of the weekly earnings distribution
In Figure 6.7, five different quantiles (q = 0.1, 0.25, 0.5, 0.75, 0.9) are shown on the histogram. This figure helps to
visualize where the sample quantiles lie along the distribution. Due to the right skewness of earnwk, the 75% and 90%
quantile values are pulled to the right. The distance between the 75% quantile and the sample median (50% quantile)
is larger than the distance between the 25% quantile and the sample median, and the distance between the 90%
quantile and the sample median is much larger than the distance between the 10% quantile and the sample median.
As this example illustrates, the skewness of a sample distribution has implications for the sample quantile values.
For a right-skewed distribution, the distance between the 75% quantile and the sample median (x̃0.75 – x̃0.5 ) would be
expected to be larger than the distance between the 25% quantile and the sample median (x̃0.5 – x̃0.25 ) and similarly
for higher quantiles like the 90% quantile (x̃0.9 – x̃0.5 larger than x̃0.5 – x̃0.1 ) and the 95% quantile (x̃0.95 – x̃0.5 larger
than x̃0.5 – x̃0.05 ). In contrast, for a sample distribution that is approximately symmetric, the distance between the 75%
quantile and the sample median would be expected to be similar to the distance between the 25% quantile and the
sample median and similarly for higher quantiles like the 90% quantile (x̃0.9 – x̃0.5 similar to x̃0.5 – x̃0.1 ) and the 95%
quantile (x̃0.9 – x̃0.5 similar to x̃0.5 – x̃0.1 ). As an example, the age variable from the labor-force data is approximately
symmetric and has a sample median of 45. The 25% and 75% sample quantiles are 37 and 53, respectively, which are
equidistant (8 years) from the sample median. The 10% and 90% sample quantiles are 32 and 57, respectively, which
are nearly equidistant (13 years and 12 years, respectively) from the sample median.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 117 — #124
i i
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4 6 8
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−6 −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 10
Figure 6.8
Location and dispersion of variable distributions
Let’s consider some simple examples of how variables can differ in terms of their location and/or dispersion. In the
four graphs of Figure 6.8, hypothetical sample distributions for two different variables are depicted as a solid bell-
shaped density curve and a dotted bell-shaped density curve. In the top-left graph, the two variables have the same
central location (sample median), but the dotted distribution has longer (and thicker) left and right tails, exhibiting
more dispersion than the solid distribution. The solid distribution has more observations closer to the center and fewer
observations in the tails. In the top-right graph, the two variables have the same dispersion since their shapes are
identical, but their locations are different, with the solid distribution having a higher sample median than the dotted
distribution. In the bottom-left graph, the solid distribution has a higher sample median and less dispersion than the
dotted distribution. And, finally, in the bottom-right graph, the solid distribution has a lower sample median and less
dispersion than the dotted distribution.
While histograms and density curves provide a way to visually compare the dispersion of different variables, it
is also important to have numerical descriptive statistics to characterize the dispersion of variables. Section 6.5.1
introduces the interquartile range and a descriptive visual, known as a box plot, that is based in part on the interquartile
range. Section 6.5.2 introduces the sample standard deviation and sample variance descriptive statistics.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 118 — #125
i i
IQRx . Therefore, for the same sample size n, if one variable has an IQR that is twice as large as another variable, that
variable would have a bin width that is twice as large as the other variable. The relationship between the bin width and
the IQR seems sensible, as it ensures that different variables have comparable numbers of observations within their
histogram bins.)
Example 6.12 (Labor force data) For the weekly earnings (earnwk) variable from the cps data, the first quartile and
third quartile of the sample are 520 and 1194, respectively, so approximately half of the sample has earnwk values
between 520 and 1194. The IQR for earnwk is 1194 – 520 = 674 dollars. For the age variable, the IQR is 53 – 37 = 16
years. It does not make sense to directly compare the IQR values for earnwk and age since they are in different
units. That is, it would not be appropriate to say that the earnwk variable exhibits more dispersion than age. In other
situations, when two variables have the same units, it can make sense to directly compare the IQR values.
The IQR can be calculated in R directly using the IQR function or, alternatively, as the difference between the 75%
and 25% sample quantiles.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 119 — #126
i i
A useful descriptive visual based upon the IQR is a box plot. While there are several variants of the box plot, two
alternative versions of the box plot are considered here:
• Box plot with whiskers at minimum and maximum: The “box” extends from the sample 25% quantile (first quartile)
to the sample 75% quantile (third quartile), with the sample median indicated by a line within the box. The
“whiskers” are indicated by lines at the minimum value and the maximum value in the sample.
• Box plot with whiskers and outliers: The “box” extends from the sample 25% quantile (first quartile) to the sample
75% quantile (third quartile), with the sample median indicated by a line within the box. The “upper whisker” is
indicated by a line at the minimum of the following two values: xmax and x̃0.75 + 1.5IQRx . The “lower whisker” is
indicated by a line at the maximum of the following two values: xmin and x̃0.25 – 1.5IQRx . The “outliers,” observations
that are either above the upper whisker or below the lower whisker, are indicated by dots or circles.
The second version (box plot with whiskers and outliers) is usually preferred by practitioners since the first version, like
the range descriptive statistic, is too sensitive to the minimum and maximum values. The best way to fully understand
how these box plots are constructed is to consider an example.
Example 6.13 (Labor force data) For the weekly earnings (earnwk) variable from the cps data, Figure 6.9 shows the
two versions of the box plot described above. The one on the left is the box plot with whiskers and outliers, and the one
on the right is the box plot with whiskers at minimum and maximum. Both box plots have the same “box,” extending
from the first quartile (520) to the third quartile (1194) with a line indicating the sample median (770). The height of
each box is the IQR value of 674. The box plot on the right has the lower whisker at the minimum value (12) and the
upper whisker at the maximum value (8780). For the box plot on the left, the lower whisker is indicated by a line at
max(xmin , x̃0.25 – 1.5IQRx ) = max(12, 520 – (1.5)(674)) = max(12, –491) = 12,
and the upper whisker is indicated by a line at
min(xmax , x̃0.75 + 1.5IQRx ) = min(8780, 1194 + (1.5)(674)) = min(8780, 2205) = 2205.
Finally, the box plot on the left has outliers represented by circles. In this case, all of the outliers are above the upper
whisker, as there can be no observations below the lower whisker, which is located at the minimum value of the sample.
This box plot offers another visual confirmation of the right skewness of the earnwk variable. The distance from the
sample median to the third quantile is larger than the distance from the first quartile to the sample median, and a long
right tail of outliers appears above the upper whisker with no such left tail of outliers below the lower whisker.
The R code to create Figure 6.9 uses the function boxplot:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 120 — #127
i i
8000
8000
6000
6000
Weekly earnings
Weekly earnings
4000
4000
2000
2000
0
Figure 6.9
Box plots of weekly earnings (CPS data)
The default for the boxplot function is to have whiskers and outliers displayed. The first boxplot command
creates this box plot for the cpsemployed$earnwk variable. The second boxplot command creates a box plot
with whiskers at minimum and maximum by specifying the optional argument range=0.
Example 6.14 (Monthly stock returns) As in Example 6.10, we focus on the monthly returns for AT&T (T) and Bank
of America (BAC) from the sp500 dataset. Figure 6.10 shows the box plots, with whiskers and outliers, for T and
BAC, drawn with the same y-axis (extending from –0.6 to 0.8) for ease of comparison. Here is the R code to create
Figure 6.10:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 121 — #128
i i
Box plot (whiskers and outliers) Box plot (whiskers and outliers)
0.8
0.8
0.6
0.6
0.4
0.4
Monthly return (BAC)
Monthly return (T)
0.2
0.2
0.0
0.0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
Figure 6.10
Box plots of monthly stock returns
Neither box plot shows strong evidence of right or left skewness, with the first and third quartiles and the lower and
upper whiskers fairly equidistant to the sample medians. In this example, a direct comparison of the distributions of T
and BAC is possible since they are both unitless. As already seen in the histograms from Example 6.10, the distribution
of BAC (Bank of America’s monthly stock returns) exhibits more dispersion than the distribution of T (AT&T’s monthly
stock returns). The box extends farther in both directions for BAC as compared to T, corresponding to a larger IQR
value (0.1025) for BAC than the IQR value (0.0703) for T. BAC also has more extreme outliers in both directions
than T does. In fact, BAC has six observed returns greater than 0.3 in magnitude (either below –0.3 or above +0.3),
whereas T has no such observed returns greater than 0.3 in magnitude.
Definition 6.12 The deviation from mean for the i-th observation of the x variable is xi – x̄.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 122 — #129
i i
Example 6.15 Suppose n = 7, and the sample for x is {4, 3, 8, 12, 0, 10, 5}. The sample mean is x̄ = 6. The deviations
from mean for each of the observations are as follows:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
xi – x̄ –2 –3 2 6 –6 4 –1
The sum of the deviations from mean in Example 6.15 is exactly zero, which is a general result for any sample:
Pn
Proposition 6.4. The sum of the deviations from mean, i=1 (xi – x̄), is equal to zero. The average of the deviations
n
from mean, 1n i=1 (xi – x̄), is equal to zero.
P
Definition 6.13 The sample mean absolute deviation of observations x1 , x2 , …, xn , denoted MADx , is
n
1X
MADx = |xi – x̄|.
n
i=1
The MADx descriptive statistic is interpreted as the average distance that sample observations are from the sample
mean. MADx is always non-negative, and it’s strictly positive unless all xi values are the same. The units of MADx are
the same as the units of x, making it convenient for interpretation.
Example 6.16 Continuing Example 6.15, the absolute deviation values are added to the table:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
xi – x̄ –2 –3 2 6 –6 4 –1
|xi – x̄| 2 3 2 6 6 4 1
Then, MADx = 71 (2 + 3 + 2 + 6 + 6 + 4 + 1) = 24
7 . The average distance that an observation is from the sample mean is
24
7 .
Example 6.17 (Monthly stock returns) The MADx values for the AT&T monthly stock returns (T) and the Bank of
America monthly stock returns (BAC) can be calculated in R:
## [1] 0.07205011
Since the two variables are unitless, the MADx statistics are also unitless and can be directly compared. On average,
the BAC monthly returns are farther from their sample mean than the T returns. The average distance of the AT&T
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 123 — #130
i i
monthly returns to their sample mean is 0.04709 or 4.709%, and the average distance of the Bank of America monthly
returns to their sample mean is 0.07205 or 7.205%. The sample dispersion is much larger for BAC, with its MADx
value 53% larger than the MADx value for T.
An alternative dispersion measure based upon the deviations from mean is the sample variance, where the squared
distance (xi – x̄)2 is used instead of the absolute distance |xi – x̄|:
While the sample standard deviation sx is in the units of the original x variable, it does not have a simple
interpretation in the same way that MADx does. Recall that the MADx can be interpreted as the average distance
of sample observations from their mean. Unfortunately, due to the presence of the square root in the definition of sx ,
the sample standard deviation is not an average of some interesting underlying quantity. As seen later in the book,
the meaning of the sample standard deviation depends upon the specific underlying distribution of the variable; for
instance, the sample standard deviation has a particularly interesting interpretation in the case of a variable that has
a “normal distribution.” For now, the sample standard deviation should be thought of as an alternative dispersion
measure.
Example 6.18 Continuing Example 6.16, the squared deviation values are added to the table:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
xi – x̄ –2 –3 2 6 –6 4 –1
|xi – x̄| 2 3 2 6 6 4 1
(xi – x̄)2 4 9 4 36 36 16 1
The sample variance is
1 106 53
s2x =(4 + 9 + 4 + 36 + 36 + 16 + 1) = = ,
7–1 q 6 3
p 53
and the sample standard deviation is sx = s2x = 3 ≈ 4.20.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 124 — #131
i i
Example 6.19 (Monthly stock returns) Continuing Example 6.17, the sample variances and sample standard
deviations for the AT&T monthly stock returns (T) and the Bank of America monthly stock returns (BAC) can be
calculated in R using the var and sd functions, respectively:
As with MADx , the sample standard deviation sx indicates that BAC has more dispersion than T. The standard
deviation of BAC (0.10530) is approximately 67% larger than the standard deviation of T (0.06307).
Example 6.20 (Union versus non-union wages) Suppose we are interested in a comparison of weekly earnings for
union workers versus non-union workers. Using the cps data, we construct two subsamples based upon the unionstatus
variable, one which consists of the 276 union workers and one which consists of the 2,533 non-union workers. Focusing
on weekly earnings (x = earnwk), the following table provides descriptive statistics for the union and non-union
subsamples:
Sample n x̄ MADx s2x sx
Union workers 276 1197.7 532.4 518378.8 720.0
Non-union workers 2533 946.5 488.8 562120.3 749.7
The sample mean of weekly earnings is roughly $250 higher for union workers ($1,198) than for non-union workers
($947). The dispersion measures provide mixed evidence on the relative dispersion of the distribution of union weekly
earnings versus the distribution of non-union weekly earnings. The MADx measure suggests slightly more dispersion
for union workers, whereas the sx measure suggests slightly more dispersion for non-union workers. To see whether
the histograms for the two subsamples provide any further evidence of their relative dispersion, Figure 6.11 plots the
two histograms and density curves, using the same x-axis for ease of comparison. It’s clear why union workers have
higher average weekly earnings, as there is a much higher proportion of observations with earnwk > 1000 in the top
histogram compared to the bottom histogram. But, consistent with the descriptive statistics, it’s unclear from these
histograms whether earnings are more dispersed in one versus the other.
Example 6.21 (Male versus female wages) Suppose we are instead interested in a comparison of weekly earnings for
male workers versus female workers. The approach is similar to Example 6.20, except we construct two subsamples
based on gender, one consisting of the 1,501 male workers and one consisting of the 1,308 female workers. Again
focusing on weekly earnings (x = earnwk), the following table provides the descriptive statistics for the two subsamples:
Sample n x̄ MADx s2x sx
Male workers 1501 1117.3 529.9 610217.8 781.2
Female workers 1308 803.5 415.1 457066.4 676.1
The average weekly earnings for male workers ($1,117) is over $300 higher than the average weekly earnings for
female workers ($804). The MADx and sx statistics both provide evidence that the distribution of male weekly earnings
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 125 — #132
i i
Weekly earnings
0.0000
Weekly earnings
Figure 6.11
Histograms of weekly earnings for union and non-union subsamples
is more dispersed than the distribution of female weekly earnings. Figure 6.12, which shows the histograms and density
curves for the two subsamples, indicates the higher dispersion for male weekly earnings distribution is a result of its
thicker and longer right tail as compared to the female weekly earnings distribution.
When x ∈ {0, 1} is a binary or indicator variable, the sample variance and sample standard deviation have
Pn
a particularly simple form. For a binary x variable, the sample mean x̄ = n1 i=1 xi is the sample proportion of
observations with xi = 1 (since the xi = 0 observations do not contribute to the summation). As an example, let’s say
that x is an indicator of whether a worker is in a union, with 1 indicating a union worker and 0 indicating a non-union
276
worker. For the sample considered in Example 6.20, we have x̄ = 2809 ≈ 0.098 or approximately 9.8% of workers being
in a union. The following proposition indicates that the sample variance of a binary x variable depends only on x̄:
Proposition 6.5. If x ∈ {0, 1} is a binary variable, the sample variance of x is
n
s2x = x̄(1 – x̄),
n–1
and the sample standard deviation of x is
r
n
q
sx = s2x = x̄(1 – x̄).
n–1
The proof of this proposition is left as an exercise (Exercise 6.10.). For the union example,
r
2 2809 276 2533 2809 276 2533
sx = · · ≈ 0.0886 and sx = · · ≈ 0.2977.
2808 2809 2809 2808 2809 2809
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 126 — #133
i i
0.0008
Density
0.0004
0.0000 0 1000 2000 3000 4000 5000 6000
Weekly earnings
0.0006
0.0000
Weekly earnings
Figure 6.12
Histograms of weekly earnings for male and female subsamples
In a sense, the sample variance doesn’t provide any additional information (beyond x̄) about the binary variable x since
it’s directly a function of x̄. Once the sample proportion of ones is known, it completely determines both the sample
mean and the sample variance of the binary variable.
Definition 6.16 The sample mode or modal outcome of observations x1 , x2 , …, xn is the value that occurs most often.
It is possible that there is more than one sample mode or modal outcome, which happens when two or more outcomes
are tied for being the most likely.
When x is a categorical variable, the sample mode is the category that occurs most often in the sample. For numerical
variables, the sample mode is generally most useful for discrete variables and perhaps for continuous variables which
have “focal” responses/values. For a continuous numerical variable, even if the sample mode is not useful (e.g., if
most or all of the values are distinct), we may refer to a distribution as a unimodal distribution if its histogram or
density curve exhibits only one “hump,” a bimodal distribution if its histogram or density curve exhibits only two
“humps,” and so on. For example, the various distributions of weekly earnings from the cps data have been unimodal
distributions with a single hump.
Example 6.22 (Labor force data) Consider the distribution of the hrslastwk variable (hours worked last week) for the
sample of 2,809 employed individuals from the cps data. Figure 6.13 provides three different ways of looking at the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 127 — #134
i i
100
0.5
0.06
0.05
80
0.4
0.04
60
0.3
Density
0.03
0.2
40
0.02
0.1
20
0.01
0.00
0.0
0 20 40 60 80 100 0 20 40 60 80 100
Figure 6.13
Histograms and box plots of weekly hours worked (CPS data)
distribution of hrslastwk: a histogram with one-hour bin widths, a histogram with ten-hour bin widths, and a box plot
with whiskers and outliers. We use the following R code, specifying the breaks argument to be seq(0.5,99.5,1)
for the one-hour bin-width histogram and seq(-5,105,10) for the ten-hour bin-width histogram:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 128 — #135
i i
Since hrslastwk is integer-valued, the histogram with one-hour bin widths has a rectangle corresponding to every
possible value of hrlastwk. A huge spike at hrslastwk = 40 is evident. In the sample, 40 hours is the modal outcome with
1408 individuals reporting 40 hours of work, representing just over half of the sample. Some other less pronounced
spikes can be seen, with the next most common values being 50 hours (158 individuals), 45 hours (106 individuals),
60 hours (96 individuals), and 35 hours (93 individuals). Perhaps not surprisingly, these common outcomes are all at
“round” numbers, which can be a feature of survey variables when individuals are asked for their recollection of a
particular activity. The second histogram, with ten-hour bin widths, smoothes away each of the spikes seen in the first
histogram. This histogram still makes it clear that the vast majority of the individuals work around 40 hours per week.
Finally, note the somewhat strange box in the box plot on the right. The line indicating the sample median is right at
the bottom of the box, which also represents the lower quartile. The lower quartile and the median of the sample are
both equal to 40 hours here because there are so many individuals with that value in the dataset.
Definition 6.17 If a and b are known constants, the variable y = a + bx is a linear transformation of the x variable.
Moreover, the values yi = a + bxi (for i = 1, …, n) are linear transformations of the sample observations {x1 , …, xn }.
In some cases, a linear transformation changes the units of a variable, as in the first two examples considered below.
But linear transformations are much more general and can be used to create new variables of interest, as seen in the
third and fourth examples below.
Example 6.23 (Height) If x is height in inches, the variable y that measures height in feet is a linear transformation
1
of x with a = 0 and b = 12 :
x = height in inches
1
y = height in feet =
x
12
Example 6.24 (Earnings) If x is weekly earnings, the variable y that measures annualized earnings is a linear
transformation of x with a = 0 and b = 52:
x = weekly earnings
y = annualized earnings = 52x
The units of x are dollars per week, and the units of y are dollars per year.
Example 6.25 (Non-working hours) If x is the number of hours worked last week, the variable y that measures the
number of non-working hours last week is a linear transformation of x with a = 168 and b = –1:
x = hours worked last week
y = non-working hours last week = (24)(7) – x = 168 – x
Example 6.26 (Website profits) Suppose x is the number of widgets purchased at a website on a given day. If the price
of a widget is p, the daily fixed cost for the website is f , and the marginal cost of a widget is c, the variable y that
measures the website’s daily profit is a linear transformation of x with a = –f and b = p – c:
x = daily purchases of widgets
y = daily profit = –f + (p – c)x
The units of y are dollars if the units of f are dollars and the units of p and c are dollars per widget.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 129 — #136
i i
For a linear transformation y of a variable x, the location and/or dispersion of the distribution of the new yi
observations may differ from the distribution of the original xi observations. Figure 6.14 considers four different
examples to illustrate this point. For each of the four graphs shown, the same density curve is used for the hypothetical
sample associated with the x variable. The distribution for x is bell-shaped, centered around 2, and has nearly all of
its observations between 0 and 4. The top-left graph shows the density curve for y = 2x, a linear transformation with
a = 0 and b = 2. The center of the y density curve is shifted to the right. The center appears to be located at 4, which is
equal to b (2) times the center of the x density curve (2). The dispersion is also greater for the y density curve, which
occurs since the b value is greater than one and therefore increases the scale of the x observations. Moving next to the
top-right graph, we have a density curve for y = 0.5x, a linear transformation with a = 0 and b = 0.5. Again, the central
location of the y density curve differs from that of the x density curve, but this time it is smaller and is located around
1, which is b (0.5) times the center of the x density curve (2). The dispersion of the y density curve is now less than that
of the x density curve, which happens since the b value is less than one and has the effect of decreasing the scale of
the observations. For the bottom-left graph, we have the density curve for y = 3 + x, a linear transformation with a = 3
and b = 1. In this case, the shape of the y density curve looks identical to that of the x density curve, but it is shifted
over to the right by 3 units, corresponding to the value of a. As the shapes of the two density curves are identical,
their dispersions are also the same, which occurs since b is exactly equal to 1. Finally, for the bottom-right graph, we
have the density curve for y = 3 + 0.5x, a linear transformation with a = 3 and b = 0.5. There is less dispersion for the y
density curve, again due to b being less than one. The shape of the y density curve in this case looks identical to the
graph above it (where y = 0.5x) but shifted to the right by a = 3 units. The central location of the y density is affected
by both the a value (a shift of 3 to the right) and the b value (a shift of 1 to the left due to scaling the x variable by 0.5),
resulting in an overall shift of 2 units (4 as compared to the central location of the x density at 2).
To formalize some of the properties exhibited in Figure 6.14, the following proposition states how the various
descriptive statistics for location and dispersion are affected by a linear transformation:
Proposition 6.6. If a and b are known constants and y = a + bx is a linear transformation of x, the descriptive
statistics for the sample {y1 , y2 , …, yn } have the following relationships to the descriptive statistics for the sample
{x1 , x2 , …, xn }:
(i) (sample mean) ȳ = a + bx̄
(ii) (sample variance) s2y = b2 s2x
(iii) (sample standard deviation) sy = |b|sx
(iv) (sample quantiles)14 ỹq = a + bx̃q if b ≥ 0
(v) (sample IQR)15 IQRy = |b|IQRx if b ≥ 0
(vi) (sample MAD) MADy = |b|MADx
It is useful to show each of these properties. For the sample mean in (i),
n n n n
!
1X 1X 1 X X 1
ȳ = yi = (a + bxi ) = a+ bxi = (na + bnx̄) = a + bx̄.
n n n n
i=1 i=1 i=1 i=1
ȳ is a linear function of x̄, with the same a and b constants as the original linear transformation.
For the sample variance in (ii),
1
Pn 1
Pn
s2y = n–1 2
i=1 (yi – ȳ) = n–1 i=1 (a + bxi – (a + bx̄))
2
1
Pn 1
Pn
= n–1 i=1 ((bxi – bx̄)2 ) = b2 · n–1 i=1 (xi – x̄)2 = b2 s2x .
Whereas the x variable is scaled by b in the linear transformation y, the sample variance of y gets scaled by b2 . The
additive constant a drops out of the sample variance s2y expression, which makes intuitive sense since the constant a
shifts the location of the original x distribution but does not change the shape or dispersion of the original x distribution.
The sign of b doesn’t matter here, so that for example the sample variance of y = 1 + 2x is the same as the sample
variance of y = 1 – 2x, both being four times the sample variance of x. The magnitude of b dictates whether or not the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 130 — #137
i i
a = 0, b = 2 a = 0, b = 0.5
1.0
1.0
x x
y = 2x y = 0.5x
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10
a = 3, b = 1 a = 3, b = 0.5
1.0
1.0
x x
y = 3+x y = 3+0.5x
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10
Figure 6.14
Linear transformations of a variable
sample variance of y is smaller or larger than the sample variance of x. From the sample variance result (s2y = b2 s2x ),
s2y > s2x if |b| > 1, s2y < s2x if |b| < 1, and s2y = s2x if |b| = 1.
For the sample standard deviation in (iii),
q q
sy = s2y = b2 s2x = |b|sx ,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 131 — #138
i i
√
where b2 = |b| is used for the last equality. As with the sample variance, the relative size of the sample standard
deviations of x and y depend upon the magnitude of b:
sy > sx when |b| > 1, sy < sx when |b| < 1, and sy = sx when |b| = 1.
For the sample quantiles in (iv), showing that ỹq = a + bx̃q when b ≥ 0 is straightforward. Let’s say that the sample of
x values have been sorted, from lowest to highest, to calculate a sample quantile x̃q . When b is positive, if the sample of
y values is sorted, from lowest to highest, the sorted y values will be in the same exact ordering as the sorted x values.
Therefore, when the algorithm from Section 6.4.2 for calculating a sample quantile is applied, the sample quantile of
y will be ỹq = a + bx̃q . As a special case of this result, the sample medians are related by the equation ỹ0.5 = a + bx̃0.5 .
For the sample IQR in (v), when b ≥ 0, the result follows from the result for sample quantiles. Since ỹ0.25 = a + bx̃0.25
and ỹ0.75 = a + bx̃0.75 ,
IQRy = ỹ0.75 – ỹ0.25 = a + bx̃0.75 – (a + bx̃0.25 ) = b(x̃0.75 – x̃0.25 ) = bIQRx .
As with the sample variance and sample standard deviation, the additive constant a has no effect on the IQR dispersion
measure. While a may shift the location of the distribution, it has no effect on the difference between the quantiles
within the distribution. Instead, the sample IQR of y is just a scaled version of the sample IQR of x, with the same
scaling (b) as for the sample standard deviation.
For the sample MAD in (vi),
Pn Pn
MADy = 1n i=1 |yi – ȳ| = 1n i=1 |a + bxi – (a + bx̄)|
1
Pn Pn
= n i=1 |b(xi – x̄)| = |b| 1n i=1 |xi – x̄| = |b|MADx .
The additive constant a does not affect MADy , and the scaling constant b affects MADy in the same way as seen for
the sample standard deviation and the sample IQR.
1
Example 6.27 (Height) Example 6.23 had x = height in inches and y = 12 x = height in feet. For a sample of heights,
Proposition 6.6 implies
1 1 1 2 1
ȳ = x̄, ỹ0.5 = x̃0.5 , s2y = s , and sy = sx .
12 12 144 x 12
Example 6.28 (Website profits) In Example 6.26, the website had daily widget sales x and daily profits y = –f + (p – c)x,
where f is the fixed daily cost, p is the widget price, and c is the marginal cost of producing each widget. For a sample
of daily sales x, from which daily profits y are derived,
ȳ = –f + (p – c)x̄, ỹ0.5 = –f + (p – c)x̃0.5 (if p > c), s2y = (p – c)2 s2x , and sy = |p – c|sx .
Example 6.29 (Earnings) For Example 6.24, with x being weekly earnings and y = 52x being annualized earnings,
let’s consider the descriptive statistics associated with the actual weekly earnings (earnwk) variable from the cps data:
x̄ s2x sx x̃0.5 IQRx MADx
earnwk (x) 971.2 563227.1 750.48 770 673.6 497.8
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 132 — #139
i i
0.4
Density
0.2
0.0 0 20 40 60 80 100
0.2
0.0
Figure 6.15
Histograms of weekly hours worked and non-working hours
the large additive constant a = 168. Taking the mirror image of the distribution doesn’t affect dispersion, as implied by
the results in Proposition 6.6 for |b| = 1. For instance, the sample standard deviations sx and sy are both approximately
11.28, and the sample MAD statistics MADx and MADy are both approximately 6.53. Since x + y = 168, the sample
mean of x (x̄ ≈ 40.3) and the sample mean of y (ȳ ≈ 127.7) sum to 168.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 133 — #140
i i
Interestingly, regardless of the original sample {x1 , x2 , …, xn }, the sample of standardized values, {y1 , y2 , …, yn },
always has sample mean ȳ = 0, since
x̄ 1
ȳ = – + x̄ = 0,
sx sx
and sample standard deviation sy = 1, since
1
sy = sx = 1.
sx
Example 6.31 (Monthly stock returns) Returning to the AT&T monthly stock returns (T) and the Bank of America
monthly stock returns (BAC) from the sp500 dataset, the following R code standardizes variables for both T and BAC:
summary(T_std)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.1626 -0.4851 -0.0185 0.0000 0.6289 4.2555
sd(T_std)
## [1] 1
summary(BAC_std)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.08030 -0.47193 0.01723 0.00000 0.50103 6.77693
sd(BAC_std)
## [1] 1
The summary statistics for the two standardized variables confirm that the sample means are equal to zero and the
sample standard deviations are equal to one. Figure 6.16 shows the histograms and density curves for the standardized
variables. As compared to the distributions of the original variables T and BAC in Figure 6.6, where BAC has visibly
more dispersion than T, the dispersion of the two standardized variables in Figure 6.16 is quite similar due to the
division by their respective standard deviations. The distributions are also centered around zero due to the de-meaning.
And, for both standardized variables, a very large proportion of the observations are between –2 (two sample standard
deviations below the sample mean) and +2 (two sample standard deviations above the sample mean). As we’ll see in
Chapter 11, this property is to be expected for variables with distributions that are approximately bell-shaped.
The optional argument type="l" provides a line plot rather than a scatter plot. With only a single variable
specified in the first argument to plot, R automatically plots the monthly return data against the row number or,
as specified on the x-axis label, the “Index.”
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 134 — #141
i i
0.4
Density
0.2
0.0 −6 −4 −2 0 2 4 6
0.2
0.0
−6 −4 −2 0 2 4 6
Figure 6.16
Histograms of standardized monthly stock returns
To directly incorporate information about actual dates or times associated with the observations, an alternative
approach in R is to create a time-series object and then draw the time-series plot:
The ts function creates the time-series object based upon the variable specified by the first argument sp500$T.
The optional argument start=c(1991,1) specifies that the time series begins in the first month of the year 1991,
and the optional argument frequency=12 specifies that the observations are monthly (i.e., at a frequency of 12
per year). The resulting time-series plot, shown in Figure 6.18, looks identical to Figure 6.17, as it should. The only
difference is that the date values appear on the x-axis, now labeled “Time.” The dispersion of the AT&T monthly
returns is particularly high between 1998 and 2004. This feature of the data is missed by either the histogram or box
plot since neither of those visualizations incorporates the time dimension.
We can also graph multiple time-series plots at once. Two approaches are considered, one in which the time-series
plots are shown on the same graph and one in which the time-series plots are vertically stacked. The following R
code yields Figure 6.19, with the time-series plots for both AT&T monthly returns (T) and Bank of America monthly
returns (BAC) shown on the same graph:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 135 — #142
i i
0.2
0.1
Monthly return (T)
0.0
−0.1
−0.2
Index
Figure 6.17
Time-series plot of monthly stock returns
The plot function draws the AT&T time-series plot with a solid line. The lines function draws the Bank of
America time-series plot as a dotted line and does so on the same graph since lines is used instead of plot. The
legend function creates the legend in the top-right corner of the graph. The greater dispersion of Bank of America’s
returns are evident in this figure, with the dotted time-series plot having more pronounced spikes. The largest of these
spikes occurred during the financial crisis that occurred in the years before 2010.
Including multiple time-series plots on the same graph can make it difficult to distinguish the behavior of the
variables. The following R code instead separately graphs the two time-series plots for T and BAC and stacks them
vertically, as shown in Figure 6.20:
plot(cbind(ts_T,ts_BAC), main="")
The cbind function combines the two time-series objects ts_T and ts_BAC into a single data frame, and plot
vertically stacks the time-series plots for the variables in the resulting data frame. This approach generalizes to more
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 136 — #143
i i
136 NOTES
0.2
0.1
Monthly return (T)
0.0
−0.1
−0.2
Time
Figure 6.18
Time-series plot of monthly stock returns
time-series variables, with additional variables included in the cbind function. Alternatively, the first argument can
be supplied directly as a data frame that consists of time-series objects.
Notes
9 The S&P 500, which is an abbreviation for the Standard and Poor’s 500, is a stock market index that tracks the stock performance of 500 of
the largest companies listed on United States stock exchanges, including the New York Stock Exchange and the Nasdaq. There are fewer than 500
stocks in the sp500 dataset since we have only included companies that were part of the S&P 500 index for the full time period between January
1991 and April 2021.
10 Negative stock prices are not possible. There are other investments, like “short sales” of stocks, for which returns can be negative and arbitrarily
large in magnitude.
11 A pie chart is an alternative descriptive visual that can be used, also using either count values or proportion values, where the sizes of the pie
wedges correspond to the sample proportions. An advantage of the bar chart is that it’s easier to visually compare bar heights rather than the size of
pie wedges.
12 Also unlike the sample mean, the sample median may not be uniquely defined when n is even. In the algorithm for calculating the sample
median for even n, choosing any value in between the two middle points would satisfy the formal definition.
13 An alternative measure is the sample median absolute deviation, 1
Pn
n i=1 |xi – x̃0.5 |, calculated with the mad function in R.
14 The result for negative b is more subtle. For b < 0, we have ỹ ≈ a + bx̃
q 1–q , where we use an approximation sign “≈” since we don’t get exact
equality based upon the procedure given in Section 6.4.2 for calculating sample quantiles. When b is negative, if we sort the sample of y values
from lowest to highest, the sorted values will be in exactly the reverse ordering of the sorted x values. As an example, if we had n = 101 and y = –x,
the sample 25% quantile of y is the 26-th value from sorted y values, which corresponds to the 75-th value from the sorted x values (which is x̃0.74 );
so we have ỹ0.25 = a + bx̃0.74 ≈ a + bx̃0.75 = a + bx̃1–0.25 .
15 The result for negative b (b < 0) is approximate, IQR ≈ |b|IQR , again due to procedure given in Section 6.4.2.
y x
Exercises
1. Use the cps dataset for this question.
(a) Provide a table of the marital status (marstatus) categorical variable.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 137 — #144
i i
NOTES 137
0.6
T
BAC
0.4
0.2
Monthly returns
0.0
−0.2
−0.4
−0.6
Time
Figure 6.19
Time-series plots of monthly stock returns
(b) For a bar chart of marstatus with sample proportions on the y-axis, what is the height of the “Divorced” bar?
(c) Conditional on someone being not “Married,” what is the sample proportion that are “Divorced”?
(d) Consider the indicator variable notmarried equal to 0 if the variable value is “Married” and 1 otherwise. What
is the sample mean of notmarried?
2. For a sample with five observations (n = 5), the deviations-from-mean for the first four observations are –1.1, 2.3,
–0.8, and 0.2.
(a) What is the deviation from mean for the fifth observation?
(b) What is the sample variance?
3. For the cps dataset, there are 4,013 observations on adults aged 30 to 59. One of the variables is the number of
children (ownchild), with the following table describing how many observations take on each of the possible values
0, 1, 2, 3, 4, 5, 6, 7:
ownchild (# of children) 0 1 2 3 4 5 6 7
# of observations 2432 654 584 233 81 21 5 3
For this question, use only the numbers in the table and not the actual cps dataset; you can use R for mathematical
calculations, but don’t use the built-in R functions for descriptive statistics.
(a) What is the sample average of ownchild?
(b) What is the sample median of ownchild?
(c) What is the sample 90% quantile of ownchild?
(d) Is the distribution of ownchild symmetric, left-skewed, or right-skewed?
(e) What is the sample mean absolute deviation of ownchild?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 138 — #145
i i
138 NOTES
0.2
0.1
ts_T
0.0
−0.2 −0.1
0.4
ts_BAC
0.0
−0.4
Time
Figure 6.20
Time-series plots of monthly stock returns
1.89, 1.99, 2.14, 2.51, 5.03, 3.81, 1.97, 2.31, 2.91, 3.97, 2.68, 2.44.
(a) What is the sample mean absolute deviation (MADx ) of monthly rainfall?
(b) What is the sample variance (s2x ) of monthly rainfall?
(c) What is the sample standard deviation (sx ) of monthly rainfall?
(d) If y is monthly rainfall measured in feet, what are MADy , s2y , and sy ?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 139 — #146
i i
NOTES 139
6. For the cps dataset, here are summary statistics of the hourly wage last week (wagehr). The variable wagehr is
missing for the 2,174 individuals who are not hourly employees, so that it has numeric values for the 1,839 employed
individuals paid on an hourly basis.
summary(cps$wagehr)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.01 12.78 16.41 18.60 22.00 90.00 2174
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 140 — #147
i i
140 NOTES
(e) Create a standardized version of gre_quant called gre_quant_std, using the sample mean and sample standard
deviation of the full sample. Confirm that gre_quant_std has sample mean zero and sample standard deviation
one.
(f) What are the sample mean, sample standard deviation, and sample IQR of gre_quant_std for the subsamples
of domestic students and non-domestic students?
9. Logan’s Lemonade sells lemonade for $6 per cup each Sunday at a local farmer’s market. The number of cups sold
each Sunday is represented by the variable x. Over 16 weeks, the descriptive statistics for x are x̄ = 120 and sx = 20.
(a) What is the sample mean and sample standard deviation of Logan’s Lemonade’s daily revenues (price times
quantity sold)?
(b) The fee to have a booth at the farmer’s market is $100 each Sunday. If it also costs Logan’s Lemonade $1 per
cup to make the lemonade, what are the sample mean and sample standard deviation of Logan’s Lemonade’s
daily profits (revenues minus costs)?
(c) If you knew the value of the sample median x̃0.5 = x∗ , how would the sample median of daily revenues be related
to x∗ ? How about the sample median of daily profits?
10. Suppose x ∈ {0, 1} is a binary variable.
(a) *Show that the sample variance of x is
n
s2x = x̄(1 – x̄).
n–1 Pn
(Hint: Use the following facts: (i) xi2 = xi if xi ∈ {0, 1} and (ii) i=1 xi = nx̄.)
(b) In a sample of 200 individuals, 13 had an emergency-room visit in the last year. If the variable x is an indicator
of an emergency-room visit (equal to 1 if visit occurred, 0 if not), what is the sample standard deviation of x?
(c) *Provide a formula for the sample mean absolute deviation of x (MADx ) in terms of x̄. The formula should not
contain any individual xi values.
11. The sample skewness gx is a descriptive statistic that measures the skewness of a sample distribution, defined as
1
Pn
(xi – x̄)3
gx = n i=1 3 .
(sx )
Positive values of gx are associated with right-skewed sample distributions, with gx > 1 considered highly right-skewed.
Negative values of gx are associated with left-skewed sample distributions, with gx < –1 considered highly left-skewed.
Values of gx closer to zero are associated with sample distributions that are neither left-skewed nor right-skewed.
(a) What are the units of gx ?
(b) Write an R function skewness that takes a vector x, containing the sample, as its only argument and returns
the sample skewness.
(c) For the cps dataset, use the skewness function to calculate the sample skewness of the earnwk and age
variables for the subsample of employed individuals.
(d) How would the sample skewness of earnwk change if weekly earnings were measured in thousands of dollars
rather than dollars?
12. Use the bitcoin dataset for this question. This dataset consists of daily prices and returns for the Bitcoin
cryptocurrency between January 1, 2020 and December 31, 2021. There are 731 observations on the three price
variables (high = daily high price, low = daily low price, close = end-of-day price) and 730 observations on the daily
return (return).
(a) Draw a time-series plot of the closing daily price for the full time series.
(b) For the first 60 observations (through February 29, 2020), draw a time-series plot with the three price variables
(high, low, close) on the same plot. Draw high and low as solid lines and close as a dotted line.
(c) Draw a time-series plot of the daily returns for the full time series.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 141 — #148
i i
NOTES 141
(d) The time-series plot in (c) should indicate periods of low variance and periods of high variance. Eyeball the
graph and pick one low-variance range and one high-variance range, and then confirm what you see visually
by calculating the sample standard deviations for the two ranges you’ve identified.
13. Use the inflation dataset for this question. This panel dataset consists of annual inflation rates for 45 countries over
the ten-year period between 2010 and 2019. The variables are country (a categorical (factor) variable with a three-
character abbreviation), year (values between 2010 and 2019), and inflation (in percentage points; e.g., inflation = 3
means 3% annual inflation).
(a) How many observations are associated with deflation (a negative inflation rate)?
(b) How many countries experience deflation at some point during the 10-year period?
(c) Which countries have the lowest and highest average inflation over the 10-year period?
(d) Which countries have the lowest and highest standard deviation of inflation over the 10-year period?
(e) Draw a time-series plot for the inflation rate of the United States (USA).
(f) Draw a time-series plot that has the inflation rates for the United States (USA), Canada (CAN), and Mexico
(MEX) on the same plot. Make the three lines a different style and/or color to differentiate them, and make sure
that the y-axis range allows all three lines to be completely visible.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 142 — #149
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 143 — #150
i i
Chapter 6 introduced descriptive statistics and visuals for univariate data. In this chapter, the focus shifts to the case of
data on two variables, known as bivariate data, with the introduction of descriptive statistics and visuals that describe
the relationship between the two variables. Rather than having a single variable for each observational unit, consider a
situation where two variables x and y are observed as a collection of pairs
{(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}
or, more concisely, {(xi , yi )}ni=1 . As before, n denotes the sample size.
Example 7.1 (Education and earnings) If we are interested in the relationship between earnings and educational
attainment, data can be collected on a sample of workers with x being the years of educational attainment and y being
weekly (or annual) earnings. A positive association is expected between these two variables.
Example 7.2 (Monthly stock returns) Chapter 6 considered several examples using the monthly stock return data
from sp500, specifically for the case of AT&T (stock ticker T) and Bank of America (stock ticker BAC). Suppose we
are instead interested in looking at the relationship between monthly stock returns for two companies that operate
in the same industry, in this case the home improvement industry. In the dataset, there are monthly stock returns for
both Home Depot (stock ticker HD) and Lowe’s (stock ticker LOW), so let x be the monthly stock return for HD and
let y be the monthly stock return for LOW. For these two companies, there are different factors that might affect the
relationship of their stock returns. On one hand, since they are competitors in the industry, it might be expected that
one company does worse when the other company does better. On the other hand, since both companies are affected
by the same macroeconomic conditions (e.g., home construction levels), it might be expected that both companies do
better than usual (or worse than usual) at the same time.
Examples 7.1 and 7.2 involve numerical data for the variables being considered. While this chapter focuses primarily
on descriptive statistics and visualization for numerical variables, Section 7.1 briefly considers categorical variables
before the case of numerical variables is covered in Section 7.2.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 144 — #151
i i
observed sample must be in one and only one of these Cx Cy categories. As such, the sum of the joint sample counts
over all the categories is equal to the sample size n, and the sum of the joint sample proportions is equal to one.
Example 7.3 (Race and labor-force status) We consider the categorical variables race and lfstatus from the cps
dataset, which allow us to examine whether there is a relationship between race and labor-force status. race has
three categories (“Black”, “White”, “Other”) and lfstatus has three categories (“Employed”, “Unemployed”, “Not
in LF”), so the total number of joint categories is nine. To completely describe the joint sample distribution, a table of
joint sample counts and a table of joint sample proportions can be created in R:
The command table(cps$lfstatus, cps$race) provides the joint sample counts, while passing
table(cps$lfstatus, cps$race) as an argument to the function addmargins provides row totals and
column totals (labeled by Sum). The table of joint sample proportions, created by the second addmargins command,
divides each count by the sample size (nrow(cps), which is 4,013).
Looking at the top-left element of each table, there are 324 individuals who are black and employed, representing
0.0807 or 8.07% of the sample. The row and column totals make it easy to say something about either of the
individual variables lfstatus and race. For instance, looking at the row labeled Sum, there are 3,188 white individuals,
representing 79.45% of the sample.
Unfortunately, these tables do not make it easy to compare labor-force status of the different racial groups since
there are different numbers of individuals in the three racial categories. Similarly, these two tables do not make it easy
to compare the racial breakdown of the different labor-force statuses since there are different numbers of individuals
in the three labor-force status categories. To facilitate such comparisons, we introduce a different table containing
conditional sample proportions. The idea is to calculate the sample proportions of one categorical variable given the
value of the other categorical variable. As an example, the function prop.table can be used in R to create a table
of the conditional sample proportions of labor-force status, where we condition upon the value of the racial category:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 145 — #152
i i
The original table, given by table(cps$lfstatus, cps$race), is passed as the first argument to the
prop.table function. The margin argument indicates which variable should be conditioned on, with margin=2
specifying that the second variable (cps$race) is the one being conditioned on here. It can be verified that the
sum of the proportions in each column is equal to one. The values in this table could have been calculated directly
from the information in the joint sample count table. For instance, the “Black” and “Employed” value, which is the
proportion of black individuals who are employed, is obtained by dividing the joint sample count of 324 by the total
for the “Black” column (476), which yields 324
476 ≈ 0.6807. From this table, we see that the sample proportion of black
individuals who are employed (0.6807) is less than the sample proportion of white individuals who are employed
(0.7039). There is also a lower sample proportion of white individuals who are not in the labor force (0.2707), as
compared to either black individuals (0.2857) or other-race individuals (0.2837).
By changing the margins argument for the prop.table function, we can create a table of conditional sample
proportions of the racial categories, where we condition on labor-force status:
The margin=1 argument specifies that we condition on the first variable (cps$lfstatus). It can be verified that
the sum of the proportions in each row is equal to one. The values in this table could have been calculated directly from
the joint sample count table, by dividing any of the joint sample counts by its associated row total. For instance, the
“Black” and “Employed” value, which is the proportion of employed individuals who are black, is obtained by dividing
324
the joint sample count of 324 by the total for the “Employed” row (2809), which yields 2809 ≈ 0.1153. The proportion
of white individuals among employed individuals (0.7989) is larger than the proportion of white individuals among
unemployed workers (0.7642), whereas the proportion of black individuals among employed individuals (0.1153) is
smaller than the proportion of black individuals among unemployed individuals (0.1509).
Building upon the idea of using conditional sample proportions as a descriptive tool, as in Example 7.3, we can do
something similar graphically by providing a descriptive visual (e.g., a bar chart) of one variable conditioned on the
categorical value of the other variable.
Example 7.4 (Race and labor-force status) To visually assess the association between race and labor-force status,
bar chart versions of the two conditional sample proportions tables from Example 7.3 can be created. Figure 7.1 shows
bar charts of labor-force status given race. Consistent with Example 7.3, the bar chart indicates that the proportion of
employed individuals is highest among white individuals and lowest among black individuals, whereas the proportion
of not-in-labor-force individuals is lowest among white individuals. Here is the R code used to create Figure 7.1:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 146 — #153
i i
0.8
Black
Other
White
0.6
0.4
0.2
0.0
Figure 7.1
Labor-force status proportions by race
# create sample count table for race and labor-force status variables
tbl_racelf <- table(cps$race, cps$lfstatus)
# barplot command --- categories on x-axis are based upon columns (lfstatus) of the table
barplot(prop.table(tbl_racelf, margin=1), ylim=c(0,0.8), col=c("gray30","gray50","gray70"),
legend.text=rownames(tbl_racelf), beside=TRUE, main="")
When a table is provided as the first argument to the barplot function, the bars are grouped into categories on
the x-axis corresponding to the columns of the table. The table tbl_racelf has been specified to have labor-force
status as the columns. The additional argument beside=TRUE causes the bars for each of the racial categories to
be displayed side-by-side, and the legend.text argument specifies the racial categories, which are given by the
vector returned by rownames(tbl_racelf).
Figure 7.2 shows bar charts of the racial categories given labor-force status. Consistent with Example 7.3, the bar
charts indicate that the proportion of white individuals is highest within the group of employed individuals and lowest
within the group of unemployed individuals, whereas the reverse is true for black individuals. Here is the R code used
to create Figure 7.2:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 147 — #154
i i
0.8
Employed
Not in LF
Unemployed
0.6
0.4
0.2
0.0
Figure 7.2
Race proportions by labor-force status
# create sample count table table for race and labor-force status variables
tbl_lfrace <- table(cps$lfstatus, cps$race)
# barplot command --- categories on x-axis are based upon columns (race) of the table
barplot(prop.table(tbl_lfrace, margin=1), ylim=c(0,0.8), col=c("gray30","gray50","gray70"),
legend.text=rownames(tbl_lfrace), beside=TRUE,
args.legend = list(x="topleft",inset=0.01), main="")
The primary difference from the previous code is that the rows and columns of the created table (tbl_lfrace)
are now labor-force status and race, respectively. The position of the legend is also specified since the default position
would interfere with the displayed bars.
7.1.2 Bivariate data with one categorical variable and one numerical variable
To assess the relationship between a categorical variable and numerical variable, we can examine how the descriptive
statistics and distribution of the numerical variable vary over different categories of the categorical variable. With
descriptive statistics, the easiest approach is to report descriptive statistics of the numerical variable for each possible
value of the categorical variable. This approach is equivalent to breaking the full sample into different subsamples,
each of which corresponds to one of the categories for the categorical variable.
Example 7.5 (Race and earnings) Let’s examine the relationship between weekly earnings (earnwk) and racial
category (race) from the cps dataset. The full sample of employed individuals has n = 2809, which is broken into
three subsamples for race = “Black” (324 observations), race = “White” (2,244 observations), and race = “Other”
(241 observations). For instance, if cpsemployed is a data frame in R for the 2,809 employed individuals, the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 148 — #155
i i
Had we used the original data frame cps, which contains missing values for the weekly earnings variable,
rather than the cpsemployed data frame, the optional argument na.rm=TRUE would have been required for
the last two uses of the tapply function above. For example, the sample standard deviation command would be
tapply(cps$earnwk, cps$race, sd, na.rm=TRUE). For the IQR calculations, the results are stored in
the vector iqrvec, and the specific IQR value for the white subsample of workers is obtained by referring to the
"White" index within square brackets.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 149 — #156
i i
8000
6000
Weekly earnings
4000
2000
0
Race
Figure 7.3
Box plots of weekly earnings by race
In the same way that descriptive statistics can be applied to different subsamples based upon the categorical values,
descriptive visuals can also be applied to these subsamples. For example, histograms, density curves, and/or box plots
can be drawn and compared for the different subsamples.
Example 7.6 (Race and earnings) Figure 7.3 shows a box plot of weekly earnings (earnwk) for each of the three
subsamples. From left to right, the box plots correspond to black workers, other-race workers, and white workers.
Comparing the box plots for black and white workers, there is a higher sample median for white workers, a larger IQR
(height of the box) for white workers, and a more pronounced right skew for white workers. Figure 7.3 is particularly
easy to create in R:
The syntax here for the first argument of the boxplot function has a tilde character (~) between two variables,
where the first variable (cps$earnwk) is the variable of interest for the box plot and the second variable
(cps$race) is a categorical variable used to split the sample into subsamples. For a similar figure to be drawn
based upon gender rather than race, cps$earnwk~cps$gender would be the first argument (and xlab should
be appropriately changed).
Histograms and density curves can be used as alternatives to box plots. As an example, Figure 7.4 shows the density
curves of the earnwk variable for the three racial categories. To easily compare the distributions, the density curves
are drawn on the same graph, with the same x-axis and y-axis. This figure tells much the same story as Figure 7.3. All
three earnings distributions have a unimodal shape, with the earnings distribution for white workers exhibiting a much
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 150 — #157
i i
Black
Other
0.0012
White
0.0010
0.0008
Density
0.0006
0.0004
0.0002
0.0000
Weekly earnings
Figure 7.4
Density of weekly earnings by race
thicker right tail than the earnings distribution for black workers. This thicker right tail explains the higher dispersion
statistics seen in Example 7.5. The hump for the black-worker earnings distribution is considerably higher and peaks
at a lower earnings level. A comparison of the curves indicates that a much larger proportion of black workers have
weekly earnings below $1000 than either white workers or other-race workers.
Here is the R code to create Figure 7.4:
First, vectors containing the earnwk values for each of the three subsamples are created. Then, each of the three
density curves are plotted on the same graph, with each having a different line type (lty=1 (the default) being a solid
line for black individuals, lty=3 being a dotted line for other-race individuals, and lty=2 being a dashed line for
white individuals).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 151 — #158
i i
In addition to a standard scatter plot for two variables, R can draw multiple scatter plots simultaneously for a set
of three or more variables. This expanded scatter plot provides a scatter plot for every possible pair of variables from
among the set of variables specified. To see how the expanded scatter plot works in practice, let’s add an additional
variable (hours worked last week (hrslastwk)) to the two variables (educ and earnwk) already specified. Figure 7.6
shows the expanded scatter plot for these three variables. There are six different plots since there are P3,2 = (3)(2)
ways to choose a pair of two variables from a set of three variables. (In general, for a set of k variables, an expanded
scatter plot has Pk,2 = (k)(k – 1) scatter plots.) The scatter plot in the middle of the top row is a plot of educ versus
earnwk, whereas the scatter plot on the left of the second row is a plot of earnwk versus educ, essentially reversing
the roles of the x and y variables. Similarly, the scatter plot on the right of the second row is a plot of earnwk versus
hrslaswk, whereas the scatter plot in the middle of the bottom row is a plot of hrslastwk versus earnwk. These two
earnings and hours worked plots indicate a positive relationship between earnings and hours worked, which is perhaps
unsurprising. In addition to earnings being more likely to be higher for larger hours worked, it also appears that the
dispersion of earnings also increases at higher levels of hours worked.
The R code to create the expanded scatter plot in Figure 7.6 is rather simple:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 152 — #159
i i
8000
6000
Weekly earnings
4000
2000
0
0 5 10 15
Education
Figure 7.5
Scatter plot of weekly earnings versus years of education
# expanded scatter plot, with weekly earnings, years of education, and weekly hours worked
plot(cps[,c("educ","earnwk","hrslastwk")])
Here, the function plot takes a data frame as its first and only argument, and in this case the three columns for the
variables of interest are selected.
Example 7.8 (Monthly stock returns) Continuing Example 7.2, let’s look at the relationship between the monthly stock
returns for Home Depot (HD) and Lowe’s (LOW). Figure 7.7 shows a scatter plot of LOW (y-axis) versus HD (x-axis).
The plot indicates a positive association between HD and LOW, as the cloud of points are roughly contained within
an oval stretching from the bottom left to the upper right of the plot. It is more likely to see low values of LOW when HD
is low and high values of LOW when HD is high. This positive association supports the idea that the two companies’
monthly returns tend to move in the same direction due to common macroeconomic conditions affecting their industry.
An expanded scatter plot can include more variables, so let’s add two additional stocks, Bank of America (stock
ticker BAC) and Wells Fargo (stock ticker WFC), both in the banking industry. Figure 7.8 shows the expanded scatter
plot with four stocks (HD, LOW, BAC, WFC).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 153 — #160
i i
15
10
educ
5
0
8000
6000
earnwk
4000
2000
0
100
80
60
hrslastwk
40
20
0
0 5 10 15 0 20 40 60 80 100
Figure 7.6
Expanded scatter plot of weekly earnings, education, and weekly hours worked
The two BAC-WFC plots, located at the bottom right of the figure, indicate a positive association similar to the one
for HD and LOW. BAC tends to be higher when WFC is higher, and BAC tends to be lower when WFC is lower. For
some of the other plots, specifically the ones across industries (e.g., one stock from home improvement (HD or LOW)
versus one stock from banking (BAC or WFC)), there are positive relationships but not as strong as either the HD-
LOW relationship or the BAC-WFC relationship. For example, in the plot of Lowe’s (LOW) versus Bank of America
(BAC), located in the third box of the second row, the cloud of points has a roughly oval shape which is slightly tilted
to the right from vertical. The slight tilt indicates a postive relationship, but the tilt is not as strong as that seen in, for
example, the plot of Lowe’s (LOW) versus Home Depot (HD), located in the first box of the second row.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 154 — #161
i i
0.3
0.2
Monthly returns (LOW)
0.1
0.0
−0.1
−0.2
Figure 7.7
Scatter plot of Lowe’s monthly returns versus Home Depot’s monthly returns
Definition 7.1 For observations (x1 , y1 ), (x2 , y2 ), …, (yn , xn ), the sample covariance between x and y, denoted sxy , is
n
1 X
sxy = (xi – x̄)(yi – ȳ).
n–1
i=1
1
The n–1 scaling for the sample covariance is the same as the scaling for the sample variance s2x . The units of the
sample covariance sxy are the units of x times the units of y. For example, if x is education (in years) and y is weekly
earnings (in dollars), the units of sxy are years × dollars. The sample variance is a special case of the sample covariance,
obtained by taking the covariance of a variable x with itself :
n n
1 X 1 X
sxx = (xi – x̄)(xi – x̄) = (xi – x̄)2 = s2x .
n–1 n–1
i=1 i=1
Also, the ordering of the variables x and y does not matter: sxy = syx since (xi – x̄)(yi – ȳ) = (yi – ȳ)(xi – x̄) for each i.
Pn
There are four types of contributions to the summation i=1 (xi – x̄)(yi – ȳ) in the sample covariance sxy :
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 155 — #162
i i
0.2
HD
0.0
−0.2
0.2
LOW
0.0
−0.2
0.4
BAC
0.0
−0.4
0.4
0.2
WFC
0.0
−0.2
Figure 7.8
Expanded scatter plot of monthly stock returns
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 156 — #163
i i
14
12
10
8
y
6
4
2
0 2 4 6 8 10 12
Figure 7.9
Scatter plot of seven-observation sample
(iii) Observation (xi , yi ) with xi above its mean and yi below its mean:
xi > x̄ and yi < ȳ implies (xi – x̄)(yi – ȳ) < 0
(iv) Observation (xi , yi ) with xi below its mean and yi above its mean:
xi < x̄ and yi > ȳ implies (xi – x̄)(yi – ȳ) < 0
The first two types of observations, with xi and yi either both above their means or both below their means, lead
to positive contributions to the sample covariance. The second two types of observations, with xi and yi on opposite
sides of their means, lead to negative contributions to the sample covariance. The overall sample covariance measure
involves a combination of all four types of observations, so its sign depends upon the relative magnitudes of the
(xi – x̄)(yi – ȳ) > 0 contributions from (i) and (ii) versus the magnitudes of the (xi – x̄)(yi – ȳ) < 0 contributions from (iii)
and (iv).
Generally speaking, there is a positive sample covariance when larger values of x tend to be associated with larger
values of y and smaller values of x tend to be associated with smaller values of y. In terms of the sample averages, the
sample covariance is positive when the x and y values are more likely to be on the same side of x̄ and ȳ, respectively,
than they are to be on opposite sides. On the other hand, there is a negative sample covariance when larger values of
x tend to be associated with smaller values of y and smaller values of x tend to be associated with larger values of y.
In terms of the sample averages, the sample covariance is negative when the x and y values are more likely to be on
opposite sides of x̄ and ȳ, respectively, than they are to be on the same side.
Example 7.9 Consider the following bivariate data with seven observations (n = 7):
{(xi , yi )}7i=1 = {(4, 8), (3, 6), (8, 10), (12, 1), (0, 15), (10, 3), (5, 6)}.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 157 — #164
i i
Figure 7.9 shows a scatter plot of these data, indicating a clear negative relationship between x and y. The following
table provides a detailed calculation of (xi – x̄)(yi – ȳ) for each observation:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
yi 8 6 10 1 15 3 6
xi – x̄ –2 –3 2 6 –6 4 –1
yi – ȳ 1 –1 3 –6 8 –4 –1
(xi – x̄)(yi – ȳ) –2 +3 +6 –36 –48 –16 +1
There are three positive contributions and four negative contributions to the sample covariance, with the negative
contributions considerably larger in magnitude. In particular, the (12, 1) point and the (0, 15) point have contributions
of –36 and –48, respectively, since their x and y values are very far away from their respective means. The (12, 1) point
is a type (iii) observation with x above its mean and y below its mean, while the (0, 15) point is a type (iv) observation
with x below its mean and y above its mean. On the scatter plot, these two points are the ones in the lower-right corner
and the upper-left corner, respectively. Summing the values in the table’s bottom row and dividing by n – 1, the sample
covariance is
1 1 46
sxy = (–2 + 3 + 6 – 36 – 48 – 16 + 1) = (–92) = – ≈ –15.33.
7–1 6 3
Example 7.10 (Monthly stock returns) In Example 7.8, the positive association between the monthly stock returns of
Home Depot (HD) and Lowe’s (LOW) is evident from the scatter plot (Figure 7.7). The sample covariance is positive,
with sHD,LOW = 0.004378 calculated in R using the cov function with the two variables as arguments:
cov(sp500$HD, sp500$LOW)
## [1] 0.004378172
To understand why the sample covariance is positive, Figure 7.10 re-draws the scatter plot of LOW versus HD with
a horizontal line drawn at the sample mean of LOW (0.02029) and a vertical line drawn at the sample mean of HD
(0.01668).
The positive contributions to the sample covariance are made by the observations in the upper-right quadrant
(type (i) observations) and the lower-left quadrant (type (ii) observations). The negative contributions to the sample
covariance are made by the observations in the lower-right quadrant (type (iii) observations) and the upper-left
quadrant (type (iv) observations). There are many more observations in the upper-right and lower-left quadrants
(positive contributions) than there are in the lower-right and upper-left quadrants (negative contributions). To be
more precise, there are 139 points of type (i), 141 points of type (ii), 43 points of type (iii), and 41 points of type (iv).
The sum of the (xi – x̄)(yi – ȳ) terms for the four types of observations are 0.933703 (type (i)), 0.842564 (type (ii)),
–0.075304 (type (iii)), and –0.111687 (type (iv)). Adding these four values together and dividing by n – 1 = 363 yields
the sample covariance sHD,LOW = 0.004378.
Example 7.11 (Education and earnings) In Example 7.7, the positive relationship between weekly earnings and
educational attainment was evident in the scatter plot of earnwk versus educ (Figure 7.5). As in Example 7.10,
the scatter plot can be re-drawn with lines at the sample means of the two variables. This scatter plot is shown
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 158 — #165
i i
0.3
0.2
Monthly returns (LOW)
0.1
0.0
−0.1
−0.2
Figure 7.10
Scatter plot of LOW versus HD, with lines at sample means
in Figure 7.11, with a horizontal line at the sample mean of earnwk (971.18 dollars) and a vertical line at the
sample mean of educ (12.82 years). For these data, there are 600 upper-right quadrant (type (i)) observations, 1,113
lower-right quadrant (type (ii)) observations, 702 lower-right quadrant (type (iii)) observations, and 394 (type (iv))
observations, with overall contributions to the sum of (xi – x̄)(yi – ȳ) given by 1388551, 811985, –335102, and –218892,
respectively. In this example, it is not just the number of upper-right quadrant (type (i)) observations that influences the
sample covariance but also the large magnitude of the (xi – x̄)(yi – ȳ) contributions caused by the considerable number
of points that have earnwk (y) values far above the sample mean of earnwk (ȳ). The resulting sample covariance is
seduc,earnwk = 586.375.
cov(cpsemployed$educ, cpsemployed$earnwk)
## [1] 586.3751
The units of the sample covariance seduc,earnwk are years × dollars or, equivalently, dollars × years. These strange
units make it very difficult to interpret what the numerical value 586.375 means. While the positive covariance does
reflect the positive relationship between earnwk and educ, the numerical value of the sample covariance may not be
an ideal statistic for quantifying this relationship.
As seen in Example 7.11, the sample covariance value can be difficult to interpret, even if the sign of the sample
covariance indicates whether there is a positive or negative relationship between two variables. The problem involves
the units of the sample covariance, which are the units of x times the units of y. It would be more useful if we had
a descriptive statistic that could be compared for different pairs of bivariate data. For example, is the relationship
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 159 — #166
i i
8000
6000
Weekly earnings
4000
2000
0
0 5 10 15
Education
Figure 7.11
Scatter plot of weekly earnings versus education, with lines at sample means
between educ and earnwk stronger than the relationship between hrslastwk and earnwk? The sample covariance is not
useful for this purpose since the units of the first sample covariance are years × dollars and the units of the second
sample covariance are hours × dollars. Even after the sample covariances seduc,earnwk and shrslastwk,earnwk are calculated,
their values are not comparable since they are measured in different units.
To address this undesirable feature of the sample covariance, we introduce the sample correlation, a descriptive
statistic that also measures the linear association between two variables but in a way that is comparable across different
pairs of variables.
Definition 7.2 For observations (x1 , y1 ), (x2 , y2 ), …, (yn , xn ), the sample correlation between x and y, denoted rxy , is
sxy
rxy = ,
sx sy
Pn q P q P
1 1 n 2 , and s = 1 n 2
where sxy = n–1 (x
i=1 i – x̄)(y i – ȳ), sx = n–1 i=1 i(x – x̄) y n–1 i=1 (yi – ȳ) .
The sample correlation between x and y is the sample covariance between x and y divided by the product of the
standard deviations of x and y. Importantly, since the units of sx are the units of x and the units of sy are the units of
y, the units of the numerator (units of x times units of y) cancel the units of the denominator, leading to the sample
correlation rxy being unitless. This fact and additional properties about the sample correlation are stated in the following
proposition:
Proposition 7.1. (Properties of the sample correlation) The sample correlation rxy has the following properties:
(i) rxy is unitless;
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 160 — #167
i i
(ii) the sign of the sample correlation is the same as the sign of the sample covariance,
sign(rxy ) = sign(sxy );
(iii) rxx = 1;
(iv) –1 ≤ rxy ≤ 1.
Property (i) has already been discussed. For property (ii), note that both sx > 0 and sy > 0 since standard deviations
s
are positive. Therefore, the denominator in sxxysy is also positive, meaning the sign of rxy must be the same as the sign of
the numerator sxy . Property (iii) says that the correlation of a variable x with itself is exactly equal to one. For rxx = ssxxxsx ,
the numerator and denominator are both equal to the sample variance s2x , yielding rxx = 1. Finally, property (iv) says that
the magnitude of the sample correlation must be less than or equal to one, that is |rxy | ≤ 1 or, equivalently, –1 ≤ rxy ≤ 1.
The proof of this property is beyond the scope of this book, but the property makes sense if two extreme cases are
considered. Intuitively, the strongest positive correlation should be between a variable x and itself, in which case the
sample correlation is equal to 1 from property (iii). Likewise, the strongest negative correlation should be between a
variable x and its negative (–x), in which case the sample correlation is equal to –1.17 As property (iv) states, any other
sample correlation has a value between the two extremes of –1 and 1.
To see how sample correlation values relate to the bivariate association in scatter plots, Figure 7.12 shows a set of
six different scatter plots, each with a different rxy value. The top row of three scatter plots has rxy = 0.4, rxy = 0.8, and
rxy = 1, and the bottom row of three scatter plots has rxy = –0.4, rxy = –0.8, and rxy = 0. The rxy = 1 scatter plot indicates
a perfect linear and positive relationship between x and y. In this case, a positive-sloped line can be drawn through
all of the points. For rxy = 0.4 and rxy = 0.8, both plots reveal a positive relationship between x and y, but the larger
correlation (rxy = 0.8) appears to describe a stronger positive relationship in the sense that the cloud of points is closer
to the extreme of a linear relationship as compared with the smaller correlation (rxy = 0.4). The comparison between
the rxy = –0.4 and rxy = –0.8 scatter plots is similar. Both indicate a negative relationship between x and y, with the
rxy = –0.8 plot indicating a stronger negative relationship as the cloud of points is closer to the extreme of a negative
linear relationship as compared with rxy = –0.4. Finally, the rxy = 0 scatter plot is a case where there is no evident
relationship between x and y. The cloud of points in this case does not show a tendency to be either upward sloping or
downward sloping.
Before considering some empirical examples involving the sample correlation, it’s important to discuss one potential
pitfall associated with the sample correlation measure. As stated previously, the sample correlation is only meant to
measure the linear association between two variables. As such, a sample correlation of zero means that there is no
linear relationship between two variables, but it does not necessarily mean that there is no relationship whatsoever
between the two variables. Figure 7.13 pictures a rather extreme example, with a scatter plot of points that lie exactly
along a parabola. For these bivariate data, there is a perfect non-linear relationship between x and y. However, it turns
out that the sample correlation between the two variables is exactly equal to zero. The scatter plot on the right, with
horizontal and vertical lines drawn at the sample means, indicates why this is the case. The positive contributions to
the sample covariance from the upper-right quadrant are cancelled out exactly by the negative contributions from the
upper-left quadrant, and similarly the positive contributions from the lower-left quadrant are cancelled out exactly by
the negative contributions from the lower-right quadrant. The resulting sample covariance is zero, meaning the sample
correlation is also zero.
There’s no way to avoid this feature of the sample correlation, as the sample correlation is specifically designed
to measure the linear association between two variables. However, this non-linear scatter plot example highlights the
importance of using both descriptive visuals, like scatter plots, and descriptive statistics, like sample correlations, when
examining the association between variables.
With the sample correlation measure now in our descriptive statistics toolkit, we re-visit the previous examples.
Example 7.12 (Education and earnings) In Example 7.11, the sample covariance of educational attainment and
weekly earnings was seduc,earnwk = 586.375. The sample standard deviations of educ and earnwk, seduc = 2.4030 and
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 161 — #168
i i
2
1
1
y
y
0
0
−1
−1
−1
−2
−2
−2
−3 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
2
1
1
y
y
0
0
−1
−1
−1
−2
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3
x x x
Figure 7.12
Scatter plots for different sample correlations
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 162 — #169
i i
0.6
0.6
0.4
0.4
y
y
0.2
0.2
0.0
x x
Figure 7.13
Scatter plot for a sample with zero sample correlation
cor(cpsemployed$educ, cpsemployed$earnwk)
## [1] 0.3251519
Since the sample correlation is unitless, it doesn’t matter which units are used for educ or earnwk. For instance, if
earnings were measured in thousands of dollars rather than dollars (i.e., earnwkthous = earnwk/1000),
1
seduc,earnwkthous seduc,earnwk · 1000
reduc,earnwkthous = = 1
= reduc,earnwk .
seduc searnwkthous seduc (searnwk · 1000 )
Similarly, if education were measured in months rather than years (i.e., educmonths = 12educ),
seducmonths,earnwk seduc,earnwk · 12
reducmonths,earnwk = = = reduc,earnwk .
seducmonths searnwk (seduc · 12)searnwk
With the sample correlation being unitless, the sample correlation reduc,earnwk can also be compared to the sample
correlation between earnwk and some other variable. For instance, for the relationship between weekly earnings and
hours worked, the sample correlation is rhrslastwk,earnwk ≈ 0.368.
cor(cpsemployed$hrslastwk, cpsemployed$earnwk)
## [1] 0.3681961
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 163 — #170
i i
A comparison of this value with the reduc,earnwk value (0.325) indicates that the two sample correlations are quite
similar, with some evidence that the positive relationship is slightly larger for earnwk and hrslastwk than it is for
earnwk and educ.
Example 7.13 (Monthly stock returns) Example 7.10 considered the relationship between monthly stock returns for
Home Depot (HD) and Lowe’s (LOW) by drawing a scatter plot and providing the sample covariance sHD,LOW . The
sample correlation between HD and LOW is
sHD,LOW 0.004378
rHD,LOW = = ≈ 0.648.
sHD sLOW (0.073706)(0.091598)
cor(sp500$HD, sp500$LOW)
## [1] 0.6484954
A sample correlation of 0.648 seems pretty high, but how does it compare to the sample correlations for other pairs
of stocks? Since the sample correlations for other pairs of stocks can be directly compared to the rHD,LOW value, we
can look at the sample correlations among a larger group of stocks. Previously, Bank of America (BAC) and Wells
Fargo (WFC) were considered in the expanded scatter plot of Example 7.8. Let’s add two more companies to the mix,
specifically Marathon Oil (stock ticker MRO) and ConocoPhillips (stock ticker COP), both of which are in the oil
industry.
A correlation matrix can be created in R using the cor function and a single argument that contains a data frame
with the desired variables:
cor(sp500[,c("HD","LOW","BAC","WFC","MRO","COP")])
## HD LOW BAC WFC MRO COP
## HD 1.0000000 0.6484954 0.3311974 0.2803797 0.1887287 0.2145689
## LOW 0.6484954 1.0000000 0.3566068 0.2619187 0.1811412 0.2560374
## BAC 0.3311974 0.3566068 1.0000000 0.6919604 0.3313437 0.3392213
## WFC 0.2803797 0.2619187 0.6919604 1.0000000 0.3793915 0.3960973
## MRO 0.1887287 0.1811412 0.3313437 0.3793915 1.0000000 0.7709914
## COP 0.2145689 0.2560374 0.3392213 0.3960973 0.7709914 1.0000000
This correlation matrix provides all possible pairwise sample correlations of the monthly returns for a set of
six stocks (HD, LOW, BAC, WFC, MRO, COP). Interestingly, all of the sample correlations are positive, which is
consistent with the idea that all six companies are affected by overall macroeconomic conditions, especially since the
time frame here (over 20 years of returns data) is quite long. There are a lot of duplicate values in the correlation
matrix since the sample correlation of x and y (rxy ) is the same as the sample correlation of y and x (ryx ). For instance,
rBAC,LOW = rLOW,BAC = 0.357 appears both as the third value in the second row and the second value in the third row.
To make the correlation matrix easier to read, it can be shown by reporting only the upper-right triangle of sample
correlations (above the diagonal) or the lower-left triangle of correlations (below the diagonal). Here is a version of
the correlation matrix that provides the upper-right triangle of sample correlations:
HD LOW BAC WFC MRO COP
HD 1.000 0.648 0.331 0.280 0.189 0.215
LOW 1.000 0.357 0.262 0.181 0.256
BAC 1.000 0.692 0.331 0.339
WFC 1.000 0.379 0.396
MRO 1.000 0.771
COP 1.000
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 164 — #171
i i
The three highest values in the correlation matrix are rMRO,COP = 0.771, rBAC,WFC = 0.692, and rHD,LOW = 0.648. Not
coincidentally, these three stock pairs correspond to the three pairs of companies that are in the same industry, with
MRO and COP in the oil industry, BAC and WFC in the banking industry, and HD and LOW in the home improvement
industry. Looking at COP, for instance, its sample correlation with MRO (0.771) is much larger than its sample
correlations with the other four companies, which range from 0.215 to 0.396. Similarly, for HD, its sample correlation
with LOW (0.648) is much larger than its sample correlations with the other four companies, which range from 0.189
to 0.331. The four lowest values in the correlation matrix are rLOW,MRO = 0.181, rHD,MRO = 0.189, rHD,COP = 0.215, and
rLOW,COP = 0.256, suggesting that the shocks affecting the home improvement and oil industries are less related to each
other than the shocks affecting other pairs of the three industries.
## [1] 946.5013
# sample correlation between union indicator and weekly earnings
union_var <- ifelse(cpsemployed$unionstatus=="Union",1,0)
cor(union_var, cpsemployed$earnwk)
## [1] 0.0996309
Among the 276 union workers with x = 1, the sample mean of weekly earnings is $1,198 (ȳ1 = 1198). Among the
2,533 non-union workers with x = 0, the sample mean of weekly earnings is $947 (ȳ0 = 947). Since ȳ1 > ȳ0 , the sample
correlation between x and y is positive, with rxy ≈ 0.100.
The results from Proposition 7.2 hold if y is also a binary variable. In that case, ȳ1 is the proportion of the x = 1
subsample with y = 1, and ȳ0 is the proportion of the x = 0 subsample with y = 0. The sample correlation between x and
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 165 — #172
i i
y is positive if the proportion of y = 1 observations is higher in the x = 1 subsample than it is in the x = 0 subsample,
and negative otherwise.
1
Pn
= n–1 i=1 (a + bxi – (a + bx̄))(c + dyi – (c + dȳ))
1
Pn
= n–1 i=1 (b(xi – x̄))(d(yi – ȳ))
1
Pn
= bd n–1 i=1 (xi – x̄)(yi – ȳ)
= bdsxy ,
where the second equality follows from v̄ = a + bx̄ and w̄ = c + dȳ (Proposition 6.6).
Then, the sample correlation between v and w is
svw bdsxy bd sxy bd
rvw = = = · = rxy .
sv sw (|b|sx )(|d|sy ) |b||d| sx sy |b||d|
bd
The expression |b||d| is equal to 1 if b and d have the same sign, which occurs when b and d are both positive or both
negative, and –1 if b and d have oppositive signs. Therefore, rvw = rxy when b and d have the same sign, and rvw = –rxy
when b and d have opposite signs. The additive constants a and c do not effect either svw or rvw .
These results for svw and rvw are summarized in the following proposition:
Proposition 7.3. Suppose a, b, c, and d are known constants, v = a + bx is a linear transformation of x,
and w = c + dy is a linear transformation of y. The sample covariance and sample correlation for the sample
{(v1 , w1 ), (v2 , w2 ), …, (vn , wn )} have the following relationships to the sample covariance and sample correlation for
the sample {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}:
(i) (sample covariance) svw = bdsxy . The sign of svw is the same as the sign of sxy when b and d are both positive or
both negative, and the sign of svw(is opposite the sign of sxy when b and d are opposite signs.
rxy if b and d are both positive or both negative
(ii) (sample correlation) rvw =
–rxy if b and d are opposite signs
(
rxy if b ≥ 0
(iii) (transforming one variable and not the other) svy = bsxy and rvy =
–rxy if b < 0
Property (iii), where one variable (v = a + bx) is transformed but the other is not, is just a special case of transforming
both variables with c = 0 and d = 1.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 166 — #173
i i
In a situation where v and w are constructed by simply changing the units of x and y, in which case b and d would
be positive numbers, the sign of svw is the same as the sign of sxy , but the magnitude of svw is scaled by bd. In this case,
rvw = rxy , meaning the sample correlation has the nice property that it is not affected by the units of the two variables.
Example 7.15 (Earnings and height) Suppose x denotes weekly earnings and y denotes height in inches, with sample
1
covariance sxy and sample correlation rxy . If v is annualized earnings (v = 52x) and w is height in feet (w = 12 y), the
sample covariance svw and the sample correlation rvw are
1 13
svw = (52) sxy = sxy and rvw = rxy .
12 3
Example 7.16 (Earnings and education) Suppose x denotes weekly earnings and y denotes educational attainment
in years. If x is transformed into annualized earnings, so that v = 52x, the sample covariance between annualized
earnings and education (in years) is svy = 52sxy , and the sample correlation is rvy = rxy . For the cps dataset, the sample
covariance and sample correlation are sxy = 586.3751 and rxy = 0.325, so that svy = 30491.51 and rvy = 0.325.
Example 7.17 (Website profits) Recall the website profit example considered in Example 6.26, where x is the daily
widget purchases, p is the price per widget, f is the website’s daily fixed cost, and c is the marginal cost per widget.
Let’s assume that p > c, which means that the price for a widget is greater than the marginal cost of a widget. The daily
profit is y = –f + (p – c)x. If we have data on both x and y, the sample covariance is sxy = (p – c)sxx = (p – c)s2x , and the
sample correlation is 1 since (p – c) > 0. There is a perfect linear relationship between x and y here since y is defined
to be a linear function of x, based on the three constants (f , p, c).
Suppose we also have data on a variable u measuring the daily number of unique visitors to the website. A positive
sample correlation is expected between u and x (more likely to have more purchases on days with more visitors and
fewer purchases on days with fewer visitors), but it would not be a perfect linear relationship. In thinking about the
relationship between daily visitors and daily profits, how would the sample covariance suy and the sample correlation
ruy compare to the sample covariance sux and the sample correlation rux ? Since suy = (p – c)sux and ruy = rux (since
(p – c) > 0), the sample correlation between visitors and profits is the same as the sample correlation between visitors
and purchases. The equality ruy = rux arises since profits have a perfect linear relationship with purchases.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 167 — #174
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 168 — #175
i i
0.8
Density
y
0.4
−1
0.0
−3
−3 −2 −1 0 1 2 3 −6 −5 −4 −2 0 1 2 3 4 5 6
0.8
Density
y
0.4
−1
0.0
−3
−3 −2 −1 0 1 2 3 −6 −5 −4 −2 0 1 2 3 4 5 6
0.8
Density
y
0.4
−1
0.0
−3
−3 −2 −1 0 1 2 3 −6 −5 −4 −2 0 1 2 3 4 5 6
Figure 7.14
Variation in x + y for different sample correlations
correlation. From the formula s2v = s2x + s2y + 2sxy , a negative correlation (sxy < 0) leads to s2v < s2x + s2y . The variance of the
sum of the variables is less than the sum of the variances of the variables in this case. The reduction in variance arises
precisely because of the negative relationship between x and y, where the tendency of the variables to be on opposite
sides of their means also leads to a tendency for x + y to be closer to its mean.
Next, we discuss the other result in property (ii), which involves the sample variance of the difference of two
variables. When v = x – y, the resulting sample variance is s2v = s2x + s2y – 2sxy . In the case of no correlation between x
and y (sxy = 0), we have s2v = s2x + s2y , so that the sample variance of the difference of two variables is the sum of the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 169 — #176
i i
variances of the two variables. When there is correlation, the reasoning for the presence of the –2sxy is similar to that
from above. Let’s start with the positive correlation (sxy > 0) case. When x and y are positive correlated, their values
tend to be on the same side of their respective means, so that the difference x – y tends to be smaller in magnitude than
it would be in the case of zero correlation. The positive relationship between x and y leads to counteracting effects for
x – y, in contrast to the “exaggerating” effects for x + y, and causes the sample variance s2v to be lower, due to the –2sxy
term, than it would be in the zero correlation case. On the other hand, when x and y are negatively correlated, they tend
to be on opposite sides of their respective means, which means the difference x – y tends to be larger in magnitude as
compared to the case of zero correlation. The negative relationship between x and y increases the dispersion of x – y,
with the term –2sxy being positive since sxy < 0.
Example 7.18 (Earnings of siblings) Suppose we have data on the earnings of a sample of adult siblings, where x is
weekly earnings for the older sibling and y is weekly earnings for the younger sibling. For these variables, a positive
correlation (rxy > 0) is expected due to common factors like similar parenting, similar educational background, similar
socioeconomic background, etc. Thus, the variance of their combined weekly earnings, v = x + y, is larger than the sum
of the variances of their individual weekly earnings: s2v = s2x+y > s2x + s2y . The positive correlation between the siblings’
earnings leads to a greater dispersion in the sum of their earnings. As compared to two randomly chosen individuals
from the population (i.e., non-siblings), in which case there is zero correlation, there is a larger variance for the sum
of the earnings of adult siblings.
How about the average of the siblings’ weekly earnings? For v = x+y 1 1
2 = 2 x + 2 y, the sample variance of v is
1 1 1 1 1 1 1
s2v = s21 x+ 1 y = s2x + s2y + 2 sxy = s2x + s2y + sxy .
2 2 4 4 2 2 4 4 2
How about the difference between the siblings’ weekly earnings, say v = x – y (older sibling’s wages minus younger
sibling’s wages)? For v = x – y, the sample variance of v is
s2v = s2x–y = s2x + s2y – 2sxy ,
with the positive correlation in their wages leading to a decreased dispersion in the difference, as compared to the
zero correlation case.
Example 7.19 (Two-stock portfolio) Suppose we have data on the returns for two different stocks, with x denoting
the return for stock A and y denoting the return for stock B. These could be monthly returns or annual returns or
returns for some other time interval. We won’t specify the time interval for now, in the interest of keeping things
general. By applying linear combinations, these data can be used to see how a particular two-stock portfolio would
have performed over the same time period. Specifically, consider the two-stock portfolio where a fraction a is invested
in stock A and the remainder, (1 – a), is invested in stock B, with a being a constant between 0 and 1. Then, the return
on the two-stock portfolio is the linear combination
v = ax + (1 – a)y.
Applying Proposition 7.4, the sample mean and variance of the return on the two-stock portfolio are
v̄ = ax̄ + (1 – a)ȳ
and
s2v = a2 s2x + (1 – a)2 s2y + 2a(1 – a)sxy .
The sample mean of the portfolio return is a weighted average of the sample means for the two stocks, with weight
a placed on stock A’s sample mean and weight 1 – a placed on stock B’s sample mean. The sample variance s2v is
a measure of the risk associated with the portfolio since it tells us how much variability there is in the portfolio’s
observed returns. Part of this sample variance comes from the sample variances of the individual stocks, reflected by
the a2 s2x and (1 – a)2 s2y terms in the s2v formula. But there is also a third term, 2a(1 – a)sxy . If the stocks’ returns are
positive correlated, the 2a(1 – a)sxy term is positive since sxy > 0 (a and 1 – a are also positive since 0 < a < 1). The stock
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 170 — #177
i i
returns move together in this case, with the returns tending to be on the same side of their respective means, leading
to an increased variance or risk of the two-stock portfolio, relative to the case of no correlation. On the other hand, if
the stocks’ returns are negative correlated, the 2a(1 – a)sxy term is negative and leads to a decreased variance or risk
of the two-stock portfolio, relative to the case of no correlation.
Interestingly, even in the case of zero correlation between the stocks’ returns (sxy = 0), there is a “diversification”
effect that affects the variance or risk of the two-stock portfolio. Using the fact that k2 < k for any positive constant
k < 1, we can show this effect as follows:
s2v = a2 s2x + (1 – a)2 s2y < as2x + (1 – a)s2y ≤ a max(s2x , s2y ) + (1 – a) max(s2x , s2y ) = max(s2x , s2y ),
which means that s2v is less than the maximum of the two sample variances s2x and s2y . For this case of zero correlation,
the same must then also be true for the standard deviations: sv < max(sx , sy ).
Let’s consider real-world examples of two-stock portfolios from the sp500 dataset. The following table shows sample
means and standard deviations for three stocks (Bank of America (BAC), Wells Fargo (WFC), and ConocoPhillips
(COP)) and also two different two-stock portofolios, one equally weighted (a = 1 – a = 0.5) between BAC and WFC
and one equally weighted (a = 1 – a = 0.5) between COP and WFC:
x̄ sx
BAC 0.01295 0.10530
WFC 0.01351 0.08157
COP 0.01093 0.08173
1/2BAC + 1/2WFC 0.01323 0.08607
1/2COP + 1/2WFC 0.01222 0.06822
For the BAC-WFC portfolio, we can confirm that, for the sample means, 0.01323 = (0.5)(0.01295) + (0.5)(0.01351).
For the sample standard deviation, using the sample correlation rBAC,WFC = 0.692 (Example 7.13) and the fact that
sBAC,WFC = rBAC,WFC sBAC sWFC , we have
s21 BAC+ 1 WFC = 0.25s2BAC + 0.25s2WFC + 0.5sBAC,WFC
2 2
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 171 — #178
i i
0.11
Standard deviation 0.9
0.8
0.7
0.6
0.09
0.5
0.4 0.3 0.2 0.1
0.07
Mean
0.08
0.9 0.1
0.8 0.2
0.7 0.6 0.4 0.3
0.5
0.06
Mean
Figure 7.15
Means and standard deviations for weighted two-stock portfolios
analogous to property (i) of Proposition 7.4. For the sample variance s2v , the expression is more complicated:
s2v = a2 s2x + b2 s2y + c2 s2z + 2absxy + 2acsxz + 2bcsyz .
This formula says that the sample variance of v involves the variances of the three variables individually, through the
a2 s2x + b2 s2y + c2 s2z component, but it also involves each of the possible pairwise covariances between the three variables.
With three variables, there are 32 = 3 different pairs of variables and, thus, three different covariances.
These results can be generalized to an even larger number of variables in the linear combination. The following
proposition provides the general results for the case of m ≥ 2 variables:
Proposition 7.5. Suppose k and a1 , a2 , …, am are known constants, and
m
X
v = k + a1 x1 + a2 x2 + · · · + am xm = k + aj xj
j=1
is a linear combination of the m variables x1 , x2 , …, xm . The descriptive statistics for the sample {v1 , v2 , …, vn } have
the following relationships to the descriptive statistics for the sample of observations for the variables x1 , x2 , …, xm :
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 172 — #179
i i
Pm
(i) (sample mean) v̄ = k + a1 x1 + a2 x2 + · · · + am xm = k + j=1 aj xj
Pm Pm–1 Pm
(ii) (sample variance) s2v = j=1 a2j s2xj + 2 j=1 a a s x`
qP `=j+1 j ` xjP
m m–1 Pm
p
(iii) (sample standard deviation) sv = s2v = 2 2
j=1 aj sxj + 2 j=1 `=j+1 aj a` sxj x`
Property (i) remains simple, with the sample mean of v having the same linear relationship with the sample means
of the variables as in the original linear combination. As seen above for the case of three variables (m = 3), property (ii)
says that the sample variance of the linear combination v involves each of the sample variances of the m variables,
Pm
through the j=1 a2j s2xj summation term, but also all possible pairwise covariances among the m variables, through the
double summation term. There are m2 terms in the double summation, corresponding to the m2 possible variable
pairs among the m variables. Finally, property (iii) says that the sample standard deviation, as always, is equal to the
square root of the sample variance.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 173 — #180
i i
NOTES 173
to be able to choose whether to receive the treatment, as that choice itself could be related to many outside factors
whereas the randomization, by construction, is not.
Notes
16 Alternatively, a single argument can be used for plot if it consists of a data frame with the x variable as the first column and the y variable as
the second column. For Figure 7.5, the appropriate argument would be cps[,c("educ","wagehr")].
17 The sx,–x –sx,x –s2x
sample correlation between x and –x is equal to –1 since rx,–x = sx s–x
= sx sx
= s2x
= –1, using the facts that sx,–x = –sx,x and s–x = sx .
Exercises
1. Use the cps dataset for this question, focusing on the subsample of 2,809 employed individuals.
(a) Provide a table of the joint sample counts for hourly (rows) and race (columns).
(b) Provide a table of the joint sample proportions hourly (rows) and race (columns).
(c) Same as (b), but condition on race so that the column values sum to one. What does the table say about the
relationship between race and being paid hourly?
(d) Draw a figure similar to Figure 7.1 with a bar plot of hourly-wage status by race. As compared to Figure 7.1,
this figure will have only two categories (“Hourly” and “Non-hourly”) on the x-axis.
(e) Draw box plots of weekly earnings (earnwk) by hourly-wage status.
(f) Draw the two densities of weekly earnings (earnwk), one for hourly workers and one for non-hourly workers,
on the same graph. Use different line styles or colors to differentiate the two densities.
(g) Based on (e) and (f), how does the sample mean of weekly earnings for hourly workers compare to that of
non-hourly workers? How about the sample standard deviation? How about the sample 90% quantile?
2. Use the sp500 dataset for this question. The variable IDX represents the monthly return for the overall stock market
(as measured by the S&P 500 index).
(a) Use the command sp500$mkt <- ifelse(sp500$IDX>=0,"Up","Down") to create a categorical
variable mkt whose value is Up when the market has a positive monthly return and Down when the market has
a negative monthly return.
(b) Provide a table of mkt. What proportion of observations have mkt equal to Up?
(c) Use the summary and sd functions to provide summary statistics of Apple’s monthly returns (AAPL).
(d) Repeat (c) on the two subsamples of observations corresponding to mkt = Up and mkt = Down, using the
tapply function. How do the sample means and sample standard deviations compare to each other for the
two subsamples?
(e) Draw two densities of AAPL, one for the mkt = Up subsample and one for the mkt = Down subsample, on the
same graph. Use different line styles or colors to differentiate the two densities.
(f) Defining the binary variable mktup to be 1 if mkt = Up and 0 if mkt = Down, use Proposition 7.2 to calculate the
sample correlation between AAPL and mktup based upon the sample means and sample standard deviations of
AAPL and mktup. The necessary information is available from the answers to (b), (c), and (d).
3. For a sample size of six observations and two variables x and y, draw a possible scatter plot of data for which
(xi – x̄)(yi – ȳ) is positive for every i = 1, …, 6 but x and y are not perfectly correlated.
4. Use the dataset auctions for this question. The dataset consists of 684 eBay auctions for Apple iPod Mini devices in
June and July 2006. The binary variables new, used, and refurb indicate the condition of the device (e.g., a new device
has new = 1 and used = refurb = 0). For this question, focus only on the subsample of 624 auctions of used and new
devices, so drop those with refurb = 1.
(a) Draw side-by-side box plots of auction sales prices (finalprice, in dollars) for used devices and new devices.
(b) What is the sample correlation between finalprice and new?
(c) Draw a scatter plot of finalprice versus the number of bidders (bidders).
(d) What is the sample correlation between finalprice and bidders?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 174 — #181
i i
174 NOTES
(e) For the plot in (c), does the variability of finalprice change for larger values of bidders?
(f) For the plot in (c), does the average finalprice appear to always increase as the value of bidders increases?
Explain.
5. For a sample of firms, there are data on x = electricity purchased (in kilowatt-hours, or kwh) and y = firm revenues
(in dollars). What are the units of the following?
(a) Sample average of electricity purchased
(b) Variance of firm revenues
(c) Covariance between electricity purchased and firm revenues
(d) Correlation between electricity purchased and firm revenues
6. Last week, widgets.com had daily sales of 24, 13, 19, 21, 12, 28, and 18 widgets.
(a) What is the sample median of daily widget sales?
(b) What is the sample mean of daily widget sales?
(c) What is the sample standard deviation of daily widget sales?
(d) If daily fixed cost is 100 dollars and profit margin per widget is 20 dollars, daily profits are given by –100 +
20 widgets. Using the results for linear transformations of univariate data, what are the sample median, sample
mean, and sample standard deviation of daily profits?
(e) What is the sample correlation between daily widget sales and daily profits?
(f) What is the sample covariance between daily widget sales and daily profits?
7. A survey of CEO’s collected data on x = salary (in thousands of dollars) and y = education (in years) for each CEO.
The sample has x̄ = 200, sx = 30, ȳ = 17, and sy = 3. The sample covariance between x and y is sxy = 50.
(a) What is the sample correlation between salary and education?
(b) What is the sample covariance between salary in dollars (not thousands of dollars) and education?
(c) What is the sample correlation between salary in dollars and education?
8. You have data on the monthly returns of two stocks A and B, given respectively by the variables x and y. The sample
variance for stock A is 0.006 (s2x = 0.006) and the sample variance for stock B is 0.008 (s2x = 0.008).
(a) What must be true about the correlation between x and y for the average return 12 (x + y) to have a sample
variance less than or equal to 0.007?
(b) What must be true about the correlation between x and y for the average return 12 (x + y) to have a sample
variance less than or equal to 0.006?
(c) What must be true about the correlation between x and y for the difference in returns (x – y) to have a sample
variance less than 0.012?
9. You have data for 1,230 individuals, including their education (in years, denoted educ) as well as the education, in
years, for each individual’s mother (motheduc) and father (fatheduc). The sample correlation matrix is:
educ motheduc fatheduc
educ 1.000 0.452 0.440
motheduc 1.000 0.599
fatheduc 1.000
The sample variance of educ is 5.543, the sample variance of motheduc is 5.190, and the sample variance of fatheduc
is 10.653. The sample covariance between motheduc and fatheduc is 4.454.
(a) What is the sample covariance between educ and motheduc? What are the units?
(b) What is the sample variance of the sum of motheduc and fatheduc?
(c) What is the sample variance of the average of motheduc and fatheduc?
(d) What is the sample variance of the difference of motheduc and fatheduc?
(e) Explain why the sample variance in (b) is higher than the sample variance in (d).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 175 — #182
i i
NOTES 175
(f) What additional information is needed to calculate the sample variance of the average of all three education
variables (educ, motheduc, fatheduc)?
10. Use the exams dataset for this question.
(a) Draw a scatter plot of the second exam score (exam2) versus the first exam score (exam1), with lines drawn
through the sample means of the two variables. Specify the range of both axes to be between 0 and 100.
(b) Using the appropriate R commands (rather than counting points in the scatter plot), how many points are in
each of the four quadrants of the scatter plot in (a)?
(c) What is the sample correlation between exam1 and exam2?
(d) Based on (c), which of the following would have a higher sample standard deviation: the sum of the two exam
scores or the difference of the two exam scores? Answer without R.
(e) If you standardize both exam1 and exam2 (by de-meaning and dividing by the sample standard deviation), what
would be the sample correlation between the two standardized exam scores? Answer without R.
(f) If you standardize both exam1 and exam2 as in (e), what would be the variance of the sum of the two
standardized exam scores? Answer without R.
(g) Create two new variables exam1_std and exam2_std with the standardized exam scores. Suppose the instructor
calculates a composite score (score) as the sum of 0.75 times the higher standardized exam score and 0.25
times the lower standardized exam score. Create the new variable score. What is the sample mean and sample
standard deviation of the composite scores? If the instructor would like the top 20 students in the class to get
an A, what would be the appropriate cutoff for the composite scores?
11. (a) *For two variables x and y, show that the sample covariance of x and y is
n
sxy = (xy – x̄ȳ) ,
n–1
Pn
where xy = n1 i=1 xi yi is the sample average of the xi yi values.
(b) Suppose that x ∈ {0, 1} and y ∈ {0, 1} are binary variables. Using the result in (a) and the fact that a binary x
n
has sample variance s2x = n–1 x̄(1 – x̄), provide a formula for rxy in terms of x̄, ȳ, and xy.
(c) A city has two newspapers, the Daily Bugle and the Daily Planet. The binary variable x indicates whether a
local company advertises in the Daily Bugle in a given year (1 means yes, 0 means no), and the binary variable
y indicates whether a local company advertises in the Daily Planet in a given year (1 means yes, 0 means no).
The following table describes the advertising behavior of a sample of 80 local companies in a given year:
y
0 1
0 24 28
x
1 18 10
Using the result from (b), what is the sample correlation between x and y?
12. Use the sp500 dataset for this question. If the data are not already visible in the top-left window of RStudio, use the
command View(sp500).
(a) First, focus on the first 20 stocks that appear in the spreadsheet. Ignore IDX, which is in the second column, so
the first 20 stocks are given by the stock tickers AAPL through APA. Output the descriptive statistics for these
20 stocks, using the command summary(sp500[,3:22]). What is the sample mean for AMD?
(b) Use the command sapply(sp500[,3:22], sd) to calculate sample standard deviations for the stocks
in (a). The sapply function “applies” the sd function to each of the columns of the first argument
sp500[,3:22]. Which stock has the highest sample standard deviation? Which stock has the lowest sample
standard deviation?
(c) Create a new variable that contains the monthly returns for a portfolio with equal (1/2) weights on the first two
stocks (AAPL, ABMD). What is the sample mean and sample standard deviation for this two-stock portfolio?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 176 — #183
i i
176 NOTES
(d) Create a new variable that contains the monthly returns for a portfolio with equal (1/3) weights on the first
three stocks (AAPL, ABMD, ABT). What is the sample mean and sample standard deviation for this three-stock
portfolio?
(e) *Continue this process to get an equally weighted 4-stock portfolio with the first 4 stocks, an equally weighted
5-stock portfolio with the first 5 stocks, and so on, through an equally weighted 20-stock portfolio with the
first 20 stocks. For each portfolio, calculate the sample mean and sample standard deviation. Then, make
two plots: (i) sample mean versus the number of stocks (ranging from 2 to 20) in the portfolio and (ii)
sample standard deviation versus the number of stocks (ranging from 2 to 20) in the portfolio. (Hint: A useful
function is rowMeans, which creates a vector that averages across columns of a data frame. For example, the
command portfolio <- rowMeans(sp500[,3:8]) creates a portfolio variable corresponding
to an equally weighted portfolio consisting of the first six stocks, in columns 3 through 8.)
(f) *Rather than using the first 20 stocks as in (e), instead choose 20 stocks randomly without replacement from
the full set of stocks, contained in columns 3 through 268 of the data frame). Calculate the sample average
and sample standard deviation for the equally weighted 20-stock portfolio. Use a loop to do this 1,000 times,
randomly picking 20 stocks each time, and store the sample means and sample standard deviations along
the way. Plot the histogram and/or smoothed density of the 1,000 sample means. Plot the histogram and/or
smoothed density of the 1,000 sample standard deviations.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 177 — #184
i i
Chapter 5 introduced the concept of numerical variables, including both discrete numerical variables and continuous
numerical variables. Then, Chapter 6 introduced several descriptive statistics for summarizing numerical variables.
At this point, we take a step back and, applying the concepts of probability theory from Chapter 2, formally model
the process by which a numerical variable arises and is observed. This chapter provides the theoretical framework for
discrete numerical variables, and Chapter 10 introduces the theoretical framework for continuous numerical variables.
While the approaches for the two types of variables have similarities, the mathematics are sufficiently different that it
is useful to consider them separately. For instance, while discrete summations are used for much of the analysis in the
discrete case, integration is necessary for much of the analysis in the continuous case. Although categorical variables
are not covered explicitly in this chapter, we have already seen that categorical variables can be represented by one or
more discrete (binary) numerical variables, so the framework discussed in this chapter is applicable.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 178 — #185
i i
Definition 8.1 A random variable X is a function that maps each outcome of the sample space S to a number.
While this definition indicates that X is a function, it is standard to suppress the argument when referring to the
random variable X, in the interest of brevity. If e is a simple event (outcome) from S, it is more accurate to say that
X(e) is some numerical value. Usually, however, the argument e is suppressed, with X instead of X(e) used to denote
the random variable. The standard convention is to use a capital letter, like X rather than x, to denote a random variable.
For modeling discrete variables, a specific type of random variable known as a discrete random variable is used:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 179 — #186
i i
Definition 8.2 A discrete random variable is a random variable with a finite or countable set of possible values.
The notation for the set of possible values, referred to in Definition 8.2, is the same as introduced in Section 8.1,
with x1∗ , x2∗ , …, xK∗ for finite K and x1∗ , x2∗ , …, xk∗ , … for infinite K.
Here are some examples of discrete random variables:
• Coin toss: S = {H, T}, X = 1 if H, X = 0 if T
• Roll of a die: S = {1, 2, 3, 4, 5, 6}, X is equal to the outcome
• Three website visitors, purchase (Y) or not (N):
S = {YYY, YYN, YNY, YNN, NYY, NYN, NNY, NNN}
– Total number of purchases X = the number of purchases in the three-visitor sequence
– Indicator of at least two purchases T = 1 if the number of purchases is ≥ 2 and 0 otherwise
• Website visitors until first purchase: S = {Y, NY, NNY, NNNY, …}, with corresponding X values 1, 2, 3, 4, …
• Number of firm patents: S = {0, 1, 2, 3, …}, X is equal to the outcome
• Income threshold: S contains all possible annual income values, X = 1 if outcome > $100,000 and 0 otherwise
Definition 8.3 The probability mass function (pmf) of a discrete random variable X, denoted pX (·), gives the
probability of each possible value of X:
pX (xk∗ ) = P(X = xk∗ ).
For each possible value xk∗ , the probability pX (xk∗ ) can be determined as follows:
• Step 1: Find all of the outcomes in S for which X = xk∗ :
A = {e : X(e) = xk∗ }.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 180 — #187
i i
probability is 20% (0.2) for each visitor and that the purchase behavior by each visitor is independent of other
visitors. Then, the pmf is
pX (0) = P(NNN) = 0.83 = 0.512,
pX (1) = P(YNN ∪ NYN ∪ NNY) = (0.2)(0.8)(0.8) + (0.8)(0.2)(0.8) + (0.8)(0.8)(0.2) = 0.384,
pX (2) = P(YYN ∪ YNY ∪ NYY) = (0.2)(0.2)(0.8) + (0.2)(0.8)(0.2) + (0.8)(0.2)(0.2) = 0.096,
and
pX (3) = P(YYY) = 0.23 = 0.008.
• X = website visitors until first purchase, with the same assumptions as the previous example: The pmf is
pX (1) = 0.2,
pX (2) = (0.8)(0.2) = 0.16,
pX (3) = (0.8)(0.8)(0.2) = 0.128,
and, so on, with the general formula
pX (xk∗ ) = (0.8)k–1 (0.2).
For these four examples, Figure 8.1 provides a graph of each of the four pmf’s. Each graph has the probability values
on the y-axis and possible xk∗ values on the x-axis, with each vertical line indicating the probability associated with any
given possible value xk∗ . For instance, the pmf for the coin toss, in the top-left graph, has two vertical lines drawn at x = 0
and x = 1, each with height 0.5, corresponding to the equal probabilities of tails (X = 0) and heads (X = 1). Similarly,
the pmf for the die roll, in the top-right graph, has vertical lines with heights 1/6 at each of the six possible values. In
the lower-right graph for the fourth example (website visitors until first purchase), note that the x-axis only extends to
a maximum of xk∗ = 20 even though K is infinite. Although the x-axis could be extended more, the probability values
become very close to zero as xk∗ gets larger; for example, at xk∗ = 20, the probability is pX (20) ≈ 0.00288 = 0.288%.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 181 — #188
i i
1.0
pmf for fair coin toss pmf for fair die roll
1.0
0.8
0.8
Probability pX(xk*)
Probability pX(xk*)
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 1 1 2 3 4 5 6
xk* xk*
1.0
0.8
0.8
Probability pX(xk*)
Probability pX(xk*)
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 1 2 3 1 3 5 7 9 11 14 17 20
xk* xk*
Figure 8.1
Probability mass functions for four examples
Definition 8.4 The cumulative distribution function (cdf) of a discrete random variable X, denoted FX (·), gives the
probability that X is less than or equal to any argument x0 of FX (·):
X
FX (x0 ) = P(X ≤ x0 ) = pX (xk∗ ).
xk∗ ≤x0
The generic argument x0 can take on any possible value on the real line. Though x0 can be equal to a possible value
xk∗ , the definition also allows for x0 values that are between possible values of the random variable. The cdf has the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 182 — #189
i i
following properties:
0 ≤ FX (x0 ) ≤ 1 for every x0
and
x0 < x1 =⇒ FX (x0 ) ≤ FX (x1 ).
The first property follows directly from the fact that that FX (x0 ) = P(X ≤ x0 ) is a probability and must therefore be
between 0 and 1 (inclusive). The second property says that the cdf is a weakly increasing function. For x0 < x1 , this
property holds since
FX (x1 ) = P(X ≤ x1 ) = P(X ≤ x0 or x0 < X ≤ x1 ) ≥ P(X ≤ x0 ) = FX (x0 ).
Let’s consider the four examples from Section 8.2.2 to illustrate the concept of cdf’s.
Example 8.2 (Coin toss) For a fair coin, the pmf and the cdf at 0 and 1 are given in the following table:
xk∗ pX (xk∗ ) FX (xk∗ )
0 0.5 0.5
1 0.5 1
How about the cdf FX (x0 ) for other values of x0 ? For x0 = –0.4, FX (–0.4) = P(X ≤ –0.4) = 0 since there are no possible
values of X below –0.4. The same is true of any negative x0 value, so that FX (x0 ) = 0 when x0 < 0. This means that the
FX (·) function jumps from 0 to 1 exactly at the point x0 = 0. For x0 = 0.6 (a point between 0 and 1), FX (0.6) = P(X ≤
0.6) = P(X = 0) = 0.5. The same is true of any x0 that is strictly between 0 and 1, meaning the FX (·) function jumps from
0.5 to 1 exactly at the point x0 = 1. Finally, for any x0 value greater than 1, PX (x0 ) = P(X ≤ x0 ) = P(X ≤ 1) = 1.
Taking these results together, the cdf is a step function, as shown in Figure 8.2. The graph has been drawn with
the x-axis extending from –1 to 2, but it should be understood that the cdf extends to the left forever (with a value of
0) and to the right forever (with a value of 1). The solid lines indicate the cdf value at any given x0 value. There is
a “closed dot” and an “open dot” at the x0 values where the function jumps up. The closed dot is the cdf value at
the corresponding point. For instance, at x0 = 0, the cdf value is FX (0) = 0.5, represented by the closed dot and not the
open dot. Similarly, at x0 = 1, the cdf value is FX (1) = 1, represented by the closed dot and not the open dot.
Example 8.3 (Six-sided die) For a fair die, the pmf and the cdf for the possible outcome values are given in the table
below:
xk∗ pX (xk∗ ) FX (xk∗ )
1 1/6 1/6
5 1/6 5/6
6 1/6 1
For any number x0 in between these outcomes, the cdf FX (x0 ) is the FX (xk∗ ) for the lowest xk∗ that is greater than
x0 . For instance, if x0 = 3.7, the cdf is FX (3.7) = P(X ≤ 3.7) = P(X ≤ 3) = 1/2. Also, FX (x0 ) = 0 for x0 < 1, and FX (x0 ) = 1
for x0 > 6. Figure 8.3 shows the cdf for the fair die roll. Again, the closed dots indicate the cdf values at the x0 points
where the cdf jumps up, and the cdf extends forever to the left (with value 0) and to the right (with value 1).
Example 8.4 (Three website visitors) For the example with three website visitors and independent purchase
probabilities of 0.2, the pmf and cdf of X = number of purchases are given by the following table:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 183 — #190
i i
1.0
0.9
0.8
0.7
0.6
FX(x)
0.5
0.4
0.3
0.2
0.1
0.0
0 1
Figure 8.2
Cumulative distribution function for fair coin toss
.. .. ..
. . .
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 184 — #191
i i
1.0
0.9
0.8
0.7
0.6
FX(x)
0.5
0.4
0.3
0.2
0.1
0.0
1 2 3 4 5 6
Figure 8.3
Cumulative distribution function for fair die roll
The cdf values for other x0 values can be determined in the same way as the last example. For example, for x0 = 2.8
(between 2 and 3), the cdf is FX (2.8) = FX (2) = 0.36. As k increases, the cdf value gets closer and closer to one.
Mathematically, FX (xk∗ ) = 1 – (0.8)k , and as k → ∞, (0.8)k → 0 so that FX (xk∗ ) → 1. The cdf never reaches one, but its
value gets closer and closer to one for larger k.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 185 — #192
i i
1.0
0.9
0.8
0.7
0.6
FX(xk*)
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3
xk*
Figure 8.4
Cumulative distribution function for number of purchases
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 186 — #193
i i
Definition 8.5 The population mean (or population average or expected value) of a discrete random variable X,
denoted µX or E(X), is X
µX = E(X) = xk∗ pX (xk∗ ).
k
The population mean µX or expected value E(X) is a weighted average of the possible xk∗ , where the weights are the
pmf probabilities pX (xk∗ ) = P(X = xk∗ ). Notice the similarity between k xk∗ pk (for the sample mean) and k xk∗ pX (xk∗ )
P P
(for the population mean), both of which are weighted averages of the xk∗ values, with the weights being the sample
proportions for the sample mean and the true outcome probabilities for the population mean. Recall the frequentist
interpretation of true probabilities discussed in Section 2.3, where we viewed the probability of an outcome as the
long-run frequency or proportion of the outcome being observed over repeated experiments. Here, for the thought
experiment of taking many repeated draws from the population, the probability pX (xk∗ ) is the number that the sample
proportion pk approaches as the sample size n gets arbitrarily large. Since this relationship holds for each probability
pX (xk∗ ) and each pk , the sample mean x̄ should also approach the population mean µX as the sample size n gets arbitrarily
large. This result, known as the Law of Large Numbers, is formalized in Chapter 13, but it is worthwhile to introduce
the intuition here. To summarize, for any given sample, generally it is the case that pk 6= pX (xk∗ ) for each of the possible
outcomes and x̄ 6= µX , either of which would only happen by chance, but as sample size increases, the sample proportion
pk gets closer to the true probability pX (xk∗ ) for each possible outcome and x̄ gets closer to µX .
Example 8.7 (Union status) Let X be a binary random variable representing union status, equal to 1 for a union
worker and 0 for a non-union worker. The population consists of the union status x ∈ {0, 1} for all possible workers.
The population mean is
µX = E(X) = 0 × pX (0) + 1 × pX (1) = pX (1).
The population mean is equal to the probability of union status in the population, pX (1) = P(X = 1). As there is nothing
special about union status in this example, µX = P(X = 1) holds for any binary random variable X ∈ {0, 1}.
Example 8.8 (Six-sided die) For X ∈ {1, 2, 3, 4, 5, 6} being the outcome of a fair die roll, the pmf probabilities are
pX (xk∗ ) = 61 for each outcome. The population mean or expected value of X is
1 1 1 1 1 1 21
µX = E(X) = 1 × +2× +3× +4× +5× +6× = = 3.5.
6 6 6 6 6 6 6
Example 8.9 (Three website visitors) For the example with three website visitors and independent purchase
probabilities of 0.2, Example 8.4 provided the pmf of X = number of purchases. The table below lists the pmf values
pX (xk∗ ) for xk∗ ∈ {0, 1, 2, 3} and calculates the xk∗ pX (xk∗ ) terms in the summation for µX .
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 187 — #194
i i
While we have been able to calculate the population mean in the examples above, there is no guarantee that the
population mean or expected value is well-defined (i.e., a finite number), as the following example illustrates:
Example 8.10 (Infinite expected value) Suppose pX (x) = 1x for any x that is a power of two — that is, x ∈
P∞
{2, 4, 8, …, 2k , …}. These outcomes and probabilities constitute a valid pmf since k=1 21k = 1–/21/2 = 1, but the expected
1
The sample variance is (approximately) a weighted average of the (xk∗ – x̄)2 values, where the weights are the associated
sample proportions of the outcomes. For the population variance, there are two main differences from this sample
variance formula. First, as with the population mean, the weights are the true probabilities of the possible outcomes
rather than the sample proportions. Second, the sample mean, which appears in each of the (xk∗ – x̄)2 expressions, is
replaced by the population mean µX . The formal definition of the population variance is given below:
Definition 8.6 The population variance of a discrete random variable X, denoted σX2 or Var(X), is
X
σX2 = Var(X) = E[(X – µX )2 ] = (xk∗ – µX )2 pX (xk∗ ).
k
The sample variance is a weighted average of the squared difference of possible outcomes from the sample mean,
and the population variance is a weighted average of the squared difference of possible outcomes from the population
mean. The weights for the sample variance are the sample proportions, and the weights for the population variance
are the true probabilities pX (xk∗ ) = P(X = xk∗ ). For a given sample, the sample proportions generally differ from the pmf
probabilities (pk 6= pK (xk∗ )), leading to a sample mean that differs from the population mean (x̄ 6= µX ) and a sample
variance that differs from the population variance (s2x 6= σX2 ). That said, if we again conduct the thought experiment
of taking many repeated draws from the population, it is expected that the sample variance s2x becomes closer to the
population variance σX2 as the sample size n gets larger and larger. The reason here is that, as the sample size n becomes
very large, (i) the sample proportions pk get closer to the true pmf probabilities pX (xk∗ ) and (ii) the sample mean x̄ gets
closer to the population mean µX , which taken together imply that each (xk∗ – x̄)2 pk term in s2x gets closer to each
(xk∗ – µX )2 pX (xk∗ ) term in σX2 .
We also define the population standard deviation associated with the random variable X:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 188 — #195
i i
Definition 8.7 The population standard deviation of a discrete random variable X, denoted σX or sd(X), is
q sX
σX = sd(X) = σX2 = (xk∗ – µX )2 pX (xk∗ ).
k
The population standard deviation σX is different from the sample standard deviation sx in the same way that the
population variance σX2 is different from the sample variance s2x . Similar to the variance measures, the sample standard
deviation sx is expected to get closer to the population standard deviation σX as the sample size n gets larger and larger.
Example 8.11 (Union status) In Example 8.7, where X ∈ {0, 1} is a binary random variable representing union status
(1 for union, 0 for non-union), the population mean was shown to be µX = pX (1). Using this result and the fact that
pX (0) = 1 – pX (1), the population variance is
σX2 = Var(X) = (0 – µX )2 pX (0) + (1 – µX )2 pX (1)
= (0 – pX (1))2 pX (0) + (1 – pX (1))2 pX (1)
= (0 – pX (1))2 (1 – pX (1)) + (1 – pX (1))2 pX (1)
= (pX (1)2 + (1 – pX (1))pX (1))(1 – pX (1))
= pX (1)(1 – pX (1)).
For this binary X, the population variance is pX (1)(1 – pX (1)), and the population standard deviation is
p
σX = sd(X) = pX (1)(1 – pX (1)).
As in Example 8.7, there’s nothing special about union status here, so these are general results for a binary random
variable X ∈ {0, 1}. These results will be re-visited when binary random variables are discussed further in Chapter 9.
Example 8.12 (Six-sided die) In Example 8.8, the population mean was shown to be µX = 3.5 for the random variable
X ∈ {1, 2, 3, 4, 5, 6} being the outcome of a fair die roll. Since the probability of each outcome is 16 , we have
P6
σX2 = Var(X) = ∗ (x∗ – 3.5)2 pX (xk∗ )
Px6k =1 k∗ 2 1
= xk∗ =1 (xk – 3.5) 6
= ((–2.5)2 +(–1.5)2 + (–0.5)2 + (0.5)2 + (1.5)2 + (2.5)2 ) 61
= (17.5) 61 = 35 12 ≈ 2.9167.
q
The population standard deviation is σX = 35 12 ≈ 1.7078. To see how the sample descriptive statistics relate to the
population descriptive statistics as the sample size n increases, Figure 8.5 shows the results from a computer simulation
of rolling a fair die 5,000 times. After each die roll, the following four quantities are updated: (i) sample proportion of
6 being the outcome of a roll (top-left graph), (ii) sample mean of the die rolls (top-right graph), (iii) sample variance
of the die rolls (bottom-left graph), and (iv) sample standard deviation of the die rolls (bottom-right graph). The figure
provides plots of these four quantities as the number of tosses n (along the x-axis) increases to 5,000. For comparison
purposes, the corresponding population statistics are drawn as horizontal dotted lines on each plot. These values are
the probability of a 6 (pX (6) = 16 ), the population mean (µX = 3.5), the population variance (σX2 = 35 12 ), and the population
q
standard deviation (σX = 35 12 ). As evident in the four graphs, the sample descriptive statistic in each case gets very
close to its corresponding population statistic as n increases toward 5,000.
Here is the R code to create Figure 8.5:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 189 — #196
i i
set.seed(1234)
8.4.1 Joint probability mass function and joint cumulative distribution function
Let K and L denote the number of possible outcomes for X and Y, respectively, where K and/or L may be infinite.
The possible outcomes for X are {x1∗ , x2∗ , …, xK∗ } if K is finite and {x1∗ , x2∗ , …, xk∗ , …} if K is infinite. The possible
outcomes for Y are {y∗1 , y∗2 , …, y∗L } if L is finite and {y∗1 , y∗2 , …, y∗` , …} if L is infinite. The concept of the probability
mass function can be extended to a joint probability mass function as follows:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 190 — #197
i i
1.0
6
0.8
5
0.6
4
0.4
3
0.2
2
0.0
1
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
2.4
3.5
2.2
3.0
2.0
1.8
2.5
1.6
2.0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Figure 8.5
Descriptive statistics for 5,000 simulated die rolls
Definition 8.8 The joint probability mass function (joint pmf) of two discrete random variables X and Y, denoted
pXY (·, ·), gives the joint probability
pXY (xk∗ , y∗` ) = P(X = xk∗ ∩ Y = y∗` ) = P(X = xk∗ , Y = y∗` )
for every possible outcome xk∗ for X and every possible outcome y∗` for Y.
The collection of possible (xk∗ , y∗` ) values is a set of disjoint and exhaustive outcomes, meaning the joint pmf satisfies
the following properties:
0 ≤ pXY (xk∗ , y∗` ) ≤ 1 for any possible outcome pair (xk∗ , y∗` )
and X XX
pXY (xk∗ , y∗` ) = pXY (xk∗ , y∗` ) = 1.
(k,`) k `
The joint pmf probabilities are each between zero and one (inclusive) and sum to one. The summations in the
P
expression above are written in two equivalent ways: (i) (k,`) is a summation over all possible pairs (k, `), and
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 191 — #198
i i
P P
(ii) k ` is a “double summation” with the inner summation taken over possible ` values and the outer summation
taken over possible k values.
If X and Y are both finite discrete random variables, the number of possible joint outcomes is KL. Following the
approach that was taken for probability tables in Section 3.3, the joint pmf can be represented by a table as follows:
y∗1 y∗2 ··· y∗L
x1∗ pXY (x1∗ , y∗1 ) pXY (x1∗ , y∗2 ) ··· pXY (x1∗ , y∗L )
x2∗ pXY (x2∗ , y∗1 ) pXY (x2∗ , y∗2 ) ··· pXY (x2∗ , y∗L )
.. .. .. .. ..
. . . . .
xK∗ pXY (xK∗ , y∗1 ) pXY (xK∗ , y∗2 ) ··· pXY (xK∗ , y∗L )
Example 8.15 (Phone and computer ownership) Continuing Example 8.13, where X = number of phones owned and
Y = number of computers owned, let’s assume that no individual in the population owns more than two phones or
more than two computers, so that K = 2 and L = 2. Suppose the joint pmf is given by the following probability table:
Y (# computers)
0 1 2
0 0.06 0.03 0.01
X (# phones) 1 0.22 0.48 0.05
2 0.02 0.04 0.09
The joint pmf values in this table are the probabilities associated with the population and not sample proportions
based upon an observed sample. For example, pXY (0, 1) = 0.03 indicates there is a 3% probability that an individual
drawn from the population doesn’t own a phone and owns one computer. Similarly, pXY (1, 2) = 0.05 indicates that there
is a 5% probability that an individual drawn from the population owns one phone and two computers.
The concept of the cdf can also be extended to the case of two discrete random variables, as follows:
Definition 8.9 The joint cumulative distribution function (joint cdf) of two discrete random variables, denoted
FXY (·, ·), gives the probability that both X and Y are less than or equal to their corresponding arguments:
FXY (x0 , y0 ) = P(X ≤ x0 ∩ Y ≤ y0 ) = P(X ≤ x0 , Y ≤ y0 ).
The joint cdf has the following properties:
0 ≤ FXY (x0 , y0 ) ≤ 1 for every x0 and y0
and
x0 < x1 =⇒ FXY (x0 , y0 ) ≤ FXY (x1 , y0 ) and y0 < y1 =⇒ FXY (x0 , y0 ) ≤ FXY (x0 , y1 ).
The first property follows from the fact that FXY (x0 , y0 ) is a probability. The first part of the second property says that the
joint cdf is weakly increasing in its first argument; that is, if y0 is held fixed, the value of the joint cdf weakly increases
as x0 is increased. Similarly, the second part says that the joint cdf is weakly increasing in its second argument, so
holding x0 fixed, the value of the joint cdf weakly increases as y0 is increased.
Example 8.16 (Phone and computer ownership) Using the probability table in Example 8.15, the joint cdf can be
calculated at any specified arguments. For example, with x0 = 2 and y0 = 1, the joint cdf FXY (2, 1) = P(X ≤ 2, Y ≤ 1)
is the sum of all the probabilities from the first two columns, which is 0.85. If y0 is held fixed at y0 = 1, focusing
on the subpopulation of individuals who own one computer, the joint cdf values for the possible values of X are
FXY (0, 1) = 0.09, FXY (1, 1) = 0.79, and FXY (2, 1) = 0.85. These joint cdf values increase as x0 increases with y0 fixed.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 192 — #199
i i
Definition 8.10 The conditional probability mass function (conditional pmf) of X given Y, denoted pX|Y (·|·), is
pXY (xk∗ , y∗` )
pX|Y (xk∗ |y∗` ) = P(X = xk∗ |Y = y∗` ) = .
pY (y∗` )
Similarly, the conditional pmf of Y given X, denoted pY|X (·|·), is
pXY (xk∗ , y∗` )
pY|X (y∗` |xk∗ ) = P(Y = y∗` |X = xk∗ ) = .
pX (xk∗ )
For the probability table when K and L are finite, the conditional probability pX|Y (xk∗ |y∗` ) is focused on the joint
probabilities in the column corresponding to Y = y∗` , and the conditional probability pY|X (y∗` |xk∗ ) is focused on the joint
probabilities in the row corresponding to X = xk∗ . The conditional pmf is itself a pmf, having the properties that its
probabilities are positive and add up to one. Therefore, a conditional cdf can be defined based upon the conditional
pmf’s probabilities:
Definition 8.11 The conditional cumulative distribution function (conditional cdf) of X given Y, denoted FX|Y (·|·),
gives the probability that X is less than or equal to any argument x0 conditional on Y = y∗` :
X
FX|Y (x0 |y∗` ) = P(X ≤ x0 |Y = y∗` ) = pX|Y (xk∗ |y∗` ).
xk∗ ≤x0
Similarly, the conditional cdf of Y given X, denoted FY|X (·|·), gives the probability that Y is less than or equal to any
argument y0 conditional on X = xk∗ :
X
FY|X (y0 |xk∗ ) = P(Y ≤ y0 |X = xk∗ ) = pY|X (y∗` |xk∗ ).
y∗
` ≤y0
Example 8.17 (Phone and computer ownership) The joint probability table from Example 8.15 is replicated below,
with the marginal pmf’s for X and Y now included.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 193 — #200
i i
Y (# computers)
0 1 2 pX (x)
0 0.06 0.03 0.01 0.10
X (# phones) 1 0.22 0.48 0.05 0.75
2 0.02 0.04 0.09 0.15
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 194 — #201
i i
3 48 4
conditional pmf values sum to one ( 55 + 55 + 55 = 1 conditioning on Y = 1 and 0.6 + 0.3 + 0.1 = 1 conditioning on X = 0),
as expected for pmf’s.
Since a conditional pmf is just a special case of a pmf, it is natural to introduce conditional versions of the population
descriptive statistics, including the mean, variance, and standard deviation:
Definition 8.12 The population conditional mean or conditional expectation of X given Y = y∗` , denoted µX|Y=y∗` , is
X
µX|Y=y∗` = E(X|Y = y∗` ) = xk∗ pX|Y (xk∗ |y∗` ).
k
Definition 8.13 The population conditional variance of X given Y = y∗` , denoted σX|Y=y2
∗ , is
`
X
2 ∗ ∗ 2 ∗ ∗
σX|Y=y∗` = Var(X|Y = y` ) = (xk – µX|Y=y` ) pX|Y (xk |y` ).
∗
Definition 8.14 The population conditional standard deviation of X given Y = y∗` , denoted σX|Y=y∗` , is
q sX
σX|Y=y∗` = sd(X|Y = y∗` ) = σX|Y=y
2
∗ = (xk∗ – µX|Y=y∗` )2 pX|Y (xk∗ |y∗` ).
`
k
These definitions are stated for the conditional distribution of X given Y. For the conditional distribution of Y given
X, the same definitions can be used with the roles of X and Y reversed. For example, the population conditional mean
or conditional expectation of Y given X = xk∗ is
X
µY|X=xk∗ = E(Y|X = xk∗ ) = y∗` pY|X (y∗` |xk∗ ).
`
Example 8.18 (Phone and computer ownership) In Example 8.17, the conditional pmf associated with phone
3
ownership (X) given that an individual owned one computer (Y = 1) is: pX|Y (0|1) = 55 , pX|Y (1|1) = 48 4
55 , and pX|Y (2|1) = 55 .
Then, using the definitions above, the population conditional mean of X given Y = 1 is
3 48 4 56
µX|Y=1 = E(X|Y = 1) = 0 × +1× +2× = ≈ 1.018,
55 55 55 55
the population conditional variance of X given Y = 1 is
2 2 2
2 56 3 56 48 56 4
σX|Y=1 = Var(X|Y = 1) = 0 – × + 1– × + 2– × ≈ 0.1269,
55 55 55 55 55 55
and the population conditional standard deviation of X given Y = 1 is
√
σX|Y=1 = sd(X|Y = 1) ≈ 0.1269 ≈ 0.3563.
These population descriptive statistics describe the number of phones owned (X) in the subpopulation of individuals
who own one computer. The unconditional population mean of X is µX = (0)(0.10) + (1)(0.75) + (2)(0.15) = 1.05, and
the unconditional population variance of X is σX2 = (0 – 1.05)2 (0.10) + (1 – 1.05)2 (0.75) + (2 – 1.05)2 (0.15) = 0.2475.
Therefore, the conditional distribution of X given Y = 1 has a slightly lower population mean and a much lower
population variance, indicating that knowing Y = 1 provides useful information about the distribution of X.
Similarly, we can find the population descriptive statistics associated with the conditional pmf of Y given X = 0
found in Example 8.17, which is pY|X=0 (0|0) = 0.6, pY|X=0 (1|0) = 0.3, and pY|X=0 (2|0) = 0.1. The population conditional
mean of Y given X = 0 is
µY|X=0 = E(Y|X = 0) = 0 × 0.6 + 1 × 0.3 + 2 × 0.1 = 0.5,
the population conditional variance of Y given X = 0 is
2
σY|X=0 = Var(Y|X = 0) = (0 – 0.5)2 × 0.60 + (1 – 0.5)2 × 0.30 + (2 – 0.5)2 × 0.10 = 0.45,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 195 — #202
i i
Definition 8.15 The population covariance between two discrete random variables X and Y, denoted σXY , is
X
σXY = Cov(X, Y) = (xk∗ – µX )(y∗` – µY ) pXY (xk∗ , y∗` ).
(k,`)
If we conduct the thought of experiment of taking many repeated draws from the population, the sample covariance
sxy should get close to the population covariance σXY as the sample size n gets larger and larger. We expect this to
occur since (i) the joint probability pXY (xk∗ , y∗` ) is the true long-run frequency that the sample proportions pkl approach
for large n and (ii) the sample means x̄ and ȳ approach the population means µX and µY for large n.
It is important to stress the difference between the sample covariance and the population covariance. While both
are measures of linear association, the sample covariance measures that association between x and y in the observed
sample, and the population covariance measures that association between X and Y in the population. For the population
covariance, the terms (xk∗ – µX )(y∗` – µY ) pXY (xk∗ , y∗` ) in the summation have positive contributions for possible outcomes
xk∗ and y∗` on the same side of their respective population means, µX and µY , and negative contributions for possible
outcomes xk∗ and y∗` on opposite sides of their respective population means. The overall population covariance depends
upon the magnitude and the weighting of each term, where the weights are the joint probabilities pXY (xk∗ , y∗` ). The sign
of the population covariance indicates, for a joint draw of X and Y from the population, whether the X and Y draws
tend to be on the same side of their population means (a positive covariance) or on opposite sides of their population
means (a negative covariance).
The units of the population covariance σXY are the units of X times the units of Y. Like the sample covariance,
these units make the population covariance difficult to interpret. As before, it is useful to have a unitless measure of
linear association, this time in the population, that is easier to interpret. With the population covariance defined, the
population correlation is defined analogously to the sample correlation:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 196 — #203
i i
Definition 8.16 The population correlation between two discrete random variables X and Y, denoted ρXY , is
σXY
ρXY = Corr(X, Y) = .
σX σY
The population covariance and correlation have properties analogous to those seen for the sample covariance and
correlation (Proposition 7.1).
Proposition 8.4. For random variables X and Y, the population covariance σXY and the population correlation ρXY
satisfy the following properties:
(i) ρXY is unitless;
(ii) the sign of the population correlation is the same as the sign of the population covariance,
sign(ρXY ) = sign(σXY );
(iii) ρXX = 1;
(iv) –1 ≤ ρXY ≤ 1.
Proposition 8.4 does not assume that X and Y are discrete random variables. These properties hold more generally
for any type of numerical random variables X and Y, so there is no reason to restrict things to the case of discrete
random variables.
Example 8.19 (Phone and computer ownership) With X = number of phones owned and Y = number of computers
owned, recall the joint pmf from Example 8.15:
Y (# computers)
0 1 2
0 0.06 0.03 0.01
X (# phones) 1 0.22 0.48 0.05
2 0.02 0.04 0.09
We have already seen that the population means are µX = 1.05 and µY = 0.85. Thus, the population covariance is
σXY = (0 – 1.05)(0 – 0.85)(0.06) + (0 – 1.05)(1 – 0.85)(0.03) + (0 – 1.05)(2 – 0.85)(0.01)
+(1 – 1.05)(0 – 0.85)(0.22) + (1 – 1.05)(1 – 0.85)(0.48) + (1 – 1.05)(2 – 0.85)(0.05)
+(2 – 1.05)(0 – 0.85)(0.02) + (2 – 1.05)(1 – 0.85)(0.04) + (2 – 1.05)(2 – 0.85)(0.09)
= 0.1275.
The population correlation is
σXY 0.1275
ρXY = =√ √ ≈ 0.392,
σX σY 0.2475 0.4275
where the values for σX and σY were calculated in Example 8.17. The population correlation ρXY ≈ 0.392 indicates a
positive relationship between the random variables X and Y.
Example 8.20 (Stock price up or down?) Suppose X and Y are defined as binary random variables based upon the
monthly returns of two underlying stocks, stock A and stock B:
(
1 if stock A’s price goes up during the month (positive return)
X=
0 if stock A’s price goes down or stays the same during the month (non-positive return)
and (
1 if stock B’s price goes up during the month (positive return)
Y=
0 if stock B’s price goes down or stays the same during the month (non-positive return)
We want to know whether there is a relationship between X and Y in the population. Let’s assume that the joint pmf of
X and Y is given by the following probability table:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 197 — #204
i i
Y
0 1
0 0.30 0.15
X
1 0.10 0.45
Since both X and Y are binary variables, their population means and variances are
µX = pX (1) = 0.55, σX2 = pX (1)(1 – pX (1)) = 0.2475,
and
µY = pY (1) = 0.60, σY2 = pY (1)(1 – pY (1)) = 0.24.
The population covariance between X and Y is
σXY = (0 – 0.55)(0 – 0.60)(0.30) + (0 – 0.55)(1 – 0.60)(0.15)
+(1 – 0.55)(0 – 0.60)(0.10) + (1 – 0.55)(1 – 0.60)(0.45) = 0.12,
and the population correlation between X and Y is
σXY 0.12
ρXY = =√ √ ≈ 0.492.
σX σY 0.2475 0.24
There is a positive relationship between X and Y, as it is much more likely for X and Y to both be above their population
means or both be below their population means (75%) as opposed to being on oppositive sides of their population
means (25%). When X = 1, it is much more likely that Y = 1 than in the overall population; the conditional probability
0.45
pY|X=1 (1|1) = 0.55 ≈ 0.818 as compared to the unconditional probability pY (1) = 0.60. Similarly, when X = 0, it is much
0.30
more likely that Y = 0 than in the overall population; the conditional probability pY|X=0 (0|0) = 0.45 ≈ 0.667 as compared
to the unconditional probability pY (0) = 0.40.
Definition 8.17 Two discrete random variables X and Y are independent if and only if
pXY (xk∗ , y∗` ) = pX (xk∗ )pY (y∗` ) for every possible outcome pair (xk∗ , y∗` )
or, equivalently,
FXY (xk∗ , y∗` ) = FX (xk∗ )FY (y∗` ) for every possible outcome pair (xk∗ , y∗` ).
If this equality fails for any (xk∗ , y∗` ), then the discrete random variables X and Y are dependent.
Independent discrete random variables X and Y have joint probabilities pXY (xk∗ , y∗` ) equal to the product of the
marginal probabilities pX (xk∗ ) and pY (y∗` ) for every possible outcome pair (xk∗ , y∗` ). This concept of independence is
closely related to the independence of events discussed in Chapter 3, specifically that two events are independent if and
only if their joint probability is equal to the product of the marginal probabilities of the two events (Proposition 3.6).
Here, any given outcome of X being observed can be thought of as an event and, likewise, any given outcome of Y
being observed can be thought of as an event. Framed in this way, Definition 8.17 is equivalent to any possible outcome
(event) associated with X being independent of any possible outcome (event) associated with Y, in the sense discussed
in Chapter 3.
To show that two discrete random variables are dependent, it is only necessary to show that there is one outcome
pair (xk∗ , y∗` ) for which pXY (xk∗ , y∗` ) 6= pX (xk∗ )pY (y∗` ).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 198 — #205
i i
Example 8.21 (Phone and computer ownership) Continuing Example 8.19, are X (number of phones owned) and Y
(number of computers owned) independent? The answer is no since, for instance, pXY (0, 0) = 0.06 and pX (0)pY (0) =
(0.10)(0.30) = 0.03 are not equal to each other. Other joint probabilities could be checked, but it’s sufficient to find just
one case where pXY (xk∗ , y∗` ) 6= pX (xk∗ )pY (y∗` ) to show that X and Y are dependent random variables.
Example 8.22 (Stock price up or down?) Continuing Example 8.20, are X (binary variable indicating positive return
of stock A) and Y independent (binary variable indicating positive return of stock B)? The answer is no since pXY (0, 0) =
0.30 and pX (0)pY (0) = (0.45)(0.40) = 0.18 are not equal to each other.
At times throughout the book, we have made the implicit assumption that certain random variables are independent
of each other. For example, for two coin tosses that have nothing to do with each other, the tosses can be thought of as
independent random variables. To be more precise, if X = 1 if the first coin is heads and 0 if the first coin is tails and
Y = 1 if the second coin is heads and 0 if the second coin is tails, the following joint probability table represents the
case when X and Y are independent:
Y
0 1
0 0.25 0.25
X
1 0.25 0.25
The joint probabilities are products of the respective marginal probabilities since each marginal probability (of heads
or tails for either toss) is equal to 0.5. Similarly, for two fair die rolls X ∈ {1, 2, 3, 4, 5, 6} and Y ∈ {1, 2, 3, 4, 5, 6} that
1
are independent, the joint probability of any of the 36 possible outcomes (x, y) is 36 since each marginal probability for
1 1
X is equal to 6 and each marginal probability for Y is equal to 6 . Here is another example of two independent discrete
random variables:
Example 8.23 (Website purchases) For two visitors to a website, let X = 1 if the first visitor makes a purchase and 0
otherwise and Y = 1 if the second visitor makes a purchase and 0 otherwise. If the marginal probability of purchase is
0.2 for both individuals, the probability table in the case of independent X and Y (i.e., their purchase behaviors are
not related) is:
Y
0 1
0 (0.8)(0.8) = 0.64 (0.8)(0.2) = 0.16
X
1 (0.2)(0.8) = 0.16 (0.2)(0.2) = 0.04
An alternative way to show that two discrete random variables are dependent is to consider the population covariance
or population correlation. If the population covariance/correlation is non-zero, there is some linear association between
the two random variables, meaning they must be dependent. The following proposition formally states this result:
Proposition 8.5. If the discrete random variables X and Y are independent, the population covariance σXY and the
population correlation ρXY are equal to zero (σXY = ρXY = 0). Equivalently, if X and Y have a non-zero population
covariance or correlation, X and Y are dependent.
The reverse is not necessarily true: having a population covariance/correlation between X and Y equal to zero does
not necessarily imply independence. As with sample covariance/correlation, these population statistics only measure
the linear relationship between two random variables, so it’s possible that there is some non-linear dependence between
the two random variables even when the covariance/correlation is equal to zero. The following example shows such a
situation, where X and Y are dependent even though the population covariance is zero:
Example 8.24 Consider the following probability table for X ∈ {0, 1} and Y ∈ {–1, 0, 1}:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 199 — #206
i i
Y
–1 0 1 pX (x)
0 0 0.5 0 0.5
X
1 0.25 0 0.25 0.5
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 200 — #207
i i
Example 8.26 Consider the following probability table for X ∈ {0, 1} and Y ∈ {–1, 0, 1}, with the marginal
probabilities also calculated:
Y
–1 0 1 pX (x)
0 7/36 2/36 3/36 12/36
X
1 14/36 4/36 6/36 24/36
The conditional distributions of Y|X are the same as the marginal distribution of Y:
7 2 3
pY|X (–1|0) = , pY|X (0|0) = , pY|X (1|0) = (conditional on X = 0)
12 12 12
14 4 6
pY|X (–1|1) = , pY|X (0|1) = , pY|X (1|1) = (conditional on X = 1)
24 24 24
The probabilities in the X = 1 row are all proportional to the corresponding probabilities in the X = 0 row; specifically,
each one is two times the probability of the corresponding probability in the X = 0 row. This proportionality is what
leads the conditional pmf’s to be unchanged and, therefore, equal to the marginal distribution function. Similarly, the
conditional distributions of X|Y are the same as the marginal distribution of X, which can be verified by the reader.
Therefore, in this case, the random variables X and Y are independent.
The concept of independence can be generalized to additional random variables, with the intuition remaining
the same. When random variables have no relationship with each other, they are independent; otherwise, they are
dependent. Definition 8.17, which considered the case of two discrete random variables, can be extended to the general
case of multiple discrete random variables:
Definition 8.18 The m discrete random variables X1 , X2 , …, Xm , where m ≥ 2, are independent if and only if
P(X1 = xk∗1 , X2 = xk∗2 , …, Xm = xk∗m ) = P(X1 = xk∗1 )P(X2 = xk∗2 ) · · · P(Xm = xk∗m )
for any possible joint outcome (xk∗1 , xk∗2 , …, xk∗m ) of (X1 , X2 , …, Xm ). If this equality fails for any joint outcome
(xk∗1 , xk∗2 , …, xk∗m ), then the discrete random variables X1 , X2 , …, Xm are dependent.
Using the notation pX1 X2 ···Xm (xk∗1 , xk∗2 , …, xk∗m ) to denote the joint pmf of (xk∗1 , xk∗2 , …, xk∗m ), the definition can be re-
stated as the discrete random variables X1 , X2 , …, Xm being independent if and only if
pX1 X2 ···Xm (xk∗1 , xk∗2 , …, xk∗m ) = pX1 (xk∗1 )pX2 (xk∗2 ) · · · pXm (xk∗m )
for any possible joint outcome (xk∗1 , xk∗2 , …, xk∗m ) of (X1 , X2 , …, Xm ). Equivalently, the definition could be stated in terms
of cdf’s rather than pdf’s, with the discrete random variables X1 , X2 , …, Xm being independent if and only if
FX1 X2 ···Xm (xk∗1 , xk∗2 , …, xk∗m ) = FX1 (xk∗1 )FX2 (xk∗2 ) · · · FXm (xk∗m )
for any possible joint outcome (xk∗1 , xk∗2 , …, xk∗m ) of (X1 , X2 , …, Xm ).
As with two discrete random variables, the independence of multiple random variables is characterized by the joint
probability of an outcome being equal to the product of the marginal probabilities. It becomes more difficult to check
independence when the number of random variables increases, as there are more joint probabilities to consider. That
said, the most important use of this concept is to apply the fact that the joint probability is equal to the product of the
marginal probabilities when it is either known or assumed that a collection of random variables are independent.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 201 — #208
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 202 — #209
i i
where a, b, c, and d are known constants. The following table provides the population covariance and correlation of V
and W in terms of the population covariance and correlation of X and Y, along with a side-by-side comparison to the
results for the analogous sample descriptive statistics from Section 6.7:
Sample Population
Linear transformations v = a + bx, w = c + dy V = a + bX, W = c + dY
Covariance s = bdsxy σVW = bdσXY
(vw (
rxy if bd > 0 ρXY if bd > 0
Correlation rvw = ρVW =
–rxy if bd < 0 –ρXY if bd < 0
Example 8.29 (Placing two bets) Example 8.28 considered a situation in which a bet was placed on the outcome of
an event. For this example, let’s consider two separate bets on two different events, where X = 1 if the first event occurs
and 0 otherwise and Y = 1 if the second event occurs and 0 otherwise. For the first bet, the winnings are w1 if X = 1 and
the losses are `1 if X = 0. For the second bet, the winnings are w2 if Y = 1 and the losses are `2 if Y = 0. The net gains
on the two bets, V and W respectively, are given by the following linear transformations:
V = –`1 + (w1 + `1 )X and W = –`2 + (w2 + `2 )Y.
The population covariance between V and W is
σVW = (w1 + `1 )(w2 + `2 )σXY ,
and the population correlation between V and W is ρVW = ρXY since w1 > 0, `1 > 0, w2 > 0, and `2 > 0. Perhaps not
surprisingly, the sign of the population correlation between the net gains on the two bets is the same as the sign of the
population correlation between the two events. If Y = 1 is more likely when X = 1 as compared to X = 0, this correlation
is positive. If Y = 1 is less likely when X = 1 as compared to X = 0, this correlation is negative. The equality ρVW = ρXY
also tells us that, in addition to the signs being equal, the magnitude of the two population correlations is the same.
As seen in the table above, this last finding is a general one, with linear transformations of two random variables
having a population correlation with the same magnitude as the population correlation of the underlying two random
variables.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 203 — #210
i i
and
V = X – Y =⇒ σV2 = σX2 + σY2 – 2σXY .
Example 8.30 (Correlated purchases) Suppose a website sells two types of widgets (widget A and widget B), and a
visitor to the website is allowed to purchase one of each type of widget. For a given visitor to the website, let X = 1 if
they buy a widget of type A and X = 0 if not, and let Y = 1 if they buy a widget of type B and Y = 0 if not. The joint pmf
for X and Y is given in the following probability table:
Y
0 1
0 0.70 0.05
X
1 0.05 0.20
X and Y are positively correlated here, with
σXY = (0 – 0.25)(0 – 0.25)(0.70) + (0 – 0.25)(1 – 0.25)(0.05)
+(1 – 0.25)(0 – 0.25)(0.05) + (1 – 0.25)(1 – 0.25)(0.20) = 0.1375
is a linear combination of the m random variables X1 , X2 , …, Xm . The population statistics for the random variable V
have the following relationships to the population statistics for the random variables X1 , X2 , …, Xm :
Pm
(i) (population mean) µV = k + a1 µX1 + a2 µX2 + · · · + am µXm = k + j=1 aj µXj
Pm Pm–1 Pm
(ii) (population variance) σV2 = j=1 a2j σX2 j + 2 j=1 aa σ `
qP `=j+1 j ` Xj XP
m m–1 Pm
p
(iii) (population standard deviation) σV = σV2 = 2 2
j=1 aj σXj + 2 j=1 `=j+1 aj a` σXj X`
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 204 — #211
i i
Part (ii) of Proposition 8.7 implies that, for random variables that are not independent of each other, the population
variance of the linear combination depends upon the covariances that exist between any pair of random variables. For
a linear combination of m = 3 random variables (V = k + a1 X1 + a2 X2 + a3 X3 ), the population variance is
σV2 = a21 σX2 1 + a22 σX2 2 + a23 σX2 3 + 2a1 a2 σX1 X2 + 2a1 a3 σX1 X3 + 2a2 a3 σX2 X3 ,
3
which has 2 = 3 covariance terms. More generally, the number of covariance terms is m2 for a linear combination of
m random variables. As seen in the next section, the population variance σV2 simplifies considerably when X1 , X2 , …, Xm
are independent since all the covariance terms are zero, leaving just the sum of the scaled underlying variances.
Two specific cases of linear combinations that are particularly useful are (i) the sum of independent random variables
and (ii) the average of independent random variables:19
(i) Sum of independent random variables: V = X1 + X2 + · · · + Xm
m
X
µV = µX1 + µX2 + · · · + µXm = µXj
j=1
m
X
σV2 = σX2 1 + σX2 2 + · · · + σX2 m = σX2 j
j=1
1
(ii) Average of independent random variables: V = m (X1 + X2 + · · · + Xm )
m
1 1 1 1X
µV = µX1 + µX2 + · · · + µXm = µXj
m m m m
j=1
m
1 2 1 2 1 2 1 X 2
σV2 = σ + σ + · · · + σ = σXj
m2 X1 m2 X2 m2 Xm m2
j=1
Example 8.31 (Three rolls of a die) Suppose a fair die is rolled three times, where X1 , X2 , and X3 are the random
variables associated with the possible outcomes of the three rolls. Assume that the three random variables are
independent, meaning the outcome of each roll has nothing to do with the outcomes of the other rolls. In that case, the
random variable representing the sum of the three outcomes, X1 + X2 + X3 , has population mean
µX1 + µX2 + µX3 = 3.5 + 3.5 + 3.5 = 10.5
and population variance
35 35 35 35
σX2 1 + σX2 2 + σX2 3 =
+ + = .
12 12 12 4
Since each roll has the same population mean (3.5) and population variance (35/12), the expected value of the sum is
equal to three times the expected value of a single roll, and the population variance of the sum is equal to three times
the population variance of a single roll.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 205 — #212
i i
1
The random variable representing the average of the three outcomes, 3 (X1 + X2 + X3 ), has population mean
1 1 1 1
µX1 + µX2 + µX3 = (3.5 + 3.5 + 3.5) = 3.5
3 3 3 3
and population variance
1 2 1 2 1 2 1 35 35 35 35
σX1 + σX2 + σX3 = + + = .
9 9 9 9 12 12 12 36
The expected value for the average of three die rolls is the same as the expected value of a single roll, and the
population variance of the average of three die rolls is 31 times the population variance of a single roll.
Example 8.32 (100 coin tosses) Suppose a fair coin is tossed 100 times, where X1 , X2 , …, X100 are the random
variables associated with the possible outcomes of the tosses. Assume that the coin tosses are all independent of
each other. For any given coin toss Xj , the population mean is 0.5, and the population variance is (0.5)(1 – 0.5) = 0.25.
The random variable representing the sum of the 100 outcomes has population mean
µX1 + µX2 + · · · + µX100 = (100)(0.5) = 50
and population variance
σX2 1 + σX2 2 + · · · + σX2 100 = (100)(0.25) = 25.
For 100 coin tosses, the population mean or expected value is 50 heads. The population variance is 100 times the
population variance of an single coin toss.
The random variable representing the average of the 100 outcomes has population mean
1 1
(µX1 + µX2 + · · · + µX100 ) = (100)(0.5) = 0.5
100 100
and population variance
1 2 1 2 1 2 1
σ + σ +···+ σ = (100)(0.25) = 0.0025.
1002 X1 1002 X2 1002 X100 1002
The expected value for the average of 100 tosses is the same as the expected value of a single toss, both equal to 0.5,
1
and the population variance of the average of 100 tosses is 100 times the population variance of a single toss.
For Examples 8.31 and 8.32, the underlying i.i.d. random variables X1 , X2 , …, Xm share a common population mean
µX , equal to 3.5 for the die rolls and 0.5 for the coin tosses, and a common population variance σX2 , equal to 35
12 for
the die rolls and 0.25 for the coin tosses. When i.i.d. variables X1 , X2 , …, Xm share a common mean µX and a common
variance σX2 , it follows that the population mean, population variance, and population standard deviation of the sum
√
of the random variables are mµX , mσX2 , and mσX , respectively, and the population mean, population variance, and
σ2
population standard deviation of the average of the random variables are µX , mX , and √σXm , respectively.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 206 — #213
i i
206 NOTES
As Example 8.33 illustrates, there is a two-step approach to calculating the expected value of a function of a discrete
random variable. First, we apply the function to each possible outcome xk∗ to get the full set of possible outcomes.
Second, we use the original probabilities pX (xk∗ ) to weight the new set of possible outcomes. This approach is stated as
a general result in the following proposition:
Proposition 8.8. For any function g(x) and a discrete random variable X with pmf pX (·), the population mean or
expected value of g(X) is X
µg(X) = E(g(X)) = g(xk∗ ) pX (xk∗ ).
k
A special case of Proposition 8.8 is the formula for the population variance (Definition 8.6), where g(X) = (X – µX )2 .
Example 8.34 (Fair-die gamble) Consider the following gamble based upon the roll of a fair die. A √ player pays
the random variable X ∈ {1, 2, 3, 4, 5, 6}, and receives a payout of X dollars.
$2, rolls a fair die whose outcome is √
Applying Proposition 8.8 with g(X) = X – 2 yields an expected value of net winnings of
6 1 1 √ √ √ √ √ √
X √
k–2 = 1 + 2 + 3 + 4 + 5 + 6 – 2 ≈ –0.1947.
6 6
k=1
Proposition 8.8 can be generalized to multiple discrete random variables. The following proposition considers the
case of two discrete random variables, where the function of interest now has two arguments (random variables X and
Y) and the joint pmf pXY (·, ·) is used:
Proposition 8.9. For any function g(x, y) and discrete random variables X and Y with joint pmf pXY (·, ·), the population
mean or expected value of g(X, Y) is
X
µg(X,Y) = E(g(X, Y)) = g(xk∗ , y∗` ) pXY (xk∗ , y∗` ).
(k,`)
A special case of Proposition 8.9 is the formula for the population covariance (Definition 8.15), where g(X, Y) =
(X – µX )(Y – µY ).
Example 8.35 (Two projects) A firm is undertaking two projects A and B, each of which may succeed or fail. The joint
probabilities of success and failure are pAB (0, 0) (both fail), pAB (0, 1) (A fails, B succeeds), pAB (1, 0) (A succeeds, B
fails), and pAB (1, 1) (both succeed). The firm realizes a profit of KA if project A succeeds, a profit of KB if B succeeds,
and an additional profit Kextra if both succeed, where KA , KB , and Kextra are constants. Then, the firm’s profit is
g(A, B) = KA A + KB B + Kextra AB.
The additional profit Kextra is realized only when AB = 1 or A = B = 1. Applying Proposition 8.9, the expected profit is
E(g(A, B)) = (0)pAB (0, 0) + (KB )pAB (0, 1) + (KA )pAB (1, 0) + (KA + KB + Kextra )pAB (1, 1)
or
E(g(A, B)) = KA (pAB (1, 0) + pAB (1, 1)) + KB (pAB (0, 1) + pAB (1, 1)) + Kextra pAB (1, 1).
Notes
Pk a(1–rk )
18 The result for FX (xk∗ ) simplifies using the formula for the sum of a finite geometric series: j=1 arj–1 = (1–r)
.
19 The results for the expecation µV do not require independence, as µV does not depend upon the covariances.
Exercises
1. The pmf of the number of machines X at a given factory that break down in a day is:
# of breakdowns (xk∗ ) 0 1 2 3
probability (pX (xk∗ )) 0.2 0.5 0.25 0.05
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 207 — #214
i i
NOTES 207
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 208 — #215
i i
208 NOTES
(c) Thinking about Option 1, what probability p on the $10,000 profit (and 1 – p on the $1,000 loss) would yield
an expected profit equal to that of Option 3?
(d) If the investor randomly chooses one of the three options (each with probability 1/3), what is the probability that
the investor realizes a profit?
(e) Let X2 denoting the random variable associated with the profit or loss associated with Option 2. Draw the cdf
associated with X2 .
6. The joint pmf of two random variables X and Y is given by the following table:
Y
1 2 3
1 0.24 0.12 0.04
X 2 0.12 0.06 0.02
3 0.24 0.12 0.04
(a) Are X and Y independent?
(b) What is the expected value of X + Y?
(c) What is the population variance of X + Y?
(d) What is the expected value of XY?
(e) What is the probability of X = 1 conditional on Y = X + 1?
7. For two positive numbers p and q with p + q < 1, the joint pmf of random variables X and Y is given by the following
table:
Y
y∗1 y∗2
x1∗ p ???
X x2∗ q ???
If X and Y are independent, what are the values of the two joint probabilities (in terms of p and q) in the second
column?
8. A bank has two car lanes with ATM machines. The following table provides the joint pmf for the number of cars in
each of the two lanes at a given time:
Cars in Lane 2 (Y)
0 1 2 3
0 0.05 0.15 0.02 0
Cars in Lane 1 (X) 1 0.15 0.20 0.10 0
2 0.02 0.10 0.10 0.03
3 0 0 0.03 0.05
(a) What is the probability that exactly one car is in Lane 1?
(b) What is the probability that exactly one car is in either lane?
(c) What is the probability that exactly one car is in both lanes?
(d) What is the probability that there are the same number of cars in the two lanes?
(e) What is the probability that exactly one car is in Lane 1 if two cars are in Lane 2?
(f) Are X and Y independent?
(g) Calculate the population covariance of X and Y.
(h) Consider the total number of cars in the two lanes, given by X + Y. What is the pmf of X + Y? What is E(X + Y)?
What is Var(X + Y)?
(i) Consider the difference in the number of cars in the two lanes, given by X – Y. What is the pmf of X – Y? What
is E(X – Y)? What is Var(X – Y)?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 209 — #216
i i
NOTES 209
(j) Does the difference between Var(X + Y) and Var(X – Y) found in (h) and (i) make sense given the population
covariance found in (g)?
9. Consider the population of families in the United States having at least two children. The following table describes
the probability that a randomly chosen family has a third child, broken down by the number of boys that the family
has among its first two children:
# of boys among first two children 0 1 2
Probability of having a third child 0.30 0.25 0.30
Let B ∈ {0, 1, 2} be the random variable representing the number of boys among the first two children. Let T be the
random variable indicating whether the family has a third child, with T = 1 for families having a third child and T = 0
for families not having a third child. Assume the following pmf for B: pB (0) = 0.25, pB (1) = 0.50, pB (2) = 0.25.
(a) Explain why the probabilities in the table do not add up to one.
(b) In terms of B and T, what does the 0.25 value in the table represent?
(c) What is the unconditional probability that a family with two children has a third child?
(d) Show the joint pmf of B and T in a table.
(e) Are B and T independent?
(f) What is the population covariance between B and T?
10. You are faced with two possible investment choices, whose annual returns are represented by the random variables
X and Y. The possible returns for the first investment (X) are 1%, 2%, 3%, and 4%, and the possible returns for the
second investment (Y) are 2%, 3%, and 4%. For simplicity, we omit the “%” sign below. The joint pmf is:
X
1 2 3 4
2 0.10 0.05 0.05 0
Y 3 0.05 0.15 0.05 0.05
4 0.05 0.05 0.15 0.25
(a) What is the joint cdf evaluated at X = 2 and Y = 3 (FXY (2, 3))?
(b) What are the marginal pmf’s of X and Y?
(c) What is the conditional pmf of X given Y = 3?
(d) What is the expected value of X?
(e) What is the expected value of Y?
(f) What is the population variance of X?
(g) What is the population variance of Y?
(h) What is the population covariance between X and Y?
(i) You put $1,000 into investment X and $2,000 into investment Y. The trades cost a total of $20 to execute.
Write an expression for the random variable G, which represents your net gain (in dollars) over the next year.
(Remember that the returns are in percentages, so for instance X = 1 corresponds to a 1% return on $1000,
which is (0.01)(1000) = 10 dollars.)
(j) What is the expected value of G?
(k) What is the population variance of G?
11. Companies A and B are in an R&D race to develop a new technology. Each company may develop the technology
in one, two, or three years. For each company, the probability that the technology is developed in one year is 20%, the
probability that the technology is developed in two years is 50%, and the probability that the technology is developed
in three years is 30%. Assume that the companies’ efforts are independent of each other; the time that it takes Company
B to develop the technology has nothing to do with how long it takes Company A to do so.
Regardless of who develops the technology first, both companies begin commercial production in five years. If one
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 210 — #217
i i
210 NOTES
company develops the technology first, its market share is 75% and its competitor’s market share is 25%. If the two
companies take the same number of years to develop the technology, each has a market share of 50%.
Define the random variables A and B as the number of years that it takes Company A and Company B to develop
the technology, respectively.
(a) Find the joint pmf of A and B.
(b) Let M denote the eventual market share for Company A. What is the pmf of M?
(c) Each year of R&D costs the company $5 million. The total market is worth $100 million in net revenues
(sales minus costs of production), which the two firms divide according to their market shares. Let the random
variable R be the profits of Company A, equal to net revenues minus the R&D costs.
i. Write R as a function of A and M.
ii. What is the expected value of R (in millions of dollars)?
12. Suppose a fair die is rolled 200 times, with each roll being independent. The rolls are recorded as a 200-character
sequence of integers between 1 and 6 (inclusive).
(a) What is the probability that the sequence 354 occurs in the first three rolls?
(b) *What is the expected value of the number of times that the sequence 354 shows up in the full 200-roll
sequence? (Hint: Think about the random variables Xj defined as 1 if 354 shows up starting with the j-th
roll and 0 otherwise. Then, use the result in Proposition 8.7 for the expected value of a linear combination of
random variables.)
(c) Conduct 10,000 simulations in R to confirm your answer to (b). Each simulation involves rolling a fair die 200
times and counting the number of times that the sequence 354 occurs.
13. The annual income X (in thousands of dollars) in a population of workers has expected value 70 and population
standard deviation 50. Assume all workers’ incomes are independent of each other. For this question, you may assume
X is a discrete random variable.
(a) What is the expected value and population standard deviation for the average of two workers’ annual incomes?
(b) What is the expected value and population standard deviation for the difference between two workers’ annual
incomes?
(c) What is the expected value and population standard deviation for the average of 10 workers’ annual incomes?
(d) What is the expected value and population standard deviation for the total of 10 workers’ annual incomes?
(e) Assume that every worker in the population works 2,080 hours (52 weeks times 40 hours per week). Now,
answer (d) using hourly incomes rather than annual incomes.
14. For the population of graduating Economics students, the probability distribution (pmf) associated with X = the
number of job offers a student receives before graduation is as follows:
x 0 1 2 3 4
pX (x) 0.15 0.40 0.30 0.10 0.05
(a) Use R to determine the expected value of X. (Hint: Create two vectors, one containing the possible outcomes
for X and one containing their associated probabilities, and then do the appropriate calculation based upon
these vectors.)
(b) Use R to determine the population variance of X.
(c) Use R to determine the population standard deviation of X.
(d) Use the sample function to create a vector with 10,000 random draws of X.
i. How do the sample mean, sample variance, and sample standard deviation compare to their population
counterparts?
ii. What proportion of the random draws are less than or equal to 2? How does that compare to FX (2)?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 211 — #218
i i
NOTES 211
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 212 — #219
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 213 — #220
i i
Chapter 8 introduced the concept of a discrete random variable, defined some population descriptive statistics for
discrete random variables, and considered the properties of linear transformations and linear combinations involving
discrete random variables. This chapter introduces some commonly used models for discrete random variables. Each
of these models gives rise to a pmf for the underlying discrete random variable, where the pmf depends upon the
parameter(s) of the specific model being used. For instance, for the binary random variable X associated with a
website purchase (X = 1 if purchase, X = 0 if not), we can formalize a model that assumes there is some (possibly
unknown) probability π of a purchase; here, π is the “parameter” of the model, which gives rise to the pmf for X, with
pX (1) = π and pX (0) = 1 – π.
Definition 9.1 A random variable X is a Bernoulli random variable with parameter π if X ∈ {0, 1} and π = pX (1) =
P(X = 1). We write X ∼ Bernoulli(π), where “∼” is read “is distributed as.”
Examples 8.7 and 8.11 considered a Bernoulli random variable for union status, where X = 1 indicated a union
worker and X = 0 a non-union worker. The results from those examples for the union-status random variable hold for
any Bernoulli random variable, as summarized in the proposition below:
Proposition 9.1. If X ∼ Bernoulli(π),
p
µX = E(X) = π, σX2 = π(1 – π), and σX = π(1 – π).
The population mean µX is the average over the zero and one values in the population. Since the zeros contribute
nothing to the average, µX is equal to the true proportion of ones in the population, which is pX (1) or π. For instance,
for π = 0.4, there is a 40% chance of drawing a 1 from the population in a single experiment associated with √ X. For
this π value, the population variance and population standard deviation are σX2 = (0.4)(0.6) = 0.24 and σX = 0.24,
respectively.
The sample variance σX2 = π(1 – π) is a quadratic function of π, as shown in Figure 9.1. The largest possible variance
occurs when π is exactly equal to 0.5 or 21 , which is the case for a fair coin toss. Having π = 0.5 maximizes the variance
or “noise” associated with the experiment of drawing a value of the random variable X from the population. The sample
variance, as a function of π, decreases as we move away from the 0.5 value, either to the left toward 0 or to the right
toward 1. For π = 0.8, for instance, there is less uncertainty about X, as compared to the π = 0.5 case, since it is much
more likely to see a 1 value than a 0 value. For even larger values, say π = 0.9 and π = 0.95, this uncertainty decreases
even more. At the extremes of π = 0 and π = 1, there is no uncertainty at all, so that the variance is equal to 0, indicating
that X is not even random in these extreme cases. There is a symmetry in the σX2 = π(1 – π) formula, so with respect to
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 214 — #221
i i
0.25
0.20
0.15
Variance
0.10
0.05
0.00
Figure 9.1
Bernoulli variance as a function of π
the variance, there is no difference between π = 0.8 and π = 0.2. For these two π values, the uncertainty or variance of
X is the same; the only difference is which of the two outcomes has the 80% probability of being drawn. Of course,
the population mean for these two π values would be different since µX = π.
Whether or not the parameter π is known depends upon the application. For a coin toss, it makes sense to say that
π = 0.5 is known. For a political poll, π is presumably unknown, and polls are conducted to figure out what π is. In
a case where π is unknown, like a political poll, Chapter 14 describes how to estimate π and how to measure the
uncertainty of that estimate.
Definition 9.2 The discrete random variables X1 , X2 , …, Xn are independent and identically distributed (i.i.d.) if
(i) X1 , X2 , …, Xn are independent and (ii) each Xj has the same probability mass function.
When X1 , X2 , …, Xn are independent and identically distributed (i.i.d.) Bernoulli(π) random variables, the random
variable associated with the sum of the Xj ’s is a binomial random variable:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 215 — #222
i i
Definition 9.3 A random variable X is a binomial random variable with parameters n and π, written X ∼
Binomial(n, π), if
X = X1 + X2 + · · · + Xn ,
where X1 , X2 , …, Xn are independent random variables with each Xj ∼ Bernoulli(π).
The possible outcomes for X are {0, 1, …, n}, with 0 corresponding to no successes among the n Bernoulli trials
and n corresponding to all successes. Since X is a linear combination of independent discrete random variables,
Proposition 8.7 implies that the population mean of the binomial random variable is
µX = µX1 + µX2 + · · · + µXn = nπ,
and the population variance of the binomial random variable is
σX2 = σX2 1 + σX2 2 + · · · + σX2 n = nπ(1 – π).
Example 9.1 (Website purchases) Each day, a website records whether each of the first ten visitors makes a purchase
(1) or not (0). Assume (i) each visitor’s behavior is independent of every other visitor and (ii) the probability of
purchase is 20% for each visitor. Then, X ∼ Binomial(10, 0.2) is the total number of the first 10 visitors on a given day
that makes a purchase. The possible values of X are {0, 1, 2, …, 10}, and its population mean and variance are
µX = nπ = (10)(0.2) = 2
and
σX2 = nπ(1 – π) = (10)(0.2)(0.8) = 1.6.
The population mean and variance do not directly tell us the probabilities of the possible X outcomes. Let’s consider
the probability of three purchases being made by the first 10 visitors, which is P(X = 3). To get X = 3, it must be the case
that exactly 3 visitors make a purchase and exactly 7 visitors do not make a purchase. For a sequence of 10 visitors,
the number of possible ways to get 3 visitors purchasing (Bernoulli Xj = 1) and 7 visitors not purchasing (Bernoulli
Xj = 0) is 10
3 . One of these sequences is
(1, 1, 1, 0, 0, 0, 0, 0, 0, 0),
which occurs with probability
(0.2)(0.2)(0.2)(0.8)(0.8)(0.8)(0.8)(0.8)(0.8)(0.8) = (0.2)3 (0.8)7 .
Similarly, any other sequence with three 1’s and seven 0’s also occurs with probability (0.2)3 (0.8)7 , so that the total
probability of three purchases is
10
P(X = 3) = (0.2)3 (0.8)7 ≈ 0.201.
3
How about P(X = 4)? There are 10
4 sequences corresponding to 4 purchases and 6 non-purchases, each with
4 6
probability (0.2) (0.8) , so that
10
P(X = 4) = (0.2)4 (0.8)6 ≈ 0.088.
4
This idea extends to all possible outcomes in {0, 1, 2, …, 10}, with
10
P(X = k) = (0.2)k (0.8)10–k for any k ∈ {0, 1, 2, …, 10}.
k
The quantity 10
k is the number of sequences with k successes and 10 – k failures out of 10 trials, where the probability
of any individual sequence equal to (0.2)k (0.8)10–k . For P(X = 0), note that 100 = 1, corresponding to the one possible
sequence (all zeros) for no purchases. And, for P(X = 10), note that 10 10 = 1, corresponding to the one possible sequence
of all ones (all 10 individuals making a purchase).
The results derived in Example 9.1 can be generalized to any binomial random variable:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 216 — #223
i i
variable.
R has functions like these for other random variables, always using the convention that a function with d at the
beginning (e.g., dbinom) returns a density or pmf, a function with p at the beginning (e.g., pbinom) returns a cdf,
and a function with r at the beginning (e.g., rbinom) creates random draws of the random variable.
The following R code calculates P(X = 3) and P(X = 4) from Example 9.1, where X ∼ Binomial(10, 0.2):
dbinom(3,10,0.2)
## [1] 0.2013266
dbinom(4,10,0.2)
## [1] 0.08808038
Using the vector 0:10, consisting of all integers between 0 and 10 (inclusive), the complete pmf and cdf can be
calculated:
dbinom(0:10,10,0.2)
## [1] 0.1073741824 0.2684354560 0.3019898880 0.2013265920 0.0880803840
## [6] 0.0264241152 0.0055050240 0.0007864320 0.0000737280 0.0000040960
## [11] 0.0000001024
pbinom(0:10,10,0.2)
## [1] 0.1073742 0.3758096 0.6777995 0.8791261 0.9672065 0.9936306 0.9991356
## [8] 0.9999221 0.9999958 0.9999999 1.0000000
And, the following code simulates 50 i.i.d. draws from the X ∼ Binomial(10, 0.2) distribution of Example 9.1:
set.seed(1234)
rbinom(50,10,0.2)
## [1] 1 2 2 2 3 2 0 1 2 2 3 2 1 4 1 3 1 1 1 1 1 1 1 0 1 3 2 4 3 0 2 1 1 2 1 3 1 1
## [39] 5 3 2 2 1 2 1 2 2 2 1 3
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 217 — #224
i i
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
Probability
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
Figure 9.2
Probability mass function for stock-increase example
Example 9.2 (Days of stock price increases) Suppose that whether a stock goes up on a given day is described by
a Bernoulli(0.6) random variable, where “success” (1) indicates the stock goes up and “failure” (0) indicates the
stock doesn’t go up. Moreover, assume that the Bernoulli random variables associated with each day are independent
of each other. Under these assumptions, for a sequence of 20 days, what is the probability that the stock goes up on
exactly 12 of the 20 days? In this case, X ∼ Binomial(20, 0.6), implying
20
P(X = 12) = 0.612 0.48 ≈ 0.180.
12
The probabilities P(X = x) for other x values can be similarly calculated. The R code below creates Figure 9.2, which
graphs the pmf of X. The type="h" optional argument for the plot function is used to draw a vertical line from the
x-axis to each pmf value.
plot(0:20, dbinom(0:20,20,0.6), type="h", axes=FALSE, main="pmf for # of days that stock goes up",
xlab="", ylab="Probability", xlim=c(0,20), ylim=c(0,0.2))
axis(1, at=0:20)
axis(2, at=seq(0,0.2,0.02))
X = 12 is the most likely (modal) outcome here, and the probabilities decrease either to the left or to the right of 12.
The pmf looks somewhat symmetric, but it is not exactly symmetric around 12; for instance, P(X = 13) is slightly larger
than P(X = 11).
Interestingly, the pmf in Example 9.2 has an approximate bell shape. This particular shape arises due to the
parameters of the binomial distribution (n = 20, π = 0.6) being considered. To examine other possible shapes of
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 218 — #225
i i
0.4
0.30
0.3
0.20
Probability
Probability
0.2
0.10
0.1
0.00
0.0
0 2 4 6 8 10 0 2 4 6 8 10
0.12
0.20
0.08
Probability
Probability
0.10
0.04
0.00
0.00
0 2 4 6 8 10 0 20 40 60 80 100
Figure 9.3
Probability mass functions for binomial random variables
binomial distributions, Figure 9.3 shows pmf’s for four different binomial random variables: (i) Binomial(10, 0.1)
in the upper-left graph, (ii) Binomial(10, 0.2) in the upper-right graph, (iii) Binomial(10, 0.5) in the lower-left graph,
and (iv) Binomial(100, 0.1) in the lower-right graph. For the Binomial(10, 0.1) random variable, the success probability
is so low (10%) that the most likely outcomes are one success and zero successes, with the other outcomes (two or
more successes) having declining probabilities. There is no evident bell shape in this graph. For the Binomial(10, 0.2)
random variable, the bell shape begins to emerge, although there isn’t really a left tail since the lowest possible outcome
is zero successes. For the Binomial(10, 0.5) random variable, the pmf has an approximate bell shape and, moreover,
a perfectly symmetric one. With a success probability of 0.5, the likelihood of having four successes out of ten trials
is the same as having six successes out of ten trials, the likelihood of having three successes is the same as having
seven successes, and so on. Finally, for the Binomial(100, 0.1) random variable, the pmf has a bell shape even though
the success probability is low (10%) and, in fact, the same as we had with the Binomial(10, 0.1) random variable. The
difference here is that the number of trials is much larger (100) than the number of trials (10) for the Binomial(10, 0.1)
random variable.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 219 — #226
i i
1-pbinom(100,200,0.52)
## [1] 0.6900274
Even though the true probability (0.52) of favoring candidate A is greater than 21 , there is approximately a 31.0%
chance that the observed proportion in the 200-voter poll is less than or equal to 21 . What if the poll has more
participants? If 1,000 voters were polled instead of 200 voters, so that X ∼ Binomial(1000, 0.52), the probability
that the proportion is strictly greater than 12 is equal to
1000
X 1000
P (P > 0.5) = P(X > 500) = (0.52)k (0.48)1000–k ≈ 0.891.
k
k=501
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 220 — #227
i i
Probability
0.00015
0.00000 0.3 0.4 0.5 0.6 0.7
0.000000
Figure 9.4
Probability mass functions for voter-poll example
1-pbinom(500,1000,0.52)
## [1] 0.8914189
In this larger poll, there is a considerably lower chance, approximately 10.9%, that the observed proportion in the
poll is less than or equal to 21 . If we continue to increase the size of the poll, the probability P(P > 0.5) gets closer and
closer to one. Figure 9.4 shows the pmf’s for the 200-voter and 1,000-voter polls. For each pmf, a dashed vertical line
is drawn at 0.5, and a solid vertical line is drawn at 0.52. For both pmf’s, the solid vertical line is at the center of
the distribution. The probability P(P > 0.5) corresponds to the sum of the probability values to the right of the dashed
vertical line. In comparing the two pmf’s, a much larger portion of the distribution is to the right of the 0.50 for
the 1,000-voter poll, corresponding to the 0.891 value above, as compared to the 200-voter poll, corresponding to
the 0.690 value above. The dispersion for the 200-voter poll is also clearly much larger than the dispersion for the
1,000-voter poll since σP2 = π(1–π)
n = (0.52)(0.48)
n .
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 221 — #228
i i
Definition 9.4 A geometric random variable X with parameter π, written X ∼ Geo(π), is the number of failures
observed before a success for a sequence of X1 , X2 , … independent random variables with each Xj ∼ Bernoulli(π).
The possible outcomes of X ∼ Geo(π) are {0, 1, 2, …}. If the first trial is a success, then X = 0 since there were no
failures before the success. If the second trial is a success, then X = 1 since there was one failure before the success.
And so on. Since successes are more likely for higher values of π, we should tend to get low values of X if π is high
and high values of X if π is low.
Example 9.4 (Visitors before a purchase) On a given day, the widgets.com website has a sequence of visitors,
each of whom has a 20% probability of purchasing a widget during their visit. The purchase behavior for the sequence
of visitors is described by a sequence of i.i.d. Bernoulli(0.2) random variables, with Xj = 1 for purchase and Xj = 0 for
no purchase. Then, the pmf of X ∼ Geo(0.2), the number of non-purchasing individuals that visit the website prior to
a purchase being made, is
P(X = 0) = P(X1 = 1) = 0.2
P(X = 1) = P(X1 = 0, X2 = 1) = (0.8)(0.2) = 0.16
P(X = 2) = P(X1 = 0, X2 = 0, X3 = 1) = (0.8)2 (0.2) = 0.128
..
.
P(X = k) = P(X1 = 0, X2 = 0, …, Xk = 0, Xk+1 = 1) = (0.8)k (0.2)
..
.
The term “geometric” is used for this random variable since each successive probability is equal to the previous
probability multiplied by the same constant (0.8 here). The interested reader can use Proposition 3.7 to confirm that
these probabilities add up to one.
The following proposition provides the general form of the pmf of a geometric random variable:
Proposition 9.3. If X ∼ Geo(π), then the probability mass function of X is
P(X = k) = (1 – π)k π for k ∈ {0, 1, 2, …}.
The population mean of X is
1–π
µX = ,
π
the population variance of X is
1–π
σX2 =
,
π2
and the population standard deviation of X is √
1–π
σX = .
π
The following R functions are useful for working with a geometric random variable:
• dgeom(x, prob): Returns the pmf of a Geo(prob) random variable evaluated at the argument x, which may
be a single number or a vector.
• pgeom(x, prob): Returns the cdf of a Geo(prob) random variable evaluated at the argument x, which may
For the X ∼ Geo(0.2) random variable of Example 9.4, the pmf and cdf values for k ∈ {0, 1, · · · , 10} are calculated as
follows:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 222 — #229
i i
dgeom(0:10,0.2)
## [1] 0.20000000 0.16000000 0.12800000 0.10240000 0.08192000 0.06553600
## [7] 0.05242880 0.04194304 0.03355443 0.02684355 0.02147484
pgeom(0:10,0.2)
## [1] 0.2000000 0.3600000 0.4880000 0.5904000 0.6723200 0.7378560 0.7902848
## [8] 0.8322278 0.8657823 0.8926258 0.9141007
Regardless of the value of π, P(X = 0) is the largest probability in the pmf. The successive probabilities are strictly
decreasing as k gets higher since each one is equal to the previous probability times (1 – π), a value less than one.
Figure 9.5 shows the pmf’s for four different geometric random variables: (i) X ∼ Geo(0.1) in the upper-left graph,
(ii) X ∼ Geo(0.2) in the upper-right graph, (iii) X ∼ Geo(0.5) in the lower-left graph, and (iv) X ∼ Geo(0.8) in the
lower-right graph. Each graph has the same y-axis, extending from 0 to 0.8, for ease of comparison.
For X ∼ Geo(0.1), the pmf starts at 0.1 and slowly declines with a long right tail, so that even quite high values for
X (i.e., many failures before a success) are possible. For X ∼ Geo(0.2), considered in Example 9.4, the pmf starts at
0.2 and declines more quickly than the π = 0.1 case. The decline in probabilities becomes even more pronounced for
X ∼ Geo(0.5) and X ∼ Geo(0.8). X ∼ Geo(0.5) is the random variable describing the number of tails observed before a
head is tossed for a sequence of fair coin tosses. When X ∼ Geo(0.8), it is very likely that the random variable X has a
low value; in this case, P(X = 0) = 0.8 and P(X ≤ 2) = 0.8 + (0.2)(0.8) + (0.2)2 (0.8) = 0.992, so that there is only a 0.8%
chance to have an X value of 3 or higher.
While we will not prove the results for the population descriptive statistics given in Proposition 9.3, we can apply
the formulas for the population mean, population variance, and the population standard deviation. For the website
0.8
purchase example (Example 9.4), where X ∼ Geo(0.2), the population mean is µX = 0.2 = 4, and the population variance
2 0.8
is σX = 0.22 = 20. For the much larger π parameter (π = 0.8) in the lower-right graph of Figure 9.5, the population mean
is µX = 0.2 2 0.2
0.8 = 0.25, and the population variance is σX = 0.82 = 0.3125. Consistent with Figure 9.5, the population mean
and variance are both much lower for π = 0.8 than they are for π = 0.2.
Definition 9.5 A negative binomial random variable X with parameters π and r ≥ 1, written X ∼ NegBin(r, π), is
the number of failures observed before r successes for a sequence of X1 , X2 , … independent random variables with
each Xj ∼ Bernoulli(π).
When r = 1, a negative binomial random variable is a geometric random variable. That is, X ∼ NegBin(1, π) is
equivalent to X ∼ Geo(π). The following R functions are useful for working with a negative binomial random variable:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 223 — #230
i i
0.8
0.8
0.6
0.6
Probability
Probability
0.4
0.4
0.2
0.2
0.0
0.0
0 5 10 15 20 0 5 10 15 20
0.8
0.6
0.6
Probability
Probability
0.4
0.4
0.2
0.2
0.0
0.0
0 5 10 15 20 0 5 10 15 20
Figure 9.5
Probability mass functions for geometric random variables
• dnbinom(x, size, prob): Returns the pmf of a NegBin(size, prob) random variable evaluated at the
argument x, which may be a single number or a vector.
• pnbinom(x, size, prob): Returns the cdf of a NegBin(size, prob) random variable evaluated at the
variable.
Since the geometric random variable is a special case of the negative binomial random variable,
dnbinom(x,1,prob) is equivalent to dgeom(x,prob), pnbinom(x,1,prob) is equivalent to pgeom(x,prob),
and rnbinom(x,1,prob) is equivalent to rgeom(x,prob).
Example 9.5 (Visitors before three purchases) Consider the same setup as Example 9.4, except that we are now
interested in the number of non-purchasing individuals that visit the website prior to three purchases being made,
which is the random variable X ∼ NegBin(3, 0.2). To have X = 0, the first three visitors must make a purchase:
P(X = 0) = (0.2)3 .
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 224 — #231
i i
For X = 1, the fourth visitor makes a purchase, with the other two purchases occurring among the first three visitors.
In other words, the possible sequences associated with X = 1 are (0, 1, 1, 1), (1, 0, 1, 1), and (1, 1, 0, 1), so that:
P(X = 1) = (3)(0.8)(0.2)3 = 0.0192.
For X = 2, the fifth visitor makes a purchase, with the other two purchases occurring among the first four visitors.
How many sequences have two purchases and two non-purchases among the first four visitors? There are 42 = 6 such
sequences, so that:
4
P(X = 2) = (0.8)2 (0.2)3 = 0.03072.
2
dnbinom(0:2,3,0.2)
## [1] 0.00800 0.01920 0.03072
For the general case, for X = k, the (k +3)-th visitor makes a purchase, with the other two purchases occurring
among the first k + 2 visitors. There are k+2
2 such sequences, so that:
k+2
P(X = k) = (0.8)k (0.2)3 .
2
Figure 9.6 shows the pmf of X ∼ NegBin(3, 0.2) in the top graph. The most likely outcomes are roughly between 5 and
10. As a comparison, two other pmf’s are shown in the figure. The first is the pmf for a lower purchase probability (10%
rather than 20%) in the middle graph, X ∼ NegBin(3, 0.1). With the lower success probability, the lower outcomes for
X become less likely and the distribution gets stretched out with a very long and thick right tail. The second is the
pmf for a larger number of purchases (4 rather than 3) in the bottom graph, X ∼ NegBin(4, 0.2). Since more purchases
are required, the distribution appears to shift a little to the right and also is a bit lower at its peak relative to the
NegBin(3, 0.2) pmf.
Here is the R code to create Figure 9.6:
Let’s generalize the approach from Example 9.5 to a general X ∼ NegBin(r, π), for any success probability π and
any number of successes r. For X = k, it must be the case that (i) there are a total of k + r trials with k failures and r
successes and (ii) the last trial is a success. Combining these two facts, there are k failures and r – 1 successes among the
first k + r – 1 trials. There are k+r–1
r–1 possible sequences for which this is true. And, the probability of any individual
k r
sequence of the k + r trials is (1 – π) π , corresponding to the k failures and r successes. The following proposition
provides the general form of the pmf of a negative binomial random variable:
Proposition 9.4. If X ∼ NegBin(r, π), then the probability mass function of X is
k+r–1
P(X = k) = (1 – π)k π r for k ∈ {0, 1, 2, …}.
r–1
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 225 — #232
i i
NegBin(3,0.2) pmf
0.06
0.04
Probability
0.02
0.00
0 10 20 30 40 50
0.06
0.04
NegBin(3,0.1) pmf
Probability
0.02
0.00
0 10 20 30 40 50
NegBin(4,0.2) pmf
0.06
0.04
Probability
0.02
0.00
0 10 20 30 40 50
Figure 9.6
Probability mass functions for negative binomial random variables
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 226 — #233
i i
of when the last event occurred. The parameter for the Poisson model, denoted λ, represents the expected value
(population average) of the number of events that occur within the fixed time interval.
Definition 9.6 A random variable X is a Poisson random variable with parameter λ > 0, written X ∼ Poisson(λ),
if X represents the number of times that an event occurs within a fixed time interval and the following are true:
(i) λ = µX = E(X), (ii) the average rate at which events occur is constant (depending on λ) and does not depend on
whether previous events have occurred, (iii) the occurrence of one event does not affect the likelihood that a future
event occurs, and (iv) two events cannot occur at exactly the same instant.
The possible outcomes for X ∼ Poisson(λ) are {0, 1, 2, …}, with no upper limit imposed on the value of X. Here are
two examples of Poisson models:
Example 9.6 (Coffee shop customers) Consider a random variable X given by the number of customers that arrive at
a coffee shop between 10am and 11am on a given weekday. The expected value or population mean of X is 20. So, on
average, a customer arrives every three minutes during the 10am-to-11am hour. For the assumptions in Definition 9.6
to be true, it must be the case the likelihood of one customer arriving has nothing to do with other customers arriving
and that the likelihood of arrival is constant throughout the hour. Then, X ∼ Poisson(20).
Example 9.7 (R&D and patents) A firm has an R&D department that does research throughout the year. Every so
often, the department makes a discovery that leads to a patent application. Let the random variable X be the number
of discoveries (or patent applications) by the firm in a given year. If the expected value of discoveries in a given year
is equal to two and the R&D process satisfies the assumptions from Definition 9.6, then X ∼ Poisson(2).
The following proposition provides the pmf of a Poisson random variable:
Proposition 9.5. If X ∼ Poisson(λ), then the probability mass function of X is
e–λ λk
P(X = k) = for k ∈ {0, 1, 2, 3, …}.
k!
The population mean of X is
µX = λ,
the population variance of X is
σX2 = λ,
and the population standard deviation of X is √
σX = λ.
–λ k
Deriving the probabilities associated with the pmf, P(X = k) = e k!λ , is beyond the scope of this book, but we can
still use the results from Proposition 9.5 for examples involving Poisson random variables. A Poisson random variable
has the interesting property that its population mean is the same as its population variance, with both equal to the
parameter λ.
The following R functions are useful for working with a Poisson random variable:
• dpois(x, lambda): Returns the pmf of a Poisson(lambda) random variable evaluated at the argument x,
which may be a single number or a vector.
• ppois(x, lambda): Returns the cdf of a Poisson(lambda) random variable evaluated at the argument x,
Example 9.8 (R&D and patents) Let’s continue Example 9.7, where the number of discoveries X in a given year is
a Poisson random variable, X ∼ Poisson(2). The population mean µX and population variance σX2 are both equal to
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 227 — #234
i i
√
λ = 2, and the population standard deviation is σX = 2. The pmf is
e–2 20
P(X = 0) = = e–2 ≈ 0.135
0!
e–2 21
P(X = 1) = = 2e–2 ≈ 0.271
1!
e–2 22
P(X = 2) = = 2e–2 ≈ 0.271
2!
e–2 23 4 –2
P(X = 3) = = e ≈ 0.180
3! 3
e–2 24 2 –2
P(X = 4) = = e ≈ 0.090
4! 3
..
.
e–2 2k
P(X = k) =
k!
..
.
The pmf of X ∼ Poisson(2) is shown in the top graph in Figure 9.7. The most likely outcomes are X = 1 and X = 2,
both with probability 0.271 from above. The pmf indicates that large values of X are not likely. For example, the
probability of having more than 5 discoveries, P(X > 5), is approximately 0.0166 or 1.66%. What would happen if the
firm had a more productive R&D department? If the Poisson parameter is increased to 4, corresponding to an expected
value of four discoveries in a given year, the pmf is shown in the bottom graph of Figure 9.7. In this case, the most
likely outcomes are X = 3 and X = 4, both with probability approximately equal to 0.195. As compared to λ = 2, larger
outcomes for X are much more likely with λ = 4; for instance, P(X > 5) is approximately 0.215 or 21.5%, and even the
probability of 8 discoveries, P(X = 8), is approximately 2.98%.
Here is the R code to create Figure 9.7:
Example 9.9 (Coffee shop customers) For the coffee shop example (Example 9.6), X ∼ Poisson(20) is the number of
customers that arrive at a coffee shop between 10am and 11am on a given weekday. Figure 9.8 shows the pmf of
X ∼ Poisson(20).
plot(0:40, dpois(0:40,20), type="h", main="", xlab="Customers between 10am and 11am", ylab="Probability")
For this λ value, the pmf has a bell shape that peaks around X = 20; in fact, the most likely outcomes here are
X = 19 and X = 20, both of which occur with a probability of approximately 0.089 or 8.9%. The bell shape of this pmf
is quite different from the shape of the pmf for the lower λ value of λ = 2 in Figure 9.7. In fact, when λ is very large,
alternative approaches can be used to model the random variable (e.g., using a normal distribution, which is discussed
in Chapter 11).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 228 — #235
i i
228 NOTES
Poisson(2) pmf
0.20
Probability
0.10
0.00 0 2 4 6 8 10
Poisson(4) pmf
0.20
Probability
0.10
0.00
0 2 4 6 8 10
Figure 9.7
Probability mass functions for Poisson patent example
What if we are interested in the number of customers that arrive during a one-minute interval (still at some point
between 10am and 11am on a weekday) rather than during a one-hour interval? If the expected number of customers
during the one-hour interval is 20, the expected number of customers during a one-minute interval must be 20 1
60 = 3
since there is a constant arrival rate. Figure 9.9 shows the pmf for the associated Poisson random variable, which
is a Poisson(1/3) random variable. The arrival of zero customers is by far the most likely outcome (71.7%), followed
by one customer (23.9%), and two customers (4.0%), with these three outcomes accounting for a total of 99.5% in
probability. Despite the same underlying process as the one-hour interval, this pmf has a quite different shape due to
the much lower λ parameter value. With λ = 31 , the arrival of a customer is a more unusual event since it must occur
during a one-minute interval, as compared to the arrival of a customer during the full one-hour interval.
Notes
20 Let Y be the number of failures before the first success, Y be the number of failures after the first success and before the second success, and
1 2
so on through Yr being the number of failures after the (r – 1)-th success and before the r-th success. With these definitions, Y1 , Y2 , …, Yr are all
geometric random variables Geo(π). Moreover, since the underlying Bernoulli trials are independent, it is also the case that the Y1 , Y2 , …, Yr are
independent random variables. Noting that, for X ∼ NegBin(r, π), we have X = Y1 + Y2 + · · · + Yr . Then, using the results for a linear combination of
independent random variables (Proposition 10.13),
1–π
µX = µY1 + µY2 + · · · + µYr = r
π
and
1–π
σX2 = σY21 + σY22 + · · · + σY2r = r.
π2
Exercises
1. Let X = 1 if a fair die roll results in 5 or 6 and X = 0 otherwise.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 229 — #236
i i
NOTES 229
0.08
0.06
Probability
0.04
0.02
0.00
0 10 20 30 40
Figure 9.8
Probability mass function for Poisson coffee shop example
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 230 — #237
i i
230 NOTES
0.7
0.6
0.5
0.4
Probability
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6
Figure 9.9
Probability mass function for Poisson coffee shop example
(d) Plot the pmf associated with the random variable X = the total number of macroeconomists (out of five) that
support the proposal.
5. A recent study found that 13% of individuals in the United States are left-handed and 87% are right-handed.
(a) For a group of n ≥ 2 independent individuals, what is the probability that at least one individual is left-handed?
The expression should be a function of n. Evaluate the expression for n = 3, n = 4, and n = 5.
(b) For a group of n ≥ 2 independent individuals, what is the probability that exactly one individual is left-handed
if at least one individual is left-handed? The expression should be a function of n. Evaluate the expression for
n = 3, n = 4, and n = 5.
(c) For a group of n ≥ 2 independent individuals, what is the probability that exactly two individuals are left-handed
if at least one individual is left-handed? The expression should be a function of n. Evaluate the expression for
n = 3, n = 4, and n = 5.
6. You manage a factory and, due to the complexity of the manufacturing process, there is a 10% probability that any
one of your manufactured products has a defect. You may assume that each product is independent; that is, whether one
product is defective is not associated with another product being defective. On any given day, your factory produces
2000 products. Let the random variable D = the total number of defective products on a given day.
(a) Plot the pmf associated with D.
(b) What is the most likely value of D?
(c) What is P(200 < D ≤ 300)?
(d) What is P(200 ≤ D ≤ 300)?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 231 — #238
i i
NOTES 231
(e) Find the smallest integer d for which P(200 – d ≤ D ≤ 200 + d) is greater than 90%. For this value of d, what is
P(200 – d ≤ D ≤ 200 + d)? (Hint: Start at d = 0 and repeatedly increase d until the probability is above 90%.)
7. A university is experiencing increased enrollment, so it needs to accommodate more students in its classes. The
university schedules a 100-student class in a room that seats only 90 students.
(a) On a given day, suppose each student attends class with an 85% probability and all students can be considered
independent of each other. What is the probability that all students who attend the class will have a seat?
(b) *Now suppose the attendance probability π of an individual student is unknown. Limiting the possible values to
the set π ∈ {0.01, 0.02, …, 0.99, 1.00}, what is the largest value of π for which there is at least a 99% probability
that all students who attend the class will have a seat?
8. An entrepreneur is intent on starting a profitable business. The probability that she succeeds in any given year is
15%, and her success in any given year is independent of the outcomes in any previous year. The entrepreneur starts
trying in year 1. Let Y ∈ {1, 2, 3, · · · } be the random variable indicating the year in which the entrepreneur is first
successful in starting a profitable business.
(a) Explain why Y is not itself a geometric random variable.
(b) Write Y in terms of a geometric random variable.
(c) Determine the probability P(4 ≤ Y ≤ 10) using the R function dgeom.
(d) What is the expected value of Y?
(e) What is the population standard deviation of Y?
9. A family really wants to have a daughter, so they decide to keep having children until they have a daughter. Assume
that the gender of any given child is independent of the genders of all other children. The probability of having a
daughter for any given birth is 48.8% (yes, slightly lower than 50%).
(a) What is the probability that the family has at least two sons before having a daughter?
(b) What is the expected number of sons that the family has before having a daughter?
(c) If the family wants to have two daughters, what is the probability that the family has at least two sons before
having two daughters? What is the expected number of sons that the family has before having two daughters?
10. A political activist collects petition signatures by going door-to-door in a neighborhood. Suppose the probability
of successfully getting a signature at any given house is 5% and the random variables associated with success at all
houses are i.i.d. Let H ∈ {20, 21, 22, 23, …} be the random variable indicating the number of houses visited to obtain
20 signatures.
(a) Explain why H is not itself a negative binomial random variable.
(b) Write H in terms of a negative binomial random variable.
(c) What is the expected value of H?
(d) What is the population standard deviation of H?
(e) Determine the probability P(300 ≤ H ≤ 400) using the R function pnbinom.
(f) A more convincing individual has a success probability of 7%. Conduct 10,000 simulations in R to approximate
the probability that this individual (7% success rate) visits strictly fewer houses than the other individual (5%
success rate) to obtain 20 signatures.
11. *A multinomial distribution provides a generalization of the binomial distribution that allows for more than two
outcomes. Specifically, the multinomial distribution is based upon n i.i.d. draws of a discrete random variable with
m ≥ 2 possible outcomes (denoted, without loss of generality, {1, 2, …, m}) and probabilities p1 , p2 , …, pm (with
Pm
j=1 pj = 1). The discrete random variables X1 , X2 , …, Xm correspond to the total number of times that outcome 1
occurs, outcome 2 occurs, and so on through outcome m. The joint pmf is
n!
pX1 X2 ···Xm (x1 , x2 , …, xm ) = P(X1 = x1 , X2 = x2 , …, Xm = xm ) = px1 px2 · · · pxmm ,
x1 !x2 ! · · · xm ! 1 2
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 232 — #239
i i
232 NOTES
There are three brands of a certain product (brands 1, 2, and 3). In the population, 15% of consumers prefer brand 1,
25% of consumers prefer brand 2, and 60% of consumers prefer brand 3.
(a) If ten consumers are chosen at random, what is the probability that two prefer brand 1, three prefer brand 2,
and five prefer brand 3?
(b) If ten consumers are chosen at random, what is the probability that five prefer brand 3 and more consumers
prefer brand 2 than brand 1?
(c) If ten consumers are chosen at random, what is the expected value of the number of consumers that prefer
brand 3? (Hint: Don’t use the joint pmf. Instead, consider the binomial distribution based upon brand 3 being
chosen or not chosen.)
(d) Make 10,000 simulated draws of (X1 , X2 , X3 ) in R for ten randomly chosen consumers. That is, for each
simulation, consider ten hypothetical consumers, assign them to a preferred brand based upon the probabilities
(15%, 25%, 60%), and let (X1 , X2 , X3 ) be the overall counts for the three brands.
(e) Using the simulated draws from (d), confirm your answer to (b).
(f) Using the simulated draws from (d), what is the approximate population correlation between X1 and X2 ? Does
the sign of this correlation make sense?
12. A financial company releases a market update e-mail every Friday afternoon. The expected number of typos in a
given e-mail is 0.8, and the number of typos follows a Poisson distribution.
(a) What is the probability that a given e-mail has no typos?
(b) What is the probability that a given e-mail has two or more typos?
(c) *Assume that each weekly e-mail can be considered independent. This part considers the number of typos in
two consecutive weekly e-mails, denoted X1 and X2 .
i. Use a while loop in R to calculate the probability that the two e-mails have the same number of typos,
given by
P(X1 = X2 ) = P(X1 = X2 = 0) + P(X1 = X2 = 1) + P(X1 = X2 = 2) + · · ·
Since this probability involves an infinite sum, continue looping until the probability P(X1 = X2 = j)
falls below a very small value, say 0.000001, to get an accurate approximation of the probability.
ii. Conduct 10,000 simulations in R, taking independent draws of X1 and X2 in each simulation, to confirm
your answer to (c)(i).
13. Consider the number of customers X that arrive at a coffee shop during one minute (within the 10am-11am hour).
Suppose the store manager knows that the average arrival rate of customers is two per minute. Based upon this
information, we can model X as a Poisson random, X ∼ Poisson(2), with pmf
e–2 2x
pX (x) = for x = 0, 1, 2, 3, …
x!
(a) Using the pmf formula directly, what is the probability that no customers arrive during a given minute?
(b) Confirm your answer to (a) by using the R function dpois.
(c) Plot the pmf of X for all values less than or equal to 10.
(d) What is the probability that four or more customers arrive during a given minute?
(e) Simulate 10,000 draws of X ∼ Poisson(2) in R. What proportion of the 10,000 draws are greater than or equal
to four? Is your answer similar, aside from simulation noise, to the probability found in (d)?
(f) A more popular coffee shop on the other side of town has an average of three customers per minute during the
10am-11am hour, with the number of customers arriving during a given minute (denoted Y) distributed as a
Poisson(3) random variable. Assume that X and Y are independent of each other.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 233 — #240
i i
NOTES 233
i. What is the joint probability that, during a given minute, there is one customer who arrives at the the
first store (the λ = 2 one) and one customer who arrives at the second store (the λ = 3 one)?
ii. Use simulated draws of X and Y to approximate the probability that more customers arrive at the first
store (the λ = 2 one) than the second store (the λ = 3 one) during a given minute. Specifically, simulate
10,000 draws of both X ∼ Poisson(2) and Y ∼ Poisson(3), and calculate the proportion of times that the
X draw is strictly greater than the Y draw.
iii. What is the expected value of the total number of customers arriving at the two stores during a given
minute?
iv. What is the population standard deviation of the total number of customers arriving at the two stores
during a given minute?
v. Use simulated draws of X and Y (again 10,000 for each) to approximate the probability that the total
number of customers arriving at the two stores is greater than or equal to six during a given minute.
14. *Suppose X1 ∼ Poisson(λ1 ) and X2 ∼ Poisson(λ2 ) are independent random variables. Show that X1 + X2 is a
Poisson(λ1 + λ2 ) random variable by verifying that
e–(λ1 +λ2 ) (λ1 + λ2 )k
P(X1 + X2 = k) = for k ∈ {0, 1, 2, 3, …}.
k!
(Hint: Use the “binomial theorem,” which is
m m
X m X m!
(a + b)m = aj bm–j = aj bm–j
j j!(m – j)!
j=0 j=0
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 234 — #241
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 235 — #242
i i
Chapter 8 formalized the concept of a discrete random variable, introducing the probability mass function (pmf) as
a complete characterization of the random variable and discussing population quantities like the population mean
and population variance. Chapter 9 considered some examples of models that are used in specific situations with
discrete random variables. This chapter departs from the case of discrete random variables and formalizes the concept
of a continuous random variable, where the underlying quantity being measured is a continuous or approximately
continuous variable.
Definition 10.1 A continuous random variable is a random variable that can take on any value on some interval or
intervals of the real line, including perhaps the entire real line, and for which the probability of any specific outcome
x∗ occurring is equal to zero.
Before discussing the last part of this definition (“the probability of any specific outcome x∗ occurring is equal to
zero”), let’s first consider a few examples of continuous random variables:
• Weekly earnings: S contains all possible (positive) weekly earnings values for employed individuals, and X is
equal to the outcome.
• Monthly stock return:21 S contains all possible monthly returns for a given stock, and X is equal to the outcome.
• Unemployment rate: S contains all possible unemployment rates, ranging from 0% to 100% (or 0 to 1), and X is
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 236 — #243
i i
fact, if we had P(X = x∗ ) = c for some positive number c > 0, then it must also be the case that any other outcome on
the [0, 1] interval must also have probability c of occurring since every value is equally likely. But, there are an infinite
number of possible outcomes on the interval [0, 1], meaning the probabilities of all of these outcomes would sum to
infinity, violating the Axioms of Probability.
The fact that P(X = x∗ ) = 0 for any specific outcome x∗ is a key distinguishing feature of a continuous random
variable, in contrast to a discrete random variable which has positive probabilities for at least two discrete outcomes.
For a continuous random variable, rather than analyzing probabilities of specific outcomes, which are all equal to zero,
the probabilities associated with intervals of outcomes are considered.
Definition 10.2 The probability density function (pdf) of a continuous random variable X is a function fX (·) such
that for any two numbers a and b with a ≤ b,
Z b
P(a ≤ X ≤ b) = fX (x) dx.
a
This definition places no restriction on the shape of the pdf fX (·). The fact that any outcome x∗ has a zero probability
follows directly from Definition 10.2 since
Z x∗
P(X = x∗ ) = P(x∗ ≤ X ≤ x∗ ) = fX (x) dx = 0.
x∗
As special cases, Definition 10.2 also implies
Z b
P(X ≤ b) = fX (x) dx,
–∞
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 237 — #244
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 238 — #245
i i
pdf example
fX(x)
a b
a b
Figure 10.1
An example of a probability density function
There is nothing special about the endpoints 0 and 1 in the U(0, 1) random variable, and a uniform random variable
can be specified with different endpoints. For example, if X is a uniform random variable between 5 and 10, written
X ∼ U(5, 10), the pdf is a constant value within the [5, 10] interval and zero outside the [5, 10] interval. The constant
value can’t be equal to 1, as it is for U(0, 1), since the area of the rectangle would be 1 × (10 – 5) = 5. To have a
rectangle area equal to one, the constant value must be 15 or 0.2, implying the pdf is
(
0.2 if 5 ≤ x ≤ 10
fX (x) =
0 otherwise
Figure 10.3 shows the pdf curve for X ∼ U(5, 10). Probabilities of intervals can be easily determined here as well. For
instance, P(7 ≤ X ≤ 9) = (0.2)(9 – 7) = 0.4.
More generally, the pdf for a uniform random variable X ∼ U(a, b), where a < b, is
(
1
if a ≤ x ≤ b
fX (x) = b–a
0 otherwise
1
The height of the pdf, equal to b–a , is the value that ensures that the area of the rectangle, which has width b – a, is
equal to one.
The following R functions are useful for working with a uniform random variable:
• dunif(x, min=0, max=1): Returns the pdf of a U(min, max) random variable evaluated at the argument x,
which may be a single number or a vector. The optional arguments min and max have default values of 0 and 1,
respectively.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 239 — #246
i i
1.0
0.8
0.6
fX(x)
0.4
0.2
0.0
Figure 10.2
Probability density function of a U(0, 1) random variable
• punif(x, min=0, max=1): Returns the cdf of a U(min, max) random variable evaluated at the argument x,
which may be a single number or a vector. The optional arguments min and max have default values of 0 and 1,
respectively.
• runif(n, min=0, max=1): Creates a vector of n i.i.d. random draws of a U(min, max) random variable.
The optional arguments min and max have default values of 0 and 1, respectively.
The punif function returns the cdf, discussed below in Section 10.3. The following code shows examples of dunif
and runif for a U(5, 10) random variable:
dunif(4,5,10)
## [1] 0
dunif(6,5,10)
## [1] 0.2
dunif(8,5,10)
## [1] 0.2
set.seed(1234)
runif(20,5,10)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 240 — #247
i i
0.20
0.15
fX(x)
0.10
0.05
0.00
4 6 8 10 12
Figure 10.3
Probability density function of a U(5, 10) random variable
Example 10.2 (Triangular distribution) Let X be a “triangular” random variable, with pdf
x
if 0 ≤ x ≤ 1
fX (x) = 2 – x if 1 < x ≤ 2
0 otherwise
Figure 10.4 shows the pdf curve for X, which has a triangular shape. To confirm that the area under the pdf curve is
equal to one, the area of a triangle is 21 times the length of the triangle base times the height of the triangle, which
here is 21 (2 – 0)(1) = 1.
Probabilities of intervals can be determined from the pdf’s shape in Figure 10.4. Alternatively, the integral formula
can be used, so for example, the probability that X is less than 0.5 is
Z 0.5 Z 0.5 0.5
x2
P(X < 0.5) = fX (x) dx = x dx = = 0.125.
–∞ 0 2 0
As another (more difficult) example, the probability that X is between 0.5 and 1.3 is
R 1.3
P(0.5 ≤ X ≤ 1.3) = 0.5 fX (x) dx
R1 R 1.3
= 0.5
x dx + 1
(2 – x) dx
1 1.3
x2 –(2–x)2
= 2 + 2
0.5 1
1
– 18 + –0.49
+ 12 = 0.63.
= 2 2
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 241 — #248
i i
1.0
0.8
0.6
fX(x)
0.4
0.2
0.0
−1 0 1 2 3
Figure 10.4
Probability density function of a triangular random variable
A new function tri_pdf is defined since R does not have a built-in function for the pdf of a triangular distribution.
The function tri_pdf returns x when x is between 0 and 1, 2-x when x is between 1 and 2, and 0 otherwise.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 242 — #249
i i
Definition 10.3 The cumulative distribution function (cdf) of a continuous random variable X, denoted FX (·), gives
the probability that X is less than or equal to any argument x0 of FX (·):
Z x0
FX (x0 ) = P(X ≤ x0 ) = fX (x) dx.
–∞
Like the cdf for a discrete random variable (Definition 8.4), the cdf for a continuous random variable is equal to
P(X ≤ x0 ). Rather than the discrete summation used for a discrete random variable, the cdf of a continuous random
variable involves integration of the pdf fX (·) from –∞ to the argument x0 . The properties of the cdf of a continuous
random variable are given in the following proposition:
Proposition 10.2. The cumulative distribution function FX (·) of a continuous random variable X has the following
properties:
(i) 0 ≤ FX (x0 ) ≤ 1 for every x0
(ii) x0 < x1 =⇒ FX (x0 ) ≤ FX (x1 )
(iii) At every x0 for which the derivative FX0 (x0 ) exists, fX (x0 ) = FX0 (x0 ).
Property (i) follows from the fact that FX (x0 ) = P(X ≤ x0 ) is a probability and, therefore, must be between zero and
one (inclusive). Property (ii) says that the cdf FX (·) is a weakly increasing function. While it may stay level on certain
intervals of the real line, the cdf can never decrease as x increases. Property (iii), which is a re-statement of the “first
fundamental theorem of calculus,” provides an approach to derive the pdf fX (·) if the cdf FX (·) is known.
If the cdf FX (·) is completely known, determining the probabilities of intervals is considerably simplified as
compared to the case when only the pdf fX (·) is known. When only fX (·) is known, we generally must calculate the
integral associated with an interval, whereas integration is unnecessary if the cdf FX (·) is known. Using the cdf, to
determine the probability that X is less than a value a,
P(X ≤ a) = P(X < a) = FX (a).
To determine the probability that X is greater than a value a,
P(X ≥ a) = P(X > a) = 1 – FX (a).
And to determine the probability that X is between two values a and b, where a < b,
P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b) = FX (b) – FX (a).
Figure 10.5 shows an example of a pdf fX (·) and its associated cdf FX (·). The top graph shows the pdf along with
shading corresponding to the area representing P(X ≤ a). The middle graph shows the same pdf along with shading
corresponding to the area representing P(X ≤ b). The bottom graph shows the cdf associated with the pdf in the two
graphs above. The cdf starts at zero and increases, eventually approaching one. The probability P(a ≤ X ≤ b) can be
determined by the difference FX (b) – FX (a), where the two cdf values are read off the y-axis. In terms of the two pdf’s,
the probability P(a ≤ X ≤ b) is the difference between the area to the left of b (middle graph) and the area to the left of
a (top graph) or, equivalently, the area under the pdf curve between a and b.
Example 10.3 (Uniform distribution) Recall from Example 10.1 that the pdf of the uniform random variable X ∼
U(0, 1) is (
1 if 0 ≤ x ≤ 1
fX (x) =
0 otherwise
R x0
To determine the cdf FX (·) from the pdf fX (·), the integral FX (x0 ) = P(X ≤ x0 ) = –∞ fX (x) dx must be evaluated for all
possible values of x0 . There are three cases that can be treated separately (x0 ≤ 0, 0 < x0 ≤ 1, and x0 > 1):
Z x0 Z x0
x0 ≤ 0 : FX (x0 ) = fX (x) dx = 0 dx = 0
–∞ –∞
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 243 — #250
i i
pdf example
3.0
2.0
fX(x)
1.0
0.0
a b
3.0
2.0
pdf example
fX(x)
1.0
0.0
a b
cdf example
1.0
0.8
0.6
FX(x)
0.4
0.2
0.0
a b
Figure 10.5
Example of a cumulative distribution function
Z x0 Z x0
0 < x0 ≤ 1 : FX (x0 ) = fX (x) dx = 1 dx = x|x00 = x0
–∞ 0
Z x0 Z 1
x0 > 1 : FX (x0 ) = fX (x) dx = 1 dx = x|10 = 1
–∞ 0
Putting these results together, the cdf of X is
0 if x ≤ 0
FX (x) = x if 0 < x ≤ 1
1 if x > 1
Figure 10.6 shows this cdf for X ∼ U(0, 1). The pdf function is a constant on the (0, 1) interval for the uniform random
variable U(0, 1), which yields a cdf function that is linear on the (0, 1) interval due to the integration of the pdf.
In Example 10.1, P(0.2 ≤ X ≤ 0.5) = 0.3 was determined by integrating pdf fX (·) from 0.2 to 0.5. Using the cdf FX (·),
P(0.2 ≤ X ≤ 0.5) = FX (0.5) – FX (0.2) = 0.5 – 0.2 = 0.3. The answer can also be verified in R:
punif(0.5)-punif(0.2)
## [1] 0.3
Since the default for punif is the U(0, 1) distribution, we do not specify the endpoints of the uniform distribution.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 244 — #251
i i
1.0
0.8
0.6
FX(x)
0.4
0.2
0.0
Figure 10.6
Cumulative distribution function for a U(0, 1) random variable
Example 10.4 (Triangular distribution) From Example 10.2, the pdf of the triangular random variable X is
x
if 0 ≤ x ≤ 1
fX (x) = 2 – x if 1 < x ≤ 2
0 otherwise
To determine the cdf, the same approach as Example 10.3 can be used, except that there are now four different cases
to consider (x0 < 0, 0 ≤ x0 ≤ 1, 1 < x0 ≤ 2, and x0 > 2):
Z x0 Z x0
x0 < 0 : FX (x0 ) = fX (x) dx = 0 dx = 0
–∞ –∞
x0 x0 x0
x2 x02
Z Z
0 ≤ x0 ≤ 1 : FX (x0 ) = fX (x) dx = x dx = =
–∞ 0 2 0 2
x0 1 x0 2 1 x
–(2 – x)2 0 (2 – x0 )2
Z Z Z
x
1 < x0 ≤ 2 : FX (x0 ) = fX (x) dx = x dx + (2 – x) dx = + =1–
–∞ 0 1 2 0 2 1 2
Z x0 Z 1 Z 2 1 2
x2 –(2 – x)2
x0 > 2 : FX (x0 ) = fX (x) dx = x dx + (2 – x) dx = + =1
–∞ 0 1 2 0 2 1
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 245 — #252
i i
1.0
0.8
0.6
FX(x)
0.4
0.2
0.0
−1 0 1 2 3
Figure 10.7
Cumulative distribution function for a triangular random variable
Figure 10.7 shows the cdf for this triangular distribution. The pdf function is linear on the (0, 1) and (1, 2) intervals
for this random variable, which yields a cdf function that is quadratic on the (0, 1) and (1, 2) intervals due to the
integration of the pdf. Here is the R code to create Figure 10.7, with a new function tri_cdf defined for the cdf of
the triangular distribution:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 246 — #253
i i
In Example 10.2, the probability P(0.5 ≤ X ≤ 1.3) was determined by integrating the pdf fX (·) from 0.5 to 1.3.
Alternatively, using the complete cdf FX (·) specified above,
(2 – 1.3)2 0.52 1.51 0.25
P(0.5 ≤ X ≤ 1.3) = FX (1.3) – FX (0.5) = 1 – – = – = 0.63.
2 2 2 2
tri_cdf(1.3)-tri_cdf(0.5)
## [1] 0.63
integrate(dunif,0.2,0.8)
## 0.6 with absolute error < 0.0000000000000067
integrate(dunif,-Inf,0.8)
## 0.7999995 with absolute error < 0.0000013
integrate(tri_pdf,0.5,1.3)
## 0.63 with absolute error < 0.000000000000007
These examples illustrate that the integrate function can be used with either a build-in R function (like dunif)
or a user-defined function (like tri_pdf). The first integrate command integrates the pdf of X ∼ U(0, 1) random
variable (dunif) from 0.2 to 0.8, which is equivalent to FX (0.8) – FX (0.2). The second integrate command
uses a lower limit of –∞, so it is equivalent to the cdf FX (0.8). The third integrate command provides an
alternative method of determining P(0.5 ≤ X ≤ 1.3) when X has a triangular distribution. Rather than determining
the cdf function and coding it, as we did with tri_cdf above, the pdf tri_pdf is integrated directly here. For
each use of integrate, R returns the result of the numerical integration along with the phrase “with absolute
error < ...” Since it’s performing numerical integration, the function integrate is only approximating the
value of the integral, but the very small absolute errors indicate the approximations are very accurate. (The desired
accuracy can be specified as an argument of integrate. For this and other options, refer to the R documentation for
integrate.)
As seen in the R output above, the integrate function does more than simply return the numeric value of the
integral. If the value of the integral needs to be stored in a variable and/or used in subsequent calculations, this value
can be accessed directly by appending $value after the integrate function.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 247 — #254
i i
Definition 10.4 For any q, with 0 < q < 1, the population quantile τX,q of a continuous random variable X is the value
for which
P(X ≤ τX,q ) = FX (τX,q ) = q.
A special case is the population median (τX,0.5 or τX,1/2 ) of a continuous random variable X, with
P(X ≤ τX,0.5 ) = FX (τX,0.5 ) = 0.5.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 248 — #255
i i
1.0
0.9
0.8
0.7
0.6
FX(x)
0.5
0.4
0.3
0.2
0.1
0.0
0 1
Figure 10.8
Cumulative distribution function for a mixture random variable
Let’s say that we are interested in the population 80% quantile τX,0.8 of X, where X is a continuous random variable.
τX,0.8 is the value for which there is a probability of 80% that X is less than or equal to τX,0.8 : P(X ≤ τX,0.8 ) = FX (τX,0.8 ) =
0.8. If we had a graph of the cdf FX (·), finding this population 80% quantile would involve drawing a horizontal line
at 0.8 and finding the x value where this line hits the cdf function. The top graph of Figure 10.9 shows an example
of a cdf, where the population 80% quantile (τX,0.8 ) and population median (τX,0.5 ) are shown as the values where the
horizontal lines at 0.8 and 0.5, respectively, cross the cdf curve. The corresponding pdf is shown in the bottom graph
of, with the population 80% quantile τX,0.8 and the population median τX,0.5 indicated. For the population median τX,0.5 ,
the area under the pdf to the left of τX,0.5 is equal to 0.5. For the population 80% quantile τX,0.8 , the area under the pdf
to the left of τX,0.8 is equal to 0.8.
Just as the sample interquartile range IQRx = x̃0.75 – x̃0.25 is defined as the difference between the sample 75% quantile
and sample 25% quantile, the population interquartile range is defined analogously as the difference between the
75% population quantile and 25% population quantile:
Definition 10.5 The population interquartile range (population IQR) of a continuous random variable X, denoted
IQRX , is
IQRX = τX,0.75 – τX,0.25 .
The following example illustrates how population quantiles can be determined analytically when the cdf FX (·) is
known. Since it is easier to find population quantiles from the cdf, it is recommended to first find the cdf if only the
pdf is specified.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 249 — #256
i i
1.0
0.9
0.8
0.7
0.6
FX(x)
0.5
0.4
0.3
0.2
0.1
0.0 τX,0.5 τX,0.8
corresponding pdf
3.0
2.0
fX(x)
1.0
0.0
τX,0.5 τX,0.8
Figure 10.9
Example of population quantiles for a continuous random variable
Example 10.6 (Triangular distribution) Continuing Example 10.4, recall that the cdf of the triangular random
variable X is
0 if x < 0
x2
if 0 ≤ x ≤ 1
FX (x) = 2 (2–x)2
1– 2 if 1 < x ≤ 2
1 if x > 2
Suppose we are interested in determining the population 40% quantile. We want to find the value τX,0.4 for which
FX (τX,0.4 ) = 0.4. We know we don’t get a cdf value equal to √0.4 for x ≤ 0, so we move
√ to the 0 < x ≤ 1 interval. Is there
x2
an x ∈ (0, 1] such that 2 = 0.4? The answer is yes, with x = 0.8, so that τX,0.4 = 0.8 ≈ 0.894.
Suppose we are interested in determining the population 70% quantile. We want to find the value τX,0.7 for which
FX (τX,0.7 ) = 0.7. We know that we need to move past the 0 < x ≤ 1 interval since the maximum cdf value, which occurs
2
at x = 1, is only
√ 0.5. Moving to the 1√< x ≤ 2 interval, is there an x ∈ (1, 2] such that 1 – (2–x)
2 = 0.7? The answer is yes,
with x = 2 – 0.6, so that τX,0.7 = 2 – 0.6 ≈ 1.225.
For commonly used continuous distributions, like the uniform distribution, R provides functions to calculate
population quantiles.
• qunif(p, min=0, max=1): Returns the population quantiles of a U(min, max) random variable specified
by the argument p, which may be a single number or a vector. The optional arguments min and max have default
values of 0 and 1, respectively.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 250 — #257
i i
R uses the convention of having q as the first letter for functions that return population quantiles. The following code
returns the population 80% quantile of a U(0, 1) random variable and the population 50% and 75% quantiles of a
U(5, 10) random variable:
qunif(0.8)
## [1] 0.8
qunif(c(0.50,0.75),5,10)
## [1] 7.50 8.75
Definition 10.6 The population mean (or population average or expected value) of a continuous random variable
X, denoted µX or E(X), is Z ∞
µX = E(X) = xfX (x) dx.
–∞
The integral in Definition 10.6 can be thought of as a summation over very small intervals of the x values.
Specifically, consider a “slice” of the pdf function right at the value x, with height fX (x) and width dx. The area of
this slice is fX (x)dx and can be thought of as the probability that the outcome of X is within the slice. The integral
essentially sums over all of these little slices, and the expected value provides a weighted average of the x values with
weights given by the “probabilities” fX (x)dx.
Example 10.7 (Uniform distribution) The pdf of the uniform random variable X ∼ U(0, 1) is
(
1 if 0 ≤ x ≤ 1
fX (x) =
0 otherwise
The population mean is
∞ 1 1
x2
Z Z
µX = xfX (x) dx = = 0.5,
x dx =
–∞ 0 2 0
as expected sense since X ∼ U(0, 1) is symmetric around the center of the (0, 1) interval.
How about the general uniform distribution X ∼ U(a, b) for a < b? The pdf is
(
1
if a ≤ x ≤ b
fX (x) = b–a
0 otherwise
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 251 — #258
i i
a+b
Again, due to symmetry around the middle of the (a, b) interval, the population mean is the midpoint 2 , which can
be confirmed as follows:
Z ∞ Z b b
1 x2 b2 – a2 a + b
µX = xfX (x) dx = x dx = = = .
–∞ a b–a 2(b – a) a 2(b – a) 2
In Example 10.7, we referred to the symmetry of the uniform distribution, a concept which we formally define:
Definition 10.7 A continuous random variable X is said to have a symmetric distribution or a symmetric pdf if
there is a midpoint, call it x∗ , for which the pdf on one side of x∗ is a mirror image of the pdf on the other side of x∗ .
Mathematically, with the midpoint x∗ , a symmetric distribution is symmetric about x∗ and has
fX (x∗ – v) = fX (x∗ + v) for all v ≥ 0
or, equivalently,
FX (x∗ – v) = 1 – FX (x∗ + v) for all v ≥ 0.
Symmetric distributions have some nice properties:
Proposition 10.3. A continuous random variable X with a symmetric distribution, for which the pdf is symmetric
around the midpoint x∗ , has the following properties:
(i) the population mean and the population median are equal to each other and to x∗ :
µX = τX,0.5 = x∗
(ii) for any q, with 0 < q < 0.5, the population quantiles τX,q and τX,1–q are equidistant from the midpoint x∗ :
|τX,q – x∗ | = |τX,1–q – x∗ | or, equivalently, |τX,q – τX,0.5 | = |τX,1–q – τX,0.5 |
Both properties should make intuitive sense since the part of the symmetric distribution to the left of its midpoint x∗
is a mirror image of the part of the distribution to the right of its midpoint x∗ . For the uniform distributions in Example
10.7, X ∼ U(0, 1) has a midpoint x∗ = 0.5 and X ∼ U(a, b) has a midpoint x∗ = a+b 2 . For X ∼ U(0, 1), the equidistant
quantile property (property (ii)) clearly holds, as for instance the population 20% and 80% quantiles, τX,0.2 = 0.2 and
τX,0.8 = 0.8, are the same distance from the midpoint x∗ = τX,0.5 = 0.5.
As was the case for discrete random variables, it is possible that the population mean or expected value of a
continuous random variable is not well-defined, as illustrated by the following example:
Example 10.8 (Infinite expected value) Consider the following pdf for X:
(
1
2 if x > 1
fX (x) = x
0 otherwise
R∞ ∞
This pdf is valid since 1 x12 dx = – 1x 1 = 0 – (–1) = 1, but the expected value of X is infinite since
Z ∞ Z ∞
1 1 ∞
µX = x 2 dx = dx = ln(x) 1 = ln(∞) – 0 = ∞.
1 x 1 x
10.4.3 Population variance and population standard deviation
Section 8.3.2 introduced the population variance for a discrete random variable X, with possible outcomes {xk∗ }Kk=1 for
finite or (countably) infinite K:
X ∗
σX2 = Var(X) = E (X – µX )2 = (xk – µX )2 pX (xk∗ ).
For a discrete random variable X, the population variance is a weighted average of the (xk∗ – µX )2 values, for all possible
outcomes xK∗ , with weights given by the true probabilities of each xk∗ in the population. As with the population mean,
this definition won’t work for a continuous random variable due to number of outcomes being uncountable and each
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 252 — #259
i i
outcome having probability zero. Instead, an analogous definition for a continuous random variable is introduced,
replacing the summation with an integral and replacing the pmf values with pdf values:
Definition 10.8 The population variance of a continuous random variable X, denoted σX2 or Var(X), is
Z ∞
σX2 = Var(X) = E (X – µX )2 = (x – µX )2 fX (x) dx.
–∞
The population standard deviation is defined as the square root of the population variance:
Definition 10.9 The population standard deviation of a continuous random variable X, denoted σX or sd(X), is
sZ
q ∞
σX = sd(X) = σX2 = (x – µX )2 fX (x) dx.
–∞
Example 10.9 (Uniform distribution) In Example 10.7, the population mean of X ∼ U(0, 1) was determined to be
µX = 0.5 and, in the general case, the population mean of X ∼ U(a, b) was determined to be µX = a+b 2 . How about
their respective population variances and population standard deviations? Starting with X ∼ U(0, 1), the population
variance is Z ∞ Z 1 1
(x – 0.5)3
1 1 1
σX2 = (x – µX )2 fX (x) dx = (x – 0.5)2 (1) dx = = – – = ,
–∞ 0 3 0 24 24 12
and the population standard deviation is r
1 1
q
σX = σX2 = =√ .
12 12
For the general case, X ∼ U(a, b), the population variance is
R∞
σX2 = –∞ (x – µX )2 fX (x) dx
Rb 2 1
= a x – a+b 2 b–a dx
3
b
1
(x– 2 )
a+b
1
(b–a)3 (a–b)3 (b–a)2
= b–a 3 = b–a 24 – 24 = 12 ,
a
and the population standard deviation is
r
(b – a)2 b – a
q
σX = σX2 = =√ .
12 12
Example 10.10 (Triangular distribution) The triangular distribution (Example 10.2) has pdf
x
if 0 ≤ x ≤ 1
fX (x) = 2 – x if 1 < x ≤ 2
0 otherwise
From the graph of the pdf fX (·) in Figure 10.4, the triangular distribution appears to be symmetric with midpoint x∗ = 1.
We can confirm that it’s a symmetric distribution by checking that fX (x∗ – v) = fX (x∗ + v) for all v ≥ 0. First, for any v
such that 0 ≤ v ≤ 1, fX (1 – v) = 1 – v since 0 ≤ 1 – v ≤ 1 and fX (1 + v) = 2 – (1 + v) = 1 – v since 1 ≤ 1 + v ≤ 2. Second, for
any v > 1, fX (1 – v) = fX (1 + v) = 0 since 1 – v < 0 and 1 + v > 2. Therefore, X has a symmetric distribution, meaning its
population mean is equal to the value of the midpoint, µX = x∗ = 1.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 253 — #260
i i
1
– 13 + 2 1
+ – 15 28 15
1 1 1
= 4 4 + 3 – 2 + 2 = 12 + 12 = 6 ,
and the population standard deviation of X is
r
1 1
q
2
σX = σX = =√ .
6 6
10.4.4 Using integration in R to calculate population statistics
Section 10.3.1 discussed numerical integration in R, using the integrate function. Since population means,
variances, and standard deviations are each defined in terms of an integral, we can use numerical integration to calculate
their values if the pdf fX (·) isR known, providing an alternative to analytic derivation of the integral. For example, the
∞
population mean is equal to –∞ xfX (x) dx, which can be evaluated by providing a function that calculates xfX (x) as the
first argument of integrate and –∞ and ∞ as the second and third arguments. Here is an example for a U(0, 1)
random variable:
The numerical integration gives the correct answer µX = 0.5 for X ∼ U(0, 1). Two slightly different versions of the
integrate command are used. The first, a standard call to integrate, reports the value along with the accuracy.
The second, which appends the syntax $value function, provides just the numerical value of the integral. This
version is useful when we need to use the numerical value of the integral in further calculations, as seen below for the
population variance. R∞
The population variance is equal to –∞ (x – µX )2 fX (x) dx, so numerical integration requires a function that returns
(x – µX )2 fX (x) to be used as the first argument of integrate. Here is an example for a U(0, 1) random variable:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 254 — #261
i i
1
The numerical integration provides the correct answer σX2 = 12 ≈ 0.083333 for X ∼ U(0, 1).
There’s nothing inherently special about the uniform distribution in these R code examples. To calculate the
population mean and population variance for the triangular distribution, for example, the density function tri_pdf
that was defined in Example 10.2 can be substituted in the code wherever dunif had appeared.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 255 — #262
i i
(iv) (population quantiles) τY,q = a + bτX,q if b ≥ 0 and X is a continuous random variable; τY,q = a + bτX,1–q if b < 0
and X is a continuous random variable
(v) (population IQR) IQRY = |b|IQRX if X is a continuous random variable
Parts (i) through (iii) are identical to the results from Section 8.5.1 for discrete random variables. The additive
constant a affects the population mean but not the population variance or standard deviation, whereas the scaling
constant b affects all three quantities. Part (iv) states that, when b is positive, the population quantiles for the linear
combination Y (τY,q ) are the same linear function of the population quantiles of X (τX,q ); when b is negative, in which
case we can think of the distribution being flipping around before being scaled, the linear function is applied to τX,1–q
rather than τX,q . Part (v) is analogous to the result for the sample IQR, where we had IQRy = |b|IQRx for the linear
transformation y = a + bx of the variable x.
Example 10.11 (Annualized earnings) If X represents the weekly earnings for an employed individual from the
population, the linear transformation Y = 52X is a random variable that represents that annualized earnings for an
employed individual from the population. Similar to the results for descriptive statistics in Example 6.29, Proposition
10.4 can be applied here to get µY = 52µX , σY2 = 522 σX2 = 2704σX2 , σY = 52σX , τY,q = 52τX,q for any q ∈ (0, 1), and
IQRY = 52IQRX .
Example 10.12 (Uniform distribution) Let X ∼ U(0, 1) be a uniform random variable on the interval (0, 1). For
constants a and b, where a < b, if the random variable Y is defines as the linear transformation
Y = a + (b – a)X,
the population mean of Y is
b–a a+b
µY = a + (b – a)µY = a + = ,
2 2
the population variance of Y is
(b – a)2
σY2 = (b – a)2 σX2 = ,
12
and the population standard deviation of Y is
b–a
σY = |b – a|σX = √ .
12
Recall from Examples 10.7 and 10.9 that these quantities are the same population mean, variance, and standard
deviation derived for the U(a, b) distribution. In fact, Y ∼ U(a, b) can be shown as follows:
y–a y–a
FY (y) = P(Y ≤ y) = P(a + (b – a)X ≤ y) = P X ≤ = for y ∈ (a, b),
b–a b–a
which implies that
1
fY (y) = FY0 (y) = for y ∈ (a, b),
b–a
by applying part (iii) of Proposition 10.2. Thus, the pdf of Y is exactly the same as the pdf of a U(a, b) random variable.
This example shows that the linear transformation of a uniform U(0, 1) random variable is also a uniform random
variable. Using this fact to derive the population mean, variance, and standard deviation is much easier than the
work required in Example 10.9 to derive those quantities directly. And, any uniform random variable can be written as
some linear transformation of the U(0, 1) random variable. For instance, to get Y ∼ U(–2, 5), the linear transformation
Y = –2 + 7X can be used for X ∼ U(0, 1).
A specific linear transformation that is often quite useful is the transformation that “standardizes” a random variable
X to have population mean equal to zero and population variance (and population standard deviation) equal to one.
This standardized random variable is formally defined as follows, where we do not restrict the type of random
variable:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 256 — #263
i i
Definition 10.10 For a random variable X, a standardized random variable Z is constructed by “de-meaning” the
random variable X and then dividing by its standard deviation:
X – µX
Z= .
σX
From the definition, Z is a linear transformation of X with additive constant a = – µσXX and scaling constant b = σ1X .
Using the results from Proposition 10.4,
µX 1
µZ = a + bµX = – + µX = 0,
σX σX
2
1
σZ2 = b2 σX2 = σX2 = 1,
σX
and
1
σZ = |b|σX = σX = 1.
σX
Therefore, the standardized random variable Z has population mean equal to zero and population variance and standard
deviation both equal to one. Moreover, Z is unitless since both its numerator X – µX and its denominator σX are in the
units of X. The random variable Z can be interpreted as the number of population standard deviations that X is from
its population mean µX , with negative Z corresponding to X being below µX and positive Z corresponding to X being
above µX . For instance, Z = –1.5 indicates that X is 1.5 standard deviations below µX (that is, X = µX – 1.5σX ), and
Z = 2.7 indicates that X is 2.7 standard deviations above µX (that is, X = µX + 2.7σX ).
Definition 10.11 The joint probability density function (joint pdf) of continuous random variables X and Y is a
function fXY (·, ·) such that, for any two numbers a and b with a ≤ b and any two numbers c and d with c ≤ d,
Z dZ b
P(a ≤ X ≤ b, c ≤ Y ≤ d) = fXY (x, y) dx dy.
c a
Whereas the probability of an interval for a single continuous random variable is a single integral, the joint
probability of X and Y being in their respective intervals is a double integral. The inner integral integrates over the
values x between a and b, while the outer integral integrates over the values y between c and d. Since X and Y are
continuous random variables, the joint probability of any specific outcome is equal to zero; that is, P(X = x, Y = y) = 0 for
any (x, y). Like the marginal pdf, the joint pdf will be non-negative for all (x, y) and integrate to one. These properties,
along with the relationship of the joint pdf to the marginal pdf’s, are stated in the following proposition:
Proposition 10.5. For continuous random variables X and Y, the joint probability density function fXY (·, ·) has the
following properties:
(i) fXY
R∞ (x,Ry) ≥ 0 for all (x, y)
∞
(ii) –∞ –∞R fXY (x, y) dx dy = 1
∞
(iii) fX (x) = R –∞ fXY (x, y) dy for all (x, y)
∞
(iv) fY (y) = –∞ fXY (x, y) dx for all (x, y)
Property (iii) states that the marginal pdf of X, evaluated at x, is obtained by fixing x and integrating the joint pdf
over all values of y. This relationship is analogous to the case of discrete random variables, where the marginal pmf of
X, evaluated at a possible outcome xk∗ , is obtained by fixing xk∗ and summing the joint pmf over all possibly y∗` values.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 257 — #264
i i
Similarly, property (iv) states that the marginal pdf of Y, evaluated at y, is obtained by fixing y and integrating the joint
pdf over all values of x.
Example 10.13 (Unrelated uniform random variables) Suppose X and Y are continuous random variables with joint
pdf (
1 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
fXY (x, y) =
0 otherwise
It can be confirmed that the joint pdf integrates to one:
Z ∞Z ∞ Z 1 Z 1 Z 1
fXY (x, y) dx dy = 1 dx dy = 1 dy = 1.
–∞ –∞ 0 0 0
R1R1
The double integral 0 0 1 dx dy can be interpreted as the volume of a rectangular solid, with height 1 (the constant
fXY (·, ·) value) and rectangular sides of 1 (for the range of x) and 1 (for the range of y).
For the marginal pdf of X, when 0 ≤ x ≤ 1,
Z ∞ Z 1
fX (x) = fXY (x, y) dy = 1 dy = 1,
–∞ 0
and otherwise (when x < 0 or x > 1),
Z ∞ Z ∞
fX (x) = fXY (x, y) dy = 0 dy = 0.
–∞ –∞
Therefore, X is a uniform random variable, X ∼ U(0, 1). Similarly, it can be shown that Y ∼ U(0, 1).
Example 10.14 (Related uniform random variables) Suppose X and Y are continuous random variables with joint pdf
2 if 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.5
fXY (x, y) = 2 if 0.5 < x ≤ 1 and 0.5 < y ≤ 1
0 otherwise
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 258 — #265
i i
other, which is not the case in Example 10.13. Here, X can only be in the range (0, 0.5) when Y is in the range (0, 0.5)
and vice versa, and X can only be in the range (0.5, 1) when Y is in the range (0.5, 1) and vice versa.
The following definition generalizes the definition of the joint cdf, introduced in Chapter 8 for discrete random
variables (Definition 8.9), to all types of random variables.
Definition 10.12 The cumulative distribution function (joint cdf) of two random variables X and Y, denoted
FXY (·, ·), gives the probability that both X and Y are less than or equal to their corresponding arguments:
FXY (x0 , y0 ) = P(X ≤ x0 , Y ≤ y0 ).
When X and Y are both continuous random variables, the joint cdf can be written in terms of the joint pdf:
Z y0 Z x0
FXY (x0 , y0 ) = fXY (x, y) dx dy.
–∞ –∞
Regardless of the types of the two random variables X and Y, the joint cdf has the same properties that were stated
in Chapter 8:
the joint cdf can be determined by calculating the double integral from Definition 10.12 for several different regions
based upon the (x, y) values. The joint cdf, whose derivation is left as an exercise (Exercise 10.13.), is
0 if x < 0 or y < 0
2xy if 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.5
0.5 + 2(x – 0.5)(y – 0.5) if 0.5 < x ≤ 1 and 0.5 < y ≤ 1
x if 0 ≤ x ≤ 0.5 and y > 0.5
FXY (x, y) =
y
if x > 0.5 and 0 ≤ y ≤ 0.5
x if 0.5 < x ≤ 1 and y > 1
y if x > 1 and 0.5 < y ≤ 1
1 if x > 1 and y > 1
Definition 10.13 For a continuous random variable X and a random variable Y, the conditional probability density
function (conditional pdf) of X given Y, is denoted fX|Y (·|·). The function fX|Y (·|y) is the pdf associated with X when
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 259 — #266
i i
Definition 10.14 For random variables X and Y, the conditional cumulative density function (conditional cdf) of
X given Y, denoted FX|Y (·|·), is
FX|Y (x|y) = P(X ≤ x|Y = y) for all (x, y).
If X and Y are both continuous random variables,
Rx
f (v, y) dv
–∞ XY
FX|Y (x|y) = P(X ≤ x|Y = y) = for all (x, y) such that fY (y) > 0.
fY (y)
We consider three examples, the first having a discrete X and continuous Y and the second and third having X and
Y both continuous.
Example 10.16 (Data analyst salaries) A large firm employs many data analysts, some of whom have graduate
degrees. Let X ∈ {0, 1} be an indicator of whether a data analyst has a graduate degree (X = 1 for graduate degree,
X = 0 for no graduate degree). Suppose the salaries (in thousands of dollars) of data analysts without a graduate
degree can be modeled as a U(60, 100) random variable and salaries of data analysts with a graduate degree can be
modeled as a U(90, 210) random variable. In this case, X is discrete and has only two possible values, leading to two
conditional pdf’s of interest: (
1
if 60 ≤ y ≤ 100
fY|X (y|0) = 40
0 otherwise
and (
1
120 if 90 ≤ y ≤ 210
fY|X (y|1) =
0 otherwise
The associated conditional cdf’s are
0 if y < 60
FY|X (y|0) = y–60 if 60 ≤ y ≤ 100
40
1 if y > 100
and
0 if y < 90
FY|X (y|1) = y–90 if 90 ≤ y ≤ 210
120
1 if y > 210
Example 10.17 (Unrelated uniform random variables) Continuing Example 10.13, the conditional pdf of X given Y
is defined for y such that 0 ≤ y ≤ 1, for which
(
fXY (x, y) fXY (x, y) 1 if 0 ≤ x ≤ 1
fX|Y (x|y) = = =
fY (y) 1 0 otherwise
The second equality follows from the marginal distribution Y ∼ U(0, 1), for which fY (y) = 1 when 0 ≤ y ≤ 1. Thus,
the conditional distribution of X given Y = y, for 0 ≤ y ≤ 1, is the uniform distribution U(0, 1), which is the same as the
marginal distribution of X.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 260 — #267
i i
Similarly, the conditional pdf of Y given X is defined for x such that 0 ≤ x ≤ 1, for which
(
fXY (x, y) fXY (x, y) 1 if 0 ≤ y ≤ 1
fY|X (y|x) = = =
fX (x) 1 0 otherwise
Thus, the conditional distribution of Y given X = x, for 0 ≤ x ≤ 1, is the uniform distribution U(0, 1), which is the same
as the marginal distribution of Y.
Example 10.18 (Related uniform random variables) Continuing Example 10.14, the conditional pdf of X given Y is
defined for y such that 0 ≤ y ≤ 1. Recall that the marginal distribution of Y is the U(0, 1) distribution, with fY (y) = 1 for
0 ≤ y ≤ 1. For y values such that 0 ≤ y ≤ 0.5, the conditional pdf of X given Y is
(
fXY (x, y) 2 if 0 ≤ x ≤ 0.5
fX|Y (x|y) = =
fY (y) 0 otherwise
and for y values such that 0.5 < y ≤ 1, the conditional pdf of X given Y is
(
fXY (x, y) 2 if 0.5 < x ≤ 1
fX|Y (x|y) = =
fY (y) 0 otherwise
Thus, the conditional distribution of X given Y is U(0, 0.5) when 0 ≤ y ≤ 0.5 and U(0.5, 1) when 0.5 < y ≤ 1. Similarly,
it can be shown that the conditional distribution of Y given X is U(0, 0.5) when 0 ≤ x ≤ 0.5 and U(0.5, 1) when 0.5 <
x ≤ 1.
Since a conditional pdf is itself a pdf, it is natural to introduce conditional versions of the population descriptive
statistics, including the mean, variance, and standard deviation:
Definition 10.15 The population conditional mean or conditional expectation of a continuous random variable X
given Y = y, denoted µX|Y=y , is Z ∞
µX|Y=y = E(X|Y = y) = xfX|Y (x|y) dx.
–∞
Definition 10.16 The population conditional variance of a continuous random variable X given Y = y, denoted
2
σX|Y=y , is
Z ∞
2
σX|Y=y = Var(X|Y = y) = (x – µX|Y=y )2 fX|Y (x|y) dx.
–∞
Definition 10.17 The population conditional standard deviation of a continuous random variable X given Y = y,
denoted σX|Y=y , is q
2
σX|Y=y = sd(X|Y = y) = σX|Y=y .
As with the the definitions of the conditional pdf and the conditional cdf, the roles of Y and X can be reversed in
these definitions (when Y is a continuous random variable).
Example 10.19 (Data analyst salaries) Continuing Example 10.16, where the conditional distribution of salaries for
data analysts with no graduate degree (Y|X = 0) is U(60, 100) and the conditional distribution of salaries for data
analysts with a graduate degree (Y|X = 1) is U(90, 210), the conditional expectations and conditional variances are
60 + 100 (100 – 60)2
E(Y|X = 0) = = 80, Var(Y|X = 0) = ≈ 133.33,
2 12
90 + 210 (210 – 90)2
E(Y|X = 1) = = 150, Var(Y|X = 1) = = 1200.
2 12
Thus, as expected from a comparison of the two conditional distributions, the expected value and variance of salaries
for non-graduate-degree data analysts are both lower than they are for graduate-degree data analysts.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 261 — #268
i i
Example 10.20 (Related uniform random variables) Continuing Example 10.14, the conditional expectation of X
given Y = y is
Z 0.5
E(X|Y = y) = (x)(2) dx = 0.25
0
when 0 ≤ y ≤ 0.5, and
Z 1
E(X|Y = y) = (x)(2) dx = 0.75
0.5
when 0.5 < y ≤ 1. The conditional variance of X given Y = y is
Z 0.5 0.5
2 1
Var(X|Y = y) = (x – 0.25)2 (2) dx = (x – 0.25)3 =
0 3 0 48
when 0 ≤ y ≤ 0.5, and
Z 1 1
2 1
Var(X|Y = y) = (x – 0.75)2 (2) dx =
(x – 0.75)3 = .
0.5 3 0.5 48
In fact, in this simple example, it would have been possible to determine the conditional expectations and conditional
variances without calculating the integrals. Since we already knew that the conditional distribution X|Y = y is U(0, 0.5)
when 0 ≤ y ≤ 0.5 and U(0.5, 1) when 0.5 < y ≤ 1, the conditional expectations are just the midpoints of the uniform
2
distribution (0.25 and 0.75, respectively) and the conditional variances are both equal to 0.5 1
12 = 48 from the general
(b–a)2
variance formula 12 for a U(a, b) distribution.
When information is available about the conditional distributions of a random variable, and therefore its conditional
expectations, we can determine the (unconditional) expected value of that random variable. This idea is analogous
to the Law of Total Probability from Chapter 3, where the unconditional probability is equal to a weighted sum
of conditional probabilities. The following proposition provides several results for expressing the (unconditional)
expected value of a random variable in terms of conditional expectations:
Proposition 10.6. (Expected value in terms of conditional expectations) Let Y be a random variable.
(i) If A1 , A2 , …, Ak are disjoint and exhaustive events, then22
k
X
E(Y) = E(Y|Aj )P(Aj ).
j=1
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 262 — #269
i i
Example 10.21 (Data analyst salaries) For Example 10.16, where the conditional distribution of salaries for data
analysts with no graduate degree (Y|X = 0) is U(60, 100) and the conditional distribution of salaries for data analysts
with a graduate degree (Y|X = 1) is U(90, 210), let π = P(X = 1) denote the probability that a data analyst has a
graduate degree. Then, the expected value of salary is related to the two conditional expectations of salaries as follows:
E(Y) = E(Y|X = 0) · P(X = 0) + E(Y|X = 0) · P(X = 1)
= 60+100
2 (1 – π) +
90+210
2 π = 80 + 70π.
For instance, if π = 0.2 or 20%, the expected value of salary is E(Y) = 94 or $94,000.
Definition 10.18 The population covariance of continuous random variables X and Y, denoted σXY or Cov(X, Y), is
Z ∞Z ∞
σXY = Cov(X, Y) = E [(X – µX )(Y – µY )] = (x – µX )(y – µY )fXY (x, y) dx dy.
–∞ –∞
Definition 10.19 The population correlation of continuous random variables X and Y, denoted ρXY or Corr(X, Y),
is
σXY
ρXY = Corr(X, Y) = .
σX σY
The properties for the population covariance and population correlation were provided earlier in Proposition 8.4. The
population correlation ρXY is unitless, with 0 ≤ ρXY ≤ 1, and its sign is always the same as the sign of the population
covariance (sign(ρXY ) = sign(σXY )).
Example 10.22 (Related uniform random variables) We can determine the population covariance and population
correlation for the random variables X and Y introduced in Example 10.14, where the joint pdf was
2 if 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.5
fXY (x, y) = 2 if 0.5 < x ≤ 1 and 0.5 < y ≤ 1
0 otherwise
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 263 — #270
i i
We have used µX = µY = 0.5 and σX = σY = √112 since X and Y are U(0, 1) random variables. The high positive
correlation indicates a strong positive association, which is expected given the specification of the joint pdf.
It should be noted that the population covariance σXY = Cov(X, Y) = E [(X – µX )(Y – µY )] is a well-defined population
concept more generally than the specific cases considered in Chapter 8 (X and Y both discrete) and this section (X and Y
both continuous). Regardless of the form of X and Y, we can think of taking many draws of (xi , yi ) from the population,
1
Pn
and the population covariance σXY is the number to which the sample covariance sxy = n–1 i=1 (xi – x̄)(yi – ȳ) eventually
converges (i.e., for very large n).23 One particular case of interest that has not been considered thus far is the covariance
between a discrete X and a continuous Y. For simplicity, let’s consider a binary (Bernoulli) X ∈ {0, 1}, with X ∼
Bernoulli(π), in which case24
σXY = E [(X – µX )(Y – µY )]
= E [(X – µX )(Y – µY )|X = 0] P(X = 0) + E [(X – µX )(Y – µY )|X = 1] P(X = 1)
= E [(0 – π)(Y – µY )|X = 0] (1 – π) + E [(1 – π)(Y – µY )|X = 1] π
= π(1 – π) (E [Y – µY |X = 1] – E [Y – µY |X = 0])
= π(1 – π) (E [Y|X = 1] – E [Y|X = 0]) .
The second equality follows from application of Proposition 10.6. For the third equality, we plug in the value 0 for X
when conditioning on X = 0 and the value 1 when conditioning on X = 1, and we use the fact that P(X = 1) = µX = π.
For the fourth equality, we pull the constants (–π and π, respectively) outside the conditional expectations and then
simplify. Finally, for the fifth equality, the additive constants (–µY for both) is pulled out of the two conditional
expectations and cancel out. Since π(1 – π) is always positive, the population covariance σXY and, thus, the population
correlation ρXY is positive when E [Y|X = 1] > E [Y|X = 0], negative when E [Y|X = 1] < E [Y|X = 0], and zero when
E [Y|X = 1] = E [Y|X = 0]. This result is analogous to the discussion in Section 7.2.3, where the sample correlation
between a discrete x and a continuous y was considered.
Example 10.23 (Data analyst salaries) Continuing Example 10.16, where the conditional distribution of salaries for
data analysts with no graduate degree (Y|X = 0) is U(60, 100) and the conditional distribution of salaries for data
analysts with a graduate degree (Y|X = 1) is U(90, 210), the population covariance is
σXY = π(1 – π) (E [Y|X = 1] – E [Y|X = 0]) = π(1 – π)(150 – 80) = 70π(1 – π),
where π = P(X = 1) is the probability that a data analyst at the firm has a graduate degree. There is a positive
covariance and, therefore, a positive correlation between the graduate-degree indicator X and the salary Y since
the population (conditional) mean of salaries for graduate-degree data analysts is higher than the population
(conditional) mean of non-graduate-degree data analysts. For instance, if π = 0.20, the population covariance is
σXY = (70)(0.2)(0.8) = 11.2.
Definition 10.20 The continuous random variables X and Y are independent if and only if
fXY (x, y) = fX (x)fY (y) for every x and y
or, equivalently,
FXY (x, y) = FX (x)FY (y) for every possible outcome pair (x, y),
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 264 — #271
i i
where FXY (x, y) = P(X ≤ x, Y ≤ y). If this equality fails for any (x, y), then the continuous random variables X and Y
are dependent.
Independence based upon the cdf (that is, FXY (x, y) = FX (x)FY (y) for all (x, y)) is the same condition for independence
seen for discrete random variables in Definition 8.17. In fact, since the cdf is a unifying concept for all random
variables, including those that might be a mixture of both discrete and continuous outcomes, the following definition
is a general definition of independence that applies to any type of random variable.
Definition 10.21 The random variables X and Y are independent if and only if
FXY (x, y) = FX (x)FY (y) for every possible outcome pair (x, y),
where FXY (x, y) = P(X ≤ x, Y ≤ y). If this equality fails for any (x, y), then the random variables X and Y are dependent.
This definition is general in the sense that X can be discrete, continuous, or some mixture of discrete and continuous,
as can Y. For instance, X could be a binary (discrete) random variable, while Y is a uniform (continuous) random
variable.
Proposition 10.7. If the random variables X and Y are independent, the population covariance σXY and population
correlation ρXY are equal to zero (σXY = ρXY = 0). Equivalently, if the random variables X and Y have a non-zero
population covariance or correlation, X and Y are dependent.
As shown in Example 8.24, it is possible for discrete random variables to be dependent and have population
covariance/correlation equal to zero. The same is true of continuous random variables. It is possible to have dependent
continuous random variables with population covariance/correlation equal to zero.25
A general characterization of independence can also be given in terms of conditional distributions:
Proposition 10.8. The random variables X and Y are independent if and only if:
• For any possible value y, the conditional distribution of X given Y = y is the same as the marginal distribution of X.
Mathematically, FX|Y (x|y) = FX (x) for any x and any possible value y.
• For any possible value x, the conditional distribution of Y given X = x is the same as the marginal distribution of Y.
Mathematically, FY|X (y|x) = FY (y) for any y and any possible value x.
It is sufficient to show just one of the equivalences between the conditional pdf’s and the marginal pdf. For instance,
if we show that FX|Y (x|y) = FX (x) for any x and any possible value y, it is unnecessary to also show FY|X (y|x) = FY (y) for
any y and any possible value x.
As with Definition 10.21, Proposition 10.8 does not restrict the type of random variables since the results are stated
in terms of the conditional cdf. X can be discrete, continuous, or a mixture of the two, as can Y. If X and Y are both
discrete, having FX|Y (x|y) = FX (x) for any x and y is equivalent to having pX|Y (x|y) = pX (x) for all possible outcomes
(x, y). If X and Y are both continuous, having FX|Y (x|y) = FX (x) for any x and y is equivalent to having fX|Y (x|y) = fX (x)
for all possible (x, y).
Example 10.24 (Unrelated uniform random variables) In Example 10.17, for the joint pdf
(
1 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
fXY (x, y) =
0 otherwise,
we showed that the conditional pdf of X given Y = y (for 0 ≤ y ≤ 1) is the same as the marginal pdf of X. Also, we
showed that the conditional pdf of Y given X = x (for 0 ≤ x ≤ 1) is the same as the marginal pdf of Y. Then, from
Proposition 10.8,
R∞ R ∞ Y are independent. Since they are independent, we also know σXY = ρXY = 0 without having to
X and
calculate σXY = –∞ –∞ (x – µX )(y – µY )fXY (x, y) dx dy.
If two random variables are independent, it becomes much simpler to determine joint probabilities. For two discrete
random variables, the definition of independence itself (Definition 8.17) stated that the joint probability of an outcome
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 265 — #272
i i
(x, y) is equal to the product of the marginal probabilities of x and y. For continuous variables, the probability of a
specific joint outcome (x, y) is zero, so instead we consider the joint probability of X being in some interval and Y
being in some other interval. The following proposition says that this joint probability is equal to the product of the
marginal probabilities of the two random variables being in their respective intervals:
Proposition 10.9. If X and Y are independent random variables, then for a ≤ b and c ≤ d,
P(a ≤ X ≤ b, c ≤ Y ≤ d) = P(a ≤ X ≤ b) · P(c ≤ Y ≤ d).
The statement of this proposition is not restricted to continuous X and Y and holds for any types of random variables.
Example 10.25 (Producing a movie) The cost of producing a movie depends upon the number of days of filming (X)
and the number of days of editing (Y), with filming costing $1 million per day and editing costing $500,000 or $0.5
million per day. Assume that X and Y are both uniformly distributed and independent, as follows:
X ∼ U(60, 80) and Y ∼ U(40, 50).
The assumption of uniform distributions allows for partial days. The marginal pdf’s are
( (
0.05 if 60 ≤ x ≤ 80 0.1 if 40 ≤ y ≤ 50
fX (x) = and fY (y) =
0 otherwise 0 otherwise
Since X and Y are independent, the joint pdf pXY (x, y) = pX (x)pY (y) for all (x, y), yielding
(
0.005 if 60 ≤ x ≤ 80 and 40 ≤ y ≤ 50
fXY (x, y) =
0 otherwise
For any y such that 40 ≤ y ≤ 50, the conditional pdf of X is the same as the marginal pdf of X: fX|Y (x|y) = fX (x), which is
0.05 if 60 ≤ x ≤ 80 and 0 otherwise. Similarly, for any x such that 60 ≤ x ≤ 80, the conditional pdf of Y is the same as
the marginal pdf of Y: fY|X (y|x) = fY (y), which is 0.1 if 40 ≤ y ≤ 50 and 0 otherwise. The independence also simplifies
the calculation of joint cdf probabilities. For instance, the probability that X (days of filming) is less than or equal to
70 and Y (days of editing) is less than or equal to 45 is
P(X ≤ 70, Y ≤ 45) = P(X ≤ 70)P(Y ≤ 45) = ((0.05)(10)) × ((0.1)(5)) = (0.5)(0.5) = 0.25.
Proposition 10.9 generalizes to more than two random variables. First, we provide a definition of independence for
multiple random variables:
Definition 10.22 The random variables X1 , X2 , …, Xm , for m ≥ 2, are independent if and only if
m
Y
FX1 X2 ···Xm (x1 , x2 , …, xm ) = FX1 (x1 )FX2 (x2 ) · · · FXm (xm ) = FXj (xj )
j=1
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 266 — #273
i i
Definition 10.23 The random variables X1 , X2 , …, Xn are independent and identically distributed (i.i.d.) if
(i) X1 , X2 , …, Xn are independent and (ii) each Xj has the same cdf (FXj (x) = FX (x) for all j ∈ {1, 2, …, n} and all x).
This definition includes discrete random variables (Definition 9.2) as a special case but also applies to continuous
random variables and random variables that are a mixture of discrete and continuous outcomes.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 267 — #274
i i
Example 10.27 (Producing a movie) Continuing Example 10.25, where X ∼ U(60, 80), the number of days of filming,
and Y ∼ U(40, 50), the number of days of editing, are independent random variables. The production cost, denoted C,
is $1 million per day of filming and $0.5 million per day of editing:
C = X + 0.5Y (in millions of dollars).
The population mean of C, or the expected cost, is
µC = µX + 0.5µY = 70 + (0.5)(45) = 92.5,
or $92.5 million. The population variance of C is
(80 – 60)2 (50 – 40)2 425
σC2 = σX2 + (0.5)2 σY2 = + (0.5)2 = ,
12 12 12
and the population standard deviation of C is
r
425
σC = ≈ 5.95,
12
or approximately $5.95 million.
Suppose $90 million has been budgeted for the movie. What is the probability that the film goes over budget? The
probability of going over budget is
P(C > 90) = P(X + 0.5Y > 90),
so we need to find the region of possible (x, y) values for which x + 0.5y > 90. Figure 10.10 helps to visualize the
problem. The rectangle represents the range of the possible values for X and Y, with x between 60 and 80 and y
between 40 and 50. The diagonal line is the y = 2(90 – x) = 180 – 2x line, such that any (x, y) value above this line has
y > 180 – 2x or x + 0.5y > 90. Then, P(X + 0.5Y > 90) is obtained by integrating the joint pdf fXY (x, y) = 0.005 over this
region. Since the pdf is constant within the region, the problem simplifies to determining the area of the gray region in
Figure 10.10 and multiplying it by 0.005, the height of the rectangular solid (given by the pdf). The area of the gray
trapezoid is 10+15
2 · 10 = 125, so that the probability of going over budget is P(C > 90) = (125)(0.005) = 0.625 or 62.5%.
An alternative approach to calculate P(C > 90) = P(X + 0.5Y > 90) is with computer simulation. Rather than
analytically deriving the probability of the gray region in Figure 10.10, the computer simulation approximates the
probability by repeatedly making random draws of X and Y and seeing how often the linear combination X + 0.5Y is
larger than 90. The following R code approximates the over-budget probability with 100,000 simulated draws of X
and Y:
set.seed(1234)
The approximated over-budget probability is 62.62%, which is very close to the true probability of 62.5%.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 268 — #275
i i
60
55
(65,50)
50
45
Y
40
(70,40)
35
30
40 60 80 100
Figure 10.10
Over-budget region for the movie example
and
m
X
σV2 = σX2 1 + σX2 2 + · · · + σX2 m = σX2 j .
j=1
1
When V = m (X1 + X2 + · · · + Xm ) is the average of independent random variables,
m
1 1 1 1X
µV = µX1 + µX2 + · · · + µXm = µXj
m m m m
j=1
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 269 — #276
i i
and
m
1 2 1 2 1 2 1 X 2
σV2 = σ + σ + · · · + σ = σXj .
m2 X1 m2 X2 m2 Xm m2
j=1
Example 10.28 (Sum of independent uniform random variables) Suppose X1 ∼ U(0, 1) and X2 ∼ U(0, 1) are
independent uniform random variables. The sum V = X1 + X2 has population mean
µV = µX1 + µX2 = 0.5 + 0.5 = 1,
population variance
1 1 1
σV2 = σX2 1 + σX2 2 = + = ,
12 12 6
and population standard deviation r
1 1
=√ .
σV =
6 6
Interestingly, V = X1 + X2 is actually the triangular distribution from Example 10.2, which the interested reader can
verify (by finding FV (·) and checking that it’s the cdf of the triangular distribution). The population mean, variance,
and standard deviation are the same as those derived for the triangular distribution in Example 10.10. Knowing that
the triangular distribution V is the sum of two independent U(0, 1) random variables greatly simplifies the calculation
of these quantities, as compared to the brute-force method using the population variance formula in Example 10.10.
How about the sum of three independent U(0, 1) random variables? In this case, V = X1 + X2 + X3 , with population
mean
µV = µX1 + µX2 + µX3 = (3)(0.5) = 1.5,
population variance
1 1
σV2 = σX2 1 + σX2 2 + σX2 3 = (3) = ,
12 4
and population standard deviation r
3 1
= .σV =
12 2
This idea extends to the sum of m independent U(0, 1) random variables, V = X1 + X2 + · · · + Xm , with population mean
µV = µX1 + µX2 + · · · + µXm = (m)(0.5) = 0.5m,
population variance
1 m
σV2 = σX2 1 + σX2 2 + · · · + σX2 m = (m) = ,
12 12
and population standard deviation r
m
σV = .
12
For the sum of independent U(0, 1) random variables, the population mean is m times the common population mean
of the Xj ’s, and the population variance is m times the common population variance of the Xj ’s.
If we are instead interested in the average of m independent U(0, 1) random variables, V = m1 (X1 + X2 + · · · + Xm ),
the population mean is
1 1 1 1
µV = µX1 + µX2 + · · · + µXm = (m) (0.5) = 0.5,
m m m m
the population variance is
1 1 1 1 1 1
σV2 = 2 σX2 1 + 2 σX2 2 + · · · + 2 σX2 m = 2
(m) = ,
m m m m 12 12m
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 270 — #277
i i
m=1 m=2
1.0
1.0
0.8
0.8
0.6
0.6
fV(v)
fV(v)
0.4
0.4
0.2
0.2
0.0
0.0
−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0
v v
m=3 m = 10
0.4
0.6
0.3
0.4
fV(v)
fV(v)
0.2
0.2
0.1
0.0
0.0
−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0
v v
Figure 10.11
Probability density functions for the average of m i.i.d. U(0, 1) random variables
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 271 — #278
i i
Proposition 10.14. Suppose X1 , X2 , …, Xm are i.i.d. random variables with common population mean µX and
population variance σX2 .
(i) For the sum of the i.i.d. random variables, V = X1 + X2 + · · · + Xm ,
√
µV = mµX , σV2 = mσX2 , and σV = mσX .
(ii) For the average of the i.i.d. random variables, V = m1 (X1 + X2 + · · · + Xm ),
σX2 σX
µV = µX , σV2 = , and σV = √ .
m m
In addition to the generalized results in Proposition 10.14, the tendency for the sum V = X1 + X2 + · · · + Xm or
the average V = m1 (X1 + X2 + · · · + Xm ) to look like a bell-shaped distribution as m gets larger is also a general
phenomenon that occurs for any i.i.d. random variables. This phenomenon was previously encountered in Chapter 9
for binomial random variables since a Binomial(n, π) random variable is the sum of n i.i.d. Bernoulli variables.
For example, Figure 9.3 showed a Binomial(100, 0.1) random variable with a distribution that appeared bell-shaped
and approximately symmetric. Even though the pdf of a Bernoulli(0.1) is certainly not bell-shaped, the sum of m
Bernoulli(0.1) random variables takes on the symmetric, bell-shaped distribution for the large m = 100 value.
The following example illustrates this same phenomenon for an asymmetric continuous distribution that is quite
different from the uniform distribution.
Example 10.29 (Asymmetric distribution) We consider a random variable with an asymmetric distribution, known as
an exponential random variable (discussed in more detail in Chapter 11). The top-left graph in Figure 10.12 shows
the pdf for a random variable X that is positive for positive values x > 0. In this example, the population mean is
µX = 2. Assume that X1 , X2 , …, Xm are i.i.d. random variables with this distribution. The remaining three graphs in
Figure 10.12 show the distribution of the average V = m1 (X1 + X2 + · · · + Xm ) for three different values of m (m = 3 in
the top-right graph, m = 10 in the lower-left graph, and m = 20 in the lower-right graph). While each Xj is asymmetric,
the distribution of V becomes closer and closer to a symmetric distribution as m increases. Even with m = 3, the
distribution begins to look like a bell-shaped distribution, though it is more asymmetric than when m increases to 10
or 20. For the higher m values, the nearly symmetric distributions are approximately centered around µX = 2 and the
dispersion decreases as m is increased from 10 to 20.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 272 — #279
i i
m=1 m=3
0.5
0.12
0.4
0.08
0.3
fV(v)
fV(v)
0.2
0.04
0.1
0.00
0 1 2 3 4 5 0 1 2 3 4 5
v v
m = 10 m = 20
0.06
fV(v)
0.02
0.00
0 1 2 3 4 5 0 1 2 3 4 5
v v
Figure 10.12
Probability density functions for the average of m i.i.d. exponential random variables
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 273 — #280
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 274 — #281
i i
0.020
0.015
g(X) (investment fee)
0.010
0.005
0.000
X (portfolio return)
Figure 10.13
Financial advisor investment fee schedule
represented by the random variable X, the financial advisor charges an investment fee equal to g(X) times the overall
portfolio value, where
0.013
g(X) = 0.002 + .
1 + e–40X
The function g(·) is increasing, with an asymptote of 0.002 (or 0.2%) to the left and an asymptote of 0.015 (or 1.5%)
to the right, as illustrated in Figure 11.3. Assuming that the annual portfolio return for the customer is uniformly
distributed, with X ∼ U(–0.10, 0.25), what is the cdf of the investment fee g(X)? We first determine the relevant sample
space associated with g(X). The minimum possible value and maximum possible value are
0.013 0.013
g(–0.10) = 0.002 + and g(0.25) = 0.002 + ,
1 + e–40(–0.10) 1 + e–40(–0.25)
respectively. Then, the cdf of g(X) is
0.013 1 v – 0.002
Fg(X) (v) = P 0.002 + ≤ v = P ≤ for g(–0.10) < v < g(0.25).
1 + e–40X 1 + e–40X 0.013
We require additional simplification to get an inequality in terms of X. For any v between g(–0.10) and g(0.25),
0.013 0.015 – v 0.015 – v
Fg(X) (v) = P ≤ 1 + e–40X = P ≤ e–40X = P ln ≤ –40X ,
v – 0.002 v – 0.002 v – 0.002
which yields
1 v – 0.002 1 1 v – 0.002
Fg(X) (v) = P X ≤ ln = ln – (–0.10) .
40 0.015 – v 0.35 40 0.015 – v
1 1
The last equality follows from X ∼ U(–0.10, 0.25), which implies a pdf of fX (x) = 0.25–(–0.10) = 0.35 for x ∈ [–0.10, 0.25].
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 275 — #282
i i
For a continuous random variable X, Proposition 10.17 implies a particularly convenient relationship between the
population quantiles of g(X) and the population quantiles of X. The result is given in the following proposition:
Proposition 10.18. If the function g(·) is strictly increasing on the sample space associated with a continuous random
variable X, the population q-th quantile of the random variable g(X) is
τg(X),q = g(τX,q ) for any q ∈ (0, 1).
To get the q-th population quantile of g(X), we just apply the g(·) function to the q-th population quantile of X. Why
does this relationship hold? Recall that τX,q is the value for which P(X ≤ τX,q ) is equal to q. Therefore, P(g(X) ≤ g(τX,q ))
is also equal to q since g(·) is a strictly increasing function, meaning that the population q-th √ quantile of g(X) is g(τX,q ).
√
For Example 10.32, where X ∼ U(0, 1), the population q-th quantile (for q ∈ (0, 1)) of X is q since the population
q-th quantile of X is q; similarly, the population q-th quantile of X 2 is q2 for q ∈ (0, 1).
For Example 10.33, what is the population median of the investment fee g(X)? According to Proposition 10.18,
since the population median of X is the middle of the (–0.10, 0.25) interval, which is 0.075,
0.013
τg(X),0.5 = g(τ0.5 ) = g(0.075) = 0.002 + ≈ 0.01438.
1 + e–40(0.075)
10.10 Random variables with discrete and continuous outcomes
This chapter has already alluded to random variables that are a mixture or hybrid of discrete and continuous random
variables, with the idea first introduced in Section 10.1.1 and then again in Section 10.3.2 when we discussed the
generality of the cdf. Unfortunately, while the cdf applies generally to discrete random variables, continuous random
variables, and mixtures of the two, our definitions of many population quantities have been quite different for discrete
and continuous random variables. For example, the population mean and the population variance are both defined as
summations for discrete random variables and as integrals for continuous random variables.
Although a completely unified framework that handles arbitrary mixtures of discrete and continuous random
variables is beyond the scope of this book, we briefly discuss how to determine population quantities for such mixtures.
The basic approach, which works in most cases of practical interest, is to apply summations for the discrete outcomes,
to apply integrals for the continuous outcomes, and then sum the results. This idea is best illustrated through examples.
Example 10.34 (Uniform distribution with point masses) As in Example 10.5, suppose the cdf of X is
0 if x < 0
FX (x) = 0.3 + 0.5x if 0 ≤ x < 1
if x ≥ 1
1
This cdf corresponds to probabilities P(X = 0) = 0.3 and P(X = 1) = 0.2 for the two possible discrete outcomes, which
leaves a probability of 0.5 for the (uniform) continuous outcomes x ∈ (0, 1). The population mean of X is
Z 1 !
µX = E(X) = (0)(0.3) + (1)(0.2) + x(1)dx (0.5) = 0.45,
0
where the first two terms account for the two discrete outcomes and the integral accounts for the continuous outcomes.
This calculation partitions the expected value E(X) into a weighted average of conditional expectations, with
E(X) = E(X|X = 0) · P(X = 0) + E(X|X = 1) · P(X = 1) + E(X|0 < X < 1) · P(0 < X < 1),
which simplifies to
E(X) = (0)(0.3) + (1)(0.2) + (0.5)(0.5) = 0.45.
Similarly, the population variance of X is
!
Z 1
σX2 2 2
= (0 – 0.45) (0.3) + (1 – 0.45) (0.2) + 2
(x – 0.45) (0.5)(1)dx ≈ 0.164.
0
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 276 — #283
i i
276 NOTES
Example 10.35 (Average hourly wages, with zeros) Suppose we are interested in the average hourly wage in a
population of teenagers, but we want to include teenagers who are not working (i.e., zero hourly wage). If 40% of
teenagers do not work and the remaining 60% have wages (in dollars) drawn from a U(8, 16) random variable, the
population mean of wages is
Z 16 16
x x2 256 – 64
(0.4)(0) + (0.6) dx = (0.6) = (0.6) = 7.2,
8 8 16 8 16
or $7.20. The population variance of wages is
Z 16 16
(x – 7.2)2 (x – 7.2)3
(0.4)(0 – 7.2)2 + (0.6) dx = (0.4)(0 – 7.2)2 + (0.6) = 37.76,
8 8 24 8
√
meaning the population standard deviation of wages is 37.76 or approximately $6.14.
Notes
21 Strictly speaking, the outcome X = –1 (which corresponds to the stock price dropping to zero) may not have zero probability, but we assume
that it does for ease of exposition.
22 This result also holds for an infinite set of disjoint and exhaustive events, in which case the summation becomes an infinite summation.
23 In the terminology of Chapter 14, the sample covariance is said to be a “consistent estimator” of the population covariance σ .
XY
24 For general discrete X, the same basic idea applies except that there are more (possibly infinite) possible outcomes x∗ for X, resulting in a
k
∗
summation of terms that are probabilities pX (xk ) times conditional expectations given X = xk .∗
25 In Chapter 11, however, we will see that for X and Y with normal distributions, independence of X of Y holds if and only if the population
covariance/correlation is equal to zero.
26 The inverse function g–1 (·) is well-defined since g(·) is strictly increasing.
Exercises
1. Consider the probability distribution function (pdf)
(
3x2 for 0 ≤ x ≤ 1
fX (x) =
0 otherwise
(a) Show that fX (·) is a valid pdf.
(b) What is the population mean of X?
(c) What is the population variance of X?
(d) What is the value of the cdf of X evaluated at 0.4?
(e) What is the population median of X?
(f) Confirm your answers to (a), (b), and (d) in R by defining functions corresponding to fX (x) and xfX (x) and using
the integrate function.
2. In the United States, a FICO Score is a measure of an individual’s credit worthiness that companies (lenders, credit-
card issuers, etc) use to determine whether to extend credit. An individual’s FICO Score is an integer between 300 and
850, with low scores indicating low credit worthiness and high scores indicating high credit worthiness. Assume that
we can consider FICO Scores to be an approximately continuous variable, even though they are integers. According
to fico.com, the population probabilities associated with different intervals in April 2023 were as follows:
Interval 300-499 500-549 550-599 600-649 650-699 700-749 750-799 800-850
Probability 0.028 0.057 0.068 0.086 0.120 0.164 0.236 0.241
Let X denote the random variable associated with an individual’s FICO Score.
(a) What is FX (799)?
(b) What are the smallest and largest possible values of FX (520)?
(c) In which interval is the population median of X?
(d) What are the smallest and largest possible values of the population mean of X?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 277 — #284
i i
NOTES 277
3. Consider the following pdf for a random variable X defined on the interval [0, 1]:
0.8 for x ∈ [0, 0.5]
fX (x) = K for x ∈ (0.5, 1]
0 otherwise
(a) What is K? Graph the pdf of X.
(b) What is the cdf of X? Graph the cdf of X.
(c) What is the expected value of X?
(d) What is the population variance of X?
(e) What is the population 10% quantile of X?
(f) What is the population 90% quantile of X?
(g) Write an R function rsplitunif that takes a single argument n and returns a vector with n draws of the
random variable X. (Hint: Think of X as being generated in the following way: with probability 0.4, X is drawn
from a U(0, 0.5) distribution, and with probability 0.6, X is drawn from a U(0.5, 1) distribution.)
(h) Using the function rsplitunif, conduct 10,000 simulations in R to confirm the answers to (c) and (d).
(i) Suppose Y ∼ U(0, 1) and that X and Y are independent. Using the functions runif and rsplitunif,
conduct 10,000 simulations in R to approximate P(X > Y).
(j) Suppose Y ∼ U(0, 1) and that X and Y are independent. Using the functions runif and rsplitunif,
conduct 10,000 simulations in R to approximate P(|X – Y| ≤ 0.1), the probability that X and Y are within 0.1 of
each other.
4. Let X be a random variable with pdf
0.5x if 0 ≤ x ≤ 1
1 – 0.5x if 1 < x ≤ 2
fX (x) =
0.25 if 2 < x ≤ 4
0 otherwise
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 278 — #285
i i
278 NOTES
(i) *The following R function rparabola takes a single argument n and returns n random draws of the random
variable X:
i. Copy and paste the code into R to define the rparabola function. Using the rparabola function,
draw a histogram of 100,000 simulated draws of X. How does the histogram compare to the pdf graph
from (c)?
ii. Consider two cities, with associated market shares X1 and X2 for Office Plus, where X1 and X2 are
i.i.d. random variables with pdf fX (·). Using the rparabola function, draw a histogram of 100,000
simulated draws of X1 +X
2 , the average market share for Office Plus in the two cities.
2
iii. Consider ten cities, with associated market shares X1 , …, X10 for Office Plus, where X1 , …, X10 are
i.i.d. random variables with pdf fX (·). Using the rparabola function, draw a histogram of 100,000
simulated draws of X1 +X210
+···+X10
, the average market share for Office Plus in the ten cities.
6. *Consider a car insurance policy that has a deductible of $500 and maximum coverage of $15,000. If the policy
holder gets into an accident and submits a claim, the policy holder must pay the first $500 of the claim and the
insurance company pays the rest of the claim up to the maximum of $15,000. Therefore, the maximum the insurance
company pays on a claim is $14,500, which happens when the claim is $15,000 or more. Suppose the pdf of the
claim X is (
6
3 (20000 – x)x for 0 ≤ x ≤ 20000
fX (x) = 20000
0 otherwise
Let Y be the amount that the insurance company pays on the claim. Y has discrete and continuous outcomes, with 0
and 14500 being discrete outcomes with positive probabilities and y ∈ (0, 14500) being continuous outcomes.
(a) What is the probability, P(Y = 0), that the insurance company pays nothing?
(b) What is the probability, P(Y = 14500), that the insurance company pays the maximum amount?
(c) Determine the cdf of Y.
(d) What is the probability that the insurance company pays between $10,000 and $12,000 on the claim?
7. The lifetime of a new restaurant, in years, is described by the random variable X with pdf
(
1
2 for x > 0
fX (x) = (x+1)
0 otherwise.
For example, if X = 2, the restaurant is in business for exactly two years and then closes.
(a) Determine the cdf of X.
(b) What is the population median of X?
(c) What is the probability that the restaurant stays in business for at least two years?
(d) If a company opens three restaurants, each of which has a lifetime that is an independent draw of X, what is the
probability that none of the restaurants lasts two years?
8. *Let X1 , X2 , …, Xn be i.i.d. continuous random variables with population median τX,0.5 . Let minX =
min(X1 , X2 , …, Xn ) and maxX = max(X1 , X2 , …, Xn ).
n–1
(a) Show that P(minX ≤ τX,0.5 ≤ maxX ) = 1 – 21 . (Hint: Consider the probability of the complement.)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 279 — #286
i i
NOTES 279
(b) For n = 4, think about having many different four-observation samples {x1 , x2 , x3 , x4 }, each of which is a
realization of the {X1 , X2 , X3 , X4 } i.i.d. draws. For what fraction of these samples is the population median
τX,0.5 between min(x1, x2 , x3 , x4 ) and max(x1, x2 , x3 , x4 )?
(c) What is the smallest value of n for which the probability in (a) is at least 95%?
(d) What is the smallest value of n for which the probability in (a) is at least 99%?
9. Each of the following R commands makes 10,000 random draws that are stored in the vector x. For each command,
(i) describe the distribution that the random draws are being drawn from and (ii) provide your best guess at what the
value of mean(x) would be.
(a) x <- 10*runif(10000)-7
(b) x <- 1*(runif(10000)<0.7)
(c) x <- 2*(runif(10000)<0.4)-1
(d) x <- runif(10000,1,3)+runif(10000,5,8)
10. A store’s sales depend on whether it is a weekday (Monday through Friday) or a weekend day (Saturday or Saturday).
Specifically, sales (in thousands of dollars) are distributed as a uniform random variable U(1, 3) on a weekday and as a
uniform random variable U(2, 5) on a weekend day. Let X denote the random variable associated with the store’s sales
on a randomly chosen day of the week (i.e., the probability associated with each day is 1/7). We say that X is a mixture
of uniform random variables.
(a) Determine the cdf of X by applying the Law of Total Probability
P(X ≤ x) = P(X ≤ x|weekday)P(weekday) + P(X ≤ x|weekend)P(weekend).
(b) Determine the pdf of X.
(c) Conduct 10,000 simulations in R, making a random draw of X in each simulation, and plot a histogram of the
draws to confirm your answer to (b).
(d) What is the population mean of X? Either calculate the integral or use the fact that
E(X) = E(X|weekday)P(weekday) + E(X|weekend)P(weekend).
(e) What is the population median of X?
11. An investor is considering two different real estate properties, Property A and Property B. The annual return from
Property A, call it RA , follows a uniform distribution between 4% and 9%, and the annual return from Property B, call
it RB , follows a uniform distribution between 2% and 12%. Assume that RA and RB are independent.
(a) What are the pdf’s of RA and RB ?
(b) Which property has a higher probability of an annual return greater than 7%?
(c) What is the probability that both RA and RB are greater than 7%?
(d) What are the expected values and population standard deviations of RA and RB ?
(e) What is P(RB > RA )?
(f) Conduct 10,000 simulations in R, making random draws of RA and RB for each simulation, to confirm your
answer to (e).
12. Two buses arrive randomly and independently at a station between 8:00am and 8:20am. The arrival time of each bus
can be modeled as a uniform distribution.
(a) If each bus stays at the station for two minutes after its arrival, what is the probability that the two buses are
simultaneously at the station at some point?
(b) Conduct 10,000 simulations in R to confirm your answer to (a).
(c) For this part, vary the amount of time t (in minutes) that the two buses remain at the station. For each integer
value t ∈ {1, 2, · · · , 9, 10}, conduct 10,000 simulations in R to approximate the probability that the two buses
are simultaneously at the station at some point. Plot the approximated probabilities versus t.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 280 — #287
i i
280 NOTES
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 281 — #288
i i
NOTES 281
(b) Suppose the first two attempted hires are successful (H1 = 1, H2 = 1), where it is assumed that H1 and H2 are
independent. Determine the (posterior) distribution of PH conditional on H1 = 1 and H2 = 1. That is, what is
fPH (p|H1 = 1, H2 = 1) as a function of p for p ∈ [0, 1]? (Hint: Replace “H1 = 1” by “H1 = 1, H2 = 1” in the formula
from (a) and use the independence assumption.)
(c) If the first hire is successful (H1 = 1) and the second hire is unsuccessful (H2 = 0), how would your answer to
(b) change?
(d) Rather than PH ∼ U(0, 1), suppose the prior distribution is instead described by the pdf
(
6p(1 – p) for p ∈ [0, 1]
fPH (p) =
0 otherwise
i. Confirm that fPH (p) is a valid pdf.
ii. Without calculating integrals by hand, provide a graph in R that shows the prior distribution of PH and
the posterior distribution of PH conditional on the first hire being successful (H1 = 1). (Hint: Use the
formula from (a) and the integrate function to determine fPH (p|H1 = 1) for possible values of p.)
18. Conduct 10,000 simulations to approximate the expected value and the standard deviation of the investment after
two years in Example 10.31.
19. The winner of a raffle receives a payout of X 2 dollars, where X is drawn from a U(20, 40) random variable.
(a) What is the sample space associated with the payout X 2 ?
(b) What is the cdf of the payout X 2 ?
(c) What is the pdf of the payout X 2 ?
(d) What is the expected value of the payout X 2 ?
(e) *For this part, assume that the payout is X 2 – 10X instead of X 2 .
i. Confirm that the payout is an increasing function of X for the relevant values of X (20 ≤ X ≤ 40).
ii. What is the sample space associated with the payout X 2 – 10X?
iii. What is the cdf of the payout X 2 – 10X? (Hint: Use the quadratic formula to determine P(X 2 – 10X ≤ v).)
20. A cryptocurrency miner owns a computer server that “mines” cryptocurrency for 20 hours during the day (4:00am
to midnight). Let X ∈ [0, 20] denote a random variable that indicates, on a given day, how many hours the server runs
without crashing. If X < 20, the server crashes before midnight; if X = 20, the server does not crash and runs until
midnight. Suppose the cdf of X is
0
for x ≤ 0
FX (x) = 0.005x + 0.00025x2 for 0 < x < 20
for x ≥ 20
1
(a) What is P(X = 20)? (Think about the jump in FX (x) that occurs at x = 20.)
(b) If the cryptocurrency miner makes the equivalent of $1,000 for each hour that the server runs, what is the
probability that the miner makes more than $16,000 on a given day?
(c) What is the pdf associated with the conditional distribution of X given 0 < X < 20?
(d) If the cryptocurrency miner makes the equivalent of $1,000 for each hour that the server runs, what is the
expected value of the amount of money made on a given day?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 282 — #289
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 283 — #290
i i
Definition 11.1 A normal random variable X with mean parameter µ and variance parameter σ 2 , denoted X ∼
N(µ, σ 2 ), has the pdf
1 1 x–µ 2
fX (x) = √ e– 2 ( σ ) for – ∞ < x < ∞.
σ 2π
Figure 11.1 shows the shape of the pdf fX (·) for a normal random variable X ∼ N(µ, σ 2 ). The pdf is bell-shaped and
symmetric, with a peak at the center of the distribution where x = µ. The pdf is positive for all values along the real
line since ev is always positive, even for negative v. The pdf decreases as x moves either to the left or to the right of the
x = µ value.
The following proposition states some of the important properties of a normal random variable:
Proposition 11.1. If X ∼ N(µ, σ 2 ) is a normal random variable, then
(i) µX = E(X) = µ
(ii) σX2 = Var(X) = σ 2
(iii) σX = sd(X) = σ
(iv) X is symmetric around µ, with population median τX,0.5 = µ
(v) The maximum value of the pdf fX (x) occurs at x = µ, with the pdf strictly increasing to the left of x = µ (that is,
f (x1 ) < f (x2 ) if x1 < x2 < µ) and strictly decreasing to the right of x = µ (that is, f (x1 ) > f (x2 ) if µ > x1 > x2 ).
Properties (i) and (ii) can be verified by using the definitions of the population mean and the population variance,
but the proofs are a complicated due to the form of the normal pdf.27 Property (iv) holds since, for any v > 0,
1 1 v 2 1 1 –v 2
fX (µ + v) = √ e– 2 ( σ ) = √ e– 2 ( σ ) = fX (µ – v).
σ 2π σ 2π
1 x–µ 2 2
Property (v) holds since fX (x) = σ√2π e 2 ( σ ) is maximized when the exponent – 21 x–µ
1 –
σ is equal to zero, which
28
happens when x = µ.
Since µ and σ 2 are the population mean and the population variance, respectively, the parameters of X ∼ N(µ, σ 2 )
describe the location and dispersion of X. The top graph of Figure 11.2 shows two normal random variables with
the same population variance, where one distribution is centered at µ and the other (dotted) distribution is shifted
to the right, with the same shape and centered at a larger population mean. The bottom graph of Figure 11.2 shows
two normal random variables centered around the same population mean, where the dotted distribution has a larger
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 284 — #291
i i
fX(x)
Figure 11.1
Probability density function for a normal distribution
population variance than the solid distribution. As the population variance increases, with the population mean fixed,
more probability gets shifted into the left and right tails, increasing the spread of the distribution.
The cdf FX (x) of a normal random variable X ∼ N(µ, σ 2 ) is
Z x0
1 1 x–µ 2
FX (x0 ) = P(X ≤ x0 ) = √ e– 2 ( σ ) dx.
–∞ σ 2π
Unfortunately, there’s no simple closed-form formula for the integral needed to calculate FX (x0 ). That said, as
discussed below, R has built-in functions to calculate the pdf fX (x0 ) and the cdf FX (x0 ) for any x0 . Since the random
variable X ∼ N(µ, σ 2 ) is symmetric around µ, the following two properties of the normal cdf hold:
FX (µ) = 0.5
and
FX (µ – v) = 1 – FX (µ + v) for any v > 0.
Figure 11.3 shows the normal cdf curve (bottom graph) along with the associated normal pdf curve (top graph). The
cdf is an “S-shaped” curve. As the argument of FX (·) gets more and more negative, the value of FX (·) gets arbitrarily
close to zero but is always strictly greater than zero. Similarly, as the argument of FX (·) gets more and more positive,
the value of FX (·) gets arbitrarily close to one but is always strictly less than one.
The following R functions are useful for working with a normal random variable:
• dnorm(x, mean=0, sd=1): Returns the pdf of a normal random variable, with mean mean and standard
deviation sd, evaluated at the argument x, which may be a single number or a vector. The optional arguments
mean and sd have default values of 0 and 1, respectively.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 285 — #292
i i
Shift in location
fX(x)
µ
Increase in variance
fX(x)
Figure 11.2
Location and variance of normal random variables
• pnorm(x, mean=0, sd=1): Returns the cdf of a normal random variable, with mean mean and standard
deviation sd, evaluated at the argument x, which may be a single number or a vector. The optional arguments
mean and sd have default values of 0 and 1, respectively.
• rnorm(x, mean=0, sd=1): Creates a vector of n i.i.d. random draws of a normal random variable with mean
mean and standard deviation sd. The optional arguments mean and sd have default values of 0 and 1, respectively.
• qnorm(p, mean=0, sd=1): Returns the population quantiles of a normal random variable, with mean mean
and standard deviation sd, specified by the argument p, which may be a single number or a vector. The optional
arguments mean and sd have default values of 0 and 1, respectively.
The ability to easily calculate the pdf and cdf of a normal random variable is particularly appealing due to the
complicated nature of the pdf and the lack of a closed-form expression for the cdf.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 286 — #293
i i
fX(x)
µ
1.0
0.9
0.8
0.7
0.6
FX(x)
0.5
0.4
0.3
0.2
0.1
0.0 µ
Figure 11.3
Cumulative distribution function for a normal random variable
set.seed(1234)
dnorm(0)
## [1] 0.3989423
dnorm(1)
## [1] 0.2419707
pnorm(1)
## [1] 0.8413447
rnorm(10)
## [1] 0.1329808
dnorm(1,mean=0,sd=3)
## [1] 0.1257944
pnorm(1,mean=0,sd=3)
## [1] 0.6305587
rnorm(10,mean=0,sd=3)
## [1] -1.4315781 -2.9951593 -2.3287617 0.1933765 2.8784822 -0.3308565
## [7] -1.5330285 -2.7335862 -2.5115150 7.2475055
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 287 — #294
i i
After the random seed is set, the next four commands in the code use the default values mean=0 and sd=1,
corresponding to X ∼ N(0, 1). In order, they return the pdf values fX (0) and fX (1), the cdf value FX (1), and ten
i.i.d. draws for X ∼ N(0, 1). The last four commands use mean=0 and sd=3, returning (in order) the pdf values
fX (0) and fX (1), the cdf value FX (1), and ten i.i.d. draws for X ∼ N(0, 9).
µ is approximately 95%.
• P(µ – 3σ ≤ X ≤ µ + 3σ) ≈ 0.9973, meaning the probability that X is within three standard deviations (3σ) of its
mean µ is nearly 100%, as there is only a 0.27% probability that X is more than 3σ away from µ.
Figure 11.4 shows the pdf of a normal random variable X ∼ N(µ, σ 2 ), with the gray region in the top graph indicating
P(µ – σ ≤ X ≤ µ + σ) and the gray region in the bottom graph indicating P(µ – 2σ ≤ X ≤ µ + 2σ). The probability P(µ –
σ ≤ X ≤ µ + σ) ≈ 0.6827 is equal to the area under the pdf curve between µ – σ and µ + σ, and the probability P(µ –
2σ ≤ X ≤ µ + 2σ) ≈ 0.9545 is equal to the area under the pdf curve between µ – 2σ and µ + 2σ.
A more exact 95% probability interval, commonly used for statistical inference, follows from the fact that
P(µ – 1.96σ ≤ X ≤ µ + 1.96σ) ≈ 0.9500.
There is a 95% probability that X is within 1.96 standard deviations, rather than two standard deviations, of its mean µ.
An exact 90% probability interval, also commonly used, follows from the fact that
P(µ – 1.645 ≤ X ≤ µ + 1.645) ≈ 0.9000.
There is a 90% probability that X is within 1.645 standard deviations of its mean µ.
Example 11.1 (Asset returns) Suppose the annual return on an asset X is normally distributed, with X ∼
N(0.07, (0.08)2 ). The annual return has population mean µX = 0.07 (or 7%) and population standard deviation
σX = 0.08 (or 8%). Then, there is approximately a 68% probability that the annual return is between 0.07 – 0.08 = –0.01
and 0.07 + 0.08 = 0.15. Also, there is a 95% probability that the annual return is between 0.07 – (1.96)(0.08) = –0.0868
and 0.07 + (1.96)(0.08) = 0.2268.
Definition 11.2 A standard normal random variable, often denoted Z, is a normal random variable with µ = 0,
σ 2 = 1, and σ = 1. That is, Z ∼ N(0, 1).
Plugging µ = 0 and σ = 1 into the pdf formula from Definition 11.1, the standard normal Z ∼ N(0, 1) has pdf
1 1 2
fZ (z) = √ e– 2 z for – ∞ < z < ∞.
2π
The standard normal distribution Z ∼ N(0, 1) is used so often that its pdf and cdf are often represented by special
notation. The standard normal pdf is denoted φ(·), with
φ(z) = fZ (z),
and the standard normal cdf is denoted Φ(·), with
Φ(z) = FZ (z).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 288 — #295
i i
fX(x)
µ−σ µ µ+σ
µ − 2σ µ µ + 2σ
Figure 11.4
Probability intervals for a normal random variable
Figure 11.5 shows the pdf curve for a standard normal random variable Z ∼ N(0, 1). The distribution is symmetric
around zero, with φ(–v) = φ(v) and Φ(–v) = 1 – Φ(v) for all v > 0. The peak of the standard normal distribution occurs
at z = 0, with φ(0) ≈ 0.3989.
For the standard normal random variable, a 95% probability interval is
P(–1.96 ≤ Z ≤ 1.96) ≈ 0.9500,
and a 90% probability interval is
P(–1.645 ≤ Z ≤ 1.645) ≈ 0.9000.
Where do the values 1.96 and 1.645 come from? The qnorm function can be used to confirm that –1.96 and 1.96 are,
respectively, the 2.5% and 97.5% quantiles of the N(0, 1) distribution and that –1.645 and 1.645 are, respectively, the
5% and 95% quantiles of the N(0, 1) distribution.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 289 — #296
i i
0.5
0.4
0.3
φ(z)
0.2
0.1
0.0
−4 −3 −2 −1 0 1 2 3 4
Figure 11.5
Probability density function for a N(0, 1) random variable
pnorm(1.96)
## [1] 0.9750021
pnorm(1.96)-pnorm(-1.96)
## [1] 0.9500042
pnorm(-1.645)
## [1] 0.04998491
pnorm(1.645)
## [1] 0.9500151
pnorm(1.645)-pnorm(-1.645)
## [1] 0.9000302
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 290 — #297
i i
Due to the symmetry of the normal distribution, there is 2.5% probability that Z < –1.96 and a 2.5% probability that
Z > 1.96. Therefore, the value 1.96 is the population 97.5% quantile of Z,
Φ(1.96) = 0.975,
leaving 2.5% probability in the tail to the right of 1.96. Similarly, for the 90% probability interval (–1.645, 1.645),
there is a 5% probability that Z < –1.645 and a 5% probability that Z > 1.645. The value 1.645 is the population 95%
quantile of Z,
Φ(1.645) = 0.95,
leaving 5% probability in the tail to the right of 1.645.
This approach can be used for any symmetric probability interval of the standard normal random variable. For
instance, for a 70% probability interval, we need to find the value c such that the (–c, c) interval is associated with a
15% probability that Z < –c and a 15% probability that Z > c. The appropriate value of c is the population 85% quantile
of Z, which can be found in R:
qnorm(0.85)
## [1] 1.036433
The population 85% quantile is approximately 1.036, so that Φ(1.036) ≈ 0.85, Φ(–1.036) ≈ 0.15, and
P(–1.036 ≤ Z ≤ 1.036) ≈ 0.70.
Similarly, for a 80% probability interval, we find the population 90% quantile of Z, leaving 10% in the right tail:
qnorm(0.90)
## [1] 1.281552
The population 90% quantile is approximately 1.282, so that Φ(1.282) ≈ 0.90, Φ(–1.282) ≈ 0.10, and
P(–1.282 ≤ Z ≤ 1.282) ≈ 0.80.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 291 — #298
i i
and
a–µ b–µ b–µ a – µ
P(a ≤ X ≤ b) = P ≤Z≤ =Φ –Φ .
σ σ σ σ
Example 11.2 (Asset returns) Example 11.1 considered annual asset returns given by the random variable X ∼
N(0.07, (0.08)2 ). The normal random variable X can be standardized by de-meaning it and dividing by its standard
deviation:
X – 0.07
Z= ∼ N(0, 1).
0.08
If Z = 1.5, X is 1.5 standard deviations above its mean and X = 0.19. If Z = –2.5, X is 2.5 standard deviations below its
mean and X = –0.13. The probability of a positive annual return is
X – 0.07 0 – 0.07
P(X > 0) = P > = P(Z > –0.875) = 1 – Φ(–0.875).
0.08 0.08
For the first equality, we subtract 0.07 and then divide by 0.08 for both X (on the left of the >) and 0 (on the right of
the >) so that the probability remains the same. Then, we calculate 1 – Φ(–0.875) ≈ 0.8092:
1-pnorm(-0.875)
## [1] 0.809213
Alternatively, since 1 – Φ(–0.875) = Φ(0.875) by symmetry of Z, pnorm(0.875) would give the same answer.
Using the results on probability intervals for Z ∼ N(0, 1) in Section 11.1.2, probability intervals for general X ∼
N(µ, σ 2 ), including the rule-of-thumb intervals from Section 11.1.1, can be constructed. These intervals are based
upon X = µ + σZ being a linear transformation of Z. For example, for a 95% probability interval,
0.95 = P(–1.96 ≤ Z ≤ 1.96) = P(µ – 1.96σ ≤ µ + σZ ≤ µ + 1.96σ) = P(µ – 1.96σ ≤ X ≤ µ + 1.96σ),
where the second equality is obtained by multiplying each of the three terms within the probability by σ and then
adding µ to each term. Similarly, the 90% probability interval for X is
0.90 = P(–1.645 ≤ Z ≤ 1.645) = P(µ – 1.645σ ≤ X ≤ µ + 1.645σ).
For other symmetric probability intervals of X, centered around µ, the constant that provides the desired probability
can be determined. As an example, suppose we want an 85% probability interval for X, leaving 7.5% probability less
than the lower end of the interval and 7.5% probability greater than the upper end of the interval. For the standard
normal, the 92.5% quantile is equal to 1.440, so that Φ(1.440) = 0.925 and Φ(–1.440) = 0.075.
qnorm(0.925)
## [1] 1.439531
Therefore,
P(–1.440 ≤ Z ≤ 1.440) = 0.85,
and, equivalently,
P(µ – 1.440σ ≤ X ≤ µ + 1.440σ) = 0.85.
Using the parameters from Example 11.2 (µ = 0.07, σ = 0.08), there is an 85% probability that X (annual asset return)
is between 0.07 – (1.440)(0.08) = –0.0452 and 0.07 + (1.440)(0.08) = 0.1852.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 292 — #299
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 293 — #300
i i
1-pnorm(-2.5/sqrt(17))
## [1] 0.7278552
Example 11.4 (Two-asset portfolio) Consider two assets whose annual returns X and Y are described by the following
two normal distributions:
X ∼ N(0.07, (0.08)2 ) and Y ∼ N(0.03, (0.01)2 ).
Think of asset X as a riskier asset, like a stock mutual fund, with a higher average return but more risk due to the
higher variance. Think of asset Y as a less risky asset, like a bond mutual fund, with a lower average return but less
risk. Suppose a portfolio is constructed by investing half of the money in asset X and half of the money in asset Y, so
that the returns on the two-asset portfolio are
V = 0.5X + 0.5Y.
If the asset returns are uncorrelated (ρXY = 0), then
µV = 0.5µX + 0.5µY = (0.5)(0.07) + (0.5)(0.03) = 0.05,
σV2 = (0.5)2 σX2 + (0.5)2 σY2 = (0.5)2 (0.08)2 + (0.5)2 (0.01)2 = 0.001625,
and √
σV = 0.001625 ≈ 0.0403.
Thus, V ∼ N(0.05, 0.001625) when ρXY = 0. The average return for the two-asset portfolio V is exactly at the
midpoint of the averages of the two returns since the portfolio is equally weighted. The standard deviation and variance
of V indicate that V is less risky than asset X and more risky than asset Y.
What if the asset returns X and Y are positively correlated instead, say with ρXY = 0.2? The population mean of
V = 0.5X + 0.5Y is unchanged, and the population variance of V becomes
σV2 = (0.5)2 σX2 + (0.5)2 σY2 + 2(0.5)(0.5)σXY
= (0.5)2 (0.08)2 + (0.5)2 (0.01)2 + (2)(0.5)(0.5)(0.2)(0.08)(0.01) = 0.001705,
√
using σXY = ρXY σX σY . The population standard deviation of V is σV = 0.001705 ≈ 0.0413, which is 2.5% higher than
the standard deviation of 0.403 for the case of no asset correlation. The positive correlation between X and Y leads to
a tendency for the asset returns to move together, resulting in higher variance than when X and Y are uncorrelated.
Even though the average return for X is greater than the average return for Y, it is possible that the realized Y
value is larger than the realized X value due to the variances of the returns. The probability P(Y > X) can be written as
P(Y > X) = P(Y – X > 0), which is a probability in terms of a linear combination. The difference Y – X has population
mean 0.03 – 0.07 = –0.04 and population variance (0.01)2 + (0.08)2 – (2)(0.2)(0.08)(0.01) = 0.00618, so that
Y – X – (–0.04) 0 – (–0.04) 0.04
P(Y > X) = P(Y – X > 0) = P √ >√ =P Z > √ ,
0.00618 0.00618 0.00618
which is approximately 0.3054 or 30.54%.
1-pnorm(0.04/sqrt(0.00618))
## [1] 0.3054386
The population mean and population standard deviation can also be determined for a two-asset portfolio with
unequal weights. Figure 11.6 summarizes the population mean and standard deviation for V = wX + (1 – w)Y, with
weights ranging from w = 0 to w = 1. There are black dots shown for portfolios with w = 0.2, w = 0.5, and w = 0.8. As
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 294 — #301
i i
0.08
w=1
0.07
Population standard deviation of portfolio return
w=0.8
0.06
0.05
0.04
w=0.5
0.03
0.02
w=0.2
0.01
w=0
Figure 11.6
Population mean and standard deviation for weighted two-asset portfolios
w moves from 0 to 1, the mean and standard deviation move from the values associated with the Y asset to the values
associated with the X asset.
Proposition 11.4 generalizes to more than two normal random variables. Specifically, the linear combination of any
number of normal random variables is also a normal random variable. For example, when V = X1 + X2 + X3 for normal
random variables X1 , X2 , and X3 , Proposition 11.4 implies that the linear combination X1 + X2 is normally distributed
and, then, also that the linear combination of X1 + X2 and X3 is also normally distributed. This type of reasoning can
be extended to additional random variables.
Proposition 11.5. If X1 , X2 , …, Xm are normal random variables with population means µ1 , µ2 , · · · , µm and
population variances σ12 , σ22 , …, σm2 , respectively, then
V = k + a1 X1 + a2 X2 + · · · + am Xm
is a normal random variable with population mean µV = k + a1 µ1 + a2 µ2 + · · · + am µm .
The variance expression for the general linear combination is more complicated, as it depends upon the (possibly
non-zero) covariances between the random variables (see Proposition 8.7). For the case of independent normal random
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 295 — #302
i i
variables, the variance expression for the linear combination simplifies considerably, and the following proposition
provides a complete specification of the distribution of the linear combination:
Proposition 11.6. If X1 , X2 , …, Xm are independent normal random variables with population means µ1 , µ2 , · · · , µm
and population variances σ12 , σ22 , …, σm2 , respectively, then
V = k + a1 X1 + a2 X2 + · · · + am Xm
is a normal random variable, with
V ∼ N(k + a1 µ1 + a2 µ2 + · · · + am µm , a21 σ12 + a22 σ22 + · · · + a2m σm2 ).
(i) For a sum of i.i.d. normal random variables, V = X1 + X2 + · · · + Xm , with each Xj ∼ N(µ, σ 2 ),
√
V ∼ N(mµ, mσ 2 ) and σV = mσ.
(ii) For an average of i.i.d. random variables, V = m1 (X1 + X2 + · · · + Xm ), with each Xj ∼ N(µ, σ 2 ),
σ2
σ
V ∼ N µ, and σV = √ .
m m
Example 11.5 (Monthly sales) Assume that the monthly sales M at Hayden’s Hardware, measured in thousands of
dollars, are i.i.d. and normally distributed, M ∼ N(10, 4). The total sales in a given year are Y = M1 + M2 + · · · + M12 ,
where each Mj ∼ M(10, 4) is i.i.d. The total sales in a given year is normally distributed with
√
Y ∼ N(120, 48) and σY = 48 ≈ 6.928.
There is a 90% probability that Y is in the 120 ± (1.645)(6.928) interval, which is (108.6, 131.4). There is a 95%
probability that Y is in the 120 ± (1.96)(6.928) interval, which is (106.4, 133.6).
1
The average monthly sales in a given year, A = 12 (M1 + M2 + · · · + M12 ), is normally distributed with
4 1
A ∼ N 10, and σA = √ ≈ 0.577.
12 3
There is a 90% probability that A is in the 10 ± (1.645)(0.577) interval, which is (9.05, 10.95). There is a 95%
probability that A is in the 10 ± (1.96)(0.577) interval, which is (8.87, 11.13). The standard deviation of the average
monthly sales in a given year (0.577) is considerably lower than the standard deviation of monthly sales in any given
month. Averaging over 12 months leads to a lower variance and less dispersion. If the average is taken over more
months, the dispersion would continue to decrease. With an average taken over 24 months, the standard deviation is
√1 ≈ 0.408. With an average taken over 36 months, the standard deviation is 1 ≈ 0.333.
6 3
Definition 11.3 A positive-valued random variable X is a log-normal random variable, with parameters µ and σ 2 ,
if ln(X) ∼ N(µ, σ 2 ).
The function ln(·) is the natural log function, satisfying eln(x) = x for all x > 0. The possible outcomes for X are all
positive real numbers since eln(x) > 0 for any possible value of ln(x). Figure 11.7 shows an example of a pdf curve for
a log-normal random variable. The pdf curve is always positive for x > 0 and is clearly asymmetric and right-skewed.
(The pdf is equal to zero for x ≤ 0.) The population median of X is equal to eµ since the population median of ln(X) is
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 296 — #303
i i
fX(x)
Figure 11.7
Probability density function for a log-normal random variable
µ. Since ln(·) is an increasing function of its argument, this fact follows from
P(X < eµ ) = P(ln(X) < µ) = 0.5.
Due to the right-skewness of the log-normal random variable, the population mean of X is greater than the population
median eµ .
Example 11.6 (Weekly earnings) Consider the earnwk variable for employed individuals from the cps dataset.
Example 6.9 showed the right-skewed sample distribution of weekly earnings. The top graph in Figure 11.8 is the
histogram of weekly earnings, this time with a normal distribution (solid curve) drawn over the histogram. The graphed
normal distribution has population mean equal to the sample mean of earnwk and population variance equal to the
sample variance of earnwk, as that distribution would match the histogram pretty closely if the earnwk values were
truly drawn from a normal distribution. It’s evident that the normal distribution does not fit the histogram of earnwk
well, as the histogram is too right-skewed to be matched by the normal distribution. The bottom graph in Figure 11.8
shows the histogram of ln(earnwk), with a normal distribution (solid curve) drawn over the histogram. In this case,
the population mean and variance are chosen to be the sample mean and variance of ln(earnwk) rather than earnwk.
The histogram of ln(earnwk) looks much more symmetric than the histogram of earnwk, with the right tail no longer
evident. The normal distribution seems to provide a pretty good fit to the histogram, certainly much better than in the
top graph. Taken together, these graphs suggest that the normal distribution might provide a good model for ln(earnwk)
but not earnwk, meaning a log-normal model for earnwk is more sensible than a normal model for earnwk.
Here is the R code to create Figure 11.8:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 297 — #304
i i
0.0006
Density
weekly earnings
0.4
0.0
3 4 5 6 7 8 9
ln(weekly earnings)
Figure 11.8
Distributions of weekly earnings and log weekly earnings
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 298 — #305
i i
The second equality holds since ln(·) is a strictly increasing function. Since ln(X) ∼ N(µ, σ 2 ) is a normal random
variable, Fln(X) (ln(x0 )) is the cdf of a N(µ, σ 2 ) random variable evaluated at ln(x0 ).
Then, the probability that X is in the interval [a, b], for 0 < a < b, can also be written in terms of the normal cdf:
P(a ≤ X ≤ b) = P(ln(a) ≤ ln(X) ≤ ln(b)) = Fln(X) (ln(b)) – Fln(X) (ln(a)).
Using this relationship, 90% and 95% probability intervals for X can be constructed:
P(µ – 1.645σ ≤ ln(X) ≤ µ + 1.645σ) ≈ 0.90 =⇒ P eµ–1.645σ ≤ X ≤ eµ+1.645σ ≈ 0.90
There is a 90% probability that X is between eµ–1.645σ and eµ+1.645σ , with a 5% probability that X is below eµ–1.645σ
and a 5% probability that X is above eµ+1.645σ . There is a 95% probability that X is between eµ–1.96σ and eµ+1.96σ ,
with a 2.5% probability that X is below eµ–1.96σ and a 2.5% probability that X is above eµ+1.96σ . Unlike the probability
intervals for a normal random variable, these probability intervals are asymmetric in the sense that the endpoints are not
equidistant from the population median eµ . For example, the difference eµ – eµ–1.96σ is not the same as eµ+1.96σ – eµ .
This approach can be used for any probability interval. For instance, for a 70% probability interval, with 15%
probabilities each in the left tail and the right tail, Φ(1.036) ≈ 0.85 implies
P(µ – 1.036σ ≤ ln(X) ≤ µ + 1.036σ) ≈ 0.70 =⇒ P eµ–1.036σ ≤ X ≤ eµ+1.036σ ≈ 0.70.
Example 11.7 (Weekly earnings) Assume that weekly earnings X are log-normally distributed with ln(X) ∼
N(6.5, (0.7)2 ). These parameters roughly correspond to the log-normal distribution shown in Figure 11.8. Then, a
95% probability interval for X is
P(e6.5–1.96(0.7) ≤ X ≤ e6.5+1.96(0.7) ) = P(168.68 ≤ X ≤ 2622.81) = 0.95.
There is a 95% probability that weekly earnings are between $168.68 and $2,622.81, a 2.5% probability that weekly
earnings are less than $168.68, and a 2.5% probability that weekly earnings are greater than $2,622.81.
The following R functions are useful for working with a log-normal random variable:
• dlnorm(x, meanlog=0, sdlog=1): Returns the pdf of a log-normal random variable evaluated at the
argument x, which may be a single number or a vector. The optional arguments meanlog and sdlog have default
values of 0 and 1, respectively, and represent the mean and standard deviation of the natural log of the random
variable.
• plnorm(x, meanlog=0, sdlog=1): Returns the cdf of a log-normal random variable evaluated at the
argument x, which may be a single number or a vector. The optional arguments meanlog and sdlog have default
values of 0 and 1, respectively, and represent the mean and standard deviation of the natural log of the random
variable.
• rlnorm(x, meanlog=0, sdlog=1): Creates a vector of n i.i.d. random draws of a log-normal random
variable. The optional arguments meanlog and sdlog have default values of 0 and 1, respectively, and represent
the mean and standard deviation of the natural log of the random variable.
• qlnorm(p, meanlog=0, sdlog=1): Returns the population quantiles of a log-normal random variable
specified by the argument p, which may be a single number or a vector. The optional arguments meanlog and
sdlog have default values of 0 and 1, respectively, and represent the mean and standard deviation of the natural
log of the random variable.
For instance, although the probability intervals in Example 11.7 were determined through the use of the normal
distribution, they can also be calculated directly in R based upon the quantiles of the log-normal distribution:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 299 — #306
i i
For the same distribution ln(X) ∼ N(6.5, (0.7)2 ), the following code calculates the probability that X > 1000 (weekly
earnings over $1,000) and makes 10 random draws from the distribution:
exp(6.5+0.5*(0.7)^2)
## [1] 849.7991
exp(6.5+0.5*(0.7)^2)*sqrt(exp(0.7^2)-1)
## [1] 675.7459
Definition 11.4 A random variable X is a chi-square random variable, denoted X ∼ χ2 or X ∼ χ21 , if X = Z 2 and
Z ∼ N(0, 1).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 300 — #307
i i
0.20
0.20
0.15
0.15
fX(x)
fX(x)
0.10
0.10
0.05
0.05
0.00
0.00
0 5 10 15 20 0 5 10 15 20
x x
0.20
0.15
0.15
fX(x)
fX(x)
0.10
0.10
0.05
0.05
0.00
0.00
0 5 10 15 20 0 5 10 15 20
x x
Figure 11.9
Probability density functions for chi-square random variables
All possible outcomes of X ∼ χ21 are non-negative since X = Z 2 . This random variable is a special case of a more
general chi-square random variable, which involves the sum of squared independent standard normals. Specifically,
if Z1 , Z2 , …, Zm are i.i.d. N(0, 1) random variables, Z12 + Z22 + · · · + Zm2 is said to have a chi-square distribution with m
degrees of freedom.
Definition 11.5 A random variable X is a chi-square random variable with m degrees of freedom, denoted X ∼ χ2m ,
if X = Z12 + Z22 + · · · + Zm2 and Z1 , Z2 , …, Zm are i.i.d. N(0, 1).
Figure 11.9 shows the distributions for four different chi-squared random variables, corresponding to 4 degrees of
freedom (χ24 ), 6 degrees of freedom (χ26 ), 8 degrees of freedom (χ28 ), and 10 degrees of freedom (χ210 ). The x-axis has
been arbitrarily cut off at 20, but each of the four distributions has a right tail that extends forever. The four distributions
are all right-skewed. And, as the value for the degrees of freedom increases, both the mean and the dispersion of the
distributions increase. This last feature should not be surprising since every time we increase the degrees of freedom
we are adding additional Zj2 terms to the random variable. In fact, it turns out that, for X ∼ χ2m , the population mean
and variance are µX = m and σX2 = 2m, respectively.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 301 — #308
i i
The following R functions are useful for working with chi-square random variables:
• dchisq(x, df): Returns the pdf of a chi-square random variable with df degrees of freedom evaluated at the
argument x, which may be a single number or a vector.
• pchisq(x, df): Returns the cdf of a chi-square random variable with df degrees of freedom evaluated at the
freedom.
• qchisq(p, df): Returns the population quantiles of a chi-square random variable with df degrees of freedom
qchisq(0.90,4)
## [1] 7.77944
qchisq(0.90,6)
## [1] 10.64464
qchisq(0.90,8)
## [1] 13.36157
qchisq(0.90,10)
## [1] 15.98718
pchisq((1.96)^2,1)
## [1] 0.9500042
pchisq((1.645)^2,1)
## [1] 0.9000302
The qchisq commands calculate the population 90% quantiles associated with the four chi-square distributions
shown in Figure 11.9. As expected, the population 90% quantiles increase as the degrees of freedom increase, reflecting
a higher likelihood to have values farther out in the right tail. The pchisq commands illustrate a connection between
the cdf of a χ2 random variable and the cdf of a N(0, 1) random variable. Whereas 1.96 is the population 97.5%
quantile of N(0, 1), the pchisq((1.96)^2,1) command confirms that the value (1.96)2 is the population 95%
quantile of χ2 . This result arises since, if X ∼ χ2 and Z ∼ N(0, 1),
P X < (1.96)2 = P Z 2 < (1.96)2 = P(–1.96 < Z < 1.96) ≈ 0.95,
where the second equality follows from the fact that Z 2 < (1.96)2 can only happen when |Z| < 1.96. Similarly, with
1.645 being the population 95% quantile of N(0, 1), the pchisq((1.645)^2,1) command confirms that the value
(1.645)2 is the population 90% quantile of χ2 , which follows from
P X < (1.645)2 = P Z 2 < (1.645)2 = P(–1.645 < Z < 1.645) ≈ 0.90.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 302 — #309
i i
Since the exponential model usually concerns time-related events, it is an example of a duration model. In the
examples above, the exponential model can be thought of as modeling the duration of the website visit, the duration of
the worker strike, or the duration of the customer service phone call. The formal definition of an exponential random
variable is the following:
Definition 11.6 An exponential random variable X with parameter θ > 0, written X ∼ Exp(θ), is a positive-valued
random variable with pdf
fX (x) = θe–θx for x > 0.
The parameter θ describes how quickly the underlying event is expected to occur. Larger values of θ correspond
to events that are expected to occur more quickly (i.e., shorter durations), whereas smaller values of θ correspond to
events that are not expected to occur quickly (i.e., longer durations). Since it turns out that the population mean of X
is µX = E(X) = θ1 , the value θ1 can be thought of as the expected time until the event occurs. For example, for θ = 1, the
expected time until the event occurs is 1 time unit, whereas for θ = 0.2, the expected time until the event occurs is 5
time units.
The cdf FX (·) for an exponential random variable X is obtained by integrating the pdf. For any value x0 > 0,
Z x0
x0
FX (x0 ) = θe–θx dx = –e–θx 0 = 1 – e–θx0 .
0
Figure 11.10 graphs three different pdf curves corresponding to three different values of the θ parameter, with θ = 1
in the left graph, θ = 0.5 in the middle graph, and θ = 0.2 in the right graph. Each of these three pdf’s appears to
be a strictly decreasing function of x, which is a property of any exponential random variable since the derivative
fX0 (x) = –θ2 e–θx < 0 for all x > 0. Regardless of the value of the parameter θ, it is always more likely for an exponential
random variable to have smaller values than larger values; for example, the probability that X is in (0, 1) is larger than
the probability that X is in (1, 2). As the value of θ decreases, moving from left to right in Figure 11.10, the height of
the pdf near zero decreases and the right tail becomes thicker.
The population descriptive statistics for an exponential random variable are given in the following proposition.30
Proposition 11.8. If X ∼ Exp(θ), the population mean of X is
1
µX = ,
θ
the population variance of X is
1
σX2 = 2 ,
θ
and the population standard deviation of X is
1
σX = .
θ
Example 11.8 (Duration of website visit) For a particular website, assume that the number of minutes that any given
visitor spends on the website, before leaving, is an exponential random variable X ∼ Exp(0.5). The expected duration
1
of the website visit, µX , is 0.5 or 2 minutes. The standard deviation of the duration of the website visit is also 2
minutes. The cdf, derived above as FX (x) = 1 – e–θx , can be used to calculate probabilities of intervals. For example,
the probability that the duration of the website visit is between 1 and 2 minutes is
FX (2) – FX (1) = 1 – e–(2)(0.5) – 1 – e–(1)(0.5) = e–0.5 – e–1 ≈ 0.239.
The largest probability for a one-minute interval is for a duration between 0 and 1, with probability e–0 – e–0.5 ≈ 0.393.
The following R functions are useful for working with exponential random variables:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 303 — #310
i i
0.8
0.8
0.8
0.6
0.6
0.6
fX(x)
fX(x)
fX(x)
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
x x x
Figure 11.10
Probability density functions for exponential random variables
• dexp(x, rate=1): Returns the pdf of an exponential random variable with rate θ equal to rate evaluated at
the argument x, which may be a single number or a vector.
• pexp(x, rate=1): Returns the cdf of an exponential random variable with rate θ equal to rate evaluated at
equal to rate.
• qexp(p, rate=1): Returns the population quantiles of an exponential random variable with rate θ equal to
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 304 — #311
i i
pexp(1,rate=0.5)-pexp(0,rate=0.5)
## [1] 0.3934693
pexp(2,rate=0.5)-pexp(1,rate=0.5)
## [1] 0.2386512
pexp(3,rate=0.5)-pexp(2,rate=0.5)
## [1] 0.1447493
set.seed(1234)
rexp(10,rate=0.5)
## [1] 5.00351721 0.49351777 0.01316391 3.48549218 0.77436517 0.17989934
## [7] 1.64816303 0.40523580 1.67608064 1.52086060
temp <- rexp(100000,rate=0.5)
mean(temp)
## [1] 1.998902
sd(temp)
## [1] 1.983554
The first rexp command simulates ten i.i.d. draws from an exponential random variable with θ = 0.5. The second
use of rexp, in the assignment of the variable temp, simulates 100,000 draws, and the values of mean(temp) and
sd(temp) are both very close to the population mean and standard deviation of θ1 = 2.
One potential drawback of the exponential model is that the exponential pdf is a strictly decreasing function of x,
regardless of the parameter value θ. This feature means that the assumption of an exponential random variable implies
that shorter durations are always more likely than longer durations. To relax this restriction, a more flexible duration
model would be needed. An example of such a model is the Weibull model, which is a two-parameter model that
generalizes the exponential model. Specifically, a Weibull random variable X has the pdf
α
fX (x) = αθ(θx)α–1 e–(θx) for x > 0,
where the two parameters α and θ are both positive.31 The exponential pdf is a special case, corresponding to α = 1.
Other values of α lead to pdf shapes different from those associated with the exponential model. For example, for
certain θ and α values, the Weibull pdf can increase until reaching a peak and then decrease afterwards, which is a
more appropriate model if the most likely durations are not close to zero but rather at some other value.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 305 — #312
i i
is less than X1 , the event count is 0; if T is between X1 and X1 + X2 , the event count is 1; if T is between X1 + X2 and
X1 + X2 + X3 , the event count is 2; and so on.32
X1 X2 X3 X4
Figure 11.11
Sequence of exponential random variables
To illustrate the connection between exponential random variables and Poisson random variables, we re-visit the
example of customers visiting a coffee shop (Example 9.6). In that example, the situation was modeled in terms of the
number or count of customers, assumed to be a Poisson random variable. Here, we instead model the customer arrival
times as exponential random variables and, using Proposition 11.9, infer that the number of customers is a Poisson
random variable. The advantage of this approach is that we can say something about the distributions of both the arrival
times between customers (exponential random variables) and the count of customers (a Poisson random variable).
Example 11.9 (Coffee shop customers) Suppose the arrival time, in hours, of a new customer at a coffee shop (since
1
the last customer arrived) is an Exp(20) random variable, so that the average arrival time is 20 = 0.05 hours or 3
minutes. If it is assumed that the arrival time of each successive customer is also an i.i.d. Exp(20) random variable,
Proposition 11.9 implies that the number of customers that arrives over the course of an hour is a Poisson(20) random
variable. Likewise, the number of customers that arrive over the course of two hours is a Poisson(40) random variable.
If we are interested in the arrival times themselves, rather than the count, the distribution associated with the Exp(θ)
random variable can be used for any single arrival time (e.g., between customers in Example 11.9). How about the
time that it takes two customers to arrive? This time would be a draw from the random variable given by the sum of
two i.i.d. Exp(θ) random variables. For example, the time for the first two customers to arrive is a draw from X1 + X2 .
Unfortunately, X1 + X2 is a more complicated distribution (not an exponential), but we can use computer simulations
to approximate this distribution and its properties (mean, standard deviation, etc). Similarly, the time for the first
three customers to arrive is a draw from X1 + X2 + X3 , whose distribution can again be approximated via computer
simulations. Perhaps not surprisingly, as seen in Example 10.29 and Figure 10.12 (for the average of exponential
random variables), the shape of the distribution of the sum of exponential random variables begins to look bell-shaped
as the number of random variables in the sum increases.33
Definition 11.7 A mixture distribution is the probability distribution associated with a random variable X that is
based upon a collection of underlying random variables Y1 , Y2 , …, Ym using the following two-step process: (i) a
random variable is selected at random from the collection Y1 , Y2 , …, Ym according to the probabilities π1 , π2 , …, πm ,
Pm
where j=1 πj = 1, and (ii) the realized value of X is the realized value of the selected random variable.
To focus on mixtures of normal random variables, we provide the following definition as a special case of
Definition 11.7:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 306 — #313
i i
306 NOTES
Definition 11.8 A mixture of normal random variables is a random variable with a mixture distribution based upon
normal random variables Y1 , Y2 , …, Ym .
In the following example, we re-visit the data-analyst salary example (Example 10.16), first noting that the original
example involved a mixture of uniform random variables and then considering a different mixture distribution based
upon normal random variables instead of uniform random variables:
Example 11.10 (Data analyst salaries) Example 10.16 considered the salaries of data analysts at a large firm, where
the salaries for non-graduate-degree data analysts and graduate-degree data analysts were modeled as different
uniform random variables. Using the notation from Definition 11.7, let Y1 ∼ U(60, 100) and Y2 ∼ U(90, 210) denote
these two random variables, respectively. If the probability that a data analyst at the firm has a graduate degree is
20%, the random variable for data-analyst salaries is a mixture of Y1 and Y2 with probabilities π1 = 0.8 and π2 = 0.2.
Now, suppose the two salary distributions are modeled as normal random variables rather than uniform random
variables, specifically with
Y1 ∼ N(80, 102 ) and Y2 ∼ N(150, 302 ),
still with π1 = 0.8 and π2 = 0.2. To visualize the mixture distribution X, based upon Y1 and Y2 with probabilities 0.8
and 0.2, respectively, the following R code simulates 1,000,000 draws of X:
# construct the mixture random variable, with probs 0.8 and 0.2
salary <- (temp<=0.8)*y1 + (temp>0.8)*y2
The temp vector consists of U(0, 1) draws that are used to determine whether Y1 or Y2 is the chosen random
variable for a given draw of X. Y1 is chosen with probability 0.8, or equivalently when the corresponding element
of temp is less than or equal to 0.8, whereas Y2 is chosen when the corresponding element of temp is greater than
0.8. The salary assignment command stores the full vector of simulated X draws. Figure 11.12 shows the smoothed
density plot output by the R code, with dotted lines drawn at the values of 80 and 150, corresponding to the means of Y1
and Y2 , respectively. The resulting distribution has two humps or modes at approximately 80 and 150. The hump near
80 is much higher than the one near 150 since the probability (80%) that X is drawn from the Y1 random variable is
so much higher than the probability (20%) that X is drawn from the Y2 random variable. To see the effect of changing
the probabilities of the two underlying distributions, the interested reader can alter the code above by replacing the
π1 value (0.8) with other values.
Notes
27 Alternatively, the population mean result (µ = µ) follows directly from property (iv).
X
28 The “strictly increasing” and “strictly decreasing” results can be shown by taking the derivative of fX (x) with respect to x and verifying that the
derivative is positive for x < µ and negative for x > µ.
29 That is, eE(ln(X)) > E eln(X) = E(X), which is a special case of a result in statistics known as Jensen’s inequality.
30 The interested reader can confirm these properties by working out the appropriate integrals. For example, µ = 1 can be shown by evaluating
X θ
the integral 0∞ θe–θx x dx.
R
31 The interested reader can look at the documentation for the Weibull-related R functions: dweibull, pweibull, rweibull, and
qweibull.
32 Proposition 11.9 allows for an infinite number of possible events so that the time T can be any value.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 307 — #314
i i
NOTES 307
0.030
0.025
0.020
Density
0.015
0.010
0.005
0.000
Figure 11.12
Mixture of two normal random variables
33 The exact distribution of a sum of i.i.d. exponential random variables is a special case of a distribution known as the gamma distribution.
Exercises
1. An airline knows that the duration of the flight from Austin to Nashville is uniformly distributed between 110
minutes and 130 minutes. The flight departs at 1:00pm.
(a) If the airline wants the probability of a late arrival to be 20%, what time should they state as the arrival time?
(b) When the flight lands, it may have to wait until an arrival gate is available for passengers to deplane. The time
that the arrival gate becomes available is uniformly distributed between 2:50pm and 3:00pm. If the flight time
and the time that the arrival gate becomes available are independent, what is the probability that the flight will
have to wait for its arrival gate when it lands?
(c) Same as (b), but now assume flight time and arrival-gate availability are independent normal random variables.
Assume that the two random variables have the same means as the uniform random variables described above,
and that the standard deviation of flight time is 5 minutes and the standard deviation of arrival-gate availability
is 2.5 minutes.
2. A credit card company knows that the monthly balance X of a representative customer is normally distributed:
X ∼ N(300, 2500).
Assume that the monthly balances of customers are independent draws from X.
(a) If X1 and X2 are the monthly balances for two randomly chosen customers, what is the probability that X1 + X2
is greater than $700?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 308 — #315
i i
308 NOTES
(b) If X1 and X2 are the monthly balances for two randomly chosen customers, let Y = XX21 denote the ratio of two
customers’ balances. Here, Y is a non-linear combination of X1 and X2 , and Y is itself not a normal random
variable. Conduct 100,000 simulations in R to approximate the following quantities: (i) the mean of Y, (ii) the
standard deviation of Y, (iii) the median of Y, and (iv) the probability that Y > 1.5.
3. Maternal smoking during pregnancy has a negative association with birthweight. Suppose the distribution of
newborn child’s birthweight (in grams) is BWS ∼ N(3050, 5902 ) if the mother smokes during pregnancy, while the
distribution is BWNS ∼ N(3260, 5302 ) if the mother does not smoke during pregnancy.
(a) Which of the following is larger: the pdf of BWS evaluated at 3050 or the pdf of BWNS evaluated at 3260?
Explain why.
(b) Plot the two pdf’s on the same graph, over the range between 2000 grams and 4500 grams.
(c) A baby that weighs less than 2500 grams is classified as “low birthweight.” What is the probability of a low-
birthweight baby if the mother smokes during pregnancy? What is the probability of a low-birthweight baby if
the mother does not smoke during pregnancy?
(d) What is the probability that the birthweight of a baby born to a smoking mother is greater than the birthweight
of a baby born to the non-smoking mother? (Treat the two births as independent.)
(e) Conduct 10,000 simulations in R to confirm your answer to (d).
(f) Now consider two births associated with smoking mothers and two births associated with non-smoking
mothers, where the four birthweights are independent random variables. What is the probability that the average
of the two birthweights for the smoking mothers is greater than the average of the two birthweights for the
non-smoking mothers?
(g) Conduct 10,000 simulations in R to confirm your answer to (f).
(h) There are approximately 453.6 grams in a pound. What is the normal distribution associated with birthweight
in pounds for a baby born to a mother who smokes during pregnancy?
4. The annual returns for two stocks, for the companies Widgetville and Planet Widget, are given by normal
distributions
Widgetville: X ∼ N(0.10, 0.0064) and Planet Widget: Y ∼ N(0.06, 0.0049),
with positive correlation ρXY = 0.3.
(a) What is the probability that Widgetville’s return is greater than Planet Widget’s return in a given year?
(b) If you buy $100 of Widgetville stock and $200 of Planet Widget stock, what is the distribution of the net
gain/loss (in dollars) on your portfolio after one year?
(c) For the portfolio in (b), what is the probability that the net gain is greater than $30 after one year?
(d) *Now suppose you can only invest in Widgetville stock. Write a function widgetgain(amt, yrs, numsim)
with three arguments: amt is the amount invested in Widgetville stock, yrs is the number of years that
the money is invested, and numsim is the number of simulations. The function should return a vector of
length numsim, where each element of the vector is a simulation of the net gain/loss over yrs years from an
investment of amt in Widgetville stock. Assume that the annual return in each year is an independent draw
from the random variable X, but make sure to allow for compounding. For instance, for yrs = 2, if $100
is invested with a 0.10 return in year 1 and a 0.05 return in year 2, you would have $100(1 + 0.10) = $110
after one year and $110(1 + 0.05) = $115.50 after two years, yielding $15.50 net profit. Draw histograms
of widgetgain(100, 10, 10000) and widgetgain(100, 20, 10000), and calculate their respective
simulated means and standard deviations.
5. A clothing store has to decide whether to spend money on advertising for the upcoming month. If the store does not
advertise, the distribution of monthly revenue (in thousands of dollars) is NA ∼ N(90, 36). If the store does advertise,
the distribution of monthly revenue (in thousands of dollars) is A ∼ N(95, 25). The cost of advertising is $4,000.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 309 — #316
i i
NOTES 309
The correlation between NA and A is ρNA,A . (This correlation is likely positive since there are factors that affect monthly
revenue whether or not the store advertises.) The profitability X associated with advertising is a random variable, with
X = A – NA – 4.
(a) What is the expected value of X?
(b) What is the population standard deviation of X, in terms of ρNA,A ? Calculate σX for ρNA,A = 0.6 and ρNA,A = 0.8.
(c) Using the fact that X is also normally distributed, plot the probability of positive profitability, P(X > 0), versus
ρNA,A for ρNA,A = {0.1, 0.2, ..., 0.8, 0.9}.
6. Suppose X1 ∼ N(0, 4), X2 ∼ N(1, 1), and X3 ∼ N(2, 9) are independent random variables.
(a) What is the expected value of the average of draws from X1 , X2 , and X3 ?
(b) What is the population variance of the average of draws from X1 , X2 , and X3 ?
(c) What is the probability that X2 is larger than the average X1 +X
2 ?
3
X2 +X3
(d) What is the probability that X1 is larger than the average 2 ?
(e) Conduct 10,000 simulations in R to approximate the probability that X2 is closer to X1 than it is to X3 , which is
P(|X2 – X1 | < |X2 – X3 |).
7. A worker has three projects (A, B, and C) that she needs to complete, but there is uncertainty about how long each
project will take to complete. The completion times, in hours, are draws from normal random variables:
TA ∼ N(12, 9), TB ∼ N(24, 16), and TC ∼ N(8, 4).
Assume that TA , TB , and TC are independent.
(a) What is the distribution of the total completion time for the three projects?
(b) Suppose the worker does project A and then B and then C. Conduct 100,000 simulations in R to approximate
the pmf of X = # projects completed within 40 hours (a week of work)?
(c) Suppose the worker does project C and then A and then B. Conduct 100,000 simulations in R to approximate
the pmf of X = # projects completed within 40 hours (a week of work)?
(d) *For this part, drop the independence assumption and assume that the correlation between any two completion
times is equal to 0.4 (that is, ρTA TB = ρTA TC = ρTB TC = 0.4). What is the distribution of the total completion time
for the three projects? How does this compare to your answer to (a)?
8. A truncated normal random variable is restricted to a range (a, b) and, within that range, has a pdf proportional to
the pdf of a normal random variable. Specifically, the pdf of a truncated normal random variable based upon a N(µ, σ 2 )
random variable and restricted to the range (a, b) is
φ( x–µ
σ )
1 · b–µ if a < x < b
σ Φ –Φ( a–µ
fX (x) = σ σ )
0 otherwise
(a) Explain why the variance of X is less than σ 2 (without actually determining σX2 ).
(b) Write an R function dtruncnorm(x,mean,sd,a,b) that returns the pdf of a truncated normal, based upon
a normal with mean mean and standard deviation sd and restricted to the range between a and b, evaluated at
each of the elements of the vector x.
(c) Write an R function rtruncnorm(n,mean,sd,a,b) that returns a vector of n i.i.d. random draws of a
truncated normal, based upon a normal with mean mean and standard deviation sd and restricted to the range
between a and b. (Hint: Continually make i.i.d. draws from the normal distribution until there are n values
between a and b.)
(d) In a population of credit-card holders, each individual has some probability L of making a late payment in
a given month. Suppose the distribution of these probabilities follows a truncated normal distribution on the
range (0, 1) based upon a N(0.1, 0.32 ) random variable.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 310 — #317
i i
310 NOTES
i. Using the dtruncnorm function, draw the density function of L for values ranging between –0.5 and
1.5 on the x-axis.
ii. Using the rtruncnorm function to create 100,000 simulated draws of L, what are the approximate
values of E(L), sd(L), τL,0.5 , and P(L < 0.1)?
9. Suppose the wealth (in dollars) of 70-year-olds in the United States is described by a log-normal random variable:
ln(W) ∼ N(10, 1).
(a) Using only the normal distribution, provide a 90% probability interval for the wealth of a randomly chosen
70-year-old.
(b) Using only the normal distribution, what is the probability that a randomly chosen 70-year-old has wealth
between $10,000 and $30,000?
(c) Conduct 100,000 simulations in R, using the rlnorm function, to confirm your answer to (b).
10. WebNet, a large technology company, owns many different websites. The monthly traffic T for any given website
(i.e., the number of unique visitors to the website) is a log-normal random variable with ln(T) ∼ N(10, 4).
(a) What is the population median of T?
(b) What is the probability that a given website has monthly traffic greater than 20,000?
(c) Assume that WebNet earns two cents for every unique visitor to any of its websites. Fill in the blank in the
following sentence: “There is a 90% probability that WebNet’s monthly earnings for a given website is greater
than dollars.”
11. A homeowner has an outdoor light that is always kept on. After replacing the light bulb, suppose the new bulb’s life
X (in years) is drawn from an exponential random variable with mean 0.8.
(a) What is the probability that the bulb lasts at least one year?
(b) What is the probability that the bulb lasts between one year and two years?
(c) *Suppose the homeowner immediately replaces a broken bulb with a new bulb, whose life is a new i.i.d. draw
of X. Conduct 100,000 simulations in R, using exponential random variables, to approximate the pmf of the
total number of bulbs B needed to keep the light illuminated for at least two years.
i. What are the approximate values of P(B = 1) and P(B = 2)?
ii. What is the approximate value of E(B)?
iii. Based upon Proposition 11.9, how is B related to a Poisson random variable?
12. A company’s customer service department takes calls throughout the day. After a given customer calls, the time
(in minutes) before the next customer calls is an i.i.d. exponential random variable with mean 4. The length of any
given call (in minutes) is also an exponential random variable, but having mean 3, and is independent of the length of
other calls and the arrival times of all calls. Suppose customer A calls at exactly 3:00pm, and the next two calls are by
customer B and customer C (in that order).
(a) What is the probability that customer B calls before 3:05pm?
(b) What is the probability that customer A’s call ends after 3:05pm?
(c) Conduct 100,000 simulations in R to approximate the following probabilities:
i. the probability that customer B calls before customer A’s call ends
ii. the probability that both customer B and customer C call before customer A’s call ends
iii. the probability that customer B’s call is still ongoing when both customer A’s and customer C’s calls
end
13. *A consumer wants to purchase a product, and she knows that the product costs p1 dollars at a major internet retailer.
She has to decide whether or not to spend time searching the internet for a lower price. Suppose there is a lower price
p2 < p1 available at another internet retailer, but that it takes some time T (in minutes) to find that retailer and price.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 311 — #318
i i
NOTES 311
Assume that T is an exponential random variable, with T ∼ Exp(θ), and that the consumer’s opportunity cost of time
is c dollars per minute.
(a) What is the consumer’s net gain associated with finding the lower price (in terms of p1 , p2 , T, and c)?
(b) The consumer only wants to search for the lower price if the expected net gain is positive. What must be true
about c for the consumer to search? Your answer should be an inequality in terms of p1 , p2 , and θ.
(c) Now suppose the consumer never searches for more than m minutes (since she knows the exponential random
variable has a long right tail). So, if she finds the lower price p2 within m minutes, she faces that price;
otherwise, she faces the original price p1 . What must be true about c for the consumer to search? Your answer
Rb –θx
b
should be an inequality in terms of p1 , p2 , θ, and m. (Hint: Use the fact that a xθe–θx dx = –xe–θx + e θ .)
a
14. Suppose X is an exponential random variable, with X ∼ Exp(θ).
(a) What is the population median of X?
(b) Provide a 95% probability interval (L, U) for X, where P(X < L) = P(X > U) = 0.025.
15. This question is a modified version of Exercise 10.10., now using normal random variables instead of uniform
random variables. A store’s sales depend upon whether it is a weekday (Monday through Friday) or a weekend
day (Saturday or Saturday). Specifically, sales (in thousands of dollars) are distributed as a normal random variable
N(2, 0.52 ) on a weekday and as a normal random variable N(3.5, 0.752 ) on a weekend day. Let X denote the random
variable associated with the store’s sales on a randomly chosen day of the week (i.e., the probability associated with
each day is 1/7), so that X is a mixture of normal random variables.
(a) What is E(X)?
(b) Determine P(X ≥ 3) analytically, in terms of the standard normal cdf Φ(·). Use the pnorm function in R to
calculate the probability.
(c) Create a vector with 1,000,000 simulated draws of X in R. When simulating these draws, store the vector with
the indicator of whether it is a weekday or not, as it will be needed for (e) below.
i. Confirm your answers to (a) and (b) based on the simulated draws.
ii. What is the approximate population median of X based upon the simulated draws?
iii. Find the approximate population 2.5% and 97.5% quantiles to construct an approximate 95%
probability interval for X.
iv. Draw the smoothed density associated with the simulated draws. Does the density curve appear to be
normal?
(d) Suppose you know that a given day’s sales are between $2,000 and $3,000. Use Bayes’ Theorem to determine
the probability that it is a weekday. (Hint: Let A denote the event that it is a weekday, let B denote the event
that 2 ≤ X ≤ 3, and determine P(A|B).)
(e) Use the simulated draws from (c) to confirm your answer to (d).
16. Let X ∈ {0, 1} be an indicator of whether an individual is female, with X = 1 for women and X = 0 for men. Height
Y (in inches) in a certain population is distributed as a N(64.5, 6.25) random variable for women and as a N(70, 9)
random variable for men. Assume P(X = 1) = 0.5.
(a) What is E(Y)?
(b) Create a vector with 1,000,000 simulated draws of Y in R.
i. Confirm your answer to (a) based on the simulated draws.
ii. What is the approximate population variance of Y?
iii. What is the approximate population IQR of Y?
iv. What is the approximate probability that Y is between 70 and 75? How does this probability compare
to the conditional probabilities P(70 ≤ Y ≤ 75|X = 0) and P(70 ≤ Y ≤ 75|X = 1)? (For each conditional
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 312 — #319
i i
312 NOTES
probability, use the appropriate normal distribution to calculate the actual probability rather than using
the simulations.)
(c) *Find σY2 analytically, using the fact that
σY2 = E((Y – µY )2 ) = E((Y – µY )2 |X = 0)P(X = 0) + E((Y – µY )2 |X = 1)P(X = 1).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 313 — #320
i i
A central goal of statistics is to use an observed sample to say something about how the data are truly generated,
sometimes called the data-generating process (DGP). In other words, the observed sample characterizes the
population from which the sample was drawn. To simplify matters, this chapter focuses on the case of a simple
random sample. Recall from Chapter 5 that a sample is a simple random sample if each element of the population
is equally likely to be sampled. To formalize matters, let’s assume that the observed random sample {x1 , x2 , …, xn }
are the realizations of a collection of n i.i.d. random variables {X1 , X2 , …, Xn }. That is, x1 is the realized outcome of
X1 , x2 is the realized outcome of X2 , and so on. Since the random variables {X1 , X2 , …, Xn } are i.i.d., they share a
common cdf FX (·). The goal is to use the sample {x1 , x2 , …, xn } to characterize the distribution FX (·) associated with
the population.
Example 12.1 (Math SAT scores) Let X be the random variable associated with math SAT score in a population of
students. With an observed random sample, what can be said about the population mean µX of math SAT scores?
1
P10
Suppose a random sample of 10 students is collected, with sample mean x̄ = 10 i=1 xi = 618. Intuitively, it seems like
x̄ = 618 should be a good “guess” for the population mean µX , but how good is it? How close is the sample mean
x̄ = 618 to µX ? The precision of x̄ = 618 as a guess for µX depends upon the variability of the sample mean itself. If
a different set of 10 students had been collected from the population, would their sample mean also be close to 618?
How about for yet another randomly chosen set of 10 students? After all, the observed {x1 , x2 , …, x10 } sample is just
1
P10
one possible realization of the random variables {X1 , X2 , …, X10 }, and therefore the descriptive statistic x̄ = 10 i=1 xi
1
P10
is just one possible realization of the sample mean 10 X
i=1 i .
Figure 12.1 provides a visual representation of how we can think about drawing our random sample of 10 students.
There are many ways of drawing a random sample of n observations, and only one of those samples, represented by
the gray shading, is observed. For the sample {x1 , x2 , …, xn }, descriptive statistics like the sample mean x̄ and the
sample variance s2x can be calculated. Had one of the other samples been observed, a different sample mean and
sample variance would have been obtained. We are interested in characterizing the distribution of the realizations of x̄
(or s2x ) over the possible random samples.
In Definition 6.1, a statistic s(x1 , x2 , …, xn ) was defined as a function of the observed sample data. The sample
mean x̄ and the sample variance s2x are examples of statistics. For considering the distribution of a statistic over the
possible random samples that can be drawn, the random variable s(X1 , X2 , …, Xn ) is introduced. The random variable
s(X1 , X2 , …, Xn ) involves the same function s(·) applied to the random variables X1 , X2 , …, Xn rather than the observed
variables x1 , x2 , …, xn . In the case of the sample mean,
n
1X
s(x1 , x2 , …, xn ) = x̄ = xi
n
i=1
is the sample mean for a sample {x1, x2 , …, xn }, and
n
1X
s(X1 , X2 , …, Xn ) = X̄ = Xi
n
i=1
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 314 — #321
i i
Your sample
⋯⋯⋯
𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# ⋯⋯⋯
Figure 12.1
Random sampling and sampling distributions
is the random variable associated with the sample mean. Before observing the data, X̄ is itself a random variable since it
depends on the random variables X1 , X2 , …, Xn . After observing the data, x1 is the realization of X1 , x2 is the realization
of X2 , and so on through xn , and x̄ is the realization of the random variable X̄.
Similarly, in the case of the sample variance,
n
1 X
s(x1 , x2 , …, xn ) = s2x = (xi – x̄)2
n–1
i=1
and
n
1 X
s(X1 , X2 , …, Xn ) = s2X = (Xi – X̄)2 .
n–1
i=1
Before observing the data, s2X
is itself a random variable. The subscript X, and not x, indicates that s2X is a random
variable. After observing the data, the sample variance s2x is the realization of the random variable s2X .
The distribution of a statistic over the possible random samples is known as the sampling distribution and is
formally defined as follows:
Definition 12.1 The sampling distribution of a statistic s(X1 , X2 , …, Xn ) is the probability distribution of the statistic
over all possible random samples of size n from the population.
For a given sample size n, the sampling distribution of a statistic is also sometimes called the exact sampling
distribution or the finite-sample distribution. This chapter focuses on some examples where the exact sampling
distribution of a statistic can be determined based upon the specific form of the underlying random variables
X1 , X2 , …, Xn . For example, in the case of i.i.d. Bernoulli random variables or i.i.d. normal random variables, the
exact sampling distribution of the sample mean for a random sample of size n can be determined. For i.i.d. normal
random variables, the exact sampling distribution of the sample variance for a random sample of size n can also be
determined. In more general cases, however, it can be difficult to characterize the exact sampling distribution of a
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 315 — #322
i i
statistic, even for a simple statistic like the sample mean. As it turns out, much more can be said when the sample size
n is large, as more general results are available to provide an approximate sampling distribution, rather than an exact
sampling distribution, for a wide range of statistics. This idea is the focus of Chapter 13, which considers large-sample
or asymptotic distributions for statistics, like the sample mean and many others, for large sample sizes.
The results of Example 12.2 can be generalized to any sample size n, and the following proposition summarizes the
results for the sampling distribution of X̄ for n i.i.d. Bernoulli random variables:
Proposition 12.1. If X1 , X2 , …, Xn are i.i.d. Bernoulli(π) random variables, the sampling distribution of X̄ is
j n j
P X̄ = = π (1 – π)n–j for j ∈ {0, 1, 2, …, n}.
n j
The population mean of X̄ is
µX̄ = E(X̄) = π,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 316 — #323
i i
σ2
σ
X̄ ∼ N µ, and σX̄ = √ .
10 10
Thus, the observed sample mean x̄, the average of sales over the 10 years of data, can be thought of as a single draw
√
from a normal random variable with mean µ and standard deviation σ/ 10. From the properties of normal random
variables, the observed sample mean x̄ is within 1.96σ of the population mean µ with probability 95% and within
1.645σ of the population mean µ with probability 90%.
The results of Example 12.3 can be generalized to any sample size n, and the following proposition summarizes the
results for the sampling distribution of X̄ for n i.i.d. normal random variables:
Proposition 12.2. If X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random variables, the sampling distribution of X̄ is
σ2
X̄ ∼ N µ, .
n
The population mean of X̄ is
µX̄ = E(X̄) = µ,
the population variance of X̄ is
σ2
σX̄2 = Var(X̄) = ,
n
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 317 — #324
i i
n=2 n=4
1.0
1.0
0.8
0.8
0.6
0.6
pX(v)
pX(v)
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
v v
n = 10 n = 20
1.0
1.0
0.8
0.8
0.6
0.6
pX(v)
pX(v)
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
v v
Figure 12.2
Sampling distributions of the sample mean of i.i.d. Bernoulli(0.2) random variables
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 318 — #325
i i
Example 12.5 (Log-normal random variables) Suppose X1 , X2 , …, Xn are i.i.d. log-normal random variables, with
ln(Xi ) ∼ N(µ, σ 2 ) for each i ∈ {1, 2, …, n}. The sum of log-normal random variables does not have a log-normal
distribution, and therefore the average of log-normal random variables does not have a log-normal distribution.35
While there are no general results regarding the exact sampling distribution of X̄ for i.i.d. log-normal random
variables, computer simulation can be used to approximate the sampling distribution for specific values of n, µ, and
σ 2 . For a log-normal distribution with ln(X) ∼ N(0, 1) (µ = 0, σ 2 = σ = 1), Figure 12.4 shows the simulated sampling
distributions for four different sample sizes (n = 2, n = 5, n = 10, and n = 20). The assumed data-generating process
is used to make simulated draws. For instance, for n = 2, X1 and X2 are drawn randomly and independently from
a log-normal distribution with ln(X) ∼ N(0, 1) and the average of the two draws is calculated, and this process is
repeated many times. For the graphs shown in the figure, 100,000 simulations are used. The top-left graph shows
the smoothed density associated with the 100,000 draws of X̄ = 21 (X1 + X2 ) for n = 2. A similar process is used for the
other sample sizes. For n = 20, X1 , X2 , …, X20 are randomly and independently drawn from a log-normal distribution
1
with ln(X) ∼ N(0, 1), which gives simulated draws of X̄ = 20 (X1 + X2 + · · · + X20 ), and the lower-right graph shows
the smoothed density associated with 100,000 draws of X̄. Comparing across sample sizes, the dispersion in the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 319 — #326
i i
n=2 n=3
4
3
3
fX(v)
fX(v)
2
2
1
1
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
v v
n=5 n = 10
4
4
3
3
fX(v)
fX(v)
2
2
1
1
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
v v
Figure 12.3
Sampling distributions of the sample mean of i.i.d. U(0, 1) random variables
distributions decreases as the sample size gets larger, as expected. Also, the right skewness characteristic of the log-
normal distribution is evident at the smaller sample sizes (n = 2 and n = 5), but the right skewness is less dramatic
for n = 10 and has nearly disappeared for n = 20. In fact, the distribution of X̄ for n = 20 looks more like a normal
distribution than it does a log-normal distribution.
The following R code approximates the sampling distributions for Figure 12.4:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 320 — #327
i i
set.seed(1234)
The code uses the function replicate, which is useful for simulations involving random variables:
• replicate(n, expr): Returns a vector of length n containing the results of evaluating the expression expr
a total of n times.
The first use of replicate in the code creates the vector xmean_2, with 100,000 values for the 100,000 evaluations
of the expression mean(rlnorm(2, meanlog=0, sdlog=1)), which calculates the sample mean over 2
draws of the specified log-normal distribution. The other three replicate expressions are similar, with different
sample sizes specified.
The simulation approach in Example 12.5 is extremely general and can be used to approximate many different
sampling distributions. As long as (i) the distribution of the underlying i.i.d. random variables is known and (ii) a
computer can be used to simulate random draws from that distribution, simulation can always approximate the
sampling distribution of X̄ for a given sample size n. Moreover, since there is nothing special about the statistic X̄,
this simulation approach can also be used to approximate the sampling distribution of other statistics. For example, to
approximate the sampling distribution of the sample variance s2X , the only step we need to change is how the statistic
Pn
is calculated after the draws of X1 , X2 , …, Xn . For X̄, the value of the draw of X̄ = 1n i=1 Xi is used, whereas for s2X ,
1 n
the value of the draw of s2X = n–1 2
P
i=1 (Xi – X̄) is used. The approach can also be used for other statistics, like sample
quantiles (including the sample median), sample IQR, and others.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 321 — #328
i i
n=2 n=5
0.6
0.0 0.1 0.2 0.3 0.4 0.5
0.4
fX(v)
fX(v)
0.2
0.0
0 2 4 6 8 0 2 4 6 8
v v
n = 10 n = 20
1.0
0.8
0.6
0.6
0.4
fX(v)
fX(v)
0.4
0.2
0.2
0.0
0.0
0 2 4 6 8 0 2 4 6 8
v v
Figure 12.4
Sampling distributions of the sample mean of log-normal random variables
are, respectively, v
n u n
1 X u 1 X
q
s2X = (Xi – X̄)2 and sX = 2
sX = t (Xi – X̄)2 .
n–1 n–1
i=1 i=1
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 322 — #329
i i
qchisq(c(0.025,0.975),9)
## [1] 2.700389 19.022768
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 323 — #330
i i
The 2.5% and 97.5% quantiles of the χ29 distribution are 2.700 and 19.023, respectively, so there is a 95% probability
of a χ29 random variable being in the interval (2.700, 19.023). Since 2500s2X ∼ χ29 , the probability of s2X being in the
interval (2.700/2500, 19.023/2500) ≈ (0.0011, 0.0076) is also equal to 95%. There is a 2.5% probability that s2X is
less than 0.0011 and a 2.5% probability that s2X is greater than 0.0076. With a probability interval p for the sample
√
variance s2X , we can construct a probability interval for the sample standard deviation sX since sX = s2X and · is
an √ function. Specifically, taking the square root of each endpoint, the 95% probability interval for sX is
√ increasing
( 0.0011, 0.0076) ≈ (0.033, 0.087).
We can also calculate the probability that the sample variance or the sample standard deviation is in some pre-
specified interval. For instance, what is the probability that the sample standard deviation is less than 0.05? When a
sample standard deviation is calculated, it will either be less than 0.05 or greater than 0.05 (with certainty), but what
is the probability before the sample is observed? This probability is
P(sX < 0.05) = P(s2X < (0.05)2 ) = P(2500s2X < 2500(0.05)2 ) = P(2500s2X < 6.25),
and since 2500s2X ∼ χ29 , this probability is the cdf of the χ29 distribution evaluated at 6.25:
pchisq(6.25,9)
## [1] 0.2853401
Therefore, P(sX < 0.05) ≈ 0.285, meaning there is a 28.5% chance that the observed sample standard deviation will
be less than 0.05.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 324 — #331
i i
set.seed(1234)
# create a vector of sample variances for n=2, n=3, n=5, and n=10
varunif_2 <- replicate(num_simulations, var(runif(2)))
varunif_3 <- replicate(num_simulations, var(runif(3)))
varunif_5 <- replicate(num_simulations, var(runif(5)))
varunif_10 <- replicate(num_simulations, var(runif(10)))
The replicate function conducts the 100,000 simulations with a single command for each of the four sample
sizes. Using an almost identical approach, we can simulate sampling distributions for the sample standard deviation
sX . Rather than calculating the sample variance in each 100,000 simulation, we calculate the sample standard
deviation (replacing the var function by the sd function in the code above) and graph the smoothed densities over the
100,000 simulated standard deviations for each sample size. Figure 12.6 shows these simulated distributions, using the
same four sample sizes as above. The y-axis label is fsX (v), corresponding to the pdf of the random variable sX . Again,
the dispersion of the distributions decreases for the larger sample sizes, and symmetry of the sampling distribution
approximately holds when n = 10.
Example 12.8 (Log-normal random variables) Suppose X1 , X2 , …, Xn are i.i.d. log-normal random variables, with
ln(Xi ) ∼ N(0, 1) for each i ∈ {1, 2, …, n}. Figure 12.7 shows the simulated sampling distributions of the standard
deviation sX for n = 10 and n = 30, using 100,000 simulations for each sample size. Even with a sample size of n = 30,
the sampling distribution of sX is distinctly right-skewed. In fact, at first glance, the sampling distributions for n = 10
and n = 30 look quite similar. A closer look, especially for the lower values on the x-axis, reveals that the pdf for
n = 30 starts increasing at slight larger values than the pdf for n = 10 and the peak occurs at a slightly larger value for
n = 30. It may not be obvious that there is reduced dispersion in the distribution when moving from n = 10 to n = 30,
but we know from Proposition 12.3 that the variance of the n = 30 distribution must be lower than the variance of the
n = 10 distribution. Although neither sample size is large enough to result in a symmetric and bell-shaped sampling
distribution, we would get such a symmetric and bell-shaped sampling distribution eventually if we continue to increase
the sample size n. This general idea is the focus of the Chapter 13, where the concept of asymptotic or large-sample
sampling distributions is discussed.
Here is the R code to create Figure 12.7, again using the replicate function:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 325 — #332
i i
n=2 n=3
15
15
10
10
fs2X(v)
fsX2(v)
5
5
0
0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
v v
n=5 n = 10
15
15
10
10
fs2X(v)
fsX2(v)
5
5
0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
v v
Figure 12.5
Sampling distributions of the sample variance for i.i.d. U(0, 1) random variables
set.seed(1234)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 326 — #333
i i
n=2 n=3
10
10
8
8
6
6
fsX(v)
fsX(v)
4
4
2
2
0
0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
v v
n=5 n = 10
10
10
8
8
6
6
fsX(v)
fsX(v)
4
4
2
2
0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
v v
Figure 12.6
Sampling distributions of the sample standard deviation for i.i.d. U(0, 1) random variables
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 327 — #334
i i
n = 10
0.6
0.4
fsX(v)
0.2
0.0 0 2 4 6 8
n = 30
0.6
0.4
fsX(v)
0.2
0.0
0 2 4 6 8
Figure 12.7
Sampling distributions of the sample standard deviation for i.i.d. log-normal random variables
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 328 — #335
i i
n=5
0.8
fmaxX(v)
0.4
0.0
−2 −1 0 1 2 3 4
0.8
n = 10
fmaxX(v)
0.4
0.0
−2 −1 0 1 2 3 4
n = 30
0.8
fmaxX(v)
0.4
0.0
−2 −1 0 1 2 3 4
Figure 12.8
Sampling distributions of the sample maximum for i.i.d. N(0, 1) random variables
random variable X̃0.5 . To fix ideas, consider a random sample of n = 3 observations. If the observations are sorted
in order, the sample median is the middle value of the three observations. What would its sampling distribution be?
Since the sample median is always the middle of the three observations, the sampling distribution should be less
dispersed than the original N(µ, σ 2 ) distribution. It’s less likely to have the middle observation in either of the tails
since that would mean that there needs to be another observation even further out in the tail. Using similar reasoning,
the sampling distribution of the sample median should become tighter around the center µ of the original distribution
as the sample size grows. Also, since the original random variables are all symmetric, it would be surprising if the
sampling distribution of the sample median was not also symmetric. While it’s difficult to analytically determine the
sampling distribution in this case, unlike the sample maximum in Example 12.9, computer simulations can approximate
the sampling distributions of the sample median for different sample sizes. Figure 12.9 shows the simulated sampling
distributions of the sample median for three sample sizes (n = 3, n = 10, and n = 20), where the standard normal N(0, 1)
distribution is assumed. For each sample size, we use 100,000 simulations for the sampling distribution, where in each
simulation a random sample is drawn and the sample median is calculated. As predicted, the distributions become
tighter around the center, at zero, as the sample size becomes larger. The distributions all look symmetric. As a
comparison, for the n = 20 graph at the bottom, the dotted line shows the sampling distribution of the sample mean X̄,
which is X̄ ∼ N(0, 1/20) since n = 20. So, the sampling distributions of X̃0.5 and X̄ are both centered around zero, but the
sampling distribution of X̄ appears to exhibit less dispersion (lower variance) than the sampling distribution of X̃0.5 .
Put another way, it is somewhat more likely that the sample mean will be closer to the true mean/median of zero than
the sample median.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 329 — #336
i i
n=3
−2 −1 0 1 2
n = 20
0.0 0.5 1.0 1.5 2.0
fX0.5(v)
−2 −1 0 1 2
Figure 12.9
Sampling distributions of the sample median for i.i.d. N(0, 1) random variables
set.seed(1234)
Even when the sampling distribution is difficult or impossible to obtain analytically, the simulation approach
illustrated in Example 12.10 can be used to simulate the sampling distribution for a statistic. Example 12.10 didn’t
require any special properties of the normal distribution, so the same approach could be used for any underlying
distribution of the i.i.d. random variables in the case of the sample median. For other statistics, like other sample
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 330 — #337
i i
330 NOTES
quantiles or the sample IQR, the only step in the simulation process that needs to be changed is calculating the
appropriate statistic after each simulated random sample is drawn. This approach even works for bivariate statistics,
like the sample covariance or the sample correlation, if the underlying joint distribution of the bivariate random
variables is fully known.
Notes
34 The exact sampling distribution of X̄ is not easy to derive, but its exact sampling distribution is known as the Irwin-Hall distribution.
35 In contrast, the product of log-normal random variables is log-normal. For instance, if ln(X ) and ln(X ) are i.i.d. N(µ, σ 2 ) random variables,
1 2
then ln(X1 X2 ) = ln(X1 ) + ln(X2 ), which has a N(2µ, 2σ 2 ) distribution, meaning X1 X2 is log-normal. Pn
36 Although the proof of Proposition 12.4 is complicated, some intuition can be developed for the result by noting that (n – 1)s2 = 2
X i=1 (Xi – X̄) ,
2
sX Pn
Xi –X̄
2 Pn
Xi –µ
2
which implies (n – 1) σ2 = i=1 σ
. If this expression had the population mean µ rather than X̄, it would be i=1 σ
, which is the sum
Pn Xi –µ 2
of n squared i.i.d. N(0, 1) random variables. That would imply that i=1 σ
has a χ2n distribution rather than the χ2n–1 distribution in the
2
proposition. Thus, the fact that X̄ appears in the expression for sX , rather than the population mean µ, means that we “lose a degree of freedom”
s2
when using X̄ in place of µ. The interested reader can try working this out explicitly for the case of n = 2, where the result simplifies to σX2 ∼ χ21 .
37 One exception is the case of i.i.d. Bernoulli random variables, but that’s arguably not very interesting since the sample variance s2 = X̄(1 – X̄),
X
so the exact sampling distribution of X̄ (a scaled binomial distribution) can be used directly to construct the sampling distribution for s2X .
Exercises
1. Let X ∈ {1, 2, 3} be a discrete random variable with pmf
P(X = 1) = 0.2, P(X = 2) = 0.3, P(X = 3) = 0.5.
(a) Consider a random sample of two observations, where X1 and X2 are i.i.d. random variables with the pmf above.
What is the sampling distribution of X̄?
(b) Consider a random sample of three observations, where X1 , X2 , and X3 are i.i.d. random variables with the pmf
above. What is the sampling distribution of X̄?
(c) Consider a random sample of three observations, where X1 , X2 , and X3 are i.i.d. random variables with the pmf
above. What is the sampling distribution of the sample median?
2. Suppose the probability of a recession (R = 1) in the United States in any given year is 10% and that the realizations
of the Bernoulli random variable R ∼ Bernoulli(0.1) in different years are independent. Consider a period of 10
consecutive years (n = 10) with realizations R1 , R2 , …, R10 .
(a) What is the sampling distribution of R̄, the sample proportion of 10 years in which there is a recession? What
are the mean and the standard deviation of the sampling distribution?
P10
(b) What is the sampling distribution of T = i=1 Ri , the total number of recession years? What are the mean and
the standard deviation of the sampling distribution? What is P(T ≥ 2)?
3. Let X ∈ {1, 2, 3, 4, 5, 6} be the random variable associated with the roll of a fair die.
(a) What is the sampling distribution associated with the total of two independent rolls of a fair die?
(b) What is the sampling distribution associated with the maximum of two independent rolls of a fair die?
(c) Conduct 100,000 simulations in R and draw the histogram that approximates the sampling distribution
associated with the total of five independent rolls of a fair die.
(d) Conduct 100,000 simulations in R and draw the histogram that approximates the sampling distribution
associated with the maximum of five independent rolls of a fair die.
(e) What are the mean and variance of the random variable associated with the sum of 100 independent rolls of a
fair die?
(f) What are the mean and variance of the random variable associated with the average of 100 independent rolls
of a fair die?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 331 — #338
i i
NOTES 331
4. A credit card company knows that the monthly balance X of a representative customer is normally distributed:
X ∼ N(300, 2500).
Assume that the monthly balances of customers are independent draws from X.
(a) Let X1 , X2 , …, X100 denote the monthly balances for 100 randomly chosen customers.
i. What is the distribution of X̄, the average monthly balance of the 100 customers?
ii. Determine a 99% probability interval for X̄.
(b) The credit card company considers a customer to be a “low-balance customer” if she has a monthly balance
below $200. Let L be an indicator variable equal to 1 for a low-balance customer and 0 otherwise.
i. What is the distribution of L?
ii. For 20 randomly chosen customers, what is the probability of at least one low-balance customer?
iii. If L1 , L2 , …, L100 denotes the “low-balance” random variables for 100 randomly chosen customers,
1
P100
what is the distribution of the sample proportion L̄ = 100 i=1 Li ?
5. Three firms have annual profits (in millions of dollars), denoted X1 , X2 , and X3 , that are log-normally distributed.
Assume that
ln X1 ∼ N(0, 1), ln X2 ∼ N(0.5, 0.25), and ln X3 ∼ N(0.75, 0.16)
are independent of each other. Based on 100,000 simulations in R, draw a histogram (with 100 bins) and a smoothed
density corresponding to the distribution of X1 + X2 + X3 , the sum of annual profits. How does the average of the
simulated draws of X1 + X2 + X3 compare to the population mean of X1 + X2 + X3 ?
6. Consider i.i.d. random variables X1 , X2 , …, Xn drawn from a N(µ, 4) distribution.
(a) For n = 10, what is P(|X̄ – µ| < 0.1) (i.e., the probability that X̄ is within 0.1 of µ)?
(b) What is the smallest value of n that guarantees P(|X̄ – µ| < 0.1) > 0.95?
(c) If you instead want P(|X̄ – µ| < 0.1) > 0.90, would you require a larger or smaller n as compared to (b)?
(d) If the random variables were instead drawn from a N(µ, 9) distribution and you want P(|X̄ – µ| < 0.1) > 0.95,
would you require a larger or smaller n as compared to (b)?
7. In the population, IQ (“intelligence quotient”) scores are normally distributed with a mean of 100 and a standard
deviation of 15. You intend to obtain IQ scores for a random sample of 20 individuals, from which you will calculate
the sample average and the sample standard deviation.
(a) For a single individual, what is the probability that IQ score is greater than 105?
(b) For a sample of 20 individuals, what is the probability that the sample average is greater than 105?
(c) For a sample of 20 individuals, provide an 80% probability interval for the sample standard deviation. That is,
what are the values a and b satisfying P(sX < a) = 0.1 and P(sX > b) = 0.1, so that P(a ≤ sX ≤ b) = 0.8?
(d) Conduct simulations in R using the assumed N(100, 152 ) distribution and n = 20. Specifically, use 100,000
simulations, where for each simulation you draw an i.i.d. sample of size n = 20 from the assumed distribution.
i. Draw a (density) histogram of the 100,000 sample averages.
ii. What is the standard deviation of the 100,000 sample averages?
iii. Draw a (density) histogram of the 100,000 sample standard deviations.
iv. What is the standard deviation of the 100,000 sample standard deviations?
v. What is the proportion of the 100,000 simulations for which the sample average is greater than 105?
Does this proportion approximate the exact probability from (b)?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 332 — #339
i i
332 NOTES
For this question, use the following fact about independent Poisson random variables: If X1 , X2 , …, Xk are independent
Poisson random variables with Poisson parameters λ1 , λ2 , …, λk , respectively, the sum X1 + X2 + · · · + Xk is a
Poisson(λ1 + λ2 + · · · + λk ) random variable.
(a) What is the sampling distribution of the total number (over all ten companies) of new drug discoveries in a
given year?
(b) What is the sampling distribution of the average number (per company) of new drug discoveries in a given
year?
(c) Conduct 10,000 simulations in R to approximate the sampling distribution of the sample median of the total
number of new drug discoveries in a given year. Draw the histogram and calculate the mean and standard
deviation of the sampling distribution.
10. Consider a sample of n = 100 observations, where the underlying random variables X1 , X2 , …, X100 are i.i.d. uniform
U(0, 1) random variables.
(a) Conduct 10,000 simulations in R to approximate the sampling distribution of the sample interquartile range
IQRX . Draw the density and report the mean and standard deviation of the sampling distribution. Is the mean
of the sampling distribution close to what you expected? (Think about the population interquartile range.)
(b) *This part considers the trimmed mean of a sample, a descriptive statistic which is defined as the sample mean
calculated on the sample after dropping the most extreme observations. For example, the 5% trimmed mean
for n = 100 is the sample mean on the 90 observations that remain after dropping the 5 largest observations
and the 5 smallest observations. Similarly, the 2% trimmed mean for n = 100 is the sample mean on the 96
observations that remain after dropping the 2 largest observations and the 2 smallest observations. Conduct
10,000 simulations in R to approximate the sampling distributions of the sample mean, the 5% trimmed mean,
and the 2% trimmed mean. Draw their densities on the same graph to compare the distributions. Calculate
the mean and standard deviation for each of the three sampling distributions, and comment on how the values
compare to each other.
11. *In an English auction, the price of an object increases as bidders continue to bid on the object. A well-known
theoretical prediction in economics is that the bidder with the highest valuation for the object wins the auction, and
the highest bid is equal to the second-highest valuation among the bidders. For instance, among a group of bidders, if
the highest and second-highest valuations of an object are $92 and $88, respectively, the prediction is that the bidder
with the highest valuation bids $88, wins the auction, and realizes a “surplus” equal to the valuation minus the bid, or
$92 – $88 = $4.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 333 — #340
i i
NOTES 333
For this question, assume that a seller holds an auction for an item she values at $85, and assume the theoretical
prediction described above holds in practice. There is a group of B bidders for the object, with valuations V1 , V2 , …, VB
for the object, where each valuation Vi is an i.i.d. draw of a U(80, 100) random variable.
(a) First, consider the case of two bidders (B = 2).
i. What is the sampling distribution of the winning bid?
ii. What is the expected value of the seller’s surplus (equal to the winning bid minus $85)?
iii. What is the probability that the seller’s surplus is negative?
iv. What is the expected value of the winner’s surplus (equal to the winner’s valuation minus the winning
bid)?
(b) For B = 3, what is the sampling distribution of the winning bid?
(c) For each B value in {5, 10, 15}, conduct 100,000 simulated auctions in R in which you determine the winning
bid and the winner’s valuation for each simulated auction. Based on the results of the simulated auctions,
(i) graph the estimated distributions (pdf’s) of both the seller’s surplus and the winner’s surplus and (ii) calculate
the expected values of both the seller’s surplus and the winner’s surplus.
12. Let maxX = max(X1 , X2 , …, Xn ) be the maximum of n random variables X1 , X2 , …, Xn , as in Section 12.3.
(a) If X1 , X2 , …, Xn are i.i.d. draws from a U(0, 1) random variable, determine the probability that maxX is greater
than 0.98 as a function of n. What is the smallest n for which this probability is at least 95%?
(b) If X1 , X2 , …, Xn are i.i.d. draws from a N(0, 1) random variable, determine the probability that maxX is greater
than 3 as a function of n. What is the smallest n for which this probability is at least 95%?
(c) If X1 , X2 , …, Xn are i.i.d. rolls of a fair die, with Xi ∈ {1, 2, 3, 4, 5, 6}, what are the cdf and pmf of maxX ?
(d) If X1 , X2 , …, Xn are i.i.d. draws from a N(0, 1) random variable, what is the cdf of the minimum value minX =
min(X1 , X2 , …, Xn ) as a function of n?
(e) If X1 , X2 , …, Xn are i.i.d. draws from a U(0, 1) random variable, what is the cdf of the second-largest value as
a function of n?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 334 — #341
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 335 — #342
i i
Chapter 12 explored sampling distributions of statistics when the sample size n is fixed and the complete distribution
of the underlying i.i.d. random variables is known. Analytical characterizations were provided for the exact sampling
distribution in some cases (e.g., the sample mean of Bernoulli random variables, the sample mean of normal random
variables, the sample variance of normal random variables), while simulations were used for others (e.g., the standard
deviation of a uniform distribution, the sample median of a normal distribution). In this chapter, we shift focus to the
situation where the sample size n “grows large” and characterize the sampling distribution of a statistic in this context.
Two main motivations drive the consideration of large-sample or “asymptotic” sampling distributions. First,
most real-world datasets used by economists and other practitioners tend to be large, which could mean hundreds
or thousands of observations or even millions of observations. For instance, the stock-return dataset sp500 used
throughout the book has 364 observations, while the labor-force dataset cps has several thousand. These sample
sizes far exceed the “small n” examples in Chapter 12. Second, with large n, remarkable statistical results enable
the characterization of the sampling distribution of a statistic. These results typically indicate that the asymptotic
sampling distribution is a normal distribution for most statistics discussed in this book. This contrasts with the exact
sampling distributions in Chapter 12, where normality holds only in very specific cases (e.g., the sample mean of
i.i.d. normal random variables) but more generally the sampling distribution’s shape depends on the specific sample
size, the specific statistic, and the specific underlying distribution of the random variables. As such, for large samples,
we will later (in Chapter 14) apply properties of the normal distribution to further analyze the variability of a given
statistic and, when viewed as a guess or estimator of an underlying parameter, its precision.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 336 — #343
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 337 — #344
i i
σ2
While the mean and variance of X̄ are already known to be µX and nX , respectively, the CLT provides the much
stronger result that the asymptotic sampling distribution of X̄ is a normal distribution with those mean and variance
parameters. The CLT is a remarkable result since the underlying distribution of the random variables X1 , X2 , …, Xn can
be anything. The CLT holds whether the underlying distribution is discrete or continuous (or some combination of the
two) and whether the underlying distribution is symmetric or asymmetric.
Since a similar result for the specific case of i.i.d. normal random variables has been seen previously (Section 12.1.2),
it is important to understand the difference between that result and the general CLT result. For i.i.d. normal random
2 σ2 a
variables, with X ∼ N(µ, σ ), the exact sampling distribution is X̄ ∼ N µ, n . The use of “∼” rather than “∼”
2
indicates an exact sampling distribution rather than an asymptotic distribution. The X̄ ∼ N µ, σn distribution holds
σ2
a
for any sample size, even for very small n, while the CLT result X̄ ∼ N µX , nX is an approximate sampling
distribution that requires large n.
The CLT requires that the sample size n is “sufficiently large,” but what does “sufficiently large” mean in practice?
Many textbooks give simple rules of thumb, like saying that n > 30 (a sample with more than 30 observations) is
sufficient to have a large sample. Unfortunately, in reality, the number of observations required for the CLT to hold (i.e.,
for the normal approximation to be accurate) depends upon the distribution of the underlying random variables. For
instance, a heavily right-skewed distribution of the random variables usually requires larger n than a nicely symmetric
distribution of the random variables. In fact, in some of the distribution graphs seen in Chapter 12, there were cases
where the distribution of X̄ looked normal for very small sample sizes (e.g., the uniform distribution) and other cases
where it did not (e.g., the log-normal distribution). For the specific case of Bernoulli random variables, which arises
frequently in practice and is discussed in more detail below, an oft-used and effective rule of thumb is that the normal
approximation works well when both nπ > 10 and n(1 – π) > 10. This Bernoulli rule of thumb requires a larger sample
size when the success probability π is closer to 0 or 1, which is the case where the Bernoulli distribution is more
asymmetric. With π = 0.5, the rule of thumb suggests having n > 20 is sufficient to use the CLT. In contrast, the rule of
thumb would suggest n > 50 for the CLT approximation when π = 0.2 or π = 0.8 and n > 100 when π = 0.1 or π = 0.9.
Example 13.1 (Restaurant franchises) Suppose a fast-food mogul owns 60 franchises of Burger Depot. The monthly
revenue, in thousands of dollars, for each of the 60 franchises is an i.i.d. random variable drawn from a distribution
with population mean µX = 20 and population standard deviation σX = 4. In a given month, the approximate
distribution of the sample average of monthly revenues at the 60 franchises, based upon the CLT, is
42
a
X̄ ∼ N 20, .
60
Then, an approximate 95% probability interval for the sample average of monthly revenues at the 60 franchises is
4 4
20 – 1.96 √ , 20 + 1.96 √ ≈ (18.99, 21.01),
60 60
and an approximate 90% probability interval for the sample average of monthly revenues at the 60 franchises is
4 4
20 – 1.645 √ , 20 + 1.645 √ ≈ (19.15, 20.85).
60 60
If the average monthly revenue falls below $18,500 in a given month, the fast-food mogul will need a bank loan. The
probability of that happening in a given month is
X̄ – 20 18.5 – 20 18.5 – 20
P(X̄ < 18.5) = P 4 √ < 4 √ =Φ √ ≈ 0.00184 or 0.184%.
/ 60 / 60 4/ 60
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 338 — #345
i i
pnorm((18.5-20)/(4/sqrt(60)))
## [1] 0.001837806
The CLT (Proposition 13.2) provides the asymptotic distribution for the average X̄ of i.i.d. random variables. Since
the sum X1 + X2 + · · · + Xn is equal to nX̄, an immediate corollary of the CLT is the following proposition for the
asymptotic distribution of the sum of i.i.d. random variables:
Proposition 13.3. (Asymptotic distribution of the sum of i.i.d. random variables) If X1 , X2 , …, Xn are i.i.d. random
variables with finite population mean µX and finite population variance σX2 , then for sufficiently large n, the sum
Pn
S = i=1 Xi = nX̄ is approximately normally distributed, with
a
S ∼ N nµX , nσX2 .
X̄ has an approximate normal distribution by the CLT, which means that nX̄ also is approximately normal since it’s
a scaled version of X̄. From Proposition 10.14, the mean and variance of the normal distribution that approximates the
Pn
sum S = i=1 Xi are n and n2 times the mean and variance, respectively, of the underlying random variable X.
Example 13.2 (Restaurant franchises) Continuing Example 13.1, the approximate distribution for the total monthly
revenues at the 60 franchises is
a a
S = 60X̄ ∼ N 60 · 20, 60 · 42 or S ∼ N (1200, 960) .
Then, an approximate 95% probability interval for the total monthly revenues at the 60 franchises is
√ √
1200 – 1.96 960, 20 + 1.96 960 ≈ (1139, 1261).
This interval can also be obtained by multiplying the endpoints of the the interval (18.99, 21.01) for X̄, calculated in
Example 13.1, by 60.
13.1.3 Normal approximation for the sample proportion or binomial random variable
In Section 12.1.1, the exact sampling distribution of X̄ was determined when X1 , X2 , …, Xn are i.i.d. Bernoulli(π)
random variables. Recall that X̄, the sample mean or sample proportion of successes, has an exact sampling distribution
equivalent to a Binomial(n, π) random variable scaled by n1 , which is true for any sample size n and success
probability π. This result followed from the fact that X1 + X2 + · · · + Xn is, by definition, a Binomial(n, π) random
variable, so that X̄ = 1n (X1 + X2 + · · · + Xn ) is a binomial random variable scaled by 1n . The population statistics for X̄ are
r
2 π(1 – π) π(1 – π)
µX̄ = π, σX̄ = , and σX̄ = .
n n
Applying the CLT, the asymptotic distribution of X̄ is
a π(1 – π) X̄ – π a
X̄ ∼ N π, or, equivalently, q ∼ N(0, 1).
n π(1–π)
n
For the binomial random variable Y = X1 + X2 + · · · + Xn , where by definition Y ∼ Binomial(n, π), the population
mean is µY = nπ and the population variance is σY2 = nπ(1 – π). Since Y = nX̄, the CLT result for the sample proportion
implies
a Y – nπ a
Y ∼ N (nπ, nπ(1 – π)) or, equivalently, √ ∼ N(0, 1).
nπ(1 – π)
The rule of thumb discussed above stated that the CLT normal approximation works well when both nπ >
10 and n(1 – π) > 10. To illustrate this phenomenon, Figure 13.1 considers four different sample sizes (n = 10,
n = 20, n = 50, and n = 100) when the success probability is π = 0.2. For π = 0.2, the rule of thumb suggests the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 339 — #346
i i
n = 10 n = 20
0.30
0.20
0.15
0.20
P(X = v)
P(X = v)
0.10
0.10
0.05
0.00
0.00
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
v v
n = 50 n = 100
0.12
0.08
0.08
P(X = v)
P(X = v)
0.04
0.04
0.00
0.00
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
v v
Figure 13.1
Sampling distributions of the sample mean for i.i.d. Bernoulli(0.2) random variables
asymptotic approximation works well when n > 50. For each graph in Figure 13.1, the exact-distribution pmf for the
sample proportion X̄ is shown with vertical bars, and the dotted curve associated with the asymptotic distribution
N 0.2, 0.2(1–0.2)
n is shown for comparison. For the smallest sample size (n = 10), the problem with the normal
approximation appears in the left tail, as it has positive probabilities associated with negative values for X̄. This problem
largely disappears for n = 20, and even for that sample size (which is lower than the rule-of-thumb suggestion), the
normal approximation looks pretty good. For the larger sample sizes (n = 50 and n = 100), the normal approximation
matches the pmf almost exactly.
How do we calculate probabilities or probability intervals using the normal approximation? Let’s start with a
binomial random variable. As an example, let’s say that n = 50 and π = 0.2, so that Y ∼ Binomial(50, 0.2). The true
probability of exactly 10 successes out of 50 trials is
50
P(Y = 10) = 0.210 0.840 ≈ 0.1398.
10
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 340 — #347
i i
dbinom(10,50,0.2)
## [1] 0.139819
Since nπ = 10 and nπ(1 – π) = 8, the approximate sampling distribution of Y is N(10, 8). The practical issue here is
that only integer outcomes between 0 and 50 are possible, so any of the non-integer values are not actually possible.
To determine the probability P(Y = 10), then, we don’t want to evaluate the pdf of N(10, 8) at the value 10. Instead,
we assume that the continuous interval (9.5, 10.5) corresponds to the discrete outcome 10, and similarly for any other
possible outcome; (10.5, 11.5) corresponds to the outcome 11, (35.5, 36.5) corresponds to the outcome 36, and so
on. Then, if W ∼ N(10, 8) denotes the normal approximation, the probability of 10 successes based upon the normal
approximation is
10.5 – 10 9.5 – 10
P(9.5 < W < 10.5) = Φ √ –Φ √ ≈ 0.1403,
8 8
which is quite close to the true probability of 0.1398.
pnorm((10.5-10)/sqrt(8))-pnorm((9.5-10)/sqrt(8))
## [1] 0.1403162
This method of looking at the probability of a continuous bin to approximate the discrete-outcome probability
is known as a continuity correction. To illustrate how the continuity correction can be used for different types of
probability intervals, where the inequalities may be strict or weak, the following table considers some additional
examples based upon the Y ∼ Binomial(50, 0.2) distribution.
Event probability P(Y = 10) P(6 < Y < 12) P(6 < Y ≤ 12) P(6 ≤ Y ≤ 12)
Probability based upon exact pmf 0.1398 0.6073 0.7105 0.7659
Normal approx. with continuity correction P(9.5 < W < 10.5) P(6.5 < W < 11.5) P(6.5 < W < 12.5) P(5.5 < W < 12.5)
Approx. probability based upon normal 0.1403 0.5941 0.7037 0.7558
The continuity correction can be used in a similar way for calculating probabilities and probability intervals
associated with the sample proportion X̄ for i.i.d. Bernoulli random variables. The only difference is the scale of the
outcomes associated with X̄ as compared to the binomial Y. For example, if X̄ is the sample proportion of successes
when n = 50 and π = 0.2, the possible outcomes for X̄ are {0, 1/50, 2/50, …, 49/50, 1} or {0, 0.02, 0.04, …, 0.98, 1}. For
the outcome X̄ = 0.20, the continuity correction would use the interval (0.19, 0.21) since 0.19 is midway between
0.20 and the outcome below it (0.18) and 0.21 is midway between 0.20 and the outcome above it (0.22). For the
probability P(0.20 ≤ X̄ ≤ 0.26), the continuity correction would entail using the
interval (0.19,0.27) for calculation of
q
(0.2)(0.8)
an approximate probability based upon the asymptotic normal distribution N 0.2, 50 .
Example 13.3 (College-educated adults) In the United States, the probability that an adult aged 25 to 34 has at least
a bachelor’s degree is 40%. Suppose a random sample of 100 of adults aged 25 to 34 is drawn from the population.
The sample proportion X̄ with at least a bachelor’s degree, among the n = 100 individuals, has
(0.4)(0.6) √
µX̄ = 0.40, σX̄2 = = 0.0024, and σX̄ = 0.0024 ≈ 0.0490.
100
The normal approximation can be used to calculate 90% and 95% probability intervals for X̄. Without a continuity
correction, an approximate 90% probability interval for X̄ is
(µX̄ – 1.645σX̄ , µX̄ + 1.645σX̄ ) ≈ (0.40 – 1.645(0.0490), 0.40 + 1.645(0.0490)) ≈ (0.3194, 0.4806),
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 341 — #348
i i
pbinom(45,100,0.40)-pbinom(34,100,0.40)
## [1] 0.738573
The approximate probability, based upon the normal approximation and using a continuity correction, is
P(0.35 ≤ X̄ ≤ 0.45) ≈ P(0.345 < W < 0.455) for W ∼ N(0.4, 0.0024)
= Φ 0.455–0.4
0.0490 – Φ 0.345–0.4
0.0490 ≈ 0.7384.
pnorm((0.455-0.4)/0.0490)-pnorm((0.345-0.4)/0.0490)
## [1] 0.7383284
Now, consider a situation in which half of the sample (50 observations) is female and half (50 observations) is male.
If the probability that a female adult aged 25 to 34 has at least a bachelor’s degree is 44% and the probability for
a male adult aged 25 to 34 is 36%, what is the probability that the observed sample proportion of females with at
least a bachelor’s degree is greater than the observed sample proportion of males with at least a bachelor’s degree?
Letting X̄f denote the sample proportion among 50 females and X̄m denote the sample proportion among 50 males,
this probability is
P(X̄f > X̄m ) = P(X̄f – X̄m > 0).
The random variable X̄f – X̄m is a linear combination of two sample proportions, specifically the difference between
two sample proportions. While we don’t have specific results about the distribution of the difference of (scaled)
binomial random variables, we do have such results about the distribution of the difference of normal random
variables. Therefore, a normal approximation can be used for both X̄f and X̄m to significantly simplify the calculation
of P(X̄f – X̄m > 0).40 Using the CLT, the asymptotic distributions of X̄f and X̄m are
a (0.44)(0.56) a (0.36)(0.64)
X̄f ∼ N 0.44, and X̄m ∼ N 0.36, .
50 50
Moreover, X̄f and X̄m are independent random variables since they are both based upon i.i.d. random variables from the
population. Using results for linear combinations of normal random variables, X̄f – X̄m is also approximately normal.
To characterize the normal distribution, the population mean and variance of X̄f – X̄m are determined as follows:
E(X̄f – X̄m ) = E(X̄f ) – E(X̄m ) = 0.44 – 0.36 = 0.08
and
(0.44)(0.56) + (0.36)(0.64)
Var(X̄f – X̄m ) = Var(X̄f ) + Var(X̄m ) = = 0.009536.
50
Thus,
a
X̄f – X̄m ∼ N (0.08, 0.009536) ,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 342 — #349
i i
which implies
–0.08
P(X̄f – X̄m > 0) ≈ 1 – Φ √ ≈ 0.7937.
0.009536
1-pnorm(-0.08/sqrt(0.009536))
## [1] 0.7936729
Example 13.4 (Political polling) Suppose a political poll is conducted, where a random sample of voters is asked
whether they intend to vote for candidate A or candidate B. Let π denote the true probability that a randomly chosen
voter from the population intends to vote for candidate A. How many voters must be polled so that the width of the
(approximate) 95% probability interval for X̄ is no greater than six percentage points wide? π is unknown here, which
is why the poll is being conducted. The approximate 95% probability interval for X̄ is
r r !
π(1 – π) π(1 – π)
π – 1.96 , π + 1.96 ,
n n
so that the width of the interval is
r r ! r
π(1 – π) π(1 – π) π(1 – π)
π + 1.96 – π – 1.96 = 3.92 .
n n n
To ensure that the width of the interval is less than or equal to six percentage points, we require
r 2
π(1 – π) 3.92
3.92 ≤ 0.06 or, equivalently, n ≥ π(1 – π).
n 0.06
This inequality must hold for any possible value for π. Since π(1 – π) is maximized when π = 0.5, we require
2
3.92
n≥ (0.5)(1 – 0.5) ≈ 1067.1 or, equivalently, n ≥ 1068.
0.06
Therefore, at least 1,068 voters must be polled to get a 95% probability interval for X̄ that is no greater than six
percentage points wide.
Example 13.5 (Simulation error) In Chapter 2, computer simulations were used to illustrate the idea that a probability
is the limit to which a long-run frequency converges. For example, Figure 2.1 showed the cumulative frequency of heads
after 10,000 simulations. Since the parameter of the underlying Bernoulli(π) random variable is known to be π = 0.5,
the asymptotic sampling distribution can be used to provide information about the simulation error that would be
expected with this many simulations. Specifically, the 95% probability interval is
r r !
0.5(1 – 0.5) 0.5(1 – 0.5)
0.5 – 1.96 , 0.5 + 1.96 ≈ (0.490, 0.510),
10000 10000
meaning there is a 95% probability that the observed heads frequency for 10,000 coin tosses is between 49.0%
and 51.0%.
If the number of simulations is 100,000 rather than 10,000, the 95% probability interval for the observed heads
frequency is r r !
0.5(1 – 0.5) 0.5(1 – 0.5)
0.5 – 1.96 , 0.5 + 1.96 ≈ (0.497, 0.503),
100000 100000
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 343 — #350
i i
and, utilizing the fact that the 99.5% quantile of a N(0, 1) random variable is τZ,0.995 ≈ 2.576, a 99% probability
interval for the observed heads frequency is
r r !
0.5(1 – 0.5) 0.5(1 – 0.5)
0.5 – 2.576 , 0.5 + 2.576 ≈ (0.496, 0.504).
100000 100000
This example illustrates how increasing the number of simulations leads to reduced simulation error. In fact, using the
approach of the previous example (Example 13.4), we can determine the number of simulations that are needed to get
a desired width of the probability interval. In the case of the 99% confidence interval, to get an interval that has a
width less than 0.002 (very narrow!), we need
r 2
0.5(1 – 0.5) 5.152
2 × 2.576 < 0.002 or, equivalently, n > (0.5)(1 – 0.5) = 1,658,944.
n 0.002
With this many simulated coin tosses, there is a 99% probability that the observed heads frequency is within 0.001, or
0.1%, of the true heads probability of 0.5, or 50%.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 344 — #351
i i
denominator. This shrinking variance for large n is consistent with the first property that the sample variance s2X gets
arbitrarily close to σX2 , which is the mean and center of the asympotic normal distribution. Again, the asymptotic normal
sampling distribution given by this proposition is remarkable, as it holds regardless of the shape of the underlying
distribution of the i.i.d. random variables.
Example 13.6 (Normal random variables) When X1 , X2 , …, Xn are i.i.d. random variables, there is an exact sampling
s2
distribution for s2X , described by (n – 1) σX2 ∼ χ2n–1 , that holds for any sample size n, including very small sample sizes.
For larger samples, there is also an asymptotic distribution given by Proposition 13.4. Figure 13.2 shows both sampling
distributions (the exact distribution and the asymptotic distribution) of s2X when X1 , X2, …, Xn are i.i.d. N(0, 1) random
variables for four different sample sizes (n = 10, n = 20, n = 50, and n = 100), with the exact distribution given by the
solid curve and the asymptotic distribution given by the dotted curve. There is some difference between the exact
distribution and asymptotic distribution for the smallest sample size (n = 10), with the exact distribution peaking at a
value less than one and exhibiting right skewness. At n = 20, the two distributions are much closer, with the peak for
the exact distribution just slightly lower than one. The two distributions are extremely close to each other at n = 50
and virtually identical at n = 100, suggesting the asymptotic normal approximation works well for sample sizes of 50
and larger. That said, even at n = 20, the normal approximation does pretty well in approximating the exact sampling
distribution of s2X .
Example 13.7 (Uniform random variables) In Example 12.7, simulation methods approximated the sampling
distribution of s2X for i.i.d. U(0, 1) random variables for some small sample sizes, and even with just ten observations
(n = 10) the sampling distribution appeared symmetric and bell-shaped. For i.i.d. U(0, 1) random variables, the
appropriate asymptotic distribution can be derived by determining the mean and variance of the normal distribution
1
given in part (ii) of Proposition 13.4. First, the mean of the distribution is σX2 , which is 12 for X ∼ U(0, 1). Second, the
4 2 2
E((X–µX ) )–(σX )
variance of the distribution is n , which can be evaluated by determining E (X – µ )4 : X
1 1
(x – 0.5)5
Z
1
E (X – µX )4 = (x – 0.5)4 dx =
= 0.00625 – (–0.00625) = 0.0125 = .
0 5 0 80
Plugging into the variance expression yields
2
E (X – µX )4 – σX2 1/80 – (1/12)2 1
= = ,
n n 180n
so that the asymptotic distribution is
1 a 1
s2X ∼ N
, .
12 180n
a 1 1
As an example, for n = 50, the distribution is s2X ∼ N 12 , 9000 , so that an approximate 95% probability interval for
2 1 1 1 1
sX is 12 – 1.96 √9000 , 12 + 1.96 √9000 ≈ (0.0627, 0.1040). There is approximately a 95% probability that the sample
variance will be between 0.0627 and 0.1040.
Xn are i.i.d. random variables with finite population mean µX , finite population
Proposition 13.5. If X1 , X2 , …,
variance σX2 , and E (X – µX )4 < ∞, then
(i) sX gets arbitrarily close to σX as n → ∞
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 345 — #352
i i
n = 10 n = 20
3.0
3.0
2.0
2.0
fs2X(v)
fsX2(v)
1.0
1.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
v v
n = 50 n = 100
3.0
3.0
2.0
2.0
fs2X(v)
fsX2(v)
1.0
1.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
v v
Figure 13.2
Sampling distributions of the sample variance for i.i.d. N(0, 1) random variables
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 346 — #353
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 347 — #354
i i
quantify this difference, probability intervals for the two statistics can be formed for a chosen sample size. With
√
100 observations (n = 100), the asymptotic standard deviation of X̄ is √ 1/ 100 = 0.1, so that a 95% probability interval
for X̄ is (–0.196, 0.196); the asymptotic standard deviation of X̃0.5 is 1.5711/100 ≈ 0.1253, so that a 95% probability
interval for X̃0.5 is (–1.96(0.1253), 1.96(0.1253)) ≈ (–0.246, 0.246). In the thought experiment where many different
100-observation samples are drawn from the population, this result says that the realizations of the sample mean (the
x̄ values) tend to be slightly closer to the true center of the distribution (at zero) than the realizations of the sample
median (the x̃0.5 values). Over many 100-observations samples, the sample mean is between –0.196 and 0.196 for 95%
of the samples, whereas the sample median is between –0.246 and 0.246 for 95% of the samples. So, in the case of the
N(0, 1) distribution, the sample mean provides a more precise measure of the center of the distribution. On the other
hand, recall that the sample median is a more robust measure of the center of the distribution since it is less affected
by outliers than the sample mean. As a result, there is a precision-robustness tradeoff here, with the sample mean
being more precise and less robust and the sample median being less precise and more robust. The interested reader
can show that this idea generalizes to other normal random variables (that is, X1 , X2 , …, Xn i.i.d. N(µ, σ 2 ) random
variables), with the asymptotic variance of the sample mean being larger than the asymptotic variance of the sample
median.
Since other quantiles of a distribution may be of interest, we generalize Proposition 13.6 to other sample quantiles.
For i.i.d. random variables, any sample quantile gets arbitrarily close to its corresponding population quantile in large
samples and has an asymptotic distribution that is normal:
Proposition 13.7. If X1 , X2 , …, Xn are i.i.d. continuous random variables with pdf fX (·) and population quantiles τX,q ,
then for any q ∈ (0, 1),
(i) X̃q gets arbitrarily close to τX,q as n → ∞
(ii) for sufficiently large n, X̃q is approximately normally distributed:
a q(1 – q)
X̃q ∼ N τX,q , .
nfX (τX,q )2
The sample median (q = 0.5) is a special case of Proposition 13.7, with q(1 – q) = 14 leading to the asymptotic variance
in Proposition 13.6. Beyond the sample size n, which enters in the usual 1/n form, both q and fX (τX,q ) affect the value of
the asymptotic variance, so that the asymptotic variance generally varies as q varies. The following example illustrates
how the asymptotic distributions and probability intervals can be determined at different quantiles.
Example 13.10 (Normal random variables) Continuing Example 13.9, suppose X1 , X2 , …, Xn are i.i.d. N(0, 1) random
variables, and again consider a sample of 100 observations (n = 100). Using the asymptotic variance formula from
Proposition 13.7, the asymptotic standard deviation for any quantile q is
√
q(1 – q)
√ .
100φ(τX,q )
The 95% probability interval for X̃q can be constructed as the true quantile τX,q plus or minus 1.96 times the asymptotic
standard deviation. The following table shows the asymptotic standard deviations and 95% probability intervals for
five different quantiles (q = 0.1, q = 0.25, q = 0.5, q = 0.75, and q = 0.9)
√
q τX,q φ(τX,q ) √ q(1–q) 95% interval for X̃q
100φ(τX,q )
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 348 — #355
i i
For q = 0.5, the values in the table correspond to those in Example 13.9. While q(1 – q) is largest at q = 0.5, the
asymptotic standard deviation for q = 0.5 is actually the smallest among those shown, which arises since the value of
the pdf φ(0) is much larger than the pdf φ(·) evaluated at the other quantiles. The asymptotic standard deviations are
largest at the extreme quantiles (q = 0.1 and q = 0.9), so the probability intervals for X̃0.1 and X̃0.9 are also the widest.
Due to the symmetry of the N(0, 1) distribution, the asymptotic standard deviation for q = 0.1 is the same as that for
q = 0.9, and similarly the asymptotic standard deviation for q = 0.25 is the same as that for q = 0.75. As a result, the
widths of the probability intervals for X̃0.1 and X̃0.9 are the same, as are the widths of the probability intervals for X̃0.25
and X̃0.75 .
How about the interquartile range? Recall that the sample statistic IQRx is the difference between the 75% sample
quantile x̃0.75 and the 25% sample quantile x̃0.25 . The associated random variable is X̃0.75 – X̃0.25 for underlying
i.i.d. random variables X1 , X2 , …, Xn , with the IQRx statistic being a realization of X̃0.75 – X̃0.25 for the sample that
happens to be observed. Then, X̃0.75 – X̃0.25 has an asymptotic normal distribution since, from Proposition 13.7, both
X̃0.25 and X̃0.75 have asymptotic normal distributions. The following proposition gives the specific asymptotic normal
distribution associated with X̃0.75 – X̃0.25 :41
Proposition 13.8. If X1 , X2 , …, Xn are i.i.d. continuous random variables with pdf fX (·) and population quantiles τX,q ,
then
(i) X̃0.75 – X̃0.25 gets arbitrarily close to τX,0.75 – τX,0.25 as n → ∞
(ii) for sufficiently large n, X̃0.75 – X̃0.25 is approximately normally distributed:
a 1 3 3 2
X̃0.75 – X̃0.25 ∼ N τX,0.75 – τX,0.25 , + – .
16n fX (τX,0.25 )2 fX (τX,0.75 )2 fX (τX,0.25 )fX (τX,0.75 )
Example 13.11 (Uniform random variables) Suppose X1 , X2 , …, Xn are i.i.d. U(0, 1) random variables, so that
τX,0.25 = 0.25 and τX,0.75 = 0.75. Then, since fX (v) = 1 for all v ∈ (0, 1), the asymptotic distribution of X̃0.75 – X̃0.25 is
a 1
X̃0.75 – X̃0.25 ∼ N 0.5, .
4n
√ a sample of 100 observations (n = 100), the asymptotic variance is /400 and the asymptotic standard deviation is
For 1
1/400 = 0.05, so that a 95% probability interval for X̃0.75 – X̃0.25 is
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 349 — #356
i i
For sample size n, let {(X1 , Y1 ), (X2 , Y2 ), …, (Xn , Yn )} denote the random variables associated with the n draws of
bivariate data from the population. For any realized sample, the sample correlation rxy can be calculated. Denote the
associated random variable as rXY , where rxy is the realization of rXY that arises from the particular sample that happens
to be observed. The following proposition formally states that the sample correlation rXY gets arbitrarily close to the
population correlation ρXY in large samples and has an asymptotic normal distribution:
Proposition 13.9. If (X1 , Y1 ), (X2 , Y2 ), …, (Xn , Yn ) are i.i.d. bivariate random variables with population
correlation ρXY , then
(i) rXY gets arbitrarily close to ρXY as n → ∞
(ii) for sufficiently large n, rXY is approximately normally distributed:
(1 – ρ2XY )2
a
rXY ∼ N ρXY , .
n
Interestingly, the asymptotic distribution of the sample correlation depends only on the sample size n and the
population correlation ρXY and not on any other feature of the joint distribution of X and Y. When the population
a
correlation is zero (ρXY = 0), the asymptotic distribution of the sample correlation simplifies to rXY ∼ N(0, 1/n), in which
√ √
case a 95% probability interval for rXY is (–1.96/ n, 1.96/ n). Thus, even though the true correlation is zero, the observed
sample correlation will not be exactly zero, except by some rare coincidence, but a probability interval for rXY can
be quantified. For n = 100, the 95% probability interval for rXY is (–0.196, 0.196) when ρXY = 0; for n = 400, the 95%
probability interval for rXY is (–0.098, 0.098) when ρXY = 0; and so on.
Example 13.12 (Bivariate normal random variables) This example considers two random variables X and Y that
have a bivariate normal distribution, which is the case when aX + bY is normal for any values a and b. For bivariate
normal random variables X and Y, the marginal distributions of X and Y are both normal (plugging in a = 1, b = 0
for the former and a = 0, b = 1 for the latter). Denoting the marginal distributions as X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 )
and the population correlation between X and Y as ρXY , computer simulations can be used to determine the exact
sampling distribution of rXY since most statistical packages provide the ability to draw (x, y) randomly from bivariate
normal random variables X and Y.42 For simplicity, consider the case where the marginal distributions of X and Y are
both standard normal, so that µX = µY = 0 and σX = σY = 1. Figure 13.3 compares the simulation-based exact sampling
distribution of rXY with the asymptotic sampling distribution of rXY for two different sample sizes (n = 50 and n = 100)
and three different population correlation values (ρXY = 0, ρXY = 0.4, and ρXY = 0.8). For the simulation-based exact
sampling distributions, 100,000 simulations are used, and the solid black lines in the graphs show the density plot
of the realized rxy values.43 As a comparison, the dotted curves show2 the asymptotic normal distribution
given in
a 1 a (0.84) a (0.36)2
Proposition 13.9, which is rXY ∼ N 0, n for ρXY = 0, rXY ∼ N 0.4, n for ρXY = 0.4, and rXY ∼ N 0.8, n for
ρXY = 0.8. Even with a sample size of n = 50, the asymptotic distributions appear to be quite close to the exact sampling
distributions, although there are some small differences for ρXY = 0.4 and slightly bigger differences for ρXY = 0.8.
Statisticians have previously documented that it takes larger samples for the asymptotic distribution to provide a good
approximation when the population correlation ρXY is very large, which is consistent with the evidence from the n = 50
graphs in the top row. At the larger sample size of n = 100, there is still a slight discrepancy between the exact sampling
distribution and the asymptotic distribution for ρXY = 0.8.
For this example, knowing the specific type of distribution (i.e., the bivariate normal) is only important for being
able to simulate the exact sampling distributions of rXY . Even without knowing the form of the joint distribution
of X and Y, Proposition 13.9 provides the large-sample sampling distribution from just the population correlation
ρXY . For example,in the case
of a sample with 100 observations (n = 100) and q ρXY = 0.4, the asymptotic distribution
a 2 2
of rXY is rXY ∼ N 0, (0.84)
100 , meaning the asymptotic standard deviation is (0.84) 100 = 0.084 and a 95% probability
interval for rXY is (0.4 – (1.96)(0.084), 0.4 + (1.96)(0.084)) ≈ (0.235, 0.565). Thus, over the possible 100-observation
i.i.d. samples that can be drawn from the population (with ρXY = 0.4), there is a 95% probability that the realized
sample correlation rxy is between 0.235 and 0.565. For n = 100 and ρXY = 0.8, the asymptotic distribution of rXY is
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 350 — #357
i i
8
3.0
2.5
2.5
6
2.0
2.0
1.5
frXY(v)
frXY(v)
frXY(v)
4
1.5
1.0
1.0
2
0.5
0.5
0.0
0.0
0
−0.4 −0.2 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 0.6 0.7 0.8 0.9 1.0
v v v
10
4
3
8
3
frXY(v)
frXY(v)
frXY(v)
6
2
4
1
2
0
−0.4 −0.2 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 0.6 0.7 0.8 0.9 1.0
v v v
Figure 13.3
Sampling distributions of the sample correlation for bivariate normal random variables
2
q
a (0.36)2
rXY ∼ N 0, (0.36)
100 , meaning the asymptotic standard deviation is 100 = 0.036 and a 95% probability interval for
rXY is (0.8 – (1.96)(0.036), 0.8 + (1.96)(0.036)) ≈ (0.729, 0.871).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 351 — #358
i i
NOTES 351
n = 5000
n = 20000
0.0 0.5 1.0 1.5 2.0
fmaxX(v)
Figure 13.4
Sampling distributions of the sample maximum for i.i.d. N(0, 1) random variables
and n = 30). But what happens as the sample size n gets large? Does the asymptotic distribution of maxX look normal?
Figure 13.4 shows the exact sampling distributions, based upon the formulas from Example 12.9, for the much larger
sample sizes of n = 5000 and n = 20000. Even with these very large sample sizes, the sampling distributions of maxX
are asymmetric and right-skewed, suggesting that the asymptotic sampling distribution is non-normal. In fact, it is
known that maxX has an asymptotic distribution known as a Gumbel distribution rather than a normal distribution.
Notes
38 A more formal statement of the LLN requires a concept known as convergence in probability. The mathematical condition corresponding to
the phrase “gets arbitrarily close to µX ” is that, for any > 0,
lim P |X̄ – µX | < = 1.
n→∞
No matter how small is, there is always a large enough sample such that there is a probability arbitrarily close to 1 that the distance between X̄ and
µX is less than .
39 As with the LLN, a formal statement of the CLT requires more advanced statistical concepts. Specifically, a commonly used version of the CLT
√
uses the concept of convergence in distribution and states that the random variable n(X̄ – µX ) converges in distribution to the normal distribution
N(0, σX2 ).
40 Alternatively, to use the exact distributions of X̄ and X̄ (i.e., the scaled binomial random variables), simulation methods can be used to
f m
approximate P(X̄f > X̄m ).
41 The mean of the asymptotic distribution is E(X̃
0.75 ) – E(X̃0.25 ) = τX,0.75 – τX,0.25 . The random variables X̃0.25 and X̃0.75 are not independent. The
asymptotic variance is the sum of the asymptotic variances of X̃0.25 and X̃0.75 , each of which is obtained from Proposition 13.7, minus two times the
2
asymptotic covariance. The asymptotic covariance of X̃0.25 and X̃0.75 is nf (τ 0.25 )f (τ )
.
X X,0.25 X X,0.75
42 For bivariate normal random variables X and Y, where the marginal distributions are X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 ), the joint pdf is
2 2
1 x–µX y–µ (x–µX )(y–µY )
1 – 2 σX
+ σ Y –2ρXY σX σY
fXY (x, y) = q e 2(1–ρXY ) Y
,
2πσX σY 1 – ρ2XY
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 352 — #359
i i
352 NOTES
Exercises
1. 50 individuals are randomly selected from a population in which the probability of owning a dog is 25%.
(a) Based on the binomial distribution, what is the probability that the sample proportion of dog owners is strictly
greater than 30%?
(b) Based on the binomial distribution, what is the probability that the sample proportion of dog owners is between
25% and 35%?
(c) Based on the normal-distribution approximation to the binomial distribution, what is the probability that the
sample proportion of dog owners is between 25% and 35%? Use the continuity correction.
2. Major League Baseball teams play 162 games during the regular season. A team’s final winning percentage is equal
to their total number of wins divided by 162. Assume that any given team has a “true” win probability and that their
wins/losses are 162 i.i.d. draws from a Bernoulli random variable with this win probability.
(a) If team A has win probability πA = 0.60, what is the asymptotic distribution of team A’s winning percentage
WA ?
(b) If team A has win probability πA = 0.60 and team B has win probability πB = 0.55, what is the asymptotic
distribution of the difference in winning percentages, WA – WB , for the two teams? (Assume that WA and WB
are independent of each other.)
(c) Based upon your answer to (b), what is P(WB > WA )?
(d) If you look at winning percentages halfway through the season (after 81 games) rather than at the end of the
season, what is the probability that team B’s winning percentage is higher than team A’s winning percentage?
(e) Conduct 100,000 simulations in R to approximate the probabilities in (c) and (d) using the exact sampling
distributions (based on the binomial) of the winning percentage. Use a strict inequality in the simulations so
that equal winning percentages are not counted.
(f) Returning to asymptotic distributions, how many games in a season would be required to ensure that P(WA >
WB ) is at least 99%?
3. Consider a two-candidate election between candidates A and B. The probability that an older registered voter (aged
65 or over) favors candidate A is 70%. The probability that a younger registered voter (under age 65) favors candidate B
is 56%. 200 older voters and 1,000 younger voters turn out for the election. Assume that the voters that turn out for
the election are randomly drawn from the subpopulations of registered voters.
(a) What is the asymptotic distribution associated with the sample proportion of votes for candidate A?
(b) What is the approximate 95% probability interval for the sample proportion of votes for candidate A?
(c) What is the approximate probability that candidate A wins the election?
4. A worker at a data-entry company enters numbers into a spreadsheet and makes errors at a 0.1% rate; that is, the
probability that the worker makes an error for any given spreadsheet cell is π = 0.1% or π = 0.001. Consider the number
of errors, given by the random variable X, that the worker makes when entering 50,000 cells of data. Assume that the
underlying Bernoulli(π) trials are i.i.d.
(a) What is the exact sampling distribution of X?
(b) Using the exact sampling distribution, what is P(45 ≤ X ≤ 55)?
(c) What is the asymptotic distribution of X?
(d) Using the asymptotic distribution and the continuity correction, what is P(45 ≤ X ≤ 55)?
(e) Using the asymptotic distribution, provide a 95% probability interval for X.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 353 — #360
i i
NOTES 353
(f) Another worker at the company makes errors at a 0.11% rate, slightly higher than the worker described above.
Again assume that the underlying Bernoulli trials are i.i.d.
i. Using the asymptotic distributions, what is the probability that this worker makes more errors than the
other worker if both enter 50,000 cells of data?
ii. Using the asymptotic distributions, what is a 95% probability interval for the total number of errors for
the two workers?
5. Frank’s Factory produces computer chips on two separate production lines. The output of each production
line is 10,000 computer chips per day, with a 2% probability that any given chip is defective. Assume that the
quality/defectiveness of each computer chip is independent.
(a) What is the asymptotic distribution of the sample proportion of defects among the 20,000 chips produced on a
given day?
(b) What is the asymptotic distribution of the difference between the total number of defects on one production
line and the total number of defects on the other production line?
(c) What is the approximate probability that the magnitude of the difference in (b) is greater than 10? (Do not
worry about a continuity correction.)
(d) How would your answer to (c) change if one production line has a defect probability of 2.1% instead of 2%?
6. For a random sample of 400 unemployed workers drawn from the population, the duration of unemployment (in
weeks) for each worker is an i.i.d. draw of a random variable X. Given the large sample size, you can assume that the
CLT implies that X̄ has an approximately normal distribution.
(a) If P(X̄ > 21) = 0.5, what is E(X)?
(b) If P(X̄ > 21) = 0.4, what can be said about E(X)?
(c) If E(X) = 20.7 and σX = 11.2, what is P(X̄ > 21)?
(d) How would the answer to (c) change if n = 1600 rather than n = 400?
7. The Air Quality Index (AQI) is used by the U.S. Environmental Protection Agency (EPA) as an overall measure of
air quality. The AQI has a scale of 0 to 500, with lower values for better air quality. For instance, the range 0 to 50 is
considered “good,” the range 51 to 100 is considered “moderate” (acceptable air quality, minimal risk), the range 101
to 150 is considered “unhealthy for sensitive groups,” and higher values indicate even more unhealthy air quality.
Suppose 52 weekly AQI measures are taken during the year in both Augusta, Maine and Los Angeles, California.
Assume that all AQI measures are independent of each other, with Augusta’s measures i.i.d. draws from the QA
random variable with expected value 25 and standard deviation 15 and Los Angeles’s measures i.i.d. draws from the
QL random variable with expected value 45 and standard deviation 25.
(a) What is the asymptotic distribution of Q̄A , the sample average of AQI for Augusta over 52 weeks?
(b) What is the asymptotic distribution of Q̄L , the sample average of AQI for Los Angeles over 52 weeks?
(c) What is the asymptotic distribution of Q̄L – Q̄A ?
(d) Are you able to say anything about the probability P(QA > QL ) in a given week? Explain why or why not.
8. Cindy’s Cereals sells cereal in 20-ounce boxes. Its manufacturing process leads to actual weights (in ounces) that are
1
i.i.d. draws from the random variable X ∼ N 20, 900 . Suppose it repackages any box weighing less than 19.9 ounces.
(a) What is the probability that any given box is repackaged?
(b) For a manufacturing run of 20,000 boxes, what is the approximate (normal) distribution of the number of boxes
that are repackaged?
(c) The profits per box are P1 for boxes weighing at least 19.9 ounces and P2 for boxes weighing less than 19.9
ounces, where P1 > P2 due to the repackaging required for the latter. For a manufacturing run of 20,000 boxes,
what is the approximate (normal) distribution of total profits in terms of P1 and P2 ?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 354 — #361
i i
354 NOTES
9. *Allison’s Apparel, a women’s clothing store, is open eight hours each day. On any given day, it is known that the
arrival time for the next customer has an expected value of 5 minutes and a population standard deviation of 3 minutes.
(Arrival time is measured since store opening for the first customer and since the last customer’s arrival for each
subsequent customer.) Assume that all arrival times are independent of each other. Let T denote the random variable
associated with the total number of customers that shop at Allison’s Apparel on a given day.
(a) Ignore for a moment that the store eventually closes. What is the asymptotic distribution of the average arrival
time (in minutes) for n customers? What is the asymptotic distribution of the total amount of time (in minutes)
that it takes n customers to arrive?
(b) What is the approximate probability that the total time it takes 100 customers to arrive is less than 480 minutes?
(c) Explain why the probability in (b) is equal to P(T ≥ 100).
(d) Using the same reasoning as in (b) and (c), approximate P(T ≥ 101) and P(T = 100) = P(T ≥ 100) – P(T ≥ 101).
(e) The pmf of T, evaluated at 100, was determined in (d). Using the same reasoning, plot the pmf of T over the
range of values {80, 81, …, 119, 120} in R.
10. The number of visitors X to a popular website during the one minute between 10:00am and 10:01am is a Poisson
random variable with λ = 120.
(a) Thinking of X as the sum of 60 i.i.d. random variables Y1 , Y2 , …, Y60 ∼ Poisson(2), where each Yi is the number
of visitors in a given second, what is the asymptotic normal distribution associated with X?
(b) Using the asymptotic normal distribution, provide an approximate 90% probability interval for X.
(c) Calculate the exact probability, based on the Poisson(120) distribution, that X is within the interval from (b).
11. Use Proposition 13.6 and Proposition 13.7 for this question.
(a) What is the asymptotic distribution of the sample median X̃0.5 if X ∼ U(0, 1) and n = 400?
(b) What is the asymptotic distribution of the sample median X̃0.5 if X ∼ U(a, b) and n = 400?
(c) What is the asymptotic distribution of the sample 75% quantile X̃0.75 if X ∼ U(0, 1) and n = 400? Provide an
approximate 90% probability interval for X̃0.75 .
12. Use Proposition 13.8 to determine a 95% probability interval for X̃0.75 – X̃0.25 when X ∼ N(0, 1) and n = 100.
13. *As mentioned in Section 13.3.2, there are some concerns with the approximation provided by the asymptotic
distribution of the sample correlation,
(1 – ρ2XY )2
a
rXY ∼ N ρXY , ,
n
especially when the magnitude of the population correlation, |ρXY |, is large. In particular, when n is not large, the
actual sampling distribution of rXY may be quite asymmetric for large |ρXY |. An alternative approach, proposed by
statistician R. A. Fisher in a 1915 paper in Biometrika and known as the Fisher transformation, can provide more
accurate confidence intervals for rXY . The idea is to consider the asymptotic distribution of the following (increasing)
function of rXY ,
1 1 + rXY
ln ,
2 1 – rXY
rather than rXY itself. The asymptotic distribution of the Fisher transformation of rXY is
1 1 + rXY a 1 1 + ρXY 1
ln ∼N ln , .
2 1 – rXY 2 1 – ρXY n
(a) Provide a 95% probability interval for 12 ln 1+r XY
1–rXY in terms of n if ρXY = 0.
(b) Suppose a probability interval for 12 ln 1+r XY
1–rXY has been calculated to be (L, U). That is,
1 1 + rXY
P L ≤ ln ≤U =p
2 1 – rXY
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 355 — #362
i i
NOTES 355
for some probability p. Using this probability expression, construct a probability interval (rL , rU ) for rXY , with
P(rL ≤ rXY ≤ rU ) = p. (Hint: Exponentiate the three quantities within the probability.)
(c) How does the 95% probability interval based on the Fisher transformation compare to the 95% probability
interval based on the original rXY distribution when ρXY = 0 and n = 100?
(d) How does the 95% probability interval based on the Fisher transformation compare to the 95% probability
interval based on the original rXY distribution when ρXY = 0.85 and n = 40?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 356 — #363
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 357 — #364
i i
Building upon the concept of sampling distributions from Chapters 12 and 13, this chapter introduces estimation,
which involves the use of a statistic as a guess or estimate of an underlying quantity of interest. As an example, the
sample mean might be used to estimate the true population mean in the usual situation where the population mean is
unknown. The sample mean has been introduced as a statistic based upon an observed sample, and it can be viewed
as playing two different roles, first as a descriptive statistic for the observed sample and second as an estimate of
the unknown population mean. This idea is formalized below and generalized to other statistics that may be used for
estimation purposes. In addition, to quantify the precision associated with a given estimate of an underlying quantity of
interest, this chapter also introduces confidence intervals, which provide a range of plausible values for the quantity of
interest based upon the estimation procedure and some pre-specified confidence level. In the case of the sample mean,
for instance, methods to construct confidence intervals for the unknown population mean, based upon the realization
of the sample mean, are introduced.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 358 — #365
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 359 — #366
i i
estimator gives the right answer (the estimand). More precisely, over all of the possible samples of size n that can be
drawn from the population, the expected value of the sample mean X̄ is equal to the population mean µX . Similarly,
over all of the possible samples of size n that can be drawn from the population, the expected value of the sample
1
variance s2X is equal to the population variance σX2 . The n–1 scaling for s2X is required to make it an unbiased estimator,
as it would be biased if the scaling were 1n instead.
The unbiasedness of the sample mean in part (i) of Proposition 14.1 has several interesting applications to previously
discussed random variable models:
• For i.i.d. Bernoulli(π) random variables, the sample mean X̄ is an unbiased estimator of the success probability π,
which also means that nX̄ is an unbiased estimator of the population mean nπ of a Binomial(n, π) random variable.
• For i.i.d. Poisson(λ) random variables, the sample mean X̄ is an unbiased estimator of λ since µ = λ.
X
• For i.i.d. N(µ, σ 2 ) random variables, the sample mean X̄ is an unbiased estimator of µ.
1 1
θ since µX = θ .
• For i.i.d. Exp(θ) random variables, the sample mean X̄ is an unbiased estimator of
While the sample mean and sample variance are unbiased, many other estimators are not unbiased. For instance, the
sample standard deviation and the sample correlation are generally biased estimators, with E(sX ) 6= σX and E(rXY ) 6= ρXY .
In fact, among the estimators listed in the table in Section 14.1.1, only the sample mean and sample variance are
guaranteed to be unbiased estimators. While having a biased estimator might seem problematic, the bias of estimators
like sX or rXY is generally only an issue in very small samples. Rather than concerning ourselves with what happens
in small samples, a much more important property of an estimator θ̂X is that it gets close to the “right answer” (the
estimand θ) for large sample sizes. This property is known as consistency of an estimator, and the formal definition of
a consistent estimator is provided below:
Definition 14.3 An estimator θ̂X = s(X1 , X2 , …, Xn ) is a consistent estimator of θ if θ̂X gets arbitrarily close to θ as
n → ∞.
Consistency is generally considered the minimal requirement for a statistical estimator to be useful in practice. The
consistency of many estimators, including all of those listed in the table in Section 14.1.1, has already been stated
in the propositions of Chapter 13, when the large-sample or asymptotic sampling distributions of various descriptive
statistics were discussed. For example, the Law of Large Numbers (Proposition 13.1) states that the sample mean X̄,
Pn
when viewed as an estimator, is a consistent estimator of the population mean µX since X̄ = 1n i=1 gets arbitrarily
close to µX as n → ∞. As another example, for the sample correlation, part (i) of Proposition 13.9 states that rXY is a
consistent estimator of the population correlation ρXY . Thus, even though rXY may be a biased estimator of ρXY , rXY
still gets arbitrarily close to ρXY as n → ∞. Thus, for a large sample, the sample correlation is an appropriate estimator
for the population correlation.
In addition to the consistency properties provided in Chapter 13 for several descriptive statistics, as estimators of
their associated population quantities, the propositions in Chapter 13 also stated that each of these descriptive statistics
has an asymptotic sampling distribution that is normally distributed. When an estimator has an asymptotic sampling
distribution that is normally distributed, the estimator is said to be an asymptotically normal estimator:
√
Definition 14.4 An estimator θ̂X = s(X1 , X2 , …, Xn ) is said to be a n-consistent and asymptotically normal
estimator (or, more concisely, an asymptotically normal estimator) if
a V
θ̂X ∼ N θ,
n
for some V that does not depend on n, or equivalently
√
a
n θ̂X – θ ∼ N (0, V) .
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 360 — #367
i i
For example, from the Central Limit Theorem (Proposition 13.2), the sample mean X̄ is an asymptotically normal
estimator of the population mean µX , with
σX2 √
a a
or, equivalently, n X̄ – µX ∼ N 0, σX2 .
X̄ ∼ N µX ,
n
√
The “ n-consistent” phrase used in Definition 14.4 refers to the rate at which the estimator X̄ approaches the estimand
σX
µX . To see why that’s the case, note that the asymptotic standard deviation of X̄ is √ n
, so that the width of any
probability interval for X̄ is proportional to √1n . The proportionality factor √1n is not unique to the sample mean,
√
but rather is a general feature of any n-consistent and asymptotically normal estimator. From Definition 14.4, if
the asymptotic variance of an estimator θ̂X is equal to Vn , where V does not depend upon n, the asymptotic standard
√
deviation of θ̂X is equal to √Vn , which is proportional to √1n .
As another example, part (ii) of Proposition 13.9 provides the asymptotic distribution of the sample correlation rXY :
(1 – ρ2XY )2 √
a a
or, equivalently, n(rXY – ρXY ) ∼ N 0, (1 – ρ2XY )2 .
rXY ∼ N ρXY ,
n
1–ρ2
The asymptotic standard deviation of the estimator rXY is √nXY . As the sample size n gets large, the standard deviation
of rXY shrinks toward zero, corresponding to the consistency of the sample correlation rXY , which says that rXY gets
arbitrarily close to ρXY as n → ∞.
Both estimators are consistent and asymptotically normal estimators of µX , but it seems like θ̂Xa should be the preferred
estimator since it is based upon more information, using all n observations as compared to the n/2 observations used
for θ̂Xb . The way to quantify that θ̂Xa is a more precise estimator than θ̂Xb is to compare the asymptotic variances or the
asymptotic standard deviations of the two estimators. The asymptotic variances of θ̂Xa and θ̂Xb are, respectively,
σX2 σ 2 2σ 2
and n X = X ,
n /2 n
and the asymptotic standard deviations of θ̂Xa and θ̂Xb are, respectively,
√
σX 2σ
√ and √ X .
n n
√
Therefore, the asymptotic standard deviation of the half-sample sample mean θ̂Xb is 2 times larger than the asymptotic
standard deviation of the full-sample sample mean θ̂Xa . On this basis, the full-sample sample mean θ̂Xa should be the
preferred estimator. In statistical terminology, it is said that the estimator θ̂Xa is more efficient than the estimator θ̂Xb , an
idea that is stated more generally in the following definition:
Definition 14.5 If there are two asymptotically normal estimators θ̂Xa and θ̂Xb of the parameter θ, the estimator θ̂Xa
is (asymptotically) more efficient than the estimator θ̂Xb if the asymptotic variance of θ̂Xa is less than the asymptotic
variance of estimator θ̂Xb . The estimator θ̂X is asymptotically efficient among all asymptotically normal estimators if it
is more efficient than any other asymptotically normal estimator.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 361 — #368
i i
For the example above, the full-sample sample mean θ̂Xa is more efficient than the half-sample sample mean θ̂Xb ,
but that doesn’t necessarily mean there isn’t some other estimator that is more efficient than θ̂Xa . That is, we have not
shown that the sample mean X̄ is the asymptotically efficient estimator since that would require showing that it has a
lower asymptotic variance than any other asymptotically normal estimator of µX .
Example 14.1 (Normal random variables, mean versus median) For a normal random variable N(µ, σ 2 ), the
population mean and population median are both equal to µ. The sample mean and sample median are two possible
estimators of the parameter µ. Example 13.6 considered the standard normal distribution, where X1 , X2 , …, Xn are
i.i.d. N(0, 1) random variables and µ = 0. In that example, it was shown that the asymptotic variance of the sample
mean X̄ was smaller than the asymptotic variance for the sample median X̃0.5 since
a
X̄ ∼ N(0, 1/n)
and
a 1
X̃0.5 ∼ N 0, ,
4nφ(0)2
1 1 1.5711
with 4nφ(0)2 ≈ 4n(0.3989)2 ≈ n > 1n . Therefore, when the underlying random variables have a standard normal
distribution, the sample mean X̄ is a more efficient estimator of µ = 0 than the sample median X̃0.5 . The interested
reader can show that this result holds for the general case of i.i.d. X1 , X2 , …, Xn ∼ N(µ, σ 2 ) random variables.
14.2 Finite-sample confidence intervals: population mean of i.i.d. normal random variables
This chapter considers how confidence intervals for parameters (estimands) can be constructed for a wide range of
estimators. For an unknown parameter θ, the goal is to provide an interval of plausible values for θ based upon the
estimator θ̂X . For example, a 95% confidence interval for θ is an interval for which, before observing the sample,
there is a 95% probability that the parameter θ is in the interval created by the estimation procedure. The first case
considered, in this section, is a confidence interval for the population mean µ associated with normally distributed
i.i.d. random variables X1 , X2 , …, Xn ∼ N(µ, σ 2 ). For this specific case, the exact sampling distribution results for the
sample mean X̄ from Section 12.1.2 are used to a confidence interval for µ. The resulting confidence interval is an
example of a finite-sample or exact confidence interval, as it is valid for any sample size n, even very small n. While
this specific case is interesting and sometimes useful, the resulting confidence interval does not generalize to other
settings. Therefore, to provide a more general method of constructing confidence intervals, subsequent sections of this
chapter consider the appropriate confidence interval based upon any asymptotically normal estimator. The resulting
confidence interval is an asymptotic confidence interval valid in large samples.
This section considers an observed sample {x1 , x2 , …, xn }, where the underlying i.i.d. random variables X1 , X2 , …, Xn
have a normal distribution N(µ, σ 2 ). The goal is to construct a confidence interval for the parameter µ based upon the
sample mean estimator X̄. From Section 12.1.2, the exact sampling distribution for the sample mean X̄ is
σ2
X̄ – µ
X̄ ∼ N µ, or, equivalently, √ ∼ N(0, 1).
n σ/ n
Based upon this exact sampling distribution, we can construct a probability interval for X̄ when µ and σ 2 are assumed
to be known. For instance, a 95% probability interval for X̄ is
σ σ
µ – 1.96 √ , µ + 1.96 √ ,
n n
meaning that, over all possible
samples of size n that can be drawn from the population, the probability that X̄ is in the
σ σ
µ – 1.96 √n , µ + 1.96 √n interval is equal to 0.95:
σ σ
P X̄ ∈ µ – 1.96 √ , µ + 1.96 √ = 0.95.
n n
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 362 — #369
i i
In most cases of interest, however, the parameters µ and σ 2 of the normal distribution are not known to the researcher.
As a result, we would like to essentially flip things around here, forming a probability interval for the parameter µ based
upon the estimator rather than a probability for the estimator based upon the parameters µ and σ 2 . To do so, the first
step is to re-write the exact sampling distribution in terms of the difference X̄ – µ, as follows
σ2 σ2
X̄ – µ ∼ N 0, or, equivalently, µ – X̄ ∼ N 0, .
n n
2
Both X̄ – µ and µ – X̄ have the same N 0, σn distribution due to the symmetry of the normal distribution. The latter
distribution says that µ – X̄ is normally distributed with mean zero and standard deviation √σn . Therefore, over all
possible samples of size n from the population, there is a 95% probability that µ – X̄ is between –1.96 √σn and 1.96 √σn :
σ σ
P µ – X̄ ∈ –1.96 √ , 1.96 √ = 0.95,
n n
or, equivalently, by adding X̄ to the µ – X̄ term and to the interval endpoints,
σ σ
P µ ∈ X̄ – 1.96 √ , X̄ + 1.96 √ = 0.95.
n n
If σ were known, this probability interval would be useful as a confidence interval for µ. Unfortunately, since σ is
unknown, this probability interval is not directly applicable since the endpoints can’t be calculated.
Rather than using the unknown population standard deviation σ, a sensible alternative is to use the sample standard
deviation sX , which is an estimator of σ. The complication introduced by using the estimator sX in place of the
parameter σ is that, while σ/X̄–µ
√ ∼ N(0, 1), the same is not true for the exact sampling distribution of X̄–µ
n
√ . In fact,
sX / n
X̄–µ
√
sX / n
does not have a normal distribution, but rather a different distribution known as a t-distribution:
Proposition 14.2. If X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random variables,
X̄ – µ
√ ∼ tn–1 ,
sX / n
where X̄ is the sample mean, sX is the sample standard deviation, and tn–1 is a t-distribution with n – 1 degrees of
freedom.
The exact form of the t-distribution, as given by the pdf or cdf, is somewhat complicated and therefore not explicitly
shown here. The following R functions are useful for working with random variables that follow a t-distribution:
• dt(x, df): Returns the pdf of a t-distributed random variable with df degrees of freedom evaluated at the
argument x, which may be a single number or a vector.
• pt(x, df): Returns the cdf of a t-distributed random variable with df degrees of freedom evaluated at the
freedom.
• qt(p, df): Returns the population quantiles of a t-distributed random variable with df degrees of freedom
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 363 — #370
i i
df=3 df=5
0.4
0.4
0.3
0.3
ft3(v)
ft5(v)
0.2
0.2
0.1
0.1
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
v v
df=10 df=30
0.4
0.4
0.3
0.3
ft10(v)
ft30(v)
0.2
0.2
0.1
0.1
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
v v
Figure 14.1
t-distributions for different degrees of freedom
curve, and the standard normal pdf is shown as a dotted curve for comparison. The thicker tails, also associated with
a lower peak at zero, are quite evident in the top two graphs, for degrees of freedom equal to 3 and 5. For degrees of
freedom equal to 10, the t-distribution is visibly closer to the N(0, 1) distribution, with less thick tails than the graphs
in the top row. For degrees of freedom equal to 30, the t-distribution is nearly indistinguishable from the N(0, 1)
distribution.
The intuition is that using the estimator sX in place of the true σ introduces uncertainty into the ratio sXX̄–µ √ , and the
/ n
thicker tails of the tn–1 distribution, as compared to the N(0, 1) distribution, account for this additional uncertainty. The
uncertainty is larger for smaller sample sizes since sX is a less precise estimator of σ when n is small, reflected by the
thicker tails associated with lower degrees of freedom in the t-distribution. For larger sample sizes, sX becomes more
precise as an estimator of σ, corresponding to thinner tails for the tn–1 distribution. Since sX gets arbitrarily close to σ
as n increases, it should not be surprising that the limit of the tn–1 distribution is the N(0, 1) distribution, as sXX̄–µ
√ ≈ X̄–µ
/ n
√
σ/ n
for large n.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 364 — #371
i i
To quantify the difference that will be seen in probability intervals associated with a t-distribution, rather than a
N(0, 1) distribution, the following table provides the 97.5% quantile (used to construct a 95% probability interval) and
the 95% quantile (used to construct a 90% probability interval) for several different values of the degrees of freedom.
Sample size n Distribution (tn–1 ) 97.5% quantile (tn–1,0.025 ) 95% quantile (tn–1,0.05 )
4 t3 3.182 2.353
6 t5 2.571 2.015
11 t10 2.228 1.812
31 t30 2.042 1.697
101 t100 1.984 1.660
501 t500 1.965 1.648
“large” N(0, 1) 1.960 1.645
The table shows some new notation corresponding to the quantiles of the tn–1 distribution, specifically the tn–1,0.025
notation in the 97.5% quantile column and the tn–1,0.05 notation in the 95% quantile column. This notation is formally
defined as follows:
Definition 14.6 The critical value tn–1,q denotes the (1 – q) quantile of the tn–1 distribution. For example, tn–1,0.025 is
the 97.5% quantile of the tn–1 distribution, and tn–1,0.05 is the 95% quantile of the tn–1 distribution.
For the t30 distribution, corresponding to n = 31, the 97.5% quantile t30,0.025 is 2.042, which is approximately 4%
larger than the 97.5% quantile (1.960) of the N(0, 1) distribution, meaning a 95% probability interval based on the
t30 distribution has a width that is approximately 4% larger than the interval based on the N(0, 1) distribution. As n
increases, the quantile values for the tn–1 distribution approach those of the N(0, 1) distribution. For n = 101 and the
t100 distribution, the quantile values in the table are approximately 1% larger than the quantile values for the N(0, 1)
distribution. The difference becomes almost entirely negligible for n = 501 and the t500 distribution, with the quantiles
only about 0.2% larger than the N(0, 1) quantiles.
The following R code shows how the critical values in the table above are calculated with the qt function:
qt(c(0.975,0.95),3)
## [1] 3.182446 2.353363
qt(c(0.975,0.95),5)
## [1] 2.570582 2.015048
qt(c(0.975,0.95),10)
## [1] 2.228139 1.812461
qt(c(0.975,0.95),30)
## [1] 2.042272 1.697261
qt(c(0.975,0.95),100)
## [1] 1.983972 1.660234
qt(c(0.975,0.95),500)
## [1] 1.964720 1.647907
Figure 14.2 provides some graphical examples of critical values. The top two graphs show the 2.5% critical values
for the t5 and t30 distributions. In the top-left graph, the gray area to the right of the t5,0.025 ≈ 2.571 critical value
has probability 2.5%, and due to symmetry, the gray area to the left of –t5,0.025 has probability 2.5%. Therefore, the
probability for the interval between –t5,0.025 and t5,0.025 is equal to 95%. The top-right graph, for the t30 distribution
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 365 — #372
i i
ft30(v)
ft5(v)
v v
v v
Figure 14.2
Critical values for a t-distribution
is similar, except that the critical value t30,0.025 ≈ 2.042 is not as large as the t5,0.025 ≈ 2.571 critical value. Since the
t30 distribution has less thick tails than the t5 distribution, the critical value must be further to the left to still have the
probability to the right of t30,0.025 , again represented by the gray area, equal to 2.5%. The bottom two graphs show the
5% critical values for the t5 and t30 distributions. For each of these two graphs, the gray area in the right tail has 5%
probability, as does the gray area in the left tail, meaning there is 90% probability of being between –t5,0.05 and t5,0.05
for the t5 distribution and between –t30,0.05 and t30,0.05 for the t30 distribution.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 366 — #373
i i
qt(0.90,19)
## [1] 1.327728
To make these probability intervals useful in practice, the only remaining step is to use the realized estimates x̄ and
sx , associated with the observed sample, in place of the estimators X̄ and sX . The standard deviation of the sampling
distribution of X̄ is equal to √sXn , which is replaced by √sxn . This latter quantity is known as the standard error of the
sample mean estimator.
Definition 14.7 The standard error is the estimated standard deviation of the sampling distribution of the
estimator θ̂X .
Then, the general result is that, for a value α, the (1 – α) confidence interval for µ is
sx sx
x̄ – tn–1,α/2 √ , x̄ + tn–1,α/2 √ .
n n
When α = 0.05, the 95% confidence interval for µ is
sx sx
x̄ – tn–1,0.025 √ , x̄ + tn–1,0.025 √ .
n n
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 367 — #374
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 368 — #375
i i
100
80
Simulation number
60
40
20
0
6 8 10 12 14
Figure 14.3
Monte Carlo simulations of confidence intervals
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 369 — #376
i i
there is a 95% probability that the quantity sXµ–/√X̄n is above –tn–1,0.05 . The critical value here uses the 5% probability,
rather than the 2.5% probability used for a two-sided confidence interval, since the full 5% probability must be to the
left of –tn–1,0.05 . This probability can be written as
µ – X̄
P √ > –tn–1,0.05 = 0.95
sX / n
or, equivalently, as
sX
P µ > X̄ – tn–1,0.05 √ = 0.95.
n
This probability can be also be written in terms of a one-sided interval, as
sX
P µ ∈ X̄ – tn–1,0.05 √ , ∞ = 0.95.
n
µ–√X̄
Using similar reasoning, since there is a 95% probability that the quantity sX / n
is below tn–1,0.05 , it follows that
µ – X̄
P √ < tn–1,0.05 = 0.95,
sX / n
which implies
sX
P µ < X̄ + tn–1,0.05 √ = 0.95
n
or
sX
P µ ∈ –∞, X̄ + tn–1,0.05 √ = 0.95.
n
As with a two-sided confidence interval, a one-sided confidence interval is implemented in practice by replacing X̄
with the sample mean x̄ and sX with the sample standard deviation sx . These one-sided confidence intervals can also
be generalized to other probabilities, as stated in the following proposition.
Proposition 14.4. (One-sided confidence intervals for the sample mean of i.i.d. normal random variables) If
X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random variables, the probability that
sX
µ > X̄ – tn–1,α √
n
is equal to 1 – α, and the probability that
sX
µ < X̄ + tn–1,α √
n
is equal to 1 – α. The associated 1 – α confidence intervals for µ are
sx
x̄ – tn–1,α √ , ∞
n
and
sx
–∞, x̄ + tn–1,α √ ,
n
sx
where x̄ is the sample mean and √
n
is the standard error based upon the sample standard deviation sx .
Example 14.4 (Food truck) Continuing Example 14.3, one-sided confidence intervals can be constructed based upon
Proposition 14.4. For instance, the owner of the food truck may be most concerned about the downside risk. As such,
the owner would be more interested in having 95% confidence, or some other confidence level, that weekly profits are
above a certain value. A one-sided 95% confidence interval for µ is
sx
x̄ – t6–1,0.05 √ , ∞ ≈ (1200 – (2.015)(81.65), ∞) ≈ (1035, ∞).
n
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 370 — #377
i i
The owner can be 95% confident that the true average of weekly sales is greater than $1035. Note that $1035 is the
same value as the lower end of the two-sided 90% confidence interval found in Example 14.3, as the same critical
value t5,0.05 is used for both calculations.
What if the owner would like even more certainty, say 99% confidence rather than 95% confidence? The critical
value t6–1,0.05 ≈ 2.015 is replaced with t6–1,0.01 ≈ 3.365, yielding the one-sided 99% confidence interval
sx
x̄ – t6–1,0.01 √ , ∞ ≈ (1200 – (3.365)(81.65), ∞) ≈ (925, ∞).
n
Since the owner wants to be more certain here (99% probability), the lower end of the one-sided confidence interval
($925) is considerably lower than the lower end of the 95% one-sided confidence interval ($1035).
Definition 14.8 The critical value zq denotes the (1 – q) quantile of the N(0, 1) distribution. For example, z0.025 ≈ 1.96
is the 97.5% quantile of the N(0, 1) distribution, and z0.05 ≈ 1.645 is the 95% quantile of the N(0, 1) distribution.
Following the same reasoning from Section 14.2, but now using z critical values rather than t critical values,
µX – X̄
P √ ∈ (–z0.025 , z0.025 ) = 0.95
sX / n
and
sX sX
P µX ∈ X̄ – z0.025 √ , X̄ + z0.025 √ = 0.95.
n n
For the case of i.i.d. normal random variables, this 95% probability interval is the natural large-sample version of
sX sX
the finite-sample interval P µX ∈ X̄ – tn–1,0.025 n , X̄ + tn–1,0.025 n = 0.95 since the limit of the tn–1 distribution, as
√ √
n gets larger, is the N(0, 1) distribution, meaning the limit of the tn–1,0.025 critical value, as n gets larger, is the z0.025
critical value. Importantly, however, the 95% asymptotic probability interval also applies for i.i.d. random variables
that are non-normal.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 371 — #378
i i
The 95% probability interval can be generalized to other probability intervals, with the (1 – α) probability interval
sX sX
P µX ∈ X̄ – zα/2 √ , X̄ + zα/2 √ = 1 – α.
n n
The critical value zα/2 is the value for which there is probability of α/2 that a N(0, 1) random variable is larger than
zα/2 , and by symmetry –zα/2 is the value for which there is probability of α/2 that a N(0, 1) random variable is less than
–zα/2 . The following proposition formally states the results for a two-sided confidence interval based upon this general
probability interval.
Proposition 14.5. (Two-sided confidence interval for the sample mean of i.i.d. random variables) If X1 , X2 , …, Xn are
i.i.d. random variables with population mean µX and large n, the probability that µX is in the interval
sX sX
X̄ – zα/2 √ , X̄ + zα/2 √
n n
is approximately equal to 1 – α. The associated asymptotic 1 – α confidence interval for µX is
sx sx
x̄ – zα/2 √ , x̄ + zα/2 √ ,
n n
sx
where x̄ is the sample mean and √
n
is the standard error based upon the sample standard deviation sx .
sx
The standard error √
n
is sometimes called an asymptotic standard error since it is based upon the asymptotic
sampling distribution, but we often simply call it a “standard error” so that the same terminology can be used when
referring to a finite-sample standard error or an asymptotic standard error. The following R code defines a function
se_meanx that calculates the standard error of x̄, given by √sxn , for any vector of data:
The function se_meanx is defined with an optional argument na.rm so that vectors with missing (NA) values can
be handled, similar to the optional argument na.rm available for built-in functions like mean and sd.
For the case of i.i.d. Bernoulli(π) random variables, the sample mean X̄ is an estimator of the population mean or
true success probability π. Since the population variance is σX = π(1 – π), the asymptotic standard deviation of X̄ is
r
σX π(1 – π)
√ = ,
n n
which suggests two possible approaches to calculating the standard error: (i) plug in the sample standard deviation sx
for σX , so that the standard error is √sxn , or (ii) plug in the sample mean x̄ for π, so that the standard error is x̄(1–x̄)
√ .
n
√
Approach (ii) is valid since the sample mean is a consistent estimator of π, so that x̄(1 – x̄) will be arbitrarily close
√
to π(1 – π). Since x̄ is just the sample proportion of successes, or the fraction of ones observed, this approach is
appealing since it doesn’t require the extra step of calculating sx . Approach (i) and approach (ii) give nearly identical
p n √ p n
standard errors, as it can be shown that sx = n–1 x̄(1 – x̄) and the scaling factor n–1 ≈ 1 for large n.46
Example 14.5 (Widget website) In Example 2.1, widgets.com conducted an e-mail campaign experiment in which
300 users received e-mail A, 300 users received e-mail B, and 2400 users received no e-mail. The resulting purchase
probabilities were 20% (60 out of 300) for e-mail A recipients, 22% (66 out of 300) for e-mail B recipients, and 15%
(360 out of 2400) for non-recipients. Using Proposition 14.5, confidence intervals for the true purchase probabilities
of the three groups can be calculated. Let πA denote the purchase probability of an e-mail A recipient, πB denote the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 372 — #379
i i
purchase probability of an e-mail B recipient, and πC denote the purchase probability of a non-recipient. Starting with
the sample of e-mail A recipients, assume that the random variables X1 , X2 , …, X300 are i.i.d. Bernoulli(πA ), with a
one (success) associated with a purchase. The observed proportion of successes (purchases) is pA = 0.20, so that the
sample mean is
x̄ = pA = 0.20,
and the standard error (of the sample mean) is
r r
sx pA (1 – pA ) (0.2)(0.8)
√ = = ≈ 0.0231.
n n 300
Then, the asymptotic 95% confidence interval for πA is
sx sx
x̄ – z0.025 √ , x̄ + z0.025 √ ≈ (0.20 – (1.96)(0.0231), 0.20 + (1.96)(0.0231)) ≈ (0.155, 0.245).
n n
Similar calculations yield an asymptotic 95% confidence interval for πB ,
r r !
sx sx (0.22)(0.78) (0.22)(0.78)
x̄ – z0.025 √ , x̄ + z0.025 √ ≈ 0.22 – (1.96) , 0.22 + (1.96) ≈ (0.173, 0.267),
n n 300 300
and an asymptotic 95% confidence interval for πC ,
r r !
sx sx (0.15)(0.85) (0.15)(0.85)
x̄ – z0.025 √ , x̄ + z0.025 √ ≈ 0.15 – (1.96) , 0.15 + (1.96) ≈ (0.136, 0.164).
n n 2400 2400
The standard errors for the three confidence intervals are 0.023, 0.024, and 0.007, respectively. The first two standard
errors are similar since the sample size n = 300 is the same and the sample standard deviations differ only slightly
√ √
( (0.20)(0.80) versus (0.22)(0.78)). The third standard error is much lower due to the larger sample size n = 2400.
Figure 14.4 provides a graphical representation of the 95% confidence intervals for the purchase probabilities
of the three groups. The confidence intervals for πA and πB have very similar widths since they have the similar
standard errors. The confidence interval for πC (no e-mail) is much more narrow since the sample size n = 2400 is
much larger, corresponding to the fact that pC = 0.15 is a more precise estimate of πC than the estimates for the
smaller samples. Comparing the confidence intervals for πA and πB , it seems that there’s not strong evidence that the
purchase probability of e-mail B recipients, πB , is greater than the purchase probability of e-mail A recipients, πA .
Even though the observed purchase probability is larger for e-mail B recipients, there is a lot of overlap between the
confidence intervals of the two purchase probabilities πA and πB . (Later in this chapter, a more formal method for
directly looking at the difference πA – πB is considered.) In contrast, the narrow confidence interval for πC leads to no
overlap with the πB confidence interval and only small overlap with the πA confidence interval, offering some evidence
that the purchase probability πC is lower than either πA or πB . (Again, a more formal examination of the differences
πA – πC and πB – πC is provided later.)
Example 14.6 (Labor force data) The confidence interval for a sample mean estimator can be applied to variables
from a cross-sectional dataset, at least in cases where the observations can plausibly be considered to be realizations
of i.i.d. random variables. This idea is illustrated by calculating standard errors and confidence intervals for several
variables from the cps dataset. The following table shows the sample mean x̄, the standard error √sxn , the asymptotic
95% confidence interval for µX , and the asymptotic 90% confidence interval for µX for five different variables: age
(age in years), educ (education in years), ownchild (number of children in household), earnwk (weekly earnings), and
union (1 if union member, 0 if not). For the first three variables, the quantities are calculated based upon the full
sample size n = 4013. For the last two variables, the quantities are calculated based upon the employed sample size
n = 2809.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 373 — #380
i i
e−mail A (πA)
e−mail B (πB)
no e−mail (πC)
Figure 14.4
Asymptotic 95% confidence intervals for true widget purchase probabilities
sx
Variable n x̄ se = √
n
95% CI for µX 90% CI for µX
age 4013 45.02 0.142 (44.74, 45.30) (44.78, 45.25)
educ 4013 12.57 0.039 (12.50, 12.65) (12.51, 12.64)
ownchild 4013 0.748 0.0176 (0.713, 782) (0.719, 0.777)
earnwk 2809 971.18 14.16 (943.42, 998.93) (947.89, 994.47)
union 2809 0.098 0.0056 (0.087, 0.109) (0.089, 0.107)
The abbreviation “se” is used for the standard error √sxn . For the education variable educ, the asymptotic 95%
confidence interval for the population mean is (12.50, 12.65). Since this confidence interval is quite narrow, the sample
mean 12.57 is providing a precise estimate of the population mean, which occurs since the sample size n = 4013 is
so large here. For the weekly earnings variable earnwk, the sample mean of 971.18 dollars is the estimate of the
population mean of weekly earnings for the population of employed individuals, with the associated 95% asymptotic
confidence interval being between 943.42 dollars and 998.93 dollars. The union variable union is an indicator
variable, so its observations can be viewed as realizations of Bernoulli random variables. The estimate of the true
probability that an employed individual from the population is in a union is 9.8%, with an asymptotic 95% confidence
interval for the true probability of union membership being between 8.7% and 10.9%.
Here is the R code used to calculate the n, x̄, and se = √sxn columns in the table above:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 374 — #381
i i
# output results
print(paste(varname,":","n",nobs,"Mean",signif(mean_var,digits=5),"SE(mean)",signif(se_mean,digits=5)))
}
## [1] "age : n 4013 Mean 45.017 SE(mean) 0.14205"
## [1] "educ : n 4013 Mean 12.573 SE(mean) 0.038863"
## [1] "ownchild : n 4013 Mean 0.74782 SE(mean) 0.017606"
## [1] "earnwk : n 2809 Mean 971.18 SE(mean) 14.16"
## [1] "union : n 2809 Mean 0.098256 SE(mean) 0.0056172"
The code loops through the five variable names of interest. For each variable given by varname, the expression
sum(!is.na(cps[,varname])) returns the number of non-missing observations, which is the effective sample
size n. To limit the number of significant digits reported by R, the function signif is used so that five significant
digits, as specified by the second argument digits=5, are reported for the values of mean_var and se_mean.
Example 14.7 (Simulation error: the likelihood of “streaks”) In Example 4.16, computer simulations estimated the
probability that a streak of at least five heads occurs during a sequence of 100 coin tosses. Over 100,000 simulations,
the calculated frequency of streaks was 0.81156, corresponding to 81,156 of the 100,000 simulations. The associated
standard error is r
0.81156(1 – 0.81156)
≈ 0.001237,
100000
so that an asymptotic 99% confidence interval for the true probability (of observing a sequence of at least five heads
in 100 coin tosses) is
(0.81156 – z0.005 (0.001237), 0.81156 + z0.005 (0.001237)) ≈ (0.8084, 0.8147),
using z0.005 ≈ 2.576. Thus, it is very likely that the true probability is close to 81% with this many simulations.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 375 — #382
i i
θ̂X – θ a
√ ∼ N (0, 1) .
V/n
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 376 — #383
i i
or, equivalently,
r r !
V V
P θ̂X – 1.96 < θ < θ̂X + 1.96 = 0.95.
n n
To calculate an asymptotic 95% confidence interval for θ, the realized √ estimate θ̂x is used in place of the estimator
θ̂X and the standard error associated with θ̂x is used in √ place of V/n. For the standard error, a consistent estimate
of V is required, call it V̂, so that the standard error is V̂/n. For example, in the case of the sample mean √ X̄ as san
2
estimator of the population mean µX , the results from Section 14.4 use V̂ = sx , so that the standard error is V̂/n = √xn .
As another example, in the case of the sample correlation rXY as an estimator of the population correlation ρXY , the
asymptotic variance shown above has V = (1 – ρ2XY )2 , which can be consistently estimated by plugging in rxy for ρXY ,
√ 1–r2
2 2
so that V̂ = (1 – rxy ) and the standard error is V̂/n = √nxy .
The 95% probability statement above can be generalized to other probability levels by changing the z0.025 = 1.96
critical value to the appropriate critical value needed for the chosen level of probability. To get a probability 1 – α, the
appropriate critical value is zα/2 :
r r !
V V
P θ̂X – zα/2 < θ < θ̂X + zα/2 = 1 – α.
n n
The general form of Proposition 14.5, which holds for any asymptotically normal estimator θ̂X , is given by the
following proposition:
Proposition 14.7. (Two-sided confidence intervals based upon an asymptotically normal estimator) If θ̂X is an
a
asymptotically normal estimator of θ with θ̂X ∼ N θ, Vn , the probability that θ is in the interval
r r !
V V
θ̂X – zα/2 , θ̂X + zα/2
n n
is approximately equal to 1 – α for large n. The associated asymptotic 1 – α confidence interval for θ is
s s
θ̂x – zα/2 V̂ , θ̂x + zα/2 V̂ ,
n n
q
V̂
where θ̂x is the realized estimate of θ and n is the standard error based upon a consistent estimate V̂ of V.
A convenient notation is se(θ̂x ), denoting the standard error for the estimate θ̂x , with
s
V̂
se(θ̂x ) = ,
n
and the asymptotic 1 – α confidence interval for θ, from Proposition 14.7, is
(θ̂x – zα/2 se(θ̂x ), θ̂x + zα/2 se(θ̂x )).
Proposition 14.7 provides a very powerful result, as it covers all of the asymptotically normal estimators introduced
in this book. More generally, beyond the estimators we’ve covered, if there is an asymptotically normal estimator θ̂X
of θ for which a computer is able to produce both the estimate θ̂x and the standard error se(θ̂x ), the asymptotic 1 – α
confidence interval is always (θ̂x – zα/2 se(θ̂x ), θ̂x + zα/2 se(θ̂x )).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 377 — #384
i i
In Section 14.3, the asymptotic confidence interval was given for the population mean, based upon the sample mean
of an i.i.d. sample, with
r s
V sX V̂ sx
asymptotic standard deviation = √ and standard error se (x̄) = =√ .
n n n n
For the specific case of underlying i.i.d. Bernoulli(π) random variables, where the sample mean is an estimator of the
true probability π = µX ,
r r s r
V π(1 – π) V̂ x̄(1 – x̄)
asymptotic standard deviation = and standard error se (x̄) = = .
n n n n
For the remainder of this section, several other examples of using the asymptotic confidence interval are considered.
Before those examples, however, we state the general result for one-sided confidence intervals:
Proposition 14.8. (One-sided confidence intervals based upon an asymptotically normal estimator) If θ̂X is an
a
asymptotically normal estimator of θ with θ̂X ∼ N θ, Vn , the probability that
r
V
θ > θ̂X – zα
n
is approximately equal to 1 – α for large n, and the probability that
r
V
θ < θ̂X + zα
n
is approximately equal to 1 – α for large n. The associated asymptotic 1 – α confidence intervals for θ are
s s
V̂ V̂
θ̂x – zα , ∞ and –∞, θ̂x + zα ,
n n
q
V̂
where θ̂x is the realized estimate of θ and n is the standard error based upon a consistent estimate V̂ of V.
From Proposition 14.8, the one-sided asymptotic (1 – α) confidence intervals for θ are
(θ̂x – zα se(θ̂x ), ∞) and (–∞, θ̂x + zα se(θ̂x )),
where θ̂x is the estimate of θ and se(θ̂x ) is its standard error.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 378 — #385
i i
To consistently estimate V, which is necessary to derive a formula for the standard error se(sx ), the realized sample
standard deviation sx is plugged in for σX and the summation
n
1X
(xi – x̄)4
n
i=1
4
is plugged in for E (X – µX ) , leading to
s q 2
Pn 2 1
Pn
1
n i=1 (xi
4
– x̄) – s2x V̂ n i=1 (xi – x̄)4 – s2x
V̂ = and se(sx ) = = √ .
4s2x n 2 nsx
With this formula for se(sx ), the asymptotic 1 – α confidence interval for the population standard deviation is
(sx – zα/2 se (sx ) , sx + zα/2 se (sx )) .
The following R code defines a function se_sx that calculates the standard error se(sx ) for any vector of data:
Example 14.9 (Standard deviation of monthly stock returns) Considering the monthly stock return data from sp500,
confidence intervals can be formed for the population standard deviation of monthly returns for any given company.
The following table shows the sample standard deviation sx , its standard error, and the associated 95% confidence
interval (in the column labeled “95% CI for σX ”) for six different stocks (HD, LOW, BAC, WFC, MRO, COP).
Company sx se(sx ) 95% CI for σX
HD 0.0737 0.00312 (0.0676, 0.0798)
LOW 0.0916 0.00361 (0.0845, 0.987)
BAC 0.1053 0.00882 (0.0880, 0.1226)
WFC 0.0816 0.00538 (0.0710, 0.0921)
MRO 0.1210 0.01064 (0.1001, 0.1418)
COP 0.0817 0.00502 (0.0719, 0.0916)
For example, the sample standard deviation sx of Home Depot (HD) monthly returns is 0.0737, and we can say with
95% confidence that the population standard deviation σX is between 0.0676 and 0.0798. The standard errors in the
“se(sx )” column vary a lot, with the largest value of 0.01064 for MRO, indicating that its standard deviation estimate
is the least precise or, equivalently, that the associated confidence interval is the widest of the six stocks.
The following R code calculates the sx and se(sx ) values for the table above:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 379 — #386
i i
Example 14.10 (Education and earnings) For the sample of n = 2809 employed individuals from the cps dataset,
the sample correlation between education (x = educ) and weekly earnings (y = earnwk) is rxy ≈ 0.325. The associated
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 380 — #387
i i
standard error is
2
1 – rxy 1 – (0.325)2
se(rxy ) = √ ≈ √ ≈ 0.0169.
n 2809
The asymptotic 95% confidence interval for the population correlation ρXY between education and earnings is
rxy – z0.025 se rxy , rxy + z0.025 se rxy ≈ (0.325 – (1.96)(0.0169), 0.325 + (1.96)(0.0169)) ≈ (0.292, 0.358),
and the asymptotic 99% confidence interval for the population correlation ρXY is
rxy – z0.005 se rxy , rxy + z0.005 se rxy ≈ (0.325 – (2.576)(0.0169), 0.325 + (2.576)(0.0169)) ≈ (0.281, 0.369).
It can be said that, with 99% confidence, the population correlation ρXY is between 0.281 and 0.369. This 99%
confidence interval provides strong evidence that the true correlation between education and earnings is positive
since a correlation value of zero is not in the interval and, in fact, is far below the lower endpoint. Since the standard
0.281
error is 0.0169, the number of standard errors that zero is below the lower endpoint 0.281 is 0.0169 ≈ 16.6.
Example 14.11 (Monthly stock returns) Example 7.13 provided a correlation matrix for six stocks (HD, LOW,
1–r2
BAC, WFC, MRO, COP) from the sp500 dataset. Applying the se(rxy ) = √nxy formula, the correlation matrix can be
augmented to provide standard errors, reported in parentheses, alongside the sample correlation values:
HD LOW BAC WFC MRO COP
HD 1.000 0.648 (0.030) 0.331 (0.047) 0.280 (0.048) 0.189 (0.051) 0.215 (0.050)
LOW 1.000 0.357 (0.046) 0.262 (0.049) 0.181 (0.051) 0.256 (0.049)
BAC 1.000 0.692 (0.027) 0.331 (0.047) 0.339 (0.046)
WFC 1.000 0.379 (0.045) 0.396 (0.044)
MRO 1.000 0.771 (0.021)
COP 1.000
1–r2
Since each of the stock pairs has the same sample size n = 364, the standard error se(rxy ) = √nxy is a decreasing function
of rxy , so the larger correlation values are associated with lower standard errors. The largest sample correlation of
0.771, between MRO and COP, has a se(rxy ) value of 0.021, whereas the lowest sample correlation of 0.181, between
LOW and MRO, has a se(rxy ) value of 0.051.
The confidence intervals for different correlations can be compared to each other. For instance, the asymptotic
95% confidence interval for ρHD,LOW is (0.589, 0.707), whereas the asymptotic 95% confidence interval for ρHD,BAC is
(0.239, 0.423). The sample correlation rHD,LOW = 0.648 is considerably higher than the sample correlation rHD,BAC =
0.331, and these two confidence intervals provide strong evidence that rHD,LOW > rHD,BAC is not happening by chance.
The two 95% confidence intervals for ρHD,LOW and ρHD,BAC have no overlap at all and, in terms of the standard error
magnitudes, are separated by a large distance.
In general, it is very useful to report the standard error of an estimate alongside the estimate itself, as the reader can
then form any asymptotic confidence interval based upon those two numbers. For instance, for the sample correlation
rxy = 0.648 and the standard error se(rHD,LOW ) = 0.030 for the HD and LOW monthly stock returns, a reader could
“ballpark” an asymptotic 95% confidence interval in their head by using a 0.648 plus-or-minus two standard error
(0.030) interval. Alternatively, to more formally calculate a confidence interval, the appropriate critical value can be
used, for example an asymptotic 90% confidence interval for ρXY of (0.648 – (1.645)(0.030), 0.648 + (1.645)(0.030))
using z0.05 ≈ 1.645.
Here is the R code to create the table above:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 381 — #388
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 382 — #389
i i
Moving to the general case, Proposition 13.7 provided the asymptotic distribution for the sample quantile X̃q , which is
an estimator of the population quantile τX,q :
a q(1 – q)
X̃q ∼ N τX,q , .
nfX (τX,q )2
The estimate is the realized sample quantile x̃q . The asymptotic standard deviation is
r √
V q(1 – q)
=√ ,
n nfX (τX,q )
which again contains the problematic density fX (τX,q ). At this point, given the difficulty associated with estimating
this density, let’s assume that the sample quantile x̃q and its associated standard error se(x̃q ) can be calculated by a
statistical package like R. Then, an asymptotic 1 – α confidence interval for the population quantile τX,q is calculated
in the usual way:
(x̃q – zα/2 se(x̃q ), x̃q + zα/2 se(x̃q )).
Example 14.12 (Labor force data) In Example 6.11, various sample quantiles, including all sample deciles and
sample quartiles, were calculated for the weekly earnings (earnwk) variable from the cps dataset. The following table
augments the table from Example 6.11 by also including standard errors and asymptotic 95% confidence intervals.
Where do the standard errors se(x̃q ) come from? They are calculated using the bootstrap, covered in Chapter 15, and
the R code/results are specifically provided in Example 15.7.
q x̃q se(x̃q ) 95% CI for τX,q
0.1 355 7.0 (341, 368)
0.2 480 7.4 (465, 495)
0.25 520 11.7 (497, 543)
0.3 576 7.4 (562, 590)
0.4 670 10.4 (650, 690)
0.5 770 11.1 (748, 792)
0.6 900 16.2 (868, 932)
0.7 1080 21.9 (1037, 1123)
0.75 1194 23.7 (1147, 1240)
0.8 1346 23.9 (1299, 1393)
0.9 1750 52.0 (1648, 1852)
For the sample median, the estimate is 770 dollars (per week), with a standard error of 11.1 dollars (per week)
and an asymptotic 95% confidence interval √ (748, 792) for the population median τX,0.5 of weekly earnings. From the
asymptotic standard deviation formula, nfq(1–q)
√
X (τX,q )
, there are two factors that affect the size of the standard error. The
term q(1 – q) is maximized at q = 0.5, so this term leads to smaller standard errors for q values closer to 0 or 1. On
the other hand, the term fX (τX,q ) leads to smaller standard errors at quantiles that have higher associated pdf fX (·)
values. Looking at Figure 6.7, the density of weekly earnings peaks just above the 0.25 quantile of the distribution. In
the right tail, the sparsity of the wage data, associated with very low pdf values, leads to much higher standard errors
even though q(1 – q) is relatively lower. As a result, the estimate of the 90% population quantile is considerably less
precise than the estimates of the other quantiles, whereas the most precise estimates are at the 10%, 20%, and 30%
population quantiles.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 383 — #390
i i
y are observed for each cross-sectional unit. Here are a few examples where such a confidence interval would be of
interest:
• Exam score data: An instructor gives two exams, where X is the random variable associated with scores on the first
exam and Y is the random variable associated with scores on the second exam. If both exam scores are available for
a sample of students, a confidence interval for µX – µY provides information about the difference in the true average
scores for the two exams.
• Asset return data: X and Y are random variables associated with the returns on two different assets. If both
asset returns are available for observations over the same time period, a confidence interval for µX – µY provides
information about the difference in the true average returns of the two assets.
• Website user activity: X and Y are binary variables associated with two different actions by a website user, where a
value of 1 indicates the action is taken and 0 indicates the action is not taken. For example, X and Y could correspond
to whether or not the user clicks on two different links on the website homepage, or X and Y could correspond to
whether or not the user purchases two different products from the website. If both binary variables are observed for
a sample of website visitors, a confidence interval for µX – µY provides information about the difference in the true
probabilities of the two actions.
Assume that a sample {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )} is observed, with underlying i.i.d. bivariate random variables
{(X1 , Y1 ), (X2 , Y2 ), …, (Xn , Yn )}. The random variables may be discrete or continuous, and we make no assumptions
about their distributions. In practice, as indicated by the examples above, X and Y would be expected to have the same
units, so that the difference µX – µY is meaningful, and generally would have the same type of distribution (e.g., both
binary in the website example, both continuous in the asset return example, both approximately continuous in the
exam score example).
The logical estimator for µX – µY is the difference in sample means X̄ – Ȳ. Since both X̄ and Ȳ have asymptotically
normal sampling distributions, the linear combination X̄ – Ȳ also has an asymptotically normal sampling distribution.
Let W = X – Y denote the linear combination (difference) of the random variables X and Y, and let wi = xi – yi denote
the corresponding linear combination (difference) of the observed variables. Applying the result for the asymptotic
sampling distribution of the sample mean to the random variable W yields
σ2
a
W̄ ∼ N µW , W
n
or, equivalently, since W̄ = X̄ – Ȳ and µW = µX–Y = µX – µY ,
σ2
a
X̄ – Ȳ ∼ N µX – µY , W .
n
2
The asymptotic variance σW has not been simplified since it depends upon the covariance the between X and Y. The
appropriate standard error is √swn , as
v v
u n u n
u 1 X u 1 X
sw = t (wi – w̄)2 = t (xi – yi – (x̄ – ȳ))2
n–1 n–1
i=1 i=1
is a consistent estimate of σW .
Then, the two-sided asymptotic 1 – α confidence interval for µX – µY , the difference in population means, is
sw sw sx–y sx–y
(x̄ – ȳ) – zα/2 √ , (x̄ – ȳ) + zα/2 √ = (x̄ – ȳ) – zα/2 √ , (x̄ – ȳ) + zα/2 √ .
n n n n
One-sided asymptotic confidence intervals can also be constructed based upon the estimate x̄ – ȳ, the standard error
sw
√
n
, and appropriate critical values.
Example 14.13 (Exam score data) The dataset exams contains the scores on two different exams, out of 100 points,
for a sample of 77 students. There are two variables, exam1 and exam2, indicating the scores on the first exam and the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 384 — #391
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 385 — #392
i i
linkA = linkB = 1. The following R code shows how the standard deviation slinkA–linkB can be calculated by the two
methods:
status, with one subsample being union employees and the other subsample being non-union employees, interest
may lie in the difference between the population mean of weekly earnings for the two corresponding subpopulations.
Similarly, the difference in population variance or population standard deviation of weekly earnings could be
examined, or the difference in the population correlation between education and earnings for the two subpopulations
could be examined.
To formalize the two-sample setting, assume that one sample is labeled as sample A and the other sample is labeled as
sample B, where θA and θB are the underlying parameters of interest for the two corresponding populations. The sample
sizes of the two samples are nA and nB , respectively. Assume that θ̂XA and θ̂XB are asymptotically normal estimators of
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 386 — #393
i i
θA and θB , with
a VA a VB
θ̂XA ∼ N θA , and θ̂XB ∼ N θB , ,
nA nB
q
V̂A
where the realized parameter estimates are denoted θ̂xA and θ̂xB . The associated standard errors are se(θ̂xA ) = nA and
q
se(θ̂xB ) = V̂nBB , where V̂A and V̂B are consistent estimates of VA and VB .
A key property of the two-sample setting is that the two estimators θ̂XA and θ̂XB are independent of each other since
they are based upon different random samples. Therefore, when considering the asymptotic variance of the difference
θ̂XA – θ̂XB , it is not necessary to consider the covariance between the two estimators since their covariance is equal
to zero. The difference θ̂XA – θ̂XB , as a linear combination of θ̂XA and θ̂XB , is asymptotically normal. The mean of the
asymptotic sampling distribution is θA – θB , and the variance of the asymptotic sampling distribution is VnAA + VnBB , so that
a VA VB
θ̂XA – θ̂XB ∼ N θA – θB , + .
nA nB
To get the standard error for this estimator, the consistent estimates V̂A and V̂B are plugged in for VA and VB and a
square root is taken, leading to
s
q
V̂A V̂B
se(θ̂xA – θ̂xB ) = + = se(θ̂xA )2 + se(θ̂xB )2 .
nA nB
Thus, the two-sided asymptotic 1 – α confidence interval for the parameter difference θA – θB is
q q
(θ̂xA – θ̂xB ) – zα/2 se(θ̂xA )2 + se(θ̂xB )2 , (θ̂xA – θ̂xB ) + zα/2 se(θ̂xA )2 + se(θ̂xB )2 .
Conveniently, only the estimate θ̂xA and standard error se(θ̂xA ) from sample A and the estimate θ̂xB and standard error
se(θ̂xB ) from sample B are needed to calculate this confidence interval.
Example 14.15 (Widget website) For the widgets.com e-mail experiment, Example 14.5 provided confidence
intervals for the purchase probabilities πA , πB , and πC associated with the subpopulations of e-mail A recipients, e-mail
B recipients, and non-recipients. The estimates for these three parameters are the observed purchase frequencies
60 66 360
pA = = 0.20, pB = = 0.22, and pC = = 0.15,
300 300 2400
with associated standard errors are
r r
pA (1 – pA ) (0.20)(0.80)
se(pA ) = = ≈ 0.0231,
300 300
r r
pB (1 – pB ) (0.22)(0.78)
se(pB ) = = ≈ 0.0239,
300 300
and r r
pC (1 – pC ) (0.15)(0.85)
se(pC ) = = ≈ 0.0068.
2400 2400
In this example, confidence intervals are calculated for the difference in purchase probabilities for two of the
subpopulations: πA – πB , πA – πC , and πB – πC . (It’s unnecessary to separately consider the differences πB – πA or
πC – πB since their confidence intervals can be inferred directly from the confidence intervals for πA – πB and πB – πC ,
respectively.) The standard error of pA – pB , as an estimator for πA – πB , is
r
(0.20)(0.80) (0.22)(0.78)
se(pA – pB ) = + ≈ 0.0332.
300 300
The asymptotic 95% confidence interval for πA – πB is
((0.20 – 0.22) – (1.96)(0.0332), (0.20 – 0.22) + (1.96)(0.0332)) ≈ (–0.085, 0.045).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 387 — #394
i i
This interval is quite wide, with 95% confidence that the difference in purchase probabilities πA – πB is between –8.5%
and 4.5%. The value of zero, corresponding to no difference (πA = πB ), is within this interval and therefore is plausible.
The asymptotic 95% confidence intervals for πA – πC and πB – πC can be constructed similarly. The associated
standard errors are r
(0.20)(0.80) (0.15)(0.85)
se(pA – pC ) = + ≈ 0.0242.
300 2400
and r
(0.22)(0.78) (0.15)(0.85)
se(pB – pC ) = + ≈ 0.0250.
300 2400
The asymptotic 95% confidence interval for πA – πC is
((0.20 – 0.15) – (1.96)(0.0242), (0.20 – 0.15) + (1.96)(0.0242)) ≈ (0.026, 0.074),
and the 95% asymptotic confidence interval for πB – πC is
((0.22 – 0.15) – (1.96)(0.0250), (0.22 – 0.15) + (1.96)(0.0250)) ≈ (0.045, 0.095).
Unlike the confidence interval for πA – πB , these two confidence intervals provide statistical evidence of significant
differences in the purchase probabilities, with the first confidence interval indicating 95% confidence that the difference
πA – πC is between 2.6% and 7.4% and the second confidence interval indicating 95% confidence that the difference
πB – πC is between 4.5% and 9.5%. The value of zero, corresponding to no difference, is not contained within either of
the two confidence intervals.
Example 14.16 (Union workers versus non-union workers) Example 6.20 provided descriptive statistics for weekly
earnings of union workers and non-union workers, based upon the cps data. Part of the table from Example 6.20 is
reproduced below:
Sample n x̄ sx
Union workers 276 1197.7 720.0
Non-union workers 2533 946.5 749.7
Using U and NU subscripts to distinguish union and non-union workers, the standard error associated with the sample
means x̄U = 1197.7 and x̄NU = 946.5 are
720.0 749.7
se(x̄U ) = √ and se(x̄NU ) = √ ,
276 2533
so that the standard error of x̄U – x̄NU , as an estimator of µX,U – µX,NU , is
r
720.02 749.72
se(x̄U – x̄NU ) = + ≈ 45.83.
276 2533
The following R code uses the function se_meanx, defined in Section 14.3, to calculate these quantities:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 388 — #395
i i
# calculate std errors of the sample averages of union and non-union wages
se_union <- se_meanx(cps[cps$unionstatus=="Union","earnwk"], na.rm=TRUE)
se_nonunion <- se_meanx(cps[cps$unionstatus=="Non-union","earnwk"], na.rm=TRUE)
se_union
## [1] 43.33802
se_nonunion
## [1] 14.89694
# calculate std error of the difference in sample averages
sqrt(se_union^2 + se_nonunion^2)
## [1] 45.82688
The asymptotic 95% confidence interval for the difference in the population means of weekly earnings, µX,U – µX,NU ,
is
((1197.7 – 946.5) – (1.96)(45.83), (1197.7 – 946.5) + (1.96)(45.83)) ≈ (161, 341).
It can be said with 95% confidence that the difference between the population means weekly earnings between union
workers and non-union workers is between $161 and $341 dollars. This confidence interval provides fairly strong
evidence that the difference is indeed positive, though the size of the interval indicates that the estimated difference of
$251 dollars is not very precise.
How about the difference in the population standard deviation of weekly earnings for union versus non-union
workers? That is, what can be said about the difference in the variation of wage distributions for the two
subpopulations? The sample standard deviations sx,U = 720.0 and sx,NU = 749.7 are estimates of the population
standard deviations σX,U and σX,NU . Based upon the formula from Section 14.4.1 for the standard error of the sample
standard deviation, the associated standard errors are
se(sx,U ) = 55.18 and se(sx,NU ) = 34.56.
Then, the standard error of sx,U – sx,NU , as an estimator of σX,U – σX,NU , is
q
se(sx,U – sx,NU ) = se(sx,U )2 + se(sx,NU )2 ≈ 65.1.
The following R code uses the function se_sx, defined in Section 14.4.1, to calculate these quantities:
se_sx_union
## [1] 55.18488
se_sx_nonunion
## [1] 34.56397
# calculate std error of difference in stdevs of earnings
sqrt(se_sx_union^2 + se_sx_nonunion^2)
## [1] 65.11558
The asymptotic 95% confidence interval for the difference in population standard deviations of weekly earnings,
σX,U – σX,NU , is
((720.0 – 749.7) – (1.96)(130.2), (720.0 – 749.7) + (1.96)(130.2) ≈ (–157, 98),
indicating no statistical evidence of a difference between the two population standard deviations.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 389 — #396
i i
To illustrate the generality of this two-sample approach, we consider looking at a difference in correlations for the
two subpopulations. Specifically, what can be said about the difference in the population correlation between weekly
earnings and education for union versus non-union workers? Does the confidence interval for the difference provide
any statistical that the earnings-education relationship is different for union and non-union workers? For notation, let
rxy,U and rxy,NU denote the sample correlations between earnwk and educ for union workers and non-union workers,
respectively, and let ρXY,U and ρXY,NU denote the corresponding population correlations. The sample correlations
between earnwk and educ for the two subsamples are
rxy,U ≈ 0.253 and rxy,NU ≈ 0.329.
cor(cps[cps$unionstatus=="Union","earnwk"],cps[cps$unionstatus=="Union","educ"],use="complete.obs")
## [1] 0.2529239
cor(cps[cps$unionstatus=="Non-union","earnwk"],cps[cps$unionstatus=="Non-union","educ"],use="complete.obs")
## [1] 0.3287519
The optional argument use="complete.obs" tells the cor function to use only those observations for which
all variables have non-missing (non-NA) values. This argument is similar to the na.rm=TRUE optional argument for
functions like mean and sd.
Using the formula for the standard error of a sample correlation (Section 14.4.2), the standard errors are
1 – 0.2532 1 – 0.3292
se(rxy,U ) = √ ≈ 0.056 and se(rxy,NU ) = √ ≈ 0.018.
276 2533
Then, the standard error of rxy,U – rxy,NU , as an estimator of ρXY,U – ρXY,NU , is
q
se(rxy,U – rxy,NU ) = se(rxy,U )2 + se(rxy,NU )2 ≈ 0.059.
The following R code uses the function se_rxy, defined in Section 14.4.2, to calculate these quantities:
se_rxy_union
## [1] 0.05634235
se_rxy_nonunion
## [1] 0.01772186
sqrt(se_rxy_union^2 + se_rxy_nonunion^2)
## [1] 0.05906374
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 390 — #397
i i
As an example, suppose we are interested in the quantity IQR σX , which is the population IQR of the random variable
X
X in terms of standard deviations (e.g., a value of 4 would indicate that the IQR is four standard deviations wide). For
i.i.d. random variables X1 , X2 , …, Xn , consistent estimators of IQRX and σX are X̃0.75 – X̃0.25 and sX , respectively, and
Proposition 14.10 implies
X̃0.75 – X̃0.25
sX
IQRX
is a consistent estimator of σX .
As another example, suppose we are interested in comparing two population correlations ρX1 X2 and ρX3 X4 based
upon data from a single dataset (like the sp500 dataset). Consistent estimators are the sample correlations rX1 X2
and rX3 X4 , respectively. Two alternative ways to compare correlations are to look at differences or to look at ratios.
Proposition 14.10 implies that the difference rX1 X2 – rX3 X4 is a consistent estimator of the difference ρX1 X2 – ρX3 X4 and
r ρ
also that the ratio rXX1 XX2 is a consistent estimator of the ratio ρXX1 XX2 (if ρX3 X4 6= 0).
3 4 3 4
How about asymptotic normality? The good news is that, due to a result known as the delta method, functions of
estimators will generally be asymptotically normal if the underlying estimators are themselves asymptotically normal.
For the single-parameter case, the only additional assumption needed is that the function f (·) is differentiable at the
true parameter θ. The following proposition provides the asymptotic-variance formula for the single-parameter case:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 391 — #398
i i
√ a
Proposition 14.11. (Delta method) If θX is an asymptotically normal estimator of θ, with n(θ̂X – θ) ∼ N(0, V), and
f (·) is a continuous function that is differentiable at θ, then f (θX ) is an asymptotically normal estimator of f (θ), with
√ a
n(f (θ̂X ) – f (θ)) ∼ N(0, f 0 (θ)2 V)
or, equivalently,
f 0 (θ)2 V
a
f (θ̂X ) ∼ N f (θ), .
n
In the one-parameter case with an estimator f (θX ), the mean of the asymptotic distribution is f (θ) since f (θX ) is a
consistent estimator of f (θ); the variance of the asymptotic distribution is f 0 (θ)2 Vn , where Vn is the asymptotic variance
of the original estimator. For example, for the odds estimator f (X̄) = 1–X̄X̄ discussed above, f 0 (π) = (1–π)
1
2 is obtained by
π
taking the derivative of 1–π with respect to π. Therefore, the asymptotic variance of f (X̄) = 1–X̄X̄ , as an estimator of
π
f (π) = 1–π , is
2
1 π(1 – π) π
=
(1 – π)2 n n(1 – π)3
π(1–π)
since nis the asymptotic variance of X̄ as an estimator of π. Based upon this asymptotic variance, the asymptotic
q
π
standard deviation is n(1–π) 3 , leading to the following standard error for the odds estimator:
x̄ r x̄
se = .
1 – x̄ n(1 – x̄)3
In the two-parameter case with an estimator f (θXa , θXb ), the mean of the asymptotic normal distribution is f (θa , θb );
the variance of the asymptotic distribution is more complicated than the one-parameter case since it involves partial
derivatives and the covariance between the two estimators.
Definition 14.9 An asymptotic (1 – α) predictive interval for X is (θ̂α/2 , θ̂1–α/2 ), where θ̂α/2 is a consistent estimate of
the population α/2’th quantile τX,α/2 and θ̂1–α/2 is a consistent estimate of the population (1 – α/2)’th quantile τX,1–α/2 .
There are two types of asymptotic predictive intervals, model-free intervals and model-based intervals. A model-free
interval uses the sample quantiles as consistent estimates of the population quantiles, so that the model-free asymptotic
(1 – α) predictive interval is
(θ̂α/2 , θ̂1–α/2 ) = (x̃α/2 , x̃1–α/2 ).
An advantage of this interval is that it doesn’t require any knowledge or assumptions about the distribution of X. For
the case of α = 0.05, the asymptotic 95% predictive interval has lower endpoint x̃0.025 and upper endpoint x̃0.975 . For
a large sample, there is approximately a 95% probability that a new draw of X from the population is between x̃0.025
and x̃0.975 . Since x̃0.025 and x̃0.975 are only estimates of the population quantiles τX,0.025 and τX,0.975 , it is a good idea
to calculate the standard errors for both x̃0.025 and x̃0.975 as a way of assessing how close the estimated endpoints are
likely to be to the true endpoints.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 392 — #399
i i
392 NOTES
A model-based interval uses the model and estimates of the model’s parameters rather than the sample quantiles.
As an example, for a normal random variable X ∼ N(µ, σ 2 ), Section 11.1 provided the (1 – α) probability interval
(τX,α/2 , τX,1–α/2 ) = (µ – zα/2 σ, µ + zα/2 σ),
which suggests the model-based asymptotic (1 – α) predictive interval
(θ̂α/2 , θ̂1–α/2 ) = (x̄ – zα/2 sx , x̄ + zα/2 sx ).
Since X̄ is a consistent estimator of µ and sX is a consistent estimator of σ, X̄ – zα/2 sX and X̄ + zα/2 sX are consistent
estimators of µ – zα/2 σ and µ + zα/2 σ, respectively. The realized endpoints x̄ – zα/2 sx and x̄ + zα/2 sx become arbitrarily
close to the true endpoints as the sample size gets larger. The model-based interval for a normal random variable has
its endpoints equidistant from x̄, whereas a model-free interval would not necessarily have this property. Also, if the
model is true, it’s likely that a model-based interval provides more precise estimates of the endpoints than a model-
free interval does. The intuition is that the model-based interval uses additional information, specifically the model
being assumed for X, as compared to the model-free interval. That said, for very large samples, the model-free and
model-based intervals should look quite similar since both are based upon consistent estimates of the endpoints.
Notes
44 If one estimator is inconsistent, then the other (consistent) estimator is generally preferred.
45 The number of confidence intervals, out of 100, for which µ is outside the confidence interval is a Binomial(100, 0.05) random variable. If the
number of Monte Carlo simulations grows large, the percentage of µ falling outside the 95% confidence interval gets arbitrarily close to 5%.
46 The sample variance s2 of a Bernoulli random variable is 1
Pn 2
x n–1 i=1 (xi – x̄) , which simplifies to
n n n
!
1 X 2 1 X X 1 1
(xi – 2xi x̄ + x̄2 ) = xi – 2x̄ xi + nx̄2 = nx̄ – 2nx̄ + nx̄2 = nx̄(1 – x̄),
n – 1 i=1 n – 1 i=1 i=1
n–1 n–1
n √
q
applying the fact that xi2 = xi for an indicator variable xi . Thus, sx = n–1
x̄(1 – x̄).
Exercises
1. A professor wants to estimate the probability of cannabis use (in the last year) in the population of students who
take economic statistics. Concerned that students might not honestly answer a direct question about cannabis use, she
uses a method known as randomized response to elicit honest responses. There are 300 students in her economics
statistics class. Before class, she creates 300 pieces of paper, numbered 1 through 300. As students come into class,
they randomly pick a piece of paper (without the professor knowing their number). Once everyone is seated, she gives
the following instructions: “Please answer the question Is the last digit of your phone number even? if your piece of
paper has a number less than or equal to 150. Please answer the question Have you used cannabis in the last year? if
your piece of paper has a number greater than 150.” Using a phone-response system, she observes the total number
of “yes” responses and nothing else. Let Y denote the random variable associated with the total number of “yes”
responses from the 300-student class. Let π denote the probability of cannabis use in the last year. Assume (i) students
honestly respond to their question and (ii) the probability of an even last-digit is 0.5.
(a) What is E(Y) in terms of π?
(b) Propose an estimator of π that is a function of Y, and show that the proposed estimator is unbiased. What is the
value of the estimate when the realization of Y is 125?
(c) *The estimator in (b) estimates the unconditional probability of cannabis use in the last year. Using Bayes’
Theorem, determine both the (i) the conditional probability of cannabis use given a “yes” response and (ii) the
conditional probability of cannabis use given a “no” response in terms of π. Plug in the estimate of π from (b)
(when Y = 125) to yield estimates of these conditional probabilities.
(d) Suppose an alternative randomization technique is used. Rather than the paper system, each student is asked
to flip a coin and, then, to answer the first question if flip is heads and the second question if the flip is tails.
In this case, while the expected number of students answering the first question is 150, the actual number may
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 393 — #400
i i
NOTES 393
be different from 150. Is the estimator from (b) still unbiased? How would you expect the variance of the
coin-flip-based estimator to compare to the variance of the paper-based estimator?
(e) Returning to the paper randomization, suppose the professor alters the instructions to have the students answer
the first question if their number is less than or equal to 75 and the second question if their number is greater
than 75. What is an appropriate estimator of π (in terms of Y) in this case?
(f) In this part, computer simulations will be used to compare the performance of the estimators for the three
different randomized-response alternatives (alternative 1: original question, alternative 2: part (d), alternative
3: part (e)). Assume that the probability of cannabis use in the last year is known to be 30%. For each of
the three alternatives, conduct 10,000 simulations in R of the 300-student class responses and calculate the
associate estimates. What are the averages of the estimates for each of the three alternatives? What are the
standard deviations of the estimates for each of the three alternatives? Do the relative sizes of the standard
deviations make sense?
2. *Consider a negative binomial random variable X ∼ NegBin(r, π).
r–1
(a) For r ≥ 2, show that π̂X = X+r–1 is an unbiased estimator of π.
(b) An employee at an investment management company has been asked to call the company’s clients to see which
ones are interested in receiving information about a new mutual fund. The employee has decided that she
will take a break after the fourth successful call, where “success” means the client is interested in receiving
information about the new mutual fund. Based on (a), what is the estimate of the success probability if the
successes occur on the 5’th, 12’th, 14’th, and 21’st calls?
1
(c) For r ≥ 1, show that θ̂X = X+r
r is an unbiased estimator of θ = π .
r
(d) The result in (c) suggests an alternative estimator of π, given by π̃X = X+r . Using this estimator, what is the
estimate of the success probability for the scenario described in (b)?
(e) It turns out that both π̂X and π̃X are consistent estimators of π (when we think of r growing very large). In this
part, computer simulations will be used to compare the performance of the alternative estimators. Specifically,
we consider the scenario from (b), but we assume that we know that the true success probability is 17% (π =
0.17). For r = 4, conduct 100,000 simulated i.i.d. draws of X ∼ NegBin(r, π) in R and, for each draw, calculate
the two estimates based upon π̂X and π̃X .
i. What are the average values of the two estimators (over the simulations)?
ii. One way to compare estimators is to calculate the mean absolute error, defined as the average of
|estimate – true value| over the simulations. What are the mean absolute errors of the two estimators?
iii. Another way to compare estimators is to calculate the mean squared error, defined as the average of
(estimate – true value)2 over the simulations. What are the mean squared errors of the two estimators?
iv. Repeat the simulations and parts i.-iii. for r = 100 instead of r = 4.
Pn
3. Referring to Proposition 13.3, explain why the sum of i.i.d. random variables, S = i=1 Xi , is an unbiased estimator
of nµX but not a consistent estimator of nµX .
4. Consider an i.i.d. random sample x1 , x2 , …, xn drawn from a N(10, 4) distribution, where n is a very large number.
In thinking about the box plot with whiskers and outliers (Section 6.5.1), provide your best guesses for the values of
the following quantities:
(a) Sample median (line within the box)
(b) Top and bottom of the box
(c) Upper and lower whiskers
(d) The percentage of points (“outliers”) above the upper whisker
5. Assume that the monthly returns on a certain asset are i.i.d. draws from a normally distributed random variable X
with unknown mean and standard deviation. You observe the monthly returns for one year (n = 12), and calculate the
sample average to be 0.01 and the sample standard deviation to be 0.08.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 394 — #401
i i
394 NOTES
(a) Using the appropriate t distribution, provide 95% and 99% confidence intervals for the population mean µX .
(b) How do the confidence intervals in (a) compare to the confidence intervals based upon a N(0, 1) distribution,
rather than the t distribution, for calculating the critical values?
(c) Suppose instead that n = 120, and you calculate the same sample average (0.01) and sample standard deviation
(0.08). Using the normal approximation, provide 95% and 99% confidence intervals for µX .
6. Assume that IQ scores are normally distributed, but the population mean and standard deviation of the normal
distribution are unknown. You collect a random sample of 20 individuals for which you calculate x̄ = 98 and sx = 12.
(a) Using the appropriate t distribution, provide 90% and 95% confidence intervals for the population mean µX .
(b) How do the confidence intervals in (a) compare to the confidence intervals based upon a N(0, 1) distribution,
rather than the t distribution, for calculating the critical values?
(c) If you instead had x̄ = 102 and sx = 12, how would the width of the 95% confidence interval for µX compare to
the one in (a)?
(d) If you instead had x̄ = 98 and sx = 16, how would the width of the 95% confidence interval for µX compare to
the one in (a)?
7. A company’s weekly profits are i.i.d. draws of a normal random variable X ∼ N(µ, σ 2 ). After n weeks of profits
(x1 , x2 , …, xn ) are observed, the following two confidence intervals for µ, based on t-distribution inference, are
constructed:
95% confidence interval for µ: (7.901, 12.099)
and
90% confidence interval for µ: (8.355, 11.645).
(a) What is the sample average of weekly profits?
(b) What is n? (Hint: Think about what the ratio of confidence-interval widths says about the ratio between the
critical values. Then, use R to determine the value of n consistent with that ratio.)
(c) What is the sample standard deviation of weekly profits?
(d) What are the one-sided 95% confidence intervals for µ? There are two such intervals, one of the form (L, ∞)
and one of the form (–∞, U).
8. A car factory implements a new production process and, over the course of the first 7 days, produces 198, 208, 206,
225, 234, 210, and 187 cars. Assume the daily production numbers are i.i.d. draws from a normal distribution.
(a) Calculate the sample mean and sample standard deviation of daily production.
(b) Determine the one-sided 95% confidence interval, of the form (L, ∞), for the population average of daily
production.
9. You are a fast-food mogul and own 50 franchises of Longhorn Burgers and 40 franchises of Perfect Pitas. Suppose
the monthly revenue (in thousands of dollars) for every franchise is an i.i.d. random variable, with Longhorn Burgers
revenues drawn from a distribution with population mean 20 and population standard deviation 4 and Perfect Pitas
revenues drawn from a distribution with population mean 15 and population standard deviation 3.
(a) In a given month, what is the approximate distribution of the sample average of the monthly revenues at the 40
franchises of Perfect Pitas?
(b) In a given month, what is the approximate distribution of the sample average of monthly revenues at all 90
franchises?
10. The two-sided asymptotic 95% confidence interval for a certain parameter is (10, 20).
(a) What is the two-sided asymptotic 80% confidence interval for the same parameter?
(b) For the one-sided asymptotic 95% confidence interval (L, ∞), what is the value of L?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 395 — #402
i i
NOTES 395
11. A university’s IT department has a large inventory of old computer monitors, some of which are no longer working.
To estimate the proportion of non-working monitors, the IT staff tests 60 of them (at random) and finds that 15 are not
working. Find the asymptotic 95% confidence interval for the population proportion of monitors that are not working.
12. A landscaping company offers a promotion in a suburban neighborhood, whereby they mow a house’s lawn free the
first time and then ask the homeowner if they would like to continue having regular (paid) service thereafter. Suppose
n homeowners allow the company to mow their lawn for free. The company’s owner is interested in the probability π
that such homeowners continue with the paid service. After observing the proportion of n homeowners that continue
with the paid service, the company’s owner finds that the asymptotic 95% confidence interval for π is (0.1503, 0.2997).
(a) What is the sample proportion of homeowners that continued with the paid service?
(b) What is n?
13. An advertising company wishes to estimate the population mean of the distribution of hours of television watched
per household per day. Suppose the population standard deviation of hours watched per household per day is known
to be 2.8 hours. The company decides that it wants the asymptotic 99% confidence interval for the population mean to
be no wider than 0.5 hours. What is the minimum sample size that results in a small enough confidence interval?
14. Use the exams dataset, which contains data for 77 students on two different exams (exam1 and exam2). Suppose
exam1 scores are i.i.d. draws from a random variable with mean µ1 and standard deviation σ1 and exam2 scores are
i.i.d. draws from a random variable with mean µ2 and standard deviation σ2 .
(a) Which is larger: (i) the middle of the asymptotic 95% confidence interval for µ1 or (ii) the middle of the
asymptotic 95% confidence interval for µ2 ?
(b) Which is larger: (i) the width of the asymptotic 95% confidence interval for µ1 or (ii) the width of the asymptotic
95% confidence interval for µ2 ?
(c) What is the asymptotic 90% confidence interval for µ1 ?
(d) What is the asymptotic 90% confidence interval for µ1 + µ2 , the sum of the exam population means?
(e) What is the asymptotic 90% confidence interval for the population correlation ρexam1,exam2 ?
(f) The standard errors of the sample standard deviations of exam1 and exam2 are 2.1122 and 2.8632,
respectively. Which is larger: (i) the upper endpoint of the asymptotic 95% confidence interval of σexam1 or
(ii) the upper endpoint of the asymptotic 95% confidence interval of σexam2 ?
15. A researcher surveys 500 individuals on their level of happiness x (on a scale from 1 to 10) and their annual income
y (in thousands of dollars). Assume that the observed data are i.i.d. draws from the joint distribution of the underlying
random variables (X, Y).
(a) Before the researcher observes the data, what is the largest possible width of the asymptotic 95% confidence
interval for the population correlation ρXY (i.e., over all possible realizations of rxy )?
(b) The researcher calculates rxy = 0.23 based upon the observed sample. What is the asymptotic 95% confidence
interval for ρXY ?
(c) Using rxy = 0.23 again, what is the one-sided asymptotic 95% confidence interval, of the form (L, 1), for ρXY ?
(The upper end is 1 and not ∞ since ρXY ≤ 1.)
16. In a random survey of 250 undergraduates, 150 respondents indicate that they consume caffeine daily.
(a) Provide an asymptotic 95% confidence interval for the probability πC that an undergraduate from the population
consumes caffeine daily.
(b) Another survey of a different sample of 250 undergraduates indicates that 140 out of the 250 respondents sleep
at least seven hours on a daily basis. Let πS denote true probability that an undergraduate sleeps at least seven
hours on a daily basis. Provide an asymptotic 95% confidence interval for the difference πC – πS .
17. In the sp500 data, 236 of the 360 monthly returns for the S&P 500 Index (idx) are positive. Assume that each
monthly return is an i.i.d. draw from some underlying random variable.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 396 — #403
i i
396 NOTES
(a) Provide an asymptotic 90% confidence interval for the probability that the S&P 500 Index monthly return is
positive.
(b) A risk-averse investor is worried about large negative returns. In the sp500 data, 33 of the 360 monthly returns
for the S&P 500 Index are less than –0.05. Provide a one-sided asymptotic 95% confidence interval, of the
form (–∞, U), for the probability that the S&P 500 Index monthly return is less than –0.05.
18. Use the cps dataset for this question.
(a) Form a table of gender and union to determine how many male workers are union vs non-union and how
many female workers are union vs non-union.
(b) What is the asymptotic 95% confidence interval for the probability that a male worker is in a union?
(c) What is the asymptotic 95% confidence interval for the probability that a female worker is in a union?
(d) What is the asymptotic 95% confidence interval for the difference between the probability that a male worker
is in a union and the probability that a female worker is in a union?
19. A survey of 150 students from college A finds that the average SAT math score is 610 with a sample standard
deviation of 50. A similar survey of 200 students from college B finds that the average and standard deviation are 580
and 45, respectively.
(a) What is the asymptotic 90% confidence interval for the population average of SAT math scores at college A?
(b) What is the asymptotic 90% confidence interval for the population average of SAT math scores at college B?
(c) What is the asymptotic 90% confidence interval for the difference between the population average at college A
and the population average at college B?
(d) Based upon the confidence interval from (c), do you think that it is likely that the population average for college
A students is larger than the population average for college B students?
20. *An economist wants to study the gender wage gap in a particular industry by collecting salary data from male and
female workers. In this industry, 80% of workers are male. Let µm = E(Xm ) and µf = E(Xf ) denote the population means
of salaries for male and female workers, respectively. The economist is interested in forming confidence intervals for
the difference µm – µf , and the size of the total sample collected (male and female combined) is n. Let γ be the
proportion of the sample made up by male workers, so that there are γn male workers and (1 – γ)n female workers.
(a) If σm2 = Var(Xm ) and σf2 = Var(Xf ), what is the asymptotic variance of X̄m – X̄f in terms of n, γ, σm2 , and σf2 ?
(b) What value of γ (in terms of σm , σf , and/or n) minimizes the asymptotic variance of X̄m – X̄f ?
(c) If the variances of male and female wages are the same (σm2 = σf2 ), what value of γ minimizes the asymptotic
variance of X̄m – X̄f ?
(d) If the economist collects the data with a simple random sample, the proportion of male workers in the
sample will be approximately 80%. For the case of equal wage variances (σm2 = σf2 ), how would the width
of the asymptotic confidence interval based upon γ = 0.8 (simple random sample) compare to the width of
the asymptotic confidence interval using the optimal γ found in (c)? Does this argue for oversampling or
undersampling of female workers?
21. The number of customers that enter a coffee shop during a given minute of the day (say, between 2:00pm and
2:01pm) is distributed as a X ∼ Poisson(λ) random variable, where λ is an unknown parameter. Suppose the coffee
shop gathers data over the course of 100 days, each day recording the number of customers that enter between 2:00pm
and 2:01pm, and finds that the sample average is 2.12 and the sample standard deviation is 1.53. You may assume that
the number of customers on each day is independent from other days.
(a) The population mean of a Poisson random variable X is µX = λ. Provide an asymptotic 95% confidence interval
for λ.
(b) Thinking of the sample average x̄ as an estimate of µX = λ, what is the estimated probability of having exactly
two customers enter between 2:00pm and 2:01pm on a given day? (Plug x̄ in for λ in the Poisson pmf formula.)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 397 — #404
i i
NOTES 397
(c) Repeat (b), but now use the lower and upper endpoints of the confidence interval from (a) to calculate the
estimated probability of having exactly two customers enter between 2:00pm and 2:01pm on a given day.
(Since there is uncertainty in our estimate of λ, as reflected by the confidence interval, this part shows how that
uncertainty translates to estimation of the probability value.)
22. Use the strikes dataset for this question. This dataset contains information on worker contract strikes within United
States manufacturing for the period 1968-1976. There are 566 observations on the variable duration (strike duration,
in weeks).
(a) What are the sample mean, sample median, and sample standard deviation of duration?
(b) Draw a (density) histogram of duration with 10 bins.
(c) Given the right-skewed nature of duration, you consider whether a log-normal distribution might be a good
description of duration. Generate a new variable lndur equal to the natural logarithm of duration.
(d) If lndur ∼ N(µ, σ 2 ) is true, the sample mean of lndur is a consistent estimator of µ. Provide an asymptotic 95%
confidence interval for µ.
1 2
(e) If lndur ∼ N(µ, σ 2 ) is true, the expected value of duration is eµ+ 2 σ . Using the sample mean and sample
standard deviation of lndur as estimates of µ and σ, respectively, plug into the expected-value formula to
get an estimated expected value of duration. How does this estimate compare to the sample mean of duration?
(f) Draw a (density) histogram of lndur with 10 bins. What do you conclude about the log-normal distribution
being a good model for duration?
23. The number of workplace injuries at a certain factory is tracked over 200 weeks. The average of the weekly number
of injuries is 0.4. Assume that each weekly observation is an i.i.d. draw from a Poisson(λ) random variable.
(a) Provide an asymptotic 90% confidence interval for λ.
(b) Show that P(X > 0) = 1 – P(X = 0) is an increasing function of λ.
(c) Based upon (b), the endpoints of an asymptotic 90% confidence interval for P(X > 0) can be determined from
the the following two probabilities: (i) the probability that there are any workplace injuries in a given week
based upon the Poisson(λ) distribution, and (ii) the probability that there are any workplace injuries in a given
week based upon the Poisson(λ) distribution. Calculate these endpoints.
24. A convenience store sells lottery tickets. On the day of a large drawing, the time (in minutes) between lottery-ticket
purchases can be considered i.i.d. draws of an exponential random variable X ∼ Exp(θ). Suppose the average of 100
observed times-between-purchases is 1.8 minutes.
(a) What is the asymptotic 90% confidence interval for θ1 ?
(b) Use the endpoints of the interval from (a) to form an asymptotic 90% confidence interval for θ. The resulting
interval need not be symmetric. (Hint: P(L ≤ 1/θ ≤ U) = P(1/U ≤ θ ≤ 1/L).)
(c) Based on the continuous mapping theorem (Proposition 14.9), how would you consistently estimate θ given
the consistent estimator of θ1 ?
(d) *Based on the delta method, what is the asymptotic standard error associated with the estimate from (c)?
(e) Provide an asymptotic 90% confidence interval for θ using the estimate from (c) and the standard error from
(d). How does this interval compare to the one found in (b)?
25. A prolific inventor submits 150 patent applications to the U.S. Patent and Trademark Office (USPTO), and
110 are successful (resulting in a patent being issued). Assume that the success of each patent application is an
i.i.d. Bernoulli(π) random variable.
π
(a) What are the estimated odds of a patent application being successful? (Recall that odds is defined as 1–π .)
(b) Provide an asymptotic 95% confidence interval for the odds of a patent application being successful.
26. For i.i.d. X1 , X2 , …, Xn ∼ Poisson(λ) random variables, a consistent estimator of µX = λ is the sample average.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 398 — #405
i i
398 NOTES
(a) Based on the continuous mapping √theorem (Proposition 14.9), how would you consistently estimate the
population standard deviation (σX = λ) of the underlying Poisson random variable?
(b) *Based on the delta method, provide a formula for the asymptotic standard error associated√with the estimate
from (a). If x̄ = 1.2 and n = 200, what is the asymptotic 95% confidence interval for sd(X) = λ?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 399 — #406
i i
15 The bootstrap
Chapter 14 considered estimation of standard errors and confidence intervals based upon two types of statistical
inference: finite-sample inference and asymptotic inference. For finite-sample inference, the finite-sample (exact)
sampling distribution results from Chapter 12 were used as the basis for confidence intervals for the population mean
of i.i.d. normal random variables. For asymptotic inference, the more general asymptotic (large-sample) sampling
distribution results from Chapter 13 were used as the basis for confidence intervals associated with any asymptotic
normal estimator.
This chapter introduces another type of statistical inference, known as bootstrap inference or, more concisely, the
bootstrap. The bootstrap is a resampling-based method that can be used as an alternative to the inference approaches
in Chapter 14. Why might the bootstrap be needed as an alternative method for statistical inference? Here are three
different reasons:
1. Even with a very large sample, when the asymptotic distribution provides a good approximation, the asymptotic
variance (or standard deviation) formula might not provide a simple approach for estimation of an estimator’s standard
error. For example, as seen in Chapter 13, when using a sample quantile to estimate a population quantile, the formula
for the asymptotic standard deviation involves an unknown pdf fX (·) value that needs to be estimated for a standard
error to be calculated. The bootstrap provides an alternative approach that avoids estimation of the pdf altogether.
2. In certain situations, we may be interested in estimating a parameter for which the asymptotic variance is not readily
available and/or difficult to determine. For example, suppose the two random variables X and Y are associated with two
variables in an observed sample and we are interested in the difference between the population standard deviations,
σX – σY . Section 14.4.4 considered a similar situation in which an asymptotic confidence interval for µX – µY was
proposed, but the reasoning used there does not obviously extend to σX – σY . While sX – sY is an appropriate estimator
of σX – σY , determining the asymptotic standard deviation of sX – sY , as an estimator of σX – σY , is difficult since
sX and sY are not necessarily independent of each other. The bootstrap provides an alternative approach for this
situation that does not require the researcher to analytically determine the asymptotic variance. There are many other
similar situations, in which the asymptotic variance is difficult to obtain, that the bootstrap can provide an appealing
alternative. For example, the bootstrap can be used to form a confidence interval for the difference between the
population mean µX and population median τX,0.5 , based upon a random sample associated with a random variable X.
As another example, the bootstrap can be used to form a confidence interval for the difference between two population
correlations, based upon a random sample of multivariate data. For the labor force data, for instance, if we are interested
in the relative magnitudes of ρearnwk,educ and ρearnwk,age , a confidence interval for ρearnwk,educ – ρearnwk,age would be useful.
3. The observed sample may not be large enough for the asymptotic distribution to provide a good approximation of
an estimator’s true sampling distribution. While a finite-sample distribution can be used in some specific cases, like
estimation of the population mean of i.i.d. normal random variables, it is more likely that a finite-sample distribution
will not be available. For instance, for estimation of the population mean µX of i.i.d. random random variables with an
unknown distribution, there is no finite-sample distribution that can be used, but the asymptotic sampling distribution
may not provide a good approximation if the sample is very small. As another example, the asymptotic distribution
associated with estimation of the true correlation ρXY between two random variables X and Y is known to provide a
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 400 — #407
i i
poor approximation of the true sampling distribution when the sample is small and the true correlation ρXY is close to
1 or –1. For these examples and others where we might be concerned the sample is not large enough, the bootstrap
provides an alternative approach to construct confidence intervals.
Definition 15.1 For a given sample of size n, a bootstrap sample of size n is constructed by treating the original
sample as the population of interest and sampling with replacement by making n i.i.d. draws from the sample.
This resampling is done many times, and the number of bootstrap replications is denoted B. Some additional notation
is required to distinguish bootstrap sample observations from the original sample observations. For a univariate sample
{x1 , x2 , …, xn }, a given bootstrap sample is denoted
{x1b , x2b , …, xnb } for b ∈ {1, 2, …, B}.
For a bivariate sample {(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}, a given bootstrap sample is denoted
{(x1b , yb1 ), (x2b , yb2 ), …, (xnb , ybn )} for b ∈ {1, 2, …, B}.
If there are more than two variables, the notation can be generalized to include additional variables.
Example 15.1 Consider the following bivariate data with seven observations (n = 7):
{(xi , yi )}7i=1 = {(4, 8), (3, 6), (8, 10), (12, 1), (0, 15), (10, 3), (5, 6)}.
These data can be shown in a table, similar to how they might appear in a spreadsheet:
i xi yi
1 4 8
2 3 6
3 8 10
4 12 1
5 0 15
6 10 3
7 5 6
To create the first bootstrap sample, associated with b = 1, the computer draws a sample of seven observations, with
replacement, from the original sample of seven observations. We make seven i.i.d. draws from the set of row numbers
{1, 2, 3, 4, 5, 6, 7}, where each row number is equally likely to be drawn, each with probability 1/7. Suppose the seven
draws are
7, 5, 2, 5, 6, 7, 3,
meaning the seventh row is drawn first, the fifth row is drawn second, the second row is drawn third, and so on, yielding
the first bootstrap sample:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 401 — #408
i i
i xi1 y1i
1 5 6
2 0 15
3 3 6
4 0 15
5 10 3
6 5 6
7 8 10
There are a couple of important things to note. First, the entire row is drawn each time a bootstrap observation is
created. That is, the x and y variables are jointly drawn for each bootstrap observation, which is important so that the
relationship between the x and y variables from the original sample is reflected and preserved in the bootstrap sample.
Second, due to the sampling being done with replacement, repeated observations are to be expected in the boostrap
sample. In this first bootstrap sample, the fifth and seventh observations from the original sample both appear twice.
Suppose the seven draws for the second bootstrap sample (b = 2) are
5, 2, 2, 2, 1, 4, 6,
the seven draws for the third bootstrap sample (b = 3) are
2, 4, 6, 5, 7, 2, 1,
and the seven draws for the fourth bootstrap sample (b = 4) are
4, 3, 3, 5, 4, 1, 4.
The following table shows the original sample alongside the four bootstrap samples. For the original sample and
the bootstrap samples, the table also shows seven different descriptive statistics: means of x and y, medians of x and
y, standard deviations of x and y, and the correlation between x and y.
Sample Bootstrap samples
i xi yi xi1 y1i xi2 y2i xi3 y3i xi4 y4i
1 4 8 5 6 0 15 3 6 12 1
2 3 6 0 15 3 6 12 1 8 10
3 8 10 3 6 3 6 10 3 8 10
4 12 1 0 15 3 6 0 15 0 15
5 0 15 10 3 4 8 5 6 12 1
6 10 3 5 6 12 1 3 6 4 8
7 5 6 8 10 10 3 4 8 12 1
x̄ 6 4.43 5 5.29 8
x̃0.5 5 5 3 4 8
sx 4.20 3.78 4.32 4.23 4.62
ȳ 7 8.71 6.43 6.43 6.57
ỹ0.5 6 6 6 6 8
sy 4.62 4.75 4.43 4.43 5.62
rxy –0.790 –0.762 –0.845 –0.870 –0.898
Focusing first on the mean of the x variable, the original sample has sample mean x̄ = 6. Due to the randomness
associated with the construction of the bootstrap samples, the sample means for the bootstrap samples should not
be expected to be equal to 6, and the sample means of x are 4.43, 5, 5.29, and 8 for the four bootstrap samples.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 402 — #409
i i
Similarly, for the other descriptive statistics, the statistic associated with any bootstrap sample can differ from the
statistic associated with the original sample.
The following R code constructs a single bootstrap sample:
set.seed(1234)
bs_df
## x y
## 4 12 1
## 2 3 6
## 6 10 3
## 5 0 15
## 4.1 12 1
## 7 5 6
## 1 4 8
print(paste("Means for bootstrap sample: x", round(mean(bs_df$x),2), ", y", round(mean(bs_df$y),2)))
## [1] "Means for bootstrap sample: x 6.57 , y 5.71"
print(paste("Medians for bootstrap sample: x", median(bs_df$x), ", y", median(bs_df$y)))
## [1] "Medians for bootstrap sample: x 5 , y 6"
print(paste("Stdevs for bootstrap sample: x", round(sd(bs_df$x),2), ", y", round(sd(bs_df$y),2)))
## [1] "Stdevs for bootstrap sample: x 4.76 , y 4.89"
print(paste("Correlation for bootstrap sample:", round(cor(bs_df$x,bs_df$y),3)))
## [1] "Correlation for bootstrap sample: -0.924"
The data frame df contains the original sample of n = 7 observations for x and y. The sample command randomly
draws seven index values from the set {1, 2, …, 7} with replacement (replace = TRUE argument). From the value
of bs_index, we see that the index value 4 is drawn twice and the index value 3 is not drawn at all. The data frame
bs_df is assigned to be the bootstrap sample associated with the index values bs_index. The print commands
summarize the descriptive statistics for this bootstrap sample.
Example 15.1 shows four different bootstrap samples associated with an original sample. In practice, however,
many more bootstrap samples are used for inference, corresponding to a large value of B. Ideally, we would want to
construct all possible bootstrap samples from the original sample, but that’s infeasible. After all, by the multiplication
rule, the total number of possible (distinct) bootstrap samples is equal to nn . Even with n = 7, the total number of
distinct bootstrap samples is 823,543.
The next section considers bootstrap sampling when the number of bootstrap samples B is chosen to be large. With
the statistic or estimator of interest calculated for each of the bootstrap samples, we can consider the distribution of
the statistic or estimator over the large number of bootstrap samples.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 403 — #410
i i
Definition 15.2 The bootstrap sampling distribution of a statistic or estimator s(·) is the distribution of the values
of the statistic or estimator applied to each of the B bootstrap samples.
For a univariate statistic s(x1 , x2 , …, xn ), the statistic for the b-th bootstrap sample is s(x1b , x2b , …, xnb ). For example,
the sample mean x̄ is the statistic or estimate associated with the estimator X̄, and its bootstrap sampling distribution
is the distribution of the sample means for each of the B bootstrap samples, {x1b , x2b , …, xnb } for b ∈ {1, 2, …, B}. If the
bootstrap sample means are denoted x̄b , then
n
1X b
x̄b = xi for each b ∈ {1, 2, …, B}.
n
i=1
1 2 B
The collection of the B values, {x̄ , x̄ , …, x̄ }, is the bootstrap sampling distribution of the sample mean.
As another example, the sample correlation rxy is the estimate associated with the estimator rXY , and the bootstrap
sample correlations are
1
Pn b b
b b
b n–1 i=1 xi – x̄ yi – ȳ
rxy = q P
2
q for each b ∈ {1, 2, …, B},
n b 2
1 b b
1
Pn b
n–1 i=1 xi – x̄ n–1 i=1 y i – ȳ
b 1
Pn b b 1
Pn b 1 2 B
where x̄ = n i=1 xi and ȳ = n i=1 yi . The collection of the B values, {rxy , rxy , …, rxy }, is the bootstrap sampling
distribution of the sample correlation between x and y.
Example 15.2 Continuing Example 15.1, consider the bootstrap sampling distributions of four statistics: the sample
mean x̄, the sample mean ȳ, the sample standard deviation sx , and the sample correlation rxy . Increasing B to 1,000,
Figure 15.1 shows the histograms of the statistics calculated for each of the 1,000 bootstrap samples. For each
histogram, a corresponding density curve is drawn, and the value of the original sample statistic is indicated by a
vertical dashed line. For the sample mean x̄ (top-left graph), the bootstrap sampling distribution looks fairly symmetric
around x̄ = 6, with nearly the entire distribution between 3 and 9. For the sample mean ȳ (top-right graph), the
bootstrap sampling distribution also looks fairly symmetric, this time around ȳ = 7. In contrast, the bootstrap sampling
distributions for the sample standard deviation sx (bottom-left graph) and the sample correlation rxy (bottom-right
graph) appear asymmetric. For the sample correlation, the sample statistic rxy = –0.790 is clearly to the right of the
peak of the distribution, which occurs below –0.9. The bootstrap sampling distribution for rxy has a very long right
tail, while there is no left tail since the sample correlation is never below –1.
Here is the R code to create the B = 1,000 bootstrap samples, calculate their descriptive statistics, and draw the
graphs in Figure 15.1:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 404 — #411
i i
set.seed(1234)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 405 — #412
i i
2 4 6 8 10 2 4 6 8 10 12
Figure 15.1
Bootstrap sampling distributions
statistic for the second bootstrap sample, and so on. For univariate data, sb is shorthand for s(x1b , x2b , …, xnb ). Using this
notation, the bootstrap standard error is defined as follows:
Definition 15.3 The bootstrap standard error of a statistic s(), denoted seB , is the standard deviation of the s()
statistic over the B bootstrap samples, v
u
u 1 X B
seB = t sb – s̄B ,
B–1
b=1
b
where s is the statistic for the b-th bootstrap sample and s̄B is the average of the statistic over the B bootstrap samples.
For large samples, the bootstrap standard error can be used as an alternative to the asymptotic standard error. Even
in cases where the asymptotic distribution provides a good approximation to an estimator’s sampling distribution,
the bootstrap standard error can be very useful if the asymptotic standard error is difficult to calculate (e.g., sample
quantiles) or if the formula for the asymptotic standard deviation may be unknown (as in the examples at the beginning
of the chapter).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 406 — #413
i i
When does the bootstrap standard error provide an appropriate alternative to the asymptotic standard error?
√
Generally speaking, as long as an estimator θ̂X is n-consistent and asymptotically normal, the bootstrap standard
error seB gets arbitrarily close to the asymptotic standard deviation of θ̂X as n → ∞ and B → ∞. The reasons for the
two “→ ∞” conditions are quite different. The n → ∞ requirement is needed so that the sample is large enough for
the asymptotic (normal) sampling distribution to be arbitrarily close to the estimator’s true sampling distribution. The
B → ∞ requirement, on the other hand, is needed to eliminate the sampling error associated with bootstrap sampling,
as there is an inherent randomness due to the resampling process. In practice, B should be chosen as large as possible,
subject to the constraints of the computer, to minimize sampling error. In most applied work, B is chosen to be in the
thousands, and a larger choice is always preferred if possible.
√
The bootstrap standard error can be used for any of the n-consistent and asymptotically normal estimators already
discussed in this book, as well as the regression estimators discussed in later chapters. For estimators that are not
√
n-consistent and asymptotically normal, the bootstrap is not guaranteed to provide valid inference. An example is
the sample maximum, maxX = max(X1 , X2 , …, Xn ), for which the bootstrap should not be used.
Example 15.3 (Mean and median of a log-normal random variable) Suppose the sample x1 , x2 , …, x100 consists of
i.i.d. draws from a log-normal distribution, with ln(X) ∼ N(0, 1). The sample size is n = 100. The following R code
first draws the sample x1 , x2 , …, x100 from the population, calculates the sample mean x̄ and sample median x̃0.5 , and
calculates the asymptotic standard error se(x̄) = √sxn . For the sample median, the asymptotic standard error is difficult
to calculate, as previously discussed. Using B = 10,000 bootstrap iterations, the code calculates bootstrap standard
errors for the sample mean x̄ and the sample median x̃0.5 :
set.seed(1234)
for (i in 1:B) {
bs_index <- sample(1:nobs, nobs, replace = TRUE)
bs_x <- x[bs_index]
bs_meanx[i] <- mean(bs_x)
bs_medianx[i] <- median(bs_x)
}
The original sample has sample mean x̄ = 1.524, sample median x̃0.5 = 0.681, and sample standard deviation sx =
2.171. For the sample mean, the asymptotic standard error is se(x̄) = √sxn = 0.2171. Using B = 10,000 bootstrap samples,
the bootstrap standard error for the sample mean is seB (x̄) = 0.2165, and the bootstrap standard error for the sample
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 407 — #414
i i
median is seB (x̃0.5 ) = 0.0877. The bootstrap standard error seB (x̄) = 0.2165 is very close to the asymptotic standard
error se(x̄) = 0.2171.
How about statistical inference for the difference between the population mean µX and the population median τX,0.5 ?
The difference between the sample mean and the sample median, x̄ – x̃0.5 = 1.5239 – 0.6808 = 0.8431, is an estimate
of µX – τX,0.5 . (The difference µX – τX,0.5 is non-zero here since the log-normal is a right-skewed distribution, with
population mean greater than population median.) Once bootstrap sampling has been done, we can calculate the
bootstrap standard error seB (x̄ – x̃0.5 ) by calculating the difference x̄ – x̃0.5 for each bootstrap sample and taking the
standard deviation of the resulting B values. This process, based upon the B = 10,000 bootstrap samples, yields the
bootstrap standard error seB (x̄ – x̃0.5 ) = 0.1874.
Example 15.4 (Monthly stock returns: correlation differences) Example 14.11 provided a sample correlation matrix,
with asymptotic standard errors, for a set of six stocks from the sp500 dataset. Each sample correlation in that
matrix is an estimate of the underlying population correlation. For example, the sample correlation rHD,LOW is an
estimate of the population correlation ρHD,LOW , and its asymptotic standard error can be used for statistical inference.
What if we are instead interested in the difference between two population correlations? For instance, the difference
ρHD,LOW – ρHD,BAC is of interest if we want to know if the correlation between HD and LOW returns is larger or
smaller than the correlation between HD and BAC returns. While the difference rHD,LOW – rHD,BAC provides a logical
estimator of ρHD,LOW – ρHD,BAC , it is difficult to calculate an asymptotic standard error for rHD,LOW – rHD,BAC . This
situation is different from the two-sample setting considered in Section 14.4.5, where the asymptotic standard error of
the correlation difference was easy to determine since the two correlations for a two-sample problem are independent
of each other (see Example 14.16). Here, since they are based upon the same sample, the sample correlations
rHD,LOW and rHD,BAC are not independent, meaning the asymptotic standard error is complicated and depends upon
their covariance/correlation. Rather than attempting to derive this more complicated asymptotic standard error, the
bootstrap provides a simpler alternative. We repeatedly create bootstrap samples, and for each bootstrap sample, the
bootstrap statistic is the difference between the HD-LOW correlation and the HD-BAC correlation.
The following R code calculates the bootstrap standard error for rHD,LOW – rHD,BAC using B = 1,000:
set.seed(1234)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 408 — #415
i i
The estimate of the difference ρHD,LOW – ρHD,BAC is rHD,LOW – rHD,BAC = 0.317, and the bootstrap standard error is
seB (rHD,LOW – rHD,BAC ) = 0.061. There is nothing special about the pairs HD-LOW and HD-BAC used in this example,
and the same approach could be used to calculate a bootstrap standard error for the difference between any two
correlations (e.g., rBAC,WFC – rMRO,COP ).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 409 — #416
i i
The 99% confidence interval provides strong statistical evidence that ρHD,LOW – ρHD,BAC is positive or, equivalently,
that ρHD,LOW is greater than ρHD,BAC .
Example 15.7 (Labor force data) Example 14.12 reported the bootstrap standard errors for various sample quantiles,
including the sample deciles and sample quartiles, of the weekly earnings variable. Here is the R code used to create
the table in Example 14.12, with B = 5,000 bootstrap iterations used to calculate the bootstrap standard errors and the
normal-based confidence intervals:
set.seed(1234)
# initialize variables
nobs <- nrow(cpsemployed)
B <- 5000
How about calculating a bootstrap standard error and normal-based bootstrap interval for the interquartile range
τX,0.75 – τX,0.25 ? The following R code calculates these quantities, again using B = 5,000 bootstrap iterations:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 410 — #417
i i
set.seed(1234)
# initialize variables
nobs <- nrow(cpsemployed)
B <- 5000
bs_iqr <- rep(0,B)
# output the bootstrap standard error and normal-based CI for the IQR
iqr_earnwk <- IQR(cpsemployed$earnwk)
print(paste("IQR: ", round(iqr_earnwk,1),
", bootstrap se ", round(sd(bs_iqr),1),
", 95% CI (", round(iqr_earnwk-1.96*sd(bs_iqr),1),
",", round(iqr_earnwk+1.96*sd(bs_iqr),1), ")", sep=""))
## [1] "IQR: 673.6, bootstrap se 23.3, 95% CI (628,719.2)"
The estimated IQRearnwk is 673.6, with a bootstrap standard error seB (IQRearnwk ) = 23.3 and a normal-based
bootstrap 95% confidence interval of (628.0, 719.2). With 95% confidence, it can be said that the population IQR
is between 628.0 and 719.2.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 411 — #418
i i
percentile interval may be asymmetric, with the distance between the lower endpoint and the estimate potentially
being different from the distance between the estimate and the upper endpoint.47
While the importance of choosing a large value for B has already discussed, it is particularly important in the
context of forming bootstrap percentile intervals. These intervals are based upon the extremes of the bootstrap sampling
distribution (e.g., the 2.5% and 97.5% quantiles in the case of a two-sided 95% interval), and estimators of extreme
quantiles are less precise than estimators of other quantities like the standard deviation. Therefore, all else equal,
we want a larger choice of B for calculating percentile intervals than would be needed for calculation of a bootstrap
standard error.
Example 15.8 (Mean and median of a log-normal random variable) Example 15.3 provided R code to calculate
the bootstrap distribution (B = 10,000 bootstrap statistics) associated with the sample mean x̄, the sample median
x̃0.5 , and the difference x̄ – x̃0.5 . The bootstrap statistics for the sample mean and the sample median were stored in
the vectors bs_meanx and bs_medianx, respectively. With these two vectors available, we can calculate 95%
bootstrap percentile intervals with the following R code:
The results are as follows: a 95% bootstrap percentile interval of (1.136, 1.976) for µX (as compared to the 95%
normal-based bootstrap confidence interval (1.099, 1.948)), a 95% bootstrap percentile interval of (0.604, 0.877) for
τX,0.5 (as compared to the 95% normal-based bootstrap confidence interval (0.509, 0.853)), and a 95% bootstrap
percentile interval of (0.487, 1.211) for µX – τX,0.5 (as compared to the 95% normal-based bootstrap confidence
interval (0.476, 1.210)). The only meaningful difference seems to arise for the τX,0.5 interval, which may suggest that the
sample size (n = 100) is not large enough for asymptotic normality of the sample median estimator with the underlying
log-normal random variable.48
Changing the confidence level of the percentile interval is straightforward. The following R code instead calculates
90% bootstrap percentile intervals by changing the quantiles from 2.5% and 97.5% to 5% and 95%, respectively:
## 5% 95%
## 1.185190 1.900511
quantile(bs_medianx, c(0.05,0.95))
## 5% 95%
## 0.6088298 0.8469239
quantile(bs_meanx-bs_medianx, c(0.05,0.95))
## 5% 95%
## 0.5295099 1.1399688
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 412 — #419
i i
412 NOTES
The resulting 90% bootstrap percentile intervals are (1.185, 1.901), (0.609, 0.847), and (0.530, 1.140) for µX , τX,0.5 ,
and µX – τX,0.5 , respectively.
Example 15.9 (Monthly stock returns: correlation differences) Example 15.4 provided R code to calculate the
bootstrap distribution (B = 1,000 bootstrap statistics) associated with rHD,LOW – rHD,BAC , with the results in the vector
bs_rdiff. With this vector available, we can calculate a 95% bootstrap percentile interval by calculating the sample
2.5% and 97.5% quantiles.
The 95% bootstrap percentile interval is (0.20, 0.44), which is the same to two decimal places as the normal-based
bootstrap 95% confidence interval calculated in Example 15.6.
Example 15.10 (Labor market data) Example 15.7 provided R code to calculate the bootstrap distribution (B = 5,000)
associated with the IQR of the weekly earnings variable earnwk from the cps dataset, with the results in the vector
bs_iqr. With this vector available, we can calculate a 95% bootstrap percentile interval, as in the previous examples:
The 95% bootstrap percentile interval for the population IQR is (617.7, 701.0), as compared to the normal-based
bootstrap 95% confidence interval (628.0, 719.2).
Notes
47 While the ability of the bootstrap percentile interval to handle non-normality and asymmetry is appealing, there are no general theoretical
results showing that the bootstrap percentile interval actually performs better than a normal-based interval (either asymptotic or bootstrap). In fact,
the main theoretical results supporting the use of the bootstrap involve large samples (n → ∞), in which case the asymptotic normal confidence
interval should work well. For large samples, another bootstrap method known as the studentized bootstrap or bootstrap-t method has been shown
to perform better than asymptotic normal confidence intervals; this method is beyond the scope of this book, but the interested reader can easily find
references for it.
48 Indeed, the histogram of the bootstrap median estimates confirms that the bootstrap distribution does not have a nice bell shape like that of
the distribution of the bootstrap mean estimates. The interested reader can confirm with the commands hist(bs_meanx,breaks=50) and
hist(bs_medianx,breaks=50).
Exercises
1. Suppose a bootstrap sample of size n is created by sampling with replacement from the i.i.d. sample {x1 , x2 , …, xn }.
(a) What is the probability that a specific observation xj , for some j ∈ {1, 2, …, n}, is in the bootstrap sample at
least once?
(b) What is the probability that a specific observation xj , for some j ∈ {1, 2, …, n}, is in the bootstrap sample at
least twice?
(c) Evaluate the probabilities from (a) and (b) for n = 100.
2. Exercise 14.25. considered a prolific inventor who submits 150 patent applications to the U.S. Patent and Trademark
Office (USPTO), with 110 being successful (resulting in a patent being issued). Assume that the success of each patent
application can be considered an i.i.d. Bernoulli(π) random variable.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 413 — #420
i i
NOTES 413
(a) Rather than creating a dataset with 110 successes (ones) and 40 failures (zeros), a shortcut to create bootstrap
samples for i.i.d. Bernoullidata is to repeatedly draw from the appropriate binomial distribution, which in this
case is Binomial 150, 110150 . For instance, the first draw from this binomial might be 105, corresponding to a
bootstrap sample with 105 successes (ones) and 45 failures (zeros). Use this approach with 5,000 iterations to
calculate a bootstrap standard error and construct a two-sided normal-based 95% bootstrap confidence interval
for π.
(b) Using the same approach, conduct 5,000 bootstrap iterations to calculate a bootstrap standard error and
π
construct a two-sided normal-based 95% bootstrap confidence interval for the odds 1–π .
(c) Using the same approach, conduct 5,000 bootstrap iterations to construct a two-sided 95% bootstrap percentile
π
interval for the odds 1–π .
3. Use the strikes dataset for this question. This dataset contains information on worker contract strikes within United
States manufacturing for the period 1968-1976. There are 566 observations on the variable duration (strike duration,
in weeks). Let X denote the random variable associated with duration.
(a) What are the sample 75% quantile and sample 90% quantile of duration? What is the sample IQR of duration?
(b) Use the bootstrap with 5,000 iterations to construct two-sided normal-based bootstrap 95% confidence intervals
for τX,0.75 , τX,0.90 , and IQRX .
(c) Let πlong = P(X > 52) be the probability that a strike lasts longer than a year. Use the bootstrap with 5,000
iterations to construct a two-sided normal-based bootstrap 95% confidence interval for πlong . How does this
interval compare to the 95% asymptotic confidence interval for πlong ?
4. Due to concerns about a dangerous intersection, a town gathers data on the weekly number of car accidents in the
intersection. The data for 50 consecutive weeks are summarized by the following table:
# accidents 0 1 2 3 4 5 6
# weeks 23 14 5 2 5 0 1
Assume that the number of accidents each week is an i.i.d. draw from a random variable X.
(a) Create a data frame or a vector of 50 observations in R based upon the table.
(b) Calculate the sample mean and the sample variance.
(c) Use the bootstrap with 5,000 iterations to construct a two-sided normal-based bootstrap 95% confidence
interval for µX and a two-sided normal-based bootstrap 95% confidence interval for σX2 .
(d) A town official studied statistics in college and wonders whether X is a Poisson random variable. She recalls
that a feature of a Poisson(λ) random variable is that the population mean and population variance are both
equal to λ and, therefore, equal to each other. Let θ = µX – σX2 be the difference between the population mean
and population variance, which would be θ = 0 if X is truly Poisson. Use the bootstrap with 5,000 iterations
to construct a two-sided normal-based bootstrap 95% confidence interval for θ = µX – σX2 . What does the
confidence interval say about X being Poisson?
(e) Same as (d), but construct a two-sided 95% bootstrap percentile interval for θ = µX – σX2 .
5. There are 106 unemployed individuals in the cps data, for whom the variable lfstatus (labor-force status) is equal
to “Unemployed” and the variable unempwks (weeks unemployed) is reported. For this question, focus only on the
sample of 106 unemployed individuals.
(a) Draw a histogram of unempwks for the unemployed individuals.
(b) Given the right-skewed distribution of unempwks, a classmate suggests taking a log transformation of the
variable to get a distribution that is more symmetric. Create a new variable lnunempwks that does so, and
calculate its sample mean and sample median.
(c) If the distribution of lnunempwks is symmetric, it should be the case that the population mean µX is equal
to the population median τX,0.5 for the underlying random variable X. Use the bootstrap with 5,000 iterations
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 414 — #421
i i
414 NOTES
to construct a two-sided normal-based bootstrap 95% confidence interval for θ = µX – τX,0.5 . What does the
confidence interval say about the population mean of X being equal to the population median of X?
(d) Same as (c), but do so for the (ratio) parameter θ = τµX,0.5
X
.
6. There are 2,809 employed individuals in the cps data, which is the sample of interest for this question. Suppose the
probability of union membership, in the population, is πm for male workers and πf for female workers.
(a) What is the estimate of the ratio ππmf , given by π̂π̂mf , where π̂m is the observed sample proportion of union members
among male workers and π̂f is the observed sample proportion of union members among female workers?
(b) Use the bootstrap with 5,000 iterations to construct a two-sided normal-based bootstrap 95% confidence
interval for the ratio ππmf . (Hint: Create bootstrap samples separately for the male-worker subsample and the
female-worker subsample.)
(c) The odds ratio (OR) is a measure used by statisticians to compare the likelihood of a certain outcome occurring
in two different groups. In the context of this union-gender example, the odds ratio is
πm /(1 – πm )
,
πf /(1 – πf )
which is the ratio between the odds of a male worker being in a union and the odds of a female worker being in
a union. Plugging π̂m and π̂f in for πm and πf , what is the estimated OR? Use the bootstrap with 5,000 iterations
to construct a two-sided normal-based bootstrap 95% confidence interval for the OR.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 415 — #422
i i
16 Hypothesis testing
This chapter introduces the concept of hypothesis testing. Sections 16.1-16.3 focus on testing a hypothesis about the
value of a single unknown parameter, and Section 16.4 extends the framework to consider tests of multiple hypotheses.
For the case of a single unknown parameter, we fix ideas by denoting the unknown parameter of interest by θ. The
following examples motivate the usefulness of hypothesis tests for a single parameter.
Example 16.1 (Widget website) Examples 2.1, 14.5, and 14.15 considered the purchase probabilities for three groups
of widgets.com users: recipients of e-mail A, recipients of e-mail B, and non-recipients. The parameters πA , πB , and
πC denote the purchase probabilities of these three groups, respectively. To compare the effectiveness of e-mail A versus
e-mail B, the difference πA – πB is the quantity of interest. If this difference is positive, then e-mail A is more effective
than e-mail B; if this difference is negative, then e-mail B is more effective than e-mail A; and, if this difference is zero,
then e-mail A and e-mail B are equally effective. If θ = πA – πB is the parameter equal to the difference between the two
purchase probabilities, we want to know whether θ = 0 (no difference in e-mail effectiveness) or θ 6= 0 (difference in
e-mail effectiveness). Example 14.15 constructed a 95% confidence interval (–0.085, 0.045) for θ = πA – πB based upon
60 66
the estimate pA = 300 = 0.20 of πA , pB = 300 = 0.22 of πB , and the standard errors se(pA ) and se(pB ). From this confidence
interval, it appears that zero is a plausible value for θ since it falls within the confidence interval. Therefore, it would
be expected that a formal statistical test would not be able to rule out θ = 0 with a high level of confidence. A test of the
hypothesis θ = 0 is known as a two-sided test since statistical evidence of either θ < 0 or θ > 0 would call into question
the hypothesis θ = 0.
Example 16.2 (Investment opportunity) You are interested in the possibility of buying a business that produces and
sells a certain product. By your calculations, the true average of weekly sales would need to be at least $10,000 in
order for the investment to be worthwhile. As part of due diligence, you obtain weekly sales figures from the business for
10 randomly chosen weeks. For those 10 weeks, the sample mean of weekly sales is $11,200, and the sample standard
deviation of weekly sales is $3,400. If θ denotes the population average of weekly sales, measured in thousands of
dollars, you are interested in knowing whether θ ≤ 10 (the business is not a worthwhile investment) or θ > 10 (the
business is a worthwhile investment). The observed sample is {x1 , x2 , …, x10 }, where xi is weekly sales in thousands of
dollars for a particular week. If these observations can be considered i.i.d. draws from some distribution, the sample
mean x̄ = 11.2 serves as an estimate of θ and, therefore, provides some evidence against θ ≤ 10. But how strong is this
evidence? A more formal test needs to take into account the fact that the sample mean is a random variable. As seen
below, the estimate’s standard error accounts for the potential imprecision of the estimator, similar to what was done
for confidence intervals in Chapter 14. A test of the hypothesis θ ≤ 10 is known as a one-sided test since only statistical
evidence of θ > 10 would call into question the hypothesis θ ≤ 10.
As done for confidence intervals in Chapter 14, this chapter first considers hypothesis testing for the unknown
population mean of i.i.d. normal random variables and then considers hypothesis testing for the more general case
of an unknown parameter for which an asymptotic normal estimator is available. For the former case, covered in
Section 16.1, testing is based upon the exact sampling distribution of the sample mean estimator. For the latter case,
covered in Section 16.2, testing is based upon the asymptotic distribution of the estimator.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 416 — #423
i i
Before getting into the details of the tests, some additional notation and terminology is needed.
Definition 16.1 The null hypothesis is the hypothesis to be tested and is often denoted H0 . The alternative hypothesis
is the opposite of the null hypothesis and is often denoted H1 (though some other sources use the notation Ha ).
For a two-sided test of an unknown parameter θ, the null hypothesis is
H0 : θ = c,
for some known constant c specified by the researcher. The alternative hypothesis, which is the opposite of the null
hypothesis, is
H1 : θ 6= c.
The alternative hypothesis is true whenever the null hypothesis is false, and vice versa.49 The hypothesis test of
H0 determines whether or not there is statistical evidence to reject H0 : θ = c. In Example 16.1, the null hypothesis is
H0 : θ = 0. Since θ is unknown, an estimate of θ that is far away from the hypothesized value c should provide statistical
evidence against H0 : θ = c, but what does “far away” mean? Due to the randomness and noise inherent in the estimate
of θ, the estimate’s standard error will help to quantify how “far away” the estimate of θ is from c.
For a one-sided test of an unknown parameter θ, the null hypothesis is either
H0 : θ ≥ c
or
H0 : θ ≤ c,
for some known constant c specified by the researcher. The direction of the inequality in the null hypothesis H0 depends
upon the situation. In Example 16.2, for example, the null hypothesis of interest is H0 : θ ≤ 10.
For the null hypothesis H0 : θ ≥ c, the alternative hypothesis is
H1 : θ < c,
and the hypothesis test of H0 determines whether or not there is statistical evidence to reject H0 : θ ≥ c. Statistical
evidence against H0 : θ ≥ c comes from an estimate of θ that is far below c.
For the null hypothesis H0 : θ ≤ c, the alternative hypothesis is
H1 : θ > c,
and the hypothesis test of H0 determines whether or not there is statistical evidence to reject H0 : θ ≤ c. Statistical
evidence against H0 : θ ≤ c comes from an estimate of θ that is far above c. Again, the notions of “far below” and “far
above” will be formalized in terms of an estimate’s standard error.
16.1 Finite-sample hypothesis testing: population mean of i.i.d. normal random variables
This section considers hypothesis tests for the population mean µ associated with normally distributed i.i.d. random
variables X1 , X2 , …, Xn ∼ N(µ, σ 2 ). Section 14.2 covered this case in detail and used the exact sampling distribution
results for the sample mean estimator X̄ from Section 12.1.2. Proposition 14.2 provided the key result to construct
confidence intervals for µ:
X̄ – µ
√ ∼ tn–1 ,
sX / n
where X̄ is the sample mean estimator, sX is the sample standard deviation estimator, and tn–1 is the t-distribution with
√
n – 1 degrees of freedom. The standard deviation of the estimator X̄ is sX / n. The ratio sXX̄–µ
√ is known as the t-ratio
/ n
and indicates the number of standard deviations that the estimator X̄ is away from the parameter µ. The t-ratio is
positive when X̄ is greater than µ and negative when X̄ is less than µ. Unfortunately, even after observing the realized
sample mean x̄ and standard deviation sx , the t-ratio is unknown since the parameter µ is unknown.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 417 — #424
i i
There are two possible conclusions from this t-test rejection rule. Either we “reject” the null hypothesis H0 , which
occurs when the magnitude of the t-statistic is above the critical value, or we “do not reject” the null hypothesis H0 ,
which occurs when the magnitude of the t-statistic is below the critical value. A t-statistic with large magnitude
(greater than the critical value) provides evidence against H0 : µ = c and, therefore, it is said that H0 : µ = c is rejected.
On the other hand, a t-statistic with a small magnitude (less than the critical value) does not provide evidence against
H0 : µ = c; such a t-statistic is not surprising given the underlying tn–1 distribution that holds when H0 is true. For this
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 418 — #425
i i
case, it is said that H0 : µ = c is not rejected. While some discussions of hypothesis testing use the term “accept H0 ”
rather than “do not reject H0 ,” the use of the term “accept H0 ” is not advisable. After all, it is never possible to have
strong evidence that H0 : µ = c since there is always some uncertainty from the estimation of µ. The most that can be
said, in the case that sxx̄–c
√ < tn–1,0.025 , is that there is not sufficient statistical evidence against the null hypothesis H0 .
/ n
As such, using the phrase “do not reject H0 ” is appropriate. Even in cases where there is a strong prior belief that H0
√
is false, there might be a failure to reject H0 just because a small sample size leads to a large standard error sx / n.
The magnitude of the t-statistic, sxx̄–c √ , can be thought of as a “statistical distance” from the estimate x̄ to the
/ n
hypothesized value c. While |x̄ – c| gives the actual distance from the estimate x̄ to the hypothesized value c, there is
no way to know if the actual distance |x̄ – c| is small or large due to the uncertainty associated with the estimate x̄.
√
Dividing by the standard error sx / n accounts for this uncertainty, so that the statistical distance sxx̄–c √ is a distance in
/ n
terms of the number of standard errors that x̄ is from c. For this statistical distance, unlike the actual distance, statistical
theory tells us what types of values should be expected if the null hypothesis is true; specifically, as discussed above,
the statistical distance given by the magnitude t-statistic sxx̄–c √
/ n
should be the absolute value of a realized draw from
the tn–1 distribution if the null hypothesis H0 : µ = c is true.
The level of the test, a term was introduced in the rejection rule above, is formally defined as follows:
Definition 16.2 The level or significance level of a hypothesis test, denoted by α, is the probability that the null
hypothesis H0 is rejected when the null hypothesis H0 is true. The level of a test is also called the type I error of the
test.
For the rejection rule above, the level of the hypothesis test is α = 5% or α = 0.05. Figure 16.1 provides a graphical
view of the rejection regions on the tn–1 distribution. If the magnitude of the t-statistic is larger than tn–1,0.025 , its value
falls into either the gray region for the left tail (if the t-statistic is negative) or the gray region for the right tail (if the
t-statistic is positive). If H0 is true, there is a 95% probability that the t-statistic falls in the middle region between the
two values –tn–1,0.025 and tn–1,0.025 . But, even if H0 is true, there is still a 5% probability that the rejection rule above says
to “reject H0 ” due to the t-statistic falling in the left or right tail; for such cases, the test is wrong about the rejection,
and the level of the test (α = 5% here) indicates the probability that the test rejects when H0 is true.
We can generalize the t-test and the rejection rule to levels other than α = 0.05. Letting α denote the level of the test,
the relevant probability statement is that
X̄ – c
P √ < tn–1,α/2 = 1 – α when H0 : µ = c is true.
sX / n
This probability statement leads to the following rejection rule for a t-test at the α level:
There is a tradeoff involved in choosing the level of the test. By its definition, the level indicates how likely it is
to reject H0 when H0 is actually true. Therefore, decreasing the level from 10% to 5% lowers the probability that
an incorrect rejection of H0 occurs. On the other hand, since the t-statistic is the same regardless of the level of the
test, decreasing the level from 10% to 5% leads to a lower chance that H0 is rejected (even if H0 is false). This lower
rejection rate occurs since the critical value increases from tn–1,0.05 to tn–1,0.025 when the level is changed from 10% to
5%: if the magnitude of the t-statistic is less than tn–1,0.05 , the test would not reject H0 at either level; if the magnitude
of the t-statistic is greater than tn–1,0.025 , the test would reject H0 at either level; and, if the magnitude of the t-statistic
is between tn–1,0.05 and tn–1,0.025 , the test would reject H0 at the 10% level but not the 5% level.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 419 — #426
i i
− tn−1,0.025 tn−1,0.025
Figure 16.1
Rejection areas for the t-test at a 5% level
Example 16.3 (Food truck) In Example 14.3, the weekly profits X of a food truck were assumed to be normally
distributed with unknown mean µ and variance σ 2 , with each week’s profits being an i.i.d. draw. Weekly profits were
recorded for a total of six weeks, with the sample average of weekly profits equal to $1200 and the sample standard
deviation equal to $200. Suppose the food truck owner, prior to seeing the data, believed that the true average µ of
weekly profits was $1000. This belief corresponds to the null hypothesis H0 : µ = 1000. For a test at the 5% level, with
α = 0.05, the appropriate critical value is tn–1,0.025 = t5,0.025 ≈ 2.571. Therefore, the t-test does not reject H0 : µ = 1000
at a 5% level since |2.449| < 2.571.
# calculate t-statistic
tstat <- (1200-1000)/(200/sqrt(6))
tstat
## [1] 2.44949
# critical value for t-test at 5% level
qt(0.975,5)
## [1] 2.570582
For a test at the 10% level (α = 0.10), the appropriate critical value is tn–1,0.05 = t5,0.05 ≈ 2.015. Therefore, the t-test
does reject H0 : µ = 1000 at a 10% level since |2.449| ≥ 2.015. For this null hypothesis, then, there is rejection at the
10% level but not the 5% level.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 420 — #427
i i
Example 14.3 calculated confidence intervals for µ based upon the sample information provided above. The two-
sided 95% confidence interval for µ is
sx sx
x̄ – t6–1,0.025 √ , x̄ + t6–1,0.025 √ ≈ (990, 1410),
n n
and the two-sided 90% confidence level for µ is
sx sx
x̄ – t6–1,0.05 √ , x̄ + t6–1,0.05 √ ≈ (1035, 1365).
n n
The hypothesized value of c = 1000 lies within the 95% confidence interval for µ but not within the 90% confidence
interval for µ. With respect to the 95% confidence interval, the value of 1000 is a plausible value since it lies within
that interval. Not coincidentally, this fact corresponds with the finding that H0 : c = 1000 is not rejected at the 5% level.
On the other hand, with respect to the 90% confidence interval, the value of 1000 does not appear likely since it lies
outside the interval. Again, not coincidentally, this fact corresponds with the finding that H0 : c = 1000 is rejected at
the 10% level.
The last part of Example 16.3 highlights the connection between two-sided 1 – α confidence intervals for µ and the
rejection decision for tests at the α level. The following proposition formally states this relationship:
Proposition 16.1. For a t-test of the null hypothesis H0 : µ = c at the α level, the null hypothesis H0 is rejected if c lies
outside the two-sided 1 – α confidence interval for µ and is not rejected if c lies inside the two-sided 1 – α confidence
interval for µ.
To show this result, note that the two-sided 1 – α confidence interval for µ is
sx sx
x̄ – tn–1,α/2 √ , x̄ + tn–1,α/2 √ ,
n n
so c is inside this interval if and only if
sx sx
x̄ – tn–1,α/2 √ < c < x̄ + tn–1,α/2 √ .
n n
The first inequality is equivalent to
x̄ – c
√ < tn–1,α/2 ,
sx / n
and the second inequality is equivalent to
x̄ – c
√ > –tn–1,α/2 .
sx / n
Putting these two inequalities together yields
x̄ – c x̄ – c
–tn–1,α/2 < √ < tn–1,α/2 or, equivalently, √ < tn–1,α/2 ,
sx / n sx / n
which corresponds to the “do not reject” rule for testing H0 : µ = c at the α level. Therefore, c is within the two-sided
1 – α confidence interval if and only if H0 : µ = c is not rejected at the α level.
Therefore, an alternative and equivalent rejection rule for the t-test is the following:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 421 — #428
i i
Definition 16.3 The p-value of a test of the null hypothesis H0 is the smallest level α∗ such that the test rejects H0 at
the level α∗ .
In the case of the two-sided t-test of H0 : µ = c, the p-value is
x̄ – c
p-value = P(|T| > |t-statistic|) = P |T| > √ when H0 is true, where T ∼ tn–1 .
sx / n
In words, if the null hypothesis H0 is true, the p-value is the probability of observing a t-statistic at least as large in
magnitude as the one actually observed.
Figure 16.2 provides a graphical depiction of the p-value. The value of the realized t-statistic is denoted by t-stat in
the figure, with |t-stat| on the right side of the graph and –|t-stat| on the left side of the graph. The p-value is equal to
the total area in the two tails, which includes the area to the left of –|t-stat| and the area to the right of |t-stat|, and is
represented by the gray shading.
As an example, if the t-statistic for H0 : µ = c is calculated to be –1.2 for a sample size of n = 15, R can be used to
find that P(T > 1.2) = P(T < –1.2) ≈ 0.125 for T ∼ t14 . Then, the associated p-value is P (|T| > 1.2) ≈ (2)(0.125) = 0.250,
meaning there is a 25% chance of seeing a t-statistic at least as large as 1.2 in magnitude if the null hypothesis H0 is
true. Due to the symmetry of the t-distribution, the p-value can be calculated as either two times the area of the left tail
or two times the area of the right tail.
For a t-test at the 5% level, the critical value tn–1,0.025 is the value that gives 5% total area in the tail to the right of
tn–1,0.025 and the tail to the left of –tn–1,0.025 . If the p-value for this t-test is less than 0.05 or 5%, it must be the case
that the associated t-statistic t∗ is larger in magnitude than tn–1,0.025 so that the total area of the tails to the right of the
|t∗ | and to the left of –|t∗ | is less than 0.05. Similarly, if the p-value for this t-test is greater than 0.05 or 5%, it must
be the case that the associated t-statistic is smaller in magnitude than tn–1,0.025 so that the total area of the tails to the
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 422 — #429
i i
−|t−stat| |t−stat|
Figure 16.2
p-value for a t-test
right of the |t∗ | and to the left of –|t∗ | is greater than 0.05. This same idea holds for any level of the test, suggesting the
following rejection rule for the t-test based upon the p-value:
The advantage of reporting the p-value associated with the t-test is that it immediately indicates whether the null
hypothesis H0 would be rejected at any specified level. For example, if the p-value is 0.08, the null hypothesis H0 would
be rejected for any level α greater than 0.08, and the null hypothesis H0 would not be rejected for any level α less than
0.08. Larger t-statistic magnitudes are associated with lower p-values, meaning H0 is more likely to be rejected, and
smaller t-statistic magnitudes are associated with higher p-values, meaning H0 is less likely to be rejected.
Example 16.4 (Food truck) In Example 16.3, a t-statistic of 2.449 was calculated for the null hypothesis H0 : µ =
1000. Since the sample size is n = 6, the distribution of interest is the t5 distribution. For a random variable T ∼ t5 , R
calculates that the p-value is 0.058:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 423 — #430
i i
# calculate t-statistic
tstat <- (1200-1000)/(200/sqrt(6))
The left tail, corresponding to P(T < –2.449), has an area of 0.029, and the right tail, corresponding to P(T > 2.449),
also has an area of 0.029, yielding the p-value of 0.058. Therefore, the null hypothesis H0 : µ = 1000 is rejected for any
level above 0.058 and not rejected for any level below 0.058. This finding is consistent with Example 16.3, where H0
was rejected at the 10% level but not at the 5% level.
Definition 16.4 For a specific alternative hypothesis, the power of a test is the probability that the test correctly
rejects the null hypothesis H0 . For a specific alternative hypothesis, the probability that the test does not reject the null
hypothesis, which is equal to one minus the power, is called the type II error of the test.
If the specific alternative hypothesis µ = d (for d 6= c) is true, then
X̄ – d
√ ∼ tn–1 .
sX / n
X̄–c
Since H0 is not true, the t-statistic √
sX / n
is not distributed as a tn–1 random variable. Instead, the t-statistic can be
written as
X̄ – c X̄ – d d–c
√ = √ + √ ,
sX / n sX / n sX / n
so that the t-statistic is a tn–1 random variable plus the term sXd–c
√ . If d is much larger than c, this term is a large positive
/ n
number so that the observed t-statistic should look like a random draw from tn–1 plus a large positive number, which
makes it more likely to be above the right-tail critical value for rejection. Similarly, if d is much smaller than c, this
term is a large negative number so that the observed t-statistic should look like a random draw from tn–1 minus a large
number, which makes it more likely to be below the left-tail critical value for rejection.
To help visualize the power of the two-sided t-test, Figure 16.3 shows the power of the test under two different
specific alternative hypotheses. In both graphs, the thin curve corresponds to the tn–1 distribution for the t-statistic
that would hold under the null hypothesis H0 , along with the critical values –tn–1,0.025 and tn–1,0.025 . The bold curves
represent the actual distributions for the t-statistic under two specific alternative hypotheses, with d being 2.5 standard
deviations above c in the top graph and 4 standard deviations above c in the bottom graph. The gray areas indicate the
rejection probabilities using the critical-value cutoff rejection rule. For both graphs, the rejection probability is clearly
much larger than 5%, which is the rejection rate under the null hypothesis. Since d is farther away from c in the bottom
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 424 — #431
i i
− tn−1,0.025 tn−1,0.025
− tn−1,0.025 tn−1,0.025
Figure 16.3
Power of a t-test
graph, the power of the test is larger in the bottom graph, as indicated by the larger gray area. The type II errors are
given by the white area under the bold curves, and since the type II error is just one minus the power, the type II error
is smaller in the bottom graph.
For any specific alternative µ = d, with d 6= c, the power of the t-test is an increasing function of sXd–c
√ , so that the
/ n
power depends upon the sample size and the value of d. For a given sample size n, the power of the test increases
when the specific alternative hypothesis has d farther away from c. Intuitively, this relationship makes sense since it is
easier to find evidence against H0 : µ = c when d is not close to c. For any specific alternative hypothesis, the power of
the test is increasing in the sample size n. Larger samples make it easier to reject the null hypothesis when it is false.
In fact, if the sample size n gets arbitrarily large (n → ∞), the power of the test becomes arbitrarily close to 100% if
the null hypothesis H0 is false.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 425 — #432
i i
negative. To have a test at the α level, the area under the tn–1 distribution to the left of the critical value must be equal
to α. The corresponding rejection rule is:
Figure 16.4 shows the rejection areas for both types of one-sided tests for a 5% level. The top graph shows the
rejection area for testing H0 : µ ≥ c, with the gray region corresponding to values less than –tn–1,0.05 and having an area
of 5%. The bottom graph shows the rejection area for testing H0 : µ ≤ c, with the gray region corresponding to values
greater than tn–1,0.05 and having an area of 5%.
Example 16.5 (Investment opportunity) Continuing Example 16.2, let’s make the additional assumption that the 10
observed weekly sales figures are i.i.d. draws from a normal distribution. Knowing that the sample mean of weekly
sales is $11,200 (x̄ = 11.2) and the sample standard deviation of weekly sales is $3,400 (sx = 3.4), would the null
hypothesis H0 : θ ≤ 10, which corresponds to the business not being a worthwhile investment, be rejected at the 5%
level? The t-statistic is
x̄ – c 11.2 – 10
√ = √ ≈ 1.116.
sx / n 3.4/ 10
Since tn–1,0.05 = t9,0.05 ≈ 1.833, the null hypothesis H0 : θ ≤ 10 is not rejected since the t-statistic 1.116 is less than the
critical value 1.833. How about a one-sided test at the 10% level? The critical value is t9,0.10 ≈ 1.383, so that the null
hypothesis is still not rejected at the 10% since 1.116 < 1.383. Here is the R code for the necessary calculations:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 426 — #433
i i
− tn−1,0.05
tn−1,0.05
Figure 16.4
Rejection areas for one-sided t-tests at a 5% level
# calculate t-statistic
tstat <- (11.2-10)/(3.4/sqrt(10))
tstat
## [1] 1.116098
# critical value for test at 5% level
qt(0.95,9)
## [1] 1.833113
# critical value for test at 10% level
qt(0.90,9)
## [1] 1.383029
The concept of p-values can be extended to one-sided hypothesis testing. Again, the difference from two-sided
testing is that only one of the tails is considered, with the left tail used for H0 : µ ≥ c and the right tail used for
H0 : µ ≤ c. For the one-sided t-test of H0 : µ ≥ c, the p-value is
x̄ – c
p-value = P(T < t-statistic) = P T < √ when H0 is true, where T ∼ tn–1 .
sx / n
For the one-sided t-test of H0 : µ ≤ c, the p-value is
x̄ – c
p-value = P(T > t-statistic) = P T > √ when H0 is true, where T ∼ tn–1 .
sx / n
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 427 — #434
i i
As with a two-sided test, the p-value can be used to determine whether a test of a one-sided null hypothesis should
be rejected. The null hypothesis H0 is rejected if the p-value is less than the level α of the test and not rejected if the
p-value is greater than the level α of the test.
Example 16.6 (Investment opportunity) Continuing Example 16.5, the p-value associated with the t-statistic 1.116 is
the area to the right of 1.116 under the t9 distribution.
Since P(T > 1.116) ≈ 0.147 for a random variable T ∼ t9 , the p-value for the one-sided test of H0 : µ ≤ 10 is 0.147.
Thus, a one-sided t-test does not reject H0 : µ ≤ 10 for any level below 14.7%. Even if µ = 10, this p-value tells us that
there would be a 14.7% probability of seeing a t-statistic at least as large as the one observed (1.116).
Finally, the connection between one-sided 1 – α confidence intervals and one-sided tests at the α level can be
established. Recall that the one-sided 1 – α confidence intervals for µ are
sx sx
x̄ – tn–1,α √ , ∞ and –∞, x̄ + tn–1,α √ .
n n
Proposition 16.2. For a t-test of the null hypothesisH0 : µ ≥ c at the α level, the null hypothesis H0 is rejected if c
lies outside the one-sided 1 – α confidence interval –∞, x̄ + tn–1,α √sxn for µ and is not rejected if c lies inside the
one-sided 1 – α confidence interval –∞, x̄ + tn–1,α √sxn for µ. For a t-test of the null hypothesis H0 : µ ≤ c at the α
level, the null hypothesis H0 is rejected if c lies outside the one-sided 1 – α confidence interval x̄ – tn–1,α √sxn , ∞ for
µ and is not rejected if c lies inside the one-sided 1 – α confidence interval x̄ – tn–1,α √sxn , ∞ for µ.
Example 16.7 (Investment opportunity) Re-visiting Example 16.5, the same conclusions for the one-sided tests at the
5% and 10% levels can be obtained by using one-sided confidence intervals and Proposition 16.2. The one-sided 95%
confidence interval for µ is
sx 3.4
x̄ – tn–1,α √ , ∞ = 11.2 – t9,0.05 √ , ∞ ≈ (9.23, ∞),
n 10
meaning H0 : µ ≤ 10 is not rejected at a 5% level since the value 10 is within this confidence interval. Similarly, the
one-sided 90% confidence interval for µ is
sx 3.4
x̄ – tn–1,α √ , ∞ = 11.2 – t9,0.10 √ , ∞ ≈ (9.71, ∞),
n 10
meaning H0 : µ ≤ 10 is not rejected at a 10% level since the value 10 is within this confidence interval. But for the
one-sided 85% confidence interval for µ, which is
sx 3.4 3.4
x̄ – tn–1,α √ , ∞ = 11.2 – t9,0.15 √ , ∞ = 11.2 – 1.0997 √ , ∞ ≈ (10.01, ∞),
n 10 10
the value 10 is not within the interval. Thus, H0 : µ ≤ 10 is rejected at a 15% level, which agrees with the p-value of
0.147 found in Example 16.6.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 428 — #435
i i
unknown parameter θ that can be estimated by an asymptotically normal estimator. In the same way that Section 14.4
provided asymptotic confidence intervals for this general case, this section considers hypothesis tests related to θ for a
large sample.
We fix ideas by considering an estimator θ̂X and estimate θ̂x based upon a univariate random variable and data,
respectively, but the results generalize to other estimators and estimates, including those based upon multivariate data
and multiple samples. Following the treatment in Section 14.4, consider an asymptotically normal estimator θ̂X with
asymptotic distribution
a V θ̂X – θ a
θ̂X ∼ N θ, or, equivalently, √ ∼ N (0, 1) ,
n V/n
q
V
where θ is the parameter (estimand) of interest and the associated asymptotic standard deviation is n . For the
estimate θ̂x , based upon the observed sample, the z-ratio is defined as
θ̂x – θ
z-ratio = ,
se(θ̂x )
where se(θ̂x ) is the standard error associated with the estimate θ̂x . For a large sample, the z-ratio is a realized draw q
from
the standard normal distribution N(0, 1) since se(θ̂x ) is a consistent estimator of the asymptotic standard deviation Vn .
However, since θ is an unknown parameter, the z-ratio can not be calculated.
Following the same approach as Section 16.1, consider the thought experiment in which the null hypothesis H0 : θ = c
is assumed to be true for some constant c. Define the z-statistic as
θ̂x – c
z-statistic = ,
se(θ̂x )
which can be calculated since c is known. The z-statistic is the number of standard errors that the estimate θ̂x is away
from c, with the z-statistic being positive if the estimate θ̂x is above c and negative if the estimate θ̂x is below c. If the
null hypothesis H0 : θ = c is true, the z-statistic should be a realized draw from the standard normal distribution N(0, 1).
Before observing the data, there is approximately a 95% probability that the realized z-statistic is between –1.96 and
1.96 if H0 : θ = c is true or, more generally, a 1 – α probability that the realized z-statistic is between –zα/2 and zα/2 if
H0 : θ = c is true. Thus, the rejection rule for the two-sided test of H0 : θ = c based upon the z-statistic is as follows:
The resulting test is known as a z-test, as it uses critical values from the normal distribution, in contrast to the t-test
of Section 16.1 that uses critical values from the t-distribution. Unlike the t-test, the critical values for the z-test do
θ̂x –c
not depend upon the sample size n. The magnitude of the z-statistic, se( θ̂ )
, can be viewed as a statistical distance,
x
measuring the number of standard errors that the estimate θ̂x is from the hypothesized value c. By dividing by the
standard error se(θ̂x ), this statistical distance takes into account the uncertainty associated with the estimate θ̂x and,
based upon statistical theory, should be the absolute value of a realized draw from the N(0, 1) distribution if the null
hypothesis H0 : θ = c is true.
Example 16.8 (Widget website) For the e-mail experiment summarized in Example 16.1, suppose the e-mail marketing
director at widgets.com believes that the true purchase probability for e-mail A recipients is 25%. The null
hypothesis for this pre-experiment belief is
H0 : πA = 0.25.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 429 — #436
i i
Given that 60 out of 300 e-mail A recipients actually made a purchase, what would a z-test of this null hypothesis
conclude? The unknown parameter here is θ = πA , and the hypothesized value is c = 0.25. The estimate of πA is θ̂x = x̄ =
60
pA = 300 = 0.20. Then, since the standard error of x̄ is 0.0231, as calculated in Example 14.5, the z-statistic is
θ̂x – c 0.20 – 0.25
= ≈ –2.16.
se(θ̂x ) 0.0231
This z-statistic says that the estimated probability of 0.20 is 2.16 standard errors below the hypothesized true
probability of 0.25. For a test at the 5% level, the null hypothesis H0 : πA = 0.25 is rejected since |z-statistic| = |2.16| ≥
1.96 = z0.025 . For a test at the 1% level, the null hypothesis H0 is not rejected since |z-statistic| = |2.16| < 2.576 = z0.005 .
# calculate z-statistic
zstat <- (pa-0.25)/se_pa
zstat
## [1] -2.165064
# critical value for test at 5% level
qnorm(0.975)
## [1] 1.959964
# critical value for test at 1% level
qnorm(0.995)
## [1] 2.575829
How about testing whether e-mail campaign A and e-mail campaign B are equally effective? As discussed in
Example 16.1, the null hypothesis of interest is
H0 : πA = πB or, equivalently, H0 : πA – πB = 0.
This null hypothesis fits into the framework developed above for the unknown parameter θ = πA – πB , and the
alternative hypothesis is
H1 : πA 6= πB or, equivalently, H1 : πA – πB 6= 0.
60 66
The estimate of θ = πA – πB is θ̂x = pA – pB = 300 – 300 = 0.20 – 0.22 = –0.02. From Example 14.15, the standard error of
this estimate is se(pA – pB ) = 0.0332. Then, the z-statistic for testing H0 : πA – πB = 0 is
(pA – pB ) – 0 –0.02 – 0
= ≈ –0.60.
se(pA – pB ) 0.0332
Therefore, the null hypothesis H0 : πA – πB = 0 is not rejected at the 5% level since | – 0.60| < 1.96. The null hypothesis
is also not rejected at the 10% level since | – 0.60| < 1.645. As a result, there does not appear to be strong evidence
that, based upon the sample of e-mail A recipients and e-mail B recipients observed, there is a statistically significant
difference between πA and πB .
Using the same approach, we can test whether there are differences between (i) the purchase probability of e-mail A
recipients and non-recipients and (ii) the purchase probability of e-mail B recipients and non-recipients. For (i), the
null hypothesis is
H0 : πA – πC = 0,
which has z-statistic
(pA – pC ) – 0 (0.20 – 0.15) – 0
= ≈ 2.07.
se(pA – pC ) 0.0242
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 430 — #437
i i
The null hypothesis H0 : πA – πC = 0 is rejected at the 5% level since |2.07| ≥ 1.96, providing evidence of a statistically
significant difference between πA and πC . For (ii), the null hypothesis is
H0 : πB – πC = 0,
which has z-statistic
(pB – pC ) – 0 (0.22 – 0.15) – 0
= ≈ 2.80.
se(pB – pC ) 0.0250
The null hypothesis H0 : πB – πC = 0 is rejected at the 5% level since |2.80| ≥ 1.96, providing evidence of a statistically
significant difference between πB and πC . Taking these hypothesis tests together, there appears to be evidence
supporting the hypotheses that the e-mail recipient groups each has a greater purchase probability than the non-
recipient group; on the other hand, there is no strong evidence of a difference between the purchase probability of the
e-mail A recipients and the purchase probability of the e-mail B recipients.
Example 16.9 (Education and earnings) For the cps dataset, the correlation between weekly earnings (y = earnwk)
and education (x = educ) among the n = 2809 employed individuals is rxy = 0.325. From Example 14.10, the standard
error of rxy , as an estimate of the population correlation ρXY , is equal to 0.0169. To test whether there is any correlation
between weekly earnings and education in the population, the appropriate null hypothesis is
H0 : ρXY = 0.
The z-statistic for testing H0 : ρXY = 0 is
θ̂x – c rxy – c 0.325 – 0
= = ≈ 19.23,
se(θ̂x ) se(rxy ) 0.0169
indicating that the sample correlation is 19.23 standard errors above 0, which is a lot! Thus, the null hypothesis
H0 : ρXY = 0 is rejected at the 5% level since |19.23| ≥ 1.96 and the 1% level since |19.23| ≥ 2.576. These rejections
are certainly not borderline rejections since the z-statistic is so large in magnitude. Testing whether the population
correlation is equal to zero is often of interest in other settings, as it provides an easy way to statistically support the
idea that two variables are related by rejecting that they are unrelated. To perform such a test using asymptotic theory
and a z-statistic, we only need the sample correlation and its standard error.50
Example 16.10 (Exam score data) Example 14.13 considered the dataset exams, which contains scores on two
different 100-point exams for 77 students. The two exam scores are given by the variables exam1 and exam2. To
test whether the exams are equally difficult, a null hypothesis of interest is whether their true averages are the same:
H0 : µexam1 = µexam2 or, equivalently, H0 : µexam1 – µexam2 = 0.
The unknown parameter is θ = µexam1 – µexam2 . From Example 14.13, the estimated score difference is
exam1 – exam2 = 6.42,
wiht a standard error of
se(exam1 – exam2) = 1.23.
The z-statistic associated with H0 : µexam1 – µexam2 = 0 is
6.42 – 0
≈ 5.22,
1.23
which provides evidence of a statistically significant difference between the population average of exam1 and the
population average of exam2. The null hypothesis is rejected at a 5% level since |5.22| ≥ 1.96 and also at a 1% level
since |5.22| ≥ 2.576.
Similar to the t-test of Section 16.1, the z-test can also be conducted through the use of confidence intervals. The
rejection rule for the z-test based upon the two-sided confidence interval is as follows:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 431 — #438
i i
Example 16.11 (Widget website) For the z-test of the null hypothesis H0 : πA = 0.25 in Example 16.8, the estimated
60
purchase probability is θ̂x = x̄ = pA = 300 = 0.20, with a standard error of x̄ of 0.0231. As shown in Example 14.5, a 95%
confidence interval for πA is
(0.20 – (1.96)(0.0231), 0.20 + (1.96)(0.0231)) ≈ (0.155, 0.245).
Since 0.25 does not fall within this 95% confidence interval, H0 : πA = 0.25 is rejected at the 5% level, the same
conclusion as reached in Example 16.8. On the other hand, a 99% confidence interval for πA is
(0.20 – (2.576)(0.0231), 0.20 + (2.576)(0.0231)) ≈ (0.140, 0.260),
meaning H0 : πA = 0.25 is not rejected at the 1% level since 0.25 is within the interval.
In the case of the two-sided z-test of H0 : θ = c, the p-value is
!
θ̂x – c
p-value = P(|Z| > |z-statistic|) = P |Z| > when H0 is true, where Z ∼ N(0, 1).
se(θ̂x )
If the null hypothesis H0 is true, the p-value is the probability of observing a z-statistic at least as large in magnitude
as the one actually observed. Graphically, as shown in Figure 16.5, the p-value is equal to the total area in the two
tails, adding the area in the tail to the left of –|z-stat| and the area in tail to the right of |z-stat|. This figure is identical to
Figure 16.2, except that the distribution in Figure 16.5 is the N(0, 1) distribution while the distribution in Figure 16.2
is the tn–1 distribution.
For the z-test, like the t-test of Section 16.1, knowing the p-value tells us whether the null hypothesis H0 : θ = c is
rejected at any level α. When the p-value is less than α, the z-statistic must be in either the left tail (less than –zα/2 ) or
in the right tail (greater than zα/2 ), indicating rejection at the level α. On the other hand, when the p-value is greater
than α, the z-statistic is between –zα/2 and zα/2 , indicating a lack of rejection at the level α.
Example 16.12 (Widget website) For the z-test of the null hypothesis H0 : πA = 0.25 in Example 16.8, the calculated
z-statistic is
θ̂x – c 0.20 – 0.25
= ≈ –2.16.
se(θ̂x ) 0.0231
The associated p-value, for Z ∼ N(0, 1), is
p-value = P(|Z| > |z-statistic|) = P (|Z| > 2.16) ≈ 0.030.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 432 — #439
i i
−|z−stat| |z−stat|
Figure 16.5
p-value for a z-test
# calculate z-statistic
zstat <- (pa-0.25)/se_pa
## [1] 0.03038282
The null hypothesis H0 : πA = 0.25 is rejected at any level above 3.0% and not rejected at any level below 3.0%,
which agrees with the conclusion in Examples 16.8 and 16.11 that H0 is rejected at a 5% level but not a 1% level.
Similarly, we can calculate p-values associated with the z-tests of the three null hypotheses H0 : πA – πB = 0, H0 :
πA – πC = 0, and H0 : πB – πC = 0 considered in Example 16.8.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 433 — #440
i i
# calculate z-statistics
zstat_abdiff <- (pa-pb)/sqrt(se_pa^2 + se_pb^2)
zstat_acdiff <- (pa-pc)/sqrt(se_pa^2 + se_pc^2)
zstat_bcdiff <- (pb-pc)/sqrt(se_pb^2 + se_pc^2)
zstat_abdiff
## [1] -0.6015661
zstat_acdiff
## [1] 2.064674
zstat_bcdiff
## [1] 2.79972
# calculate the p-values for the z-tests
2*(1-pnorm(abs(zstat_abdiff)))
## [1] 0.547463
2*(1-pnorm(abs(zstat_acdiff)))
## [1] 0.03895389
2*(1-pnorm(abs(zstat_bcdiff)))
## [1] 0.005114694
The following table summarizes the z-statistics and p-values for the three null hypotheses:
Null hypothesis z-statistic p-value
H0 : πA – πB = 0 –0.60 0.547
H0 : πA – πC = 0 2.06 0.039
H0 : πB – πC = 0 2.80 0.005
Having the p-values makes it easy to see that the second and third null hypotheses are rejected at a 5% level and
the first hypothesis is not. The strongest evidence of a difference is between πB and πC , with an associated p-value of
0.005, for which H0 : πB – πC = 0 is rejected at any level above 0.5%. The weakest evidence of a difference is between
πA and πB , with the large p-value of 0.547 indicating that H0 : πA – πB = 0 is not be rejected at any level below 54.7%.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 434 — #441
i i
The rejection areas for one-sided z-tests are similar to those depicted in Figure 16.4 for one-sided t-tests. For
instance, for a test at the 5% level, the 5% rejection area for testing H0 : θ ≥ c corresponds to all z-statistic values
less than –z0.05 , whereas the 5% rejection area for testing H0 : θ ≤ c corresponds to all z-statistic values greater than
z0.05 .
Example 16.13 (Betting strategy) An experienced sports gambler is convinced that they have a strategy that is
profitable at a certain casino. They have tested their strategy betting on 120 games, with a 55% success rate (66
winning bets on the 120 total games). Due to fees that the casino charges, the gambler needs a success rate π of at
least 52% to be profitable over the long run. The gambler is therefore interested in testing, and hopes to be able to
reject, the one-sided null hypothesis
H0 : π ≤ 0.52,
which corresponds to unprofitable π values. The alternative hypothesis H1 : π > 0.52 corresponds to profitable π values.
The observed success rate is consistent with the alternative hypothesis H1 being true, but the gambler is concerned
that the high realized success rate may have arisen due to chance. The estimate of π is 0.55, and the only additional
information needed for the z-test is the standard error of this
q estimate. Under the assumption that the success of
each bet is an i.i.d. Bernoulli(π) draw, the standard error is (0.55)(0.45)
120 ≈ 0.0454. The z-statistic is 0.55–0.52
0.0454 ≈ 0.661,
meaning the null hypothesis H0 : π ≤ 0.52 is not rejected at a 5% level since 0.661 < 1.645 = z0.05 . The gambler is right
to be concerned. What if the 55% success rate had occurred
q for a much larger set of games, say 360 games instead of
120 games? In that case, the standard error would be (0.55)(0.45)
360 ≈ 0.0262, yielding a z-statistic of 0.55–0.52
0.0262 ≈ 1.144.
Even with this larger sample, H0 : π ≤ 0.52 is not rejected at a 5% level since 1.144 < 1.645 = z0.05 .
Similar to one-sided t-tests, the concept of p-values can be extended to one-sided z-tests. For the one-sided z-test of
H0 : θ ≥ c, the p-value is
!
θ̂x – c
p-value = P(Z < z-statistic) = P Z < when H0 is true, where Z ∼ N(0, 1).
se(θ̂x )
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 435 — #442
i i
# calculate z-statistic
zstat <- (0.55-0.52)/sqrt(0.55*(1-0.55)/120)
zstat
## [1] 0.6605783
# calculate one-sided p-value
1-pnorm(zstat)
## [1] 0.2544414
The null hypothesis H0 : π ≤ 0.52 is not rejected at any level below 25.4%. Example 16.13 also considered how
things would change with a larger set of games (360) and the same success rate (55%), in which case the z-statistic
increases to 1.144 due to the lower standard error. The p-value would be P(Z > 1.14) ≈ 0.126, considerably lower than
0.254 but still implying that H0 : π ≤ 0.52 is not rejected at any level below 12.6%.
# calculate z-statistic
zstat <- (0.55-0.52)/sqrt(0.55*(1-0.55)/360)
zstat
## [1] 1.144155
# calculate one-sided p-value
1-pnorm(zstat)
## [1] 0.1262797
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 436 — #443
i i
of the exam scores, which are sexam1 = 14.19 and sexam2 = 14.15. The estimated difference of the average scores is
approximately 45% of the standard deviation on either exam, which is a practically important magnitude.
It is especially important to think about practical significance when sample sizes are large. A t-statistic or z-statistic
has the standard error in its denominator, and since the standard error is proportional to √1n , the test statistic becomes
arbitrarily large in magnitude as n increases for any fixed value of its numerator. As a result, even if the estimate in
the numerator is very small in magnitude, with little or no practical significance, it is possible to find that the estimate
is statistically significant with a very low p-value. In the exam example, suppose the estimated difference in averages
is 0.52 rather than 6.42 and the sample standard deviations are unchanged. With an extremely large sample size, the
standard error would eventually be low enough that the p-value for testing equality of the exam score averages would
be rejected at a 5% level, leading to a statistically significant estimated difference. The estimated difference would not,
however, be very practically significant since it represents only about 3.7% of the standard deviation on either exam.
The idea here is that with a very large sample, we can precisely estimate parameters whose true values are close
to zero from a practical point of view. The precision of the estimate can lead to a low p-value (when testing against
zero) and, thus, statistical significance even though the magnitude of the estimate is not practically significant. The
following example illustrates this issue.
Example 16.15 (Fundraising campaign) A non-profit organization would like to increase the average level of giving
among its 100,000 past donors, and a consultant has suggested a physical mail campaign to augment their usual e-
mail outreach efforts. The non-profit randomly selects 50,000 of its past donors as the “treatment” group to receive the
physical mailing, and the other 50,000 past donors serve as the “control” group that receives only the e-mail outreach.
The average donation received from the treatment donors is $32.20, with a sample standard deviation of $28.90, and
the average donation received from the control donors is $31.80, with a sample standard deviation of $29.10. If µT
and µC denote the population means of donations for the treatment and control subpopulations, respectively, a test of
the null hypothesis
H0 : µT = µC or H0 : µT – µC = 0
has a z-statistic equal to
32.20 – 31.80
q ≈ 2.18
28.902 29.102
50000 + 50000
and a p-value of approximately 0.029. Therefore, the estimated difference in average donations ($0.40 or 40 cents)
is statistically significant at a 5% level. But the 40-cent differential is likely not practically significant for the
organization, as 40 cents is just over 1% of the average donation level and might also be offset or negated by the
costs of the mail campaign.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 437 — #444
i i
and
H0 : πB = πC versus H1 : πB 6= πC .
The p-values associated with these three z-tests were 0.547, 0.039, and 0.005, respectively. What if instead we wanted
to simultaneously test the equality of all three purchase probabilities, πA = πB = πC ? In order for πA = πB = πC to be
true, it must be the case that both πA = πB and πB = πC are true. Therefore, the null hypothesis can be written
H0 : πA = πB , πB = πC ,
where the convention is to read the comma in H0 as “and.” The null hypothesis H0 is false when either πA 6= πB or
πB 6= πC , giving the alternative hypothesis
H1 : πA 6= πB or πB 6= πC .
For H0 to be false and H1 to be true, it is enough to have either πA 6= πB or πB 6= πC . For example, if πA 6= πB and
πB = πC , the null hypothesis is false. A few remarks about the formulation of the null hypothesis (H0 : πA = πB , πB = πC )
are necessary. First, it is unnecessary to also include πA = πC in the statement of H0 since that equality is implied
by the other two. Inclusion of πA = πC is redundant. Second, the choice of the two equalities in the statement of H0
doesn’t matter, as long as all three purchase probabilities are involved. That is, it is equally appropriate to specify
the null hypothesis as H0 : πA = πC , πB = πC or H0 : πA = πB , πA = πC . For any of these equivalent statements of the null
hypothesis, the conclusion of the Wald test described below will be the same.52
To formalize the test of multiple hypotheses, suppose Q hypotheses are to be tested simultaneously. The notation
θ1 , θ2 , …, θQ denotes the unknown parameters to be tested against hypothesized values c1 , c2 , …, cQ , respectively. To
keep things general, each of the θj parameters may itself be a linear function of multiple unknown parameters, an idea
illustrated in the examples considered below. The null hypothesis of interest is
H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ .
This null hypothesis is said to consist of Q linear restrictions since each θj may be a linear function of parameters.
The associated test of the null hypothesis H0 tests whether all of the Q linear restrictions are true. The alternative
hypothesis, which is true when one or more of the q linear restrictions are false, is
H1 : θ1 6= c1 or θ2 6= c2 or … or θQ 6= cQ .
Example 16.17 (Widget website) For the null hypothesis
H0 : πA = πB , πB = πC
discussed in Example 16.16, there are two linear restrictions (Q = 2). Using the notation developed above, this null
hypothesis can be written
H0 : θ1 = 0, θ2 = 0,
where θ1 = πA – πB , θ2 = πB – πC , and c1 = c2 = 0. The alternative hypothesis
H1 : θ1 6= 0 or θ2 6= 0
is true when πA 6= πB or πB 6= πC .
Example 16.18 (Asset correlation) Suppose an investor’s portfolio currently consists of three assets A, B, and C.
The investor is considering whether to add an asset D to the portfolio but only wants to do so if there is no evidence
that asset D’s daily returns have correlation with the daily returns of the three other assets. Let ρAD denote the true
correlation of the daily returns of asset A and asset D, and similarly for ρBD and ρCD . The null hypothesis is
H0 : ρAD = 0, ρBD = 0, ρCD = 0,
which has three linear restrictions (Q = 3), with θ1 = ρAD , θ2 = ρBD , θ3 = ρCD , and c1 = c2 = c3 = 0. The alternative
hypothesis is
H1 : ρAD 6= 0 or ρBD 6= 0 or ρCD 6= 0.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 438 — #445
i i
If the investor uses historical data to estimate the three correlations (ρAD , ρBD , ρCD ) and test the null hypothesis H0 , a
rejection of H0 would provide statistical evidence that at least one of the correlations is non-zero, in which case the
investor would not want to add asset D to the portfolio.
Example 16.19 (Earnings and marital status) The labor-force data in cps contains a categorical variable marstatus
indicating marital status, with four possible values: “Married,” “Divorced,” “Widowed,” and “Never married.”
Suppose we want to test whether there is any relationship between average weekly earnings (earnwk) and marital
status. Put another way, are average weekly earnings (earnwk) the same or different for the four groups of workers
as delineated by marital status? For notation, use M for married workers, D for divorced workers, W for widowed
workers, and N for never-married workers, and let µM , µD , µW , µN denote the population mean of weekly earnings for
each of the four subpopulations of employed individuals. Since we want to test µM = µD = µW = µN , the null hypothesis
can be written
H0 : µM = µD , µD = µW , µW = µN ,
which has Q = 3 linear restrictions, with θ1 = µM – µD , θ2 = µD – µW , θ3 = µW – µN , and c1 = c2 = c3 = 0. Again, as in
Example 16.16, there are equivalent ways of writing the null hypothesis (e.g., H0 : µM = µD , µM = µW , µM = µN ). The
alternative hypothesis is
H1 : µM 6= µD or µD 6= µW or µW 6= µN .
If the null hypothesis H0 is tested and rejected, there is evidence of a statistically significant difference in average
weekly earnings between at least two of the four subpopulations of employed individuals.
√
To conduct a Wald test of a null hypothesis with multiple linear restrictions, we require n-consistent and
asymptotically normal estimators of the Q parameters θ1 , θ2 , …, θQ . Let the realized estimates of those parameters
be denoted as θ̂1 , θ̂2 , …, θ̂Q . Intuitively, when these estimates are “close to” the hypothesized values c1 , c2 , …, cQ ,
respectively, we are in a situation that is consistent with the null hypothesis H0 being true. On the other hand, when
one or more of the estimates is “far from” the hypothesized values c1 , c2 , …, cQ , respectively, we are in a situation that
is consistent with the alternative hypothesis H1 being true. The z-test discussed in Section 16.2 can be used to test
whether any individual θ̂j is close to an individual cj , but it is only useful for testing a single linear restriction rather
than multiple linear restrictions simultaneously.
To provide more intuition, consider the simplest case of Q = 2, where the null hypothesis is
H0 : θ1 = c1 , θ2 = c2 ,
and the alternative hypothesis is
H1 : θ1 6= c1 or θ2 6= c2 .
The estimates of θ1 and θ2 are θ̂1 and θ̂2 , respectively. The differences θ̂1 – c1 and θ̂2 – c2 form the basis of the Wald
test, and the conclusion of the test is based upon how far each of these differences are from zero. If θ̂1 – c1 and θ̂2 – c2
are both very close to zero, in statistical terms, that situation would be consistent with the null hypothesis H0 being
true. If the null hypothesis H0 is true, meaning θ1 = c1 and θ2 = c2 , then both θ̂1 – c1 and θ̂2 – c2 should be realizations
of draws from normal distributions that are centered at zero. For the z-test, recall that the magnitude of the z-statistic,
θ̂x –c
se(θ̂ )
, is a statistical distance between θ̂x and c, obtained by dividing the actual distance |θ̂x – c| by the standard error
x
se(θ̂x ). Unlike this z-test, there are now two distances, |θ̂1 – c1 | and |θ̂2 – c2 |, to consider when there are Q = 2 hypotheses
being tested.
The Wald statistic generalizes the notion of a statistical distance to handle both higher dimensions (two for the
Q = 2 case) and possible correlation between the two estimators of θ1 and θ2 . As the formula for the Wald statistic
requires more advanced mathematics for the general case, a more complete description of the Wald statistic is left for
the Appendix. Instead, for the remaining discussion in this section, we assume that R is able to calculate the Wald
statistic. In the case of Q = 2, the realized Wald statistic is a draw from a χ22 distribution (chi-square distribution with 2
degrees of freedom) if the null hypothesis H0 is true. The Wald statistic is always non-negative, which is intuitive since
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 439 — #446
i i
it is a statistical distance measure. Values of the Wald statistic close to zero are consistent with the null hypothesis
H0 since such values arise when θ̂1 is close to c1 and θ̂2 is close to c2 . On the other hand, large positive values of the
Wald statistic are consistent with the alternative hypothesis since such values arise when θ̂1 is far from c1 and/or θ̂2 is
far from c2 . As a result, the resulting Wald test is a one-sided test, where for Q = 2 rejection only occurs in the right
tail of the χ22 distribution. For instance, for a test at a 5% level, we reject H0 if the Wald statistic is greater than the
95% quantile of the χ22 distribution, which is approximately 5.991. For a test at a 10% level, we reject H0 if the Wald
statistic is greater than the 90% quantile of the χ22 distribution, which is approximately 4.605. These critical values are
calculated in R with the qchisq function:
qchisq(0.95,2)
## [1] 5.991465
qchisq(0.90,2)
## [1] 4.60517
The Wald statistic generalizes to additional linear restrictions. For higher Q, there are more estimators to be
considered and, therefore, more distances (between realized estimates and hypothesized values) taken into account by
the Wald statistic. The biggest difference for higher Q is the sampling distribution of the Wald statistic when the null
hypothesis is true. Specifically, the realized Wald statistic is a draw from a χ2Q distribution (chi-square distribution with
Q degrees of freedom) when the null hypothesis H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ is true. Intuitively, as more distances
are added into the Wald statistic, the overall statistical distance measure is expected to increase, which corresponds to
the thicker right tails of the chi-square distribution as the degrees of freedom increase.
If a statistical package can conduct a Wald test, it usually provides the value of the Wald statistic and/or the p-value
for the test itself. As with t-tests and z-tests, the p-value is the most useful, as it immediately tells us whether the
null hypothesis H0 would be rejected at any level. In the interest of completeness, however, we also describe how to
conduct the test based upon the Wald statistic itself. First, notation for the critical value of a chi-square distribution is
required.
Definition 16.5 The critical value wQ,q denotes the (1 – q) quantile of the χ2Q distribution. For example, w2,0.05 is the
95% quantile of the χ22 distribution, and w2,0.10 is the 90% quantile of the χ22 distribution.
The following proposition states the sampling distribution of the Wald statistic when the null hypothesis is true:
Proposition 16.3. The Wald statistic associated with testing the null hypothesis
H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ
is distributed as a χ2Q random variable if H0 is true. The probability that the Wald statistic is greater than the critical
value wQ,α is equal to α if H0 is true.
For Q hypotheses being tested, given the χ2Q sampling distribution of the Wald statistic when H0 is true, the following
rejection rule based on critical values can be used:
Rejection rule for the Wald test based on the Wald statistic (test at the α level):
• Reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if Wald statistic ≥ wQ,α .
• Do not reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if Wald statistic < wQ,α .
Figure 16.6 provides a graphical view of the rejection area for a test at the 5% level, where the critical value is
wQ,0.05 . The gray region, corresponding to any Wald statistic above the critical value wQ,0.05 , indicates the area where
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 440 — #447
i i
wQ,0.05
Figure 16.6
Rejection area for the Wald test at a 5% level
H0 is rejected. Therefore, the probability of rejecting H0 when H0 is true is equal to 5%, as expected since it’s the level
of the test.
Alternatively, if the p-value for the Wald test is available, this p-value can be directly used to test the null hypothesis,
similar to t-tests and z-tests. For the Wald test, the p-value is the probability that a χ2Q random variable is greater than
the Wald statistic. If the null hypothesis H0 is true, the p-value provides the probability that a realized Wald statistic is
at least as large as the one calculated for the observed sample. With the p-value, the rejection rule should look familiar:
Rejection rule for the Wald test based on the p-value (test at the α level):
• Reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if p-value < α.
• Do not reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if p-value > α.
Figure 16.7 shows how the p-value relates to the Wald statistic and the χ2Q distribution. For a given Wald statistic,
labeled “Wald stat” in the figure, the p-value is the area of the gray region to the right of the Wald statistic under the
χ2Q distribution.
We now re-visit two of the examples from the beginning of this section.
Example 16.20 (Widget website) Continuing Example 16.17, the null hypothesis
H0 : πA = πB , πB = πC
or, equivalently,
H0 : θ1 = 0, θ2 = 0,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 441 — #448
i i
Wald stat
Figure 16.7
p-value for a Wald test
where θ1 = πA – πB and θ2 = πB – πC , corresponds to the purchase probabilities being the same for the three groups (e-
mail A recipients, e-mail B recipients, and non-recipients). There are Q = 2 hypotheses being tested. The Wald statistic
associated with H0 turns out to be 11.17. The p-value is calculated in R with the pchisq function:
1-pchisq(11.17,2)
## [1] 0.00375375
The p-value is approximately 0.004, meaning the null hypothesis is rejected at any level above 0.4% and providing
strong statistical evidence that the three purchase probabilities are not all equal to each other. By itself, the Wald test
doesn’t tell us which of the tested equalities is causing the rejection, although the previous evidence from the z-tests
for this example suggests that the rejection is being driven by the fact that the estimate of πC (pC = 0.15) is much lower
than the estimates of πA (pA = 0.20) and πB (pB = 0.22).
Example 16.21 (Earnings and marital status) Continuing Example 16.19, where µM , µD , µW , µN denoted the
population mean of weekly earnings for the four subpopulations based on marital status (“Married,” “Divorced,”
“Widowed,” “Never married,” respectively), the null hypothesis associated with these population means being equal
to each other is
H0 : µM = µD , µD = µW , µW = µN .
There are Q = 3 hypotheses being tested. The sample means of weekly earnings for the four groups are
x̄M = 1047, x̄D = 902, x̄W = 661, and x̄N = 820.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 442 — #449
i i
The Wald statistic associated with H0 turns out to be 80.90, which has a p-value of 0.000.
1-pchisq(81.59,3)
## [1] 0
Thus, the null hypothesis is rejected at any level, providing strong statistical evidence against the population means
being the same for the four groups. We might wonder if this result is being driven by the much lower sample average of
weekly earnings observed for the “Widowed” group, with x̄W = 661. To ignore the “Widowed” group, we could instead
test the null hypothesis
H0 : µM = µD , µD = µN ,
which has Q = 2 hypotheses. For this null hypothesis, the Wald statistic is 49.31, still with a p-value of 0.000, indicating
again that there is strong statistical evidence against the population means being the same for the three remaining
groups.
What happens for the Wald test when there is only a single restriction (Q = 1)? In this case, the null hypothesis
is H0 : θ1 = c1 , which can be tested with a z-test. Would we get a different answer using a Wald test? Thankfully,
the answer is no. Whether the test of H0 is conducted using a z-test or a Wald test, the p-value for the test will be
numerically identical, meaning the rejection conclusion is also the same for the two tests. This equivalence arises
since, in the Q = 1 case, the Wald statistic is exactly equal to the square of the z-statistic, and the critical value w1,α is
exactly equal to the square of the critical value zα/2 . The latter fact follows from a χ21 random variable being equal to
the square of a Z ∼ N(0, 1) random variable.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 443 — #450
i i
this possibility, we introduce the estimated asymptotic variance matrix with L rows and L columns, denoted V̂:
v̂11 v̂12 · · · · · · v̂1L
v̂21 v̂22 · · · · · · v̂2L
.. .. .. ..
V̂ = .
. . .
.
. . . .
. . . .
. . . .
v̂L1 v̂L2 ··· · · · v̂LL
The diagonal elements are the estimated asymptotic variances of the estimates γ̂1 , γ̂2 , …, γ̂L . For example, v̂22 is the
estimated asymptotic variance of γ̂2 . The off-diagonal elements are the estimated asymptotic covariances between two
estimates. For example, v̂12 is the estimated asymptotic covariance between γ̂1 and γ̂2 . (The true covariance will be
zero if the two underlying estimators are independent.)
To represent the Q linear restrictions being tested, we use a matrix denoted R (with Q rows and L columns) and a
vector denoted c (with Q rows):
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 444 — #451
i i
• Testing the equality of four population averages: For the population averages (µA , µB , µC , µD ), we let γ1 = µA ,
γ2 = µB , γ3 = µC , and γ4 = µD . The null hypothesis H0 : µA = µB = µC = µD has three linear restrictions, given by
γ1 = γ2 , γ2 = γ3 , and γ3 = γ4 . We have L = 4 parameters and Q = 3 restrictions. Then, R and c are
1 –1 0 0 0
R = 0 1 –1 0
and c = 0 .
0 0 1 –1 0
• Testing that all parameters are equal to zero: In some situations, it may be of interest to test whether a set of
parameters are all equal to zero. For instance, this test is often used in the context of multiple regression models
(Chapter 18). If we have L parameters γ1 , γ2 , …, γL and want to test
H0 : γ1 = γ2 = · · · = γL = 0,
there are Q = L linear restrictions (γ1 = 0, γ2 = 0, ..., γL = 0). Then, R and c are
1 0 ··· ··· 0 0
0 1 ··· ··· 0 0
.. .. . . .. ..
R= . .
. . and c =
. .
. . .
.. .. ..
. .. ...
0 0 ··· ··· 1 0
In this case, R is a matrix with ones along the diagonal and zeros everywhere else.
• Other tests: The framework using R and c is quite general. Suppose we have L = 5 parameters and want to jointly
test the following three restrictions:
H0 : γ1 + γ2 = 4, γ3 = 2γ4 , γ5 = 10.
This null hypothesis says that the first two parameters sum to 4, the third parameter is two times the size of the
fourth parameter, and the fifth parameter is equal to 10. The corresponding R and c are
1 1 0 0 0 4
R = 0 0 1 –2 0 and c = 0 .
0 0 0 0 1 10
The first row has the restriction γ1 + γ2 = 4, the second row has the restriction γ3 – 2γ4 = 0, and the third row has the
restriction γ5 = 10.
Since R does not have a build-in function for general Wald tests, we introduce a user-defined function wald_test
below. The function wald_test takes four arguments: the estimate vector gamma_hat (γ̂), the asymptotic variance
matrix var_gamma_hat (V̂), the restriction matrix R (R), and the constant vector c (c). The restriction matrix R
and the constant vector c are optional arguments, with default values equal to the identity matrix (ones along the
diagonal and zeros for all non-diagonal elements) and the zero vector, respectively; these default values correspond to
a null hypothesis whose restrictions are that each element of γ is equal to zero. The function wald_test returns a
list containing two elements, the Wald statistic (W) and its associated p-value (p_value). The p-value is calculated
by determining the area to the right of the Wald statistic for the χ2Q distribution.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 445 — #452
i i
# when R has one row (one restriction), make sure R has matrix type
if (!is.matrix(R)) {
R <- t(as.matrix(R))
}
We also define additional R functions to handle estimation of the asymptotic variance matrix V̂ for some of the
more common examples of the Wald test.
Sample proportions of independent samples: The function var_prop_indep estimates the asymptotic
variance matrix for a vector of sample proportions estimated on different (independent) samples. Its arguments are
pi_hat, the vector of sample proportions, and nobs, the vector of underlying sample sizes for each sample.
return(var_pi_hat)
}
Example 16.22 (Widget website) We re-visit Example 16.20 to detail how the Wald statistic and p-value are
calculated for the null hypothesis H0 : πA = πB = πC . Here is the R code that uses the functions var_prop_indep
and wald_test:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 446 — #453
i i
First, we store the estimated purchase probabilities (γ̂1 = π̂A = 0.20, γ̂2 = π̂B = 0.22, γ̂3 = π̂C = 0.15) in the vector
gamma_hat. Then, we estimate the asymptotic covariance matrix using var_prop_indep with the arguments
gamma_hat (the Bernoulli parameter estimates) and the vector of sample sizes c(300,300,2400). The matrix
R and the vector c are defined to correspond to the linear restrictions γ1 – γ2 = 0 and γ2 – γ3 = 0. We use the rbind
function to construct the R matrix. The rbind function takes vectors as arguments, where each vector corresponds
to one row, and stacks the vectors as rows in a matrix. The command R <- rbind(c(1,-1,0),c(0,1,-1))
stores the matrix
1 –1 0
0 1 –1
in the variable R. Finally, the function wald_test is called and outputs both the Wald statistic and the p-value, as
seen previously in Example 16.20.
Sample means of independent samples: The function var_mean_indep estimates the asymptotic variance
matrix for sample means estimated on different (independent) samples. Its argument is x_vectors, a list of the
sample vectors (i.e., the actual observations in each sample). Each of the sample vectors within the list x_vectors
may have different length.
# initialize variables
num_means <- length(x_vectors)
tempvec <- rep(0, num_means)
return(var_mean)
}
What is a list? A list in R is a collection of data objects, which may be of different data types. The list x_vectors is
a collection of the sample vectors for each of the (independent) samples. For example, in the case of two samples, given
by sample1 and sample2, the function call would be var_mean_indep(list(sample1,sample2)),
where the function list combines its arguments into a list object.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 447 — #454
i i
Example 16.23 (Earnings and marital status) We re-visit Example 16.21 to detail how the Wald statistic and p-
value are caclulated for the null hypothesis H0 : µM = µD = µW = µN , corresponding to the population mean of weekly
earnings being the same for the four subpopulations based on marital status (“Married,” “Divorced,” “Widowed,” and
“Never Married,” respectively). In this case, the number of parameters is L = 4, and the number of linear restrictions
in H0 is Q = 3. Here is the R code to conduct the Wald test:
Example 16.21 also considered dropping the “Widowed” subsample and testing H0 : µM = µD = µN . In this case, the
number of parameters is L = 3, and the number of linear restrictions in H0 is Q = 2. Here is the R code to conduct the
Wald test:
Sample means from the same sample: The function var_mean_onesample estimates the asymptotic variance
matrix for the sample means of several variables from the same sample. Its arguments are df, the data frame containing
the sample, and vars, a vector of either variable names or indices that identifies which elements of the data frame df
to consider.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 448 — #455
i i
448 NOTES
return(var_mean)
}
Example 16.24 (Exam score data) Example 16.10 implemented a z-test to test the equality of µexam1 and µexam2 , the
population means associated with the variables exam1 and exam2 in the exams dataset. There are L = 2 parameters,
and the z-test is a special case of the Wald test with Q = 1 linear restriction. The null hypothesis is H0 : µexam1 – µexam2 =
0. Here is the R code to conduct the Wald test:
The p-value from this Wald test is the same as the p-value from the z-test in Example 16.10, and the Wald statistic
(approximately 27.13) is equal to the z-statistic (approximately 5.22) squared. If there were another exam within the
dataset, say exam3, we could test the equality of all three population means, H0 : µexam1 = µexam2 = µexam3 , by modifying
the R code to account for the extra parameter (L = 3) and the extra restriction (Q = 2).
Notes
49 The use of the word “null” in “null hypothesis” stems from the fact that many tests of interest consider c = 0, so that H : θ = 0. If θ is a parameter
0
that measures some sort of effect, which is common in regression models, the null hypothesis H0 corresponds to no effect or a “null” effect.
50 An alternative approach is to use a simple linear regression of one variable on the other variable, which is covered in Chapter 17. For the simple
linear regression, the z-test uses the slope estimate and its standard error.
51 The test is named after Abraham Wald, who introduced the idea in a 1943 paper in the Transactions of the American Mathematical Society.
52 More formally, the p-values associated with the tests of these three null hypotheses are numerically identical.
Exercises
1. You are considering investing in a company called Tech Trove. The company has been listed on a stock exchange
for ten years, with the following annual returns:
0.04, –0.10, 0.17, 0.02, –0.19, –0.08, –0.01, 0.09, 0.22, 0.06.
Assume the annual returns are i.i.d. draws from a normal distribution with unknown mean µ and unknown variance σ 2 .
(a) What is the 95% confidence interval for µ?
(b) What is the t-statistic for the test of the null hypothesis H0 : µ = 0? Do you reject H0 at a 5% level?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 449 — #456
i i
NOTES 449
(c) What is the p-value for the test of the null hypothesis H0 : µ = 0?
(d) Without doing any additional calculations, how would the t-statistic for testing H0 : µ = –0.01 compare to the
t-statistic found in (b)?
(e) Without doing any additional calculations, how would the p-value for testing H0 : µ = –0.01 compare to the
p-value found in (c)?
2. A random sample of 12 undergraduates enrolled in an introductory economics course are asked to predict the annual
income for their first job after graduation. The sample average of the responses is $49,000, and the sample standard
deviation of the responses is $8,000. Assume that the individual responses are drawn independently from a normal
distribution.
(a) If µX denotes the population average of the predicted annual income, what is the t-statistic associated with
testing H0 : µX = 50000?
(b) Do you reject H0 : µX = 50000 at a 5% level?
(c) What is the p-value for the t-test of H0 : µX = 50000?
(d) If the same sample average and sample standard deviation were observed for a sample size of n = 1200 (instead
of n = 12), would you reject H0 : µX = 50000 at a 5% level?
(e) For (d), is it necessary to assume that the individual responses are drawn from a normal distribution?
3. The average grade among fourth graders on a math aptitude test in a certain state is 62.5 out of 100 points. (Assume
62.5 is the population average.) An educational company has developed an app for elementary-school math and would
like to test whether the app improves average performance on the aptitude test among fourth graders. Let µX denote
the population mean of test scores among students who use the app.
(a) To provide statistical evidence that the app increases the true average of test scores, what are the appropriate
one-sided null hypothesis and alternative hypothesis?
(b) The company provides the app to 15 randomly selected students. The sample mean and sample standard
deviation of test scores are 68.3 and 13.5, respectively. Under the assumption that the test scores are i.i.d. and
normally distributed, would the t-test of the null hypothesis in (a) be rejected at the 5% level? at the 10% level?
(c) What is the p-value associated with the t-test in (b)?
4. Suppose 200 students are chosen at random to do a blind taste test of green M&M’s versus red M&M’s. Of the 200
students, 112 students prefer green M&M’s and 88 prefer red M&M’s. Let π denote the population probability that a
randomly selected student prefers green M&M’s.
(a) What is the asymptotic 95% confidence interval for π?
(b) State the null hypothesis that corresponds to green M&M’s and red M&M’s being equally preferred in the
population.
(c) What is the p-value associated with the test of the null hypothesis in (b)?
(d) Suppose another 200 students are chosen to do the blind taste test, and once again(!) 112 prefer green M&M’s
and 88 prefer red M&M’s. How do the following quantities for the 400-student sample compare to the
associated quantities for the initial 200-student sample?
i. Middle of the asymptotic 95% confidence interval for π
ii. Width of the asymptotic 95% confidence interval for π
iii. The p-value for testing the null hypothesis that green M&M’s and red M&M’s are equally preferred in
the population
5. Use the metricsgrades dataset for this question. These data are from a graduate econometrics course with 68
students.
(a) Provide an asymptotic 95% confidence interval for the difference in the population means of total (composite
course score) for the subpopulations of domestic (domestic = 1) and international (domestic = 0) students.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 450 — #457
i i
450 NOTES
(b) What is the p-value for the z-test of the null hypothesis that the population means of total (composite course
score) for the subpopulations of domestic (domestic = 1) and international (domestic = 0) students are the same?
(c) Define an indicator variable hiscore that is equal to 1 if total > 80 and 0 otherwise. What is the p-value for the
z-test of the null hypothesis that P(hiscore = 1|domestic = 1) = P(hiscore = 1|domestic = 0)?
6. *A classmate gives you a coin to toss and claims it is a fair coin, but you have your suspicions. Let π be the
probability of heads with this coin.
(a) If you toss the coin 100 times and get 58 heads, what is the p-value for testing H0 : π = 0.50?
(b) Suppose the true heads probability is π = 0.51, so that
X̄ – 0.51
q ∼ N(0, 1).
(0.51)(0.49)
n
The z-statistic for testing H0 : π = 0.50, before observing outcomes of the coin tosses, can be written
X̄ – 0.50 X̄ – 0.51 0.51 – 0.50
q =q +q ,
(0.51)(0.49) (0.51)(0.49) (0.51)(0.49)
n n n
where the first term is a N(0, 1) random variable and the second term is a constant (depending on n). Given
this expression, how many tosses n∗ would be required for there to be a 99% probability that the z-statistic is
greater than 1.96?
(c) For the n∗ found in (b), and still assuming π = 0.51, conduct 100,000 simulations q in R to determine the
probability that the z-statistic is greater than 1.96. For this part, use the usual se(x̄) = x̄(1–x̄)
n formula rather
q
than the “true” (0.51)(0.49)
n formula.
(d) Repeat (b) for different values of the true heads probability, with π = {0.51, 0.52, …, 0.59, 0.60} being the
possibilities. Plot a graph of required sample size n∗ against probability π.
7. You have an estimate θ̂x = 1.2 of an unknown parameter θ based upon a sample size of n = 100. You calculate that
the p-value associated with a z-test of H0 : θ = 1 is p∗ . How do the following quantities compare with p∗ ?
(a) the p-value for a z-test of H0 : θ = 1.1
(b) the p-value for a one-sided z-test of H0 : θ ≤ 1
8. Suppose the smoking rate among adults aged 25-44 in the United States is 11.2%. A public-health researcher has
an informational video that she is convinced will lower the prevalence of smoking in this age group. She randomly
selects a sample (including smokers and non-smokers) from this age group and then follows up in six months to ask
whether they are a smoker or not. Let π denote the probability of smoking for an adult aged 25-44 who sees the video.
(a) To provide statistical evidence that the informational video decreases the smoking rate, what are the appropriate
one-sided null hypothesis and alternative hypothesis?
(b) If the researcher finds that 10% of participants are smokers in her follow-up survey, how large would her sample
need to be to reject the null hypothesis in (a) at a 5% level?
9. Use the sp500 dataset for this question.
(a) What are the z-statistic and p-value for testing that the population average of Home Depot (HD) monthly returns
is equal to the population average of Lowe’s (LOW) monthly returns?
(b) The sample average of the market-index (IDX) monthly returns is 0.0078. If you are interested in testing
whether the population average of Home Depot monthly returns is the same as the population average of the
market-index monthly returns, would it be appropriate to test H0 : µHD = 0.0078? Explain.
(c) *Use the bootstrap for this part to calculate the standard error required for the test. What are the z-statistic
and p-value for testing that the population median of Home Depot monthly returns is equal to the population
median of Lowe’s monthly returns?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 451 — #458
i i
NOTES 451
(d) *Use the bootstrap for this part to calculate the standard error required for the test. What are the z-statistic
and p-value for testing that the population standard deviation of Home Depot monthly returns is equal to the
population standard deviation of Lowe’s monthly returns?
(e) *Use the bootstrap for this part to calculate the standard error required for the test. What are the z-statistic and
p-value for testing that ρHD,IDX (the population correlation between Home Depot returns and the market-index
returns) is equal to ρLOW,IDX (the population correlation between Lowe’s returns and the market-index returns)?
10. Use the cps dataset for this question. Focus on the sample of 2,809 employed individuals.
(a) Provide the sample correlation matrix for the variables hrslastwk, age, educ, and ownchild.
(b) For each sample correlation in the correlation matrix from (a), calculate the p-value for the z-test of the null
hypothesis that the population correlation is equal to zero. Which of the correlations are statistically significant
at a 5% level?
(c) Define an indicator variable collgrad that is equal to 1 if educ ≥ 16 and 0 otherwise. Calculate the sample
standard deviation of hrslastwk for the two subsamples of college graduates (collgrad = 1) and non-college
graduates (collgrad = 0). What are the z-statistic and p-value for the test of the null hypothesis that the
population standard deviations for the two subpopulations (college graduates and non-college graduates) are
the same? (Hint: Use the se_sx function defined in Section 14.4.)
11. *A podcast provider has 1,000 customers who have signed up for a free three-month trial. It would like to do an
A/B test of two alternative plans to get customers to subscribe after the free trial, where plan A guarantees a $2.99
monthly fee forever if the customer pays 12 months up-front and plan B guarantees a $3.99 monthly fee forever but
allows cancellation at any time. Let πA and πB be the true probabilities of subscription for plans A and B, respectively.
The podcast provider chooses the number of customers offered plan A, denoted n∗ , with the 1000 – n∗ other customers
offered plan B. Let XA and XB be the Bernoulli random variables for the two plans, with success indicating a subscriber.
(a) If n∗ = 300, what is the asymptotic distribution of X̄A – X̄B in terms of πA and πB ?
(b) What value of n∗ minimizes the asymptotic variance of X̄A – X̄B (in terms of πA and πB )?
(c) If the null hypothesis H0 : πA = πB is true, what value of n∗ minimizes the asymptotic variance of X̄A – X̄B ?
(d) Suppose plan A is slightly more effective than plan B, with πA – πB = , for some small number > 0. Then, the
z-statistic for testing H0 : πA = πB , before observing outcomes, can be written
X̄A – X̄B X̄A – X̄B –
= + .
sd(X̄A – X̄B ) sd(X̄A – X̄B ) sd(X̄A – X̄B )
Since is small, you may assume that sd(X̄A – X̄B ) is approximately the same as it would be for πA = πB . Explain
why the choice of n∗ found in (c) is the best choice for testing H0 : πA = πB , in the sense that it’s the choice that
makes rejection the most likely when πA – πB = .
12. Use the brands dataset for this question. The dataset consists of 14,560 observations on customers who purchased
a candy bar in their last visit to a specific market. There are five brands, numbered 1 through 5, and the last_brand
variable indicates the brand that was purchased on the last visit. The purchase variable is 1 if the customer purchases a
candy bar during their current visit and 0 otherwise. If a purchase is made (purchase = 1), the variable brand indicates
the brand purchased on the current visit; if no purchase is made (purchase = 0), the variable brand has a value of 0.
(a) For each of the five conditional purchase probabilities (purchase given brand 1 on last visit, purchase given
brand 2 on last visit, and so on), provide the estimated probability and an asymptotic 95% confidence interval.
(b) For the 10 possible pairs of different brands (b1 and b2 ), provide the p-value for the test of
H0 : P(purchase = 1|last_brand = b1 ) = P(purchase = 1|last_brand = b2 ).
How many pairs indicate a statistically significant different at a 5% level?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 452 — #459
i i
452 NOTES
(c) Create a new variable same_brand equal to 1 if brand = last_brand and 0 otherwise. For each brand b ∈
{1, 2, 3, 4, 5}, provide the estimate of the conditional probability
P(same_brand = 1|last_brand = b, purchase = 1),
along with an asymptotic 95% confidence interval.
(d) For the 10 possible pairs of different brands (b1 and b2 ), provide the p-value for the test of
H0 : P(same_brand = 1|last_brand = b1 , purchase = 1) = P(same_brand = 1|last_brand = b2 , purchase = 1).
How many pairs indicate a statistically significant different at a 5% level?
(e) *Returning to the conditional purchase probabilities in (a), use the wald_test function to test the null
hypothesis that all five probabilities are equal to each other. What is the p-value? Do you reject at the 5%
level?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 453 — #460
i i
The linear regression model describes the relationship between an outcome variable and one or more explanatory
variables. The linear regression estimator, also known as the least-squares estimator, is one of the most commonly
used techniques for data analysis. This estimator is used for two main purposes: (i) making predictions about the
outcome variable based upon values of the explanatory variable(s) and (ii) estimating the causal effect of one or more
explanatory variables on the outcome variable.
This chapter focuses on the simplest form of the model, one with a single explanatory variable and known as
the simple linear regression model. Chapter 18 considers a more general model that allows for multiple explanatory
variables and known as the multiple linear regression model. The linear regression model is “linear” since it assumes a
linear relationship between the outcome variable and explanatory variable(s), but it allows for additional random noise
in the outcome variable. As discussed in Chapter 18, the linear nature is not as restrictive as the terminology suggests
since nonlinearity can be incorporated with additional explanatory variables.
We focus on cross-sectional data in both this chapter and Chapter 18, with the sample assumed to consist of
i.i.d. draws from the underlying population. While linear regression models and estimators certainly apply to other
types of data, including time-series data and panel data, the treatment of these cases is beyond the scope of this book.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 454 — #461
i i
• Exogeneity assumption: The assumption E(U|X) = 0, known as the exogeneity assumption, states that the
explanatory variable X provides no information about the expected value of U. Regardless of the value of X, the
conditional expectation of U is always equal to zero.53
The exogeneity assumption E(U|X) = 0 requires that the random variables X and U are uncorrelated (σXU = ρXU = 0),
but exogeneity is a stronger assumption than σXU = ρXU = 0.
Proposition 17.1. If the exogeneity assumption E(U|X) = 0 holds, then σXU = ρXU = 0. Moreover, for any function g(·),
if the exogeneity assumption holds, then g(X) and U are uncorrelated: σg(X)U = ρg(X)U = 0.
While the exogeneity assumption implies that X and U are uncorrelated, the reverse is not necessarily true. Even if
X and U are uncorrelated, E(U|X = x) may be non-zero for some x if U is correlated with some nonlinear function of X.
Under the exogeneity assumption, the conditional expectation E(Y|X) simplifies to a linear function of X:
E(Y|X) = E(α + βX + U|X) = E(α|X) + E(βX|X) + E(U|X) = α + βX.
The first equality comes from plugging in α + βX + U for Y based upon the SLR model. The second equality follows
from the fact that the expected value of the sum of random variables is the sum of the expected values of the random
variables. The third equality uses several facts: α is a constant, so E(α|X) = α; after conditioning on X, βX is also a
constant, so E(βX|X) = βX; and, E(U|X) = 0 by the exogeneity assumption.
Therefore, the SLR model can be thought of as a linear model for E(Y|X). The meaning of the two parameters, the
intercept α and the slope β, follow directly from the equation E(Y|X) = α + βX:
• Meaning of the intercept α: When X = 0, the conditional expectation is E(Y|X = 0) = α + β · 0 = α, implying
α = E(Y|X = 0).
α is the population mean of Y conditional on X = 0. Graphically, α is the y-intercept of the line E(Y|X) = α + βX.
The units of α are the same as the units of the random variable Y. Whether the value of α has a meaningful
interpretation depends upon the particular application. Generally speaking, if zero is a sensible value for the X
variable, then the value of α has a meaningful interpretation. For instance, if Y measures an individual’s earnings
and X measures years of work experience, α is the population average of earnings conditional on no work experience
(i.e., individuals newly in the workforce). On the other hand, if Y measures an individual’s earnings and X measures
an individual’s age in years, α would not be meaningful by itself since it corresponds to the population earnings
conditional on an individual being zero years of age.
• Meaning of the slope β: The parameter β is the slope of the E(Y|X) = α + βX line, indicating how much E(Y|X)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 455 — #462
i i
𝑦∗ 𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥
𝑢∗
∗
𝛼 + 𝛽𝑥
(0, 𝛼) 𝑠𝑙𝑜𝑝𝑒 = 𝛽
𝑥∗ 𝑥
(0,0)
Figure 17.1
Data-generating process for the SLR model
from the conditional distribution of U given X = x∗ , the realized draw of Y is y∗ = α + βx∗ + u∗ . Figure 17.1 provides a
graphical representation of this data-generating process. In the figure, the conditional expectation E(Y|X) = α + βX is
drawn with a positive intercept α and a positive slope β. The point (0, α) is where the E(Y|X) = α + βX line crosses the
y-axis. The figure shows a single realized data point (x∗ , y∗ ), arising from the proceess just described. For the x∗ value
drawn from the unconditional distribution of X, the conditional expectation E(Y|X = x) = α + βx∗ is read off the line.
For the point in the figure, the draw of u∗ from the conditional distribution of U given X = x is a positive value, leading
to a y∗ = α + βx∗ + u∗ value that is above the line (y∗ > α + βx∗ ). If the draw of u∗ had been negative, the observed data
point would have been below the line.
Example 17.1 (Advertising and sales) Suppose a retail company has stores in 80 different cities. During a particular
week, the company randomly chooses 20 cities for on-line targeted advertising on a social media website. The targeted
advertising is costly, which is why the company advertises to a subset of the 80 cities. If SALES is the random variable
for weekly sales at a given location and AD is the random variable indicating whether the location receives target
advertising (AD = 1) or not (AD = 0), the SLR model relating SALES and AD is
SALES = α + βAD + U with E(U|AD) = 0.
The exogeneity assumption E(U|AD) = 0 is likely to hold here. Since AD is assigned randomly by the company to be
0 or 1, there is no reason to expect a relationship between AD and the unobservable U. This situation is an example
of an A/B experiment where the SLR model allows the estimation of the causal effect of targeted advertising (AD) on
weekly sales (SALES). With the exogeneity assumption E(U|AD) = 0 holding, the SLR model implies that
E(SALES|AD = 0) = α and E(SALES|AD = 1) = α + β.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 456 — #463
i i
𝑆𝐴𝐿𝐸𝑆
𝐸(𝑆𝐴𝐿𝐸𝑆|𝐴𝐷) = 𝛼 + 𝛽𝑥
𝛼+𝛽
𝐴𝐷 = 1 𝐴𝐷
𝐴𝐷 = 0
Figure 17.2
SLR model for sales and advertising
Figure 17.2 shows the SLR model for this example. While the line E(SALES|AD) = α + βAD is defined for all values
of AD, the only relevant values are AD = 0 (no advertising) and AD = 1 (advertising), and the conditional expectations
for these two values are indicated by the solid points. The figure is drawn assuming a positive slope β > 0; of course,
since β is unknown, it is possible that β = 0 (advertising has no effect) or even β < 0 (advertising has a negative effect).
The parameter α is the population mean of sales for cities not receiving targeted advertising, and α + β is the
population mean of sales for cities receiving targeted advertising. Subtracting the E(SALES|AD = 0) equation from the
E(SALES|AD = 1) equation yields
β = E(SALES|AD = 1) – E(SALES|AD = 0),
which is the causal effect of targeted advertising, equal to the difference in the population mean of sales between
targeted cities and non-targeted cities. As such, since we will discuss how to estimate β below, the SLR model provides
a natural framework for estimating the difference in average outcomes for “A” and “B” groups in an A/B experiment.
Example 17.2 (Earnings and union status) Suppose we are interested in modeling the relationship between weekly
earnings, measured by the random variable EARNWK, and union status, measured by the random variable UNION,
among employed individuals in the population. EARNWK is a continuous random variable, and UNION is an indicator
variable equal to 1 for a union member and 0 for a non-member. The SLR model relating EARNWK and UNION is
EARNWK = α + βUNION + U with E(U|UNION) = 0.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 457 — #464
i i
This example is similar to Example 17.1 since the explanatory variable is an indicator variable. α is the conditional
population average of weekly earnings for non-members,
α = E(EARNWK|UNION = 0),
and β is the difference in the population average of weekly earnings between union members and non-members,
β = E(EARNWK|UNION = 1) – E(EARNWK|UNION = 0).
Unlike Example 17.1, however, the explanatory variable UNION is not randomly assigned. Should we expect that the
exogeneity assumption E(U|UNION) = 0 holds here? There is the possibility that union status may be related to the
unobservable U. For instance, if union members tend to have higher skill levels than non-members and, thus, higher
productivity than non-members, it could be the case that UNION and U are positively correlated. Such a correlation
between UNION and U would imply that the exogeneity assumption E(U|UNION) = 0 does not hold. As discussed
below, failure of the exogeneity assumption makes causal inference difficult (e.g., testing whether union membership
causes higher earnings) but does not preclude estimating the association between earnings and union membership
(e.g., estimating the difference between average earnings for union members and average earnings for non-members).
Example 17.3 (Earnings and education) Rather than looking at the relationship between weekly earnings and union
membership (Example 17.2), suppose we are interested in the relationship between weekly earnings (EARNWK) and
years of educational attainment (EDUC) among employed individuals in the population. The SLR model relating
EARNWK and EDUC is
EARNWK = α + βEDUC + U with E(U|EDUC) = 0.
EDUC is a discrete explanatory variable that can take several different values, in contrast to the binary explanatory
variable UNION. As a result, the linearity assumption is considerably stronger here since it requires that the linear
relationship, given by the intercept α and the slope β, holds over the entire range of the EDUC variable, as opposed
to holding for the two possible values of UNION.54 The slope β is the change in the conditional expectation of weekly
earnings associated with a one-year change in education. The intercept α is not really meaningful by itself since
EDUC = 0 is far below the minimum education level traditionally observed in labor-force datasets in the United
States. As in Example 17.2, there is a concern that the exogeneity assumption may not hold here. Specifically, it is
generally believed by labor economists that EDUC is positively correlated with U since education likely has a positive
association with unobserved productivity, unobserved family wealth, etc. This positive association would imply that
E(U|EDUC) = 0 does not hold.
Example 17.4 (Cigarette sales and cigarette taxes) As discussed in Example 5.3, there is a lot of variation in state-
level cigarette taxes in the United States. Standard economic theory predicts a negative relationship between cigarette
sales (the “demand for cigarettes”) and cigarette taxes since a higher tax is associated with a higher price. We use
per-capita sales to make the data comparable across states. Let CIGSALES denote the number of packs per capita sold
in a year in a given state, and let CIGTAX denote the state tax (in dollars) on a pack of cigarettes. Both CIGSALES
and CIGTAX are continuous variables, and the SLR model relating the two variables is
CIGSALES = α + βCIGTAX + U with E(U|CIGTAX) = 0.
Since cigarette taxes are the result of a political and legislative process in each state, there might be a concern that the
exogeneity assumption E(U|CIGTAX) = 0 does not hold. For instance, if there are very few smokers in a given state,
there may be little resistance to a law that increases cigarette taxes. On the other hand, if there are many smokers in a
given state, there might be more resistance to such a law. If true, these arguments imply a negative correlation between
CIGTAX and the unobservable U since U is higher in states where people have a greater inclination to smoke.
Example 17.5 (Monthly stock returns and the overall market) One of the simplest and most commonly used models
for stock returns is a SLR model that relates the return of an individual stock to the return of the overall stock market.
The standard way to measure the return of the overall stock market is to use an index, like the S&P 500 or the Russell
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 458 — #465
i i
2000. Generally, the individual stock would be part of the index being used. To fix ideas, let RSTOCK denote the monthly
return of a specific stock and RIDX denote the monthly return of the S&P 500 index. Then, the SLR model relating the
individual stock return to the overall market return is55
RSTOCK = α + βRIDX + U with E(U|RIDX ) = 0.
The intercept
α = E(RSTOCK |RIDX = 0)
is the population mean of the individual stock return RSTOCK conditional on RIDX = 0, which corresponds to the S&P 500
index being unchanged (zero market return) during a month. The slope β is the change in the conditional expectation
of the individual stock’s return RSTOCK associated with a one-unit change in the index return RIDX . In terms of returns,
a one-unit change is huge since it corresponds to a 100 percentage-point change. Instead, β can be interpreted in terms
of a smaller change in RIDX , like a one percentage-point change. For a one percentage-point (0.01) change in RIDX ,
the associated change in the conditional expectation of RSTOCK is 0.01β. The nature of the individual stock returns’
relationship to the overall market returns depends upon β as follows:
• β = 0: The expected stock return is not related to the market return.
• β = 1: The expected stock return moves exactly in tandem with the market return. For instance, if the market return
is 0.02 higher, the expected stock return is also 0.02 higher.
• 0 < β < 1: The expected stock return moves in the same direction as the market return but with a smaller magnitude.
For instance, if the market return is 0.02 higher, the expected stock return is higher by an amount less than 0.02.
• β > 1: The expected stock return moves in the same direction as the market return but with a larger magnitude. For
instance, if the market return is 0.02 higher, the expected stock return is higher by an amount greater than 0.02.
• β < 0: The expected stock return moves in the opposite direction as the market return. The expected stock return
decreases when the market return increases, and the expected stock return increases when the market return
decreases. Due to the behavior of the expected stock return in this case, such a stock is often called countercyclical.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 459 — #466
i i
(𝑥∗ , 𝑦∗ )
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥
(0, 𝛼)
(𝑥∗∗ , 𝑦∗∗ )
𝑥∗ 𝑥∗∗ 𝑥
(0,0)
Figure 17.3
SLR model and observed data
data point (xi , yi ) in the figure, the value of the residual ui = yi – α – βxi is equal to the vertical distance from the data
point to the α + βx line, with points above the line having positive residual values and points below the line having
negative residual values.
To develop estimators of the SLR model parameters, we use the following proposition which states how the
parameters α and β are related to the population descriptive statistics of the random variables (Y, X). The key idea
is that, once the parameters are expressed in terms of population descriptive statistics, estimators can be developed by
using sample descriptive statistics in place of population descriptive statistics.
Proposition 17.2. If the SLR model holds and σX2 > 0, then the slope β and intercept α are related to the population
descriptive statistics as follows:
σXY σY
β = 2 or, equivalently, β = ρXY
σX σX
and
σXY
α = µY – βµX = µY – 2 µX .
σX
The assumption that σX2 > 0 means that X is not constant and, since σX2 is in the denominator of the β expression,
ensures that β is a well-defined parameter. To show the first result for β, note that
Cov(X, Y) = Cov(X, α + βX + U)
= Cov(X, α) + Cov(X, βX) + Cov(X, U)
= βVar(X).
The last equality follows from Cov(X, α) = 0, since α is constant, and Cov(X, U) = 0, which is implied by the exogeneity
assumption E(U|X) = 0. Dividing both sides by Var(X), which is possible since Var(X) is assumed to be positive,
yields β = Cov(X,Y) σXY
Var(X) = σ 2 . The second result for β follows from the relationship between a population covariance and
X
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 460 — #467
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 461 — #468
i i
Based upon the least-squares estimates α̂ and β̂, the estimated regression line is
Ê(Y|X = x) = α̂ + β̂x,
where Ê(Y|X = x) denotes the estimate of the true conditional expectation E(Y|X = x). Since the formula for α̂ can be
re-written as
ȳ = α̂ + β̂x̄,
it follows that the point of sample means (x̄, ȳ) falls on the estimated regression line. That is, ȳ is the estimated
conditional expectation of Y given that X is equal to the sample mean x̄.
Proposition 17.3. If the SLR model holds and s2X > 0, the least-squares slope estimator
sXY sY
β̂XY = 2 = rXY
sX sX
and the least-squares intercept estimator
α̂XY = Ȳ – β̂XY X̄
are consistent estimators of β and α, respectively.
Consistency follows from the continuous mapping theorem (Proposition 14.10). β̂XY = ssXY2 consistently estimates
X
β = σσXY2 since sXY and s2X are consistent estimators of σXY and σX2 , respectively. And, α̂XY = Ȳ – β̂XY X̄ consistently
X
estimates α = µY – βµX since Ȳ and X̄ are consistent estimators of µY and µX , respectively. It turns out that the least-
squares estimators β̂XY and α̂XY are also unbiased estimators of β and α, respectively. But since we will only focus
on asymptotic statistical inference for the least-squares estimators, the consistency property is the important one. The
estimators are also asymptotically normal, as discussed in detail below.
The lm function, where “lm” is short for “linear model,” is the primary function used for least-squares estimation
in R. Although lm has many different arguments available, the following provides the basic syntax that handles most
linear-regression problems of interest:
• lm(formula, data, subset): Returns the results from the least-squares estimation of the model specified
by formula. The optional argument data specifies the data frame to be used and can greatly simplify how
formula is written. The optional argument subset is a logical vector that specifies the subset of data to be used
for estimation.
The lm function automatically ignores observations for which the variables used in the formula argument have
missing (NA) values. Therefore, it is not necessary to remove rows of the vector or data frame with NA values.
Here are some simple examples that illustrate the usage of the lm function:
• lm(df$y~df$x): This lm command returns the results from the least-squares estimation of a SLR model with
df$y as the outcome variable and df$x as the explanatory variable. The syntax for the formula argument has
the outcome variable before the tilde (~) and the explanatory variable after the tilde. Here, df is a data frame that
contains the two variables.
• lm(y~x, data=df): This lm command is identical to lm(df$y~df$x), with the data=df argument
indicating that the variables in the formula argument are in the df data frame.
• lm(y~x, data=df, subset=(x>10)): This lm command does the least-squares estimation, with df$y
as the outcome variable and df$x as the explanatory variable, on the subsample for which (x>10) is true.
Example 17.6 (Earnings and union status) Example 17.2 considered a SLR model for the relationship between weekly
earnings (EARNWK) and union membership (UNION):
EARNWK = α + βUNION + U.
To estimate this model, consider the sample of employed individuals (n = 2809) from the cps dataset. The outcome
variable is earnwk, and the explanatory variable is an indicator variable that indicates union membership. Although
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 462 — #469
i i
unionstatus is a categorical variable (with categories “Non-union” and “Union”) in the dataset, for the purposes of
estimating the SLR model, we define a binary variable with union = 1 for the “Union” category and union = 0 for the
“Non-union” category. The least-squares estimates of α and β can be calculated in R using the lm function:
lm(earnwk~union, data=cps)
##
## Call:
## lm(formula = earnwk ~ union, data = cps)
##
## Coefficients:
## (Intercept) union
## 947 251
lm(earnwk~unionstatus, data=cps)
##
## Call:
## lm(formula = earnwk ~ unionstatus, data = cps)
##
## Coefficients:
## (Intercept) unionstatusUnion
## 947 251
The lm function recognizes unionstatus as a categorical variable and automatically creates an indicator
variable, which appears as unionstatusUnion in the ouput and is equal to 1 when unionstatus is Union
and 0 when unionstatus is Non-union. This capability extends to categorical variables with more than two
categories. For example, the command lm(earnwk~race, data=cps) would automatically create two indicator
variables based upon the categorical variable race since it has three categories. Whether we explicitly create the
indicator variable(s) based upon a categorical variable or let lm do it for us is largely a matter of preference, though
creation of the indicator variables allows us to explicitly indicate which is the “omitted category” in the model.
The relationship between the least-squares estimates and the different subsample averages of y, seen in Example 17.6,
is a general property for any least-squares estimates when x is a binary variable.
Proposition 17.4. For a SLR model with a binary explanatory variable X, the least squares estimates α̂ and β̂ satisfy
α̂ = ȳ0
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 463 — #470
i i
and
β̂ = ȳ1 – ȳ0 ,
where ȳ0 is the average of y values in the x = 0 subsample and ȳ1 is the average of y values in the x = 1 subsample.
The intercept estimate α̂ corresponds to the sample average of y values for the x = 0 subsample, whereas the slope
estimate β̂ is the difference in the sample average of y values for the x = 1 subsample and the sample average of y
values for the x = 0 subsample. Therefore, Proposition 17.4 implies that the estimated line α̂ + β̂x passes through the
two points corresponding to the sample averages of y values for the x = 0 and x = 1 subsamples. Specifically, the points
1 –ȳ0
(0, ȳ0 ) and (1, ȳ1 ) lie on the estimated line, corresponding to the slope estimate β̂ = ȳ1–0 = ȳ1 – ȳ0 . It is important to
stress that this property does not generalize to an x variable that has more than two possible values.
Example 17.7 (Cigarette sales and cigarette taxes) Example 17.4 considered a SLR model for the relationship
between state-level annual cigarette sales (number of packs per capita), given by the random variable CIGSALES,
and state-level cigarette taxes (dollars per pack), given by the random variable CIGTAX:
CIGSALES = α + βCIGTAX + U.
The dataset cigdata contains data for 2019 that can be used to estimate this SLR model. Specifically, data for the 50
individual states plus the District of Columbia yields a sample with n = 51 observations. The realized outcome variable
is cigsales, and the realized explanatory variable is cigtax.
lm(cigsales~cigtax, data=cigdata)
##
## Call:
## lm(formula = cigsales ~ cigtax, data = cigdata)
##
## Coefficients:
## (Intercept) cigtax
## 55.95 -9.49
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 464 — #471
i i
80
40
20
0 1 2 3 4 5
Figure 17.4
Least-squares estimated line for cigarette data
An estimate of this difference is 2β̂ = (2)(–9.49) = –18.98, or expected state-level sales that are estimated to be 18.98
packs per-capita lower among the population of states with a $3 per-pack tax as compared to the population of states
with a $1 per-pack tax. This estimate says something about expected state-level sales. It is possible that the realized
state-level sales in a specific $3-tax state may be higher than the realized state-level sales in a specific $1-tax state. But
averaging over many such states with $3 taxes and $1 taxes, the negative slope estimate says that expected state-level
sales are estimated to be lower in states with $3 taxes than in states with $1 taxes.
Here is the R code used to draw the scatter plot and least-squares regression line in Figure 17.4:
par(mfrow = c(1,1))
After the scatter plot is drawn with the plot function, the least-squares regression line is drawn by using the lm
regression itself as the argument of the abline function. The abline function determines the intercept and slope of
the regression line from the lm regression results.
Example 17.8 (Monthly stock returns and the overall market) Example 17.5 introduced a SLR model of the
relationship between the returns of an individual stock and the returns of a market index:
RSTOCK = α + βRIDX + U.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 465 — #472
i i
We estimate the parameters of this model using a sample from the sp500 dataset, which has n = 364 monthly
observations. The variable IDX contains the monthly returns for the S&P 500 index, which we use as the realized
values of the random variable RIDX from the model above. Since the model can be used for any individual stock, let’s
use Home Depot (HD) as the first one to examine.
lm(HD~IDX, data=sp500)
##
## Call:
## lm(formula = HD ~ IDX, data = sp500)
##
## Coefficients:
## (Intercept) IDX
## 0.00873 1.02045
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 466 — #473
i i
0.3
0.2
0.1
0.0
−0.1
−0.2
Figure 17.5
Least-squares estimated line for Home Depot versus market index
monthly return. For Bank of America, the intercept estimate of α̂ = 0.0014 indicates that the estimated expected monthly
return of Bank of America is 0.0014, or 0.14%, when the market index monthly return is zero.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 467 — #474
i i
(𝑥𝑖 , 𝑦𝑖 )
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥 (SLR model)
𝑦𝑖 ^
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 " 𝑖 (least-squares estimation)
! + 𝛽𝑥
𝑦# 𝑖 = 𝛼 " 𝑖
! + 𝛽𝑥
!)
(0, 𝛼
𝑥𝑖 𝑥
(0,0)
Figure 17.6
Least-squares estimated line
out to be less (i.e., a flatter slope) than the true β. Looking at the specific point (xi , yi ) highlighted in the figure, the
fitted value ŷi = α̂ + β̂xi is read off the estimated line, with the realized outcome yi greater than the fitted value ŷi and
the point being above the estimated line.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 468 — #475
i i
and
ui = yi – ŷi = yi – (55.95 – 9.49xi ).
These fitted values and estimated residuals are calculated by the lm function in R. Specifically, if the results from
lm estimation are stored in a variable called results, the fitted values and estimated residuals are contained in
results$residuals and results$fitted.values, respectively:
The following table shows fitted values ŷi and estimated residuals ûi for the first seven states in the data:
State xi (cigtaxi ) yi (cigsalesi ) ŷi ûi
Alaska 2.00 30.4 37.0 –6.6
Alabama 0.675 53.1 49.5 3.6
Arkansas 1.15 47.6 45.0 2.6
Arizona 2.00 20.7 37.0 –16.3
California 2.87 15.8 28.7 –12.9
Colorado 0.84 30.4 48.0 –17.6
Connecticut 4.35 21.9 14.7 7.2
The negative slope estimate is reflected by the fact that the fitted values are larger in states with lower tax rates. For
example, Alabama has a low tax rate of $0.675 per pack and a high fitted value of 49.5 packs per-capita, whereas
Connecticut has a high tax rate of $4.35 per pack and a low fitted value of 14.7 packs per-capita. The fitted value ŷi
can be thought of as an in-sample prediction since it’s an estimate of the conditional expectation of the outcome given
that the explanatory variable is equal to xi . For Alabama, the fitted value of 49.5 is what the estimates say we should
expect, on average, for a state having a tax rate of $0.675. The actual observed outcome for Alabama is yi = 53.1 packs
per-capita, which is 3.6 higher than the fitted value or in-sample prediction. The 3.6 value is the estimated residual ûi
for Alabama, which can also be thought of as an in-sample prediction error. For Connecticut, the observed outcome
is 21.9 packs per-capita, which is 7.2 higher than the fitted value of 14.7. That is, the fitted value ŷi = 14.7 packs per-
capita is what would be expected, on average, for a state with a tax rate xi = 4.35, and the estimated residual ûi = 7.2
indicates that the observed outcome is 7.2 packs per-capita higher than the in-sample prediction.
Estimation of the SLR model is often called least-squares estimation. The “least-squares” descriptor is used since
an important property of the estimates α̂ and β̂ is that they minimize a summation that involves squared estimated
residuals. This least-squares property is stated in the following proposition:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 469 — #476
i i
sxy
Proposition 17.6. If s2x > 0, the least-squares estimates α̂ = ȳ – β̂x̄ and β̂ = s2x are the values of a and b that minimize
the function
n
X
S(a, b) = (yi – a – bxi )2 .
i=1
The values a and b can be thought of as guesses for the intercept α and the slope β, respectively. Then, the term
yi – a – bxi is an estimated residual associated with those guesses. The function S(a, b) is a summation of the squared
values of these estimated residuals over all the sample observations. Why do we want to minimize the function S(a, b)?
The intuition is that “good guesses” for a and b are associated with an estimated regression line that goes through the
data points in such a way that the vertical distances from the outcomes (yi ) to the estimated line (a + bxi ) are relatively
small. The proof of Proposition 17.6 uses partial derivatives, and the interested reader can refer to the endnote.57
Evaluated at the least-squares estimates, the function S(a, b) from Proposition 17.6 is
n
X n
X
S(α̂, β̂) = (yi – α̂ – β̂xi )2 = û2i ,
i=1 i=1
which is the sum of squared estimated residuals. This summation provides a measure of the overall noise in the yi
values that is not explained by the estimated regression line, but it is not normalized by the number of observations
and, therefore, can be large simply because of a large sample size. A more useful measure is one that normalizes by
the sample size n. The following definition introduces the residual variance estimate, which is based upon the sum
of squared estimated residuals but also normalizes by the sample size:
The division by n – 2, rather than n or n – 1, used in these two definitions is discussed below.
The following proposition summarizes some important properties of the estimated residuals and the residual
variance estimate:
Proposition 17.7. (Properties of estimated residuals) The estimated residuals ûi = yi – α̂ – β̂xi , based upon the least-
squares estimates α̂ and β̂, have the following properties:
(i) The sample average of the estimated residuals is zero:
n
1X
ûi = 0.
n
i=1
(ii) The sample correlation between the values of the explanatory variable and the estimated residuals is zero:
rxû = 0.
(iii) The sample correlation between the fitted values and the estimated residuals is zero:
rŷû = 0.
1
Pn
(iv) The residual variance estimate σ̂U2 = n–2 i=1 û2i is a consistent (and unbiased) estimate of σU2 , the unconditional
variance of the random variable U.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 470 — #477
i i
q Pn
1
(v) The residual standard deviation estimate σ̂U = n–2 i=1 û2i is a consistent estimate of σU , the unconditional
standard deviation of the random variable U.
Properties (i) and (ii) follow directly from the minimization problem in Proposition 17.6 (see endnote 57).
Property (i) says that, on average, the difference between the realized outcome yi and its fitted value ŷi is equal to
zero. Equivalently, property (i) says that, on average, the fitted values are equal to the sample mean ȳ:
n n n n
1X 1X 1X 1X
ŷi = yi + ûi = yi = ȳ.
n n n n
i=1 i=1 i=1 i=1
Thinking of the fitted value ŷi as an in-sample prediction of the outcome Y associated with X = xi , the residual ûi can be
thought of as a prediction error. This property, which says that the average prediction error is equal to zero, is desirable
since a non-zero average prediction error would indicate that the estimation is systematically over- or under-estimating
the expected outcomes. Property (ii) says that the explanatory variable xi and the estimated residual ûi are uncorrelated.
Knowing whether xi is above or below the sample mean x̄ does not provide information about whether ûi tends to be
above or below its sample mean (which is zero). This property is desirable since a non-zero correlation between xi
and ûi would indicate that there is information contained in the xi values that is not being utilized. For example, if the
correlation between xi and ûi were positive, it would imply that ûi tends to be positive for higher values of xi (xi > x̄) and
negative for lower values of xi (xi < x̄), meaning a line with a larger slope would provide a better fit than the estimated
regression line.
Property (ii) is a sample counterpart to the population property that X and U are uncorrelated, which is implied
by the exogeneity assumption E(U|X) = 0. That said, the fact that xi and ûi are uncorrelated is a property of the least-
squares estimates that holds regardless of whether or not the exogeneity assumption is true. It might be tempting to try
to test whether X and U are uncorrelated by looking at the sample correlation between xi and ûi . Unfortunately, such
an approach is not useful since that sample correlation is always zero, even if X and U are correlated.
Property (iii) follows directly from property (ii) since ŷi is a perfect linear function of xi . Specifically, the covariance
between fitted values and estimated residuals is
sŷû = Cov(α̂ + β̂x, û) = β̂Cov(x, û) = 0,
which implies rŷû = 0.
Property (iv) provides a way to estimate the overall noise or variation contained in the unobservable U. The
1
Pn 2
unconditional variance of U is σU2 , and the estimate σ̂U2 = n–2 2
i=1 ûi is consistent, getting arbitrarily close to σU as the
1 2
sample size increases. The n–2 scaling is usually used in practice since it also leads to σ̂U being unbiased. As compared
1 1
to the n–1 scaling used for a sample variance estimate, the “2” in the n–2 scaling accounts for the two estimates α̂ and
β̂ needed to calculate the ûi values. Since we focus on the asymptotic properties of least-squares estimation, the use of
1
the n–2 scaling, rather than say a n1 scaling, becomes inconsequential for a large sample size n. As a summary measure,
the estimated residual variance σ̂U2 may not be ideal since its units are in the units of y squared. On the other hand, the
estimated residual standard deviation σ̂U is a measure that is in the units of y and, therefore, more easily interpretable.
Property (v) states that σ̂U is consistent, so that it gets arbitrarily close to σU as the sample size increases.
The properties of estimated residuals are illustrated using the cigarette sales-tax regression example:
Example 17.10 (Cigarette sales and cigarette taxes) Example 17.9 calculated the estimated residuals and fitted values
from least-squares estimation in R, storing them as the variables uhat and yhat. The following R output confirms
that the sample average of estimated residuals is equal to zero (property (i)) and, equivalently, that the sample average
of fitted values is equal to the sample average of the outcome variable:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 471 — #478
i i
mean(uhat)
## [1] 0.00000000000000019355
mean(cigdata$cigsales)
## [1] 38.82
mean(yhat)
## [1] 38.82
The following R output confirms that the correlation between the estimated residuals and the explanatory variable
is equal to zero (property (ii)), as is the correlation between the estimated residuals and the fitted values (property
(iii)):
cor(uhat,cigdata$cigtax)
## [1] -0.000000000000000023861
cor(uhat,yhat)
## [1] 0.0000000000000000066505
Finally, the following R code calculates the residual standard deviation estimate:
The residual standard deviation estimate is σ̂U ≈ 12.52 packs per-capita. To get a sense of how large this residual
standard deviation is, it can be compared to the standard deviation of the outcome variable, which is sy ≈ 16.65 packs
per-capita. The estimate σ̂U is determined in two ways by the code above. The first way is to use the formulas for σ̂U2
and σ̂U directly, based upon the estimated residual ûi values. The second way is to access the stored results from lm
estimation using the summary(results)$sigma command. The expression summary(results) provides a
nice summary view of the lm estimation results, and some of its components (like $sigma for the residual standard
deviation estimate) can be easily accessed. Here is the output from summary(results) for this example:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 472 — #479
i i
summary(results)
##
## Call:
## lm(formula = cigsales ~ cigtax, data = cigdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.92 -8.10 -0.86 5.01 39.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.95 3.24 17.25 < 0.0000000000000002 ***
## cigtax -9.49 1.51 -6.28 0.000000088 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.5 on 49 degrees of freedom
## Multiple R-squared: 0.446,Adjusted R-squared: 0.434
## F-statistic: 39.4 on 1 and 49 DF, p-value: 0.0000000875
This output contains many elements that have not yet been discussed, but the σ̂U ≈ 12.52 estimate can be seen near
the end, with the line beginning Residual standard error.
Working directly from the definition for estimated residuals, ûi = yi – ŷi , the outcome variable yi can be written as
yi = ŷi + ûi .
This equation shows that yi can be decomposed into two parts, the fitted value ŷi = α̂ + β̂xi , which is the part of the
outcome that is directly related to the explanatory variable xi , and the estimated residual ûi , which is the part of
the outcome that is uncorrelated with the explanatory variable xi . This decomposition can be used to determine how
well least-squares estimation, based upon the explanatory variable xi , explains the variation in the yi outcomes. From
property (iii) of Proposition 17.7, we know that the fitted values ŷi and estimated residuals ûi are uncorrelated with
each other (sŷû = 0), meaning the sample variance of y is
s2y = s2ŷ + s2û .
Thus, the variation in the outcome variable yi is also decomposed into two parts, the variation explained by the
1
Pn
explanatory variable xi , given by s2ŷ = n–1 2
i=1 (ŷi – ȳ) , and the variation left unexplained by the explanatory variable,
2 1
P n 2 2 2 1 1
given by sû = n–1 i=1 ûi . (sû and σ̂U are slightly different since the former has a n–1 scaling and the latter has a n–2
scaling.) Therefore, the fraction of the variation in the outcome variable that is explained by the explanatory variable
is equal to
s2ŷ
,
s2y
which can also be written as
s2
1 – û2
sy
s2ŷ s2y –s2û s2
since s2ŷ = s2y – s2û implies s2y = s2y = 1 – sû2 . This measure of overall regression fit is known as the R-squared value.
y
Definition 17.5 The R-squared value associated with least-squares estimation of the SLR model, denoted R2 , is
s2ŷ s2û
R2 = =1– .
s2y s2y
For instance, if R2 = 0.24, we say that “the explanatory variable explains 24% of the variation in the outcome
variable.” It is important to note that the sample size n does not have a direct impact on R2 . While having a very
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 473 — #480
i i
large sample improves the precision of the least-squares estimates, there is no reason to expect that R2 increases for
larger n. A large sample still has the same inherent residual noise as a smaller sample. In fact, looking at the expression
s2 σ2
for R2 , we see that R2 = 1 – s2û should get arbitrarily close to 1 – σU2 as n → ∞.
y Y
The terminology “R-squared” comes from the fact that R2 is equal to the square of the sample correlation between
the outcomes yi and the fitted values ŷi , as stated in the following proposition:
Proposition 17.8. The R-squared value is equal to the square of the correlation between outcomes yi and fitted values
ŷi :
R2 = ryŷ
2
.
For simple linear regression, it is also the case that R2 = ryx
2
.
The result of Proposition 17.8 can be shown as follows:
2
s2ŷ
2
2 syŷ s2ŷ
ryŷ = 2 2 = 2 2 = 2 = R2 ,
sy sŷ sy sŷ sy
where the first equality follows from the definition of correlation and the second equality follows from the fact that
syŷ = sŷŷ + sûŷ = s2ŷ + 0 = s2ŷ . Since ŷ is a linear transformation of x, it also follows that ryŷ
2 2
= ryx and, therefore, R2 = ryx
2
.
Since the correlation ryŷ is between –1 and 1 (inclusive), it follows from Proposition 17.8 that
0 ≤ R2 ≤ 1.
At one extreme, R2 = 0 corresponds to a case where there is no correlation between yi and ŷi , which can only happen
if β̂ = 0. Graphically, this case has a flat estimated regression line, with α̂ = ȳ and β̂ = 0. For R2 = 0, the explanatory
variable xi explains none of the variation in the outcome yi . At the other extreme, R2 = 1 corresponds to a case where
there is perfect correlation, either positive (ryŷ = 1) or negative (ryŷ = –1). Graphically, the estimated regression line has
β̂ 6= 0 and passes exactly through all of the sample observation points (yi , xi ). For R2 = 1, the explanatory variable xi
explains 100% of the variation in the outcome yi .
For values of R2 strictly between 0 and 1, there is correlation between yi and ŷi but not perfect correlation. A
regression with a larger magnitude of ryŷ has a larger R2 . The sign of the correlation ryŷ does not impact R2 , so for
instance the correlations ryŷ = 0.8 and ryŷ = –0.8 both correspond to R2 = 0.82 = 0.64.
Example 17.11 (Cigarette sales and cigarette taxes) Continuing Example 17.7, the decomposition of the outcome
variance
s2y = s2ŷ + s2û
involves the following sample variances
s2y = 277.155, s2ŷ = 123.522, and s2û = 153.632.
The R-squared value is
123.522
R2 = ≈ 0.446 or 44.6%,
277.155
which indicates that state-level cigarette taxes explain 44.6% of the variation in state-level cigarette sales. Therefore,
55.4% of the variation in state-level cigarette sales is left unexplained by state-level cigarette taxes. Another way
to determine the R-squared value, which doesn’t require the least-squares estimates at all, is to square the sample
correlation between cigsales and cigtax. This sample correlation is rcigsales,cigtax = –0.6676 for the observed sample,
meaning R2 = (–0.6676)2 ≈ 0.446. The following R code illustrates the various equivalent methods of calculating R2
for this example:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 474 — #481
i i
y <- cigdata$cigsales
x <- cigdata$cigtax
var(yhat)/var(y)
## [1] 0.44568
1-var(uhat)/var(y)
## [1] 0.44568
cor(y,yhat)^2
## [1] 0.44568
cor(y,x)^2
## [1] 0.44568
summary(results)$r.squared
## [1] 0.44568
The first two calculations of R2 are based upon Definition 17.5. The second two calculations are based upon
Proposition 17.8. The last expression accesses the R2 value directly from the lm regression results.
Example 17.12 (Monthly stock returns and the overall market) Continuing Example 17.8, we consider the overall fit
of the least-squares estimation using an individual stock return as the outcome y and the market index return (IDX) as
the explanatory variable x. For the six stocks considered in Example 17.8, the following table reports the R-squared
value and the estimated residual standard deviation associated with the least-squares estimates:
R2 σ̂U
Home Depot (HD) 0.336 0.0602
Lowe’s (LOW) 0.256 0.0791
Bank of America (BAC) 0.350 0.0850
Wells Fargo (WFC) 0.289 0.0689
Marathon Oil (MRO) 0.251 0.1048
ConocoPhillips (COP) 0.285 0.0692
The R-squared values are fairly similar across the six individual stocks. The lowest R-squared is for Marathon Oil,
indicating that 25.1% of the variation in Marathon Oil’s monthly returns are explained by S&P 500 monthly returns,
and the highest R-squared is for Bank of America, indicating that 35.0% of the variation in Bank of America’s monthly
returns are explained by S&P 500 monthly returns. For these two stocks, since R2 is equal
√ to the square of the sample
correlation
√ between x and y, the corresponding sample correlations are rMRO,IDX = 0.251 ≈ 0.501 and rBAC,IDX =
0.350 ≈ 0.592. (These sample correlations must be positive, rather than negative, because the least-squares slope
estimates reported in Example 17.8 were both positive.)
Here is the R code to produce the table of R2 and σ̂U values above:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 475 — #482
i i
Definition 17.6 The residuals of an SLR model are homoskedastic if the conditional variance Var(U|X = x) is
constant and does not depend upon x. Equivalently, since σU2 is the unconditional variance of U, the residuals are
homoskedastic if Var(U|X = x) = σU2 for all x. The residuals are said to exhibit homoskedasticity.
Figure 17.7 provides a graphical representation of the data-generating process for a SLR model with homoskedastic
residuals. The figure shows three possible values (x∗ , x∗∗ , x∗∗∗ ) for the random variable X. At each of these three values,
the conditional expectation E(Y|X = x) is the value for which the associated vertical line passes through the E(Y|X =
x) = α + βx line. The realized y outcome depends upon this conditional expectation but also the realized residual. The
realized residual is a draw of U, conditional upon the value of X. For X = x∗ , this conditional distribution is shown as
the rotated pdf curve for U conditional on X = x∗ . This pdf curve is centered at the SLR model line, and its variance
describes the variance of residual draws associated with X = x∗ . Similarly, for x∗ and x∗∗ , the rotated pdf curves shown
in the figure represent the distributions of U conditional on X = x∗ and X = x∗∗ , respectively. In Figure 17.7, the shape of
the conditional distribution of U is exactly the same for X = x∗ , X = x∗∗ , and X = x∗∗∗ , meaning the conditional variance
Var(U|X = x) is the same for these three possible values of X. Although the constant conditional residual variance has
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 476 — #483
i i
pdf of U
given X = 𝑥∗∗∗
pdf of U
given X = 𝑥∗∗
pdf of U
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥
given X = 𝑥∗
𝑥∗ 𝑥∗∗ 𝑥∗∗∗ 𝑥
Figure 17.7
SLR model with homoskedastic residuals
only been shown for three values in the figure, the same conditional residual variance would arise for any other value
of X if the residuals are homoskedastic.
Definition 17.7 The residuals of an SLR model are heteroskedastic if the conditional variance Var(U|X = x) is non-
constant and depends upon x. The residuals are said to exhibit heteroskedasticity.
Figure 17.8 shows how the data-generating process differs when the SLR model has heteroskedastic residuals. The
figure shows a situation in which the variance of the residual U depends upon the value of X, with the residual variance
increasing for larger values of X. The distribution of U conditional on X = x∗ has a lower variance, with its pdf being
more tightly distributed around the SLR line. The variance of the distribution of U conditional on X = x∗∗ is larger, as
indicated by the increased dispersion of the pdf curve, and the variance of the distribution of U conditional on X = x∗∗∗
is even larger with a more dispersed pdf.
To see how the cases of homoskedastic errors and heteroskedastic errors affect the realized sample, we artificially
create two different samples based upon the SLR model, one exhibiting homoskedasticity and one exhibiting
heteroskedasticity. Specifically, we assume that the SLR model
Y = 1 + 8X + U
describes the conditional expectation E(Y|X) = 1 + 8X for both samples, but the conditional distribution of U differs for
the two samples, as follows:
• Sample 1: Homoskedastic errors, where the distribution of U given X = x is N(0, 1)
• Sample 2: Heteroskedastic errors, where the distribution of U given X = x is N(0, 4x2 )
Figure 17.9 shows the two samples generated by the SLR model under the two assumptions on the residuals. For each
sample, the sample size is n = 500, and the x values are draws from the U(0, 1) distribution. The y values are generated
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 477 — #484
i i
pdf of U
given X = 𝑥∗∗∗
pdf of U
given X = 𝑥∗∗
pdf of U
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥
given X = 𝑥∗
𝑥∗ 𝑥∗∗ 𝑥∗∗∗ 𝑥
Figure 17.8
SLR model with heteroskedastic residuals
as y = 1 + 8x + u, where u is a draw from the distribution of U given X = x, as specified for the two samples above.
The scatter plot on the top corresponds to Sample 1 with homoskedastic errors, and the scatter plot on the bottom
corresponds to Sample 2 with heteroskedastic errors. A solid line corresponding to the SLR line E(Y|X) = 1 + 8X is
drawn for reference in both plots. For the sample with homoskedastic errors, the vertical spread of data points around
the SLR line stays roughly the same throughout the full range of possible x values. On the other hand, for the sample
with heteroskedastic errors, the vertical spread of data points around the SLR line changes a lot over the range of x
values, with very low conditional variance of the residuals for x values near zero and a steadily increasing conditional
variance of the residuals as x increases.
Practitioners care about the homoskedasticity/heteroskedasticity of residuals for two main reasons. First, the
appropriate way to calculate standard errors associated with the least-squares estimates turns out to depend upon
whether the residuals are homoskedastic or heteroskedastic. This issue is discussed in more detail below. Second, if we
are interested in determining a predictive interval for Y given X = x, this interval clearly depends upon the conditional
variance of U given X = x. For instance, looking at Sample 2 in Figure 17.9, a predictive interval for Y associated with
a low value of x should be much more narrow than a predictive interval for Y associated with a high value of x since
the conditional variance of the residuals is much higher for higher values of x. We return to this idea in Section 18.8,
where we discuss predictive intervals based upon least-squares estimation.
V
The formulas for the asymptotic variances Vnα and nβ in the general case of heteroskedasticity are quite complicated
and, therefore, omitted from our discussion. That said, statistical packages like R are able to calculate standard errors
based upon these formulas, and these standard errors are known as heteroskedasticity-robust standard errors or
sometimes, more concisely, as robust standard errors. For least-squares estimates α̂ and β̂, which are the realizations
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 478 — #485
i i
Sample 1 (homoskedastic)
8
6
y
4
2
0 0.0 0.2 0.4 0.6 0.8 1.0
Sample 2 (heteroskedastic)
12
8
y
6
4
2
Figure 17.9
Homoskedasticity versus heteroskedasticity
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 479 — #486
i i
The asymptotic-variance formulas in Proposition 17.10 highlight the features of the population and sample that
affect the precision of the least-squares estimators. To obtain standard errors for the case of homoskedasticity, sample
descriptive statistics can be plugged in for population statistics, so that
s s s
s2û
V
bα x̄2 sû x̄2
se(α̂) = = 1+ 2 = √ 1+ 2
n n sx n sx
and s s
V
bβ s2û sû
se(β̂) = = 2
=√ .
n nsx n sx
Let’s focus on the slope estimate β̂ first, as its standard error expression is a bit simpler. There are three factors that
affect the standard error of β̂:
• Sample size: Larger n leads to a smaller standard error se(β̂), so that having more observations is better for
√
precision. The standard error has the usual √1n scaling for n-consistent and asymptotically normal estimators.
For example, quadrupling the sample size should lead to a standard error that is roughly half as large.
• Residual noise: A smaller residual variance σ 2 , as estimated by s2 , leads to a smaller standard error se(β̂). Although
U û
not within our control, less noise in the residuals is better for precision.
• Variation of the explanatory variable: A larger variance σ 2 of the explanatory variable, as estimated by s2 , leads
X x
to a smaller standard error se(β̂). Since the slope β represents the change in E(Y|X = x) associated with changes
in x, the intuition is that it is better to have observations that exhibit a wide range of x values. In cases where the
distribution of X may be in our control, as might be the case in choosing a survey population (e.g., if age is the
explanatory variable, choosing to interview people between 25 and 45 rather than between 30 and 40), a choice
with higher variance σX2 leads to a lower standard error se(β̂), holding n and σU2 fixed.
Now, looking at the formula for the standard error of the intercept estimate α̂, these three factors affect se(α̂) in the
same way. Larger n, smaller residual variance σU2 , and larger explanatory variable variance σX2 are all associated with
lower se(α̂) or more precise intercept estimates. There is a fourth factor for se(α̂), which is that a larger value of µ2X
or x̄2 leads to a larger standard error se(α̂). Recall that α̂ is an estimate of E(Y|X = 0). When the average X value is
far from zero (i.e., µX very negative or µX very positive), as indicated by µ2X being large, this relationship says that
it becomes more difficult to precisely estimate the intercept α. The best-case scenario, for se(α̂), is when x̄ = 0, in
which case the standard-error formula simplies to se(α̂) = √sûn , which is similar to the formula for the standard error of
a sample mean.
Unfortunately, the built-in R function lm does not calculate heteroskedasticity-robust standard errors and instead
reports standard errors based upon the restrictive assumption of homoskedastic residuals. Thankfully, there are several
R packages with functions for calculating robust standard errors. We will use a package called estimatr, and the
least-squares regression function is called lm_robust and will be illustrated in the examples throughout this chapter
and Chapter 18. For now, here is the code that installs the estimatr package:
install.packages("estimatr")
library(estimatr)
Example 17.13 (Cigarette sales and cigarette taxes) Example 17.7 reported the least-squares estimates for the SLR
model using cigarette sales as the outcome variable and cigarette tax as the explanatory variable:
α̂ = 55.95 and β̂ = –9.49.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 480 — #487
i i
The commands results and summary(results) provide alternative ways of displaying the lm_robust
estimation results, with the latter providing additional information like the R-squared value. In addition to the
parameter estimates and standard errors, the output includes additional columns for hypothesis testing and confidence
intervals that will be discussed in the next section. In the summary(results) output, the HC2 specified as the
“standard error type” is the method that lm_robust uses to calculate the robust standard errors.
When results is assigned to the lm_robust regression function call, lots of useful information about the
regression is stored in results, including the following:
• results$res_var: residual variance estimate σ̂U2
• results$r.squared: R-squared value
• results$fitted.values: a vector (of length n) with the fitted values
• results$coefficients: a vector with the estimates, in the same order as the output
• results$std.error: a vector with the standard errors, in the same order as the output
Here are some examples of how these quantities can be accessed after the regression above:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 481 — #488
i i
# R-squared value
results$r.squared
## [1] 0.4456806
# residual standard deviation estimate
sqrt(results$res_var)
## [1] 12.52068
# slope estimate and its standard error
results$coefficients[2]
## cigtax
## -9.487131
results$std.error[2]
## cigtax
## 1.063502
Example 17.14 (Monthly stock returns and the overall market) Example 17.8 reported the least-squares estimates
for SLR models that related monthly returns of six individual stocks to monthly returns of the S&P 500 index. The
following table augments those estimates with their heteroskedasticity-robust standard errors reported in parentheses:
α̂ (se) β̂ (se)
Home Depot (HD) 0.0087 (0.0032) 1.020 (0.082)
Lowe’s (LOW) 0.0117 (0.0042) 1.107 (0.104)
Bank of America (BAC) 0.0014 (0.0046) 1.489 (0.146)
Wells Fargo (WFC) 0.0054 (0.0039) 1.048 (0.131)
Marathon Oil (MRO) –0.0006 (0.0053) 1.449 (0.203)
ConocoPhillips (COP) 0.0028 (0.0037) 1.042 (0.116)
The results from this table are obtained using the lm_robust function for each of the six SLR models. For example,
the first row corresponds to the SLR model with HD as the outcome variable and IDX as the explanatory variable,
with results given by the following code:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 482 — #489
i i
and
(β̂ – zα/2 se(β̂), β̂ + zα/2 se(β̂)),
respectively. For example, the 95% confidence interval for β is (β̂ – 1.96se(β̂), β̂ + 1.96se(β̂)), and the 90% confidence
interval for β is (β̂ – 1.645se(β̂), β̂ + 1.645se(β̂)).
Example 17.15 (Cigarette sales and cigarette taxes) Using the standard errors of the least-squares estimates from
Example 17.13, we can construct confidence intervals for α and β. The 95% confidence interval for α is
(α̂ – z0.025 se(α̂), α̂ + z0.025 se(α̂)) = (55.95 – (1.96)(2.924), 55.95 + (1.96)(2.924)) ≈ (50.07, 61.82).
It can be said with 95% confidence that the intercept parameter α, or equivalently the conditional expectation
E[CIGSALES|CIGTAX = 0], is between 50.07 and 61.82 packs per-capita. The 95% confidence interval for β is
(β̂ – z0.025 se(β̂), β̂ + z0.025 se(β̂)) = (–9.49 – (1.96)(1.064), –9.49 + (1.96)(1.064)) ≈ (–11.62, –7.35).
It can be said with 95% confidence that the slope parameter β, or equivalently the change in expected cigarette sales
associated with a one-dollar increase in cigarette taxes, is between –11.62 and –7.35 packs per-capita.
The lm_robust function automatically calculates these confidence intervals in R. The default is 95% confidence
intervals, but the optional argument alpha can be set at other values (different from the default alpha=0.05) to get
other confidence intervals. Here is the code for calculating 95% confidence intervals and 90% confidence intervals:
Moving to hypothesis testing, suppose we want to test whether the slope parameter β is equal to some constant c,
so that the null hypothesis is
H0 : β = c.
The z-statistic for testing this null hypothesis is
β̂ – c
z-statistic =
se(β̂)
and indicates the number of standard errors that β̂ is above c (positive z-statistic) or the number of standard errors
that β̂ is below c (negative z-statistic). Applying the z-test approach from Section 16.2, the rejection rule for testing
H0 : β = c at the α level is:
β̂–c
• Reject H0 : β = c at the α level if z-statistic = se(β̂)
≥ zα/2 .
β̂–c
• Do not reject H0 : β = c at the α level if z-statistic = se(β̂)
< zα/2 .
As with other asymptotically normal estimators, we can calculate a p-value based upon the z-statistic, and this p-value
can be used to test the null hypothesis at any level α. For the null hypothesis H0 : β = c , the p-value is
!
β̂ – c
p-value = P (|Z| > |z-statistic|) = P |Z| > , where Z ∼ N(0, 1),
se(β̂)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 483 — #490
i i
The first and second columns provide the estimates (α̂ in the first row, β̂ in the second row) and the standard errors
of the estimates (se(α̂) in the first row, se(β̂) in the second row). The third column is the z-statistic for testing H0 : α = 0
(first row) and for testing H0 : β = 0 (second row). For the case of c = 0, the test of H0 : α = 0 has a z-statistic equal
to se(α̂α̂) , and the test of H0 : β = 0 has a z-statistic equal to se(β̂β̂) . Therefore, the z-statistic values in the third column
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 484 — #491
i i
are equal to the values in the first column (estimates) divided by the values in the second column (standard errors).
Finally, the fourth column reports the p-value associated with testing the parameter against zero, which is the p-value
for H0 : α = 0 in the first row and the p-value for H0 : β = 0 in the second row.
Seeing such a low p-value for the α parameter is to be expected since it would be surprising if we could not reject
that the expected cigarette sales is equal to zero when there is no cigarette tax. In fact, from its z-statistic, the estimate
α̂ is 19.14 standard errors above zero! The p-value associated with H0 : β = 0 is more interesting. This p-value is again
very small (zero to many decimal places), meaning the null hypothesis H0 : β = 0 is rejected at any level. In other
words, the slope estimate is statistically significant at any level, meaning there is strong statistical support for the
existence of a negative relationship between state-level cigarette sales and state-level cigarette taxes. The magnitude
of this relationship can be provided by using a confidence interval, like the 95% confidence interval for β provided in
Example 17.15.
We’ve already seen that results contains information about the regression, and several quantities associated with
the confidence intervals and testing are accessible after results is assigned to the lm_robust function call above:
• results$statistic: a vector with the z-statistics of the estimates, in the same order as the output
• results$p.value: a vector with the p-values for the two-sided test against 0, in the same order as the output
• results$conf.low and results$conf.high: vectors with the upper and lower endpoints of the
A test of the null hypothesis H0 : α = 0 is interesting here since it is a test of whether the expected Home Depot
monthly return is equal to zero when the S&P 500 monthly return is equal to zero. Based upon the p-value of 0.0069,
this null hypothesis is rejected at any level above 0.69%. The null hypothesis H0 : β = 0 is less interesting here since it
would be surprising to find no relationship (i.e., no correlation) between Home Depot returns and S&P 500 returns.
Indeed, the p-value is equal to zero (to many decimal places) since the estimated slope is 12.55 standard errors above
zero. A more interesting null hypothesis to test is H0 : β = 1. When β = 1, a 0.01 change in the S&P 500 monthly return
is associated with an expected change of 0.01 in the Home Depot monthly return. The z-statistic associated with
H0 : β = 1 is
β̂ – 1 1.020 – 1
z-statistic = = ≈ 0.245.
se(β̂) 0.0817
The associated p-value is 2(1 – Φ(0.245)) ≈ 0.806. Therefore, using either the z-statistic rejection rule or the p-value
rejection rule, the null hypothesis H0 : β = 1 would not be rejected at a 5% level. Based upon the p-value, H0 : β = 1
would not be rejected at any reasonable level.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 485 — #492
i i
Example 17.18 (Earnings and union status) Example 17.6 provided the least-squares estimates for the SLR model
with weekly earnings as the outcome variable and union membership as the explanatory variable. The more complete
results are reported below:
The test of H0 : α = 0 is not interesting here (why?), so we focus on the test of H0 : β = 0. Recall that
β = E(EARNWK|UNION = 1) – E(EARNWK|UNION = 0),
meaning the null hypothesis H0 : β = 0 is true if there is no difference between the expected weekly earnings of union
workers and the expected weekly earnings of non-union workers. The p-value of zero (to many decimal places)
indicates that H0 : β = 0 is rejected at any level, providing evidence that the association of weekly earnings with union
status is statistically significant. The 95% confidence interval for β is
(β̂ – z0.025 se(β̂), β̂ + z0.025 se(β̂)) = (251.2 – (1.96)(45.83), 251.2 + (1.96)(45.83)) ≈ (161.3, 341.0),
meaning we can say with 95% confidence that the true difference between the expected weekly earnings of union
workers and the expected weekly earnings of non-union workers is between $161.30 and $341.00.
Example 17.19 (A/B testing) Example 17.18 showed that the SLR model and least-squares estimation can be used to
directly test whether the expected value of an outcome variable Y differs over two subpopulations, where the binary
variable X indicates which of the two subpopulations an observation is in. In Example 17.18, the random variable
Y corresponds to weekly earnings and the random variable X to union membership (X = 1 for union member, X = 0
for non-member). The case of an A/B test, where the outcome of interest is a continuous outcome Y, also fits in this
framework. If there are two possible treatments (A or B) and X is a binary variable indicating the treatment (say, X = 1
for treatment B, X = 0 for treatment A), then
α = E(Y|X = 0) = expected value of Y for treatment A
and
α + β = E(Y|X = 1) = expected value of Y for treatment B.
This setup is a generalization of the advertising “experiment” considered in Example 17.1, where the Y random
variable was SALES and the X random variable was AD (equal to 1 for cities receiving targeted advertising and 0 for
cities not receiving targeted advertising). To test whether the expected value of Y differs for the two treatments, the
null hypothesis of interest is H0 : β = 0. The z-test involves using either the z-statistic se(β̂β̂) or its corresponding p-value.
A confidence interval for the estimated difference between the expected outcome Y for treatment A and the expected
outcome Y for treatment B is just the standard confidence interval for β.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 486 — #493
i i
First, let’s consider the relationship between the random variables Y and X without specifying an exogeneity
assumption. To do so, the following proposition provides an important result concerning the decomposition of Y
into two parts, one which is a linear function of X and the other which is a random variable uncorrelated with X:
Proposition 17.11. If X and Y are random variables with σX2 > 0, Y can be decomposed into a linear function of X and
another random variable V such that
Y = α∗ + β ∗ X + V with Cov(X, V) = 0 and E(V) = 0.
For this decomposition, the parameter β ∗ is
Cov(X, Y) σXY
β∗ = = 2 ,
Var(X) σX
and the parameter α∗ is
σXY
α∗ = µY – β ∗ µX = µY – µX .
σX2
While the equation in Proposition 17.11 looks very similar to the specification of the SLR model, it is important
to stress that the relationship described by the proposition is not a model, in the sense that no assumptions have been
made to state the relationship between Y and X. The observant reader will notice that β ∗ is the population analogue
of the least-squares slope estimator β̂XY = ssXY2 , and likewise α∗ is the population analogue of the least-squares intercept
X
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 487 — #494
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 488 — #495
i i
488 NOTES
causal parameter β. Therefore, even if a positive slope estimate β̂ is obtained, it’s still possible that there is no causal
effect of education on expected earnings (β = 0) since β̂ is providing an over-estimate.
Notes
53 It is not restrictive to assume that E(U|X) is equal to zero, rather than some other constant, since the parameter α could always be changed to
yield E(U|X) = 0. For instance, if we had Y = α0 + βX + U 0 with E(U 0 |X) = c for some constant c, the model could be re-written as Y = α + βX + U
with α = α0 + c and U = U 0 – c.
54 In fact, linearity is not really an assumption at all for the case of a binary explanatory variable since a line can always be drawn exactly through
the two points E(Y|X = 0), at X = 0, and E(Y|X = 1), at X = 1. Once a third value is possible for the explanatory variable, the linearity assumption
becomes important since it requires that all three conditional expectations lie along a line.
55 A slightly more complicated version of this model, which incorporates the “risk-free rate,” is the capital asset pricing model (CAPM). In the
CAPM model, the outcome variable is the stock return minus the risk-free rate (e.g., the interest rate on a Treasury bond), and the explanatory
variable is the market index return minus the risk-free rate.
56 We are being a little loose with terminology here. Consistency is a property of an estimator, so a “consistent estimate” refers to the realization
of a consistent estimator.
57 Take partial derivatives of S(a, b) with respect to a and b and set them both equal to zero:
n
∂S(a, b) X
= –2 (yi – a – bxi ) = 0
∂a i=1
and
n
∂S(a, b) X
= –2 (yi – a – bxi )xi = 0.
∂b i=1
The first equation implies nȳ – na – bnx̄ = 0, or a = ȳ – bx̄. The second equation implies ni=1 (yi – a – bxi )xi = 0, and plugging in a = ȳ – bx̄ yields
P
n
X
(yi – ȳ – b(xi – x̄))xi = 0
i=1
or, equivalently,
n
X
(yi – ȳ – b(xi – x̄))(xi – x̄) = 0.
i=1
Pn
(y –ȳ)(xi –x̄) sxy
Solving this last equation for b yields b = Pni=1 i = = β̂. Finally, plugging β̂ for b in a = ȳ – bx̄ yields a = ȳ – β̂x̄ = α̂.
i=1 (xi –ȳ)(xi –x̄) s2x
58 Although not explicitly stated in Proposition 17.9, the additional technical assumptions that X and U have finite variances is required to apply
CLT results and prove the result.
59 The interested reader can confirm that the non-robust standard errors obtained using lm are se(α̂) = 3.24 and se(β̂) = 1.51. To do so, store
the results with the command results <- lm(cigsales~cigtax, data=cigdata) and view a summary of the results, including the
standard errors, with summary(results).
Exercises
1. Consider a sample {(xi , yi )}ni=1 , where x and y are standardized variables with sample correlation rxy = 0.7.
(a) For the SLR model Y = α + βX + U, what are the least-squares estimates α̂ and β̂?
(b) Under the exogeneity assumption E(U|X) = 0, what is the interpretation of the least-squares slope estimate β̂?
2. A manufacturing company has 100 factories spread across the United States. It has purchased a new technology
that it can deploy at 30 of its factories. Assume that the company assigns the technology randomly to 30 factories. In
the subsequent year, the company collects data on total production (prodi , in thousands of units) at each factory, with
techi = 1 if the factory has the new technology and techi = 0 otherwise. Let PROD and TECH be the associated random
variables, and assume the following SLR model holds:
PROD = 10 + 3TECH + U with E(U|TECH) = 0,
with an additional assumption that U|TECH ∼ N(0, 4).
(a) Explain why the exogeneity assumption E(U|TECH) = 0 is likely to hold.
(b) What is the conditional distribution of PROD given TECH = 0? What is the conditional distribution of PROD
given TECH = 1?
(c) Determine P(PROD > 12|TECH = 1) – P(PROD > 12|TECH = 0).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 489 — #496
i i
NOTES 489
(d) If factory A has the new technology and factory B does not, what is the distribution of the difference between
factory A production and factory B production?
3. Use the hrs dataset for this question. The data consist of 6,052 non-married individuals who are 50 and older.
Consider a SLR model, where the outcome variable is annual out-of-pocket medical costs (outofpocket_costs, in
dollars) during 2000, and the explanatory variable is age (age, in years).
(a) Use lm to estimate the SLR model, and store the results in hrs_results.
(b) Interpret the slope estimate β̂.
(c) What is the estimated difference in expected out-of-pocket medical costs between a 70-year-old and a 60-year-
old?
(d) What percentage of the residuals are positive? negative?
(e) Which five observations are associated with the largest magnitudes of the estimated residuals?
(f) What percentage of fitted values are negative? Is this problematic, since outofpocket_costs ≥ 0?
(g) Draw a scatter plot of outofpocket_costs versus age.
(h) Add the least-squares regression line to the plot in (g) with the command abline(hrs_results,
col="blue"). Are the residuals left-skewed, right-skewed, or approximately symmetric?
(i) Estimate the SLR model separately for men (male = 1) and women (male = 0). How do the slope estimates from
the two regressions compare?
4. Use the baseball dataset for this question. The data consist of 30 Major League Baseball teams for the 2022
regular season. Consider a SLR model, where the outcome variable is a team’s average attendance at its home games
(attend_home) and the explanatory variable is the team’s winning percentage for the season (winpct_22). A team
winning half its games has winpct_22 = 0.5, and a team winning 55% of its games has winpct_22 = 0.55.
(a) Use lm to estimate the SLR model, and store the results in mlb_results.
(b) Draw a scatter plot of attend_home versus winpct_22. Add the least-squares regression line to the plot with the
command abline(mlb_results, col="blue").
(c) What does the slope estimate β̂ say about the difference between a team with 55% winning percentage and
50% winning percentage?
(d) How much of the variation in attend_home is left unexplained by the least-squares regression?
(e) What is the estimated standard deviation of the SLR model residual?
(f) Re-run the regression using the previous year’s winning percentage (winpct_21) as the explanatory variable.
Focusing on the slope estimate and the R-squared value, how do the results compare to the original regression?
(g) Define a new outcome variable pctattend equal to attend_home divided by capacity (the size of the team’s
stadium). Estimate the SLR model with pctattend as the outcome variable and winpct_22 as the explanatory
variable. What does the slope estimate β̂ say about the difference between a team with 55% winning percentage
and 50% winning percentage?
(h) To determine the association between team payroll (payroll, in millions of dollars) and team performance
(winpct_22), run a regression with winpct_22 as the outcome variable and payroll as the explanatory variable.
What is the estimated difference in expected winning percentage between a team with a payroll of $200 million
and a team with a payroll of $150 million?
5. Use the cps dataset for this question, and focus on the sample of 2,809 employed workers. You’ll find it easiest to
create a new data frame cpsemployed for the employed workers.
(a) Use lm_robust to estimate the SLR model from Example 17.3 with earnwk as the outcome variable and
educ as the explanatory variable.
(b) Interpret the slope estimate β̂.
(c) Provide a 95% asymptotic confidence interval for the SLR slope parameter.
(d) Interpret the R-squared value.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 490 — #497
i i
490 NOTES
(e) Use plot to plot the estimated residuals versus educ, and add a horizontal line at zero.
(f) From the plot in (e), do you think the residuals are homoskedastic or heteroskedastic? Explain.
(g) From the plot in (e), do you think the conditional expectation of the residuals is zero for all values of educ?
Explain.
6. Use the exams dataset for this question. You are interested in modeling exam2 performance based upon exam1
performance.
(a) Use lm_robust to estimate the SLR model with exam2 as the outcome variable and exam1 as the explanatory
variable.
(b) Interpret the slope estimate β̂.
(c) Perform a z-test of H0 : β = 0 at the 5% level. What do you conclude?
(d) Perform a z-test of H0 : β = 1 at the 5% level. What do you conclude? What is the p-value for this test?
(e) Provide an estimate of the conditional expectation E(exam2|exam1 = 80).
(f) Draw two histograms, one of the actual exam2 values and one of the fitted values from the regression. Which
histogram has lower dispersion, and why?
(g) Standardize both variables (exam1 and exam2) and re-run the regression with lm_robust.
i. Interpret the slope estimate of the new regression.
ii. How does the R-squared value compare to the R-squared value of the original regression?
iii. How does the z-statistic for testing H0 : β = 0 compare to the original regression?
7. Consider a sample {(xi , yi )}ni=1 , where x̄ = 0 and ȳ = 0. For the SLR model Y = α + βX + U, show that the least-squares
estimates are Pn
xi yi
α̂ = 0 and β̂ = Pi=1n 2
.
i=1 xi
8. A dataset has IQ scores (iq) for 930 individuals, between the ages of 20 and 24, along with the IQ scores of
their mothers (momiq) and fathers (dadiq). The average parental IQ is avgiq = momiq+dadiq 2 . With iq as the outcome
variable, the following table summarizes the results from regressions for three different SLR models: (i) momiq as the
explanatory variable, (ii) dadiq as the explanatory variable, and (iii) avgiq as the explanatory variable.
SLR: iq on momiq SLR: iq on dadiq SLR: iq on avgiq
α̂ (se) 69.585 (3.394) 77.156 (3.426) 56.487 (4.331)
β̂ (se) 0.299 (0.034) 0.231 (0.035) 0.434 (0.044)
R2 0.0925 0.0522 0.1161
Each of the IQ variables has a sample mean close to 100 and a sample standard deviation close to 15.
(a) Which of the three explanatory variables explains the most variation in iq? What is the sample correlation
between this variable and iq?
(b) Interpret the slope estimate from the regression of iq on momiq.
(c) Provide a 95% asymptotic confidence interval for the slope in the SLR model with momiq as the explanatory
variable.
(d) Provide an estimate of the expected difference in IQ between an individual whose mother’s IQ is 110 and an
individual whose mother’s IQ is 105. What is the standard error of this estimate?
(e) Thinking about an observation with momiq = 120, what is the fitted value from the regression of iq on momiq?
Thinking about an observation with dadiq = 120, what is the fitted value from the regression of iq on dadiq?
Thinking about an observation with avgiq = 120, what is the fitted value from the regression of iq on avgiq?
Explain why one of these fitted values is markedly different from the others.
(f) Focus on the SLR with avgiq as the explanatory variable. Copy and execute the R code below, which draws
the (solid) estimated regression line, a dashed 45-degree line (iq = avgiq), and a dotted horizontal line at the
outcome mean (iq = 100). The graph shows a phenomenon known as regression to the mean. For avgiq values
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 491 — #498
i i
NOTES 491
between 110 and 130, how do the fitted values of iq compare to avgiq and the mean of iq? For avgiq values
between 70 and 90, how do the fitted values of iq compare to avgiq and the mean of iq?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 492 — #499
i i
492 NOTES
(c) To assess the causal effect of an informational campaign about a new tax-credit policy, the Internal Revenue
Service mails a postcard explaining the policy to a random subset of taxpayers. The new policy increases the
tax refund for an individual who claims it. The outcome variable is the total refund amount for an individual,
and the explanatory variable is an indicator of whether they received the postcard.
(d) To assess the causal effect of policing on crime, a researcher collects data for hundreds of cities with the
outcome variable being crimes-per-capita and the explanatory variable being police-officers-per-capita.
(e) To assess the causal effect of weather on crime, a researcher collects data for hundreds of days for a single city
with the outcome variable being daily crimes and the explanatory variable being daily rainfall total.
12. *For this question, you will conduct Monte Carlo simulations to illustrate the performance of the slope z-test in a
SLR model Y = α + βX + U. For all simulations, the marginal distribution of Y is N(0, 1) and the marginal distribution
of X is N(0, 1).
(a) Consider the case where Y and X are independent, so that α = β = 0. Conduct 10,000 Monte Carlo simulations,
where for each simulation you: (i) create an i.i.d. sample with n = 100, {(yi , xi )}100
i=1 , with yi drawn from N(0, 1)
and xi drawn from N(0, 1), (ii) estimate the least-squares regression, and (iii) test H0 : β = 0 at the 1% level, the
5% level, and the 10% level (and keep track of the results). Over the 10,000 simulations, what is the percentage
of times that the test is rejected at the 1% level, the 5% level, and the 10% level? Are your findings as expected?
(b) Consider the case where Y = 0.2X + U. Follow the process in (a), except with the following√change in the
sample creation step (i): Draw xi from N(0, 1) and ui from N(0, 1), and generate yi = 0.2xi + 0.96ui . (This
sampling ensures that yi is drawn from a N(0, 1) marginal distribution.) Over the 10,000 simulations, what is
the percentage of times that the test is rejected at the 1% level, the 5% level,
√ and the 10% level?
(c) Same as (b), except Y = 0.5X + U and generate each yi as yi = 0.5xi + 0.75ui . Over the 10,000 simulations,
what is the percentage of times that the test is rejected at the 1% level, the 5% level, and the 10% level?
(d) Explain why the rejection rates change for (a)-(c).
(e) Without actually doing the simulations, how do you think the results would change in (a)-(c) if you tested
H0 : α = 0 instead of H0 : β = 0?
13. Use the sp500 dataset for this question. We have considered the returns of two bank stocks, Bank of America (BAC)
and Wells Fargo (WFC). Let’s add a third bank stock, M&T Bank Corporation (MTB). You are interested in predicting
WFC returns but with only one explanatory variable, so that the two possible SLR models are
WFC = α1 + β1 BAC + U1
and
WFC = α2 + β2 MTB + U2 .
(a) Use lm_robust to run the two least-squares regressions. How do β̂1 and β̂2 compare to each other? How do
the two R-squared values compare?
(b) *It’s difficult to test H0 : β1 = β2 or form a confidence interval for β1 – β2 since (i) the estimates β̂1 and β̂2 are
not independent and (ii) the regression output from (a) doesn’t provide the covariance of the two estimates.
As an alternative, use the bootstrap to estimate the standard error of β̂1 – β̂2 . Use the bootstrap with 5,000
iterations. During the b’th iteration, for b ∈ {1, 2, …, 5000}, run both regressions (on the same bootstrap
sample) and calculate β̂1b – β̂2b . Then, calculate the bootstrap standard error. What is the normal-based bootstrap
95% confidence interval for β1 – β2 ? Do you reject H0 : β1 = β2 at a 5% level?
(c) *Use the bootstrap, as in (b), to form a normal-based bootstrap 95% confidence interval for ρ2WFC,BAC –
ρ2WFC,MTB , which is the limit to which the difference in R-squared values converges for a large sample.
Rather than calculating β̂1b – β̂2b in each iteration, calculate the difference in R-squared values between the
first regression (WFC on BAC) and the second regression (WFC on MTB).
(d) *Use the bootstrap, as in (b), to form a normal-based bootstrap 95% confidence interval for σU1 – σU2 , the
difference between the standard deviations of the two regression residuals.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 493 — #500
i i
Chapter 17 introduced the simple linear regression (SLR) model to model the relationship between an outcome
variable Y and a single explanatory variable X. The SLR model is often too simplistic since there may be multiple
explanatory variables that should be included in a model describing the outcome variable. The regression approach
is easily generalizable to allow for more explanatory variables. This chapter introduces the multiple linear regression
model, a model which generalizes the SLR model from the previous chapter. While many of the ideas and results for
the SLR model carry over to the multiple regression model, there are important new issues that arise with additional
explanatory variables.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 494 — #501
i i
• Exogeneity assumption: The exogeneity assumption E(U|X) = 0 implies that there is no correlation between the
unobservable U and any of the explanatory variables:
Cov(Xk , U) = 0 for any k ∈ {1, 2, …, K}.
Knowing the value of one or more of the explanatory variables tells us nothing about the expected value of U, as its
conditional expectation is zero for all possible values. Importantly, the exogeneity assumption says nothing about
the relationship between the explanatory variables. Any two explanatory variables Xk and X` may be correlated with
each other and, in most cases, would be expected to be.
In one sense, the MLR exogeneity assumption seems stronger than the SLR exogeneity assumption since it requires
that U is uncorrelated with a larger set of explanatory variables. In another sense, however, the inclusion of additional
explanatory variables in the MLR model may make the exogeneity assumption more plausible than it was without
those variables. For example, in Example 17.21, it was argued that the exogeneity assumption was not likely to hold
for the SLR model of state-level cigarette sales and taxes, as the unobservable U is likely to be negatively related
with cigarette taxes. If we also had a state-level variable that measured the pro-tax sentiment of residents (e.g., the
percentage of residents who say yes to “Do you favor increased cigarette taxes?”), the inclusion of this additional
explanatory variable as a second variable in a MLR model could make the Cov(CIGTAX, U) = 0 assumption more
plausible.
For the MLR model with the exogeneity assumption, the conditional expectation of Y given the explanatory
variables is
E(Y|X) = α + β1 X1 + β2 X2 + · · · + βK XK
since E(U|X) = 0. The interpretation of the parameters (α, β1 , β2 , …, βK ) of the MLR model follows from this
conditional expectation:
• Meaning of the intercept α: Note that
α = E(Y|X1 = X2 = · · · = XK = 0),
which is the conditional expectation of Y when all of the explanatory variables are equal to zero. Whether α has a
practical interpretation depends upon the specific MLR model being considered and, specifically, whether zero is a
relevant value for each of the explanatory variables.
• Meaning of the slope parameters: To discuss the meaning of the slope parameters (β , β , …, β ), consider some
1 2 K
arbitrary “starting values” for the explanatory variables, say (X1 , X2 , …, XK ) = (x1∗ , x2∗ , …, xK∗ ). Starting with the first
explanatory variable X1 , what happens to the conditional expectation of Y if X1 is increased by one unit while the
values of all the other explanatory variables are held fixed? The change in the conditional expectation is equal to
E(Y|X1 = x1∗ + 1, X2 = x2∗ , …, XK = xK∗ ) – E(Y|X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ )
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 495 — #502
i i
fact, it is always possible to consider the more general case when the values of the explanatory variables change from
(X1 , X2 , …, XK ) = (x1∗ , x2∗ , …, xK∗ ) to (X1 , X2 , …, XK ) = (x1∗∗ , x2∗∗ , …, xK∗∗ ), in which case the change in the expected
outcome Y is
β1 (x1∗∗ – x1∗ ) + β1 (x2∗∗ – x2∗ ) + · · · + βK (xK∗∗ – xK∗ ).
This expression can be used to make statements about changes in the expected outcome Y when multiple explanatory
variables change values, which includes cases where an explanatory variable may be a direct function of one or more
other explanatory variables.
Example 18.1 (Monthly stock returns) Example 17.5 considered an SLR model with an individual stock return as the
outcome variable and a market-index return as the explanatory variable. This SLR model captures the association
between the performance of the stock of an individual company and the overall performance of the stock market.
Previously, in Chapter 7 (see, for example, Example 7.13), we measured the correlation between the returns of several
individual stocks. By using a MLR model, we can model an individual stock return with both a market-index return
and the return of a related stock (or the returns of several related stocks). As an example, consider a MLR model with
monthly returns of Home Depot (HD) as the outcome variable and monthly returns of the S&P 500 index (IDX) and
monthly returns of Lowe’s (LOW) as explanatory variables:
HD = α + β1 IDX + β2 LOW + U with E(U|IDX, LOW) = 0
or, equivalently,
E(HD|IDX, LOW) = α + β1 IDX + β2 LOW.
The intercept parameter α = E(HD|IDX = 0, LOW = 0) is the expected monthly return for Home Depot when the
monthly returns for both the S&P 500 index and Lowe’s are equal to zero. The slope parameter β1 measures how
much the conditional expectation of Home Depot returns changes when the S&P 500 return changes, holding the
Lowe’s return fixed. For example, if the S&P return changes by 0.01 (one percentage point) and the Lowe’s return
is unchanged, the conditional expectation of Home Depot returns changes by 0.01β1 . This interpretation is different
from the interpretation of the slope parameter in a SLR model with only the S&P 500 as an explanatory variable. Such
a model would be HD = α + βIDX + U, where β measures the association between Home Depot returns and S&P 500
returns but does not control for (or hold fixed) the returns for Lowe’s. Similarly, the slope parameter β2 measures how
much the conditional expectation of Home Depot returns changes when the Lowe’s return changes, holding the S&P
500 return fixed. If the Lowe’s return changes by 0.01 (one percentage point) and the S&P return is unchanged, the
conditional expectation of Home Depot returns changes by 0.01β2 . The interpretation of this parameter is different
from an SLR model with only the Lowe’s return as an explanatory variable, where the slope parameter measures the
association between Home Depot and Lowe’s without controlling for the S&P return.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 496 — #503
i i
To guarantee that (α̂, β̂1 , β̂2 , …, β̂K ) is the unique solution to this minimization problem, the following two assumptions
are required:
• Assumption MLR-VarX: Each explanatory variable xk has a positive sample variance. That is, s2xk > 0 for each
k ∈ {1, 2, …, K}.
• Assumption MLR-NPC: For each explanatory variable x , it is not possible to write x as a linear combination of
k k
the other explanatory variables.
Assumption MLR-VarX is a very weak assumption, as it just requires that each explanatory variable is non-constant.
This assumption generalizes the assumption made in Chapter 17 for the SLR model (s2x > 0 for the single explanatory
variable x). Assumption MLR-NPC requires that there is no perfect colinearity among the explanatory variables,
which means that it is not possible to express one explanatory variable perfectly as a linear combination of one
or more other explanatory variables. For instance, if one explanatory variable is height in inches, having another
explanatory variable heightft, which is height in feet, would violate the assumption of no perfect colinearity since
height = 12heightft. As another example, if the indicator variable union is an explanatory variable equal to 1 for union
members and 0 for non-members, having another explanatory variable nonunion (equal to 1 for non-members and 0 for
union members) would violate the assumption of no perfect colinearity since nonunion = 1 – union. This assumption
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 497 — #504
i i
is also not very restrictive, as it is generally only violated when a practitioner makes a mistake in specifying the
set of explanatory variables, as would be the case, for example, when including both a union indicator variable and
a nonunion indicator variable in the MLR model. The following proposition states that these two assumptions are
sufficient to ensure that the least-squares estimates uniquely maximize the S(a, b1 , b2 , …, bK ) function:61
Proposition 18.1. If Assumption MLR-VarX and Assumption MLR-NPC hold, the minimization of
n
X
S(a, b1 , b2 , …, bK ) = (yi – a – b1 xi1 – b2 xi2 – · · · – bK xiK )2
i=1
has a unique solution.
Example 18.2 (Monthly stock returns) Using the sp500 dataset, the parameters (α, β1 , β2 ) of the MLR model from
Example 18.1,
HD = α + β1 IDX + β2 LOW + U with E(U|IDX, LOW) = 0,
can be estimated. The first column of the following table shows the least-squares estimates (α̂, β̂1 , β̂2 ), obtained by
minimizing the function S(a, b1 , b2 ). For comparison purposes, the second column shows the least-squares estimates
for a SLR model having only IDX (S&P 500 monthly returns) as an explanatory variable, and the third column shows
the least-squares estimates for a SLR model having only LOW (Lowe’s monthly returns) as an explanatory variable.
MLR estimates SLR estimates SLR estimates
α (intercept) 0.004 0.009 0.006
β1 (slope on IDX) 0.595 1.020
β2 (slope on LOW) 0.384 0.522
The MLR estimates can be obtained from the lm or lm_robust function in R. Since standard errors are not being
calculated yet, the lm function is sufficient:
lm(HD~IDX+LOW, data=sp500)
##
## Call:
## lm(formula = HD ~ IDX + LOW, data = sp500)
##
## Coefficients:
## (Intercept) IDX LOW
## 0.00425 0.59491 0.38425
The syntax to include more than one explanatory variable is to use a plus sign (+) in between variable names, so
that IDX+LOW indicates there are two explanatory variables, IDX and LOW, in the model.
For the MLR estimates, the intercept estimate α̂ = 0.004 (or 0.4%) is an estimate of E(HD|IDX = LOW = 0), the
expected monthly return of Home Depot when both the S&P 500 and Lowe’s have zero monthly return. The slope
estimate β̂1 = 0.595 estimates the expected change in Home Depot returns given a one-unit change in S&P 500 returns,
holding the returns of Lowe’s fixed. Therefore, for a 0.01 increase in S&P 500 returns (IDX) with Lowe’s returns
(LOW) held fixed, we estimate that the Home Depot returns will, on average, increase by 0.00595. Similarly, for a
0.01 increase in Lowe’s returns (LOW) with S&P 500 returns (IDX) held fixed, we estimate that the Home Depot
returns will, on average, increase by 0.00384. The estimates can also be used to directly estimate the conditional
expectation of HD for any specific values of IDX and LOW. For instance, in a month where the S&P 500 market return
is 4% (IDX = 0.04) and Lowe’s return is 2% (LOW = 0.02), the estimated conditional expectation of HD is
Ê(HD|IDX = 0.04, LOW = 0.02) = α̂ + β̂1 (0.04) + β̂2 (0.02) = 0.004 + (0.595)(0.04) + (0.384)(0.02) ≈ 0.035,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 498 — #505
i i
which is approximately 3.5%. Although we do not yet know whether there is a statistically significant difference
between the two slope estimates βˆ1 and βˆ2 , the estimates suggest that the S&P 500 returns have a stronger association
(slope of 0.595) than Lowe’s returns (slope of 0.384) with expected Home Depot’s returns.
Why is the estimated slope on IDX (β̂ = 1.020) in the SLR model, reported in the second column of the table, so
much larger than the MLR estimated slope of 0.595? First, it’s important to realize that the underlying population
parameters are measuring different things, with the SLR slope measuring the association of a one-unit change in IDX
with expected HD but completely ignoring LOW and the MLR slope measuring the association of a one-unit change in
IDX with expected HD with LOW held fixed. For the SLR model, the least-squares estimate of the IDX slope parameter
is essentially forced to pick up any association with HD that may be coming from LOW since LOW is not in the model.
It turns out that, perhaps unsurprisingly, there is a positive correlation between the S&P returns (IDX) and Lowe’s
returns (LOW), with rIDX,LOW = 0.506. Therefore, when IDX increases by one unit, LOW also tends to increase, so
the SLR slope parameter on IDX is expected to be larger than the MLR slope parameter on IDX, as the SLR slope
parameter captures the direct association of IDX with expected HD but also the indirect association of LOW with
expected HD. Similar reasoning can be applied to explain why the SLR slope estimate for LOW, reported in the third
column of the table, is larger than the MLR slope estimate for LOW.
Example 18.3 (Cigarette sales and cigarette taxes) In Examples 17.4 and 17.7, we introduced a SLR model to explain
state-level cigarette sales (CIGSALES) with state-level cigarette taxes (CIGTAX). We now add a variable to the model,
yielding a MLR model with two explanatory variables. The new variable is the binary variable PRODUCER, which is
equal to 1 for any state producing more than 20 million pounds of tobacco in 2019 and 0 otherwise.
CIGSALES = α + β1 CIGTAX + β2 PRODUCER + U with E(U|CIGTAX, PRODUCER) = 0.
The PRODUCER variable is included to allow for the possibility that cigarette sales, even after controlling for
cigaratte taxes, may be higher in states that produce tobacco (e.g., due to greater acceptance of tobacco and smoking).
In the dataset, there are seven states (Georgia, Kentucky, North Caroline, Pennsylvania, South Carolina, Tennessee,
and Virginia) with producer = 1. The following table reports the least-squares estimates (α̂, β̂1 , β̂2 ) of the MLR model
parameters in the first column and, for comparison, the SLR estimates from Example 17.7 in the second column:
MLR estimates SLR estimates
α (intercept) 54.28 55.95
β1 (slope on CIGTAX) –8.97 –9.49
β2 (slope on PRODUCER) 5.37
The MLR estimates are obtained using the lm function in R:
lm(cigsales~cigtax+producer, data=cigdata)
##
## Call:
## lm(formula = cigsales ~ cigtax + producer, data = cigdata)
##
## Coefficients:
## (Intercept) cigtax producer
## 54.28 -8.97 5.37
There is not much change in the CIGTAX slope estimate in the MLR model (–8.97) as compared to the SLR model
(–9.49). Again, the meaning of the two parameters and estimates is different, as the MLR estimate measures what
happens when holding PRODUCER fixed and the SLR estimate does not. For the MLR model, the slope estimate
β̂1 = –8.97 implies that a one-unit (one dollar) change in state cigarette taxes, holding fixed whether the state is a
tobacco producer, is associated with an estimated decrease of 8.97 packs per capita, on average. For the tobacco-
producer variable, a one-unit change in the variable involves going from a non-producing state (PRODUCER = 0) to a
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 499 — #506
i i
producing state (PRODUCER = 1). The slope estimate β̂2 = 5.37 estimates the difference in expected per-capita packs
sold between a tobacco-producing state and a non-producing state, holding state cigarette taxes fixed. Whether this
difference is statistically significant depends upon the standard error of the estimate, an issue that we re-visit later.
Example 18.4 (Weekly earnings) In the previous chapter, SLR models were used to model weekly earnings in terms
of union status (Example 17.2) and in terms of educational attainment (Example 17.3). With the MLR model, we can
model weekly earnings in terms of several explanatory variables simultaneously. We use the cps data on n = 2809
employed individuals and the following four explanatory variables (K = 4) to model the outcome earnwk (weekly
earnings):
educ = years of education
exper = years of experience
union = 1 if individual is a union member, 0 otherwise
female = 1 if individual is female, 0 otherwise
For the experience (exper) variable, we adopt a standard definition used by labor economists, which defines62
exper = age – educ – 6, where age is age in years.
With both exper and educ in the MLR model, it is not possible to also include age since, by the definition of exper, the
age variable is a perfect linear combination of educ and exper, which would violate Assumption MLR-NPC.
Here is the R code to construct the female and exper variables and calculate the MLR least-squares estimates:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 500 — #507
i i
member (female = 0, union = 1) is equal to βˆ4 = –344.8. Estimated differences in average weekly earnings between
other groups can be calculated similarly.
The least-squares minimization problem in Proposition 18.1 yields the least-squares estimates for the observed
sample that has been drawn from the population. The values of the least-squares estimates depend upon the particular
sample that happened to be drawn from the population. In thinking about the sampling distribution associated with
least-squares estimation, the distribution of the least-squares estimators (α̂XY , β̂1,XY , β̂2,XY , …, β̂K,XY ) is described by
the distribution of all possible least-squares estimates that could arise from every possible n-observation i.i.d. sample
drawn from the population. Like the least-squares estimators for the SLR model parameters, the least-squares
estimators for the MLR model have the desirable properties of consistency and asymptotic normality, with the
latter property allowing for large-sample inference using the normal distribution. The following proposition states
the consistency and asymptotic normality properties of the least-squares estimators of the MLR model:63
Proposition 18.2. If the MLR model holds and Assumption MLR-VarX and Assumption MLR-NPC hold for any
possible sample drawn from the population, least-squares estimation is consistent, with α̂XY being a consistent
estimator of α and each β̂k,XY being a consistent estimator of βk for each k ∈ {1, 2, …, K}. Moreover, the least-squares
estimators are asymptotically normal, with
√ a
n(α̂XY – α) ∼ N (0, Vα )
and
√ a
n(β̂k,XY – βk ) ∼ N 0, Vβk for each k ∈ {1, 2, …, K}.
Vα
The asymptotic variance of the intercept estimator α̂XY is n , and the asymptotic variance of each slope estimator
V
β̂k,XY is nβk for each k ∈ {1, 2, …, K}.
Proposition 18.2 says that the least-squares estimators get arbitrarily close to the underlying MLR parameters as
the sample size increases. The asymptotic normality of the estimators allows construction of confidence intervals and
hypothesis testing based upon the normal distribution, as discussed in further detail in Section 18.3.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 501 — #508
i i
The estimated residual is the difference between the realized outcome yi and its fitted value ŷi . When an estimated
residual is small in magnitude, it means the realized outcome is close to the value predicted by the fitted value.
Since the least-squares estimators are consistent, the fitted values and estimated residuals calculated from the least-
squares estimates are also consistent. The following proposition states this result, generalizing the SLR result
(Proposition 17.5):
Proposition 18.3. (Consistency of fitted values and estimated residuals) Assume that the MLR model holds and
Assumption MLR-VarX and Assumption MLR-NPC hold for any possible sample drawn from the population. Then,
for any i ∈ {1, 2, …, n}, the fitted value ŷi is a consistent estimate of
E(Y|X1 = xi1 , X2 = xi2 , …, XK = xik ) = α + β1 xi1 + β2 xi2 + · · · + βK xiK ,
and the estimated residual ûi is a consistent estimate of the population residual
ui = yi – α – β1 xi1 – β2 xi2 – · · · – βK xiK .
Example 18.5 (Weekly earnings) Continuing Example 18.4, the following table shows detailed information for the
first ten observations in the cps dataset, including the values of the four explanatory variables (educ, exper, union,
female), the observed outcome value (earnwk), and the fitted values and estimated residuals.
i educi experi unioni femalei yi (earnwki ) ŷi ûi
1 14 30 0 0 577 1269 –692
2 18 10 0 1 3049 1270 1779
3 18 6 0 1 2500 1250 1250
4 12 17 0 0 300 983 –683
5 12 38 0 0 1000 1086 –86
6 7.5 28.5 0 1 1000 196 804
7 12 37 0 1 650 737 –87
8 13 39 0 1 1712 857 854
9 16 17 0 1 820 1082 –262
10 12 41 1 0 1240 1244 –4
We calculate the fitted values and estimated residuals in R in the same way as seen for the SLR model:
# output first ten estimated residuals and fitted values, rounded to the nearest dollar
round(uhat[1:10])
The fitted values are based upon the least-squares parameter estimates, using the formula in the definition of ŷi .
Looking at the sixth observation (i = 6), for instance, the fitted value is very low since the education level is so low; the
fitted value indicates an estimate of $196 for the estimated expected earnings of an individual with those characteristics
(female, non-union worker with 7.5 years of education and 28.5 years of experience). The fitted values for the first three
observations happen to be very close to each other due to the values of their explanatory variables. Looking at the
second and third observations, the only difference is a four-year difference in experience (exper = 10 for i = 2 and
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 502 — #509
i i
exper = 6 for i = 3), leading to a difference in fitted values of 4β̂exper = (4)(4.9) ≈ 20. The estimated residuals ûi exhibit
a lot of variation. Some observations have fitted values quite close to the actual outcomes and estimated residuals close
to zero, most notably for i ∈ {5, 7, 10}, while other observations have fitted values very far from the actual outcomes
and estimated residuals with large magnitudes, most notably for i ∈ {2, 3}.
Summary measures of the overall size of the model’s residuals can be obtained by generalizing the residual variance
and standard deviation estimates introduced for the SLR model. Here are their definitions for the MLR model:
As compared to the σ̂U2 and σ̂U formulas for least-squares estimation of the SLR model, the formulas for the MLR
1 1
model involve a scaling of n–K–1 rather than n–2 . This scaling accounts for the estimation of the (K + 1) parameters
in the MLR model, with the denominator equal to n – (K + 1) = n – K – 1. The SLR formulas are a special case of the
1 1
MLR formulas with K = 1. While the n–K–1 scaling does not differ much numerically from either a n–1 or 1n scaling
when n is large, this scaling does ensure that the residual variance estimator is unbiased.
The following proposition summarizes the properties of the estimated residuals from least-squares estimation of the
MLR model. This proposition generalizes Proposition 17.7, which considered the properties of the estimated residuals
for the SLR case.
Proposition 18.4. (Properties of estimated residuals) The estimated residuals
ûi = yi – α̂ – β̂1 xi1 – β̂2 xi2 – · · · – β̂K xiK,
based upon the least-squares estimates (α̂, β̂1 , β̂2 , …, β̂K ) have the following properties:
(i) The sample average of the estimated residuals is zero:
n
1X
ûi = 0.
n
i=1
(ii) The sample correlation between the values of any explanatory variable and the estimated residuals is zero:
rxk û = 0 for every k ∈ {1, 2, …, K}.
(iii) The sample correlation between the fitted values and the estimated residuals is zero:
rŷû = 0.
1 n
(iv) The residual variance estimate σ̂U2 = n–K–1 2 2
P
i=1 ûi is a consistent (and unbiased) estimate of σU , the
unconditional variance of the random variable U. q
1
Pn 2
(v) The residual standard deviation estimate σ̂U = n–K–1 i=1 ûi is a consistent estimate of σU , the unconditional
standard deviation of the random variable U.
Properties (i), (iii), (iv), and (v) are basically the same as those stated for the SLR model in Proposition 17.7,
1
with the different scaling n–K–1 used for properties (iv) and (v) here. Property (ii) states that the sample correlation
between estimated residuals and any of the explanatory variables is equal to zero. That is, the explanatory variable
xk is uncorrelated with the estimated residual û for any k ∈ {1, 2, …, K}. This property is a sample analogue of the
corresponding population property, that the population residual U is uncorrelated with each random variable Xk , for
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 503 — #510
i i
k ∈ {1, 2, …, K}, which is implied by the exogeneity assumption E(U|X) = 0. Since property (ii) is a by-product of
least-squares estimation, it is true even if the exogeneity assumption doesn’t actually hold. As such, it is not fruitful to
use a sample correlation between an explanatory variable xk and the estimated residual û to test whether Xk is related
to U, as the sample correlation is always zero.
With estimated residuals and fitted values defined for least-squares estimation of the MLR model, we can generalize
the R-squared measure that was introduced in Definition 17.5 for the SLR model.
Definition 18.5 The R-squared value associated with least-squares estimation of the MLR model, denoted R2 , is
s2ŷ s2û
R2 = =1– .
s2y s2y
This definition is identical to the definition of R-squared for least-squares estimation of the SLR model. In terms
of intepretation, the only difference for the MLR model is that there may be more than a single explanatory variable
explaining the variation in the outcome variable. For example, if K = 3 and R2 = 0.36, then the three explanatory
variables x1 , x2 , and x3 explain 36% of the variation in the y variable.
The following proposition summarizes the important properties of the R-squared value for MLR models:
Proposition 18.5. (Properties of R-squared) The R-squared value, based upon the least-squares estimates
(α̂, β̂1 , β̂2 , …, β̂K ), has the following properties:
(i) R2 = ryŷ 2
(ii) 0 ≤ R2 ≤ 1
(iii) If an explanatory variable xK+1 is added to the MLR model, least-squares estimation of the new MLR model
(with K + 1 explanatory variables) has an R-squared value (call it R2K+1 ) that is at least as large as the original R2
value (from the MLR model with K explanatory variables): R2K+1 ≥ R2 .
Properties (i) and (ii) have been seen previously for SLR models. R2 is equal to the square of the correlation between
the values of the outcome variable y and the fitted values ŷ. To the degree that the explanatory variables provide a better
fit/prediction for the outcome variable, the correlation between y and ŷ is larger in magnitude and, therefore, R2 is
higher. The extreme of R2 = 1 corresponds to 100% of the variation in y being explained by the explanatory variables,
which happens when y is a perfect linear function of the explanatory variables. The extreme of R2 = 0 corresponds
to none of the variation in y being explained by the explanatory variables, which happens only when all of the slope
estimates β̂1 , β̂2 , …, β̂K are exactly equal to zero. Property (ii) follows directly from property (i) since the correlation
ryŷ satisfies –1 ≤ ryŷ ≤ 1.
Property (iii) states that the R-squared value (weakly) increases when a variable is added to the MLR model and
new least-squares estimates are calculated. The intuition here is that you can’t do worse in explaining the variation
in the outcome variable y when you add an explanatory variable to the model. (Mathematically, if the estimated
slope on the added variable is exactly equal to zero, the fitted values would be the same as the original least-squares
estimation, so that R2 is unchanged. For any other (non-zero) estimate of the slope of the added variable, the R2 value
increases as compared to the original R2 value.) By continually applying property (iii), R2 increases if we add more
and more explanatory variables to the MLR model. This property should be considered a cautionary tale. Specifically,
if a practitioner is too focused on the R2 value, it may cause them to add too many explanatory variables to a model.
In choosing which explanatory variables belong in the model, R2 should not be the guiding force; instead, variables
should be included based upon prior knowledge that the researcher has about their practical or economic relevance in
explaining the outcome variable.
Example 18.6 (Monthly stock returns) Continuing Example 18.2, the following table reports the R-squared values
and residual standard deviation estimates for least-squares estimation of three models for the outcome HD: (i) MLR
with IDX and LOW as explanatory variables, (ii) SLR with IDX as the explanatory variable, and (iii) SLR with LOW
as the explanatory variable:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 504 — #511
i i
Example 18.7 (Weekly earnings) Re-visiting Example 18.4, the following table shows how R2 changes when
explanatory variables are added one-by-one to the model that explains weekly earnings (earnwk):
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 505 — #512
i i
Explanatory variables R2
educ 0.106
educ, exper 0.108
educ, exper, union 0.115
educ, exper, union, female 0.166
Property (iii) of Proposition 18.5 states that R2 increases as variables are added to the model. The first row corresponds
to a SLR model with only educ as an explanatory variable. Education alone explains 10.6% of the variation in weekly
earnings. Adding experience (exper) to the model provides only a small incremental increase in R2 to 0.108, as does
adding union, which increases R2 to 0.115. Adding the female variable provides a larger increase, to an R2 of 0.166,
meaning the four explanatory variables together explain 16.6% of the variation in weekly earnings.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 506 — #513
i i
This same reasoning applies to any other explanatory variable Xk in the MLR model. For any k ∈ {1, 2, …, K}, the
random variable X̃k is the part of Xk that is not related to the other explanatory variables in the model. To precisely
estimate βk , it is better to have a lot of independent variation in the Xk variable, which happens when the variance of
X˜k , Var(X̃k ) = σX̃2 , is high.
k
As we did for the SLR model (Proposition 17.10), we consider the case of homoskedastic residuals for the MLR
model as a pedagogical tool to see the factors that affect the standard errors of the slope estimates. To be clear,
homoskedasticity is not required to perform least-squares estimation, as the standard errors of the least-squares
estimates can and should be calculated under the more relaxed case of heteroskedasticity.
The definitions of homoskedasticity and heteroskedasticity are generalized to the MLR model as follows:
Definition 18.6 The residuals of a MLR model are homoskedastic if the conditional variance
Var(U|X1 = x1 , X2 = x2 , …, XK = xK ) = σU2
is constant and does not depend upon the values x1 , x2 , …, xK . The residuals are said to exhibit homoskedasticity.
Definition 18.7 The residuals of a MLR model are heteroskedastic if the conditional variance
Var(U|X1 = x1 , X2 = x2 , …, XK = xK )
is non-constant and depends upon the values x1 , x2 , …, xK . The residuals are said to exhibit heteroskedasticity.
In the case of homoskedastic residuals, the following proposition provides the asymptotic variance formulas for the
least-squares slope estimators:
Proposition 18.7. If the MLR model holds, Assumption MLR-VarX and Assumption MLR-NPC hold for any possible
sample drawn from the population, and the residuals are homoskedastic, then the asymptotic variance of β̂k,XY (the
least-squares slope estimator of βk ) is
Vβk σ2
= U2 for each k ∈ {1, 2, …, K}.
n nσX̃
k
The variance in the denominator of the asymptotic-variance expression is the independent variation in Xk , given by
σX̃2 , and not the overall variation in Xk , given by σX2 k . To obtain the standard errors of the least-squares slope estimates,
k
sample descriptive statistics can be plugged in for the population statistics, so that
s s
V̂βk s2û sû
se(β̂k ) = = =√ for each k ∈ {1, 2, …, K},
n ns2x̃k n sx̃k
where sû is the sample standard deviation of the estimated residuals and sx̃k is the sample standard deviation of the
variable that measures the part of xk that is not linearly related to the other explanatory variables.64 Thus, there are
three factors that affect the standard errors of the least-squares slope estimates, with the first two being the same as
seen for the SLR model:
• Sample size: Larger n leads to a smaller standard error se(β̂k ), with the usual √1 scaling.
n
• Residual noise: A smaller residual variance σU2 ,
as estimated by s2û ,
leads to a smaller standard error se(β̂k ).
• Independent variation of the explanatory variable: Whereas the overall variation of the explanatory variable affects
the standard error for the SLR model, is it the independent variation of an explanatory variable xk that affects se(β̂k )
for the MLR model. The independent variation of the random variable Xk is given by the variance of the random
variable X̃k introduced above and measures how much variation is left in Xk after the linear relationship with other
explanatory variables has been accounted for. When this independent variation, as estimated by s2x̃k , is larger, the
standard error se(β̂k ) is smaller.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 507 — #514
i i
For the third factor, while a lot of independent variation in xk is good for the precision of β̂k (i.e., a low se(β̂k )), it is
also true that the precision of β̂k is poor (i.e., a high se(β̂k )) when there is little independent variation in xk . When the
variable xk is highly correlated with and well explained by the other explanatory variables in the model, the variance
σX̃2 can be very low, which can cause the asymptotic variance of the slope estimator β̂XY,k and the standard error of
k
the estimate βˆk to “blow up” since this variance appears in the denominator of the asymptotic-variance formula in
Proposition 18.7. This issue, when high correlation among two or more explanatory variables causes standard errors
to be large, is known as multicollinearity. While some textbooks offer “solutions” to multicollinearity, there really is
no solution if we wish to keep all of the affected variables in the model. For instance, suppose K = 2 and the variables
x1 and x2 are highly correlated with each other. In such a case, it is possible that x1 and x2 together provide a good
prediction of the outcome variable y, as reflected by a high R2 value, even if both β̂1 and β̂2 have very high standard
errors due to the multicollinearity. While the two variables are jointly important in explaining y, their high correlation
means that it is difficult or impossible to disentangle the separate effects of x1 and x2 on the expected outcome. β1
measures how X1 affects E(Y|X) while holding X2 fixed, but estimating β1 is difficult because there is little variation
in x1 once x2 is held fixed due to the high value of rx1 x2 .
Example 18.8 (Cigarette sales and cigarette taxes) The following table reproduces the least-squares estimates from
Example 18.3, now with heteroskedasticity-robust standard errors reported in parentheses:
MLR estimates SLR estimates
α (intercept) 54.28 (3.51) 55.95 (2.92)
β1 (slope on CIGTAX) –8.97 (1.19) –9.49 (1.06)
β2 (slope on PRODUCER) 5.37 (5.08)
The standard error for the cigarette-tax slope in the MLR model is se(β̂1 ) = 1.19. This standard error is only slightly
higher than the standard error of 1.06 for the cigarette-tax slope in the SLR model, suggesting that even after
controlling for producer (whether or not a state is a tobacco producer) there is still a lot of independent variation
left in cigtax for precisely estimating β1 in the MLR model.
The MLR estimates in the table are obtained in R using the lm_robust function:
Example 18.9 (Monthly stock returns) The following table reproduces the least-squares estimates from Example 18.2,
now with heteroskedasticity-robust standard errors reported in parentheses:
MLR estimates SLR estimates SLR estimates
α (intercept) 0.004 (0.003) 0.009 (0.003) 0.006 (0.003)
β1 (slope on IDX) 0.595 (0.091) 1.020 (0.082)
β2 (slope on LOW) 0.384 (0.047) 0.522 (0.040)
For the MLR model, the standard error of the slope on IDX (S&P 500 returns) is se(β̂1 ) = 0.091. This standard error
is roughly 11% higher than the standard error of 0.082 obtained for the slope on IDX in the SLR model. The higher
standard error for the MLR model is not too surprising since β1 is estimated using only the independent variation of
IDX after controlling for LOW. Since IDX and LOW are correlated, with rIDX,LOW = 0.5061, the amount of independent
variation in IDX used in estimation of the MLR model is somewhat lower than the overall variation of IDX used in
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 508 — #515
i i
estimation of the SLR model. Similarly, for the same reason, the standard error on LOW (Lowe’s returns) is higher for
estimation of the MLR model, with se(β̂2 ) = 0.047, as compared to estimation of the SLR model with only LOW as an
explanatory variable, where the standard error is 0.040.
Here is the R code to produce the estimates and standard errors in the table:
The asymptotic normality of the least-squares estimators (Proposition 18.2) allows practitioners to use the normal
distribution for (i) calculating confidence intervals for the MLR parameters and (ii) conducting hypothesis tests
involving the MLR parameters. The 1 – α confidence interval for the intercept parameter α is
(α̂ – zα/2 se(α̂), α̂ + zα/2 se(α̂)),
and the 1 – α confidence interval for each slope parameter βk , for k ∈ {1, 2, …, K}, is
(β̂k – zα/2 se(β̂k ), β̂k + zα/2 se(β̂k )).
The specific case of 95% confidence intervals for α and βk are, respectively,
(α̂ – 1.96se(α̂), α̂ + 1.96se(α̂))
and
(β̂k – 1.96se(β̂k ), β̂k + 1.96se(β̂k )).
Example 18.10 (Weekly earnings) The following R code shows the least-squares estimates from Example 18.4, with
heteroskedasticity-robust standard errors and 95% confidence intervals also reported:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 509 — #516
i i
We are 95% confident that the change in expected weekly earnings associated with a one-year change in education,
holding all other variables fixed, is between $96.67 and $125.21. For the union-membership slope β3 , the 95%
confidence interval is
(142.70 – (1.96)(44.45), 142.70 + (1.96)(44.45)) ≈ (55.59, 229.91).
We are 95% confident that the difference in expected weekly earnings between a union member and a non-member,
holding all other variables fixed, is between $55.59 and $229.91. While this interval for β3 is quite wide (approximately
$174 wide), it does not include zero, supporting the idea that the union differential is statistically meaningful even if
it’s not very precisely estimated.
We can obtain other confidence intervals for the parameters by changing the optional alpha argument for the
lm_robust function. Here is output with 90% confidence intervals:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 510 — #517
i i
due to possible correlation between the estimators β̂1,XY and β̂2,XY . More generally, if X1 increases by v1 units
and X2 increases by v2 units, the relevant change in E(Y|X) is the linear combination v1 β1 + v2 β2 , with estimate
v1 β̂1 + v2 β̂2 and standard error se(v1 β̂1 + v2 β̂2 ).
• Difference in effects: If X and X are comparable variables (e.g., they are measured in the same units), we might be
1 2
interested in the difference of the two variables’ effects on E(Y|X). This difference is given by the linear combination
β1 – β2 . (This turns out to be a special case of the case considered above, thinking about increasing X1 by one unit
and decreasing X2 by one unit.)
• Estimating the conditional expectation: The conditional expectation E(Y|X) is itself a linear combination of the
three model parameters. For X1 = x1∗ and X2 = x2∗ , the conditional expectation is
E(Y|X1 = x1∗ , X2 = x2∗ ) = α + β1 x1∗ + β2 x2∗ .
This linear combination can be re-written 1 · α + x1∗ · β1 + x2∗ · β2 , making it clear that the expression is a linear
combination of the three parameters, with coefficients 1, x1∗ , and x2∗ . The estimate of the conditional expectation is
obtained by plugging in the least-squares estimates,
Ê(Y|X1 = x1∗ , X2 = x2∗ ) = α̂ + β̂1 x1∗ + β̂2 x2∗ ,
and has standard error se(α̂ + β̂1 x1∗ + β̂2 x2∗ ) = se(1 · α̂ + x1∗ · β̂1 + x2∗ · β̂2 ).
To accommodate general forms of linear combinations, we introduce a (row) vector that contains the appropriate
coefficients for any given linear combination of the model parameters. In the example above, there are three model
parameters (α, β1 , β2 ) and three estimates (α̂, β̂1 , β̂2 ), so the vector will have three elements. The first element
corresponds to the constant multiplying α̂, the second element corresponds to the constant multiplying β̂1 , and the
third element corresponds to the constant multiplying β̂2 . For the three examples above, the vectors would be:
0 1 1 for β̂1 + β̂2 ,
0 1 –1 for β̂1 – β̂2 ,
and
1 x1∗ x2∗ for α̂ + β̂1 x1∗ + β̂2 x2∗ .
More generally, the MLR model has K + 1 model parameters (α, β1 , β2 , …, βK ) and K + 1 estimates (α̂, β̂1 , β̂2 , β̂K ),
so the vector will have K + 1 elements. For example, with K = 4 explanatory variables, the linear combination β1 +
3β3 – 2β4 is represented by the vector
0 1 0 3 –2
since β1 + 3β3 – 2β4 = 0α + 1β1 + 0β2 + 3β3 + (–2)β4 .
To facilitate the calculation of a standard error for a linear combination of least-squares estimates, a user-defined R
function linear_combination has been written:
• linear_combination(regresults, linvec): Takes regresults, the results from a lm_robust
regression, and linvec, a vector that specifies the linear combination of interest, as arguments and returns the
estimate (estimate) of the linear combination and the standard error (se) of the estimate. The vector linvec
has the same length (K + 1) as the parameter vector (α, β1 , β2 , …, βK ), with its elements (a1 , a2 , a3 , …, aK+1 )
corresponding to the coefficients of each parameter that give the linear combination of the estimates, a1 α̂ + a2 β̂1 +
a3 β̂2 + · · · + aK+1 β̂K .
Example 18.11 (Monthly stock returns) Parameter estimates and standard errors for the MLR model
E(HD|IDX, LOW) = α + β1 IDX + β2 LOW were provided in Example 18.9. We use the linear_combination
function to provide estimates and standard errors for some linear combinations of the parameters, specifically the
following: (i) β1 + β2 , representing the combined effect of increasing both IDX and LOW, (ii) β1 – β2 , representing the
difference in the effects of IDX and LOW, and (iii) α + 0.05IDX + 0.03LOW, representing the conditional expectation
E(HD|IDX = 0.05, LOW = 0.03). The following R code provides output for these three cases:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 511 — #518
i i
We have β̂1 + β̂2 = 0.979, with a standard error of 0.069. If IDX and LOW both increase by 0.01 (one percentage
point), these estimates imply an associated increase in the conditional expectation of HD of 0.00979, with a standard
error of 0.00069. For the difference in the slope parameters, we have β̂1 – β̂2 = 0.211, with a standard error of 0.127,
which implies the 95% confidence interval for β1 – β2 is (–0.039, 0.460). And, the estimated conditional expectation
Ê(HD|IDX = 0.05, LOW = 0.03) = α̂ + β̂1 (0.05) + β̂2 (0.03) = 0.0455,
with a standard error of 0.0042, which implies the 95% confidence interval for the true conditional expectation
E(HD|IDX = 0.05, LOW = 0.03) is (0.0372, 0.0538).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 512 — #519
i i
Xk in the model at the outset, the failure to reject H0 : βk = 0 might prompt a practitioner to drop Xk from the model.
That said, variables should be dropped from the MLR model with caution. First, it’s possible that we just don’t have
a large enough sample to estimate a statistically significant slope on Xk . Second, it can be useful to have a variable in
the model even if its slope estimate is not statistically significant, as someone who is looking at the results might have
expected the variable to matter in the model and would be interested to see that it does not.
Example 18.12 (Monthly stock returns) The following R code shows the MLR results from Example 18.9:
The output contains z-statistics and p-values for testing each individual MLR parameter versus zero. Since α =
E(HD|IDX = LOW = 0), the null hypothesis H0 : α = 0 corresponds to testing whether the expected Home Depot return
(HD) is equal to zero when the S&P 500 return (IDX) and Lowe’s return (LOW) are both equal to zero. From the table,
the z-statistic for this null hypothesis is 1.53, so H0 : α = 0 would not be rejected at a 5% level (1.53 < z0.025 = 1.96) or
at a 10% level (1.53 < z0.05 = 1.645). The p-value of 0.1266 indicates that H0 : α = 0 would not be rejected at any level
below 12.66%. For both slope parameters, the test of H0 : βk = 0 has a p-value equal to zero to many decimal places.
Therefore, H0 : β1 = 0 and H0 : β2 = 0 are both rejected at any level, indicating that IDX and LOW are both statistically
significant variables in the MLR model.
Example 18.13 (Cigarette sales and cigarette taxes) The following R code shows the MLR results from Example 18.8,
with the z-statistics and p-values for testing each parameter versus zero:
The state-level cigarette tax (CIGTAX) variable is statistically significant, as the p-value of 0.0000 indicates that
H0 : β1 = 0 is rejected at any level. On the other hand, the tobacco-producing state indicator (PRODUCER) variable
has a z-statistic of 1.06 and a p-value of 0.296 associated with testing H0 : β2 = 0. The null hypothesis H0 : β2 = 0
corresponds to PRODUCER not being in the MLR model, and H0 : β2 = 0 would not be rejected at any level below
29.6%. Given these results, whether or not to leave PRODUCER in the model is a choice that the practitioner needs
to make. If PRODUCER is dropped from the model, the resulting model is the SLR model with only CIGTAX as an
explanatory variable; from Example 18.8, the cigarette-tax slope estimate for that SLR model is –9.49 with a standard
error of 1.06.
Other hypothesis tests may be of interest for a MLR model, including the following:
Testing a single linear restriction: The simple null hypotheses above are examples of single linear restrictions on
the MLR parameters. What if we want to test whether the slope on one variable X1 is equal to the slope on another
variable X2 ? If the two variables are measured in the same units, a test of H0 : β1 = β2 assesses whether the partial effect
of X1 on E(Y|X), holding all other variables fixed, is the same as the partial effect of X2 on E(Y|X), holding all other
variables fixed. The null hypothesis H0 : β1 = β2 is equivalent to
H0 : β1 – β2 = 0,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 513 — #520
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 514 — #521
i i
three 0 elements since each of the parameters is being tested against zero. Using the notation from the Appendix to
Chapter 16, the number of restrictions is Q = 3 and the number of parameters is L = K + 1 = 7, and we have
0 0 1 0 0 0 0 0
R = 0 0 0 0 1 0 0 and c = 0 .
0 0 0 0 0 1 0 0
Each row of the matrix R is a specific linear combination of the L = K + 1 model parameters. For the specific R matrix
shown above, the first row above is β2 , the second row is β4 , and the third row is β5 . Each element of the c (column)
vector provides the constant against which the corresponding row of R is being tested. For the example shown, these
elements are all zero.
Suppose the MLR model for Home Depot (HD) monthly returns is augmented to include the monthly returns of
Bank of America (BAC) and Wells Fargo (WFC) as explanatory variables, with the results provided below:
Based on the z-statistics and p-values, the IDX and LOW variables remain statistically significant, but the p-values
on the two added variables indicate lack of significance for each variable individually (p-value of 0.30 for BAC and
p-value of 0.89 for WFC). These results are suggestive that the two added variables are not valuable in the model for
explaining Home Depot’s returns, and a Wald test of H0 : β3 = β4 = 0 provides a p-value for their joint significance:
## $W
## [1] 1.4126
##
## $p_value
## [1] 0.49346
The rbind function stacks the rows of linear restrictions to form the matrix R. The resulting p-value of 0.49 implies
that H0 : β3 = β4 = 0, corresponding to BAC and WFC not being in the model, would not be rejected at any reasonable
level. This test, therefore, provides support for dropping the two variables from the model.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 515 — #522
i i
of more than one explanatory variable. For instance, if X is a variable, we can include both X and X 2 as explanatory
variables to allow Y to be a non-linear function of X. Such non-linear specifications are considered in Section 18.6.2.
Also, if X1 and X2 are explanatory variables, we can include the interaction X1 X2 in the model. Such interaction
variables, considered in Section 18.6.3, allow for more flexibility in how the expected outcome is associated with the
explanatory variables.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 516 — #523
i i
# estimate Model I
model1 <- lm_robust(earnwk~married+divorced+widowed, data=cps)
model1
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 820.512 27.929 29.3785 1.2281e-165 765.7488 875.276 2805
## married 226.198 33.675 6.7170 2.2374e-11 160.1667 292.229 2805
## divorced 81.235 43.091 1.8852 5.9504e-02 -3.2573 165.728 2805
## widowed -159.499 58.786 -2.7132 6.7040e-03 -274.7669 -44.231 2805
# estimate Model II
model2 <- lm_robust(earnwk~married+divorced+widowed+educ+exper+union+female, data=cps)
model2
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -500.4410 98.1084 -5.10090 3.6061e-07 -692.81309 -308.0690 2801
## married 204.3868 33.2880 6.13994 9.4230e-10 139.11518 269.6583 2801
## divorced 86.3465 41.5833 2.07647 3.7941e-02 4.80958 167.8834 2801
## widowed -50.4511 64.6772 -0.78005 4.3543e-01 -177.27084 76.3685 2801
## educ 108.4820 7.2819 14.89757 2.3140e-48 94.20361 122.7603 2801
## exper 3.3677 1.3992 2.40697 1.6150e-02 0.62425 6.1112 2801
## union 134.2413 44.7788 2.99787 2.7426e-03 46.43840 222.0441 2801
## female -341.3955 25.6764 -13.29607 3.6589e-39 -391.74215 -291.0489 2801
In Model I, with no other explanatory variables, the intercept estimate has a direct interpretation as the estimate of
the expected weekly earnings for the omitted category since
α = E(earnwk|married = divorced = widowed = 0).
The intercept estimate α̂ = 820.51 implies that the average weekly earnings for never-married individuals is $820.51.
Each of the three slope estimates is interpreted as the difference in expected weekly earnings for the corresponding
category and the omitted category. For instance, the slope on the married variable is
βmarried = E(earnwk|married = 1) – E(earnwk|nevermarried = 1),
with the estimate β̂married = 226.20 implying that the estimated difference in expected weekly earnings between married
and never-married individuals is $226.20. Similarly, β̂divorced = 81.24 implies that the estimated difference in expected
weekly earnings between divorced and never-married individuals is $81.24, and β̂widowed = –159.50 implies that the
estimated difference in expected weekly earnings between widowed and never-married individuals is –$159.50. For
any of these differences with the omitted category, we can test the statistical significance of the difference with a z-test
for the appropriate slope estimate. For example, the null hypothesis H0 : βmarried = 0, corresponding to married and
never-married individuals having the same expected weekly earnings, has a z-statistic of
β̂married 226.20
= ≈ 6.72
se(β̂married ) 33.68
and p-value of 0.0000. Thus, H0 : βmarried = 0 is rejected at any level, indicating a statistically significant difference
between the expected weekly earnings of married and never-married individuals. The z-tests of H0 : βdivorced = 0 and
H0 : βwidowed = 0 have z-statistics of 1.89 and –2.71, respectively, and p-values of 0.060 and 0.007, respectively. For a
5% level, H0 : βdivorced = 0 is not rejected, while H0 : βwidowed = 0 is rejected.
How about differences for two of the categories included in the MLR model? Let’s say we are interested in the
difference in expected earnings between married and divorced individuals,
βmarried – βdivorced = E(earnwk|married = 1) – E(earnwk|divorced = 1).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 517 — #524
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 518 — #525
i i
would be in the model. A one-unit change in X1 , from X1 = x1∗ to X1 = x1∗ + 1, the associated change in E(Y|X), holding
the other explanatory variables fixed, is67
E(Y|X1 = x1∗ + 1, X4 = x4∗ , …, XK = xK∗ ) – E(Y|X1 = x1∗, X4 = x4∗ , …, XK = xK∗ )
= β1 + β2 (x1∗ + 1)2 – (x1∗ )2 + β3 (x1∗ + 1)3 – (x1∗ )3
The second variable is specified as I(educ^2), and the I() syntax tells the lm_robust function to do the
calculation within parentheses for each observation and include the resulting variable in the least-squares estimation.
The following table reports these results, with the original model in the first column and the new model, with educ2
included, in the second column. Standard errors are reported in parentheses.
MLR model MLR model
with educ with educ, educ2
Intercept –431.03 (100.08) 883.69 (123.10)
educ 110.94 (7.27) –122.71 (23.59)
educ2 9.821 (1.169)
exper 4.897 (1.315) 5.287 (1.294)
union 142.75 (44.39) 143.96 (43.98)
female –344.79 (25.82) –345.52 (25.21)
R2 0.166 0.198
σ̂U 685.78 672.64
For a z-test of H0 : βeduc2 = 0, the z-statistic is 9.821 2
1.169 ≈ 8.4 and p-value 0.000, indicating that the educ variable belongs
2
in the model. Including educ in the model increases the R-squared value from 16.6% to 19.8% and decreases the
residual standard deviation estimate from $685.78 to $672.64. On the other hand, adding educ2 to the model has very
little effect on the estimates of the slopes for exper, union, and female.
Let’s take a closer look at the partial effects of education implied by the two models. For the model with only educ,
the estimated partial effect of a one-year change in education on E(earnwk|X) is constant and equal to β̂educ = 110.94.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 519 — #526
i i
For the model with educ2 included, the estimated partial effect of a one-year change in education, from educ∗ to
educ∗ + 1, on E(earnwk|X) is
β̂educ + β̂educ2 (2educ∗ + 1) = –122.71 + 9.821(2educ∗ + 1).
To calculate a standard error for this partial effect, we use the linear_combination function to calculate
se(β̂educ + β̂educ2 (2educ∗ + 1)) for a given value of educ∗ . The following table summarizes the partial effects estimated
by the two models for four possible values of educ∗ , with standard errors reported in parentheses:
MLR model MLR model
Change in E(earnwk|X) when:
with educ with educ, educ2
educ changes from 10 to 11 110.94 (7.27) 83.54 (4.84)
educ changes from 12 to 13 110.94 (7.27) 122.82 (7.65)
educ changes from 14 to 15 110.94 (7.27) 162.11 (11.71)
educ changes from 16 to 17 110.94 (7.27) 201.39 (16.12)
For the quadratic model, the estimated partial effect of education increases a lot as the level of education increases.
For example, the estimated partial effect at educ∗ = 16 is equal to $201.39 and is roughly 64% larger in magnitude
than the estimated partial effect at educ∗ = 12, which is equal to $122.82.
Here is the R code to calculate standard errors for the estimated partial effects in the quadratic model:
# calculate the partial effect and standard error for each educ value
for (educ_star in educ_vec) {
print(linear_combination(results, c(0,1,2*educ_star+1,0,0,0)))
}
## $estimate
## [1] 83.535
##
## $se
## [1] 4.8381
##
## $estimate
## [1] 122.82
##
## $se
## [1] 7.6458
##
## $estimate
## [1] 162.11
##
## $se
## [1] 11.714
##
## $estimate
## [1] 201.39
##
## $se
## [1] 16.115
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 520 — #527
i i
model, which contains X1 , X2 , the interaction variable X3 = X1 X2 , and additional explanatory variables:
E(Y|X) = α + β1 X1 + β2 X2 + β3 X1 X2 + β4 X4 + · · · + βK XK .
The partial effect of increasing X1 by one unit, from x1∗ to x1∗ + 1, on E(Y|X) is equal to
E(Y|X1 = x1∗ + 1, X2 = x2∗ , …,XK = xK∗ ) – E(Y|X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ )
= β1 + β3 (x1∗ + 1)x2∗ – x1∗ x2∗
= β1 + β3 x2∗ .
Therefore, the partial effect of X1 on E(Y|X) is a function of x2∗ , the value of X2 . When β3 is positive, the partial
effect of X1 on E(Y|X) is an increasing function of x2∗ , and when β3 is negative, the partial effect of X1 on E(Y|X) is a
decreasing function of x2∗ .
Similarly, we can determine the partial effect of increasing X2 by one unit, from x2∗ to x2∗ + 1, on E(Y|X):
E(Y|X1 = x1∗ , X2 = x2∗ + 1, …,XK = xK∗ ) – E(Y|X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ )
= β2 + β3 x1∗ (x2∗ + 1) – x1∗ x2∗
= β2 + β3 x1∗ .
The partial effect of X2 on E(Y|X) depends on x1∗ , with the partial effect increasing in x1∗ if β3 is positive and decreasing
in x1∗ if β3 is negative. Thus, a feature of a model with the interaction variable X1 X2 is that, as long as β3 6= 0, it must
be the case that both the partial effect of X1 depends upon X2 and the partial effect of X2 depends upon X1 . When
including X1 X2 in a model, it is also good practice to always include both of the original variables X1 and X2 in the
model, even if it turns out that one or both appear to be insignificant based upon z-tests.
To test whether an interaction variable is statistically significant, a z-test can be used for testing the null hypothesis
H0 : β3 = 0. Rejection of H0 : β3 = 0 indicates statistical significance of the interaction variable, and a failure to reject
H0 : β3 = 0 would support dropping the interaction variable, especially if the p-value is very high.
Example 18.16 (Weekly earnings) Example 18.10 provided least-squares estimates and standard errors for a MLR
model of weekly earnings (earnwk) with educ, exper, female, and union as explanatory variables. To allow for the
possibility that the partial effect of education on weekly earnings depends upon a worker’s experience level, the
interaction variable educ · exper can be added to the MLR model. This inclusion also allows for the partial effect of
experience on weekly earnings to depend upon a worker’s education level. The I() syntax for lm_robust can be
used to include the interaction variable in the model without creating a new variable. For the educ · exper interaction,
the added variable is I(educ*exper):
summary(results)$r.squared
## [1] 0.1785
The following table shows the least-squares estimates for the original model without the interaction variable and
the new model with the added interaction variable. Standard errors are reported in parentheses.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 521 — #528
i i
The interaction variable seems to belong in the model, as the z-statistic for testing H0 : βeduc·exper = 0 is –3.194 0.529 ≈
–6.04. The negative β̂educ·exper implies that (i) the estimated partial effect of education on weekly earnings (β̂educ +
β̂educ·exper exper∗ ) is a decreasing function of experience and (ii) the estimated partial effect of experience on weekly
earnings (β̂exper + β̂educ·exper educ∗ ) is a decreasing function of education. We calculate standard errors for these partial
effects with the linear_combination function. The following table compares some estimated partial effects based
upon the two models, with standard errors reported in parentheses:
MLR model MLR model
Change in E(earnwk|X) when:
without interaction with educ · exper
educ changes by one unit, exper = 10 110.94 (7.27) 166.26 (13.79)
educ changes by one unit, exper = 20 110.94 (7.27) 134.32 (9.38)
exper changes by one unit, educ = 12 4.897 (1.315) 7.591 (1.265)
exper changes by one unit, educ = 14 4.897 (1.315) 1.204 (1.601)
For the model without the interaction variable educ · exper, the estimated partial effects of the two variables are
constant, with the estimated partial effect of education given by β̂educ = 110.94 and the estimated partial effect of
experience given by β̂exper = 4.897. For the model with the interaction variable, the estimated partial effect of educ
on E(earnwk|X) declines from $166.26 at 10 years of experience (exper = 10) to $134.32 at 20 years of experience
(exper = 20). Since β̂educ·exper is negative, this estimated partial effect would continue to decrease at even higher values
of exper. The estimated partial effect of exper on E(earnwk|X) declines from $7.59 at 12 years of education (educ = 12)
to $1.20 at 14 years of education. In fact, the estimated partial effect of exper at educ = 14 is not statistically significant,
as the z-statistic for testing the partial effect against zero is 1.204
1.601 ≈ 0.75, which is associated with a p-value of 0.45.
Here is the R code to calculate standard errors for the estimated partial effects in the interaction model:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 522 — #529
i i
# calculate the partial effect and standard error for each exper value
for (exper_star in exper_vec) {
print(linear_combination(results, c(0,1,0,exper_star,0,0)))
}
## $estimate
## [1] 166.26
##
## $se
## [1] 13.792
##
## $estimate
## [1] 134.32
##
## $se
## [1] 9.3784
# consider educ values of 12 and 14
educ_vec <- c(12,14)
# calculate the partial effect and standard error for each educ value
for (educ_star in educ_vec) {
print(linear_combination(results, c(0,0,1,educ_star,0,0)))
}
## $estimate
## [1] 7.5911
##
## $se
## [1] 1.2648
##
## $estimate
## [1] 1.2038
##
## $se
## [1] 1.6005
When one of the variables in an interaction variable is an indicator variable, we can determine whether, and by how
much, the partial effect of the other variable depends on the two possible values of the indicator variable. For the MLR
model
E(Y|X) = α + β1 X1 + β2 X2 + β3 X1 X2 + β4 X4 + · · · + βK XK ,
consider the case where X1 is an indicator variable, with X1 ∈ {0, 1}. The only relevant one-unit change for X1 is going
from 0 to 1, and the partial effect of X1 on E(Y|X) is equal to
β1 + β3 x2∗
when X2 = x2∗ . The partial effect of X2 on E(Y|X) is equal to
(
∗ β2 if x1∗ = 0
β2 + β3 x1 =
β2 + β3 if x1∗ = 1,
meaning β3 measures the difference between the partial effect of X2 at X1 = 1 and the partial effect of X2 at X1 = 0.
For instance, for the weekly earnings model considered in Example 18.16, we could allow for the partial effect of
education on weekly earnings to depend upon union membership by adding an interaction between educ and union;
alternatively, we could allow for the partial effect of experience on weekly earnings to depend upon membership by
adding an interaction between exper and union.
Example 18.17 (Monthly stock returns and sample-splitting) The dataset sp500 has 364 monthly observations,
spanning over 30 years between 1991 and 2021. A possible concern with using a regression model for such a long
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 523 — #530
i i
time horizon is that the “true model” might have changed over time. For instance, what if the relationship between
an individual stock’s return and the market index return has not remained the same over the 30+ years observed in
the data? To address that concern, one approach is to split the sample by creating an indicator variable post2005 that
indicates which observations are after 2005:
(
1 if observation is after 2005 (2006-2021)
post2005 =
0 if observation is 2005 or earlier (1991-2005).
The MLR model
E(HD|IDX, post2005) = α + β1 post2005 + β2 IDX + β3 post2005 · IDX
includes the interaction variable post2005 · IDX, allowing the partial effect of the market index monthly return (IDX)
to depend upon whether post2005 = 0 or post2005 = 1. Specifically, if IDX increases by one unit, the expected HD
return changes by β2 if post2005 = 0 and by β2 + β3 if post2005 = 1. The parameter β3 is, therefore, the difference in
the partial effects of the market index return on the Home Depot return between the 2006-2021 period and the 1991-
2005 period. There’s nothing special about HD (Home Depot) here, so we can replace the outcome variable by the
monthly return of any individual stock. The following table shows the least-squares estimates of this model for four
different stocks: Home Depot (HD), Lowe’s (LOW), Bank of America (BAC), and ConocoPhillips (COP). Standard
errors are reported in parentheses, and a row containing the p-value for the test of the statistical significance of the
interaction variable (H0 : β3 = 0) has been included.
HD LOW BAC COP
Intercept 0.009 (0.005) 0.018 (0.007) 0.008 (0.005) 0.008 (0.005)
post2005 0.000 (0.006) –0.012 (0.008) –0.012 (0.009) –0.010 (0.007)
IDX 1.150 (0.134) 1.070 (0.174) 0.998 (0.138) 0.697 (0.118)
post2005 · IDX –0.239 (0.164) 0.068 (0.214) 0.904 (0.272) 0.635 (0.202)
p-value for H0 : β3 = 0 0.147 0.752 0.001 0.002
Starting with the Home Depot (HD) model, the estimated slope on the interaction variable, β̂3 = –0.239, indicates
that the association between expected HD returns and IDX returns is lower in the post-2005 period. Specifically,
when IDX goes up by 0.01, expected HD is estimated to increase by 0.01β̂2 = 0.01150 in the pre-2005 period and
0.01(β̂2 + β̂3 ) = 0.00911 in the post-2005 period. But, in looking at the p-value of 0.147 for the test of H0 : β3 = 0,
we do not reject that the interaction variable has a true slope of zero at the 10% level. There is limited statistical
evidence that there is meaningful difference in the partial effects of IDX on HD for the pre-2005 and post-2005
periods. For the Lowe’s (LOW) model, the p-value of 0.784 for testing H0 : β3 = 0 is very high, meaning we would not
reject H0 : β3 = 0 at any reasonable level. The picture is quite different for the other two stocks, Bank of America (BAC)
and ConocoPhillips (COP), where the p-values for H0 : β3 = 0 are equal to 0.001 and 0.002, respectively. These low
p-values indicate that the interaction variable is statistically significant in both the BAC model and the COP model.
For the BAC model, when IDX goes up by 0.01, expected BAC is estimated to go up by 0.00998 (with standard error
0.00138) in the pre-2005 period and by 0.01902 (with standard error 0.00235) in the post-2005 period.
Here is the R code to calculate the least-squares estimates above, along with the standard error for the post-2005
period IDX partial effect for the BAC model:
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 524 — #531
i i
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 525 — #532
i i
or, equivalently,
E(ln(Y)|X) = α + β1 X1 + β2 X2 + · · · + βK XK .
Estimation of this MLR model proceeds by using ln(Y) as the outcome variable rather than Y. The interpretation of
the model parameters, however, is more complicated. Thinking about the partial effect of the X1 variable, for instance,
a one-unit change from X1 = x1∗ to X1 = x1∗ + 1, holding all other variables fixed, leads to the conditional expectation
E(ln(Y)|·) changing by β1 :
E(ln(Y)|X1 = x1∗ + 1, X2 = x2∗ , …, XK = xK∗ ) – E(ln(Y)|X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ ) = β1 .
Unfortunately, this partial effect is difficult to describe in a useful way since it is in terms of the log-outcome rather
than the outcome itself. For instance, if Y is weekly earnings, the partial effect of a one-unit increase in X1 is a change
of β1 in the conditional expectation of log-earnings. To get an interpretation in terms of Y itself, we use the following
calculus-based approximation:68
∆Y
∆ ln(Y) ≈ when ∆Y is small.
Y
Since ∆YY is the percentage change in the variable Y, the partial-effect formula above says that a one-unit change in
X1 , holding all other variables fixed, is associated with an expected percentage change in Y approximately equal to β1 .
Example 18.18 (Weekly earnings) Using the cps dataset, here are the results from least-squares estimation of a MLR
model using log-earnings as the outcome variable and educ, exper, union, and female as the explanatory variables:
The use of I(log(earnwk)) specifies the natural logarithm of weekly earnings as the outcome variable. Using
the approximation result, a one-year change in education, holding the other variables fixed, is associated with an
expected change of 0.100 or 10.0% in weekly earnings. Similarly, a one-year change in experience, holding the
other variables fixed, is associated with an expected change of 0.0057 or 0.57% in weekly earnings. For the union
indicator variable, where a one-unit change means going from a non-union worker to a union worker, union workers
are expected to earn approximately 0.199 or 19.9% more than non-union workers, holding all else fixed. And, female
workers are expected to earn approximately 0.401 or 40.1% less than male workers, holding all else fixed.
The R-squared value is 0.194 or 19.4%. Importantly, this R-squared value is not comparable to R-squared values
for regressions with weekly earnings as the outcome variable. Since the outcome variable here is log-earnings, the
R-squared value says that the explanatory variables explain 19.4% of the variation in log-earnings, not 19.4% of the
variation in earnings.
In some cases, a regression model may have an explanatory variable that is also log-transformed. Without loss of
generality, let’s say that the first explanatory variable is positive-valued (X1 > 0) and log-transformed, so that
ln(Y) = α + β1 ln(X1 ) + β2 X2 + · · · + βK XK + U with E(U|X) = 0
or, equivalently,
E(ln(Y)|X) = α + β1 ln(X1 ) + β2 X2 + · · · + βK XK + U.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 526 — #533
i i
The approximation ∆ ln(X1 ) ≈ ∆X X1 (when ∆X1 is small) also holds here. To have the approximation perform well, it
1
variables fixed, the expected percentage change in Y is then approximately equal to 0.01β1 .
Example 18.19 (Weekly earnings) We modify the MLR model from Example 18.18 to use a log-transformation of the
experience (exper) variable. Here are the results from least-squares estimation of the model:
The slope estimates for educ, union, and female are very similar to the estimates from Example 18.18, and their
interpretations are similar as well. The R-squared is also virtually identical to that found in Example 18.18, suggesting
that the overall fit from this model specification is quite similar to the model specification that used exper without a log
transformation. For the log-transformed exper variable, the slope estimate is 0.133. Therefore, a one-percent change
in experience, holding all other variables fixed, is estimated to be associated with an expected percentage change in
weekly earnings of (0.01)(0.133) = (0.00133) or 0.133%.
Example 18.20 (Cigarette sales and cigarette taxes) To get a partial effect of taxes in terms of percentages, the SLR
model from Example 17.4 can be changed to have log transformations of both the outcome variable (CIGSALES)
and the explanatory variable (CIGTAX). Here are the results from least-squares estimation of the model with both
variables log-transformed:
lm_robust(I(log(cigsales))~I(log(cigtax)), data=cigdata)
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 3.7086 0.04616 80.341 1.105e-53 3.6158 3.8013 49
## I(log(cigtax)) -0.4277 0.06354 -6.732 1.726e-08 -0.5554 -0.3001 49
A one-percent change in state-level cigarette taxes is estimated to be associated with an expected percentage change
in cigarette sales of (0.01)(–0.428) = –0.00428 or –0.428%. (This estimate quantifies the price elasticity of demand for
cigarettes.)
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 527 — #534
i i
While the exogeneity assumption E(U|X) = 0 is assumed to simplify exposition, this assumption is not necessary for
the purposes of predicting the value of the outcome variable.69 The approaches described below can be applied even
if there is doubt about the exogeneity assumption holding. That is, even in a model where it may not be possible to
establish causality due to failure of the exogeneity assumption, we can still use least-squares estimation for predictive
purposes.
If the values of the explanatory variables are X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ , the new outcome Y ∗ can be written
Y ∗ = α + β1 x1∗ + β2 x2∗ + · · · + βK xK∗ + U ∗ ,
where U ∗ is the population residual associated with Y ∗ . There are two parts of the outcome Y ∗ , the part linearly related
to the explanatory variables, which is the conditional expectation
E(Y ∗ |X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ ) = α + β1 x1∗ + β2 x2∗ + · · · + βK xK∗ ,
and the part unrelated to the explanatory variables, which is the population residual U ∗ . The least-squares estimates
can be used to estimate the conditional-expectation part, with
Ê(Y ∗ |X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ ) = α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ .
Due to the consistency of least-squares estimation, this estimate of the conditional-expectation part gets arbitrarily
close to the true conditional expectation of Y ∗ as the sample size grows. In this section, we assume that the sample size
n is large enough that we can ignore any estimation imprecision for the conditional-expectation part of Y ∗ , meaning
the uncertainty associated with the asymptotic predictive interval for Y ∗ comes only from the uncertainty associated
with the population residual U ∗ .70
Since Y ∗ = E(Y ∗ |X) + U ∗ , the distribution of Y ∗ conditional on X is the same as the distribution of U ∗ conditional
on X but shifted by an amount equal to E(Y ∗ |X). The shape of the conditional distribution of Y ∗ given X is identical
to the shape of the conditional distribution of U ∗ given X. Therefore, determining a predictive interval for Y ∗ given X
simplifies to determining a predictive interval for U ∗ given X and then adding Ê(Y ∗ |X), which is a consistent estimate
of E(Y ∗ |X). While the appealing properties of the least-squares estimators, including consistency and asymptotic
normality, do not require any distributional assumptions on the population residuals, imposing additional assumptions
can lead to simplified predictive intervals. To illustrate, we’ll focus on a specific distribution assumption, namely the
assumption of normally distributed residuals.
The remainder of this section considers how a predictive interval can be constructed for Y ∗ given X in four different
cases: (i) U ∗ is normally distributed and homoskedastic, (ii) U ∗ is normally distributed and heteroskedastic, (iii) U ∗
has an unspecified distribution that does not depend on X, and (iv) U ∗ has an unspecified distribution that depends
on X.
Case (i): U ∗ is normally distributed and homoskedastic. In this case, the conditional distribution of U ∗ does not
depend upon the explanatory variables and is always N(0, σU2 ), where σU2 is the unconditional variance of U. The 1 – α
probability interval for U ∗ is
(–zα/2 σU , zα/2 σU ).
Based upon the least-squares estimates, this 1 – α probability interval can be consistently estimated by
(–zα/2 σ̂U , zα/2 σ̂U ),
q Pn 2
1
where σ̂U = n–K–1 i=1 ûi is the residual standard deviation estimate. Then, the 1 – α asymptotic predictive interval
for Y ∗ , given the values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables, is
(α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ – zα/2 σ̂U , α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ + zα/2 σ̂U ).
Example 18.21 (Monthly stock returns) Suppose we assume homoskedastic and normally distributed residuals in
the MLR model with Home Depot returns (HD) as the outcome variable and S&P 500 index returns (IDX) and
Lowe’s returns (LOW) as the explanatory variables. From Example 18.6, the residual standard deviation estimate is
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 528 — #535
i i
σ̂U = 0.052. A 95% predictive interval for HD given IDX = 0 and LOW = 0 is
(α̂ – z0.025 σ̂U , α̂ + z0.025 σ̂U ) = (0.004 – (1.96)(0.052), 0.004 + (1.96)(0.052)) ≈ (–0.098, 0.106),
meaning there is a 95% probability that the Home Depot monthly return is between –9.8% and 10.6% when the S&P
500 and Lowe’s returns are equal to zero. Asymptotic predictive intervals can be formed for other choices of the values
of IDX and LOW. For example, if IDX and LOW are both equal to 0.05, the 95% predictive interval for HD is
0.004 + (0.595)(0.05) + (0.384)(0.05) ± (1.96)(0.052) ≈ (–0.049, 0.155),
and if IDX and LOW are both equal to –0.05, the 95% predictive interval for HD is
0.004 + (0.595)(–0.05) + (0.384)(–0.05) ± (1.96)(0.052) ≈ (–0.147, 0.057).
Case (ii): U is normally distributed and heteroskedastic. In this case, the conditional distribution of U ∗ depends
∗
upon the explanatory variables. While this conditional distribution is normal, by assumption, the conditional variance
of U ∗ depends upon the values of the explanatory variables. To provide a predictive interval for U ∗ , then, we need
to explicitly model the conditional variance of U ∗ as a function of the explanatory variables. A particularly simple
approach is to adopt a model with a linear specification like the original MLR model, with
Var(U|X) = γ + δ1 X1 + δ2 X2 + · · · + δK XK .
The drawback to this model is that, depending upon the parameter values, the conditional variance is not guaranteed
to be positive for all values of the explanatory variables. Note that Var(U|X) = E((U – E(U))2 |X) = E(U 2 |X) since
E(U|X) = 0, so that the conditional-variance model becomes
E(U 2 |X) = γ + δ1 X1 + δ2 X2 + · · · + δK XK .
To estimate the parameters (γ, δ1 , δ2 , …, δK ), the least-squares estimator can be applied to a model with û2i as the
outcome variable and (xi1 , xi2 , …, xiK ) as the explanatory variables.71 Then, the conditional variance of U given X is
consistently estimated by
Var(U|X)
c = γ̂ + δ̂1 X1 + δ̂2 X2 + · · · + δ̂K XK ,
and the conditional standard deviation of U given X is consistently estimated by
q q
sd(U|X)
b = Var(U|X)
c = γ̂ + δ̂1 X1 + δ̂2 X2 + · · · + δ̂K XK if Var(U|X)
c > 0.
Then, the estimated 1 – α probability interval of U ∗ , given values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables, is
q q
–zα/2 γ̂ + δ̂1 x1∗ + · · · + δ̂K xK∗ , zα/2 γ̂ + δ̂1 x1∗ + · · · + δ̂K xK∗ .
The 1 – α asymptotic predictive interval of Y ∗ , given values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables, is
q
α̂ + β̂1 x1∗ + · · · + β̂K xK∗ – zα/2 γ̂ + δ̂1 x1∗ + · · · + δ̂K xK∗ ,
q
∗ ∗ ∗ ∗
α̂ + β̂1 x1 + · · · + β̂K xK + zα/2 γ̂ + δ̂1 x1 + · · · + δ̂K xK .
Example 18.22 (Birthweight data) The dataset births contains information on 50,249 births in the United States
during the month of December 2021. Birth outcomes are of interest to health economists since adverse birth outcomes,
like low birthweight, can be associated with high healthcare costs. The MLR model has the outcome variable bweight,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 529 — #536
i i
the baby’s birthweight measured in grams, modeled as a function of the following explanatory variables:
age = mother’s age (in years)
hsgrad = 1 if mother is high-school grad and not beyond, 0 otherwise
somecoll = 1 if mother has some college but not college grad, 0 otherwise
collgrad = 1 if mother is a 4-year college grad, 0 otherwise
married = 1 if mother is married, 0 otherwise
smoke = 1 if mother smoked during pregnancy, 0 otherwise
male = 1 if baby is male, 0 otherwise
This model specification has three indicator variables associated with educational-attainment categories, with the
omitted category being nonhsgrad (non-high school graduates). Their estimates should therefore be interpreted as
differences from non-high school graduates.
For the conditional-variance model, we use a specification that, like the MLR model, is a linear function of the
explanatory variables:
Var(U|X) = E(U 2 |X) = γ + δ1 age + δ2 hsgrad + δ3 somecoll + δ4 collgrad + δ5 married + δ6 smoke + δ7 male.
To estimate the parameters of the conditional-variance model, the least-squares estimates (α̂, β̂1 , …, β̂7 ) of the MLR
model are used to construct the estimated residuals ûi for i = 1, …, n. Then, the û2i values for i = 1, …, n are obtained by
squaring each ûi . The squared estimated residuals û2i are used as the outcome variable, and least-squares estimation
yields the estimates (γ̂, δ̂1 , …, δ̂7 ) of the parameters for the conditional-variance model.
The following table shows the least-squares estimates of the MLR model side-by-side with the estimates of the
conditional-variance model. Standard errors are reported in parentheses, and the additional columns provide the
p-values for testing the MLR slopes (βk ’s) against zero and the conditional-variance slopes (δk ’s) against zero.
p-value for p-value for
MLR model Var(U|X) model
H0 : βk = 0 H 0 : δk = 0
α (intercept) 3262.93 (28.13) γ (intercept) 194758.2 (28661.5)
β1 (age) –5.377 (0.828) 0.000 δ1 (age) 3889.3 (850.5) 0.000
β2 (hsgrad) 32.69 (16.57) 0.048 δ2 (hsgrad) –13472.8 (17101.7) 0.431
β3 (somecoll) 66.59 (16.06) 0.000 δ3 (somecoll) –16022.7 (16554.2) 0.333
β4 (collgrad) 92.56 (15.77) 0.000 δ4 (collgrad) –59650.4 (16193.9) 0.000
β5 (married) 44.73 (5.15) 0.000 δ5 (married) –14443.6 (5270.0) 0.006
β6 (smoke) –168.68 (19.94) 0.000 δ6 (smoke) 42771.7 (19780.7) 0.031
β7 (male) 99.39 (4.68) 0.000 δ7 (male) 29592.6 (4774.4) 0.000
Looking at the conditional-variance model estimates, the p-values indicate that age, collgrad, married, smoke, and
male all have statistically significant associations, at a 5% level, with the conditional variance of the MLR residual.
These p-values provide strong evidence that the residual variance depends upon the explanatory variables and,
therefore, the MLR model residuals are heteroskedastic. For the age variable, the estimate δ̂1 = 3889.3 means that
a one-year increase in age is associated with an estimated increase in the residual variance of 3889.3, holding all
other variables fixed. The residual variance is in the units of grams squared, so this 3889.3 estimate is in the units of
grams squared. For the smoke variable, the estimate δ̂6 = 42771.7 means that a mother who smokes during pregnancy
has a residual that has a variance that is estimated to be 42771.7 larger than the residual variance for a non-smoking
mother, holding all other variables fixed.
The MLR model and Var(U|X) model estimates can be used together to provide a predictive interval for birthweight
based on any specific values of the explanatory variables. For instance, consider a 30-year-old mother (age = 30) who
is a college graduate (collgrad = 1), is married (married = 1), doesn’t smoke during pregnancy (smoke = 0), and has a
male child (male = 1). The estimated conditional expectation of birthweight, based upon the MLR model, is
3262.93 + (–5.377)(30) + (92.56)(1) + (44.73)(1) + (99.39)(1) ≈ 3338.3.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 530 — #537
i i
The estimated conditional variance of the MLR residual, based upon the Var(U|X) model, is
194758.2 + (3889.3)(30) + (–59650.4)(1) + (–14443.6)(1) + (29592.6)(1) ≈ 266934.9.
Therefore, the estimated conditional standard deviation of the MLR residual is
√
266934.9 ≈ 516.7,
and a 95% predictive interval for birthwight, given these values of the explanatory variables, is
(3338.3 – (1.96)(516.7), 3338.3 + (1.96)(516.7)) ≈ (2325.6, 4351.0).
The R code below calculates the estimates for the MLR model and Var(U|X) model reported above:
Case (iii): U ∗ has an unspecified distribution that does not depend on X. In this case, the conditional distribution
of U ∗ given X is the same as the unconditional distribution of U ∗ since it is assumed that the distribution of U ∗ does
not depend on X. Without the assumption of normality, the quantiles of the U ∗ distribution can’t be determined using
the zα/2 critical values for the normal distribution. Instead, we can directly estimate the desired quantiles of the U ∗
distribution using the corresponding sample quantiles of the estimated residuals ûi . Let v̂q denote the sample q-th
quantile of the distribution of the estimated residuals {û1 , û2 , …, ûn }. Then, the estimated 1 – α probability interval for
U ∗ is
(v̂α/2 , v̂1–α/2 ),
and the estimated 1 – α asymptotic predictive interval for Y ∗ , given the values (x1∗ , x2∗ , …, xK∗ ) for the explanatory
variables, is
(α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ + v̂α/2 , α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ + v̂1–α/2 ).
Example 18.23 (Monthly stock returns) Re-visiting Example 18.21, we consider predictive intervals for Home Depot
returns (HD), based upon S&P 500 index returns and Lowe’s returns, but without assuming normality of the residuals.
The estimated residuals {û1 , û2 , …, ûn } have sample 2.5% and 97.5% quantiles
v̂0.025 = –0.113 and v̂0.975 = 0.104,
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 531 — #538
i i
which can be used to construct 95% predictive intervals for HD. When IDX = LOW = 0, the 95% predictive interval
for HD is
(0.004 – 0.113, 0.004 + 0.104) ≈ (–0.109, 0.106).
When IDX = LOW = 0.05, the 95% predictive interval for HD is
(0.004 + (0.595)(0.05) + (0.384)(0.05) – 0.113,
≈ (–0.060, 0.157).
0.004 + (0.595)(0.05) + (0.384)(0.05) + 0.104)
When IDX = LOW = –0.05, the 95% predictive interval for HD is
(0.004 + (0.595)(–0.05) + (0.384)(–0.05) – 0.113,
≈ (–0.158, 0.059).
0.004 + (0.595)(–0.05) + (0.384)(–0.05) + 0.104)
These predictive intervals are quite similar to those found in Example 18.21 under the assumption of normality. To
construct predictive intervals for other confidence levels, we use different sample quantiles of the estimated residuals.
For example, for 90% predictive intervals, the appropriate sample quantiles are v̂0.05 and v̂0.95 , which are –0.087 and
0.083, respectively.
Case (iv): U ∗ has an unspecified distribution that depends on X. This case is the most general, as it involves no
assumptions on the conditional distribution of U ∗ given X and, therefore, is always applicable. Unfortunately, this case
is also the most difficult to handle and requires methods beyond the scope of this book. While we used (unconditional)
quantile estimates of the residuals for case (iii), the basic idea is that conditional quantile estimates of the residuals
should be used in this general case. That is, what is the q-th quantile of U ∗ given the values (x1∗ , x2∗ , …, xK∗ ) for the
explanatory variables? This q-th conditional quantile can be modeled as72
vq (U ∗ |X) = γ q + δ1q X1 + δ2q X2 + · · · + δKq XK ,
where vq (U ∗ |X) represents the q-th quantile of U ∗ given X. Estimation of this model, for any q ∈ (0, 1), requires
a method called quantile regression. If the parameters are estimated consistently for q = α/2 and q = 1 – α/2, the
estimated 1 – α predictive interval of Y ∗ , given the values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables, is
α̂ + β̂1 x1∗ + · · · + β̂K xK∗ + γ̂ α/2 + δ̂1α/2 x1∗ + · · · + δ̂Kα/2 xK∗ ,
α̂ + β̂1 x1∗ + · · · + β̂K xK∗ + γ̂ 1–α/2 + δ̂11–α/2 x1∗ + · · · + δ̂K1–α/2 xK∗ .
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 532 — #539
i i
A simple version of the conditional Bernoulli model is the linear probability model (LPM), where the conditional
probability of Y = 1 given X is modeled as a linear function of the explanatory variables, similar to the MLR model:
P(Y = 1|X) = α + β1 X1 + β2 X2 + · · · + βK XK .
The interpretation of the LPM parameters is most easily understood in terms of conditional probabilities:
• Meaning of the intercept α: α = P(Y = 1|X1 = X2 = · · · = XK = 0), which is the conditional probability of Y = 1 when
all explanatory variables are equal to zero. Whether α has a practical interpretation depends on whether zero is a
relevant value for each of the explanatory variables.
• Meaning of the slope parameters: Consider the simplest case where each explanatory variable X
k only enters
into the βk Xk term, which rules out polynomials and interaction variables. Similar to the approach for MLR
slope parameters, consider specific values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables. Then, the partial effect of
increasing Xk by one unit, from xk∗ to xk∗ + 1, is to change the conditional probability P(Y = 1|X) by βk . For instance,
if β2 = 0.03, the partial effect of increasing X2 by one unit is to change P(Y = 1|X) by 0.03 or 3 percentage points.
Interpretation of partial effects in the presence of other types of variables, like categorical variables, polynomial
variables, or interaction variables, is similar to the interpretation for MLR parameters (Section 18.6), except that the
partial effects are interpreted in terms of P(Y = 1|X).
Since E(Y|X) = P(Y = 1|X), the LPM can be re-written as
E(Y|X) = α + β1 X1 + β2 X2 + · · · + βK XK .
This equation has exactly the same form as the MLR model and, therefore, least-squares estimation can be used to
estimate the LPM parameters. Let (α̂, β̂1 , β̂2 , …, β̂K ) denote the least-squares estimates of the LPM parameters, and let
se(α̂) and se(β̂k ), for k ∈ {1, 2, …, K}, denote the corresponding heteroskedasticity-robust standard errors. In addition
to providing estimates of the partial effects discussed above, the least-squares estimates also provide estimates of the
in-sample predicted probabilities. Specifically, the fitted values from least-squares estimation are consistent estimates
of the conditional probability of Y = 1 given the observed values of the explanatory variables:
ŷi = α̂ + β̂1 xi1 + β̂2 xi2 + · · · + β̂K xiK = P̂(Y = 1|X1 = xi1 , X2 = xi2 , …, XK = xiK ).
A drawback of the LPM is that it can imply conditional probabilities that are less than zero and/or greater than
one when there are continuous explanatory variables. As a simple example, consider the case of a single explanatory
variable X1 , so that P(Y|X1 ) = α + β1 X1 . If β1 6= 0 and X1 is a continuous variable whose values may extend forever in
both directions, it must be the case that P(Y|X1 ) < 0 for some range of X1 and P(Y|X1 ) > 1 for some range of X1 . To
formally deal with this issue, practitioners sometimes use a non-linear model for P(Y|X) that restricts the probability
to be strictly between zero and one. (The most popular versions of the non-linear model are the probit model and the
logit model.) That said, the LPM is frequently used since it leads to easily interpretable estimates of partial effects and,
even for many empirical applications with continuous explanatory variables, most or all of the estimated in-sample
probabilities fall between zero and one.73
Example 18.24 (Widget website) We consider the use of the LPM for A/B testing when the outcome variable is binary.
Specifically, let Y be the binary outcome indicating whether a purchase is made (Y = 1) or not (Y = 0). The explanatory
variables are emailA, which is a binary variable indicating whether a user receives e-mail A (emailA = 1) or not
(emailA = 0), and emailB, which is a binary variable indicating whether a user receives e-mail B (emailB = 1) or not
(emailB = 0). The omitted category, corresponding to emailA = emailB = 0, is for users who do not receive an e-mail
and, therefore, are in the control group. As described in Example 2.1, there are a total of n = 3000 observations, of
which 300 have emailA = 1, 300 have emailB = 1, and 2,400 have emailA = emailB = 0. Of the emailA = 1 observations,
20% (or 60 out of 300) have Y = 1 and 80% have Y = 0. Of the emailB = 1 observations, 22% (or 66 out of 300) have
Y = 1 and 78% have Y = 0. And, of the emailA = emailB = 0 observations, 15% (or 360 out of 2,400) have Y = 1 and 85%
have Y = 0. The dataset widgets contains the emailA, emailB, and purchase (Y) variables for the 3,000 observations.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 533 — #540
i i
The LPM is P(Y = 1|emailA, emailB) = α + β1 emailA + β2 emailB. The least-squares estimates of the LPM parameters
are provided in the following table, with standard errors and z-statistics and p-values for the z-test of each parameter
being equal to zero.
Estimate Standard error z-statistic p-value
α 0.1500 0.0073 20.58 0.000
β1 0.0500 0.0243 2.06 0.039
β2 0.0700 0.0250 2.80 0.005
The parameter estimates are as expected, with
P̂(Y = 1|emailA = emailB = 0) = α̂ = 0.15,
P̂(Y = 1|emailA = 1) = α̂ + β̂1 = 0.15 + 0.05 = 0.20,
and
P̂(Y = 1|emailB = 1) = α̂ + β̂2 = 0.15 + 0.07 = 0.22.
The estimates β̂1 = 0.05 and β̂2 = 0.07 indicate that e-mail A recipients and e-mail B receipients are 5 percentage points
and 7 percentage points more likely, respectively, than non-recipients to make a purchase. The p-value of 0.039 for
H0 : β1 = 0 indicates that, at a 5% level, there is a statistically significant difference between the purchase probability
for e-mail A recipients and the control group. Similarly, the p-value of 0.005 for H0 : β2 = 0 indicates that, for any level
above 0.5%, there is a statistically significant difference between the purchase probability for e-mail B recipients and
the control group. A z-test of H0 : β1 = β2 , which can be conducted by either making e-mail A recipients the omitted
category or by directly calculating se(β̂1 – β̂2 ), has a p-value of 0.548. Therefore, there is no statistically meaningful
difference between the purchase probabilities of e-mail A recipients and e-mail B recipients since H0 : β1 = β2 can not
be rejected at any reasonable level.
The following R code calculate the LPM estimates above, with the p-value for the H0 : β1 = β2 test:
Example 18.25 (Union membership) The previous examples using the cps dataset considered weekly earnings as the
outcome variable. In this example, we instead consider union membership, measured by the indicator variable union,
as the outcome variable. The least-squares estimates of the LPM describe how other variables are associated with,
and can be used to predict, union membership for the sample of 2809 employed individuals. The following R code
provides the least-squares estimates for one such LPM, with explanatory variables educ, exper, exper2 , and female.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 534 — #541
i i
We can also use the LPM estimates to directly predict the probability of union membership for any specific values
of the explanatory variables. For example, for a female worker (female = 1) with 12 years of education (educ = 12) and
15 years of experience (exper = 15, exper2 = 225), the predicted probability of union membership is
–0.1344 + (0.0099)(12) + (0.00843)(15) + (–0.00011)(225) + (–0.0610)(1) ≈ 0.0249 or 2.49%,
with a standard error of 0.0097 (or 0.97%).
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 535 — #542
i i
NOTES 535
Notes
60 For two explanatory variables (K = 2), the graph is three-dimensional. The MLR conditional expectation, E(Y|X) = α + β X + β X , is a two-
1 1 2 2
dimensional plane that extends forever in three-dimensional space. (Think of a sheet of paper that may be tilted and extends forever.) If y is measured
vertically, positive residuals are associated with data points that are above the MLR plane, negative residuals are associated with data points that are
below the MLR plane, and the magnitude of a residual is the vertical distance from the data point to the plane.
61 If either Assumption MLR-VarX or Assumption MLR-NPC are violated, the minimization problem has an infinite number of possible solutions.
For instance, if x1 has zero variance, there are an infinite number of combinations of a and b1 that could be chosen to minimize S(a, b1 , b2 , …, bK ).
62 Six years of age is roughly when children in the United States start their education, so the experience variable approximates the number of
post-education years. The variable is not a perfect measure of experience since it doesn’t take into account any period(s) of unemployment.
63 Although not explicitly stated in Proposition 18.2, additional technical assumptions are required to prove the result. Specifically, we require
that U has a finite variance, each component of X has a finite variance, and the covariance between any two components of X is finite.
64 s can be calculated as the sample standard deviation of the estimated residuals from least-squares estimation of a model that has x as the
x̃k k
outcome variable and all other x variables as the explanatory variables, which corresponds to the decomposition described in Proposition 18.6.
65 Alternatively, lm_robust(earnwk~marstatus, data=cps) automatically creates three indicator variables from marstatus.
66 An alternative approach is to use derivatives to approximate the change in E(Y|X). A small change dx in X , from x∗ to x∗ + dx , is associated
1 1 1 1 1
with a change of E(Y|X) equal to (β1 + 2β2 x1∗ )dx1 since ∂E(Y|X)
∂x
= β 1 + 2β 2 x ∗.
1
1
67 For the derivative approach, a small change dx1 in X1 , from x1∗ to x1∗ + dx1 , is associated with a change of E(Y|X) equal to (β1 + 2β2 x1∗ +
3β3 x1 )dx1 since ∂E(Y|X)
∗2
∂x
= β1 + 2β2 x1∗ + 3β3 x1∗2 .
1
68 This approximation is based upon the derivative of the natural logarithm, dln(Y)
dY
= Y1 .
69 Proposition 17.11 can be generalized to the case of multiple explanatory variables, with the decomposition of Y into a linear function of the
explanatory variables and a random variable uncorrelated with the explanatory variables: Y = α∗ + β1∗ X1 + β2∗ X2 + · · · + βK∗ + V, with Cov(Xk , V) = 0
for k ∈ {1, 2, …, K} and E(V) = 0. For this decomposition, the least-squares estimates (α̂, β̂1 , β̂2 , …, β̂K ) consistently estimate (α∗ , β1∗ , β2∗ , …, βK∗ )
even if the exogeneity assumption of the MLR model doesn’t hold.
70 For small sample sizes, where there may be imprecision in the estimate of E(Y ∗ |X), the resulting asymptotic predictive interval is not wide
enough. One approach for gauging whether imprecision in the estimate of E(Y ∗ |X) affects the predictive interval is to utilize the bootstrap from
Chapter 15. Specifically, a predictive interval can be constructed based upon each bootstrap sample to see how much the predictive interval varies
over bootstrap samples. If the estimates of E(Y ∗ |X) are very precise, there should be little difference in the predictive intervals over bootstrap
samples.
71 There are alternative conditional-variance models that guarantee positive estimated conditional variances. One oft-used model is the nonlinear
(exponential) model
E(U 2 |X) = eγ+δ1 X1 +δ2 X2 +···+δK XK ,
for which the parameters (γ, δ1 , δ2 , …, δK ) can be estimated by many q statistical packages. With consistent estimates (γ̂, δ̂1 , δ̂2 , …, δ̂K ), the
∗ ∗
estimated conditional standard deviation of U ∗ given (x1∗ , x2∗ , …, xK∗ ) is eγ̂+δ̂1 x1 +···+δ̂K xK .
72 An alternative approach is to directly model the conditional quantiles v (Y|X) rather than using the E(Y|X) model at all.
q
73 With only discrete explanatory variables, P(Y = 1|X) may be guaranteed to be between zero and one. A simple example is an LPM with a single
binary variable X1 , for which P(Y = 1|X1 = 0) = α and P(Y = 1|X1 = 1) = α + β.
Exercises
1. Use the widgets dataset for this question. These data are for 3,000 users, 300 of whom receive e-mail A (emailA = 1),
300 of whom receive e-mail B (emailB = 1), and 2,400 of whom receive neither (emailA = emailB = 0). The outcome
variable of interest is amount, which is the total amount purchased (in dollars) by the user.
(a) How many users have amount = 0?
(b) What are the sample averages of amount for the three subsamples corresponding to e-mail A recipients, e-mail
B recipients, and non-recipients?
(c) Use lm_robust to estimate the multiple regression of amount on emailA and emailB. Interpret the intercept
estimate and the two slope estimates. How do these estimates relate to the sample averages in (b)?
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 536 — #543
i i
536 NOTES
(d) What is the p-value for the z-test of H0 : βemailA = 0? What do you conclude from this p-value?
(e) Create a binary variable nonrecipient equal to 1 for non-recipients and 0 for e-mail A and e-mail B recipients.
Re-run the regression using emailA and nonrecipient as the explanatory variables. Test whether there is a
significant difference, at the 5% level, between average purchases for e-mail A users and e-mail B users.
2. Use the metricsgrades dataset for this question. These data are from a graduate econometrics course with 68
students, containing the following variables:
total = overall composite course grade (out of 100 points)
gre_quant = score on GRE quantitative test (out of 170 points)
gre_verbal = score on GRE (English) verbal test (out of 170 points)
domestic = 1 if domestic (U.S.) student, 0 otherwise
(a) Provide the sample correlation matrix for the four variables. Which variable has the largest correlation (in
magnitude) with total?
(b) Use lm_robust to estimate the multiple regression with total as the outcome variable and the other three
variables as explanatory variables.
(c) Interpret the estimate of βgre_quant .
(d) What is the estimated conditional expectation of total for a non-domestic student with a GRE quantitative score
of 160 and a GRE verbal score of 150?
(e) What is the estimated standard deviation of the regression model’s residual?
(f) Test H0 : βdomestic = 0 at a 10% level. What do you conclude?
(g) Drop domestic from the regression and re-run it. How do the results compare to the original regression?
(h) Now put domestic back in the regression and instead drop gre_verbal. Re-run the regression. What happens to
the statistical significance of domestic, and why?
3. Use the cigdata dataset for this question. Example 18.8 provided the results from a regression of cigsales on cigtax
and producer.
(a) Add the variable price_pack (equal to the total price per pack) to the model and re-run the regression using
lm_robust.
(b) How does the R-squared value of this regression compare to the R-squared value of the regression without the
price_pack variable?
(c) What happens to the statistical significance of the slope on the state tax (cigtax)?
(d) Considering the correlation between cigtax and price_pack, explain the result in (c).
(e) Now drop the state-tax (cigtax) variable, and re-run the regression with price_pack and producer as the
explanatory variables. How do the results compare to the regression in Example 18.8?
(f) Do you prefer the MLR model with cigtax and producer or the MLR model with price_pack and producer?
4. Use the mutualfunds dataset for this question. The sample, consisting of 206 mutual funds categorized as “Large
Blend Equity” by Morningstar, includes the following variables:
return_10yr = ten-year annualized return
expense_ratio = annual fee (e.g., 0.005 is an annual fee of 0.5%)
manager_tenure = tenure of current fund manager (in years)
fund_age = age of fund (in years)
load = "Y" if fund has a sales charge, "N" otherwise
(a) Add the binary variable hasload, equal to 1 if the fund has a sales charge and 0 otherwise, to the data frame.
(b) Use lm_robust to estimate the multiple regression with return_10yr as the outcome variable and the four
explanatory variables expense_ratio, manager_tenure, fund_age, and hasload.
(c) Interpret the R-squared value.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 537 — #544
i i
NOTES 537
(d) If the expense_ratio value increases by 0.001 (0.1%), what does the estimate of βexpense_ratio imply about the
conditional expectation of return_10yr?
(e) Do any of the variables appear to be statistically significant at a 5% level? If so, which one(s)?
(f) Provide a 90% asymptotic confidence interval for βhasload .
(g) *You are considering dropping the fund_age and manager_tenure variables from the regression. You are
worried about multicollinearity, so you want to test them jointly. Use the test_linear_restrictions
function to determine the p-value for the test of H0 : βfund_age = βmanager_tenure = 0.
5. Use the congress dataset for this question. The data consist of Congressional election outcomes in the United States
between 1948 and 1990. For this question, you will focus on the subsample of 476 Congressional district elections that
occurred in 1990. Each election is between a Democrat and a Republican, where demvoteshare (between 0 and 1) gives
the fraction of votes received by the Democrat, meaning that the Democrat won the election if demvoteshare > 0.5. The
explanatory variables of interest are:
medianincome = median income within the district
pcturban = fraction (between 0 and 1) of district residents who live in an urban area
pctblack = fraction (between 0 and 1) of district residents who are black
pcthighschl = fraction (between 0 and 1) of district residents who have a high-school degree
(a) Plot the histogram of demvoteshare. Is the distribution unimodal or bimodal?
(b) The variable democrat is equal to 1 if the Democrat won and 0 if the Republican won. Of the 476 elections in
1990, what fraction were won by Democrats?
(c) Provide the sample averages of the explanatory variables separately for elections won by Democrats and
elections won by Republicans.
(d) Run the necessary simple linear regressions (with democrat as the explanatory variable) to test, at a 5% level,
whether each of the explanatory variables has a different population mean in the two subsamples.
(e) Use lm_robust to estimate the multiple regression with demvoteshare as the outcome variable and the four
explanatory variables above.
(f) Interpret the slope estimate for medianincome, thinking about a $1,000 change.
(g) Interpret the slope estimate for pcturban, thinking about a change of 10 percentage points.
(h) Add lagdemocrat, which is equal to 1 if the Democrat won the previous election and 0 otherwise, and re-run
the regression. How do the results change?
(i) Interpret the slope estimate for lagdemocrat in the regression in (h), and provide an asymptotic 95% confidence
interval for the βlagdemocrat .
(j) Without actually doing it, imagine re-running the regression in (e) using repvoteshare = 1 – demvoteshare as
the outcome variable and the same explanatory variables. Describe how the slope estimates, their z-statistics,
and their p-values would change.
6. Use the hrs dataset for this question. The data consist of 6,052 non-married individuals who are 50 and older.
For this question, focus on the subsample of 3,983 individuals with positive (non-zero) out-of-pocket medical costs
(outofpocket_costs, in dollars) during 2000. The explanatory variables of interest are:
age = individual’s age (in years)
ins_none = 1 if individual has no health insurance, 0 otherwise
ins_medicare = 1 if individual has Medicare insurance, 0 otherwise
male = 1 if individual is male, 0 otherwise
(a) Use lm_robust to estimate the multiple regression with outofpocket_costs as the outcome variable and the
four explanatory variables above. Interpret the slope estimates for age and ins_none.
(b) Draw a histogram of outofpocket_costs.
(c) Create a new variable, ln_oopc, equal to the natural logarithm of outofpocket_costs.
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 538 — #545
i i
538 NOTES
i i
i i
i i
“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 539 — #546
i i
NOTES 539
iv. *Use the linear_combination function to determine the standard errors for the two partial effects
in (e)(iii).
(f) The actual inflation, for benchmark, was 3.2% in 2006, 2.9% in 2007, and 3.8% in 2008. Therefore, a sensible
forecast for inflation should probably fall in the 2% to 5% range. Create a binary variable accurate equal to 1 if
2 ≤ inflation_pred ≤ 5 and 0 otherwise. Run an LPM regression with accurate as the outcome variable, using
the same explanatory variables as in (a). Interpret the slope estimates for finlit_score and collgrad.
10. You have a sample of 1,000 graduating seniors from a certain university. The outcome variable y is equal to 1 if the
student has a job offer and 0 otherwise. The explanatory variable x is equal to 1 if the student is an economics major
and 0 otherwise. The joint sample counts are given by the following table:
econ (x)
0 1
0 270 20
offer (y)
1 630 80
(a) For the LPM model P(Y = 1|X) = α + βX, what are the least-squares estimates α̂ and β̂? Use only the table above
to answer this part.
(b) Interpret the slope estimate β̂.
(c) In R, create a data frame with 1,000 rows and 2 columns that corresponds to the table of joint sample counts
above. Use lm_robust to confirm the answer to (a). What is the p-value for testing H0 : β = 0?
11. Use the births dataset considered in Example 18.22. Since the healthcare costs associated with births are mostly
concentrated on babies with low birthweight, public health researchers and economists use a specific definition of “low
birthweight” that corresponds to babies having birthweight less than 2500 grams.
(a) Create a new variable lowbwt equal to 1 if bweight is less than 2500 and 0 otherwise.
(b) Run an LPM regression with lowbwt as the outcome, using the explanatory variables from Example 18.22.
(c) What are the highest and lowest LPM fitted values (predicted probabilities)?
(d) Plot the LPM fitted values (predicted probabilities) versus age.
(e) Interpret the LPM slope estimate for smoke.
(f) What do the LPM results say about the difference in low-birthweight probabilities for college-graduate mothers
(collgrad = 1) versus high-school graduate mothers (hsgrad = 1), holding all other variables fixed?
(g) What is the p-value associated with testing that there is no difference in (f)? For this part, either test the linear
combination directly or re-run the LPM with a different omitted education category.
(h) Provide a 95% confidence interval for the difference in low-birthweight probabilities between a 30-year-old
mother and a 25-year-old mother, holding all other variables fixed. Is the difference statistically significant at a
5% level?
(i) Add a quadratic age variable (age squared) to the LPM and re-run the regression. Provide a 95% confidence
interval for the difference in low-birthweight probabilities between a 30-year old mother and a 25-year-old
mother, holding all other variables fixed. Is the difference statistically significant at a 5% level?
(j) For the LPM regression in (i), you should see very high p-values for the two age-variable slopes. Does this
result suggest that the two age variables should be dropped from the model? Explain why or why not.
(k) *For the LPM regression in (i), use the test_linear_restrictions function to determine the p-value
for testing that the slopes for the two age variables are both equal to zero.
12. *Use the brands dataset for this question. Refer to Exercise 16.12.. Run a LPM regression with purchase as the
outcome variable and four indicator variables (for four of the five possible values of last_brand) as explanatory
variables. Use the regression results and the test_linear_restrictions function as necessary to answer parts
(b) and (e) of Exercise 16.12..
i i
i i