0% found this document useful (0 votes)

20 views545 pages

TextbookECO 329 Fall 2024

Uploaded by

guoj310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views545 pages

TextbookECO 329 Fall 2024

Uploaded by

guoj310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 545

i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page i — #2

i i

Probability and Statistics for Economics and Business: An Introduction Using R

Jason Abrevaya

© 2024, under contract (MIT Press), DO NOT DISTRIBUTE

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page ii — #3

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page iii — #4

i i

Contents

1 The basics of R 1
1.1 Installing R 1
1.2 Arithmetic operations and mathematical functions 2
1.3 Variables and data types 4
1.4 Vectors 9
1.5 Output 18
1.6 Programming 18
1.7 Writing functions 22
1.8 Data frames and file input 24
1.9 Missing values 29
1.10 R packages 30
Exercises 32

2 Introduction to probability theory 37

2.1 Experiments and sample spaces 39
2.2 Events 42
2.3 What is a probability? 45
2.4 Properties of probabilities 51
Exercises 55

3 Conditional probabilities and independence 59

3.1 Definition and properties of conditional probabilities 59
3.2 Multiplication rule and Bayes’ Theorem 60
3.3 Probability tables 63
3.4 Independence 66
3.5 Examples with an infinite number of outcomes 70
Exercises 72

4 Combinatorics (counting methods) 77

4.1 Product rule and sum rule 77
4.2 Permutations and combinations 78
4.3 Probabilities for equally likely choices 81
Exercises 83

5 Economic data and sampling 89

5.1 Types of data 89

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page iv — #5

i i

iv Contents

5.2 Types of variables 91

5.3 The population and sampling 93
Exercises 96

6 Descriptive statistics and visuals: univariate data 99

6.1 Dataset examples 99
6.2 Categorical data: sample proportions and bar charts 102
6.3 Numerical data: histograms 104
6.4 Numerical data: measures of location 110
6.5 Numerical data: measures of dispersion 116
6.6 Modal outcomes 126
6.7 Linear transformations of univariate data 128
6.8 Time-series plots 133
Exercises 136

7 Descriptive statistics and visuals: bivariate data 143

7.1 Categorical variables 143
7.2 Numerical data: scatter plots, sample covariance and correlation 151
7.3 Correlation is not causation 172
Exercises 173

8 Discrete random variables 177

8.1 Using sample proportions to calculate descriptive statistics 177
8.2 Random variables and discrete random variables 178
8.3 Population descriptive statistics 186
8.4 Multiple discrete random variables 189
8.5 Linear transformations 200
8.6 Linear combination of multiple random variables 202
8.7 Expected values of functions of discrete random variables 205
Exercises 206

9 Models of discrete random variables 213

9.1 Bernoulli random variable 213
9.2 Binomial random variable 214
9.3 Geometric random variable 220
9.4 Negative binomial random variable 222
9.5 Poisson random variable 225
Exercises 228

10 Continuous random variables 235

10.1 Continuous random variables vs. discrete random variables 235
10.2 Probability density function 236
10.3 Cumulative distribution function 241
10.4 Population descriptive statistics 247
10.5 Linear transformations of one random variable 254
10.6 Multiple continuous random variables 256
10.7 Linear transformations and combinations of multiple random variables 266
10.8 Expected values of functions of continuous random variables 271

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page v — #6

i i

Contents v

10.9 Strictly increasing transformations of random variables 273

10.10 Random variables with discrete and continuous outcomes 275
Exercises 276

11 Models of continuous random variables 283

11.1 Normal random variable 283
11.2 Log-normal random variable 295
11.3 Chi-square random variable 299
11.4 Exponential random variable 301
11.5 Mixture of normal random variables 305
Exercises 307

12 Sampling distributions: exact 313

12.1 Sampling distribution of the sample mean 315
12.2 Sampling distribution of the sample variance 320
12.3 Sampling distribution of other statistics 326
Exercises 330

13 Sampling distributions: asymptotic 335

13.1 Asymptotic distribution of the sample mean 335
13.2 Asymptotic distribution of the sample variance 343
13.3 Asymptotic distribution of other statistics 346
Exercises 352

14 Estimation and confidence intervals 357

14.1 Estimation and properties of estimators 357
14.2 Finite-sample confidence intervals: population mean of i.i.d. normal random variables 361
14.3 Asymptotic confidence intervals: population mean of i.i.d. random variables 370
14.4 Asymptotic confidence intervals: parameters with asymptotically normal estimators 375
14.5 Functions of consistent estimators 390
14.6 Asymptotic predictive intervals for continuous random variables 391
Exercises 392

15 The bootstrap 399

15.1 Bootstrap sampling 400
15.2 Bootstrap sampling distribution 403
15.3 Bootstrap standard errors and bootstrap confidence intervals 404
Exercises 412

16 Hypothesis testing 415

16.1 Finite-sample hypothesis testing: population mean of i.i.d. normal random variables 416
16.2 Asymptotic hypothesis testing: parameters with asymptotically normal estimators 427
16.3 Statistical significance versus practical significance 435
16.4 Hypothesis testing for multiple hypotheses: the wald test 436
Appendix: Details for the Wald test 442
Exercises 448

17 Simple linear regression 453

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page vi — #7

i i

vi Contents

17.1 The simple linear regression model 453

17.2 The least-squares estimator 458
17.3 Fitted values, estimated residuals, and regression fit 466
17.4 Asymptotic normality and statistical inference 475
17.5 Causality and prediction 485
Exercises 488

18 Multiple linear regression 493

18.1 The multiple linear regression model 493
18.2 The least-squares estimator 495
18.3 Standard errors and confidence intervals 505
18.4 Inference for linear combinations of regression parameters 509
18.5 Hypothesis testing 511
18.6 Modeling approaches and explanatory variables 514
18.7 Log-transformed outcome variable 524
18.8 Asymptotic predictive intervals 526
18.9 Linear probability model 531
Exercises 535

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 1 — #8

i i

1 The basics of R

1.1 Installing R
R is a statistical programming language for data analysis and visualization widely used by economists and data
scientists. This book uses R to illustrate statistical concepts and to implement analytical methods. For the best
experience of working with R, readers should also install the software package RStudio. While RStudio is strictly
speaking not required to run R, it provides a user-friendly graphical interface that makes it much easier to use R.
RStudio provides an advanced editor with features that include syntax highlighting, code completion, and debugging.
It also has several tools to streamline data analysis, including a workspace viewer and a plotting window.
Both R and RStudio are available for standard operating systems (Windows, macOS, Linux) and can be downloaded
for free at https://cran.r-project.org and https://posit.co/downloads, respectively. Be sure to
download and install R prior to downloading and installing RStudio.
RStudio has four main “panes” as part of its interface, as seen in Figure 1.1:
1. Source (top left): This pane allows the user to create, edit, and save R scripts. Datasets can also be browsed in this
pane. This pane only appears when a script or dataset has been opened.
2. Console (bottom left): This pane allows the user to enter commands directly in R and see the output.
3. Environment/History (top right): This pane displays the current workspace, including datasets, variables, and
functions that have been introduced. This pane also provides access to previous commands and output.
4. Plots/Help/Packages/Files (bottom right): This pane displays plots, provides access to R documentation, allows the
installation and management of R packages, and allows direct access to file directories.
Within RStudio, there are two main options for writing R commands:
• The console: This “command line” interface appears when RStudio is opened. Commands can be typed directly
into the console, and R executes them immediately. Using the console is a good option for simple calculations or
experimentation with R functions.
• R scripts: An R script is a text file containing a series of R commands. A script file can be edited and saved, making

it useful for more complex data analysis and programming tasks. With RStudio, it is straightforward to run an entire
script or a selected section of a script.
The two options have their advantages and disadvantages. The console is quick and easy to use, but it can be difficult
to keep track of all the commands that have been issued. An R script, on the other hand, makes it easier to organize
code, make code more readable, and save code. But, since many commands from an R script are executed at once, it
may take more time to figure out why code is not performing as expected.
Throughout the book, we use the console option for R commands and output to demonstrate how commands work.
However, all R code is accessible as script files on the companion website https://www.probstats4econ.com,
organized by chapter and section. These files enable readers to run large chunks of code conveniently. Moreover,
readers can modify scripts to see the impact of code edits on output, add additional analysis, or cut and paste from
script file(s) to create new scripts.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 2 — #9

i i

2 The basics of R

••• S¥ oox �aD0 ,s i;;i

Ii RStudio File Edit Code View Plots Session Build Debug Profile Tools Window Help Q. � Thu Aug 1 11:44AM

RStudio

0 �· 91 Ll I ,+ Go to file/function [88) .... Addins .... Project: (None) •

ISign In
Ocps �□ Environment History Connections Tutorial =□
V � ?Filter Q. � 8 � Import Dataset • • 88 MiB • -L ssusr. @.
• statefips unempwks wagehr gender hispanic ottipcomm hourly Q.
Non-hispanic Nevermarried Employed Hourly Data ®.
Non-hispanic Married Employed Ocpsemployed 2809 obs. of 17 vari.ables D
Hispanic $ statefi.ps
$ age
: Factor w/ 51 levels "AK","AL","AR", .: 5 1 35 43 2 6 21 44 44 20 .
: i.nt 50 34 30 35 56 42 55 58 39 59 .
�
Non-hispanic Nevermarried Employed

Non-hispanic Nevermarried
$ hrslastwk
$ unempwks
: i.nt
: i.nt
40 40 44 30 40 ZS 40 46 41 44 .
NA NA NA NA NA NA NA NA NA NA .
�
Non-hispanic Divorced Employed
$ wagehr : num 12 NA NA NA 25 .
Non-hispanic Married Employed Hourly $ earnwk : num 577 3049 2500 300 1000 . □=
CJ

Non-hispanic Widowed Employed Hourly $ ownchi. ld : i.nt 0 1 0 0 0 0 0 0 1 0 .

Non-hispanic Nevermarried Employed $ educ : num 14 18 18 12 12 7.5 12 13 16 12 . �
Hispanic Employed $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 1 1 1 1 2 .

Hispanic Employed Hourly $ metro

$ race
: Factor w/ 2 levels "Metro","Non-metro": 1 1 1 1 1 1 1 1 1 1 .
: Factor w/ 3 levels "Black","Other", .: 1 3 3 3 3 1 2 3 2 3 .
/::0
Non-hispanic Married Employed Hourly

Non-hispanic Married Employed Hourly

$ hi.spani.c
$ marstatus
: Factor w/ 2 levels "Hi.spani.c","Non-hi.spani.c": 2 2 2 2 2 2 2 1 1 2 .
: Factor w/ 4 levels "Di.vorced","Marri.ed", .: 3 2 3 1 2 4 3 2 2 2 .
�
Non-hispanic Married Employed Hourly
$ lfstatus : Factor w/ 3 levels "Employed","Not i.n LF", . : 1 1 1 1 1 1 1 1 1 1 .
Non-hispanic Divorced Employed Hourly $ otti.pcomm : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 2 2 2 2 . �
Non-hispanic Married Employed Hourly $ hourly : Factor w/ 3 levels "","Hourly", "Non-hourly": 2 3 3 3 2 2 3 3 2 2 .
Hispanic $ uni.onstatus: Factor w/ 3 levels "","Non-uni.on", .: 2 2 2 2 2 2 2 2 2 3 . a,
Non-hispanic Married Values

Non-hispanic Married
4.12310562561766
�
Non-hispanic Nevermarried

Hispanic Employed Hourly 0

Showing lto2lof4,013entries, 17totalcolumns
�
Console Terminal Background Jobs �□ Files Plots Packages Help Viewer =□
R 4.4.0 • ~/Library/CloudStorage/Box-Box/ECO 329 book/ � � fil � Q. @
R: Miscellaneous Mathematical Functions ... Find in Topic
R versi.on 4.4.0 (2024-04-24) -- "Puppy Cup"
Copyri.ght CC) 2024 The R Foundati.on for Stati.sti.cal Computi.ng MathFun {base] R Documentation
Platform: x86_64-apple-darwi.n20

R i.s free software and comes wi.th ABSOLUTELY NO WARRANTY. Miscellaneous Mathematical Functions
You are welcome to redi.stri.bute i.t under certai.n condi.ti.ans.
Type 'li.cense()' or 'li.cence()' for di.stri.buti.on detai.ls. Description

Natural language support but runni.ng i.n an Engli.sh locale abs(x) computes the absolute value of x, sqrt(x) computes the (principal) square root of x, y'x.

R i.s a collaborati.ve project wi.th many contri.butors. The naming follows the standard for computer languages such as C or Fortran.
Type 'contri.butors()' for more i.nformati.on and
'ci.tati.on()' on how to ci.te R or R packages i.n publi.cati.ons. Usage

Type 'demo()' for some demos, 'help()' for on-li.ne help, or abs(x)
'help.start()' for an HTML browser i.nterface to help. sqrt(x)
Type 'q()' to qui.t R.
Arguments
>li.brary(probstats4econ)
>Vi.ew(cps) x a numeric or QQIDplex vector or array.
>?sqrt
>x<- sqrt(17)
Details
>cpsemployed <- cps[cps$lfstatus=="Employed" ,]
These are internal generic Rrimitive functions: methods can be defined for them individually or via the Math group generic. For complex arguments (and the default method), z, abs (z) == !,!Qg(z)
and sqrt(z) == z'0.5.

abs ( x) returns an integg vector when x is integer or lQ_gical.

S4methods

Both are S4 generic and members of the Math group generic.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. I➔

Figure 1.1
The four panes in RStudio

1.2 Arithmetic operations and mathematical functions

R performs a wide range of mathematical calculations. It can be used like a calculator, performing arithmetic
operations using standard mathematical operators: addition (+), subtraction (-), multiplication (*), division (/), and
exponentiation (ˆ). For example, typing 5+3 and pressing enter in the R console returns the result 8. Some other
simple examples are illustrated in the R code below.

# simple arithmetic operations

5+3
## [1] 8
5-3
## [1] 2

5*3
## [1] 15
5/3
## [1] 1.666667
5^3
## [1] 125

In this block of code, the first line is a “comment” that is not executed by R. When writing R code, we can start any
line with the number-sign character (#) to indicate that the line is a descriptive comment rather than a command to
be evaluated. The first command that gets evaluated is 5+3, and R returns the answer 8. Any printed R output in this
book is on a line starting with two number-sign characters (##).1 How about the [1] that appears right after ##? In
cases where a single value is returned, the [1] appears before the ouput. Later in this chapter, we consider cases in

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 3 — #10

i i

The basics of R 3

which multiple values are returned simultaneously (as part of a “vector”), and when there are multiple lines of output,
the number within the [] brackets for a particular line indicates which element is being shown first.
R follows the standard mathematical order of operations, which from highest to lowest priority is:
1. Parentheses
2. Exponentiation
3. Multiplication and division (performed from left to right)
4. Addition and subtraction (performed from left to right)

(3+4)^2
## [1] 49
3+4^2
## [1] 19
3+2*4^2
## [1] 35
3+2*4+2
## [1] 13
(3+2)*(4+2)
## [1] 30

For the first expression, the addition within the parentheses occurs before the exponentiation. For the second
expression, the exponentiation occurs before the addition. For the third expression, the exponentiation occurs first,
followed by the multiplication and then the addition. For the fourth expression, the expressions within both parentheses
are evaluated before being multiplied together.
In addition to arithmetic operators, R has many mathematical functions that facilitate calculations and data analysis.
Here are some examples of commonly used mathematical functions:
• abs(x): Calculates the absolute value |x|.
√
• sqrt(x): Calculates the square root x.
• exp(x): Calculates the exponential value ex . exp(1) returns Euler’s constant e ≈ 2.718282 since e1 = e.
• log(x): Calculates the natural logarithm ln(x).
• factorial(x): Calculates the factorial x! = x(x – 1)(x – 2) · · · (3)(2)(1).
The following output shows examples using abs(x) and sqrt(x).

5*abs(-3)
## [1] 15

sqrt(17)/2
## [1] 2.061553

R functions, like the mathematical functions above, can have a required argument or multiple required arguments,
along with optional arguments that can be specified during function calls. To determine these arguments for any
function, R documentation or “help” can be requested by typing a question mark (?) followed by the function name. For
example, the command ?sqrt requests the documentation for the sqrt function, which then appears in the bottom-
right window of the RStudio interface. The documentation indicates that the “usage” for the function is sqrt(x),
meaning it has a single required argument x and no optional arguments. In the example above, the argument is a single

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 4 — #11

i i

4 The basics of R

number (17), though the documentation specifies that the argument x can be a “numeric or complex vector or array.”
The concept of vectors is described in Section 1.4.
The log function has an optional argument, which can be confirmed by requesting documentation with ?log.
The documentation indicates the “usage” for the function is log(x, base = exp(1)), meaning it has a required
argument x and an optional argument base. If the optional argument is omitted, then its default value is exp(1),
resulting in the natural logarithm (base e). To calculate a base 10 logarithm, there are two ways to call the log function
with the optional argument.

log(100,base=10)
## [1] 2
log(100,10)
## [1] 2
log(100)
## [1] 4.60517

The first and second commands are equivalent, with the first explicitly using the name of the base argument and the
second relying on the “usage” having the base as the second argument. The former approach is generally preferred, as
a function may have multiple optional arguments, and specifying the name of the optional argument avoids confusion
and mistakes. The third command uses no optional argument, so the natural logarithm (base e) of 100 is calculated.
Two additional examples of mathematical functions with optional arguments are round and signif:
• round(x, digits = 0): Rounds a number x to a specified number of digits. If the argument digits is
not specified, the default value is 0, in which case the function rounds to the nearest integer.
• signif(x, digits = 6): Reports a number x with a specified number of significant digits equal to

digits. If the argument digits is not specified, the default value is 6.

round(50/3)
## [1] 17
round(exp(1),digits=4)
## [1] 2.7183
signif(50/3,digits=5)
## [1] 16.667

1.3 Variables and data types

1.3.1 Variables
In R, variables are used to store data and facilitate data manipulation. A variable is basically a container that holds a
value, like a number or a text string. When we create a variable in R, we give the variable a name and assign a value to
it, using the assignment operator <-.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 5 — #12

i i

The basics of R 5

x <- 8
x
## [1] 8
x+5
## [1] 13
x <- 2*x
x
## [1] 16
frog
## Error in eval(expr, envir, enclos): object ’frog’ not found

The first command assigns the value 8 to the variable x. This command has no output associated with it. The next
command, simply x, does provide output, corresponding to the value 8 stored in the variable x. The variable can then
be used in other expressions. The third command outputs the value of x+5, which is 13. Importantly, this command
does not change the value of the variable x, which is still 8. The fourth command does change the value of the variable
x with another assignment operator <-. Specifically, the new value assigned to x is two times the old value of x, or
16. The last command shows that an error message is returned when we refer to a variable name, in this case frog,
that has not been assigned a value.
Variables can store different types of data. The variable x above has a numeric value, but variables can also contain
text strings, logical values (indicating true or false), and other types of data. The data type of a variable can be
determined by the class() function.

y <- 3.4
class(y)
## [1] "numeric"
str <- "Economic statistics"
class(str)
## [1] "character"

The most common data types in R are the following:

• numeric: numeric values, including real numbers (e.g., 3.4) and integers (e.g., 8)2
• logical: logical values, which can be either TRUE or FALSE
• character: text strings (e.g., "intro to statistics")

• date: date values (e.g., 2022-04-30)3

• factor: a data type for a categorical variable with a fixed set of possible values (e.g., a variable for eye color

with the three possible values Blue, Brown, and Other or a variable for restaurant ratings with the four possible
values Excellent, Good, Fair, and Poor)
If a variable is no longer needed, we can delete it with the rm function. For example, if the variable x has been assigned
a value, the command rm(x) removes the variable from the R working environment. The command rm(list =
ls()) removes all variables from the R working environment, thought it should be used with caution since R does
not ask for confirmation when the rm function is called.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 6 — #13

i i

6 The basics of R

1.3.2 Logical values (logical data type)

The logical data type is used for the logical values TRUE and FALSE. Logical data types are commonly used to
compare two values against each other. For example, the == operator tests whether two values are equal to each other,
in which case TRUE is returned, or not equal to each other, in which case FALSE is returned. In contrast, the operator
!= tests for inequality of two values, returning TRUE if the they are not equal to each other and FALSE if the they are
equal to each other. Similarly, the operators <, >, <=, and >= are used to test for strict inequality or weak inequality
of two values.

8 == 3+5
## [1] TRUE
x <- 16
x > 12
## [1] TRUE
xsmall <- (x<=9)
xsmall
## [1] FALSE

The first command returns TRUE since 8 is exactly equal to 3+5. The second command assigns the value 16 to the
variable x, and the third command returns TRUE since x is strictly greater than 12. The fourth command assigns the
value of x<=9, which is FALSE since x is not less than or equal to 9, to the variable xsmall. As a result, the variable
xsmall has a logical data type, and the last command outputs the value of xsmall.
The logical operators “and” and “or” are represented by the symbols & and |, respectively, and can be used to
combine multiple logical values and return a new logical value based on the combination. The “and” operator (&)
returns TRUE if both of the logical values being considered are TRUE and FALSE otherwise. The “or” operator (|)
returns TRUE if at least one of the logical values being considered is TRUE and FALSE otherwise.

x <- 16
(x>12) & (x<=9)
## [1] FALSE
(x>12) | (x<=9)
## [1] TRUE

The “not” operator (!) can be applied to a logical value to return the opposite value. If v is a logical variable, the
expression !v is FALSE if v is TRUE and TRUE if v is FALSE.

x <- 16
(x>12)

## [1] TRUE
!(x>12)
## [1] FALSE

Parentheses can be used for more complex logical expressions and to control the order of operations.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 7 — #14

i i

The basics of R 7

x <- 16
y <- 30
((y>2*x)|(y<3*x)) & (abs(x-y)!=10)
## [1] TRUE

In this example, ((y>2*x)|(y<3*x)) is TRUE since (y<3*x) is TRUE. The absolute value of the difference
between x and y is not equal to 10, so the expression (abs(x-y)!=10) is also TRUE, meaning the overall
expression obtained by applying the “and” (&) operator is also TRUE.
A useful feature of logical values is that we can perform mathematical operations on them. When a logical value is
included in a mathematical expression in R, a TRUE value is treated as a one and a FALSE value is treated as a zero.
The following R code provides a few simple examples.

TRUE+FALSE
## [1] 1
x <- 4.3
y <- 8.7
1*(x<4)
## [1] 0
(x>4)*(x<7)
## [1] 1
(y>4)*(y<7)
## [1] 0

The first expression, TRUE+FALSE, evaluates as 1+0. After setting the values of the x and y variables, the
1*(x<4) expression returns 0 since (x<4) is false, (x>4)*(x<7) returns 1 since (x>4) and (x<7) are both
true, and (y>4)*(y<7) returns 0 since (y>4) is true and (y<7) is false.
As will be seen in Section 1.6, logical data types also provide a convenient way to control the flow of a program or
to make decisions based on whether certain conditions hold.

1.3.3 String values (char data type)

In R, a string is a sequence of characters, such as letters, numbers, spaces, and punctuation marks. To create a string,
we enclose the text in quotation marks.

str <- "intro to statistics"

str
## [1] "intro to statistics"

The first command assigns the (string) value "intro to statistics" to a variable named str, and the
second command outputs the value of the variable str.
There are several useful functions for manipulating strings, including the following:
• nchar(x): Returns the number of characters in the string x.
• toupper(x) and tolower(x): Convert all the characters in the string x to uppercase or lowercase,
respectively.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 8 — #15

i i

8 The basics of R

• substr(x, start, stop): Extracts a portion of the string x, called a “substring,” that starts at the character
indicated by the start parameter and ends at the character indicated by the stop parameter.
• paste(..., sep = " "): Takes one or more values (strings, numbers, etc) as arguments and pastes them

together as a single string. The optional parameter sep, whose default value is a space (" "), is inserted between
each of the strings being pasted together.

str <- "intro to statistics"

nchar(str)
## [1] 19
toupper(str)
## [1] "INTRO TO STATISTICS"
substr(str,3,12)
## [1] "tro to sta"
x <- 6
y <- exp(1)
paste("The value of x is", x, "and the value of y is", round(y,4))
## [1] "The value of x is 6 and the value of y is 2.7183"
paste("The value of x is", x, "and the value of y is", round(y,4), sep="")
## [1] "The value of x is6and the value of y is2.7183"
str2 <- paste(str,str)
str2
## [1] "intro to statistics intro to statistics"

This example creates the str variable as a string, as before with a char data type. The outputs of the nchar
and toupper functions are self-explanatory. For the substr(str,3,11) command, the output is the substring
that starts at the third character of str, which is the first t, and ends at the 12th character of str, which is the
a. The remaining commands illustrate how the paste function can be used. After the x and y values are assigned
values, the first paste function results in a string that is output with the default space (" ") separator, and the second
paste function results in a string that is output with no separator between the arguments. The command str2 <-
paste(str,str) pastes two copies of the str together, with the default space separator.
Sometimes it is useful to know whether or not a particular substring is contained within a string. The grepl
function provides one way to do this:
• grepl(pattern, x): Returns TRUE if the string pattern is contained within the string x and FALSE
otherwise. The function grepl has several optional arguments, including for instance ignore.case, which
indicates whether the case of the letters (upper versus lower) should be ignored and whose default value is FALSE.
The interested reader can use the R documentation, by typing ?grepl, to get more information about grepl and
related functions.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 9 — #16

i i

The basics of R 9

answers <- "ACBBAD"

grepl("AA", answers)
## [1] FALSE
grepl("BB", answers)
## [1] TRUE
grepl("AAA", answers)
## [1] FALSE
grepl("AAA", answers) | grepl("BBB", answers) | grepl("CCC", answers) | grepl("DDD", answers)
## [1] FALSE

In this example, answers is assigned the value "ACBBAD", corresponding to a student’s answers to six multiple-
choice exam questions, each of which has the possible answers A, B, C, and D. The first grepl command asks whether
the substring "AA" (two straight A responses) is within answers. The second grepl command asks whether two
straight B responses are within answers. The third grepl command asks whether three straight A responses are
within answers. And the fourth grepl command asks whether there is a sequence of three straight responses of
any of A, B, C, or D.

1.4 Vectors
A vector is a collection of elements of the same data type. For example, a vector can be a collection of numerical
values, a collection of logical values, or a collection of strings. Vectors can be created and manipulated in various
ways.

1.4.1 Creating vectors

The simplest method to create a vector is with the c function, and we can determine the length of a vector with the
length function:
• c(...): Returns a vector that is constructed from one or more arguments, with the order of the vector elements
corresponding to the order of the arguments.
• length(x): Returns the length (i.e., the number of elements) of the vector x.

rep("abcd",6)
## [1] "abcd" "abcd" "abcd" "abcd" "abcd" "abcd"

1.4.2 Accessing elements of vectors

To access specific elements of a vector or even a single element, square brackets [] are used. The convention is that
the “index” of the first element of a vector is 1, the “index” of the second element is 2, and so on. To access a single
element of the vector, we specify a single number within the square brackets (e.g., x[2] to access the second element

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 12 — #19

i i

12 The basics of R

of the vector x). To access multiple elements of the vector, we specify a vector of “indices” within the square brackets,
as illustrated below.

numvec <- c(8,12,5,10,3,18,7,10,2,8)

length(numvec)
## [1] 10
numvec[3]
## [1] 5
numvec[2:5]
## [1] 12 5 10 3
numvec[(length(numvec)-2):length(numvec)]
## [1] 10 2 8
numvec[c(4,1,9)]
## [1] 10 8 2

In this example, the variable numvec is assigned to be a vector containing ten numeric values. The numvec[3]
command returns the third element of numvec. The numvec[2:5] command returns a vector consisting of the
the second through fifth elements (inclusive) of numvec. The next command uses the length function to assist in
returning a vector with the last three elements of numvec. The numvec[c(4,1,9)] command illustrates that the
chosen elements need not be consecutive, as this command returns a vector with the fourth element, first element, and
ninth element of numvec.

1.4.3 Commonly used vector-related functions

Here are some additional vector-related functions that are frequently used in R:
• min(x) and max(x): Return the minimum and maximum values of the vector x, respectively.
• sort(x, decreasing = FALSE): Returns a vector with the elements of the vector x sorted in ascending
order (if the optional argument decreasing is omitted) or in descending order (if the optional argument is
specified as decreasing = TRUE).
• unique(x): Returns a vector with only the unique values of the vector x.

• sum(x): Returns the sum of the elements of the vector x. (If x is a logical vector, TRUE and FALSE are treated

as the values 1 and 0, respectively.)

• mean(x): Returns the arithmetic mean of the elements of the vector x, which is the sum of the elements divided

by the number of elements. (If x is a logical vector, TRUE and FALSE are treated as the values 1 and 0, respectively.)
• cumsum(x): Returns a vector containing the cumulative or “running” sum of the elements of the vector x. The

first element is first element of x, the second element is the sum of the first two elements of x, the third element
is the sum of the first three elements of x, and so on. (If x is a logical vector, TRUE and FALSE are treated as the
values 1 and 0, respectively.)
The functions min, max, sort, and unique can be used for numeric, logical, and string vectors, whereas the
functions sum, mean, and cumsum are not meant for string vectors.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 13 — #20

i i

The basics of R 13

numvec <- c(8,12,5,10,3,18,7,10,2,8)

min(numvec)
## [1] 2
max(numvec)
## [1] 18
sort(numvec)
## [1] 2 3 5 7 8 8 10 10 12 18
sort(numvec, decreasing = TRUE)
## [1] 18 12 10 10 8 8 7 5 3 2
unique(numvec)
## [1] 8 12 5 10 3 18 7 2
sort(unique(numvec))
## [1] 2 3 5 7 8 10 12 18
sum(numvec)
## [1] 83
mean(numvec)
## [1] 8.3
cumsum(numvec)
## [1] 8 20 25 35 38 56 63 73 75 83

1.4.4 Element-by-element operations for vectors

A convenient feature of R is that it automatically applies mathematical/logical operators and other functions on an
element-by-element basis to vectors. We first illustrate this idea with numerical vectors.

numvec <- c(8,12,5,10,3,18,7,10,2,8)

numvec + 1
## [1] 9 13 6 11 4 19 8 11 3 9
numvec + 1:10
## [1] 9 14 8 14 8 24 14 18 11 18
sqrt(numvec)
## [1] 2.828427 3.464102 2.236068 3.162278 1.732051 4.242641 2.645751 3.162278
## [9] 1.414214 2.828427

sqrt(numvec)[2:4]
## [1] 3.464102 2.236068 3.162278
numvec >= 9
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
prices <- c(1.24,3.12,0.78,2.22,4.57,2.89,4.08,1.83,3.78,2.66)
numvec*prices
## [1] 9.92 37.44 3.90 22.20 13.71 52.02 28.56 18.30 7.56 21.28
revenue <- sum(numvec*prices)
revenue
## [1] 214.89

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 14 — #21

i i

14 The basics of R

The numvec variable is a vector with ten numerical values that was used in a previous example. The result of the
command numvec + 1 is a vector which takes each element of numvec and adds the value 1 to it. In contrast,
for the command numvec + 1:10, where both operands are ten-element vectors, the vector that is returned is an
element-by-element sum of the two operands. The first element is the sum of the first element of numvec (8) and
the first element of 1:10 (1), the second element is the sum of the second element of numvec (12) and the second
element of 1:10 (2), and so on. A mathematical function like sqrt returns a vector with the function applied to each
element of the vector argument. In this case, we can confirm that the resulting vector has the square root of 8 as its first
element, the square root of 12 as its second element, and so on. The command sqrt(numvec)[2:4] demonstrates
the generality of indexing vectors; since sqrt(numvec) is itself a vector, with the same length as the original
numvec, the [2:4] indexing returns the second through fourth elements of the square root applied to the numvec
vector. The numvec >= 9 command illustrates how logical operators are also applied on an element-by-element
basis. The resulting vector is a logical vector, where the first element indicates whether the first element of numvec
(8) is greater than 9, the second element indicates whether the second element of numvec (12) is greater than 9, and
so on. The last set of commands, involving the prices and revenue variables, illustrate how vector operations can
be used in a simple economic example. If the variable numvec represents the quantities of ten different goods that
are purchased at a certain store and the variable prices represents the corresponding prices of these ten goods, the
command numvec*prices returns a vector with ten elements, each corresponding to the revenue associated with
a given good. Then, the variable revenue, obtained with the sum function, provides the total revenue for all of the
goods.
Here are some examples of element-by-element operations and functions for string and logical vectors.

strvec <- c("ab","cd","ef")

toupper(strvec)
## [1] "AB" "CD" "EF"
paste(strvec,".",sep="")
## [1] "ab." "cd." "ef."
strvec=="ab"
## [1] TRUE FALSE FALSE
logicvec1 <- c(TRUE,FALSE,FALSE,TRUE)
logicvec2 <- c(TRUE,TRUE,FALSE,TRUE)
logicvec1|logicvec2
## [1] TRUE TRUE FALSE TRUE
logicvec1&logicvec2
## [1] TRUE FALSE FALSE TRUE

For the string vector strvec, toupper(strvec) capitalizes each element, paste(strvec,".",sep="")
appends a period (.) to the end of each element, and strvec=="ab" does a logical comparison of each element
with the string "ab". For the logical vectors, logicvec1 and logicvec2 are created as logical vectors with four
elements. The commands logicvec1|logicvec2 and logicvec1&logicvec2 apply the “or” (|) and the
“and” (&) operators, respectively, to the two vectors on an element-by-element basis.

1.4.5 Logical values and vectors

We have already seen the power of logical operators when applied to vectors. For example, for the numerical vector
numvec considered previously, the expression numvec >= 9 returns a logical vector of the same length as numvec,
with each element indicating whether the corresponding element of numvec is greater than or equal to 9. There are
some additional functions that facilitate the use of logical vectors and the use of logical operators for other vectors,
including the following:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 15 — #22

i i

The basics of R 15

• all(x): Returns TRUE if all of the values in the logical vector x are TRUE, and FALSE otherwise.
• any(x): Returns TRUE if any of the values in the logical vector x are TRUE, and FALSE otherwise.
• which(x): Returns a vector of indices corresponding to TRUE elements of the logical vector x.

• ifelse(test, yes, no): Returns a vector based upon a logical condition, given by the argument test,

where the value given by the yes argument is used if the condition is TRUE and the value given by the no argument
is used if the condition is FALSE.

numvec <- c(8,12,5,10,3,18,7,10,2,8)

prices <- c(1.24,3.12,0.78,2.22,4.57,2.89,4.08,1.83,3.78,2.66)
all(numvec >= 9)
## [1] FALSE
any(numvec >= 9)
## [1] TRUE
ifelse(prices > 3.50, "high price", "low price")
## [1] "low price" "low price" "low price" "low price" "high price"
## [6] "low price" "high price" "low price" "high price" "low price"
ifelse(prices > 3.50, 3.50, prices)
## [1] 1.24 3.12 0.78 2.22 3.50 2.89 3.50 1.83 3.50 2.66
which(prices > 3.50)
## [1] 5 7 9
high_prices <- prices[which(prices > 3.50)]
high_prices
## [1] 4.57 4.08 3.78

Assume again that numvec represents the quantities of ten different goods purchased at a store and the variable
prices represents the corresponding prices of these ten goods. The all(numvec >= 9) command returns
FALSE since the quantities are not all greater than or equal to 9, and the any(numvec >= 9) command returns
TRUE since at least one of the quantities is greater than or equal to 9. The first ifelse command returns a string
vector, with the value "high price" corresponding to any element of prices greater than 3.50 and the value
"low price" corresponding to any element of prices that is not greater than 3.50. The second ifelse
does something slightly different, creating a numeric vector where each element is either exactly 3.50 (when the
corresponding element of prices is greater than 3.50) or equal to the corresponding element of prices (when
this element is not greater than 3.50). The which(prices > 3.50) command returns a vector of the indices
corresponding to the elements of prices that are greater than 3.50. There are three such elements, corresponding
to the indices 5, 7, and 9 of the prices vector. The usefulness of the which function command is illustrated in
the subsequent command, where which(prices > 3.50) it itself used within the square brackets [] to select
elements of the prices vector. Specifically, the new variable high_prices as a vector that consists of only the
elements of prices for which the condition prices > 3.50 is true.
A particularly useful function to use for a logical vector is the sum function. When x is a logical vector, sum(x)
treats TRUE values as ones and FALSE values as zeros , which means that sum(x) returns the total number of TRUE
values that the vector x contains. As an example, using the numvec and prices vectors defined above, the following
code uses the sum function to count how many of the numvec elements are greater than or equal to 9 and how many
of the prices elements are greater than 3.50.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 16 — #23

i i

16 The basics of R

sum(numvec >= 9)
## [1] 4
sum(prices > 3.50)
## [1] 3

For a logical vector x, the mean(x) returns the number of TRUE elements in x divided by the total number of
elements of x, which is the proportion or fraction of elements of x that are TRUE.

mean(numvec >= 9)
## [1] 0.4
mean(prices > 3.50)
## [1] 0.3

Throughout the book, we will apply the sum or mean functions to logical vectors as part of computer simulations
involving random numbers. To preview this type of calculation, we briefly introduce the function runif. When called
with a single argument n, the function runif(n) returns a vector of n random (real) numbers between zero and one.
This “uniform” random variable will be discussed in Chapter 10, but for now we can think of any real number between
zero and one as being equally likely to be chosen.

set.seed(1234)
x <- runif(1000)
x[1:10]
## [1] 0.113703411 0.622299405 0.609274733 0.623379442 0.860915384 0.640310605
## [7] 0.009495756 0.232550506 0.666083758 0.514251141
sum(x<0.3)
## [1] 291
mean(x<0.3)
## [1] 0.291
mean((x>0.6)*(x<0.8))

## [1] 0.198

Ignore the first command for now, as the set.seed function will be discussed in Chapter 2. The second command
creates a vector x containing 1,000 random numbers between zero and one, and the third command outputs the first
ten values of x. The sum(x<0.3) command outputs the number of random numbers, out of 1,000, that are less
than 0.3, and the mean(x<0.3) command outputs the sum divided by the length of x (1,000). The final command,
mean((x>0.6)*(x<0.8)), returns the fraction of the 1,000 random numbers that are between 0.6 and 0.8 since
(x>0.6)*(x<0.8) evaluates to 1 when both x>0.6 and x<0.8 are TRUE and 0 otherwise. For this simulation,
29.1% of the 1,000 random numbers are less than 0.3, and 19.8% are between 0.6 and 0.8.

1.4.6 Factor variables and vectors

The factor data type is a data type for a categorical variable with a fixed set of possible values. In R, such a categorical
variable is called a factor variable. As an example, suppose that a vector contains a student’s responses to five true/false
questions, with possible values True and False.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 17 — #24

i i

The basics of R 17

answers <- c("True","False","True","True","False")

answers
## [1] "True" "False" "True" "True" "False"
tfvec <- factor(answers)
tfvec
## [1] True False True True False
## Levels: False True
levels(tfvec)
## [1] "False" "True"

The first command assigns answers to be a string vector with the student’s answers, and the second command
outputs this vector. The third command creates tfvec by applying the factor function to the original vector
answers. The factor function converts the original string vector into a vector that has the factor data type. As
the output shows, the tfvec has the same values answers, but R has automatically detected that the Levels of the
factor variable are False and True. The levels function can be used to return the levels associated with a factor
variable. By default, R orders the levels of the factor variable alphabetically unless specified otherwise.
If a categorical variable has a natural ordering, we can also explicitly specify the factor levels and do so in the correct
order. For example, suppose we have customer ratings for two different restaurants, contained in the vectors vec1 and
vec2. There are four possible ratings, which are (in ascending order) Poor, Fair, Good, and Excellent.

vec1 <- c("Good","Excellent","Good","Poor","Good")

vec2 <- c("Fair","Excellent","Excellent","Poor","Good","Excellent")

rating_levels <- c("Poor","Fair","Good","Excellent")

ratings1 <- factor(vec1, levels=rating_levels)
ratings2 <- factor(vec2, levels=rating_levels)

ratings1
## [1] Good Excellent Good Poor Good
## Levels: Poor Fair Good Excellent
ratings2
## [1] Fair Excellent Excellent Poor Good Excellent
## Levels: Poor Fair Good Excellent
table(ratings1)
## ratings1
## Poor Fair Good Excellent
## 1 0 3 1

table(ratings2)
## ratings2
## Poor Fair Good Excellent
## 1 1 1 3

The assignment of rating_levels to the four ratings categories tells R the order of the categories. Then,
ratings1 and ratings2 are created as vectors of factor variables, based upon the original vec1 and vec2
string vectors and using the levels specified by rating_levels. When ratings1 is output, the levels are shown
in the order that we specified, and moreover the output shows the level Fair even though there are no Fair ratings
in the ratings1 vector. Another way to output and view the data within the factor-variable vector is with the table
command, which provides a tabulation (count) of the number of vector elements that are within each category. For the
table of ratings1, all four categories are shown, with a 0 indicating no elements with a Fair rating.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 18 — #25

i i

18 The basics of R

1.5 Output
In the examples considered thus far, R supplies output automatically for most expressions, with the exception of
variable assignments.

x <- 5
x
## [1] 5
x/3
## [1] 1.666667

In this example, the variable assignment x <- 5 does not result in output, but the commands x and x/3 both
result in output, corresponding to the values of the two expressions.
The print function provides an alternative method for providing output. In many cases, the command print(x)
leads to the same output as the command x, but the print function is sometimes preferred by R users since (i) it has
optional arguments that can be useful and (ii) it can provide output that is better formatted for tables, regressions, etc.

1.6 Programming
1.6.1 Conditional (if-else) execution
The simplest version of conditional execution involves a single if statement, where a command or series of commands
gets executed if a certain logical condition holds.

# if (logical condition) {
# ... commands ...
# }

The syntax for the if statement has a logical condition, which evaluates to TRUE or FALSE, within parentheses
after the if keyword. If the logical condition evaluates to TRUE, the sequence of commands within the curly braces {
and } are executed.

str <- "intro to statistics"

if (nchar(str) > 10) {

print(paste(str,"is a long string."))
print(paste(str,"has",nchar(str),"characters."))
}
## [1] "intro to statistics is a long string."
## [1] "intro to statistics has 19 characters."

The if statement checks whether the string str has more than ten characters. Since it does, the commands within {
and } are executed, resulting in two lines of output. Had str been a string with ten or fewer characters, the commands
within { and } would have been skipped, and no output would have resulted.
What if we want to execute different commands when the logical condition within the if statement does not hold?
In this case, we use an if-else statement that executes one set of commands if the logical condition holds and a
different set of commands if it does not.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 19 — #26

i i

The basics of R 19

# if (logical condition) {
# ... commands executed if condition is TRUE...
# } else {
# ... commands executed if condition is FALSE...
# }

This syntax has the else keyword appearing after the right curly brace } of the original if statement and the
“else” commands contained within a second set of curly braces { and } after the else keyword.

price <- 25.78

if (price > 20) {

print("The price is too high.")
price <- 0.9*price
} else {
print("The price is not too high.")
}
## [1] "The price is too high."

The if statement checks if price is greater than 20, which is the case here. As a result, the output "The price
is too high." is provided, and the variable price is set to 90% of its original value. Had price been less than or
equal to 20, the output "The price is not too high." would have been provided, and the variable price
would have been left unchanged.
We can check additional logical conditions within an if-else statement by using the keywords else if rather
than else, as illustrated in the following example.

price <- 25.78

if (price > 20) {

print("The price is too high.")
price <- 0.9*price
} else if (price < 10) {
print("The price is too low.")
price <- 1.1*price
} else {
print("The price is not too high or too low.")
}

## [1] "The price is too high."

Like the previous example, the first logical condition (price > 20) is checked by the initial if statement,
with the two commands below it executed if this condition is TRUE. In this example, however, if the (price >
20) condition is FALSE, the subsequent else if checks whether the logical condition (price < 10). If this
condition is TRUE, the output "The price is too low." is provided, and the variable price is set to 110%
of its original value. If this condition is FALSE, in which case price is between 10 and 20, the output "The price
is not too high or too low." is provided.

1.6.2 for loops

A for loop is a programming structure used to repeat a set of commands a specified number of times or for each
element in a vector. This type of loop will be especially useful in this book when we want to conduct a specific number
of simulations of a certain process.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 20 — #27

i i

20 The basics of R

# for (var in sequence) {

# ... commands ...
# }

In this syntax, var is a variable used to store values during the loop, and sequence is a vector specifying the
sequence of values over which the for loop will iterate. For each value in sequence, the for loop executes the
commands specified within the curly braces { and }. For instance, if we want to conduct 10,000 simulations of a
certain process, a for loop can loop over the simulation number (from 1 to 10,000), and the commands within the
loop will be repeatedly executed for each simulation.
As an example, we consider how a for loop can be used to calculate the Fibonacci sequence, which is a sequence
of numbers in which each number is the sum of the two preceding numbers. The sequence starts with the numbers 1
and 1, so that the Fibonacci sequence is 1, 1, 2, 3, 5, 8, 13, 21, ….

sequence_length <- 10
fib_sequence <- rep(0,sequence_length)
fib_sequence[1] <- 1
fib_sequence[2] <- 1

for (i in 3:sequence_length) {
fib_sequence[i] <- fib_sequence[i-2]+fib_sequence[i-1]
}

print(fib_sequence)
## [1] 1 1 2 3 5 8 13 21 34 55

The variable sequence_length is the desired length of the Fibonacci sequence, set to 10 here. Then,
the variable fib_sequence is initialized as a numeric vector having all zeros and with length equal to the
specified sequence_length. The next two commands assign the value 1 to both the first and second elements
of fib_sequence, which corresponds to 1 and 1 being the first two numbers in the Fibonacci sequence. The
for loop does the rest of the work in determining the Fibonacci sequence. The variable i loops from 3 to
sequence_length. With sequence_length being 10, the variable i has the value 3 on the first iteration
of the loop, 4 on the second iteration of the loop, and so on, through 10 on the eighth iteration of the loop.
On the first iteration of the loop, fib_sequence[3] is assigned to the sum of fib_sequence[1] and
fib_sequence[2], which is 2. On the second iteration of the loop, fib_sequence[4] is assigned to the sum
of fib_sequence[2] and fib_sequence[3], which is 3. This process continues until the final iteration of the
loop, where fib_sequence[10] is assigned to the sum of fib_sequence[8] and fib_sequence[9]. The
last print command provides the output with the Fibonacci sequence of length sequence_length.
The following example involves the use of strings and illustrates how the sequence vector can be something other
than a simple sequence of numeric values.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 21 — #28

i i

The basics of R 21

strvec <- c("cow","horse","pig","chicken","goat")

first_letters <- ""

for (str in strvec) {

first_letters <- paste(first_letters,substr(str,1,1),sep="")
}

print(first_letters)
## [1] "chpcg"

The for loop builds up a string first_letters that consists of the first letters from each string in the string
vector strvec. The variable first_letters is initialized to be an empty string "". In the first iteration of the
for loop, the variable str is equal to "cow", the first element of strvec. The expression substr(str,1,1)
yields the first character of str, which is "c", and the paste function pastes it at the end of first_letters, so
that the value of first_letters after the first iteration is "c". In the second iteration, the variable str is equal
to "horse", and the character "h" gets added to the end of first_letters, so that its value is "ch" after the
second iteration. This process continues, with a total of five iterations, and the print command outputs the final value
of first_letters.

1.6.3 while loops

A while loop is used to repeatedly execute a set of commands as long as a logical condition is true. Whereas a for
loop is most often used when the number of iterations is known (e.g., a loop of 10,000 simulations), a while loop is
used in situations where the total number of iterations may be unknown at the start. Instead, the while loop continues
until the specified logical condition doesn’t hold.

# while (logical condition) {

# ... commands ...
# }

The syntax for the while loop has a logical condition, which evaluates to TRUE or FALSE, within parentheses after
the while keyword. If the logical condition is TRUE, the commands within the curly braces { and } are executed.
After the commands are executed, the computer goes back to the start of the while loop to check the logical condition
again. If the logical condition is still TRUE, the commands within the curly braces { and } are executed again. This
process continues until the while logical condition is FALSE, at which point the computer stops the loop (and, if
there are commands after the loop (i.e., after the right curly brace }), jumps to those commands).

sales <- c(38,52,24,61,47,18,29,44,41)

total_sales <- 0
idx <- 0

while (total_sales <= 200) {

idx <- idx + 1
total_sales <- total_sales + sales[idx]
}

print(paste("It took",idx,"days to exceed 200 sales. There were",total_sales,"sales after",idx,"days."))

## [1] "It took 5 days to exceed 200 sales. There were 222 sales after 5 days."

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 22 — #29

i i

22 The basics of R

The numeric vector sales contains data on daily sales for a certain company, and a while loop determines how
many days it takes for the company’s total (cumulative) sales to be at least 200 units. The total_sales variable,
used to keep track of cumulative daily sales, is initialized to 0. The idx variable, used to keep track of the number of
the day (i.e., the index of the sales vector), is also initialized to 0. When the while loop is first reached, the logical
condition is TRUE since total_sales is equal to 0. The commands within the loop increment idx, so it now has a
value of 1, and add sales[1] to total_sales, which now has a value of 38. The logical condition remains true,
so the commands within the loop are executed again, leading to an idx value of 2 and a total_sales of 90. This
process continues until idx is equal to 5, and total_sales is equal to 222, at which point the logical condition
for the while loop is FALSE, so the loop is ended and the print command outputs the information.

1.7 Writing functions

A function is a block of code that performs a specific task. There are many built-in R functions that facilitate
calculations and data manipulation. But new functions can also be created to perform tasks not already built into
R. Here are the steps for creating a new function:
• Specify the function name: Choose a function name that describes what the function does. For example, if the
function calculates the distance between two points (x1 , y1 ) and (x2 , y2 ), an appropriate name might be distance
or something similar. (Do not pick a function name already used by R, as defining a function with an existing name
will, at least temporarily, replace the existing R function.)
• Specify the arguments: Arguments are the values that the function takes as input. When the function is called,

these arguments are enclosed in parentheses after the function name. Some arguments may be required, and some
arguments may be optional, as we have seen for built-functions like round and log.
• Write the code: The code for the function, which is contained within curly braces { and }, performs the desired

task using the provided arguments.

• Return the output: The return function returns the output that results from execution of the function’s code. The

output could be any type of object, like a number, a string, a logical value, or a vector.4
The following example shows how a function for calculating the area of a triangle based upon the values of its base
and its height.

triangle_area <- function(base, height) {

area <- 0.5*base*height
return(area)
}

triangle_area(10,5)
## [1] 25

area
## Error in eval(expr, envir, enclos): object ’area’ not found

The name of the function is triangle_area, and it is “assigned” to be a function with two required arguments,
base and height. Within the function, the variable area calculates the area of the triangle, which is 1/2 times
base times height. Then, the return function returns the function’s output, which is the value of area. The
triangle_area(10,5) command makes a call to the function triangle_area with argument values 10 (for
base) and 5 (for height), and the resulting area of 25 is printed. The final command, area, results in an error
message. Even though the variable area is defined within the function triangle_area, it is not recognized by R
once the function evaluation is complete; we say that the variable area is “local” to the function. The same is true
for any variables created within a function. It is also true for the arguments of the function; the variables base and
height would not be recognized after the function evaluation is complete. When a variable or argument is defined

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 23 — #30

i i

The basics of R 23

inside a function, it becomes a “local” variable by default. As such, the variable or argument name can be used freely
within the function without affecting any variables outside of the function. For example, if height were a variable
that already existed before the call to triangle_area, the use of the argument name height would not affect the
value of the variable height. This idea is illustrated below.

height <- 8
triangle_area(10,5)
## [1] 25
height
## [1] 8

The variable height is assigned the value 8. Then, even though the triangle_area function uses an argument
with the same name (height) and assigns the value 10 to the argument height, the value of the original variable
remains 8 after the function evaluation is complete.
Without any changes, the triangle_area function can actually take vectors as its two arguments, with base
being a vector of triangle bases and height being a vector of triangle heights. The first elements of base and
height correspond to the first triangle, the second elements correspond to the second triangle, and so on. Then,
since R automatically performs arithmetic operations on an element-by-element basis, the function triangle_area
returns a vector of the areas for each of the triangles.

base_vec <- c(10,15,20)

height_vec <- c(5,8,10)
triangle_area(base_vec,height_vec)
## [1] 25 60 100

The following example shows how to define optional arguments in a function. p We define the function distance
to calculate the distance between any two points (x1 , y1 ) and (x2 , y2 ), given by (x1 – x2 )2 + (y1 – y2 )2 .

distance <- function(x1, y1, x2 = 0, y2 = 0) {

return(sqrt((x1-x2)^2+(y1-y2)^2))
}

distance(3,4)
## [1] 5

distance(3,4,5,-1)
## [1] 5.385165

• str(df): Shows the structure of df, including the number of rows, the number of variables, the variable names,

the variable data types, and some sample values for each variable.
• summary(df): Provides a summary of the variables in df. The information provided for a variable depends

upon its data type. For numeric variables, descriptive measures (including minimum value, maximum value, and
arithmetic mean) are provided. For factor variables, observation counts for some or all categories are provided.
• nrow(df) and ncol(df): Return the number of rows and columns in df, respectively.

• names(df): Returns a string vector containing the names of the variables in df.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 28 — #35

i i

The hourly_wage vector has three missing values and five non-missing values. The expression
is.na(hourly_wage) returns a logical vector having TRUE values corresponding to the missing elements of
hourly_wage and FALSE values for the other elements. The expression sum(is.na(hourly_wage)) returns
the total number of missing values in hourly_wage since TRUE and FALSE are treated as 1 and 0, respectively,
by the sum function. The hourly_wage[!is.na(hourly_wage)] expression returns a vector with all non-
missing elements of hourly_wage. The data.frame function creates the data frame worker_df consisting of
the hourly wage and age variables. Then, the na.omit(worker_df) command returns a data frame consisting of
only those rows with non-missing hourly wage values.
Several R functions that take vectors as arguments do not work correctly in the presence of missing (NA) values.
For instance, to calculate the average hourly wage for workers, the expression mean(hourly_wage) would seem
appropriate. Unfortunately, that expression doesn’t perform as desired, and the optional argument na.rm = TRUE
needs to be specified.

mean(hourly_wage)
## [1] NA
mean(hourly_wage, na.rm = TRUE)

## [1] 12.46
max(hourly_wage, na.rm = TRUE)

## [1] 14.8

The expression mean(hourly_wage) returns NA, indicating that R is unable to calculate the mathematical
average when the vector has missing values. Adding the optional argument na.rm = TRUE fixes the problem,
with 12.46 being reported as the average of the five non-missing values. Similarly, using na.rm = TRUE for
the max function allows R to calculate the largest value (14.8) among the non-missing values. If the expression
max(hourly_wage) had been used without na.rm = TRUE, the expression would also have a result of NA.

1.10 R packages
An R package is a collection of functions, data, and/or documentation that are bundled together so that they can be
easily installed and used. Such packages provide add-on capabilities to the standard R statistical software, and they

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 31 — #38

i i

The basics of R 31

are a convenient way to share code with others and to organize and re-use your own code. Before an R package can
be used, it first needs to be installed using the install.packages function. Once the package has been installed,
it can be used in an R session with the library function, and the description for the package can be accessed using
the optional help argument for the library function.
As an example, we consider the installation and use of stringr, an R package that provides several useful
functions for string manipulation. First, the stringr package must be installed:

install.packages("stringr")

The package name stringr is enclosed in double quotes when used as an argument of the install.packages
function. With the stringr package installed, its contents can be used in R after the command
library(stringr) is executed.

library(stringr)
library(help = stringr)
str_trim(" testing this out ")
## [1] "testing this out"

The double quotes are unnecessary for stringr when used as an argument for the library function. The second
command, library(help = stringr), opens a window in the RStudio script pane with information about the
stringr library and its included functions. For example, the function str_trim removes any extra spaces from
the start and end of a string; in the example above, there are two spaces at the start of the string that are removed and
one at the end of the string that is removed. In most cases, we can access detailed documentation for new functions that
are added through packages. The command ?str_trim, for instance, provides documentation for the str_trim
function.

1.10.1 The probstats4econ package

This book has a companion R package, probstats4econ, which includes all of the datasets (data frames) in the
book and several user-defined functions introduced throughout the book. We install the package as follows:

install.packages("probstats4econ")
library(probstats4econ)

The data frames in the probstats4econ package are “lazily loaded” into R, which means they are available
to the user even though they will not initially appear in the (upper-right) Environment/History pane. As an example,
after loading the package with the install.packages and library functions, we can immediately use the
exams data frame within the console. Even after using the dataset, it will still not appear in the Environment/History
pane unless a change (e.g., adding a new variable) is made to the data frame. (R itself has several datasets lazily
loaded when RStudio is launched. A complete list of datasets can be seen by using the command library(help =
datasets).)
Aside from the probstats4econ package, this book primarily considers the use of R without installation of
additional packages, which is known as “base R.” That said, there are several R packages that are very useful for
data analysis. Examples include ggplot, a package for creating high-quality graphics and data visualizations, and
dplyr, a package for data manipulation. Both ggplot and dplyr are part of a larger package, tidyverse, that
is popular among data scientists.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 32 — #39

i i

32 NOTES

Notes
1 The RStudio console window does not display the ## at the beginning of output lines. This book puts the ## at the beginning of output so that
the full block can be copied and, when executed, only the commands and not the output lines (since they are now “comments”) are executed.
2 R has an integer data type, which provides a more efficient way to store integer-valued variables in memory.
3 R has a time data type for storing time values (e.g., 11:49:20).
4 It is also possible for a function to have multiple return values, which can be done by returning a list object. An R list is a collection of different
objects, which need not be of the same data type. The code for wald_test in Chapter 14 is an example having a list of two objects returned.
5 There are point-and-click alternatives in RStudio to read data files as well, including (1) click on “Import Dataset” in the (upper-right)
Environment/History pane, (2) click on the “Files” tab in the (lower-right) Plots/Help/Packages/Files pane and select the desired directory and
file, and (3) click on the “Import Dataset” submenu from the main “File” menu.

Exercises
1. For each of the following sections of R code, indicate (i) what output, if any, is provided by R and (ii) the final value
of the variable x. Answer without using R.
(a) Code section #1:

x <- 16
sqrt(x)

(b) Code section #2:

x <- 16
x <- sqrt(x)
x <- c(x,x)

(c) Code section #3:

x <- c(79,16,53,44)
x <- sort(x, decreasing = TRUE)
length(x)

(d) Code section #4:

y <- c(6.7,-3.3,4.2)
x <- (1:3)*y
min(x)
x <- (x > 0)
sum(x)

2. A company has the following daily sales over the course of nine days: 38, 52, 24, 61, 47, 18, 29, 44, 41.
(a) Create a numerical vector sales that contains the daily sales.
(b) Provide a single command to calculate the average daily sales.
(c) Provide a single command to calculate the number of days for which sales are (strictly) between 40 units and
60 units.
(d) Provide a single command to calculate the proportion of days for which sales are (strictly) between 40 units
and 60 units.
3. Provide a single command, involving the sqrt function, to output a vector of all “perfect squares” that are less than
1,000 and in ascending order. The “perfect squares,” in ascending order, are 12 , 22 , 32 , 42 , 52 , … or 1, 4, 9, 16, 26, ….

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 33 — #40

i i

NOTES 33

4.
(a) Create a vector mult_two consisting of all multiples of two (i.e., even numbers) between 1 and 200
(inclusive).
(b) Create a vector mult_three consisting of all multiples of three between 1 and 200 (inclusive).
(c) Provide a single command to calculate how much longer mult_two is than mult_three.
(d) Using mult_two and mult_three, create a new vector mult_vec consisting of all numbers between 1
and 200 (inclusive) that are a multiple of two, a multiple of three, or both. The vector mult_vec should have
(i) the vector elements in increasing order and (ii) no duplicate elements (e.g., the number 6 should only appear
once even though it’s in both mult_two and mult_three). How many elements does mult_vec have?
5. Write a function rectangle_area that calculates the area of a rectangle based upon two arguments. The first
argument is base, the length of the rectangle base. The second argument is height, the rectangle height. Make the
second argument optional, with the default value specified by height = base, corresponding to a square. Confirm
that rectangle_area(3) returns the area of a 3 × 3 square and rectangle_area(4,5) returns the area of a
4 × 5 rectangle.
6. Write a function even_product that takes a single numerical (integer) argument x and returns the product of the
first x even integers. For example, even_product(4) should return the product of 2, 4, 6, and 8, which is 384.
7. Refer to the R code (Section 1.6.2) that calculates the first ten numbers in the Fibonacci sequence.
(a) Modify the code to create a function fibonacci that takes sequence_length as an argument and
returns the Fibonnaci sequence as a vector. Confirm that fibonacci(10) returns the correct sequence of
ten numbers.
(b) Does your fibonacci function work when called with 1 or 2 as its argument? If not, modify the code to
handle these two cases.
(c) The Tribonacci sequence is similar to the Fibonacci sequence, except that each element of the Tribonacci
sequence is the sum of the previous three numbers in the sequence. The beginning of the Tribonacci sequence
is 1, 1, 2, 4, 7, 13, 24, 44, .... Write a function tribonacci that takes a single (positive-integer) argument and
returns the Tribonacci sequence of that length.
8. Use a while loop to determine the smallest positive integer n for which ln(n) + ln(2n) is greater than 7.
9. Write a function longest_trend, with a numerical vector x as its single argument, that returns the
number associated with the longest subvector of strictly increasing values in x. For instance, for the vector
c(3,-1,1,0,-2), the function would return 2 since the sequence (-1,1) is the longest subvector of strictly
increasing values. Similarly, for the vector c(78,43,21,-5), the function would return 1 since the elements
are decreasing; for the vector c(-10,5,0,3,21,56,8), the function would return 4 since the sequence
(0,3,21,56) is the longest subvector of strictly increasing values.
(a) Write the function longest_trend using either a for loop or a while loop. Confirm that the function
returns the correct values for the three example vectors specified in the question.
(b) Modify the function longest_trend to include a second optional argument decrease with a default value
of FALSE. When decrease is FALSE, the function should work as above. When decrease is TRUE, the
function should return the number associated with the longest subvector of strictly decreasing values in x. In
the three example vectors given in the question, the function should return 3, 4, and 2, respectively, when
decrease=TRUE.
10. Using the sales vector created in Question 2, answer the following questions.
(a) Create a vector sales_yesterday that contains the daily sales that occurred yesterday, which should
be c(NA,38,52,24,61,47,18,29,44), where the first element is missing (NA) since there is no
observation before the first day. Try to create this new vector by using the original sales vector (rather
than the brute-force method of assigning the new vector to the list of values specified).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 34 — #41

i i

34 NOTES

(b) Use a single command that, ignoring the first day, calculates the number of days for which sales are strictly
greater than the previous day’s sales.
(c) Using the vectors sales and sales_yesterday, calculate the average daily sales on days for which the
previous day’s sales were stricly less than 30 units.
11. Create a vector of 5,000 random numbers between zero and one with the following two commands:
set.seed(1234) followed by x <- runif(5000).
(a) Use a single sum command to calculate the sum of the 5,000 numbers.
(b) Use a single mean command to calculate the proportion of random numbers between 0.15 and 0.40.
(c) Thinking about the vector as an ordered sequence of random numbers, determine the proportion of times that
an element of the sequence is within 0.1 of the previous element of the sequence. The first element has no
previous element, so the proportion should be calculated for the remaining 4,999 elements of the sequence.
12. Suppose the 20-character string returns, defined below, indicates whether the stock price of a certain company
goes up, indicated by U, or down, indicated by D, over the course of 20 days of trading on the stock market.

returns <- "UDUUUDDUDUDUUUUDUDUU"

(a) Using a for loop to loop over the characters of returns, determine how many days the stock goes up.
(b) Using another for loop, determine how many times a D (stock-price drop) is followed immediately by a U
(stock-price increase).
(c) Use the grepl command to determine whether there is a streak of four consecutive days of stock-price
increases. Use the grepl command to determine whether there is a streak of four consecutive days of
stock-price drops.
(d) Write a function strtovec that takes a string string as a single argument and returns a vector consisting
of the single characters that comprise string. For example, strtovec("abc") should return the vector
consisting of the elements "a", "b", and "c".
(e) Using the strtovec function from the previous part, provide a single command to determine the number of
days the stock goes up (based upon returns).
13. Use the exams dataset, read into the data frame exams, for this question.
(a) Create a vector scorediff that contains the difference between the exam2 score and the exam1 score for
each student. What is the average of scorediff? What is the maximum value of scorediff and which
student (i.e., which row number) has this maximum value? What is the minimum value of scorediff and
which student has this minimum value?
(b) The professor would like to reward students who show improvement on the second exam. Specifically, she
will place 70% weight on the second exam if a student’s second exam score is at least 5% higher than
the student’s first exam score. Otherwise, she will just place 50% weight on both exams. Create a vector
composite_score that calculates a composite score based upon these grading guidelines. For a student
who gets 50% weight on both exams, the composite score should be the sum of the two exam scores; for a
student who gets 70% weight on the second exam, the composite score should be 0.6 times the first exam score
plus 1.4 times the second exam score. What is the average of composite_score? Which student benefits
the most from the 70% weighting rule?
14. Use the cigdata dataset, read into the data frame cigdata, for this question. The dataset contains information on
cigarette taxes, prices, and sales in 2019 for each state (plus the District of Columbia) in the United States.
(a) Use the nrow function to confirm the number of observations.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 35 — #42

i i

NOTES 35

(b) The variables cigprice and cigtax are the average price of a pack of cigarettes and the tax per pack of
cigarettes, respectively, for each state. Which states have the highest and lowest values for these two variables?
(The states can be identified by either of the string variables state or statename.)
(c) What is the average tax per pack in the data?
(d) Write a command that gives a vector of the five highest values of the tax per pack.
(e) Create a variable cigdata$statetax_pct equal to the state-tax percentage, defined as the tax per pack
divided by the average pack price for each state. What are the minimum, maximum, and average of this
variable?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 36 — #43

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 37 — #44

i i

2 Introduction to probability theory

Probability theory underlies all of statistics, providing a mathematical framework for quantifying uncertainty and
randomness in real-world phenomena. It allows practitioners to model and analyze the likelihood of different outcomes
in a given situation. Key statistical concepts, like estimation, confidence intervals, and hypothesis tests, are all based
upon probability theory. Therefore, to properly understand statistics and apply statistical inference, it is important to
have a solid foundation in the basic concepts of probability theory. This chapter provides an introduction to probability
theory by discussing the meaning of probability and introducing the fundamental properties of probabilities. Chapter 3
builds upon the material of this chapter and introduces the concepts of conditional probability and independence.
Before jumping into terminology and definitions, we first consider a motivating example:
Example 2.1 (Widget website) The website widgets.com has 3,000 total registered users and wants to test the
effectiveness of two possible e-mail campaigns. E-mail A is sent to 300 users at random, e-mail B is sent to 300 users
at random, and the other 2,400 users receive no e-mail. For each registered user, widgets.com has the following
information at some point (say, one week) after the e-mail campaigns:

A if user receives e-mail A


campaign = B if user receives e-mail B


None if user receives no e-mail
(
Y if user has made a purchase in the last week
purchase =
N if user has not made a purchase in the last week
Despite being a relatively simple example, there are many probability concepts associated with this experiment. First,
what is meant by “at random” when we say that an e-mail is send to users at random? Intuitively, we mean that any
user has the same chance of receiving the e-mail as any other user. One way to choose the users at random for the two
e-mail campaigns is as follows:
• E-mail A: The first recipient is chosen randomly from the 3,000 users, so that everyone has a 1/3000 chance of
being chosen. The second recipient is chosen randomly from the 2,999 remaining users, so that each remaining user
has a 1/2999 chance of being chosen. This process continues through the 300’th recipient, with each of the 2,701
remaining users having a 1/2701 chance of being chosen.
• E-mail B: The first recipient is chosen randomly from the 2,700 remaining users, so that each remaining user has

a 1/2700 chance of being chosen. The second recipient is chosen randomly from the 2,699 remaining users, so that
each remaining user has a 1/2699 chance of being chosen. This process continues through the 300’th recipient, with
each of the 2,401 remaining users having a 1/2401 chance of being chosen.
• The 2,400 remaining users who did not receive e-mail A or e-mail B are the users for which the value of campaign

is None. We can think of this group of users as a “control group” to which we can compare the e-mail A recipients
and/or the e-mail B recipients.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 38 — #45

i i

38 Introduction to probability theory

This method of selecting the e-mail recipients is called sampling without replacement. Starting with the full sample
of 3,000 users, the 300 e-mail A recipients are chosen one-by-one. Once a user is randomly selected to receive e-
mail A, that user can not be randomly selected again. The user is not replaced back into the sample, from which the
terminology “without replacement” comes. Instead, the user is removed from the sample that is used to randomly
select the subsequent e-mail recipients.
The sample function in R can be used to implement the random selection of e-mail A and e-mail B recipients. Here
is a description of the usage of sample:
• sample(x, size, replace = FALSE, prob = NULL): Returns a random sample of size size from
the elements in the vector x. The optional argument replace indicates whether sampling should be done without
replacement (which is the default, replace = FALSE) or with replacement (replace = TRUE). The optional
argument prob specifies how likely each element of x is to be sampled, with the default that each element is equally
likely to be chosen.
If the 3,000 registered users are uniquely identified by a user number, ranging from 1 to 3000, the following code
randomly selects the e-mail recipients:

temp <- sample(1:3000,600)

emailA_recipient <- temp[1:300]
emailB_recipient <- temp[301:600]

The vector 1:3000 represents the full set of users. The vector temp is created with the sample function by
randomly picking 600 elements from 1:3000, where each of the elements is equally likely to be chosen. Then, the
emailA_recipient vector is assigned to be the first 300 elements of temp, and the emailB_recipient vector
is assigned to be the last 300 elements of temp. The non-recipients are the elements of 1:3000 that are not in either
emailA_recipient or emailB_recipient.
Before implementation of this e-mail experiment, there are some known probabilities and some unknown
probabilities:
• For any given user, the chance or probability that she receives e-mail A is known. Since each user has the same
chance of being chosen to receive e-mail A, this known probability is 300/3000 = 1/10 or 10%. Similarly, the known
probability of receiving e-mail B is 1/10 or 10%, and the known probability of receiving no e-mail is 2400/3000 = 4/5
or 80%.
• For any user who receives e-mail A, the chance or probability that she makes a purchase is an unknown probability.

The same is true of the probability that an e-mail B recipient makes a purchase and the probability that a user who
received neither e-mail makes a purchase.
Ultimately, the goal of the e-mail experiment by widgets.com is to determine how effective the e-mail campaigns
are. Specifically, the following questions are of interest:
• Is campaign A more (or less) effective than campaign B? In terms of probabilities, is the probability of purchase
by an e-mail A recipient higher (or lower) than the probability of purchase by an e-mail B recipient?
• Is campaign A more (or less) effective than no campaign?

• Is campaign B more (or less) effective than no campaign?

Suppose widgets.com finds that 60 e-mail A recipients (20% of the 300) made a purchase, 66 e-mail B recipients
(22% of the 300) made a purchase, and 360 of the non-recipients (15% of the 2400) made a purchase. For these
specific users, the results indicate that the e-mail B campaign is slightly more successful than the e-mail A campaign
(22% versus 20%) in leading to purchases and even more so when compared to the non-recipients (22% versus 15%).
But is there enough evidence to conclude that the e-mail B campaign is truly better than the e-mail A campaign (or
no e-mail at all)? Should widgets.com use e-mail B for its future campaigns instead of e-mail A? Perhaps the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 39 — #46

i i

Introduction to probability theory 39

higher purchase rate (22%) for e-mail B recipients happened by chance, and it might not be the case that we would
see the same result for a new batch of users receiving the two e-mail campaigns. As seen later in the book, the power
of statistical inference is the ability to take outcomes such as these (i.e., the purchasing outcomes for the three groups)
and provide an analysis of what might happen for new users who are presented with e-mail A, e-mail B, or neither.
Before leaving this example, it is worth noting that widgets.com might not be simply interested in whether or
not a purchase is made but also the revenue from such purchases. For example, the following information, in addition
to campaign and purchase above, could be collected:
amount = dollar amount of purchases by the user (equal to zero if no purchases) in the last week
Whereas purchase is a binary outcome, with the two possibilities purchase (Y) or no purchase (N), amount can
potentially take on many different values, depending upon the prices and quantities of the widgets available for
purchase. Based upon the results of their e-mail campaign experiment, widgets.com might like to determine which
campaign is likely to be more successful with respect to revenues or profits for new users.

2.1 Experiments and sample spaces

The formal introduction to probability theory begins by defining an experiment:

Definition 2.1 An experiment is any process whose outcome is subject to uncertainty.

Some examples of experiments include the following:
• tossing a coin once or several times
• obtaining income information from one or more people
• measuring the return on one or more financial assets during a given year
Example 2.1 itself consists of multiple experiments. The random assignment of 3,000 users to three groups (e-mail
A, e-mail B, or neither) creates uncertainty for each user’s group placement, and there is also uncertainty about each
user’s purchase behavior.
To specify the possible outcomes associated with any given experiment, we define the sample space:

Definition 2.2 The sample space, denoted S, is the set of all possible outcomes for an experiment.
The mathematical concept of a set is used in Definition 2.2. To review, a set is a collection of distinct objects that is
specified by listing its elements within curly braces { and }. The ordering of the elements within a set does not matter,
and a set does not have duplicate elements.
The simplest sample space has two possible outcomes. If there is only one outcome, there is no uncertainty, and
therefore it does not constitute an experiment. A classic example of a two-outcome sample space is a coin toss, where
the possible outcomes are heads (denoted H) and tails (denoted T). The sample space for a coin toss is S = {H, T}.
Since the order of outcomes doesn’t matter, S = {T, H} is equivalent to S = {H, T}.
Here are some additional examples of two-outcome sample spaces:
• asset return for the year: S = {U, D}, where U denotes a positive return (“up”) and D denotes a negative return
(“down”)
• website visitor purchase behavior: S = {Y, N}, where Y indicates purchase and N indicates no purchase

• student exam result: S = {P, F}, where P indicates pass and F indicates fail

• worker’s union status: S = {U, NU}, where U indicates a union worker and NU indicates a non-union worker

Of course, sample spaces can have more than two outcomes. Some examples include:
• roll of a six-sided die: S = {1, 2, 3, 4, 5, 6}
• day of the week of a baby’s birth: S = {Mon, Tue, Wed, Thu, Fri, Sat, Sun}

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 40 — #47

i i

40 Introduction to probability theory

• number of days during January for which a city’s temperature is above freezing: S = {0, 1, 2, ..., 31}, where the
“...” shorthand indicates that the sample space contains 32 outcomes, consisting of all integers between 0 and 31
(inclusive)
Oftentimes, an experiment involves multiple “trials” of the same underlying process:
• tossing two coins: S = {HH, HT, TH, TT}
• purchase behavior of the first three visitors to a website on a given day:
S = {YYY, YYN, YNY, YNN, NYY, NYN, NNY, NNN}
In these last two examples, we’ve implicitly assumed that the order of the outcomes matters, with the outcome HT
(heads then tails) being different from the outcome TH (tails then heads). The number of possible outcomes for the
two coin tosses is four, which is the number of possible outcomes for the first toss (two) times the number of possible
outcomes for the second toss (two). The number of possible outcomes for the purchase behavior of the three website
visitors is eight, which is the number of possible outcomes for the first visitor (two) times the number of possible
outcomes for the second visitor (two) times the number of possible outcomes for the third visitor (two). Using this
same logic, the size of the sample space for an experiment involving the tossing of five coins is 2 · 2 · 2 · 2 · 2 = 25 = 32,
and the size of the sample space for an experiment involving the tossing of ten coins is 210 = 1024.
For the coin toss and website purchase examples, if we only care about the total number of heads or purchases, the
sample spaces would be:
• number of heads from two coin tosses: S = {0, 1, 2}
• number of purchases from the first three visitors to a website on a given day: S = {0, 1, 2, 3}
As compared with the sample spaces for the sequences of coin tosses or website purchases, these sample spaces have
fewer possible outcomes. Why? Here, the order of the outcomes doesn’t matter since only the total number of heads
or purchases matters. For the coin tosses, the outcome of 1 total head arises when the sequence HT or the sequence
TH occurs. For the website purchases, the outcome of 1 total purchase arises when YNN, NYN, or NNY occurs.
In R, we can use a vector to represent a finite sample space. For example, the following code creates vectors for the
sample spaces for a coin toss, a roll of a die, and a sequence of two coin tosses:

coin <- c("H","T")

coin
## [1] "H" "T"
die <- 1:6
die
## [1] 1 2 3 4 5 6
twocoins <- c("HH","HT","TH","TT")
twocoins
## [1] "HH" "HT" "TH" "TT"

If there is concern that a vector vec contains duplicate elements, unique(vec) returns a vector (or set) of the
distinct elements. The sort function can be used to order the elements, if desired.
Example 2.2 (Two car dealerships) Suppose you own two car dealerships, dealership A and dealership B, where
dealership A has four salespeople and dealership B has three salespeople. Consider the experiment associated with
the number of salespeople at dealership A and the number of salespeople at dealership B that sell a car on a given
day. The following table enumerates all of the possible outcomes associated with the sample space for this experiment.

i i

Definition 2.3 An event is any subset of outcomes within the sample space S. A simple event is exactly one outcome
from S, whereas a composite event consists of more than one outcome from S.
If the sample space S is finite, the number of simple events is the size of the sample space S. The sample space S
is itself an event since S is a subset of itself and, moreover, is a composite event since S has more than one outcome.

Definition 2.4 For any event E with a finite number of outcomes, let |E| denote the number of outcomes in event E.
Using this notation, a simple event E has |E| = 1, and a composite event E has |E| > 1.
Example 2.3 (Three website visitors) For the first three visitors to a website on a given day, the sample space for their
purchase behavior is
S = {YYY, YYN, YNY, YNN, NYY, NYN, NNY, NNN}.
Each element of S is a simple event. For example, the event E = {YYY} is a subset of S corresponding to three purchases
in a row. There are eight different simple events since |S| = 8. How about the event A that the first two visitors make a
purchase? A = {YYY, YYN} is a composite event, with two possible outcomes from S. How about the event B that two
total purchases are made? B = {YYN, YNY, NYY} is also a composite event, with three possible outcomes from S.
Example 2.4 (Two car dealerships) In Example 2.2, the sample space S has 20 outcomes, so that |S| = 20. There are
20 simple events associated with S. One example is the event E = {(2, 2)} that two salespeople at each dealership
sell a car on a given day. How about the event, denoted E0 , that the total number of salespeople who sell a car is
equal to two? This composite event is E0 = {(0, 2), (1, 1), (2, 0)}. How about the event, denoted E00 , that the number of
salespeople at dealership B who sell a car is greater than the number of salespeople at dealership A who sell a car?
This composite event is E00 = {(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)}.
For any event, the result of the experiment is that either the event occurs or the event does not occur. If the event
does not occur, we say that the complement of the event occurs.

Definition 2.5 The complement of an event A, denoted Ac , is the set of all outcomes in the sample space S that are
not in A.
Example 2.5 (Three website visitors) From Example 2.3, the event B = {YYN, YNY, NYY} corresponds to two total
purchases. Its complement is Bc = {YYY, YNN, NYN, NNY, NNN}, which contains all outcomes for which there were
not two total purchases made, or equivalently Bc contains all outcomes for which either zero, one, or three total
purchases were made. The size of the event Bc (5 outcomes) is the size of the sample space S (9 outcomes) minus the
size of the event B (4 outcomes).
Since any outcome in the sample space S must be either in A or its complement Ac , we have the following
proposition:
Proposition 2.1. For a sample space S with a finite number of outcomes, |A| + |Ac | = |S| for any event A.
Since many possible events may be associated with a sample space S, it is useful to have ways to think about
multiple events at once. The definitions below introduce the concepts of the union of two events and the intersection of

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 43 — #50

i i

Introduction to probability theory 43

two events, which correspond to either of two events occurring (the union) and both events occurring (the intersection).
We start with the union.

Definition 2.6 The union of events A and B, denoted A ∪ B, is the set of all outcomes in event A or event B or in both.
The union is sometimes read as “A or B.”
Example 2.6 (Three website visitors) Recall that event A = {YYY, YYN} corresponds to the first two visitors
making a purchase and event B = {YYN, YNY, NYY} to a total of two purchases being made. The event A ∪ B =
{YYY, YYN, YNY, NYY} corresponds to the first two visitors making a purchase or a total of two purchases being
made. The outcome YYN is in both event A and event B, but it appears only once in the event A ∪ B since A ∪ B is a set.
If A and B are events with a finite number of outcomes, |A ∪ B| must always be less than or equal to the sum of |A|
and |B|. In Example 2.6, |A| = 2, |B| = 3, and |A ∪ B| = 4. The only case in which |A ∪ B| = |A| + |B| is when A and B have
no outcomes in common; in such a case, A and B are said to be disjoint events.

Definition 2.7 Events A and B are disjoint events (or mutually exclusive events) if they have no outcomes in common;
that is, there is no outcome in the sample space S that is an element of both event A and event B.
In Example 2.6, A and B are not disjoint events since they share the outcome YYN. Since disjoint events have no
outcomes in common, they can be thought of as events that can not possibly occur at the same time. In the example
of tossing two coins, where S = {HH, HT, TH, TT}, the event A that an equal number of heads and tails appear (A =
{HT, TH}) and the event B that two heads appear (B = {HH}) are disjoint events since they have no outcomes in
common and can not possibly occur at the same time.
Proposition 2.2. If A and B are events with a finite number of outcomes, |A ∪ B| ≤ |A| + |B|. Moreover, |A ∪ B| = |A| + |B|
if A and B are disjoint events, and |A ∪ B| < |A| + |B| if A and B are not disjoint events.
We now move to the intersection of two events:

Definition 2.8 The intersection of events A and B, denoted A ∩ B, is the set of all outcomes in both event A and
event B. The intersection is sometimes read as “A and B.”
Example 2.7 (Three website visitors) Recall that event A = {YYY, YYN} corresponds to the first two visitors making a
purchase and event B = {YYN, YNY, NYY} to a total of two purchases being made. The event A ∩ B corresponds to the
first two visitors making a purchase and a total of two purchases being made. In this case, A ∩ B = {YYN} since YYN
is the only outcome in both events.
If A and B are events with a finite number of outcomes, |A ∩ B| can not possibly be larger than either |A| or |B|:
Proposition 2.3. If A and B are events with a finite number of outcomes, |A ∩ B| ≤ |A| and |A ∩ B| ≤ |B|. The only case
in which |A ∩ B| = |A| is when event A is a subset of event B (i.e., all outcomes in A are also outcomes in B). Similarly,
the only case in which |A ∩ B| = |B| is when event B is a subset of event A.
Again focusing on events with a finite number of outcomes, there is an interesting relationship between the size of
the union A ∪ B and the size of the intersection A ∩ B. We know that all of the outcomes in event A are in A ∪ B, and all
of the outcomes in event B are in A ∪ B. However, the size of A ∪ B is not necessarily equal to |A| + |B| since Proposition
2.2 states that |A ∪ B| < |A| + |B| when A and B are not disjoint (i.e., they have outcomes in common). Note that A ∪ B
contains three types of outcomes: (i) outcomes in both A and B, of which there are |A ∩ B|; (ii) outcomes in A but not B,
of which there are |A| – |A ∩ B|; and (iii) outcomes in B but not A, of which there are |B| – |A ∩ B|. Therefore, the number
of outcomes in A ∪ B is |A ∩ B| + |A| – |A ∩ B| + |B| – |A ∩ B|, or |A| + |B| – |A ∩ B|. In this last expression, the subtraction
of |A ∩ B| effectively eliminates the double-counting in |A| + |B| for the outcomes that appear in both A and B. This
result is formally stated in the following proposition:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 44 — #51

i i

44 Introduction to probability theory

Proposition 2.4. If A and B are events with a finite number of outcomes,

|A ∪ B| = |A| + |B| – |A ∩ B|
or, equivalently,
|A ∩ B| = |A| + |B| – |A ∪ B|.

Previously, Proposition 2.2 stated that |A ∪ B| = |A| + |B| only in the case that A and B are disjoint. When A and B are
disjoint, there are no outcomes in A ∩ B, so that |A ∩ B| = 0, which agrees with the result in Proposition 2.4.
An event that has no outcomes is called the null event and is formally defined as follows:

Definition 2.9 The null event consists of no outcomes and is denoted ∅. The size of the null event is |∅| = 0.
Disjoint events A and B have A ∩ B = ∅.
Several R functions facilitate the use and manipulation of (finite) events or sets, including the following:
• elt %in% x: Returns TRUE if the element elt is in the set x and FALSE otherwise.
• union(x, y): Returns the union of the sets x and y.
• intersect(x, y): Returns the intersection of the sets x and y.

i i

Introduction to probability theory 47

0.5
0.4
Cumulative frequency of heads

0.3
0.2
0.1
0.0

0 2000 4000 6000 8000 10000

Simulation number

Figure 2.1
Head frequency for 10,000 simulated coin tosses

set.seed(1234)

# toss a coin 10,000 times

cointosses <- sample(c("H","T"), 10000, replace = TRUE)

# show the results of the first 10 tosses

cointosses[1:10]
## [1] "T" "T" "T" "T" "H" "T" "H" "H" "H" "T"
# calculate the cumulative sum of heads
# (e.g. cumul_heads[100] is number of heads through first 100 tosses)
cumul_heads <- cumsum(cointosses=="H")

# calculate the cumulative frequency of heads

# (e.g. freqheads[100] is the frequency (fraction) of heads through first 100 tosses)
freq_heads <- cumul_heads/(1:10000)

# plot the freqheads vector (when plot has only a single variable (vector) argument,
# it plots the variable versus 1 through the length of the vector)
plot(freq_heads,xlab="Simulation number",ylab="Cumulative frequency of heads",cex=0.5)
abline(h=0.5,lty=3)

The function cumsum calculates the cumulative number of heads through any number of tosses, which is stored in
the vector cumul_heads. Figure 2.1 is created by the plot and abline functions, with the plot command
plotting the vector freq_heads against the number of the simulation and the abline command drawing a
horizontal line at the value 0.5 (h=0.5 argument) that is dotted (lty=3 argument). To highlight the dotted line,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 48 — #55

i i

48 Introduction to probability theory

we could also change its color; for example, adding col="blue" as an argument to the abline command would
make it blue.
How many coin tosses are required for the fraction of heads to be close to 0.5 and stabilize near that value? While
we don’t have the tools to answer that question yet, the takeaway from Figure 2.1 is that a higher number of tosses
has a realized fraction of heads that is more likely to be closer to the probability P(A) = 0.5. Additional examples of
computer-simulated experiments are provided below, but at this point we can provide a more formal description of
what is meant by P(A), the probability of an event A associated with an experiment and sample space.
Imagine being able to repeat an experiment a large number of times, say n times. Think really large — a million, a
billion, a trillion, or more! For each experiment, record whether the event A has occurred. Let nA be the total number of
times that event A occurs over the n experiments. The fraction, or frequency, of experiments in which event A occurs
is equal to nnA . Then, P(A) is the number that nnA approaches as n gets arbitrarily large. (We are implicitly assuming
that there is a number (a “limit”) that nnA approaches as n gets arbitrarily large. This idea, known as the Law of Large
Numbers, is discussed in Chapter 13.)
Thinking about probability in terms of a large number of repeated experiments is known as the frequentist
interpretation of probability. The word “frequentist” is used since the probability of event A is viewed as the long-run
frequency of A occurring in a large number of repeated experiments.
Before discussing the properties of probabilities, a few more examples are considered to illustrate the frequentist
interpretation of probability.
Example 2.10 (Tossing two coins) Consider the experiment of tossing two fair coins. The sample space S =
{HH, HT, TH, TT} has four possible outcomes. If A = {HH} is the event corresponding to two heads, what is P(A)? For
fair coins, it’s perhaps not surprising that each of the four outcomes in S are going to be equally likely, so we would
expect P(A) = 0.25. (In Chapter 3, we formalize why this probability is equal to 0.25.) Figure 2.2 shows the results from
10,000 simulations, with the computer randomly tossing two coins for each simulation. The x-axis is the number of the
simulation, and the y-axis is the calculated frequency of HH occurring through that number of simulations. Similar to
the simple coin toss example, the frequencies appear to stabilize when the number of experiments gets larger, but here
the stabilization occurs at a level close to the P(A) = 0.25 value, indicated by the horizontal dotted line.

set.seed(1234)

# toss two coins 10,000 times

coin1tosses <- sample(c("H","T"), 10000, replace=TRUE)
coin2tosses <- sample(c("H","T"), 10000, replace=TRUE)

# calculate the cumulative number of "two heads" occurrences

cumul_twoheads <- cumsum(coin1tosses=="H" & coin2tosses=="H")

# calculate the cumulative frequency of "two heads" occurrences

freq_twoheads <- cumul_twoheads/(1:10000)

# plot the freq_twoheads vector

plot(freq_twoheads,xlab="Simulation number",ylab="Cumulative frequency of two-heads",cex=0.5)
abline(h=0.25,lty=3)

The vectors coin1tosses and coin2tosses each contain the outcomes of 10,000 coin tosses. The occurrence
of two heads happens when the corresponding elements of these two vectors are both "H". For example, the fifth
simulation of two coin tosses results in two heads when both coin1tosses[5] and coin2tosses[5] are "H".
The vector cumul_twoheads contains the cumulative number of times that the two-heads event has occurred.
If we are only interested in approximating the probability of getting two heads in two tosses, rather than tracking
the cumulative frequencies, the following code suffices:

i i

50 Introduction to probability theory

0.4
Cumulative frequency of sixes

0.3
0.2
0.1
0.0

0 2000 4000 6000 8000 10000

Simulation number

plus the probability of exactly two purchases being made plus the probability of exactly three purchases being made.
At this point, we can’t say anything more about the value of this probability since the purchase probability is unknown.
The probability axioms lead to several other interesting properties of probabilities, some of which are stated in the
following proposition:
Proposition 2.5. (Properties of probabilities) Let A and B be any two events. The following properties are implied by
the Axioms of Probability:
(i) (Probability of a complement) P(Ac ) = 1 – P(A).
(ii) (Probability of the null event) P(∅) = 0.
(iii) (Partitioning an event) P(A) = P(A ∩ B) + P(A ∩ Bc ).
(iv) (Probability of the union of events) P(A ∪ B) = P(A) + P(B) – P(A ∩ B).
(v) If A and B are disjoint events, P(A ∩ B) = 0 and P(A ∪ B) = P(A) + P(B).
(vi) (De Morgan’s laws for probabilities) P((A ∪ B)c ) = P(Ac ∩ Bc ) and P((A ∩ B)c ) = P(Ac ∪ Bc ).
(vii) (Equally likely outcomes) Suppose S is a finite sample space with k possible outcomes. If every outcome in S
is equally likely to occur,
P(A1 ) = P(A2 ) = · · · = P(Ak ) = 1/k
for each of the simple events A1 , A2 , …, Ak in S.
Property (i) follows from the facts that Ac and A are disjoint, Ac ∪ A = S, and P(S) = 1, which taken together imply
P(Ac ) + P(A) = 1 or, equivalently, P(Ac ) = 1 – P(A).
Example 2.13 (Three website visitors) In Example 2.12, we considered the probability of the event B that at
least one purchase is made. An alternative approach to finding that probability is to consider Bc , which is the
event A0 that no purchases are made, Bc = A0 = {NNN}. Applying property (i), we have P(B) = 1 – P(A0 ). Since
we know that P(A0 ∪ A1 ∪ A2 ∪ A3 ) = P(A0 ) + P(A1 ) + P(A2 ) + P(A3 ) = 1, note that P(B) = 1 – P(A0 ) is equivalent to
P(B) = P(A1 ) + P(A2 ) + P(A3 ).
Property (ii) says that the probability of the null event is equal to zero, which intuitively makes sense since it is
impossible for the experiment to have no outcome. This property follows from ∅ = S c , to which we apply property (i):
P(∅) = 1 – P(S) = 1 – 1 = 0.
Property (iii) involves partitioning event A into two disjoint events, one event with outcomes also in the event B,
which is A ∩ B, and one event with outcomes not in event B, which is A ∩ Bc . The two events A ∩ B and A ∩ Bc are
disjoint since no outcome can be in both B and Bc . The union of A ∩ B and A ∩ Bc is the event A since every outcome
in A is either in A ∩ B or A ∩ Bc . Applying Axiom 3 then implies that P(A) = P(A ∩ B) + P(A ∩ Bc ).
Example 2.14 (Six-sided die) Let A = {3, 4, 5, 6} be the event of rolling at least a 3. Let B = {2, 4, 6} be the event of
rolling an even number. A can be partitioned into its even numbers, A ∩ B = {4, 6}, and its odd numbers, A ∩ Bc = {3, 5}.
Property (iv), which states P(A ∪ B) = P(A) + P(B) – P(A ∩ B) for events A and B, provides the relationship between
the probability of a union of events and the probability of an intersection of events. We already know from Axiom
3 that for disjoint events A and B, we have P(A ∪ B) = P(A) + P(B). But property (iv) is more general, as it covers
events which are disjoint, when P(A ∩ B) = ∅, and events that are not disjoint, when P(A ∩ B) 6= ∅.7 As we saw before
in the context of counting the number of outcomes in A ∪ B (recall Proposition 2.4, where |A ∪ B| = |A| + |B| – |A ∩ B|),
subtracting the probability P(A ∩ B) avoids the double-counting that occurs from the P(A) and P(B) terms for outcomes
that are in both A and B.
One way to visualize property (iv) is through the use of a Venn diagram, as shown in Figure 2.4. This figure shows
two different scenarios, one in which the intersection A ∩ B contains one or more outcomes (the Venn diagram on the
left) and one in which the intersection A ∩ B contains no outcomes (the Venn diagram on the right, having A ∩ B = ∅).
For both Venn diagrams, the light-gray circle corresponds to the event A and the dark-gray circle corresponds to the
event B. In the left Venn diagram, the intersection A ∩ B is the region where the two circles overlap. If the area of each

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 53 — #60

7 To show that property (iv) holds, consider the two events B and A ∩ Bc . These two events are disjoint since any outcome in B can’t possibly be
in Bc or, therefore, A ∩ Bc . We also know that (A ∩ Bc ) ∪ B = A ∪ B, so that
P(A ∪ B) = P(A ∩ Bc ) + P(B) = P(A) – P(A ∩ B) + P(B) = P(A) + P(B) – P(A ∩ B)
from applying property (iii), plugging in P(A) – P(A ∩ B) for P(A ∩ Bc ).

Exercises
1. You have five songs on your playlist, with songs 1 and 2 by Beyoncé and songs 3, 4, and 5 by Pink. You listen to the
playlist in random order, but without repeats (e.g. once song 1 is played, it doesn’t get played again). You continue to
listen until a song by Pink is played. For example, 214 is one possible song sequence (outcome).
(a) What is the sample space S?
(b) What is the event A that song 5 is played?
(c) What is the event B that song 2 is not played?
2. For two events A and B, suppose every outcome in A is also in B. State whether each of the following statements is
true or false, and explain why.
(a) The size of B, denoted |B|, is strictly less than the size of A, denoted |A|.
(b) The union of A and B is B.
(c) Every outcome in Bc is in Ac .
3. Consider randomly selecting a college student. Let V be the event that the student has a paid video-streaming
account, and let M be the event that the student has a paid music-streaming account. Suppose P(V) = 0.7 and P(M) =
0.5.
(a) Is it possible that P(V ∩ M) = 0.6? Why or why not?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 56 — #63

i i

56 NOTES

(b) Suppose the probability that the student has both a video-streaming account and a music-streaming account is
0.35.
i. What is the probability that the student has at least one of the two types of accounts?
ii. What is the probability that the student has neither type of account?
iii. In terms of V and M, what is the event that the student has a video-streaming account but no music-
streaming account? What is the probability of this event?
4. Consider the experiment of picking a number randomly from the sample space S = {1, 2, 3, …, 999, 1000}. Let A be
the event that the number is a multiple of three, let B = {500, 501, …, 699, 700} be the event that the number is between
500 and 700 (inclusive), and let C be the event that the number is a perfect square (i.e., C = {1, 4, 9, 16, …, 900, 961}).
For this question, use R to create sets (vectors) and perform the necessary set operations.
(a) Create three vectors eventA, eventB, and eventC for the events A, B, and C and a vector samplespace
for S.
(b) Create a vector containing A ∪ B and calculate |A ∪ B|.
(c) Create a vector containing A ∩ C and calculate |A ∩ C|.
(d) Create a vector containing Cc and calculate |Cc |.
(e) Create a vector containing A ∩ B ∩ C and calculate |A ∩ B ∩ C|.
(f) Create a vector containing (A ∩ Bc ) ∪ C and calculate |(A ∩ Bc ) ∪ C|.
5. Suppose that, on any given weekday, 70% of college students eat breakfast, 60% do homework, and 85% do at least
one of these two things.
(a) What is the probability that a randomly selected student eats breakfast and does homework?
(b) What is the probability that a randomly selected student does neither activity?
6. The probability of the union, P(A ∪ B) = P(A) + P(B) – P(A ∩ B), includes outcomes that are in both A and B. Provide
a formula for the probability that exactly one of the events A and B occurs (but not both) in terms of P(A), P(B), and
P(A ∪ B).
7. A company has two research projects R1 and R2 , each of which either results in a patent or not. The probability that
project R1 results in a patent is 0.45, the probability that project R2 results in a patent is 0.15, and the probability that
both projects result in a patent is 0.05.
(a) What is the probability that at least one of the two projects results in a patent?
(b) What is the probability that neither of the two projects results in a patent?
(c) What is the probability that exactly one of the two projects results in a patent?
8. The sample space for the price of a company’s stock on a given day is S = [0, ∞), which consists of all non-negative
real numbers (including zero). In this case, S has an infinite number of possible outcomes. Consider the following
events: A = [80, 100], B = (60, 90], and C = (95, ∞). A square bracket [ or ] indicates that an interval is inclusive for
that endpoint, and a parenthesis ( or ) indicates that an interval is exclusive for that endpoint. Therefore, A is the set
of prices p such that 80 ≤ p ≤ 100, B is the set of prices p such that 60 < p ≤ 90, and C is the set of prices p such that
p > 95. Using the bracket and parenthesis notation and (when necessary) the union operator, what are the following
events?
(a) Cc
(b) Bc
(c) A∩B
(d) A∪C
(e) Ac ∪ B
(f) A ∩ B ∩ Cc

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 57 — #64

i i

NOTES 57

9. At a given company, worker salaries for the following year have just been decided. A worker’s salary can either
increase (I), decrease (D), or remain the same (R). Consider observing whether it increases, decreases, or stays the
same for three different employees.
(a) What is the sample space S?
(b) What is the event A that all three workers have different outcomes?
(c) What is the event B that exactly one of the three workers has a salary increase?
(d) What is the event C that exactly two of the three workers have the same outcome?
(e) Determine the following events: Cc and B ∩ C.
(f) Are A and C disjoint events?
(g) Are A and C collectively exhaustive?
10. A small information systems firm has the resources to respond to two invitations to submit a proposal for a contract.
When a proposal is submitted, it may be accepted outright, rejected outright, or a revision of the proposal may be
requested. If a revision is requested, submission of the revision leads to acceptance or rejection. You may assume that
the firm always submits a revision if it is requested.
(a) What is the sample space? Develop your own notation for this part.
(b) Define the event A to be “both proposals are eventually accepted,” the event B to be “both proposals are
eventually rejected,” and event C to be “a revision is submitted.” Which outcomes belong to each of these
events?
(c) Are A and B disjoint events? Are A and B collectively exhaustive?
(d) Are A and C disjoint events? Are A and C collectively exhaustive?
11. Burger Barn and Patty Palace will both open one new restaurant in Texas next year. Burger Barn is choosing among
four cities: A(ustin), D(allas), H(ouston), and S(an Antonio). Patty Palace is only choosing among three cities (A, D,
H) since it already has too many locations in San Antonio. Here is the probability table associated with their decisions:
Burger Barn
A D H S
A 0.07 0.12 0.16 0.12
Patty Palace D 0.03 0.02 0.15 0.15
H 0.06 0.04 0.03 0.05
(a) What is the probability that Burger Barn locates in Dallas and Patty Palace locates in Houston?
(b) What is the probability that Burger Barn locates in Dallas?
(c) What is the probability that Patty Palace locates in Houston?
(d) What is the probability that Burger Barn and Patty Palace locate in the same city?
(e) What is the probability that Burger Barn and Patty Palace locate in different cities?
12. An investor has three investment opportunities (A, B, and C) and is asked to rank them in preference order with the
most preferred listed first. Since she is indifferent between the three investments, she ranks them randomly.
(a) What are the outcomes in S, and what are the probabilities for each of the outcomes?
(b) What is the probability that investment C is ranked first?
(c) What is the probability that investment C is ranked first and investment A is ranked last?
13. On a particular day, Steve’s Sneaker Shop is beginning to stock pairs of a highly anticipated new sneaker. Hundreds
of customers are expected at the store. The owner (Steve) decides that one of the first 50 customers will receive a free
pair of the sneakers, and he randomly picks a number between 1 and 50 (inclusive), each being equally likely, before
the store opens.
(a) What is the probability that the fifth customer gets the free pair of sneakers?
(b) What is the probability that the free pair of sneakers is given away before the 20’th customer?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 58 — #65

i i

58 NOTES

(c) Conduct 10,000 simulations in R to approximate the probabilities in (a) and (b), where each simulation involves
a random number being chosen from the vector 1:50.
14.
(a) Modify the R code for 10,000 coin-toss simulations, used to create Figure 2.1 in Section 2.3, to replace the fair
coin with an unfair coin. Specifically, consider an unfair coin that comes up heads with 60% probability. In
addition to changing the random tosses, change where the dotted line appears.
(b) Repeat (a) but with only 100 simulations. How does the figure compare to the one in (a)?
15. At a large technology company, 80% of employees work full time (at least 40 hours per week), 70% have a flexible
work arrangement (that allows them to work partially from home), and 50% have a company-issued laptop computer.
Refer to these three employee characteristics as A, B, and C, respectively, so that P(A) = 0.8, P(B) = 0.7, and P(C) =
0.5. The following things are also true: 60% of employees work full time and have a flexible work arrangement,
38% of employees work full time and have a company-issued laptop computer, 42% of employees have a flexible
work arrangement and have a company-issued laptop computer, and 35% work full time and have a flexible work
arrangement and have a company-issued laptop computer.
(a) What is the probability that an employee has at least one of the three characteristics?
(b) What is the probability that an employee has exactly one of the three characteristics? (For this part, you might
find it easiest to use a Venn diagram.)
(c) What is the probability that an employee has at least two of the three characteristics?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 59 — #66

i i

3 Conditional probabilities and independence

This chapter builds upon the basic probability properties from Chapter 2 to say more about what happens when
there are multiple events. What can be said about the probability that two events both occur? What can be said
about the probability of one event occurring if it is known that the other event occurs? Various types of probabilities
are discussed, including joint probabilities, conditional probabilities, and marginal probabilities, and the important
concept of independence is introduced.

3.1 Definition and properties of conditional probabilities

Before introducing conditional probabilities, we introduce some additional terminology associated with probabilities.

Definition 3.1 P(A), the probability of an event A, is also called the unconditional probability or marginal
probability of A.
We say P(A) is an “unconditional probability” since we do not condition on any other event occurring, in contrast to
the conditional probability introduced below. And, as seen later in this chapter, the terminology “marginal probability”
relates to the idea that this unconditional probability will sometimes appear in the “margin” of a probability table.
Each outcome in the sample space has an unconditional or marginal probability associated with it, and the collection
of these probabilities is the probability distribution:

Definition 3.2 A probability distribution is a complete description of the probabilities associated with every outcome
in the sample space S.
Example 3.1 (Six-sided die) For a fair die and S = {1, 2, 3, 4, 5, 6}, the probability distribution is
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6.
For any two events, the joint probability is defined as the probability that both events occur:

Definition 3.3 The joint probability of events A and B is the probability P(A ∩ B) that both events occur.
Examples of joint probabilities were seen in Chapter 2. For instance, Example 2.15 considered the joint probability
of rolling at least a 3 (event A) and rolling an even number (event B).
We can now formally define the conditional probability:

Definition 3.4 The conditional probability P(A|B), which is the probability of event A given that event B has
occurred, is
P(A ∩ B)
P(A|B) =
P(B)
if P(B) > 0. P(A|B) is often read as “the probability of A given B” or “the probability of A conditional on B.”

i i

and
P(A ∩ B) = P(B|A)P(A) if P(A) > 0.
Proposition 3.2 follows directly from Definition 3.4 since P(A|B) = P(A∩B) P(A∩B)
P(B) and P(B|A) = P(A) , respectively. It can
be useful to have the two alternative equations since in some cases you might know P(B) and P(A|B) but not P(A) and
P(B|A), or vice versa.
Example 3.3 (Product returns) Suppose widgets.com sells three types of widgets, each with different purchase
probabilities and return rates, as follows:
• Widget 1: The probability that a widget purchase is Widget 1 is 60%. The probability that a Widget 1 purchase is
returned is 15%.
• Widget 2: The probability that a widget purchase is Widget 2 is 30%. The probability that a Widget 2 purchase is

returned is 25%.
• Widget 3: The probability that a widget purchase is Widget 3 is 10%. The probability that a Widget 3 purchase is

returned is 35%.
Let the events A1 , A2 , and A3 correspond to a purchase being Widget 1, Widget 2, and Widget 3, respectively. We know
P(A1 ) = 0.6, P(A2 ) = 0.3, and P(A3 ) = 0.1. Let R correspond to the event that a widget purchase is returned. We know
the conditional probabilities P(R|A1 ) = 0.15, P(R|A2 ) = 0.25, and P(R|A3 ) = 0.35. Given this information, what is the
joint probability that a widget purchase is Widget 1 and the purchase is returned? We have
P(A1 ∩ R) = P(R|A1 )P(A1 ) = (0.15)(0.6) = 0.09.
(Why not use the alternative formula P(A1 ∩ R) = P(A1 |R)P(R) here?)
How about the joint probability that a widget purchase is Widget 2 and the purchase is not returned? The event of
the purchase not being returned is Rc , so that
P(A2 ∩ Rc ) = P(Rc |A2 )P(A2 ) = (1 – 0.25)(0.3) = 0.225.
We can do even more with conditional probabilities after introducing two important results, the Law of Total
Probability and Bayes’ Theorem. We start with the Law of Total Probability, which is based upon the concept of
partitioning introduced in Chapter 2:
Proposition 3.3. (Law of Total Probability) If A1 , A2 , …, Ak are disjoint events and also exhaustive events (that is,
P(A1 ) + P(A2 ) + · · · + P(Ak ) = 1), then for any event B,
k
X
P(B) = P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + · · · + P(B|Ak )P(Ak ) = P(B|Aj )P(Aj ).
j=1

Using the disjoint and exhaustive events A1 , …, Ak , the event B can itself be partitioned into k events. These k events
are B ∩ A1 , B ∩ A2 , …, B ∩ Ak . Since A1 , …, Ak are exhaustive events, the union of these k events is equal to B since any
outcome in B must be in one of the k partitions. Applying Axiom 3, then, yields
P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + · · · + P(B ∩ Ak ).
The Venn diagram in Figure 3.2 illustrates the idea of partitioning. The Venn diagram depicts a partition of the sample
space S into five disjoint and exhaustive events A1 , A2 , A3 , A4 , and A5 . The event B is represented by the gray circle,
and B is itself partitioned into five events: B ∩ A1 , B ∩ A2 , B ∩ A3 , B ∩ A4 , and B ∩ A5 . (In Figure 3.2, each of the B ∩ Aj
events is non-empty, though it is certainly possible in other examples to have B ∩ Aj = ∅.) Then, the overall probability
of event B is the sum of the five probabilities P(B ∩ A1 ), P(B ∩ A2 ), P(B ∩ A3 ), P(B ∩ A4 ), and P(B ∩ A5 ).
Having P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + · · · + P(B ∩ Ak ) for disjoint and exhaustive events A1 , …, Ak leads directly
to the Law of Total Probability result. Specifically, from the multiplication rule (Proposition 3.2), P(B ∩ A1 ) =
P(B|A1 )P(A1 ) and simililarly for the other partitions, so that:
P(B) = P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + · · · + P(B|Ak )P(Ak ).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 62 — #69

i i

62 Conditional probabilities and independence

Sample space 𝑆
𝐴1 𝐴2
𝐵 ∩ 𝐴1

𝐵 𝐵 ∩ 𝐴2
𝐴5

𝐵 ∩ 𝐴5

𝐵 ∩ 𝐴3
𝐵 ∩ 𝐴4

𝐴4 𝐴3

Figure 3.1
Venn diagram for event partitions

Example 3.4 (Product returns) Continuing Example 3.3, the following question can now be answered: For a widget
purchase, what is the probability that the widget is returned? In other words, what is the unconditional probability
P(R)? A direct application of the Law of Total Probability gives
P(R) = P(R|A1 )P(A1 ) + P(R|A2 )P(A2 ) + P(R|A3 )P(A3 )
= (0.15)(0.6) + (0.25)(0.3) + (0.35)(0.1) = 0.2.
The unconditional probability of a return is 0.2 or 20%. From the equation above, this unconditional probability is
a weighted average of the three conditional return probabilities for the different types of widgets, where the weights
are the probabilities of the partitions (here, the probabilities of the three types of widgets). With the unconditional
probability P(R), we can also determine the conditional probability of each widget type given that a widget is returned.
If a widgets.com employee receives a return in a sealed package, what is the probability that this return is Widget
1 or Widget 2 or Widget 3? These conditional probabilities can be determined as follows:
P(A1 ∩ R) P(R|A1 )P(A1 ) (0.15)(0.6)
P(A1 |R) = = = = 0.45
P(R) P(R) 0.2
P(A2 ∩ R) P(R|A2 )P(A2 ) (0.25)(0.3)
P(A2 |R) = = = = 0.375
P(R) P(R) 0.2
P(A3 ∩ R) P(R|A3 )P(A3 ) (0.35)(0.1)
P(A3 |R) = = = = 0.175
P(R) P(R) 0.2
The conditioning information changes the probabilities. Whereas the unconditional probability of A1 is 0.6, the
probability of A1 given that it is a returned widget is 0.45. For A2 , the unconditional probability is 0.3, and the
conditional probability given R is 0.375. For A3 , the unconditional probability is 0.1, and the conditional probability
given R is 0.175.
As seen in Example 3.4, probabilities can be updated once new information is incorporated. We move from an
unconditional probability, with no information given, to a conditional probability, where we condition upon new
information. This approach to “updating” probabilities is embodied in a famous result known as Bayes’ Theorem:

i i

i i

Conditional probabilities and independence 65

whereas the probability of non-vaccination given infection is

P(NV ∩ D) 0.150
P(NV|D) = = ≈ 0.811.
P(D) 0.185
With a higher vaccination probability of P(V) = 0.9 and the same conditional infection probabilities (5% for
vaccinated, 50% for unvaccinated), the probability table would be:
Infected
Yes (D) No (Dc ) Total
Yes (V) 0.045 0.855 0.900
Vaccinated
No (NV) 0.050 0.050 0.100
Total 0.095 0.905
Example 3.8 (Two car dealerships) Example 2.2 considered the experiment associated with the number of salespeople
at dealership A and the number of salespeople at dealership B that sell cars on a given day. Suppose the joint
probabilities are given by the values in the following probability table, without marginal probabilities specified:
Dealership B
0 1 2 3
0 0.02 0.04 0.03 0.01
1 0.03 0.08 0.05 0.02
Dealership A 2 0.07 0.15 0.11 0.06
3 0.04 0.09 0.06 0.02
4 0.02 0.05 0.04 0.01
Let A0 , A1 , A2 , A3 , A4 denote the events associated with the number of dealership A salespeople selling a car on a
given day. Let B0 , B1 , B2 , B3 denote the events associated with the number of dealership B salespeople selling a car on
a given day. The most likely joint event is A2 ∩ B1 (sales by two salespeople at dealership A and one salesperson at
dealership B), with probability 15%. The least likely joint events are A0 ∩ B3 and A4 ∩ B3 , each with probability 1%.
The marginal probabilities of dealership A events and dealership B events can be calculated from the joint
probabilities. For instance, the unconditional or marginal probability that dealership A has two salespeople sell a
car is obtained by adding up the joint probabilities in the row labeled “2”:
P(A2 ) = P(A2 ∩ B0 ) + P(A2 ∩ B1 ) + P(A2 ∩ B2 ) + P(A2 ∩ B3 )
= 0.07 + 0.15 + 0.11 + 0.06 = 0.39.
The unconditional or marginal probability that dealership B has one salesperson sell a car is obtained by adding up
the joint probabilities in the column labeled “1”:
P(B1 ) = P(B1 ∩ A0 ) + P(B1 ∩ A1 ) + P(B1 ∩ A2 ) + P(B1 ∩ A3 ) + P(B1 ∩ A4 )
= 0.04 + 0.08 + 0.15 + 0.09 + 0.05 = 0.41.
Conditional probabilities are calculated in the usual way. For instance, the conditional probability of two dealership
A salespeople selling a car (A2 ) given that three dealership B salespeople sell a car (B3 ) is:
P(A2 ∩ B3 ) 0.06
P(A2 |B3 ) = = = 0.5.
P(B3 ) 0.01 + 0.02 + 0.06 + 0.02 + 0.01
We can also consider calculation of conditional probabilities that are not based solely upon simple events. An example
is the conditional probability of at least two dealership A salespeople selling a car (A2 ∪ A3 ∪ A4 ) given that three

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 66 — #73

i i

66 Conditional probabilities and independence

dealership B salespeople sell a car (B3 ), which is

P(A2 ∪ A3 ∪ A4 |B3 ) = P(A2 |B3 ) + P(A3 |B3 ) + P(A4 |B3 )

P(A2 ∩B3 ) 3 ∩B3 ) 4 ∩B3 )

= P(B3 ) + P(AP(B 3)
+ P(AP(B 3)

0.06+0.02+0.01
= 0.01+0.02+0.06+0.02+0.01 = 0.75.
Another example is the conditional probability of two dealership A salespeople selling a car (A2 ) given that the
number of salespeople selling a car at the two dealerships are equal to each other. Let E be the event that the number
of salespeople selling a car at the two dealerships are equal to each other, which is
E = (A0 ∩ B0 ) ∪ (A1 ∪ B1 ) ∪ (A2 ∪ B2 ) ∪ (A3 ∪ B3 ),
the union of “both 0” and “both 1” and “both 2” and “both 3.” The conditional probability is
P(A2 ∩ E) 0.11 11
P(A2 |E) = = = .
P(E) 0.02 + 0.08 + 0.11 + 0.02 23

3.4 Independence
In this section, we consider the concept of independence of events, which essentially means that knowing that
one event occurs does not provide any additional information about the other event. In situations where events are
independent of each other, the calculation of joint probabilities is greatly simplified. We start with the definition of
independent events.

Definition 3.5 Events A and B are independent if P(A|B) = P(A). Events A and B are dependent if P(A|B) 6= P(A).
The P(A|B) = P(A) condition says that knowing B occurs does not affect the probability of A occurring. The
conditional probability of A given B is the same as the unconditional probability of A. Even though Definition 3.5
specifies that only P(A|B) = P(A) is required for independence, P(A|B) = P(A) immediately implies that P(B|A) = P(B)
(that is, knowing A does not affect the probability of B):
P(B ∩ A) P(A|B)P(B)
P(B|A) = = = P(B).
P(A) P(A)
Therefore, independence of A and B can be established by checking either P(A|B) = P(A) or P(B|A) = P(B). It is not
necessary to check both. Likewise, to show dependence of A and B, either P(A|B) 6= P(A) or P(B|A) 6= P(B) can be
checked.
If events A and B are independent, knowing that B occurs doesn’t affect the probability of A. It is also the case that
knowing that B does not occur doesn’t affect the probability of A:
Proposition 3.5. If the events A and B are independent, P(A|Bc ) = P(A).
Example 3.9 (Product returns) Based upon the probability table in Example 3.6, are the events A1 (Widget 1 purchase)
and R (return) independent? Note that P(A1 |R) = 0.09/0.20 = 0.45, which is not equal to the unconditional probability
P(A1 ) = 0.6. Alternatively, note that P(R|A1 ) = 0.15, which is not equal to the unconditional probability P(R) = 0.2.
Intuitively, it makes sense that A1 and R are dependent since the return probability depends upon the type of widget
purchased; that is, knowing that Widget 1 is purchased provides additional information about the return probability.
Example 3.10 (Vaccination and infection) From the probability table in Example 3.7, the vaccination event (V) and
infection event (D) are dependent since P(D|V) = 0.05 and P(D) = 0.095. The dependence of these events arises since
the infection probability changes depending upon whether an individual is vaccinated or non-vaccinated.
Example 3.11 (Two coin tosses) Recall the experiment where two coins are tossed, with sample space S =
{HH, HT, TH, TT}. Let the event H1 correspond to the first toss being heads and event H2 correspond to the second

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 67 — #74

i i

Conditional probabilities and independence 67

toss being heads. If the two tosses have nothing to do with each other (that is, the outcome of one toss has no
effect on the outcome of the other toss), then H1 and H2 are independent events with P(H1 |H2 ) = P(H1 ) = 0.5 and
P(H2 |H1 ) = P(H2 ) = 0.5. Due to independence, each of the probabilities for the outcomes in S is equal to 0.25:
P(HH) = P(H1 |H2 )P(H2 ) = P(H1 )P(H2 ) = (0.5)(0.5) = 0.25
P(HT) = P(H1 |H2c )P(H2c ) = P(H1 )P(H2c ) = (0.5)(0.5) = 0.25
P(TH) = P(H1c |H2 )P(H2 ) = P(H1c )P(H2 ) = (0.5)(0.5) = 0.25
P(TT) = P(H1c |H2c )P(H2c ) = P(H1c )P(H2c ) = (0.5)(0.5) = 0.25
In Example 3.11, the joint probabilities for the two coin tosses each simplify to the product of two unconditional
probabilities. This simplification is a general property of independent events:
Proposition 3.6. Events A and B are independent if and only if P(A ∩ B) = P(A)P(B).
This proposition provides an alternative method to check independence without conditional probabilities.
Example 3.12 (Product returns) A1 and R are dependent since P(A1 ∩ R) = 0.09 and P(A1 )P(R) = (0.6)(0.2) = 0.12.
Example 3.13 (Vaccination and infection) V and D are dependent since P(V ∩ D) = 0.035 and P(V)P(D) =
(0.7)(0.185) = 0.1295.
The concept of independence can be extended beyond two events, as follows:

Definition 3.6 (Independence of more than two events) Events A1 , A2 , …, Ak are mutually independent if, for any
subset of the events, the joint probability is equal to the product of the individual probabilities. Equivalently, the
conditional probability of any event Ai given any subset of other events is equal to the unconditional probability of Ai .
For the case of two events, this definition corresponds exactly to the notion of independence in Proposition 3.6. For
three events (A1 , A2 , A3 ), the events are mutually independent if
P(A1 ∩ A2 ) = P(A1 )P(A2 ), P(A1 ∩ A3 ) = P(A1 )P(A3 ), P(A2 ∩ A3 ) = P(A2 )P(A3 )
and
P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 ).
Or, in terms of conditional probabilities, the three events are mutually independent if
P(A1 |A2 ) = P(A1 |A3 ) = P(A1 |A2 ∩ A3 ) = P(A1 ),
P(A2 |A1 ) = P(A2 |A3 ) = P(A2 |A1 ∩ A3 ) = P(A2 ),
and
P(A3 |A1 ) = P(A3 |A2 ) = P(A3 |A1 ∩ A2 ) = P(A3 ).
At this point, it is useful to introduce more concise notation for the intersection (∩) of events. Specifically, we use a
comma in between events to indicate intersection. For example, we write
P(A1 , A2 ) for P(A1 ∩ A2 ),
P(A1 , A2 , A3 ) for P(A1 ∩ A2 ∩ A3 ),
P(A1 |A2 , A3 ) for P(A1 |A2 ∩ A3 ),
P(A1 , A2 |A3 ) for P(A1 ∩ A2 |A3 ),
and so on.
When there are many events, the number of possible products involved in the definition of mutual independence
can be quite large. Consider the case of 50 events (A1 , A2 , …, A50 ), for which mutual independence is equivalent to the
following:
P(A1 , A2 ) = P(A1 )P(A2 ) and similarly for any two events in {A1 , …, A50 },

i i

Combinatorics is a topic in mathematics that involves counting methods. In many interesting situations, our ability to
determine probabilities requires that we count and/or enumerate the possible outcomes. Chapters 2 and 3 considered
some simple examples (e.g., sequences of coin-toss outcomes and sequences of customer-purchase outcomes) of
counting different types of outcomes. As another example, consider a lottery where the winning numbers are four
distinct numbers that the lottery authority chooses from the set {1, 2, …, 30}. How many different ways can the four
numbers be drawn? If you were to play the lottery at random (i.e., randomly picking four numbers from the 30 possible
numbers), what is your chance of winning the lottery?

4.1 Product rule and sum rule

The first counting tools we introduce are the product rule and the sum rule. In the simplest case, we consider a
situation in which two actions are possible.
Proposition 4.1. (Product rule) If there are m choices for the first action and n choices for the second action, the
number of possible choices for both actions is mn.
Example 4.1 (Retirement savings) Your company’s retirement plan has 8 different bond funds and 12 different stock
funds. For your portfolio, you are required to pick one bond fund and one stock fund. How many possible ways are
there to pick a portfolio? Since there are 8 ways to take the first action (choice of bond fund) and 12 ways to take the
second action (choice of stock fund), the product rule implies that the number of ways to pick a portfolio is (8)(12) = 96.
Proposition 4.2. (Sum rule) If there are m choices for the first action, n choices for the second action, and only one
of the actions can be taken, the number of possible choices is m + n.
Example 4.2 (Retirement savings) If instead of choosing both a bond fund and a stock fund, suppose you are restricted
to choose only one fund overall. With 8 options for the bond fund and 12 options for the stock fund, the number of
possible choices for your one fund is 8 + 12 = 20.
Both the product rule and the sum rule generalize to more actions, as follows:
Proposition 4.3. (Generalized product rule) Suppose there are k actions. If there are n1 choices for the first action, n2
choices for the second action, and so on through nk choices for the k-th action, the number of possible choices for the
k actions is the product of the numbers of choices, n1 n2 · · · nk .
Proposition 4.4. (Generalized sum rule) Suppose there are k actions. If there are n1 choices for the first action, n2
choices for the second action, and so on through nk choices for the k-th action, the number of possible choices when
only one action can be chosen is the sum of the numbers of choices, n1 + n2 + · · · + nk .
Example 4.3 (Retirement savings) The retirement plan adds money-market fund options, so that there are 8 bond
funds, 12 stock funds, and 4 money-market funds. How many ways are there to pick a portfolio with one bond fund,
one stock fund, and one money-market fund? The product rule implies that the total number of choices for the portfolio

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 78 — #85

i i

78 Combinatorics (counting methods)

is (8)(12)(4) = 384. If you’re restricted to picking just one of the funds, the number of choices from the sum rule is
8 + 12 + 4 = 24.

4.2 Permutations and combinations

Consider a situation in which there is a group with n distinct objects. We are interested in knowing how many ways
there are to choose a group of k elements from the group, where k ≤ n. To count the number of ways, we need to be
more precise about (i) whether repeats among the chosen elements are allowed and (ii) whether the order of the chosen
elements matters.
Here is an example where repeats are allowed and the order of the elements matters:
Example 4.4 (Four-letter password) You are asked to provide a four-letter password, where each letter must be upper
case. The order of the letters matters here (e.g., ASJV is different from SVJA). If repeated letters are allowed, the
number of possible four-letter passwords is (26)(26)(26)(26) = 264 = 456,976 by the product rule.
Here is an example where repeats are allowed and the order of the elements does not matter:
Example 4.5 (Choice of bonus stocks) An on-line trading app is offering a sign-up bonus, where you receive three
shares from a group of ten possible stocks {S1 , S2 , …, S10 }. You choose any three stocks and are allowed to have
repeats. That is, you can choose to receive three shares of one company, say S6 . The order in which you pick the three
shares is unimportant here. The product rule can not be directly applied here. The number of possible three-share
portfolios is not (10)(10)(10) = 1,000 since the order of the shares doesn’t matter. For instance, the choice S1 S7 S10 is
the same as any of the following choices: S1 S10 S7 , S7 S1 S10 , S7 S10 S1 , S10 S1 S7 , or S10 S7 S1 .
Here is an example where repeats are not allowed and the order of the elements matters:
Example 4.6 (Board of directors with titled positions) A company needs to select a three-member board of directors
and must designate one member as Chair, one member as Vice Chair, and one member as Secretary. If there are
20 candidates available for the board positions, how many ways can the board be selected? There are 20 ways to
fill the President position, after which there are 19 possible ways to fill the Vice Chair position, after which there
are 18 possible ways to fill the Secretary position. The product rule then implies that the total number of ways to
select the board is (20)(19)(18) = 6,840. It doesn’t matter how we think about filling the positions. We could start with
Vice Chair, then Secretary, and then President, and it would still be 20 choices then 19 choices then 18 choices. An
important feature here is that you can’t have repeats in the three positions since three distinct board members must be
selected.
Here is an example where repeats are not allowed and the order of the elements does not matter:
Example 4.7 (Board of directors with no titled positions) Consider the same setting as Example 4.6 but without titled
positions. The company needs to selected a three-member board of directors from a group of 20 candidates, but the
order in which the candidates is chosen is irrelevant. Let’s say that the possible candidates are {C1 , C2 , …, C20 }.
Since the order doesn’t matter, the number of possible three-member boards is going to be considerably less than
(20)(19)(18) = 6,840. Any ordered choice of three members is equivalent to five other ordered choices of three members.
For example, the choice C3 C11 C19 is equivalent to the following five choices: C3 C19 C11 , C11 C3 C19 , C11 C19 C3 ,
C19 C3 C11 , or C19 C11 C3 . As a result, the total number of possible (unordered) three-person boards is equal to
6,840/6 = 1,140.
Using the last two examples as motivation, we formally define the terms permutation and combination, which
refer respectively to an ordered subset of distinct choices and an unordered subset of distinct choices.

Definition 4.1 An ordered subset of distinct choices is called a permutation, and Pn,k denotes the number of
permutations of size k that can be formed from n objects (for k ≤ n).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 79 — #86

i i

Combinatorics (counting methods) 79

Proposition 4.5. (Number of permutations) The number of permutations of size k that can be formed from n objects
(for k ≤ n), denoted Pn,k , is
n!
Pn,k = ,
(n – k)!
where j! = (1)(2) · · · (j), read as “j factorial,” is the product of all positive integers up through j. An equivalent formula
is
Pn,k = (n)(n – 1) · · · (n – k + 1).
Example 4.8 (Board of directors with titled positions) The number of permutations for the three titled board positions
in Example 4.6 is P20,3 = 20!
17! = (20)(19)(18) = 6,840.

factorial(20)/factorial(17)
## [1] 6840
20*19*18
## [1] 6840

Example 4.9 (Stock portfolio with different weights) Suppose there are 100 possible stocks, and you want to form
a 30%/25%/20%/15%/10% weighted portolio of five stocks. How many possible portfolios are there? The order of
the five stocks matters here since a different weight is being placed on each of the five choices. Then, the number of
portfolio choices is P100,5 = 100!
95! = (100)(99)(98)(97)(96) = 9,034,502,400 (over 9 billion!).

Moving to the case of unordered subsets of choices, a combination is defined as follows:

Definition 4.2 An unordered subset of choices is called a combination, and Cn,k denotes the number of combinations
of size k that can be formed from n objects (for k ≤ n).
Proposition 4.6. (Number of combinations) The number of combinations of size k that can be formed from n objects
(for k ≤ n), denoted Cn,k , is
Pn,k n!
Cn,k = = .
k! k!(n – k)!
Sometimes the notation nk , read as “n choose k,” is used as an alternative to the Cn,k notation, so that

n n!
= Cn,k = .
k k!(n – k)!

Example 4.10 (Board of directors with no titled positions) The number of combinations for the three (untitled)
board positions in Example 4.7 is C20,3 = 20
3 = 20!
3!17! = (20)(19)(18)
(3)(2)(1) = 1,140. Proposition 4.6 provides an exact relationship
between C20,3 and P20,3 , specifically that C20,3 = P3!
20,3
= P20,3
6 . As discussed above, any ordered choice of three members
is equivalent to five other ordered choices of those three members. Why is that the case? For a group of three members,
there are six different orderings: three choices for the first member, then two choices for the second member, and then
one choice for the third member, which is a total of (3)(2)(1) = 3! = 6. The division of P20,3 by 3! in the C20,3 formula
accounts for the fact that the 6 possible orderings of any three members is equivalent to just one unordered group of
the three members. The R function choose(n,k) calculates nk = Cn,k .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 80 — #87

i i

80 Combinatorics (counting methods)

choose(20,17)
## [1] 1140

Example 4.11 (Stock portfolio with equal weights) Suppose there are 100 possible stocks. You want to invest equally
(20% each) in five different stocks. How many possible portfolios are there? The order of the five stocks does
100
100!
not matter since it’s an equally weighted portfolio. Then, the number of portfolio choices is C100,5 = 5 = 5!95! =
(100)(99)(98)(97)(96)
(5)(4)(3)(2)(1) = 75,287,520, still a large number but smaller than P100,5 by a factor of 5! = 120).
Example 4.12 (Choice of bonus stocks) We return to Example 4.5, where we considered an on-line trading app
bonus in which the user is allowed to choose three shares from the set of ten possible stocks {S1 , S2 , …, S10 }. The
important difference from Example 4.11 is that the user is not restricted to picking three distinct stocks. There are
three possibilities: (i) the user’s choice of shares involves three distinct stocks, (ii) the user’s choice of shares involves
two distinct stocks, with two shares of one stock and one share of the other stock, and (iii) the user’s choice of shares
involves one stock, with all three shares being that stock. If we can determine the number of possible choices associated
with each of these three possibilities, the sum rule can be used to determine the total number of possible choices. For
case (i), where the choice of shares involves three distinct stocks, the number of possible choices is C10,3 = 10 3 = 120
since the order of the three stocks doesn’t matter (each has one share). For case (ii), where the choice of shares involves
two distinct stocks, the number of possible choices is P10,2 = (10)(9) = 90 since the order of the two stocks matters (one
has two shares, the other has one share). For case (iii), where the choice of shares involves one stock, the number of
possible choices is just 10. Thus, by the sum rule, the total number of possible choices is 120 + 90 + 10 = 220.

4.2.1 Generalizing the combination formula

The combination formula nk = Cn,k = k!(n–k)!
n!

, from Proposition 4.6, gives the number of combinations of k objects that
can
n
be formed from n objects (k ≤ n), where ordering of the k objects does not matter. Alternatively, one can think of
k as the number of ways that n objects can be split into two subsets of k and n – k objects. Viewed in this way, we
can generalize the idea and formula to situations where the n objects are to be split into more than two subsets. The
following proposition provides this generalization:
Proposition 4.7. The number of ways that n objects can be split into m ≥ 2 different subsets of size k1 , k2 , …, km , where
k1 + k2 + · · · + km = n, is
n n!
= .
k1 , k2 , …, km k1 ! k2 ! · · · km !
Example 4.13 (Academic committees) An economics department has 26 faculty members, and it needs to create three
different committees: an admissions committee with 12 members, a curriculum committee with 8 members, and an
alumni relations committee with 4 members. Suppose no faculty member can serve on more than one committee, so
that there are 2 faculty members who are not on a committee. Then, the total number of ways the committees can be
formed is
26 26!
= = 435,031,096,500.
12, 8, 4, 2 12!8!4!2!
In this example, there are m = 4 subsets, three for the three committees and the fourth for non-committee members.
While the generalized combination formula provides a direct way to calculate the number of combinations, it would
also be possible to calculate it usingthe original nk definition. Specifically, the number of ways to pick the admissions
committee is 26 14
12 , which allows 8 ways to choose the curriculum
6
committee, which allows 4 ways to choose the
26 14 6
alumni relations committee. The reader can verify that 12 8 4 , obtained by the multiplication rule, is also equal
26!
to 12!8!4!2! .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 81 — #88

i i

Combinatorics (counting methods) 81

What if the department also needs to specify a chair for each of the three committees? We can think of there being
m = 7 subsets now, rather than m = 4 subsets, as each of the three committees is effectively split into two subsets, one
for the chair member (1 faculty member) and one for the non-chair members (11 for admissions, 7 for curriculum, 3
for alumni relations). The number of possible ways to form the committees, with the chairs specified, is considerably
larger than the answer above and is equal to

26 26!
= = 167,051,941,056,000.
11, 1, 7, 1, 3, 1, 2 11!1!7!1!3!1!2!

4.3 Probabilities for equally likely choices

In some situations, we are interested in the probability of seeing certain permutations or combinations. In the case
where any possible permutation or combination is equally likely, it is often possible to figure out the probability either
directly using math or indirectly using computer simulation. For now, let’s focus on what’s possible with mathematical
reasoning.
Example 4.14 (Website names) It’s well-known that shorter website names are more valuable. In fact, it’s highly
unlikely that a three-character .com website would be available for registration. How many three-character .com
websites are there? By “character,” we mean any letter a through z or number 0 through 9. Website names are not
case-sensitive, so we’ll just assume that the letters are lower-case letters. Since repeated characters are allowed, the
total number of three-character websites is (36)(36)(36) = 46,656 by the product rule.
1
If one of the three-character websites is chosen at random, each with the same probability 46,656 , what is the
probability that the chosen website is composed entirely of alphabetic characters and no numbers? To answer this
question, the simplest approach is to determine how many of the 46,656 three-character websites are composed entirely
of alphabetic characters. By the product rule, the total number of websites composed entirely of alphabetic characters
is (26)(26)(26) = 17,576. Therefore, the probability that a randomly chosen three-character website consists of only
17,576
alphabetic characters is 46,656 ≈ 0.377 or 37.7%.
If one of the three-character websites is chosen at random, what is the probability that the chosen website is
composed of three distinct alphabetic characters? Again, we can count the number of three-character websites with
three distinct alphabetic characters and then divide by the total number of websites (46,656). Since the alphabetic
characters must be distinct and order matters, the number of three-character websites with three distinct alphabetic
characters is equal to C26,3 = (26)(25)(24) = 15,600. Therefore, the probability that a randomly chosen three-character
website consists of three distinct alphabetic characters is 15,600
46,656 ≈ 0.334 or 33.4%.

Example 4.15 (Stock portfolio with sectors) Consider again the problem of forming a five-stock equally weighted
portfolio, with 20% weight on each stock, but now suppose 20 of the 100 stocks are classified as “tech” stocks and 80
of the 100 stocks are classified as “non-tech” stocks. How many equally weighted five-stock portfolios have exactly two
20

“tech” stocks and three “non-tech” stocks? There are C20,2 = 2 ways of choosing the “tech” stocks (the ordering of
the two stocks doesn’t matter) and C80,3 = 80

3 ways of choosing the “non-tech” stocks, so thatby the
product rule the
total number of five-stock portfolios with two “tech” stocks and three “non-tech” stocks is 20 2
80
3 .
If we form a five-stock equally weighted portfolio by picking five stocks from the 100 total at random, what is
the probability the portfolio
has two “tech” stocks and three “non-tech” stocks? We’ve already found the relevant
20 80
numerator, which is 2 3 . The relevant denominator is the total number of possible five-stock portfolios, which is
C100,5 = 100

5 . Thus, the probability is
(20)(19) (80)(79)(78)
20 80
2 (2)(1) · (3)(2)(1)
100
3 = (100)(99)(98)(97)(96) ≈ 0.207 or 20.7%.
5 (5)(4)(3)(2)(1)

How about the probability that a randomly chosen five-stock portfolio has at most two “tech” stocks? Using similar
reasoning, the number of five-stock portfolios with zero “tech” stocks is 20 80

0 5 and the number with one “tech” stock

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 82 — #89

i i

82 Combinatorics (counting methods)

20 80

is 1 4 . Thus, the probability of having at most two “tech” stocks in a randomly chosen five-stock portfolio is
20 80
20 80 20 80
0 5 + 1 + 2 3
100
4 ≈ 0.947 or 94.7%.
5
Although we have used combinatorics to answer interesting probability questions in the examples above, some
problems are too complicated to solve analytically in this way (or at least complicated enough that you would rather
not try!). Such an example is considered below and illustrates how computer simulation offers an alternative method
for calculating probabilities.
Example 4.16 (The likelihood of “streaks”) If a coin is tossed 100 times, what is the probability of observing a
streak of at least five consecutive heads during the 100 tosses? (Try guessing before reading any further.) With two
possibilities for each toss, there are 2100 possible 100-coin sequences by the product rule. Moreover, if the coin tosses
are independent of each other, each of these 2100 sequences must be equally likely. Therefore, the probability of
S
observing a streak of at least five consecutive heads during the 100 tosses is equal to 2100 , where S is the number of 100-
coin sequences with a streak of at least five consecutive heads. Unfortunately, it is extremely difficult to analytically
determine the value of S. An alternative approach is to use computer simulation, as follows:
• Step 1: Simulate the experiment of flipping 100 coins, with probability 1/2 of heads and probability 1/2 of tails for
each toss.
• Step 2: For the simulated sequence of 100 coin tosses, check and record whether or not there is a streak of at least

five heads.
• Repeat Steps 1 and 2 many times, and then determine the frequency or proportion of simulated 100-coin sequences

that include a streak of at least five heads.

Figure 4.1 shows results based upon 100,000 computer simulations, repeating Steps 1 and 2 for each simulation. The
figure shows how the calculated frequency of five-head streaks changes over the course of the 100,000 simulations,
with the x-axis indicating the number of the simulation. (The figure starts the x-axis at the 1000’th simulation since
the frequencies are inherently noisier after a small number of simulations. Dropping these early simulations allows us
to focus more clearly on the relevant range where the simulated frequencies end up.)
Figure 4.1 is generated by the following R code:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 83 — #90

i i

Combinatorics (counting methods) 83

set.seed(1234)

# initialize the number of simulations and number of tosses

num_simulations <- 100000
num_tosses <- 100

# initialize a vector to store the cumulative count of streaks

cumul_streaks <- rep(0, num_simulations)

# initialize a counter for the number of streaks observed

streak_counter <- 0

# loop with simulated coin tosses and check for streak

for (i in 1:num_simulations) {
# simulate the coin tosses
tosses <- sample(c("H","T"), num_tosses, replace = TRUE)

# paste together the toss outcomes into a single string

toss_string <- paste(tosses, collapse = "")

# add one to streak counter if streak of five heads occurs

streak_counter <- streak_counter + grepl("HHHHH", toss_string)

cumul_streaks[i] <- streak_counter

}

# create a vector of the cumulative frequencies

freq_streaks <- cumul_streaks/(1:num_simulations)

# output the final frequency (approximated probability)

freq_streaks[num_simulations]
## [1] 0.81156
# plot the cumulative frequencies against the simulation number
plot(1000:num_simulations,freq_streaks[1000:num_simulations], xlab="Simulation number",
ylab="Cumulative frequency of five-head streak occurrence", cex=0.5, ylim=c(0.77,0.83))

We use a for loop to repeatedly conduct the 100-toss experiment. The loop is executed 100,000 times. During
each iteration of the loop, (i) 100 coin tosses are simulated, (ii) a string of the results is created and stored in
toss_string, using the collapse = "" option for the paste function, (iii) the variable streak_counter
is incremented by one if the string "HHHHH" occurs within toss_string, and (iv) the cumulative streak count is set
to the value of streak_counter. After the loop, the vector of cumulative frequencies freq_streaks is created
and plotted using the plot function.
The simulated frequencies level off at around 0.81 or 81%. The actual calculated frequency after the 100,000
simulations, resulting from the command freq_streaks[num_simulations], is 0.81156. How can we
determine the accuracy of the computer simulated frequency? That is, how close is the computer’s answer is to the
“true” probability of the event, which is the probability of seeing a five-head streak in a 100-coin sequence? While a
meaningful answer can’t be provided yet, the statistical tools developed in this book will allow simulation accuracy to
be quantified. There is simulation error whenever a computer simulates probabilities, and we can quantify how large
this simulation error is likely to be.

Exercises
1. Your phone has 30 different songs available: 10 by artist A, 8 by artist B, and 12 by artist C.
(a) How many different ways can you play first a song by artist A and then a song by artist B?
(b) How many different ways can you play first a song by artist A and then a song by artist B and then a song by
artist C?
(c) If you no longer restrict the order of the artists, how many different ways can you play three songs that each
have a different artist?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 84 — #91

i i

84 Combinatorics (counting methods)

0.83
Cumulative frequency of five−head streak occurrence

0.82
0.81
0.80
0.79
0.78
0.77

0 20000 40000 60000 80000 100000

Simulation number

Figure 4.1
Simulated frequency of a five-head streak among 100 coin tosses

2. A 12-person jury is being selected from a pool of 50 potential jurors. Among the potential jurors, there are 20 men
and 30 women.
(a) How many possible ways are there to select the jury?
(b) If the judge insists on there being an equal number of men and women on the jury, how many possible ways
are there to select the jury?
(c) The jury needs a foreperson, who is one of the 12 jurors. Ignoring the gender restriction for now, how many
possible ways are there to select the jury and foreperson?
(d) Now assume that the judge insists that there are an equal number of men and women on the jury and that the
foreperson is woman. How many possible ways are there to select the jury and foreperson?
3. A company visits a college campus to interview students. The company has seven Economics majors, six Finance
majors, and five Accounting majors from which to choose. Unfortunately, the company has lost everyone’s résumés,
so they randomly pick three students to interview.
(a) What is the probability that all three interviewees are Economics students?
(b) What is the probability that all three interviewees are from the same major?
(c) What is the probability that the set of three interviewees has either no Economics students or no Finance
students?
(d) What is the probability that at least one of the majors has no students interviewed?
4. At the “Pick One” restaurant, you have to choose one dish option from each of the following four dinner courses:
Appetizer (Vegetarian, Chicken, Pork), Salad (Vegetarian, Chicken), Small Plate (Vegetarian, Beef, Seafood), Large
Plate (Vegetarian, Beef, Chicken, Pork, Seafood). So, in order, CVBS is one possible choice.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 85 — #92

i i

Combinatorics (counting methods) 85

(a) How many possible ways are there to order?

(b) How many possible ways are there to order if the Appetizer and Salad are both Chicken?
(c) How many possible ways are there to order if exactly two of the dishes are Vegetarian?
(d) If you order completely at random, what is the chance of having at least two Chicken dishes?
5. An office manager needs to assign offices to nine employees. There are three offices available: office A which holds
two workers, office B which holds three workers, and office C which holds four workers.
(a) How many possible office assignments of the employees are there?
(b) McKenna is one of the nine workers. If assignments are made at random, what is the probability that McKenna
gets assigned to office C?
(c) If assignments are made at random, what is the probability that McKenna is in the same office as her co-worker
Daniel?
(d) If assignments are made at random, what is the probability that McKenna is in the same office as her co-workers
Daniel and Alma?
(e) Conduct 100,000 simulations in R to approximate the three probabilities in (b)-(d).
6. A state lottery game awards a jackpot to any player who correctly guesses four distinct numbers from thirty possible
numbers. The lottery employee picks four numbers at random, without replacement, from {1, 2, …, 30}.
(a) If a player randomly chooses four numbers, what is the probability she wins the jackpot?
(b) If a player likes small numbers and randomly chooses four numbers only from {1, 2, …, 10}, what is the
probability she wins the jackpot?
(c) Suppose the lottery adds a “Bonus Number” to its game. For this new game, the lottery employee picks five
numbers at random, with replacement, from {1, 2, …, 30}. The fifth number chosen is set aside as the Bonus
Number. The player again guesses four numbers, but now also specifies a fifth number (different from the first
four) as their guess for the Bonus Number. If they get the first four numbers correct, they win the jackpot; but
now, if they also get the Bonus Number correct, they win double the jackpot. (If they get the “Bonus Number”
correct but not the first four numbers, they get nothing.) What is the probability that a player wins the double
jackpot? What is the probability that a player wins the jackpot but not the double jackpot?
7. *An online banking app requires its users to enter a six-digit PIN. Each digit can be 0, 1, …, 9.
(a) If there are no restrictions on picking a PIN, how many possible PIN’s are there?
(b) If the app does not allow any PIN with a digit appearing in three consecutive places, how many possible PIN’s
are there? (To be clear, we are ruling out a PIN like 124449 since 4 appears three times in a row.)
(c) Suppose you chose your PIN randomly (that is, you randomly picked 6-digit numbers until you had a PIN
satisfying the restriction from (b)). Someone is looking over your shoulder when you type your PIN into the
app, and they notice that the first two digits are 42 and the last two digits are 35. The onlooker tries to guess
your PIN by randomly picking two digits for the middle two digits, but he is unaware of the app’s restriction
on PIN’s. What is the probability that he correctly guesses your PIN?
(d) Same as (c), except now assume that the onlooker is aware of the app’s restriction on PIN’s. As such, he
randomly picks the middle two digits, but he continues to do so until the PIN does not have a digit appearing
in three consecutive places. What is the probability that he correctly guesses your PIN?
8. (Birthday Problem) Consider a class of students. Assume the birthdays of students are distributed randomly and
independently of each other. Ignore leap years, so the chance of a student having any given birthday is 1/365.
(a) For two specific students, what is the probability that they have the same birthday?
(b) For three specific students, what is the probability that at least two of them share a birthday? (Hint: Think first
about the probability that the three students each have a different birthday. How does the probability in the
question relate to this probability?)

i i

5.2.1 Numerical variables

Definition 5.5 A numerical variable is a variable that has a number associated with the characteristic.
Some examples of numerical variables include the following:
• the weekly earnings of an individual
• the annual GDP growth of a country
• the number of website visitors on a given day
• the number of children in a household
Even among numerical variables, there are different types of numerical variables. In the examples above, the number
of children in a household can only be a non-negative integer, whereas the annual GDP growth of a country can be
a positive or negative real number. It is useful to further break down numerical variables into three different types:
discrete numerical variables, continuous numerical variables, and approximately continuous numerical variables.

Definition 5.6 A discrete variable (or discrete numerical variable) is a variable where the number of possible values
can be counted, even if the number of possible values is infinite.
The number of children in a given household and the number of patents awarded to a given firm in a given year are
both examples of discrete variables. Even though these variables will usually have pretty low values in any observed
data, we can still think of these variables as having an infinite number of possible values, with any value from the
set {0, 1, 2, …} possible. In other cases, a discrete variable may be inherently finite. Some examples of finite discrete
variables include a student’s score on an Advanced Placement (AP) exam, which is in {1, 2, 3, 4, 5}, the number of
states that have a Republican governor in a given year, which is in {0, 1, 2, …, 50}, or the number of months for which
a given stock has a positive return in a given year, which is in {0, 1, 2, …, 12}. Although these examples all involve
integer values, there is nothing in the definition of discrete variables that restricts values to be integers. For example,
shoe size is a discrete variable that can take non-integer values for “half” sizes.
In contrast to discrete variables, a continuous variable has values along some portion, or all, of the real line, so that
the number of possible values is not countable.

Definition 5.7 A continuous variable (or continuous numerical variable) is a variable that can take on any value
on some interval or intervals of the real line, including perhaps the entire real line.
Examples of continuous variables include the monthly rainfall in a given city (measured in inches, but not rounded),
the daily stock return for a given stock, the fraction of monthly income that a given employed individual saves in a
given month, and the annual GDP of a given country. The possible values are different in these examples, with monthly
rainfall and annual GDP being non-negative real numbers in [0, ∞), the daily stock return being any real number in
[–1, ∞], and the savings fraction being any real number in [0, 1].
As seen in future chapters, the probability models used to model discrete variables and continuous variables are
fundamentally different. Think about the comparison between one of our discrete variable examples (the number
of children in a household) and one of our continuous variable examples (the amount of monthly rainfall in a city).
Suppose we are interested in knowing the probability that the value of the variable is between one and three (inclusive).

To see how this works, first consider one of the binary categorical variable examples. For the home ownership
variable, which has possible values “yes” and “no,” we can define the variable owner to be
(
1 if individual owns a home ("yes")
owner =
0 if not ("no")
While we could also define the variable nonowner to be 1 if the individual does not own a home and 0 if the individual
does, there’s no need to do so since the value of nonowner is known if the value of owner is known.
Now consider the labor force status variable, which is a categorical variable with more than two categories. In this
case, three different indicator variables employed, unemployed, and notinlf can be defined:
(
1 if individual is employed
employed =
0 if not (unemployed or not in labor force)
(
1 if individual is unemployed
unemployed =
0 if not (employed or not in labor force)
(
1 if individual is not in labor force
notinlf =
0 if not (employed or unemployed)
Using three indicator variables here is overkill, as once the values of two indicator variables are known, the value of
the third variable is known. Since an individual is in one and only one of the three categories, we have
employed + unemployed + notinlf = 1.
Therefore, if employed and unemployed are already defined, it follows that notinlf = 1 – employed – unemployed, so
that it’s unnecessary to specify notinlf as a third variable. For three categories, only two indicator variables (for two
of the categories) are needed to completely characterize the categorical variable. It doesn’t matter which of the three
categories is the “omitted category.”
This basic idea generalizes to categorical variables with more categories. If a categorical variable has C different
categories that are disjoint and exhaustive, using the same terminology introduced for events in Section 2.2, C – 1
indicator variables can be used to completely describe the categorical variable, where it doesn’t really matter which
category is omitted. Of course, it is important to have disjoint and exhaustive categories for this to be true, as that
implies that one and only one category may be the actual value for the categorical variable.

5.3 The population and sampling

Having introduced different types of variables that might be part of the observed data, we now formalize how these
data might be collected and observed. As a starting point, we need to decide the group for which we want to conduct
a statistical analysis and from which we are going to pick representative observations. In statistics, this group is called
the population, and the representative observations constitute the sample:

Definition 5.11 The population is the entire group for which conclusions are to be made from statistical analysis.

Definition 5.12 The sample consists of the specific units from the population for which data are collected and
observed.

Definition 5.13 The sample size, usually denoted n, is the number of units collected and observed.
The collected and observed data in the sample, consisting of n units, are taken or “drawn” from the underlying
population of interest. In many cases, the population is very large, and it is impossible to collect data on the entire
population due to cost, logistics, and/or other factors.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 94 — #101

Notes
8 Whether method 1 yields a simple random sample depends upon how you think about the population. If the population consists of students
before they arrive on campus, method 1 yields a simple random sample since the assignment of the student ID hasn’t occurred yet, and any student
is equally likely to have a “5” as the last digit. If the population consists of students after they arrive on campus, method 1 won’t pick any students
who don’t have a “5” as the last digit, so in that sense the units of the population are not equally likely to be chosen. This distinction is unimportant
here since, as the discussion has highlighted, there is no sampling bias introduced by method 1.

Exercises
1. For each of the following examples, indicate whether the data are cross-sectional, time-series, or panel data.
(a) An avid runner records the number of miles that she runs every day for 100 straight days.
(b) A random sample of 100 older adults, aged 65 and older, were surveyed in January 2023 about their health
status and medical expenditures.
(c) A financial analyst randomly picks 20 companies that are listed on the New York Stock Exchange and gathers
data on their profits and sales for the year 2022.
(d) A Canadian economist gathers annual data on each of Canada’s 13 political territories, including the population
and the unemployment rate for each territory, for each year between 2000 and 2020.
(e) A random sample of 100 college students is asked whether or not they have received an influenza vaccine in
the last year.
(f) A random sample of 100 college seniors is asked for their semester GPA for each of their first six semesters at
the university.
2. A survey is taken of Economics majors at a particular university. For each of the following variables, indicate
whether the variable is (i) categorical, (ii) discrete but not approximately continuous, (iii) discrete and approximately
continuous, or (iv) continuous.
(a) Favorite economics professor.
(b) Number of economics courses taken prior to taking econometrics (a required course for majors).
(c) Born in the United States or not.
(d) Cumulative GPA prior to the current semester (not rounded).
3. For each of the following variables, indicate whether the variable is (i) categorical, (ii) discrete but not approximately
continuous, (iii) discrete and approximately continuous, or (iv) continuous.
(a) The daily number of deliveries that a major on-line retailer makes to U.S residential addresses.
(b) Whether or not a person has a flu vaccination in a given year.
(c) The time that it takes a personal shopper at a grocery store to complete a customer’s order.
(d) The number of cellphones that an individual has owned in their lifetime.
4. Consider drawing a simple random sample of n = 10 observations from a population consisting of 100 units.
(a) How many possible ways are there to draw the simple random sample?
(b) What is the probability that a given observation from the population is in the simple random sample that is
drawn?
(c) What is the probability that any two given observations from the population are in the simple random sample
that is drawn?
5. A major city has a complete census of all restaurants within its city limits and is interested in the percentage that
have public-health violations. It doesn’t have the resources to conduct inspections at all restaurants, so it must draw
a sample from the full census (population). For each of the following sampling possibilities, explain whether there
would be concern about sample selection bias and explain why.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 97 — #104

i i

NOTES 97

(a) The city has six different zip codes and conducts inspections at all restaurants in one of the six zip codes.
(b) The city conducts inspections at 15% of the restaurants chosen completely at random.
(c) The city conducts inspections at all restaurants having a street address number ending in 3.
(d) The city conducts inspections at the 30 largest restaurants in the census.
6. For each of the following examples, discuss whether there is a potential for sample selection bias and explain why.
For a given example, there may be multiple reasons for sample selection bias.
(a) A pharmaceutical company wants to test the efficacy of a new medication for treating a disease. The company
is interested in the population of all individuals having the disease, and they recruit participants through
advertisements posted in medical facilities and in online forums.
(b) A firm wants to determine the effectiveness of an employee training program, i.e. how effective the training is
for a representative employee at the firm. The program is voluntary, so the firm measures the effectiveness (the
difference between pre-program productivity and post-program productivity) for employees who enroll in the
program.
(c) The Social Security Administration (SSA) has comprehensive earnings-history data for every United States
citizen. A researcher with access to SSA data is interested in the average earnings of 30-year-old citizens in
2020. She randomly draws 5,000 individuals from the SSA data who were 30 years in old in 2020 and gathers
their earnings data for that year.
(d) A magazine reports the average salaries of graduates from law schools. To gather their data, the magazine
conducts an on-line survey of law-school graduates, who are asked to voluntarily share their own salary
information.
(e) An economist would like to analyze the economic impact of recent tax cuts on small businesses. She collects
data from businesses that voluntarily express willingness to respond to a survey.
7. The following table summarizes the political party affiliation (A or B) and age group (“Under 30,” “30-50,” and
“Over 50”) for the population of registered voters in a particular voting district:
Party
A B
Under 30 15% 5%
Age 30-50 25% 15%
Over 50 20% 20%
For example, 15% of the registered voters are affiliated with party A and under 30 years old.
(a) If a sample of 1,000 voters is stratified on the basis of party affiliation alone, how many voters in the sample
are affiliated with party A versus party B?
(b) If a sample of 1,000 voters is stratified on the basis of both party affiliation and age group, how many strata are
there and how many voters are in each of the stratum?
(c) A sample of 1,000 voters has exactly 380 voters in the “Over 50” age group. Which one (or more) of the
following are possible:
i. the sample is a simple random sample
ii. the sample is a proportionate stratified random sample, stratified by age group
iii. the sample is a proportionate stratified random sample, stratified by party affiliation

i i

str(cps)
## 'data.frame': 4013 obs. of 17 variables:
## $ statefips : Factor w/ 51 levels "AK","AL","AR",..: 5 1 27 35 41 43 2 6 21 44 ...
## $ age : int 50 34 50 30 40 35 56 42 55 58 ...
## $ hrslastwk : int 40 40 NA 44 NA 30 40 25 40 46 ...
## $ unempwks : int NA NA NA NA NA NA NA NA NA NA ...
## $ wagehr : num 12 NA NA NA NA NA 25 8 NA NA ...
## $ earnwk : num 577 3049 NA 2500 NA ...
## $ ownchild : int 0 1 4 0 0 0 0 0 0 0 ...
## $ educ : num 14 18 16 18 12 12 12 7.5 12 13 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 1 1 ...
## $ metro : Factor w/ 2 levels "Metro","Non-metro": 1 1 1 1 1 1 1 1 1 1 ...
## $ race : Factor w/ 3 levels "Black","Other",..: 1 3 3 3 1 3 3 1 2 3 ...
## $ hispanic : Factor w/ 2 levels "Hispanic","Non-hispanic": 2 2 1 2 2 2 2 2 2 1 ...
## $ marstatus : Factor w/ 4 levels "Divorced","Married",..: 3 2 2 3 3 1 2 4 3 2 ...
## $ lfstatus : Factor w/ 3 levels "Employed","Not in LF",..: 1 1 2 1 2 1 1 1 1 1 ...
## $ ottipcomm : Factor w/ 2 levels "No","Yes": 1 1 NA 1 NA 1 1 1 1 1 ...
## $ hourly : Factor w/ 2 levels "Hourly","Non-hourly": 1 2 NA 2 NA 2 1 1 2 2 ...
## $ unionstatus: Factor w/ 2 levels "Non-union","Union": 1 1 NA 1 NA 1 1 1 1 1 ...
summary(cps)
## statefips age hrslastwk unempwks
## CA : 341 Min. :30.00 Min. : 1.00 Min. : 1.00
## TX : 250 1st Qu.:37.00 1st Qu.:40.00 1st Qu.: 4.00
## FL : 196 Median :45.00 Median :40.00 Median : 8.50
## NY : 162 Mean :45.02 Mean :40.34 Mean : 17.75
## OH : 109 3rd Qu.:53.00 3rd Qu.:42.00 3rd Qu.: 20.00
## GA : 108 Max. :59.00 Max. :99.00 Max. :119.00
## (Other):2847 NA's :1204 NA's :3907
## wagehr earnwk ownchild educ
## Min. : 1.01 Min. : 12.0 Min. :0.0000 Min. : 0.00
## 1st Qu.:12.78 1st Qu.: 520.0 1st Qu.:0.0000 1st Qu.:12.00
## Median :16.41 Median : 770.0 Median :0.0000 Median :12.00
## Mean :18.60 Mean : 971.2 Mean :0.7478 Mean :12.57
## 3rd Qu.:22.00 3rd Qu.:1193.6 3rd Qu.:1.0000 3rd Qu.:14.00
## Max. :90.00 Max. :8779.7 Max. :7.0000 Max. :18.00
## NA's :2174 NA's :1204
## gender metro race hispanic
## Female:2093 Metro :3175 Black: 476 Hispanic : 745
## Male :1920 Non-metro: 838 Other: 349 Non-hispanic:3268
## White:3188
##
##
##
##
## marstatus lfstatus ottipcomm hourly
## Divorced : 707 Employed :2809 No :2341 Hourly :1839
## Married :2377 Not in LF :1098 Yes : 468 Non-hourly: 970
## Never married: 853 Unemployed: 106 NA's:1204 NA's :1204
## Widowed : 76
##
##
##
## unionstatus
## Non-union:2533
## Union : 276
## NA's :1204
##
##
##
##

The str(cps) command shows the structure of the cps data frame, indicating the number of observations and
variables and, for each variable, indicating the variable type and the first few observed values from the data. The
summary(cps) command shows more detailed information about each variable, with category counts shown for
categorical (factor) variables and descriptive statistics shown for numerical variables; for variables with missing
values, the number of missing values is indicated in the row labeled NA’s. For example, the unionstatus variable

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 102 — #109
i i

102 Descriptive statistics and visuals: univariate data

has 2,533 observations in the Non-union category, 276 observations in the Union category, and 1,204 with a
missing (NA) value.
We can also get the summary statistics for a single variable by specifying that variable, rather than the whole data
frame, as the argument for the summary function.

summary(cps$earnwk)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.0 520.0 770.0 971.2 1193.6 8779.7 1204

Example 6.2 (Monthly stock returns) The sp500 dataset consists of 364 monthly observations (January 1991 through
April 2021) for a set of 266 individual stocks, each of which is part of the S&P 500 stock market index.9 Each variable
in the dataset corresponds to a single company, with the variable name corresponding to the company’s stock ticker.
For example, the variable names for AT&T and Bank of America are T and BAC, respectively. For any given stock, the
observations constitute a time series with a sample size of n = 364. The data are all numerical, with each observation
representing a monthly return for a given stock. The monthly return for month m is defined as
pricem – pricem–1
returnm = ,
pricem–1
where pricem is the price at the end of month m and pricem–1 is the price at the end of the previous month m – 1. This
variable is continuous with possible values in [–1, ∞), where the –1 corresponds to pricem being 0.10 The variable
values are unitless here. Even though stock prices pricem and pricem–1 have monetary units (dollars), the formula
above indicates that the monetary units in the numerator cancel those in the denominator, leaving no units for returnm .
After loading the sp500 dataset in R, we can use the head function to display the first few observations for T and
BAC, along with the dates:

head(sp500[,c("Date","T","BAC")])
## Date T BAC
## 1 1991-02-01 0.041747550 0.03999942
## 2 1991-03-01 0.044186174 0.18376090
## 3 1991-04-01 -0.048997714 0.07811400
## 4 1991-05-01 -0.017966258 0.14237314
## 5 1991-06-01 0.033815331 -0.15133543
## 6 1991-07-01 0.009346888 -0.01935816

The monthly return for AT&T in January 1991 was approximately 4.17%, and the monthly return for Bank of
America was approximately 4.00%. During the first six months, AT&T and Bank of America both had four positive
monthly returns and two negative monthly returns, though the timing of the negative-return months was different for
the two companies.
In the rest of this chapter, we discuss several different descriptive statistics and data visualization options for different
types of data, including the following:
• categorical data: sample proportion, bar charts
• discrete or continuous numerical data: histograms, measures of location (sample mean, sample median, sample
quantiles), box plots, measures of dispersion (interquartile range, sample variance, sample standard deviation)

6.2 Categorical data: sample proportions and bar charts

Consider a categorical variable x, with sample {x1 , x2 , …, xn }, where there are C possible categories for x. The sample
can be completed described by the count of observations within each category c ∈ {1, 2, …, C} or, equivalently, the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 103 — #110
i i

Descriptive statistics and visuals: univariate data 103

fraction of observations within each category c ∈ {1, 2, …, C}. These quantities are known as the sample counts and
sample proportions, respectively.

Definition 6.2 For a categorical variable x, the sample count associated with category c is the number of
observations in category c:
n
X
sample count for category c = 1(xi = c).
i=1
In this definition, the function 1(xi = c) has the value 1 when xi = c (the i-th observation is in category c) and the value
0 when xi 6= c (the i-th observation is not in category c). This function is an example of an indicator function, which is
used elsewhere in the book. More generally, an indicator function 1(E) is equal to 1 if the event E is true and 0 if the
event E is not true.

Definition 6.3 For a categorical variable x, the sample proportion associated with category c is the fraction or
percentage of observations in category c:
Pn n
1(xi = c) 1 X
sample proportion for category c = i=1 = 1(xi = c).
n n
i=1
Since the C categories of x are disjoint and exhaustive, every xi is in one and only one category, which immediately
implies the following:
Proposition 6.1. For any sample {x1 , x2 , …, xn } of a categorical variable x, the sum of the sample counts is n, and the
sum of the sample proportions is 1.
Similar to the discussion of categorical variables in Section 5.2.2, the sample proportion of any category c can be
inferred if the sample proportions of the other C – 1 categories are known since the proportions sum to one. In the
case of a binary categorical variable, the sample proportion of one of the two categories completely summarizes the
observed data.
Example 6.3 (Labor force data) Let’s focus on the labor force status (lfstatus) variable from the cps data. The sample
counts for the lfstatus categories were already seen in Example 6.1, as part of the output from the summary(cps)
command. Another method to directly tabulate the sample counts for a categorical variable, which works even when
there are more categories than will fit in the summary output, is to use the table function.

table(cps$lfstatus)
##
## Employed Not in LF Unemployed
## 2809 1098 106
table(cps$lfstatus)/nrow(cps)
##
## Employed Not in LF Unemployed
## 0.69997508 0.27361077 0.02641415

The first table command provides the sample counts. The second command, which divides by the sample size (i.e.,
the number of rows in the dataset, given by nrow(cps)), provides the sample proportions. The sample counts and
sample proportions for the three categories are provided more neatly in the following table:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 104 — #111
i i

104 Descriptive statistics and visuals: univariate data

lfstatus category Sample count Sample proportion

Employed 2809 0.7000
Unemployed 106 0.0264
Not in LF 1098 0.2736
Total 4013 1.0000
70.00% of individuals in the sample are employed, while 2.64% are unemployed and 27.36% are not in the labor force.
(The 2.64% value is not what economists would consider the “unemployment rate” since its denominator includes
106
individuals who are not in the labor force. The unemployment rate for this sample is (106+2809) ≈ 3.64%.)
The sample counts or sample proportions of a categorical variable can be visually portrayed through the use of a
bar chart. A bar chart has one bar for each category, with a total of C bars for a variable with C categories. The height
of each bar corresponds to either the sample count or sample proportion of that category, depending upon which bar
chart we use. Likewise, the y-axis for the bar chart corresponds to count values or proportion values, depending upon
which bar chart we use.11 As illustrated in the following example, the two bar charts look identical with only the y-axis
scale differing.
Example 6.4 (Labor force data) Continuing Example 6.3, the two bar charts in Figure 6.1 provide a visual description
of the lfstatus variable, where the heights are the sample counts in the first bar chart and are the sample proportions
in the second bar chart. Note that lfstatus is an unordered categorical variable. Since there is no natural ordering
of the three categories, it doesn’t really matter in which order the three bars are displayed. If we had an ordered
categorical variable instead, we would want to have the bars displayed according to the ordering of the categories,
either in ascending or descending order.
The R code to create Figure 6.1 uses the graphing function barplot:

# graph-display format (two rows, one column)

par(mfrow = c(2,1))

# bar chart with sample counts

barplot(table(cps$lfstatus), ylim=c(0,3000), main="Bar chart (counts) of labor-force status")

# bar chart with sample proportions

barplot(table(cps$lfstatus)/nrow(cps), ylim=c(0,0.8), main="Bar chart (proportions) of labor-force status")

For the first barplot command, the first argument is table(cps$lfstatus) (the table of sample counts),
the second argument (ylim=c(0,3000)) specifies the lower and upper limit to be used for the y-axis, and the third
argument (main) provides a title for the chart. The second barplot command is similar, except that the table of
sample proportions is the first argument and the values of the ylim and main are different. The ylim argument is
optional, and R uses its default upper/lower limits for the y-axis if it is omitted. The formatting command par(mfrow
= c(2,1)) is used before the bar charts are produced. This command provides a convenient way to display multiple
graphs simultaneously. The c(2,1) can be changed to accommodate a different number of rows and columns to be
displayed. Here, there are 2 rows and 1 column, leading to the first bar chart being displayed on top of the second bar
chart. If there were six different graphs to be displayed in three rows of two graphs each, the appropriate command
is par(mfrow = c(3,2)). The graphs are displayed from left to right, starting with the first row and continuing
left-to-right for subsequent rows.

6.3 Numerical data: histograms

For a numerical variable, a histogram is a common method used to visually represent the sample distribution of the
observed values. A histogram is nothing more than a collection of rectangles, similar to the bars in a bar chart. Each
rectangle corresponds to a bin of possible values, where the width of the rectangle is known as a bin width and the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 105 — #112
i i

Descriptive statistics and visuals: univariate data 105

Bar chart (counts) of labor−force status

3000
2000
1000
0 Employed Not in LF Unemployed

Bar chart (proportions) of labor−force status

0.8
0.6
0.4
0.2
0.0

Employed Not in LF Unemployed

Figure 6.1
Bar charts of labor-force status (CPS data)

height of the rectangle indicates how many of the variable’s values are within that bin. To construct the histogram,
the bins need to be specified, which involves specifying the bin width and the starting/ending values for the bins. To
illustrate how this works, we start with a simple example involving a discrete numerical variable.
Example 6.5 (Labor force data) Consider the age variable from the cps data, which is a discrete numerical variable
that can take on the integer values 30, 31, …, 59. Figure 6.2 shows four different histograms for the age variable.
The top two histograms look identical, except for the y-axis, with the one on the left having “frequency” (counts) of
observations and the one on the right having the “density” of observations. Both of these histograms have 30 bins,
each with a bin width of 1 year. The bins themselves are (29.5, 30.5], (30.5, 31.5], (31.5, 32.5], and so on through
(58.5, 59.5]. Since age is only integer-valued, the (29.5, 30.5] bin contains observations with age = 30, the (30.5, 31.5]
bin contains observations with age = 31, and so on through the (58.5, 59.5] bin which contains observations with
age = 59. Looking at these two histograms, the two most observed age values are age = 56 and age = 59, and the two
least observed age values are age = 38 and age = 44. For the “frequency” histogram on the top-left, the height of each
rectangle is the count of observations within the corresponding bin or, equivalently, the count of observations with
that specific age value. For the “density” histogram on the top-right, the height of each rectangle turns out to be the
proportion of observations in the associated bin or, equivalently, the proportion of observations with that specific age
value; as we’ll see, these heights are proportions for this histogram since the bin widths are exactly equal to one.
The bottom two histograms use a bin width of 2 years, with the 15 bins defined as (29.5, 31.5], (31.5, 33.5],
(33.5, 35.5], and so on through (57.5, 59.5]. The first bin contains the observations with age = 30 or age = 31, the
second bin contains the observations with age = 32 or age = 33, and so on through the last bin which contains the
observations with age = 58 or age = 59. For the bottom-left “frequency” histogram, the height of each rectangle still
corresponds to a count of the observations within the associated bin. Since the wider 2-year bins naturally contain

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 106 — #113
i i

106 Descriptive statistics and visuals: univariate data

more observations than the 1-year bins, the scale of the y-axis is considerably larger. For instance, the height of the
first rectangle for the 2-year bin width is equal to the sum of the first two rectangles for the 1-year bin width histogram
above it. For the bottom-right “density” histogram, the scale of the y-axis is quite similar to the “density” histogram
with 1-year bins (top-right). The “density” values on the y-axis lead to the area of each rectangle to be equal to the
proportion or fraction of observations within the associated bin. For example, the height (“density”) of the first bin is
approximately 0.036, so that the area of the rectangle is approximately (0.036)(2) = 0.072, with roughly 7.2% of the
observations having age = 30 or age = 31.
The R code to create Figure 6.2 uses the function hist:

# graph-display format (two rows, two columns)

par(mfrow = c(2,2))

# histogram of age with bin width = one year, counts

hist(cps$age, breaks=seq(29.5,59.5,1), main="Bin width = 1 yr, counts", xlab="Age")

# histogram of age with bin width = one year, density

hist(cps$age, breaks=seq(29.5,59.5,1), freq=FALSE, main="Bin width = 1 yr, density", xlab="Age")

# histogram of age with bin width = one year, counts

hist(cps$age, breaks=seq(29.5,59.5,2), main="Bin width = 2 yrs, counts", xlab="Age")

# histogram of age with bin width = one year, density

hist(cps$age, breaks=seq(29.5,59.5,2), freq=FALSE, main="Bin width = 2 yrs, density", xlab="Age")

The first argument for the hist function is the variable of interest (here cps$age). The optional breaks
argument allows the user to directly specify the starting/ending values for the bins. For the histograms in the top row,
the one-year bins are specified using the vector seq(29.5,59.5,1); for the histograms in the bottom row, the
two-year bins are specified using the vector seq(29.5,59.5,2). The optional argument freq indicates whether
the y-axis of the histogram should display counts (when freq is TRUE, which is the default) or densities (when freq
is FALSE). The main argument provides a title for the histogram, and the xlab argument specifies the text to be
shown on the x-axis (here specified as Age rather than the default value, which is cps$age).
The properties of the histograms described in Example 6.5 are stated in the following general propositions:
Proposition 6.2. For a frequency or count histogram, the height of any rectangle is the number of observations within
the associated bin. The sum of the heights of all of the rectangles is equal to the sample size n.
Proposition 6.3. For a density histogram, the area of any rectangle (height times bin width) is the proportion or
fraction of observations within the associated bin. The sum of the areas of all of the rectangles is equal to 1 or 100%.
From this point forward, we focus on density histograms since they have a direct relationship with the probability
distributions introduced in Chapters 8 and 10.
While Example 6.5 considers a discrete numerical variable, the next example considers a continuous numerical
variable and discusses the choice of the bin width or the number of bins in more detail.
Example 6.6 (Labor force data) Consider the weekly earnings (earnwk) variable from the cps data, which is a
continuous, or at least an approximately continuous, numerical variable. The earnwk variable has non-missing values
for the 2809 employed individuals in the sample. Figure 6.3 shows six different density histograms for earnwk, with
the number of bins specified as 10, 20, 50, 100, 200, and 500. These six histograms illustrate an inherent tradeoff
when choosing the number of bins. If the number of bins is chosen to be too small (or, equivalently, the bin width to be
too large), the histogram will be less “noisy” but might miss key aspects of the shape of the distribution of the data.
The top-left histogram with 10 bins completely misses the “hump” in the distribution of the observed weekly earnings.
This hump, located just below $1000 per week, first becomes evident in the histogram with 20 bins and even more so
in the histograms with 50 bins and 100 bins. As the number of bins gets even larger, however, the histograms display

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 107 — #114
i i

Descriptive statistics and visuals: univariate data 107

Bin width = 1 yr, counts Bin width = 1 yr, density

0.04
150

0.03
Frequency

Density
100

0.02
50

0.01
0.00
0

30 35 40 45 50 55 60 30 35 40 45 50 55 60

Age Age

Bin width = 2 yrs, counts Bin width = 2 yrs, density

0.04
250

0.03
Frequency

Density

0.02
150

0.01
50

0.00
0

30 35 40 45 50 55 60 30 35 40 45 50 55 60

Age Age

Figure 6.2
Histograms of age (CPS data)

less smoothness and more noise. The two histograms with 200 bins and 500 bins have rectangle heights that jump up
and down. This lack of smoothness can be exacerbated by variables where data may be bunched at round numbers,
as is the case for the earnwk variable where round numbers like $1,000 or $2,000 are more likely to be reported than,
say, $1,019 or $1,992. When the number of bins is very large (or the bin width very small), the histogram is unable to
smooth out this type of bunching.
Here is the R code used to create Figure 6.3:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 108 — #115
i i

108 Descriptive statistics and visuals: univariate data

# create subsample of the data of only "Employed" individuals

cpsemployed <- cps[cps$lfstatus=="Employed",]

# show the number of employed individuals

nrow(cpsemployed)
## [1] 2809
# graph-display format (three rows, two columns)
par(mfrow = c(3,2))

# histogram of weekly earnings with 10 bins (density on y-axis)

hist(cpsemployed$earnwk, breaks=10, freq=FALSE, main="Weekly earnings, 10 bins", xlab="Weekly earnings")

# histogram of weekly earnings with 20 bins (density on y-axis)

hist(cpsemployed$earnwk, breaks=20, freq=FALSE, main="Weekly earnings, 20 bins", xlab="Weekly earnings")

# histogram of weekly earnings with 50 bins (density on y-axis)

hist(cpsemployed$earnwk, breaks=50, freq=FALSE, main="Weekly earnings, 50 bins", xlab="Weekly earnings")

# histogram of weekly earnings with 100 bins (density on y-axis)

hist(cpsemployed$earnwk, breaks=100, freq=FALSE, main="Weekly earnings, 100 bins", xlab="Weekly earnings")

# histogram of weekly earnings with 200 bins (density on y-axis)

hist(cpsemployed$earnwk, breaks=200, freq=FALSE, main="Weekly earnings, 200 bins", xlab="Weekly earnings")

# histogram of weekly earnings with 500 bins (density on y-axis)

hist(cpsemployed$earnwk, breaks=500, freq=FALSE, main="Weekly earnings, 500 bins", xlab="Weekly earnings")

Since weekly earnings are observed only for employed individuals, the relevant rows of the data frame cps are first
selected. Specifically, cpsemployed is created as a new data frame which consists of the rows of cps for which
cps$lfstatus=="Employed" is TRUE. The nrow command confirms that the number of employed individuals
is 2809. The six hist commands each use the breaks argument, but here a value is provided for breaks, which
specifies the number of bins for the histogram. This use of breaks is in contrast to Example 6.5, where a vector was
provided for breaks, corresponding to the starting/ending values for the bins rather than the number of bins.

6.3.1 Bin width choice and density curves

Given that the bin width can dramatically affect how a histogram looks, how should the bin width be chosen? We don’t
want a bin width that is “too small,” nor do we want a bin width that is “too large.” Most statistical packages offer the
option to automatically choose a bin width when a histogram is drawn. Different statistical packages may use different
rules of thumb for calculating the bin width, but in practice the overall look of the histogram should not vary much
from package to package. One rule of thumb that is sometimes used is the Freedman-Diaconis rule, which calculates
the bin width for a variable x as follows:
IQRx
bin width = 2 · √
3
,
n
√
where IQRx is the interquartile range of x, a measure we will introduce in Section 6.5, and 3 n is the cube-root of the
sample size n. The number of bins for the histogram in this case is
xmax – xmin
rounded to the nearest integer,
2 · IQR
√
3 n
x

where xmax and xmin are the maximum and minimum x values in the sample, respectively. For the earnwk variable in
Example 6.6, the Freedman-Diaconis rule yields a choice of 92 bins, based upon a calculated bin width of 95.48 since
earnwkmax = 8779.73 and earnwkmin = 12. Of the histograms displayed in Figure 6.3, this choice yields a histogram
similar to the one with 100 bins. Figure 6.4 shows the earnwk histogram with the 92 bins from the Freedman-Diaconis
rule. In addition, the figure includes a density curve that overlays the histogram. Most statistical packages, including
R, offer the option to draw this type of smooth density curve either on its own or along with a histogram. In Figure 6.4,
the density curve roughly passes through the tops of the histogram rectangles, but it does so in a smooth fashion rather

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 109 — #116
i i

Descriptive statistics and visuals: univariate data 109

Weekly earnings, 10 bins Weekly earnings, 20 bins

0.0008
0.0006
Density

Density

0.0004
0.0003
0.0000

0.0000
0 2000 4000 6000 8000 0 2000 4000 6000 8000

Weekly earnings Weekly earnings

Weekly earnings, 50 bins Weekly earnings, 100 bins

0.0006

0.0006
Density

Density
0.0000

0.0000
0 2000 4000 6000 8000 0 2000 4000 6000 8000

Weekly earnings Weekly earnings

Weekly earnings, 200 bins Weekly earnings, 500 bins

0.0010

0.0010
Density

Density
0.0000

0.0000

0 2000 4000 6000 8000 0 2000 4000 6000 8000

Weekly earnings Weekly earnings

Figure 6.3
Histograms of weekly earnings (CPS data)

than jumping from flat step to flat step as the histogram does. While a formal discussion of density curves is beyond
the scope of this book, a density curve (i) can be quite useful as a descriptive visual for a numerical variable and (ii) has
a direct relationship to the probability distributions introduced in Chapters 8 and 10.
Here is the R code used to create Figure 6.4:

# histogram of weekly earnings with 92 bins, with estimated density overlaid

hist(cpsemployed$earnwk, breaks=92, freq=FALSE, main="", xlab="Weekly earnings")
lines(density(cpsemployed$earnwk), lwd=2)

The hist command is similar to those seen in Example 6.6, here with breaks=92 specified. The lines
command, with density(cpsemployed$earnwk) specified as its first argument, draws the density curve on
the same graph as the original histogram. The optional “line width” argument lwd=2 specifies a slightly thicker line
for the density curve, as compared to the default value of lwd=1.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 110 — #117
i i

110 Descriptive statistics and visuals: univariate data

0.0010
0.0008
0.0006
Density

0.0004
0.0002
0.0000

0 2000 4000 6000 8000

Weekly earnings

Figure 6.4
Histogram of weekly earnings with Freedman-Diaconis bin width and density curve

6.4 Numerical data: measures of location

6.4.1 Sample mean and sample median
For a numerical variable, the two most common measures of central tendency are the sample mean and the sample
median. Most readers are probably already familiar with the sample mean (or the sample average), which is formally
defined as follows:

Definition 6.4 The sample mean or sample average of observations x1 , x2 , …, xn , denoted x̄, is
n
x1 + x2 + · · · + xn 1 X
x̄ = = xi .
n n
i=1
The sample mean depends on the value of each and every observation in the sample. As such, the sample mean can
be sensitive to unusually small or unusually large values of x, sometimes known as outliers.
The sample median is a descriptive statistic used to describe the center of the sample. The sample median depends
only on the relative order of the observations and, specifically, the observation value(s) directly “in the middle” of the
sample. The sample median is not affected by the values of outliers.

Definition 6.5 The sample median of observations x1 , x2 , …, xn , denoted x̃1/2 or x̃0.5 , is the value for which half of
the observations are below (≤) x̃1/2 and half of the observations are above (≥) x̃1/2 .
Unlike the sample mean, the sample median does not have a closed-form formula.12
To determine the sample median x̃1/2 , the following procedure can be used:
• Sort the observations from lowest to highest. There may be some repeated values (“ties”).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 111 — #118
i i

Descriptive statistics and visuals: univariate data 111

• If n is odd, x̃1/2 is the middle observation in the sorted sample.

• If n is even, x̃1/2 is the average of the two middle observations in the sorted sample.
Example 6.7 Suppose n = 7, and the sample for x is {4, 3, 8, 12, 0, 10, 5}. The sample mean is
1
x̄ = (4 + 3 + 8 + 12 + 0 + 10 + 5) = 6.
7
The sorted sample is {0, 3, 4, 5, 8, 10, 12}. Since n is odd, the sample median is x̃1/2 = 5. What would happen if the
x4 = 12 observation was instead x4 = 26? The sample mean would increase to 8, showing how the sample mean can be
affected by an outlier. In contrast, the sample median is unchanged since the middle value in the sorted list of values
{0, 3, 4, 5, 8, 10, 12} is still 5.
Example 6.8 Suppose n = 8, and the sample for x is {4, 3, 8, 12, 0, 10, 5, 14}. The sample mean is
1
x̄ = (4 + 3 + 8 + 12 + 0 + 10 + 5 + 14) = 7.
8
The sorted sample is {0, 3, 4, 5, 8, 10, 12, 14}. Since n is even, the sample median is x̃1/2 = 5+8
2 = 6.5.

Both the sample mean and the sample median have the same units as the underlying x variable. For instance, if the x
variable is measured in dollars, the sample mean x̄ and sample median x̃1/2 are also measured in dollars.
Example 6.9 (Labor force data) Examples 6.5 and 6.6 considered histograms for the age and earnwk variables
from the cps data. The sample average and the sample median can be calculated in R using the mean and median
functions, respectively.

mean(cps$age)
## [1] 45.0167
median(cps$age)
## [1] 45
mean(cps$earnwk, na.rm = TRUE)
## [1] 971.1785
median(cps$earnwk, na.rm = TRUE)
## [1] 770

Since earnwk has missing values for non-employed individuals, we specify the optional argument na.rm =
TRUE for the mean and median functions to ignore the missing values. Alternatively, the cpsemployed data
frame from Example 6.6 could be used; for instance, the command mean(cpsemployed$earnwk) would give the
same result as mean(cps$earnwk, na.rm = TRUE).
For the age variable, the sample mean (44.64 years) and the sample median (45 years) are quite close to each other,
which is expected from the fairly symmetric histograms in Example 6.5. For the earnwk variable, it’s a very different
story, with the sample mean of weekly earnings ($971.18) much larger than the sample median ($770). Figure 6.5
shows the same histogram as Figure 6.4, but now with the sample mean and sample median indicated on the graph.
This histogram exhibits a long right tail, with some very large weekly earnings values observed in the right tail. There
is no long left tail in the histogram since earnwk must be positive. The right-tail values cause the sample mean to be
larger than the sample median. Since the sample mean depends on all observations, the large weekly earnings values
in the right tail effectively pull the sample mean to the right. On the other hand, the sample median does not increase
due to the very large right-tail values. Even if all the weekly earnings values in the right tail were instead equal to
2000, the sample median would be unchanged, as there would still be 50% of observations below 770 and 50% of
observations above 770.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 112 — #119
i i

112 Descriptive statistics and visuals: univariate data

Sample median
Sample mean

0.0010
0.0008
0.0006
Density

0.0004
0.0002
0.0000

0 2000 4000 6000 8000

Weekly earnings

Figure 6.5
Right skewness of the distribution of weekly earnings

Here is the R code used to create Figure 6.5:

# histogram of weekly earnings with 92 bins (Freedman-Diaconis), with estimated density overlaid
hist(cpsemployed$earnwk, breaks=92, freq=FALSE, main="", xlab="Weekly earnings")
lines(density(cpsemployed$earnwk), lwd=2)
abline(v=mean(cpsemployed$earnwk), lwd=2, lty=3)
abline(v=median(cpsemployed$earnwk), lwd=2, lty=2)
legend("topright", legend=c("Sample median","Sample mean"), lty=c(2,3), lwd=c(2,2))

The hist and lines commands are identical to those used for Figure 6.4. The two abline commands add
vertical lines, due to the inclusion of the optional argument v, at the sample mean and the sample median of earnwk.
The vertical lines for the sample mean and the sample median are dotted and dashed, respectively. Finally, the legend
command shows how a legend can be added to a graph. The first argument indicates where the legend appears, and the
remaining arguments specify the legend text (legend), line type (lty), and line width (lwd). Refer to the legend
documentation in R for more details on the available options.
When a histogram is characterized by a long right tail, like weekly earnings in Example 6.9, the variable is said to
have a right-skewed distribution.

Definition 6.6 A variable has a right-skewed distribution if there is a longer tail on the right side of its distribution
than on the left side of its distribution, as exhibited by a histogram or density curve.
A variable with a right-skewed sample distribution usually has its sample mean greater than its sample median. Many
economic variables naturally have right-skewed distributions, including earnings or wealth for a sample of individuals,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 113 — #120
i i

Descriptive statistics and visuals: univariate data 113

sales or profits for a sample of firms, gross domestic product (GDP) for a sample of countries, etc. While left-skewed
distributions are less common in economics, we provide a formal definition in the interest of completeness.

Definition 6.7 A variable has a left-skewed distribution if there is a longer tail on the left side of its distribution than
on the right side of its distribution, as exhibited by a histogram or density curve.
A variable with a left-skewed sample distribution usually has its sample mean less than its sample median.
Unlike the weekly earnings variable in Example 6.9, the age variable exhibits neither a right-skewed distribution
nor a left-skewed distribution. In fact, the histogram for the age variable looks approximately flat on both sides of the
center of its distribution. This type of distribution is said to be an approximately symmetric distribution.

Definition 6.8 A variable has an approximately symmetric distribution if the shape of the distribution to the left of
the sample median is approximately a mirror image of the shape of the distribution to the right of the sample median.
If a variable does not have an approximately symmetric distribution, the variable is said to have an asymmetric
distribution.
A variable with an approximately symmetric sample distribution has a sample mean that is close to the sample
median, meaning that either provides a good measure of the center of the sample distribution. A sample with either a
right-skewed distribution or a left-skewed distribution must be an asymmetric distribution, as the presence of a longer
tail on either the right or the left means that the distribution can not possibly appear to have mirror images on the two
sides of the sample median.
Example 6.10 (Monthly stock returns) Example 6.2 introduced the monthly stock return dataset sp500. Focusing
again on the monthly returns for AT&T (T) and Bank of America (BAC), the sample means and sample medians can
be calculated with the mean and median functions in R. Alternatively, the sample mean and sample median are part
of the output provided by the summary function applied to a numerical variable.

mean(sp500$T)
## [1] 0.008239883
median(sp500$T)
## [1] 0.007072945
mean(sp500$BAC)
## [1] 0.01294691
median(sp500$BAC)
## [1] 0.014761
summary(sp500[,c("T","BAC")])
## T BAC
## Min. :-0.191214 Min. :-0.52203
## 1st Qu.:-0.022352 1st Qu.:-0.03675
## Median : 0.007073 Median : 0.01476
## Mean : 0.008240 Mean : 0.01295
## 3rd Qu.: 0.047903 3rd Qu.: 0.06571
## Max. : 0.276617 Max. : 0.72658

The mean monthly return for AT&T is 0.824%, and the median monthly return is 0.707%. Both measures are larger
for the Bank of America, whose monthly returns have a sample mean of 1.295% and a sample median of 1.476%.
Figure 6.6 shows histograms and density curves for the T and BAC variables. Both distributions exhibit a “bell curve”
shape and look approximately symmetric. That said, the histograms do not look identical to each other, as the BAC
distribution has many more monthly returns that are larger in magnitude (i.e., either large positive returns or large
negative returns). This feature is discussed further when the dispersion of distributions is introduced in Section 6.5.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 114 — #121
i i

114 Descriptive statistics and visuals: univariate data

8
6
Density

4
2
0 −0.5 0.0 0.5

Monthly return (T)

5
4
Density

3
2
1
0

−0.5 0.0 0.5

Monthly return (BAC)

Figure 6.6
Histograms of monthly stock returns

Here is the R code used to create Figure 6.6:

# graph-display format (two rows, one column)

par(mfrow = c(2,1))

# histograms of T and BAC returns, with density curves

hist(sp500$T, freq=FALSE, breaks=seq(-0.8,0.8,0.02), main="", xlab="Monthly return (T)")
lines(density(sp500$T), lwd=2)

hist(sp500$BAC, freq=FALSE, breaks=seq(-0.8,0.8,0.02), main="", xlab="Monthly return (BAC)")

lines(density(sp500$BAC), lwd=2)

The use of the hist and lines functions to draw the histograms and density curves is similar to the
examples above. For both histograms, the vector for the breaks argument is specified to be the same
(seq(-0.8,0.8,0.02)) for ease of comparison.

6.4.2 Sample quantiles

The sample median, introduced in Section 6.4.1, provides a measure of the middle of a variable’s sample distribution.
Specifically, the sample median is the value x̃1/2 for which half of the observations are below (≤) x̃1/2 and half of the
observations are above (≥) x̃1/2 . While the sample median is interesting, we might be interested in other parts of the
sample distribution. For instance, if we want to know what the “large” values of a variable x might be, the sample
median doesn’t provide a useful measure. Instead, we might want to find the value for which only 10% of observations
are above that value.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 115 — #122
i i

Descriptive statistics and visuals: univariate data 115

To generalize the idea of a sample median to other parts of the sample distribution, sample quantiles are defined as
follows:

Definition 6.9 For any q where 0 < q < 1, the sample quantile x̃q is a value for which (100q)% of the observations
are below (≤) x̃q and (100 – 100q)% of the observations are above (≥) x̃q . The sample median x̃1/2 is a special case
of the sample quantile where q = 1/2. Other commonly used sample quantiles are sample quartiles, corresponding to
q ∈ {0.25, 0.50, 0.75}, and sample deciles, corresponding to q ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}.
As an example, for q = 0.9, the sample quantile x̃0.9 is the value for which 90% of the observations are less than or
equal to x̃0.9 and 10% of the observations are greater than or equal to x̃0.9 . We call x̃0.9 the sample 90% quantile. For
the sample quartiles, the 25% quantile is the first quartile or lower quartile, and the 75% quantile is the third quartile
or upper quartile. For the sample deciles, the 10% quantile is the first decile, the 20% quantile is the second decile,
and so on. Like the sample mean and sample median, any sample quantile x̃q has the same units as the underlying x
variable.
Most statistical packages have functions to calculate sample quantiles, and in practice we rely upon the statistical
package to calculate these quantiles. There are several alternative algorithms to calculate sample quantiles, meaning
that one statistical package might give a slightly different answer than another statistical package. In reasonably sized
samples, these small differences will not be practically meaningful.
Here is a procedure that generalizes the procedure used previously for calculating a sample median and can be used
to manually compute the sample quantile for any value q between 0 and 1:
• Sort the observations from lowest to highest. There may be some repeated values (“ties”), which is fine.
• If nq is an integer, then x̃q is the average of the nq-th value and the (nq + 1)-th value in the sorted sample.
• If nq is not an integer, then x̃ is the dnqe-th value in the sorted sample, where dnqe is the smallest integer larger
q
than nq.
To see how this works, consider a sample with sample size n = 50. First, the 50 observations are sorted in ascending
order, from lowest to highest. For the 10% quantile (q = 0.1), nq = 5 is an integer, so x̃0.1 is the average of the 5-th and
6-th values of the sorted sample. For the 25% quantile (q = 0.25), nq = 12.5 is not an integer, so x̃0.25 is equal to the
13-th value of the sorted sample since dnqe = d12.5e = 13.
The R function quantile calculate sample quantiles:
• quantile(x, probs = ...): Returns sample quantiles of the vector x, where the quantiles returns are
specified by the probs argument. For example, quantile(x, probs = c(0.25,0.75)) returns the
sample 25% and 75% quantiles.
Example 6.11 (Labor force data) Sample quantiles provide a more complete description of the distribution of weekly
earnings (earnwk) from the cps data. The following R output shows the sample quartiles and sample deciles of earnwk:

# sample quartiles of weekly earnings

quantile(cpsemployed$earnwk, probs = c(0.25,0.50,0.75))
## 25% 50% 75%
## 520.0 770.0 1193.6
# sample quartiles of weekly earnings
quantile(cpsemployed$earnwk, probs = seq(0.1,0.9,0.1))
## 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 354.56 480.00 576.00 670.00 770.00 900.00 1080.00 1346.00 1750.00

Interpreting the values for the sample quantiles is straightforward. For instance, for the sample 70% quantile (x̃0.7 =
1080), approximately 70% of the sample observations are below 1080 and approximately 30% are above 1080.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 116 — #123
i i

116 Descriptive statistics and visuals: univariate data

Sample 10% quantile

Sample 25% quantile
Sample median

0.0010
Sample 75% quantile
Sample 90% quantile

0.0008
0.0006
Density

0.0004
0.0002
0.0000

0 2000 4000 6000 8000

Weekly earnings

Figure 6.7
Sample quantiles of the weekly earnings distribution

In Figure 6.7, five different quantiles (q = 0.1, 0.25, 0.5, 0.75, 0.9) are shown on the histogram. This figure helps to
visualize where the sample quantiles lie along the distribution. Due to the right skewness of earnwk, the 75% and 90%
quantile values are pulled to the right. The distance between the 75% quantile and the sample median (50% quantile)
is larger than the distance between the 25% quantile and the sample median, and the distance between the 90%
quantile and the sample median is much larger than the distance between the 10% quantile and the sample median.
As this example illustrates, the skewness of a sample distribution has implications for the sample quantile values.
For a right-skewed distribution, the distance between the 75% quantile and the sample median (x̃0.75 – x̃0.5 ) would be
expected to be larger than the distance between the 25% quantile and the sample median (x̃0.5 – x̃0.25 ) and similarly
for higher quantiles like the 90% quantile (x̃0.9 – x̃0.5 larger than x̃0.5 – x̃0.1 ) and the 95% quantile (x̃0.95 – x̃0.5 larger
than x̃0.5 – x̃0.05 ). In contrast, for a sample distribution that is approximately symmetric, the distance between the 75%
quantile and the sample median would be expected to be similar to the distance between the 25% quantile and the
sample median and similarly for higher quantiles like the 90% quantile (x̃0.9 – x̃0.5 similar to x̃0.5 – x̃0.1 ) and the 95%
quantile (x̃0.9 – x̃0.5 similar to x̃0.5 – x̃0.1 ). As an example, the age variable from the labor-force data is approximately
symmetric and has a sample median of 45. The 25% and 75% sample quantiles are 37 and 53, respectively, which are
equidistant (8 years) from the sample median. The 10% and 90% sample quantiles are 32 and 57, respectively, which
are nearly equidistant (13 years and 12 years, respectively) from the sample median.

6.5 Numerical data: measures of dispersion

We now introduce descriptive statistics that measure the dispersion of a sample distribution. In cases where the
dispersion measures for multiple variables are comparable, we can determine which variables exhibit more dispersion
or variation.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 117 — #124
i i

Descriptive statistics and visuals: univariate data 117

0.4

0.4
0.3

0.3
0.2
0.2

0.1
0.1

0.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4 6 8
0.4

0.4
0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

−6 −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 10

Figure 6.8
Location and dispersion of variable distributions

Let’s consider some simple examples of how variables can differ in terms of their location and/or dispersion. In the
four graphs of Figure 6.8, hypothetical sample distributions for two different variables are depicted as a solid bell-
shaped density curve and a dotted bell-shaped density curve. In the top-left graph, the two variables have the same
central location (sample median), but the dotted distribution has longer (and thicker) left and right tails, exhibiting
more dispersion than the solid distribution. The solid distribution has more observations closer to the center and fewer
observations in the tails. In the top-right graph, the two variables have the same dispersion since their shapes are
identical, but their locations are different, with the solid distribution having a higher sample median than the dotted
distribution. In the bottom-left graph, the solid distribution has a higher sample median and less dispersion than the
dotted distribution. And, finally, in the bottom-right graph, the solid distribution has a lower sample median and less
dispersion than the dotted distribution.
While histograms and density curves provide a way to visually compare the dispersion of different variables, it
is also important to have numerical descriptive statistics to characterize the dispersion of variables. Section 6.5.1
introduces the interquartile range and a descriptive visual, known as a box plot, that is based in part on the interquartile
range. Section 6.5.2 introduces the sample standard deviation and sample variance descriptive statistics.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 118 — #125
i i

118 Descriptive statistics and visuals: univariate data

6.5.1 Interquartile range and box plots

A simple way to describe the dispersion of a sample is to look at the range of observed values:

Definition 6.10 The range of observations x1 , x2 , …, xn is

xmax – xmin ,
where xmax is the maximum value in the sample and xmin is the minimum value in the sample.
Unfortunately, the range is not a very useful dispersion measure since it is so sensitive to the minimum and maximum
values. Just a single outlier can dramatically affect the range. For instance, the sample {4, 7, 2, 9, 11, 1} has a range of
10, while the same sample with an “outlier” of 42 rather than the 11 value, so that the sample is {4, 7, 2, 9, 42, 1}, has
a range of 41.
To address this drawback of the range, we can instead look at the difference between certain sample quantiles of
the distribution. While the minimum and maximum values are too extreme, the difference between the sample 90%
quantile and the sample 10% quantile (x̃0.9 – x̃0.1 ) or the difference between sample 80% quantile and the sample 20%
quantile (x̃0.8 – x̃0.2 ) could be used as alternative measures of dispersion. A frequently used descriptive statistic, known
as the interquartile range, is based upon the difference between the sample 75% quantile and the sample 25% quantile
or, equivalently, the difference between the third quartile and the first quartile.

Definition 6.11 The interquartile range (IQR) of observations x1 , x2 , …, xn , denoted IQRx , is

IQRx = x̃0.75 – x̃0.25 .
Since sample quantiles have the same units as the underlying x variable, the units of IQRx are also the same as the
units of the underlying x variable. With 25% of the observations being below the first quartile x̃0.25 and 75% of the
observations being below the third quartile x̃0.75 , approximately 50% of the observations are between x̃0.25 and x̃0.75 .
The IQR statistic gives the width of this range, between the first and third quartiles, that covers approximately 50% of
the observations. Recall that IQRx was used for the Freedman-Diaconis method for choosing a histogram bin width in
Section 6.3.1. (The Freedman-Diaconis bin width choice is 2 · IQR 3 n , so the bin width choice is directly proportional to
√ x

IQRx . Therefore, for the same sample size n, if one variable has an IQR that is twice as large as another variable, that
variable would have a bin width that is twice as large as the other variable. The relationship between the bin width and
the IQR seems sensible, as it ensures that different variables have comparable numbers of observations within their
histogram bins.)
Example 6.12 (Labor force data) For the weekly earnings (earnwk) variable from the cps data, the first quartile and
third quartile of the sample are 520 and 1194, respectively, so approximately half of the sample has earnwk values
between 520 and 1194. The IQR for earnwk is 1194 – 520 = 674 dollars. For the age variable, the IQR is 53 – 37 = 16
years. It does not make sense to directly compare the IQR values for earnwk and age since they are in different
units. That is, it would not be appropriate to say that the earnwk variable exhibits more dispersion than age. In other
situations, when two variables have the same units, it can make sense to directly compare the IQR values.
The IQR can be calculated in R directly using the IQR function or, alternatively, as the difference between the 75%
and 25% sample quantiles.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 119 — #126
i i

Descriptive statistics and visuals: univariate data 119

# interquartile range of weekly earnings

IQR(cps$earnwk, na.rm = TRUE)
## [1] 673.6
# interquartile range of weekly earnings
IQR(cps$age)
## [1] 16

A useful descriptive visual based upon the IQR is a box plot. While there are several variants of the box plot, two
alternative versions of the box plot are considered here:
• Box plot with whiskers at minimum and maximum: The “box” extends from the sample 25% quantile (first quartile)
to the sample 75% quantile (third quartile), with the sample median indicated by a line within the box. The
“whiskers” are indicated by lines at the minimum value and the maximum value in the sample.
• Box plot with whiskers and outliers: The “box” extends from the sample 25% quantile (first quartile) to the sample

75% quantile (third quartile), with the sample median indicated by a line within the box. The “upper whisker” is
indicated by a line at the minimum of the following two values: xmax and x̃0.75 + 1.5IQRx . The “lower whisker” is
indicated by a line at the maximum of the following two values: xmin and x̃0.25 – 1.5IQRx . The “outliers,” observations
that are either above the upper whisker or below the lower whisker, are indicated by dots or circles.
The second version (box plot with whiskers and outliers) is usually preferred by practitioners since the first version, like
the range descriptive statistic, is too sensitive to the minimum and maximum values. The best way to fully understand
how these box plots are constructed is to consider an example.
Example 6.13 (Labor force data) For the weekly earnings (earnwk) variable from the cps data, Figure 6.9 shows the
two versions of the box plot described above. The one on the left is the box plot with whiskers and outliers, and the one
on the right is the box plot with whiskers at minimum and maximum. Both box plots have the same “box,” extending
from the first quartile (520) to the third quartile (1194) with a line indicating the sample median (770). The height of
each box is the IQR value of 674. The box plot on the right has the lower whisker at the minimum value (12) and the
upper whisker at the maximum value (8780). For the box plot on the left, the lower whisker is indicated by a line at
max(xmin , x̃0.25 – 1.5IQRx ) = max(12, 520 – (1.5)(674)) = max(12, –491) = 12,
and the upper whisker is indicated by a line at
min(xmax , x̃0.75 + 1.5IQRx ) = min(8780, 1194 + (1.5)(674)) = min(8780, 2205) = 2205.
Finally, the box plot on the left has outliers represented by circles. In this case, all of the outliers are above the upper
whisker, as there can be no observations below the lower whisker, which is located at the minimum value of the sample.
This box plot offers another visual confirmation of the right skewness of the earnwk variable. The distance from the
sample median to the third quantile is larger than the distance from the first quartile to the sample median, and a long
right tail of outliers appears above the upper whisker with no such left tail of outliers below the lower whisker.
The R code to create Figure 6.9 uses the function boxplot:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 120 — #127
i i

120 Descriptive statistics and visuals: univariate data

Box plot (whiskers and outliers) Box plot (whiskers at min/max)

8000

8000
6000

6000
Weekly earnings

Weekly earnings
4000

4000
2000

2000
0

Figure 6.9
Box plots of weekly earnings (CPS data)

# graph-display format (one row, two columns)

par(mfrow = c(1,2))

# box plot with whiskers and outliers (the default in R)

boxplot(cpsemployed$earnwk, ylab="Weekly earnings", main="Box plot (whiskers and outliers)")

# box plot with whiskers at min and max values

boxplot(cpsemployed$earnwk, range=0, ylab="Weekly earnings", main="Box plot (whiskers at min/max)")

The default for the boxplot function is to have whiskers and outliers displayed. The first boxplot command
creates this box plot for the cpsemployed$earnwk variable. The second boxplot command creates a box plot
with whiskers at minimum and maximum by specifying the optional argument range=0.
Example 6.14 (Monthly stock returns) As in Example 6.10, we focus on the monthly returns for AT&T (T) and Bank
of America (BAC) from the sp500 dataset. Figure 6.10 shows the box plots, with whiskers and outliers, for T and
BAC, drawn with the same y-axis (extending from –0.6 to 0.8) for ease of comparison. Here is the R code to create
Figure 6.10:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 121 — #128
i i

Descriptive statistics and visuals: univariate data 121

Box plot (whiskers and outliers) Box plot (whiskers and outliers)

0.8

0.8
0.6

0.6
0.4

0.4
Monthly return (BAC)
Monthly return (T)

0.2

0.2
0.0

0.0
−0.2

−0.2
−0.4

−0.4
−0.6

−0.6

Figure 6.10
Box plots of monthly stock returns

# graph-display format (one row, two columns)

par(mfrow = c(1,2))

# box plot (whiskers and outliers) for AT&T

boxplot(sp500$T, ylab="Monthly return (T)", main="Box plot (whiskers and outliers)",ylim=c(-0.6,0.8))

# box plot (whiskers and outliers) for Bank of America

boxplot(sp500$BAC, ylab="Monthly return (BAC)", main="Box plot (whiskers and outliers)",ylim=c(-0.6,0.8))

Neither box plot shows strong evidence of right or left skewness, with the first and third quartiles and the lower and
upper whiskers fairly equidistant to the sample medians. In this example, a direct comparison of the distributions of T
and BAC is possible since they are both unitless. As already seen in the histograms from Example 6.10, the distribution
of BAC (Bank of America’s monthly stock returns) exhibits more dispersion than the distribution of T (AT&T’s monthly
stock returns). The box extends farther in both directions for BAC as compared to T, corresponding to a larger IQR
value (0.1025) for BAC than the IQR value (0.0703) for T. BAC also has more extreme outliers in both directions
than T does. In fact, BAC has six observed returns greater than 0.3 in magnitude (either below –0.3 or above +0.3),
whereas T has no such observed returns greater than 0.3 in magnitude.

6.5.2 Sample variance and sample standard deviation

We now introduce descriptive measures of dispersion based upon the deviations from mean in the sample.

Definition 6.12 The deviation from mean for the i-th observation of the x variable is xi – x̄.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 122 — #129
i i

122 Descriptive statistics and visuals: univariate data

Example 6.15 Suppose n = 7, and the sample for x is {4, 3, 8, 12, 0, 10, 5}. The sample mean is x̄ = 6. The deviations
from mean for each of the observations are as follows:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
xi – x̄ –2 –3 2 6 –6 4 –1
The sum of the deviations from mean in Example 6.15 is exactly zero, which is a general result for any sample:
Pn
Proposition 6.4. The sum of the deviations from mean, i=1 (xi – x̄), is equal to zero. The average of the deviations
n
from mean, 1n i=1 (xi – x̄), is equal to zero.
P

This proposition holds since

n n
!
X X
(xi – x̄) = (x1 – x̄) + (x2 – x̄) + · · · + (xn – x̄) = xi – nx̄ = nx̄ – nx̄ = 0.
i=1 i=1
Proposition 6.4 implies that it is pointless to use the average deviation from mean as a measure of dispersion since
its value is always zero. The negative and positive deviations cancel each other out when summed together. To avoid
this cancelling out, the key is to treat positive and negative deviations in the same way, viewing the deviation from the
mean as a positive distance regardless of whether an observation is above or below the mean. One dispersion measure
based upon this idea is the sample mean absolute deviation:13

Definition 6.13 The sample mean absolute deviation of observations x1 , x2 , …, xn , denoted MADx , is
n
1X
MADx = |xi – x̄|.
n
i=1
The MADx descriptive statistic is interpreted as the average distance that sample observations are from the sample
mean. MADx is always non-negative, and it’s strictly positive unless all xi values are the same. The units of MADx are
the same as the units of x, making it convenient for interpretation.
Example 6.16 Continuing Example 6.15, the absolute deviation values are added to the table:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
xi – x̄ –2 –3 2 6 –6 4 –1
|xi – x̄| 2 3 2 6 6 4 1

Then, MADx = 71 (2 + 3 + 2 + 6 + 6 + 4 + 1) = 24
7 . The average distance that an observation is from the sample mean is
24
7 .

Example 6.17 (Monthly stock returns) The MADx values for the AT&T monthly stock returns (T) and the Bank of
America monthly stock returns (BAC) can be calculated in R:

# MAD for AT&T

mean(abs(sp500$T-mean(sp500$T)))
## [1] 0.04708729
# MAD for Bank of America
mean(abs(sp500$BAC-mean(sp500$BAC)))

## [1] 0.07205011

Since the two variables are unitless, the MADx statistics are also unitless and can be directly compared. On average,
the BAC monthly returns are farther from their sample mean than the T returns. The average distance of the AT&T

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 123 — #130
i i

Descriptive statistics and visuals: univariate data 123

monthly returns to their sample mean is 0.04709 or 4.709%, and the average distance of the Bank of America monthly
returns to their sample mean is 0.07205 or 7.205%. The sample dispersion is much larger for BAC, with its MADx
value 53% larger than the MADx value for T.
An alternative dispersion measure based upon the deviations from mean is the sample variance, where the squared
distance (xi – x̄)2 is used instead of the absolute distance |xi – x̄|:

Definition 6.14 The sample variance of observations x1 , x2 , …, xn , denoted s2x , is

n
1 X
s2x = (xi – x̄)2 .
n–1
i=1
1
s2x
The sample variance has a n–1 scaling, rather than the 1n scaling previously seen for averages. Even with the small
1
difference introduced by the n–1 scaling, the sample variance s2x can be interpreted as (approximately) the average
squared distance that the sample observations are from the sample mean. Like MADx , the sample variance s2x is always
non-negative, and it’s strictly positive unless all xi values are the same. Unlike MADx , however, the units of s2x are not
the same as the units of x. Instead, the units of s2x are the units of x squared since each (xi – x̄)2 term has has those
units. If the units of x are dollars, the units of s2x are dollars squared; if the units of x are years, the units of s2x are
years squared; and similarly for other variables. This feature of the sample variance makes interpretation problematic
since it’s hard to explain or comprehend a descriptive statistic that is measured in (units of x)2 (e.g., dollars squared,
years squared, etc). As a result, it is often more useful to calculate and report the sample standard deviation, which is
obtained by taking the square root of the sample variance and, therefore, has the same units as the original x variable:

Definition 6.15 The sample standard deviation of observations x1 , x2 , …, xn , denoted sx , is

v
u n
q u 1 X
2
sx = sx = t (xi – x̄)2 .
n–1
i=1

While the sample standard deviation sx is in the units of the original x variable, it does not have a simple
interpretation in the same way that MADx does. Recall that the MADx can be interpreted as the average distance
of sample observations from their mean. Unfortunately, due to the presence of the square root in the definition of sx ,
the sample standard deviation is not an average of some interesting underlying quantity. As seen later in the book,
the meaning of the sample standard deviation depends upon the specific underlying distribution of the variable; for
instance, the sample standard deviation has a particularly interesting interpretation in the case of a variable that has
a “normal distribution.” For now, the sample standard deviation should be thought of as an alternative dispersion
measure.
Example 6.18 Continuing Example 6.16, the squared deviation values are added to the table:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
xi – x̄ –2 –3 2 6 –6 4 –1
|xi – x̄| 2 3 2 6 6 4 1
(xi – x̄)2 4 9 4 36 36 16 1
The sample variance is
1 106 53
s2x =(4 + 9 + 4 + 36 + 36 + 16 + 1) = = ,
7–1 q 6 3
p 53
and the sample standard deviation is sx = s2x = 3 ≈ 4.20.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 124 — #131
i i

124 Descriptive statistics and visuals: univariate data

Example 6.19 (Monthly stock returns) Continuing Example 6.17, the sample variances and sample standard
deviations for the AT&T monthly stock returns (T) and the Bank of America monthly stock returns (BAC) can be
calculated in R using the var and sd functions, respectively:

# sample variance and sample standard deviation for AT&T

var(sp500$T)
## [1] 0.003977273
sd(sp500$T)
## [1] 0.06306562
# sample variance and sample standard deviation for Bank of America
var(sp500$BAC)
## [1] 0.01108884
sd(sp500$BAC)
## [1] 0.1053035

As with MADx , the sample standard deviation sx indicates that BAC has more dispersion than T. The standard
deviation of BAC (0.10530) is approximately 67% larger than the standard deviation of T (0.06307).
Example 6.20 (Union versus non-union wages) Suppose we are interested in a comparison of weekly earnings for
union workers versus non-union workers. Using the cps data, we construct two subsamples based upon the unionstatus
variable, one which consists of the 276 union workers and one which consists of the 2,533 non-union workers. Focusing
on weekly earnings (x = earnwk), the following table provides descriptive statistics for the union and non-union
subsamples:
Sample n x̄ MADx s2x sx
Union workers 276 1197.7 532.4 518378.8 720.0
Non-union workers 2533 946.5 488.8 562120.3 749.7
The sample mean of weekly earnings is roughly $250 higher for union workers ($1,198) than for non-union workers
($947). The dispersion measures provide mixed evidence on the relative dispersion of the distribution of union weekly
earnings versus the distribution of non-union weekly earnings. The MADx measure suggests slightly more dispersion
for union workers, whereas the sx measure suggests slightly more dispersion for non-union workers. To see whether
the histograms for the two subsamples provide any further evidence of their relative dispersion, Figure 6.11 plots the
two histograms and density curves, using the same x-axis for ease of comparison. It’s clear why union workers have
higher average weekly earnings, as there is a much higher proportion of observations with earnwk > 1000 in the top
histogram compared to the bottom histogram. But, consistent with the descriptive statistics, it’s unclear from these
histograms whether earnings are more dispersed in one versus the other.
Example 6.21 (Male versus female wages) Suppose we are instead interested in a comparison of weekly earnings for
male workers versus female workers. The approach is similar to Example 6.20, except we construct two subsamples
based on gender, one consisting of the 1,501 male workers and one consisting of the 1,308 female workers. Again
focusing on weekly earnings (x = earnwk), the following table provides the descriptive statistics for the two subsamples:
Sample n x̄ MADx s2x sx
Male workers 1501 1117.3 529.9 610217.8 781.2
Female workers 1308 803.5 415.1 457066.4 676.1
The average weekly earnings for male workers ($1,117) is over $300 higher than the average weekly earnings for
female workers ($804). The MADx and sx statistics both provide evidence that the distribution of male weekly earnings

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 125 — #132
i i

Descriptive statistics and visuals: univariate data 125

Weekly earnings (union)

0.0000 0.0004 0.0008

Density
0 1000 2000 3000 4000 5000 6000

Weekly earnings

Weekly earnings (non−union)

0.0006
Density

0.0000

0 1000 2000 3000 4000 5000 6000

Weekly earnings

Figure 6.11
Histograms of weekly earnings for union and non-union subsamples

is more dispersed than the distribution of female weekly earnings. Figure 6.12, which shows the histograms and density
curves for the two subsamples, indicates the higher dispersion for male weekly earnings distribution is a result of its
thicker and longer right tail as compared to the female weekly earnings distribution.
When x ∈ {0, 1} is a binary or indicator variable, the sample variance and sample standard deviation have
Pn
a particularly simple form. For a binary x variable, the sample mean x̄ = n1 i=1 xi is the sample proportion of
observations with xi = 1 (since the xi = 0 observations do not contribute to the summation). As an example, let’s say
that x is an indicator of whether a worker is in a union, with 1 indicating a union worker and 0 indicating a non-union
276
worker. For the sample considered in Example 6.20, we have x̄ = 2809 ≈ 0.098 or approximately 9.8% of workers being
in a union. The following proposition indicates that the sample variance of a binary x variable depends only on x̄:
Proposition 6.5. If x ∈ {0, 1} is a binary variable, the sample variance of x is
n
s2x = x̄(1 – x̄),
n–1
and the sample standard deviation of x is
r
n
q
sx = s2x = x̄(1 – x̄).
n–1
The proof of this proposition is left as an exercise (Exercise 6.10.). For the union example,
r
2 2809 276 2533 2809 276 2533
sx = · · ≈ 0.0886 and sx = · · ≈ 0.2977.
2808 2809 2809 2808 2809 2809

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 126 — #133
i i

126 Descriptive statistics and visuals: univariate data

Weekly earnings (male)

0.0008
Density

0.0004
0.0000 0 1000 2000 3000 4000 5000 6000

Weekly earnings

Weekly earnings (female)

0.0012
Density

0.0006
0.0000

0 1000 2000 3000 4000 5000 6000

Weekly earnings

Figure 6.12
Histograms of weekly earnings for male and female subsamples

In a sense, the sample variance doesn’t provide any additional information (beyond x̄) about the binary variable x since
it’s directly a function of x̄. Once the sample proportion of ones is known, it completely determines both the sample
mean and the sample variance of the binary variable.

6.6 Modal outcomes

In certain instances, we might be interested in knowing the most likely outcome for a given variable. None of the
descriptive statistics discussed to this point are helpful for determining the most likely outcome. Instead, a new
descriptive statistic is needed.

Definition 6.16 The sample mode or modal outcome of observations x1 , x2 , …, xn is the value that occurs most often.
It is possible that there is more than one sample mode or modal outcome, which happens when two or more outcomes
are tied for being the most likely.
When x is a categorical variable, the sample mode is the category that occurs most often in the sample. For numerical
variables, the sample mode is generally most useful for discrete variables and perhaps for continuous variables which
have “focal” responses/values. For a continuous numerical variable, even if the sample mode is not useful (e.g., if
most or all of the values are distinct), we may refer to a distribution as a unimodal distribution if its histogram or
density curve exhibits only one “hump,” a bimodal distribution if its histogram or density curve exhibits only two
“humps,” and so on. For example, the various distributions of weekly earnings from the cps data have been unimodal
distributions with a single hump.
Example 6.22 (Labor force data) Consider the distribution of the hrslastwk variable (hours worked last week) for the
sample of 2,809 employed individuals from the cps data. Figure 6.13 provides three different ways of looking at the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 127 — #134
i i

Descriptive statistics and visuals: univariate data 127

1−hour bins, hrslastwk 10−hour bins, hrslastwk

100
0.5

0.06
0.05

80
0.4

0.04

60
0.3

Hours worked last week

Density

0.03
0.2

40
0.02
0.1

20
0.01
0.00
0.0

0 20 40 60 80 100 0 20 40 60 80 100

Hours worked last week Hours worked last week

Figure 6.13
Histograms and box plots of weekly hours worked (CPS data)

distribution of hrslastwk: a histogram with one-hour bin widths, a histogram with ten-hour bin widths, and a box plot
with whiskers and outliers. We use the following R code, specifying the breaks argument to be seq(0.5,99.5,1)
for the one-hour bin-width histogram and seq(-5,105,10) for the ten-hour bin-width histogram:

# graph-display format (one row, three columns)

par(mfrow = c(1,3))

# histogram with bin widths of one hour

hist(cpsemployed$hrslastwk, breaks=seq(0.5,99.5,1), freq=FALSE,
main="1-hour bins, hrslastwk", xlab="Hours worked last week")

# histogram with bin widths of ten hours

hist(cpsemployed$hrslastwk, breaks=seq(-5,105,10), freq=FALSE,
main="10-hour bins, hrslastwk", xlab="Hours worked last week")

# box plot with whiskers and outliers

boxplot(cpsemployed$hrslastwk, ylab="Hours worked last week")

a = 0, b = 2 a = 0, b = 0.5
1.0

1.0
x x
y = 2x y = 0.5x
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

a = 3, b = 1 a = 3, b = 0.5
1.0

1.0

x x
y = 3+x y = 3+0.5x
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

Figure 6.14
Linear transformations of a variable

sample variance of y is smaller or larger than the sample variance of x. From the sample variance result (s2y = b2 s2x ),
s2y > s2x if |b| > 1, s2y < s2x if |b| < 1, and s2y = s2x if |b| = 1.
For the sample standard deviation in (iii),
q q
sy = s2y = b2 s2x = |b|sx ,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 131 — #138
i i

Descriptive statistics and visuals: univariate data 131

√
where b2 = |b| is used for the last equality. As with the sample variance, the relative size of the sample standard
deviations of x and y depend upon the magnitude of b:
sy > sx when |b| > 1, sy < sx when |b| < 1, and sy = sx when |b| = 1.
For the sample quantiles in (iv), showing that ỹq = a + bx̃q when b ≥ 0 is straightforward. Let’s say that the sample of
x values have been sorted, from lowest to highest, to calculate a sample quantile x̃q . When b is positive, if the sample of
y values is sorted, from lowest to highest, the sorted y values will be in the same exact ordering as the sorted x values.
Therefore, when the algorithm from Section 6.4.2 for calculating a sample quantile is applied, the sample quantile of
y will be ỹq = a + bx̃q . As a special case of this result, the sample medians are related by the equation ỹ0.5 = a + bx̃0.5 .
For the sample IQR in (v), when b ≥ 0, the result follows from the result for sample quantiles. Since ỹ0.25 = a + bx̃0.25
and ỹ0.75 = a + bx̃0.75 ,
IQRy = ỹ0.75 – ỹ0.25 = a + bx̃0.75 – (a + bx̃0.25 ) = b(x̃0.75 – x̃0.25 ) = bIQRx .
As with the sample variance and sample standard deviation, the additive constant a has no effect on the IQR dispersion
measure. While a may shift the location of the distribution, it has no effect on the difference between the quantiles
within the distribution. Instead, the sample IQR of y is just a scaled version of the sample IQR of x, with the same
scaling (b) as for the sample standard deviation.
For the sample MAD in (vi),
Pn Pn
MADy = 1n i=1 |yi – ȳ| = 1n i=1 |a + bxi – (a + bx̄)|

1
Pn Pn
= n i=1 |b(xi – x̄)| = |b| 1n i=1 |xi – x̄| = |b|MADx .
The additive constant a does not affect MADy , and the scaling constant b affects MADy in the same way as seen for
the sample standard deviation and the sample IQR.
1
Example 6.27 (Height) Example 6.23 had x = height in inches and y = 12 x = height in feet. For a sample of heights,
Proposition 6.6 implies
1 1 1 2 1
ȳ = x̄, ỹ0.5 = x̃0.5 , s2y = s , and sy = sx .
12 12 144 x 12
Example 6.28 (Website profits) In Example 6.26, the website had daily widget sales x and daily profits y = –f + (p – c)x,
where f is the fixed daily cost, p is the widget price, and c is the marginal cost of producing each widget. For a sample
of daily sales x, from which daily profits y are derived,
ȳ = –f + (p – c)x̄, ỹ0.5 = –f + (p – c)x̃0.5 (if p > c), s2y = (p – c)2 s2x , and sy = |p – c|sx .
Example 6.29 (Earnings) For Example 6.24, with x being weekly earnings and y = 52x being annualized earnings,
let’s consider the descriptive statistics associated with the actual weekly earnings (earnwk) variable from the cps data:
x̄ s2x sx x̃0.5 IQRx MADx
earnwk (x) 971.2 563227.1 750.48 770 673.6 497.8

ȳ s2y sy ỹ0.5 IQRy MADy

earnyr (y) 50501 1522966024 39025.2 40040 35027.2 25886.1
Each quantity in the annualized earnings (y) table is equal to 52 times the corresponding quantity in the weekly
earnings (x) table, except for the sample variance, which is 522 = 2704 times larger.
Example 6.30 (Non-working hours) For Example 6.25, with x being the number of hours worked last week and
y = 168 – x being the number of non-working hours last week, let’s consider x = hrslastwk from the cps data. Figure 6.15
shows two histograms with one-year bin widths, the first for the x values and the second for the y = 168 – x values. The
shape of the y distribution is an exact mirror image of the x distribution since b = –1. The y distribution is also shifted by

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 132 — #139
i i

132 Descriptive statistics and visuals: univariate data

0.4
Density

0.2
0.0 0 20 40 60 80 100

Hours worked last week

0.4
Density

0.2
0.0

80 100 120 140 160

Non−working hours last week

Figure 6.15
Histograms of weekly hours worked and non-working hours

the large additive constant a = 168. Taking the mirror image of the distribution doesn’t affect dispersion, as implied by
the results in Proposition 6.6 for |b| = 1. For instance, the sample standard deviations sx and sy are both approximately
11.28, and the sample MAD statistics MADx and MADy are both approximately 6.53. Since x + y = 168, the sample
mean of x (x̄ ≈ 40.3) and the sample mean of y (ȳ ≈ 127.7) sum to 168.

6.7.1 Standardizing a variable

Since it can be difficult to compare variables that have different means, different variances, and even different units,
a commonly used linear transformation is one that “standardizes” the values observed in a sample. Specifically, for a
sample {x1 , x2 , …, xn } with sample mean x̄ and sample standard deviation sx , the standardized values
xi – x̄
yi = , for i = 1, 2…, n,
sx
are obtained by “de-meaning” the observations (subtracting the sample mean x̄) and then dividing by the sample
standard deviation sx . The standardized value yi is the number of sample standard deviations that xi is above or below
the sample mean x̄. For instance, yi = 1.5 when xi = x̄ + 1.5sx (xi is 1.5 sample standard deviations above the sample
mean), and yi = –0.8 when xi = x̄ – 0.8sx (xi is 0.8 sample standard deviations below the sample mean). The standardized
value yi is unitless since the units of the numerator and the units of the denominator are both the units of x. This
standardization is a special case of the linear transformation, having a = – sx̄x and b = s1x since yi can be re-written
x̄ 1
yi = – + xi for i = 1, 2…, n.
sx sx

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 133 — #140
i i

Descriptive statistics and visuals: univariate data 133

Interestingly, regardless of the original sample {x1 , x2 , …, xn }, the sample of standardized values, {y1 , y2 , …, yn },
always has sample mean ȳ = 0, since
x̄ 1
ȳ = – + x̄ = 0,
sx sx
and sample standard deviation sy = 1, since
1
sy = sx = 1.
sx
Example 6.31 (Monthly stock returns) Returning to the AT&T monthly stock returns (T) and the Bank of America
monthly stock returns (BAC) from the sp500 dataset, the following R code standardizes variables for both T and BAC:

T_std <- (sp500$T-mean(sp500$T))/sd(sp500$T)

BAC_std <- (sp500$BAC-mean(sp500$BAC))/sd(sp500$BAC)

summary(T_std)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.1626 -0.4851 -0.0185 0.0000 0.6289 4.2555
sd(T_std)
## [1] 1
summary(BAC_std)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.08030 -0.47193 0.01723 0.00000 0.50103 6.77693
sd(BAC_std)
## [1] 1

The summary statistics for the two standardized variables confirm that the sample means are equal to zero and the
sample standard deviations are equal to one. Figure 6.16 shows the histograms and density curves for the standardized
variables. As compared to the distributions of the original variables T and BAC in Figure 6.6, where BAC has visibly
more dispersion than T, the dispersion of the two standardized variables in Figure 6.16 is quite similar due to the
division by their respective standard deviations. The distributions are also centered around zero due to the de-meaning.
And, for both standardized variables, a very large proportion of the observations are between –2 (two sample standard
deviations below the sample mean) and +2 (two sample standard deviations above the sample mean). As we’ll see in
Chapter 11, this property is to be expected for variables with distributions that are approximately bell-shaped.

6.8 Time-series plots

For time-series data, there is a natural chronological ordering to the observations. Generally, the data are stored with
the earliest observation in the first row, the next observation (in time) in the second row, and so on through the latest
observation in the last row. A time-series plot graphs a time-series variable (y-axis) against time (x-axis) and is useful
for visualizing how a univariate data for a time-series variable behaves over time.
The simplest way in R to draw a time-series plot for a single variable is to use the plot function. For instance,
here is the code that generates the time-series plot of AT&T monthly returns (T) from the sp500 dataset shown in
Figure 6.17:

plot(sp500$T, type="l", ylab="Monthly return (T)")

The optional argument type="l" provides a line plot rather than a scatter plot. With only a single variable
specified in the first argument to plot, R automatically plots the monthly return data against the row number or,
as specified on the x-axis label, the “Index.”

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 134 — #141
i i

134 Descriptive statistics and visuals: univariate data

0.4
Density

0.2
0.0 −6 −4 −2 0 2 4 6

Standardized monthly return (T)

0.6
0.4
Density

0.2
0.0

−6 −4 −2 0 2 4 6

Standardized monthly return (BAC)

Figure 6.16
Histograms of standardized monthly stock returns

To directly incorporate information about actual dates or times associated with the observations, an alternative
approach in R is to create a time-series object and then draw the time-series plot:

# create a time-series object for AT&T monthly returns

ts_T <- ts(sp500$T, start=c(1991,1), frequency=12)

plot(ts_T, ylab="Monthly return (T)")

The ts function creates the time-series object based upon the variable specified by the first argument sp500$T.
The optional argument start=c(1991,1) specifies that the time series begins in the first month of the year 1991,
and the optional argument frequency=12 specifies that the observations are monthly (i.e., at a frequency of 12
per year). The resulting time-series plot, shown in Figure 6.18, looks identical to Figure 6.17, as it should. The only
difference is that the date values appear on the x-axis, now labeled “Time.” The dispersion of the AT&T monthly
returns is particularly high between 1998 and 2004. This feature of the data is missed by either the histogram or box
plot since neither of those visualizations incorporates the time dimension.
We can also graph multiple time-series plots at once. Two approaches are considered, one in which the time-series
plots are shown on the same graph and one in which the time-series plots are vertically stacked. The following R
code yields Figure 6.19, with the time-series plots for both AT&T monthly returns (T) and Bank of America monthly
returns (BAC) shown on the same graph:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 135 — #142
i i

Descriptive statistics and visuals: univariate data 135

0.2
0.1
Monthly return (T)

0.0
−0.1
−0.2

0 100 200 300

1990 1995 2000 2005 2010 2015 2020

Time

Figure 6.20
Time-series plots of monthly stock returns

(f) What is the sample variance of ownchild?

(g) What is the sample standard deviation of ownchild?
4. Your classmate surveys 223 people and asks each person how many times they ate out at a restaurant in a given
month. The minimum response is 0, and the maximum response is 30. Rather than providing the counts for every
possible integer between 0 and 30, your classmate summarizes the responses as follows:
# of times eating out in a month 0 1-10 11-20 21-30
# of observations 25 125 49 24
Let x be the variable representing the number of times that a person eats out in a month.
(a) What can you say about the sample median of x?
(b) What is the highest possible value of the sample mean of x? What is the lowest possible value of the sample
mean of x?
(c) What is the sample proportion of individuals for whom x ≥ 11?
(d) Is there anything that can be said about the modal outcome for x?
5. The monthly rainfall x (in inches) in Statsville from January 2023 through December 2023 is:

1.89, 1.99, 2.14, 2.51, 5.03, 3.81, 1.97, 2.31, 2.91, 3.97, 2.68, 2.44.
(a) What is the sample mean absolute deviation (MADx ) of monthly rainfall?
(b) What is the sample variance (s2x ) of monthly rainfall?
(c) What is the sample standard deviation (sx ) of monthly rainfall?
(d) If y is monthly rainfall measured in feet, what are MADy , s2y , and sy ?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 139 — #146
i i

NOTES 139

6. For the cps dataset, here are summary statistics of the hourly wage last week (wagehr). The variable wagehr is
missing for the 2,174 individuals who are not hourly employees, so that it has numeric values for the 1,839 employed
individuals paid on an hourly basis.

summary(cps$wagehr)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.01 12.78 16.41 18.60 22.00 90.00 2174

(a) Is the distribution of wagehr likely to be symmetric, left-skewed, or right-skewed?

(b) If you added one observation to this data for a worker with an hourly wage of $16.41, how would the sample
median change? how would the sample mean change?
(c) What is the sample IQR of wagehr?
(d) What are the values of the lower whisker and the upper whisker for the box plot of wagehr?
7. Use the website dataset for this question. This time-series dataset contains the arrival times of website visitors
during one hour. The variable arrival (in minutes, with values between 0 and 60) indicates when a visitor arrives at the
website. The variable time_since_last (in minutes) indicates the amount of time since the last visitor’s arrival.
(a) Use hist to draw a histogram of arrival with densities on the y-axis and without specifying the width or
number of bins. Does the distribution of arrival appear to be approximately symmetric, left-skewed, or right-
skewed?
(b) Draw a (density) histogram of arrival with 60 bins, corresponding to the first minute, second minute, and so
on.
(c) Draw a (density) histogram of time_since_last using the Freedman-Diaconis rule. Does time_since_last appear
to be symmetric, left-skewed, or right-skewed?
(d) Add a density curve to the histogram in (c). When using the density function, specify the optional argument
na.rm=TRUE since the first observation of time_since_last is missing.
(e) Draw a box plot with whiskers and outliers for time_since_last.
(f) Using other R functions as necessary, what are the values of the following features of the box plot from (e):
(i) lower whisker, (ii) upper whisker, (iii) bottom of the box, (iv) top of the box, (v) median (line within the
box), and (vi) the number of points (outliers) outside the whiskers.
(g) If time-since-last-arrival is measured in seconds rather than minutes, how would the box plot in (e) change?
Answer this part without actually drawing a new box plot.
8. Use the metricsgrades dataset for this question. These data are from a graduate econometrics course with 68
students, containing the following variables:
total = overall composite course grade (out of 100 points)
gre_quant = score on GRE quantitative test (out of 170 points)
gre_verbal = score on GRE (English) verbal test (out of 170 points)
domestic = 1 if domestic (U.S.) student, 0 otherwise
(a) What is the sample mode of gre_quant?
(b) Draw separate (frequency) histograms of gre_quant for domestic students and non-domestic students. For ease
of comparison, use the xlim and ylim options to specify the same scales for both axes.
(c) Draw separate (frequency) histograms of gre_verbal for domestic students and non-domestic students. For ease
of comparison, use the xlim and ylim options to specify the same scales for both axes.
(d) Compare the following descriptive statistics of gre_quant for the subsamples of domestic students and non-
domestic students: (i) sample mean, (ii) sample median, (iii) sample 75% quantile, and (iii) sample IQR.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 140 — #147
i i

140 NOTES

(e) Create a standardized version of gre_quant called gre_quant_std, using the sample mean and sample standard
deviation of the full sample. Confirm that gre_quant_std has sample mean zero and sample standard deviation
one.
(f) What are the sample mean, sample standard deviation, and sample IQR of gre_quant_std for the subsamples
of domestic students and non-domestic students?
9. Logan’s Lemonade sells lemonade for $6 per cup each Sunday at a local farmer’s market. The number of cups sold
each Sunday is represented by the variable x. Over 16 weeks, the descriptive statistics for x are x̄ = 120 and sx = 20.
(a) What is the sample mean and sample standard deviation of Logan’s Lemonade’s daily revenues (price times
quantity sold)?
(b) The fee to have a booth at the farmer’s market is $100 each Sunday. If it also costs Logan’s Lemonade $1 per
cup to make the lemonade, what are the sample mean and sample standard deviation of Logan’s Lemonade’s
daily profits (revenues minus costs)?
(c) If you knew the value of the sample median x̃0.5 = x∗ , how would the sample median of daily revenues be related
to x∗ ? How about the sample median of daily profits?
10. Suppose x ∈ {0, 1} is a binary variable.
(a) *Show that the sample variance of x is
n
s2x = x̄(1 – x̄).
n–1 Pn
(Hint: Use the following facts: (i) xi2 = xi if xi ∈ {0, 1} and (ii) i=1 xi = nx̄.)
(b) In a sample of 200 individuals, 13 had an emergency-room visit in the last year. If the variable x is an indicator
of an emergency-room visit (equal to 1 if visit occurred, 0 if not), what is the sample standard deviation of x?
(c) *Provide a formula for the sample mean absolute deviation of x (MADx ) in terms of x̄. The formula should not
contain any individual xi values.
11. The sample skewness gx is a descriptive statistic that measures the skewness of a sample distribution, defined as
1
Pn
(xi – x̄)3
gx = n i=1 3 .
(sx )
Positive values of gx are associated with right-skewed sample distributions, with gx > 1 considered highly right-skewed.
Negative values of gx are associated with left-skewed sample distributions, with gx < –1 considered highly left-skewed.
Values of gx closer to zero are associated with sample distributions that are neither left-skewed nor right-skewed.
(a) What are the units of gx ?
(b) Write an R function skewness that takes a vector x, containing the sample, as its only argument and returns
the sample skewness.
(c) For the cps dataset, use the skewness function to calculate the sample skewness of the earnwk and age
variables for the subsample of employed individuals.
(d) How would the sample skewness of earnwk change if weekly earnings were measured in thousands of dollars
rather than dollars?
12. Use the bitcoin dataset for this question. This dataset consists of daily prices and returns for the Bitcoin
cryptocurrency between January 1, 2020 and December 31, 2021. There are 731 observations on the three price
variables (high = daily high price, low = daily low price, close = end-of-day price) and 730 observations on the daily
return (return).
(a) Draw a time-series plot of the closing daily price for the full time series.
(b) For the first 60 observations (through February 29, 2020), draw a time-series plot with the three price variables
(high, low, close) on the same plot. Draw high and low as solid lines and close as a dotted line.
(c) Draw a time-series plot of the daily returns for the full time series.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 141 — #148
i i

NOTES 141

(d) The time-series plot in (c) should indicate periods of low variance and periods of high variance. Eyeball the
graph and pick one low-variance range and one high-variance range, and then confirm what you see visually
by calculating the sample standard deviations for the two ranges you’ve identified.
13. Use the inflation dataset for this question. This panel dataset consists of annual inflation rates for 45 countries over
the ten-year period between 2010 and 2019. The variables are country (a categorical (factor) variable with a three-
character abbreviation), year (values between 2010 and 2019), and inflation (in percentage points; e.g., inflation = 3
means 3% annual inflation).
(a) How many observations are associated with deflation (a negative inflation rate)?
(b) How many countries experience deflation at some point during the 10-year period?
(c) Which countries have the lowest and highest average inflation over the 10-year period?
(d) Which countries have the lowest and highest standard deviation of inflation over the 10-year period?
(e) Draw a time-series plot for the inflation rate of the United States (USA).
(f) Draw a time-series plot that has the inflation rates for the United States (USA), Canada (CAN), and Mexico
(MEX) on the same plot. Make the three lines a different style and/or color to differentiate them, and make sure
that the y-axis range allows all three lines to be completely visible.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 142 — #149
i i

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 143 — #150
i i

7 Descriptive statistics and visuals: bivariate data

Chapter 6 introduced descriptive statistics and visuals for univariate data. In this chapter, the focus shifts to the case of
data on two variables, known as bivariate data, with the introduction of descriptive statistics and visuals that describe
the relationship between the two variables. Rather than having a single variable for each observational unit, consider a
situation where two variables x and y are observed as a collection of pairs
{(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}
or, more concisely, {(xi , yi )}ni=1 . As before, n denotes the sample size.
Example 7.1 (Education and earnings) If we are interested in the relationship between earnings and educational
attainment, data can be collected on a sample of workers with x being the years of educational attainment and y being
weekly (or annual) earnings. A positive association is expected between these two variables.
Example 7.2 (Monthly stock returns) Chapter 6 considered several examples using the monthly stock return data
from sp500, specifically for the case of AT&T (stock ticker T) and Bank of America (stock ticker BAC). Suppose we
are instead interested in looking at the relationship between monthly stock returns for two companies that operate
in the same industry, in this case the home improvement industry. In the dataset, there are monthly stock returns for
both Home Depot (stock ticker HD) and Lowe’s (stock ticker LOW), so let x be the monthly stock return for HD and
let y be the monthly stock return for LOW. For these two companies, there are different factors that might affect the
relationship of their stock returns. On one hand, since they are competitors in the industry, it might be expected that
one company does worse when the other company does better. On the other hand, since both companies are affected
by the same macroeconomic conditions (e.g., home construction levels), it might be expected that both companies do
better than usual (or worse than usual) at the same time.
Examples 7.1 and 7.2 involve numerical data for the variables being considered. While this chapter focuses primarily
on descriptive statistics and visualization for numerical variables, Section 7.1 briefly considers categorical variables
before the case of numerical variables is covered in Section 7.2.

7.1 Categorical variables

In this section, two different cases are considered: (i) bivariate data where both variables are categorical and
(ii) bivariate data where one variable is categorical and one variable is numerical.

7.1.1 Bivariate data with two categorical variables

With two categorical variables, the bivariate data can be completely characterized by a table with joint sample counts
or joint sample proportions. This approach is quite similar to the treatment of a single categorical variable in Section
6.2. The same level of formality is not used here since the bivariate data case is a special case of what was done in
Section 6.2. If x is a categorical variable with Cx disjoint and exhaustive categories and y is a categorical variable with
Cy disjoint and exhaustive categories, the number of categories for the pair of variables is Cx Cy . Any (xi , yi ) pair in the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 144 — #151
i i

144 Descriptive statistics and visuals: bivariate data

observed sample must be in one and only one of these Cx Cy categories. As such, the sum of the joint sample counts
over all the categories is equal to the sample size n, and the sum of the joint sample proportions is equal to one.
Example 7.3 (Race and labor-force status) We consider the categorical variables race and lfstatus from the cps
dataset, which allow us to examine whether there is a relationship between race and labor-force status. race has
three categories (“Black”, “White”, “Other”) and lfstatus has three categories (“Employed”, “Unemployed”, “Not
in LF”), so the total number of joint categories is nine. To completely describe the joint sample distribution, a table of
joint sample counts and a table of joint sample proportions can be created in R:

# table of joint sample counts for labor-force status and race

table(cps$lfstatus, cps$race)
##
## Black Other White
## Employed 324 241 2244
## Not in LF 136 99 863
## Unemployed 16 9 81
addmargins(table(cps$lfstatus, cps$race))
##
## Black Other White Sum
## Employed 324 241 2244 2809
## Not in LF 136 99 863 1098
## Unemployed 16 9 81 106
## Sum 476 349 3188 4013
# table of joint sample proportions for labor-force status and race
addmargins(table(cps$lfstatus, cps$race))/nrow(cps)
##
## Black Other White Sum
## Employed 0.080737603 0.060054822 0.559182656 0.699975081
## Not in LF 0.033889858 0.024669823 0.215051084 0.273610765
## Unemployed 0.003987042 0.002242711 0.020184401 0.026414154
## Sum 0.118614503 0.086967356 0.794418141 1.000000000

The command table(cps$lfstatus, cps$race) provides the joint sample counts, while passing
table(cps$lfstatus, cps$race) as an argument to the function addmargins provides row totals and
column totals (labeled by Sum). The table of joint sample proportions, created by the second addmargins command,
divides each count by the sample size (nrow(cps), which is 4,013).
Looking at the top-left element of each table, there are 324 individuals who are black and employed, representing
0.0807 or 8.07% of the sample. The row and column totals make it easy to say something about either of the
individual variables lfstatus and race. For instance, looking at the row labeled Sum, there are 3,188 white individuals,
representing 79.45% of the sample.
Unfortunately, these tables do not make it easy to compare labor-force status of the different racial groups since
there are different numbers of individuals in the three racial categories. Similarly, these two tables do not make it easy
to compare the racial breakdown of the different labor-force statuses since there are different numbers of individuals
in the three labor-force status categories. To facilitate such comparisons, we introduce a different table containing
conditional sample proportions. The idea is to calculate the sample proportions of one categorical variable given the
value of the other categorical variable. As an example, the function prop.table can be used in R to create a table
of the conditional sample proportions of labor-force status, where we condition upon the value of the racial category:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 145 — #152
i i

Descriptive statistics and visuals: bivariate data 145

# table of sample proportions of labor-force status conditional on race

prop.table(table(cps$lfstatus, cps$race), margin=2)
##
## Black Other White
## Employed 0.68067227 0.69054441 0.70388959
## Not in LF 0.28571429 0.28366762 0.27070263
## Unemployed 0.03361345 0.02578797 0.02540778

The original table, given by table(cps$lfstatus, cps$race), is passed as the first argument to the
prop.table function. The margin argument indicates which variable should be conditioned on, with margin=2
specifying that the second variable (cps$race) is the one being conditioned on here. It can be verified that the
sum of the proportions in each column is equal to one. The values in this table could have been calculated directly
from the information in the joint sample count table. For instance, the “Black” and “Employed” value, which is the
proportion of black individuals who are employed, is obtained by dividing the joint sample count of 324 by the total
for the “Black” column (476), which yields 324
476 ≈ 0.6807. From this table, we see that the sample proportion of black
individuals who are employed (0.6807) is less than the sample proportion of white individuals who are employed
(0.7039). There is also a lower sample proportion of white individuals who are not in the labor force (0.2707), as
compared to either black individuals (0.2857) or other-race individuals (0.2837).
By changing the margins argument for the prop.table function, we can create a table of conditional sample
proportions of the racial categories, where we condition on labor-force status:

# table of sample proportions of race conditional on labor-force status

prop.table(table(cps$lfstatus, cps$race), margin=1)
##
## Black Other White
## Employed 0.11534354 0.08579566 0.79886080
## Not in LF 0.12386157 0.09016393 0.78597450
## Unemployed 0.15094340 0.08490566 0.76415094

The margin=1 argument specifies that we condition on the first variable (cps$lfstatus). It can be verified that
the sum of the proportions in each row is equal to one. The values in this table could have been calculated directly from
the joint sample count table, by dividing any of the joint sample counts by its associated row total. For instance, the
“Black” and “Employed” value, which is the proportion of employed individuals who are black, is obtained by dividing
324
the joint sample count of 324 by the total for the “Employed” row (2809), which yields 2809 ≈ 0.1153. The proportion
of white individuals among employed individuals (0.7989) is larger than the proportion of white individuals among
unemployed workers (0.7642), whereas the proportion of black individuals among employed individuals (0.1153) is
smaller than the proportion of black individuals among unemployed individuals (0.1509).
Building upon the idea of using conditional sample proportions as a descriptive tool, as in Example 7.3, we can do
something similar graphically by providing a descriptive visual (e.g., a bar chart) of one variable conditioned on the
categorical value of the other variable.
Example 7.4 (Race and labor-force status) To visually assess the association between race and labor-force status,
bar chart versions of the two conditional sample proportions tables from Example 7.3 can be created. Figure 7.1 shows
bar charts of labor-force status given race. Consistent with Example 7.3, the bar chart indicates that the proportion of
employed individuals is highest among white individuals and lowest among black individuals, whereas the proportion
of not-in-labor-force individuals is lowest among white individuals. Here is the R code used to create Figure 7.1:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 146 — #153
i i

146 Descriptive statistics and visuals: bivariate data

0.8
Black
Other
White

0.6
0.4
0.2
0.0

Employed Not in LF Unemployed

Figure 7.1
Labor-force status proportions by race

# create sample count table for race and labor-force status variables
tbl_racelf <- table(cps$race, cps$lfstatus)

# barplot command --- categories on x-axis are based upon columns (lfstatus) of the table
barplot(prop.table(tbl_racelf, margin=1), ylim=c(0,0.8), col=c("gray30","gray50","gray70"),
legend.text=rownames(tbl_racelf), beside=TRUE, main="")

When a table is provided as the first argument to the barplot function, the bars are grouped into categories on
the x-axis corresponding to the columns of the table. The table tbl_racelf has been specified to have labor-force
status as the columns. The additional argument beside=TRUE causes the bars for each of the racial categories to
be displayed side-by-side, and the legend.text argument specifies the racial categories, which are given by the
vector returned by rownames(tbl_racelf).
Figure 7.2 shows bar charts of the racial categories given labor-force status. Consistent with Example 7.3, the bar
charts indicate that the proportion of white individuals is highest within the group of employed individuals and lowest
within the group of unemployed individuals, whereas the reverse is true for black individuals. Here is the R code used
to create Figure 7.2:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 147 — #154
i i

Descriptive statistics and visuals: bivariate data 147

0.8
Employed
Not in LF
Unemployed

0.6
0.4
0.2
0.0

Black Other White

Figure 7.2
Race proportions by labor-force status

# create sample count table table for race and labor-force status variables
tbl_lfrace <- table(cps$lfstatus, cps$race)

# barplot command --- categories on x-axis are based upon columns (race) of the table
barplot(prop.table(tbl_lfrace, margin=1), ylim=c(0,0.8), col=c("gray30","gray50","gray70"),
legend.text=rownames(tbl_lfrace), beside=TRUE,
args.legend = list(x="topleft",inset=0.01), main="")

The primary difference from the previous code is that the rows and columns of the created table (tbl_lfrace)
are now labor-force status and race, respectively. The position of the legend is also specified since the default position
would interfere with the displayed bars.

7.1.2 Bivariate data with one categorical variable and one numerical variable
To assess the relationship between a categorical variable and numerical variable, we can examine how the descriptive
statistics and distribution of the numerical variable vary over different categories of the categorical variable. With
descriptive statistics, the easiest approach is to report descriptive statistics of the numerical variable for each possible
value of the categorical variable. This approach is equivalent to breaking the full sample into different subsamples,
each of which corresponds to one of the categories for the categorical variable.
Example 7.5 (Race and earnings) Let’s examine the relationship between weekly earnings (earnwk) and racial
category (race) from the cps dataset. The full sample of employed individuals has n = 2809, which is broken into
three subsamples for race = “Black” (324 observations), race = “White” (2,244 observations), and race = “Other”
(241 observations). For instance, if cpsemployed is a data frame in R for the 2,809 employed individuals, the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 148 — #155
i i

148 Descriptive statistics and visuals: bivariate data

subsample of black employed individuals is the data frame cpsemployed[cpsemployed$race==”Black”,].

The following table shows several descriptive statistics for weekly earnings (sample mean, sample median, sample
standard deviation, and sample IQR) in each of the three subsamples:
race n x̄ x̃0.5 sx IQRx
Black 324 765 640 491 450
White 2244 1002 800 780 677
Other 241 962 730 718 685
In comparing white workers to black workers, earnwk has a larger sample mean and sample median for white
workers. On average, white workers earn $1,002 per week, and black workers earn $765 per week. Both dispersion
measures (sx and IQRx ) indicate that the distribution of earnings for white workers exhibits considerably more
dispersion than the distribution of earnings for black workers. The sample IQR for white workers ($677) is 50.4%
larger than the sample IQR for black workers ($450).
The built-in R function tapply provides one simple way to calculate descriptive statistics of variables for the
different subsamples based upon another (categorical) variable. Here is the basic syntax for the tapply function:
• tapply(x, catvar, func, na.rm=FALSE): Outputs the results from applying the function func to
the vector or data frame x for the subsamples based on the data frame factor (categorical) variable catvar. The
optional argument na.rm=TRUE should be specified when missing values must be removed for the data frame and
function being used.
Here is the code that produces the descriptive statistics in Example 7.5:

# summary statistics of weekly earnings by racial category

tapply(cpsemployed$earnwk, cpsemployed$race, summary)
## $Black
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 461.4 640.0 765.0 911.8 3312.5
##
## $Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.0 520.0 730.0 961.9 1205.4 4431.0
##
## $White
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15 540 800 1002 1217 8780
# sample standard deviation of weekly earnings by racial category
tapply(cpsemployed$earnwk, cpsemployed$race, sd)

## Black Other White

## 491.0571 718.0332 779.8147
# sample IQR of weekly earnings by racial category
iqrvec <- tapply(cpsemployed$earnwk, cpsemployed$race, IQR)
iqrvec["White"]
## White
## 676.875

Had we used the original data frame cps, which contains missing values for the weekly earnings variable,
rather than the cpsemployed data frame, the optional argument na.rm=TRUE would have been required for
the last two uses of the tapply function above. For example, the sample standard deviation command would be
tapply(cps$earnwk, cps$race, sd, na.rm=TRUE). For the IQR calculations, the results are stored in
the vector iqrvec, and the specific IQR value for the white subsample of workers is obtained by referring to the
"White" index within square brackets.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 149 — #156
i i

Descriptive statistics and visuals: bivariate data 149

8000
6000
Weekly earnings

4000
2000
0

Black Other White

Race

Figure 7.3
Box plots of weekly earnings by race

In the same way that descriptive statistics can be applied to different subsamples based upon the categorical values,
descriptive visuals can also be applied to these subsamples. For example, histograms, density curves, and/or box plots
can be drawn and compared for the different subsamples.
Example 7.6 (Race and earnings) Figure 7.3 shows a box plot of weekly earnings (earnwk) for each of the three
subsamples. From left to right, the box plots correspond to black workers, other-race workers, and white workers.
Comparing the box plots for black and white workers, there is a higher sample median for white workers, a larger IQR
(height of the box) for white workers, and a more pronounced right skew for white workers. Figure 7.3 is particularly
easy to create in R:

# box plot of weekly earnings, by race

boxplot(cps$earnwk~cps$race, ylab="Weekly earnings", xlab="Race")

The syntax here for the first argument of the boxplot function has a tilde character (~) between two variables,
where the first variable (cps$earnwk) is the variable of interest for the box plot and the second variable
(cps$race) is a categorical variable used to split the sample into subsamples. For a similar figure to be drawn
based upon gender rather than race, cps$earnwk~cps$gender would be the first argument (and xlab should
be appropriately changed).
Histograms and density curves can be used as alternatives to box plots. As an example, Figure 7.4 shows the density
curves of the earnwk variable for the three racial categories. To easily compare the distributions, the density curves
are drawn on the same graph, with the same x-axis and y-axis. This figure tells much the same story as Figure 7.3. All
three earnings distributions have a unimodal shape, with the earnings distribution for white workers exhibiting a much

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 150 — #157
i i

150 Descriptive statistics and visuals: bivariate data

Black
Other

0.0012
White

0.0010
0.0008
Density

0.0006
0.0004
0.0002
0.0000

0 1000 2000 3000

Weekly earnings

Figure 7.4
Density of weekly earnings by race

thicker right tail than the earnings distribution for black workers. This thicker right tail explains the higher dispersion
statistics seen in Example 7.5. The hump for the black-worker earnings distribution is considerably higher and peaks
at a lower earnings level. A comparison of the curves indicates that a much larger proportion of black workers have
weekly earnings below $1000 than either white workers or other-race workers.
Here is the R code to create Figure 7.4:

# create weekly earnings vectors for the three subsamples, by race

earnwk_black <- cpsemployed[cpsemployed$race=="Black","earnwk"]
earnwk_other <- cpsemployed[cpsemployed$race=="Other","earnwk"]
earnwk_white <- cpsemployed[cpsemployed$race=="White","earnwk"]

# first plot the density of weekly earnings for black individuals

plot(density(earnwk_black), main="", xlab="Weekly earnings")

# overlay the density of weekly earnings for other-race individuals

lines(density(earnwk_other), lty=3)

# overlay the density of weekly earnings for white individuals

lines(density(earnwk_white), lty=2)

# draw the legend

legend("topright", legend=c("Black","Other","White"), lty=c(1,3,2), inset=0.01)

First, vectors containing the earnwk values for each of the three subsamples are created. Then, each of the three
density curves are plotted on the same graph, with each having a different line type (lty=1 (the default) being a solid
line for black individuals, lty=3 being a dotted line for other-race individuals, and lty=2 being a dashed line for
white individuals).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 151 — #158
i i

Descriptive statistics and visuals: bivariate data 151

7.2 Numerical data: scatter plots, sample covariance and correlation

This section considers bivariate data with observations on two numerical variables. First, Section 7.2.1 introduces
scatter plots as a descriptive visual for bivariate numerical data. Then, Section 7.2.2 introduces two descriptive
statistics, the sample covariance and the sample correlation, that measure the association between two numerical
variables. Finally, Section 7.2.4 considers how linear transformations of variables, seen previously in Section 6.7, affect
these new descriptive statistics and also examines the properties associated with a linear combination of variables, in
which more than one variable is part of the linear function.

7.2.1 Scatter plots

The most common visualization tool for bivariate numerical data is a scatter plot, where the values of the y variable
are plotted against the values of the x variable. Specifically, for each observed (xi, yi ) pair in the sample, a dot or circle
is drawn at that value such that the xi value is read off the x-axis when a vertical line is drawn through the point (xi, yi )
and the yi value is read off the y-axis when a horizontal line is drawn through the point (xi, yi ). The “scatter” in the
term scatter plot refers to the fact that the full sample of n points looks like they are scattered on the graph. If it’s the
case that no two observations share both the same x and y values, it may be possible to visually see all n points on the
scatter plot. If there are “tied” pairs, with two or more observations sharing the same (x, y) values, some of the dots or
circles will lie on top of each other.
Example 7.7 (Education and earnings) For the sample of employed individuals (n = 2809) from the cps dataset, let
x = educ (educational attainment, in years) and y = earnwk (weekly earnings, in dollars). Figure 7.5 shows a scatter
plot of y = earnwk versus x = educ. The plot provides some interesting insight into the relationship between earnings
and education. First, the higher earnings values are almost exclusively seen at higher education levels, primarily for
educ ≥ 12, which indicates a type of positive association between education and earnings. Second, the dispersion of
the earnings values also increases at higher education levels. For example, thinking about all the observations along
the vertical line at educ = 12 (i.e., the observations for all workers with exactly 12 years of education), their dispersion
is the variation that is seen in the earnings values along that vertical line. The dispersion at educ = 12 is clearly greater
than the dispersion of earnings at lower education values (e.g., if we look at the points along the vertical at educ = 10
or along the vertical at educ = 9). The R code for Figure 7.5 uses the function plot with the x variable (cps$educ)
as the first argument and the y variable (cps$earnwk) as the second argument:16

# scatter plot of weekly earnings versus years of education

plot(cps$educ, cps$earnwk, xlab="Education", ylab="Weekly earnings")

In addition to a standard scatter plot for two variables, R can draw multiple scatter plots simultaneously for a set
of three or more variables. This expanded scatter plot provides a scatter plot for every possible pair of variables from
among the set of variables specified. To see how the expanded scatter plot works in practice, let’s add an additional
variable (hours worked last week (hrslastwk)) to the two variables (educ and earnwk) already specified. Figure 7.6
shows the expanded scatter plot for these three variables. There are six different plots since there are P3,2 = (3)(2)
ways to choose a pair of two variables from a set of three variables. (In general, for a set of k variables, an expanded
scatter plot has Pk,2 = (k)(k – 1) scatter plots.) The scatter plot in the middle of the top row is a plot of educ versus
earnwk, whereas the scatter plot on the left of the second row is a plot of earnwk versus educ, essentially reversing
the roles of the x and y variables. Similarly, the scatter plot on the right of the second row is a plot of earnwk versus
hrslaswk, whereas the scatter plot in the middle of the bottom row is a plot of hrslastwk versus earnwk. These two
earnings and hours worked plots indicate a positive relationship between earnings and hours worked, which is perhaps
unsurprising. In addition to earnings being more likely to be higher for larger hours worked, it also appears that the
dispersion of earnings also increases at higher levels of hours worked.
The R code to create the expanded scatter plot in Figure 7.6 is rather simple:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 152 — #159
i i

152 Descriptive statistics and visuals: bivariate data

8000
6000
Weekly earnings

4000
2000
0

0 5 10 15

Education

Figure 7.5
Scatter plot of weekly earnings versus years of education

# expanded scatter plot, with weekly earnings, years of education, and weekly hours worked
plot(cps[,c("educ","earnwk","hrslastwk")])

Here, the function plot takes a data frame as its first and only argument, and in this case the three columns for the
variables of interest are selected.
Example 7.8 (Monthly stock returns) Continuing Example 7.2, let’s look at the relationship between the monthly stock
returns for Home Depot (HD) and Lowe’s (LOW). Figure 7.7 shows a scatter plot of LOW (y-axis) versus HD (x-axis).

# scatter plot of Lowe's returns versus Home Depot returns

plot(sp500$HD, sp500$LOW, xlab="Monthly returns (HD)", ylab="Monthly returns (LOW)")

The plot indicates a positive association between HD and LOW, as the cloud of points are roughly contained within
an oval stretching from the bottom left to the upper right of the plot. It is more likely to see low values of LOW when HD
is low and high values of LOW when HD is high. This positive association supports the idea that the two companies’
monthly returns tend to move in the same direction due to common macroeconomic conditions affecting their industry.
An expanded scatter plot can include more variables, so let’s add two additional stocks, Bank of America (stock
ticker BAC) and Wells Fargo (stock ticker WFC), both in the banking industry. Figure 7.8 shows the expanded scatter
plot with four stocks (HD, LOW, BAC, WFC).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 153 — #160
i i

Descriptive statistics and visuals: bivariate data 153

0 2000 4000 6000 8000

15
10
educ

5
0
8000
6000

earnwk
4000
2000
0

100
80
60
hrslastwk

40
20
0
0 5 10 15 0 20 40 60 80 100

Figure 7.6
Expanded scatter plot of weekly earnings, education, and weekly hours worked

# expanded scatter plot of monthly returns

plot(sp500[,c("HD","LOW","BAC","WFC")])

The two BAC-WFC plots, located at the bottom right of the figure, indicate a positive association similar to the one
for HD and LOW. BAC tends to be higher when WFC is higher, and BAC tends to be lower when WFC is lower. For
some of the other plots, specifically the ones across industries (e.g., one stock from home improvement (HD or LOW)
versus one stock from banking (BAC or WFC)), there are positive relationships but not as strong as either the HD-
LOW relationship or the BAC-WFC relationship. For example, in the plot of Lowe’s (LOW) versus Bank of America
(BAC), located in the third box of the second row, the cloud of points has a roughly oval shape which is slightly tilted
to the right from vertical. The slight tilt indicates a postive relationship, but the tilt is not as strong as that seen in, for
example, the plot of Lowe’s (LOW) versus Home Depot (HD), located in the first box of the second row.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 154 — #161
i i

154 Descriptive statistics and visuals: bivariate data

0.3
0.2
Monthly returns (LOW)

0.1
0.0
−0.1
−0.2

−0.2 −0.1 0.0 0.1 0.2 0.3

Monthly returns (HD)

Figure 7.7
Scatter plot of Lowe’s monthly returns versus Home Depot’s monthly returns

7.2.2 Sample covariance and sample correlation

Scatter plots provide a visualization of the relationship between two numerical variables. To quantify this relationship,
this section introduces the sample covariance and sample correlation descriptive statistics. In the monthly stock
return example, for instance, it would be useful to quantify that the positive relationship between the returns of Lowe’s
and Home Depot is stronger than the positive relationship between Lowe’s and Bank of America.
Both the sample covariance and the sample correlation provide a measure of the linear association between two
numerical variables. The definition of the sample covariance is provided first:

Definition 7.1 For observations (x1 , y1 ), (x2 , y2 ), …, (yn , xn ), the sample covariance between x and y, denoted sxy , is
n
1 X
sxy = (xi – x̄)(yi – ȳ).
n–1
i=1
1
The n–1 scaling for the sample covariance is the same as the scaling for the sample variance s2x . The units of the
sample covariance sxy are the units of x times the units of y. For example, if x is education (in years) and y is weekly
earnings (in dollars), the units of sxy are years × dollars. The sample variance is a special case of the sample covariance,
obtained by taking the covariance of a variable x with itself :
n n
1 X 1 X
sxx = (xi – x̄)(xi – x̄) = (xi – x̄)2 = s2x .
n–1 n–1
i=1 i=1
Also, the ordering of the variables x and y does not matter: sxy = syx since (xi – x̄)(yi – ȳ) = (yi – ȳ)(xi – x̄) for each i.
Pn
There are four types of contributions to the summation i=1 (xi – x̄)(yi – ȳ) in the sample covariance sxy :

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 155 — #162
i i

Descriptive statistics and visuals: bivariate data 155

−0.2 0.0 0.2 −0.2 0.0 0.2 0.4

0.2
HD

0.0
−0.2
0.2

LOW
0.0
−0.2

0.4
BAC

0.0
−0.4
0.4
0.2

WFC
0.0
−0.2

−0.2 0.0 0.2 −0.4 0.0 0.4

Figure 7.8
Expanded scatter plot of monthly stock returns

(i) Observation (xi , yi ) with xi and yi above their respective means:

xi > x̄ and yi > ȳ implies (xi – x̄)(yi – ȳ) > 0
(ii) Observation (xi , yi ) with xi and yi below their respective means:
xi < x̄ and yi < ȳ implies (xi – x̄)(yi – ȳ) > 0

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 156 — #163
i i

156 Descriptive statistics and visuals: bivariate data

14
12
10
8
y

6
4
2

0 2 4 6 8 10 12

Figure 7.9
Scatter plot of seven-observation sample

(iii) Observation (xi , yi ) with xi above its mean and yi below its mean:
xi > x̄ and yi < ȳ implies (xi – x̄)(yi – ȳ) < 0
(iv) Observation (xi , yi ) with xi below its mean and yi above its mean:
xi < x̄ and yi > ȳ implies (xi – x̄)(yi – ȳ) < 0
The first two types of observations, with xi and yi either both above their means or both below their means, lead
to positive contributions to the sample covariance. The second two types of observations, with xi and yi on opposite
sides of their means, lead to negative contributions to the sample covariance. The overall sample covariance measure
involves a combination of all four types of observations, so its sign depends upon the relative magnitudes of the
(xi – x̄)(yi – ȳ) > 0 contributions from (i) and (ii) versus the magnitudes of the (xi – x̄)(yi – ȳ) < 0 contributions from (iii)
and (iv).
Generally speaking, there is a positive sample covariance when larger values of x tend to be associated with larger
values of y and smaller values of x tend to be associated with smaller values of y. In terms of the sample averages, the
sample covariance is positive when the x and y values are more likely to be on the same side of x̄ and ȳ, respectively,
than they are to be on opposite sides. On the other hand, there is a negative sample covariance when larger values of
x tend to be associated with smaller values of y and smaller values of x tend to be associated with larger values of y.
In terms of the sample averages, the sample covariance is negative when the x and y values are more likely to be on
opposite sides of x̄ and ȳ, respectively, than they are to be on the same side.
Example 7.9 Consider the following bivariate data with seven observations (n = 7):
{(xi , yi )}7i=1 = {(4, 8), (3, 6), (8, 10), (12, 1), (0, 15), (10, 3), (5, 6)}.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 157 — #164
i i

Descriptive statistics and visuals: bivariate data 157

Figure 7.9 shows a scatter plot of these data, indicating a clear negative relationship between x and y. The following
table provides a detailed calculation of (xi – x̄)(yi – ȳ) for each observation:
i 1 2 3 4 5 6 7
xi 4 3 8 12 0 10 5
yi 8 6 10 1 15 3 6
xi – x̄ –2 –3 2 6 –6 4 –1
yi – ȳ 1 –1 3 –6 8 –4 –1
(xi – x̄)(yi – ȳ) –2 +3 +6 –36 –48 –16 +1
There are three positive contributions and four negative contributions to the sample covariance, with the negative
contributions considerably larger in magnitude. In particular, the (12, 1) point and the (0, 15) point have contributions
of –36 and –48, respectively, since their x and y values are very far away from their respective means. The (12, 1) point
is a type (iii) observation with x above its mean and y below its mean, while the (0, 15) point is a type (iv) observation
with x below its mean and y above its mean. On the scatter plot, these two points are the ones in the lower-right corner
and the upper-left corner, respectively. Summing the values in the table’s bottom row and dividing by n – 1, the sample
covariance is
1 1 46
sxy = (–2 + 3 + 6 – 36 – 48 – 16 + 1) = (–92) = – ≈ –15.33.
7–1 6 3
Example 7.10 (Monthly stock returns) In Example 7.8, the positive association between the monthly stock returns of
Home Depot (HD) and Lowe’s (LOW) is evident from the scatter plot (Figure 7.7). The sample covariance is positive,
with sHD,LOW = 0.004378 calculated in R using the cov function with the two variables as arguments:

cov(sp500$HD, sp500$LOW)
## [1] 0.004378172

To understand why the sample covariance is positive, Figure 7.10 re-draws the scatter plot of LOW versus HD with
a horizontal line drawn at the sample mean of LOW (0.02029) and a vertical line drawn at the sample mean of HD
(0.01668).

# scatter plot with horizontal/vertical lines drawn at the sample means

plot(sp500$HD, sp500$LOW, main="", xlab="Monthly returns (HD)", ylab="Monthly returns (LOW)")
abline(h=mean(sp500$LOW))
abline(v=mean(sp500$HD))

The positive contributions to the sample covariance are made by the observations in the upper-right quadrant
(type (i) observations) and the lower-left quadrant (type (ii) observations). The negative contributions to the sample
covariance are made by the observations in the lower-right quadrant (type (iii) observations) and the upper-left
quadrant (type (iv) observations). There are many more observations in the upper-right and lower-left quadrants
(positive contributions) than there are in the lower-right and upper-left quadrants (negative contributions). To be
more precise, there are 139 points of type (i), 141 points of type (ii), 43 points of type (iii), and 41 points of type (iv).
The sum of the (xi – x̄)(yi – ȳ) terms for the four types of observations are 0.933703 (type (i)), 0.842564 (type (ii)),
–0.075304 (type (iii)), and –0.111687 (type (iv)). Adding these four values together and dividing by n – 1 = 363 yields
the sample covariance sHD,LOW = 0.004378.
Example 7.11 (Education and earnings) In Example 7.7, the positive relationship between weekly earnings and
educational attainment was evident in the scatter plot of earnwk versus educ (Figure 7.5). As in Example 7.10,
the scatter plot can be re-drawn with lines at the sample means of the two variables. This scatter plot is shown

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 158 — #165
i i

158 Descriptive statistics and visuals: bivariate data

0.3
0.2
Monthly returns (LOW)

0.1
0.0
−0.1
−0.2

−0.2 −0.1 0.0 0.1 0.2 0.3

Monthly returns (HD)

Figure 7.10
Scatter plot of LOW versus HD, with lines at sample means

in Figure 7.11, with a horizontal line at the sample mean of earnwk (971.18 dollars) and a vertical line at the
sample mean of educ (12.82 years). For these data, there are 600 upper-right quadrant (type (i)) observations, 1,113
lower-right quadrant (type (ii)) observations, 702 lower-right quadrant (type (iii)) observations, and 394 (type (iv))
observations, with overall contributions to the sum of (xi – x̄)(yi – ȳ) given by 1388551, 811985, –335102, and –218892,
respectively. In this example, it is not just the number of upper-right quadrant (type (i)) observations that influences the
sample covariance but also the large magnitude of the (xi – x̄)(yi – ȳ) contributions caused by the considerable number
of points that have earnwk (y) values far above the sample mean of earnwk (ȳ). The resulting sample covariance is
seduc,earnwk = 586.375.

cov(cpsemployed$educ, cpsemployed$earnwk)
## [1] 586.3751

The units of the sample covariance seduc,earnwk are years × dollars or, equivalently, dollars × years. These strange
units make it very difficult to interpret what the numerical value 586.375 means. While the positive covariance does
reflect the positive relationship between earnwk and educ, the numerical value of the sample covariance may not be
an ideal statistic for quantifying this relationship.
As seen in Example 7.11, the sample covariance value can be difficult to interpret, even if the sign of the sample
covariance indicates whether there is a positive or negative relationship between two variables. The problem involves
the units of the sample covariance, which are the units of x times the units of y. It would be more useful if we had
a descriptive statistic that could be compared for different pairs of bivariate data. For example, is the relationship

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 159 — #166
i i

Descriptive statistics and visuals: bivariate data 159

8000
6000
Weekly earnings

4000
2000
0

0 5 10 15

Education

Figure 7.11
Scatter plot of weekly earnings versus education, with lines at sample means

between educ and earnwk stronger than the relationship between hrslastwk and earnwk? The sample covariance is not
useful for this purpose since the units of the first sample covariance are years × dollars and the units of the second
sample covariance are hours × dollars. Even after the sample covariances seduc,earnwk and shrslastwk,earnwk are calculated,
their values are not comparable since they are measured in different units.
To address this undesirable feature of the sample covariance, we introduce the sample correlation, a descriptive
statistic that also measures the linear association between two variables but in a way that is comparable across different
pairs of variables.

Definition 7.2 For observations (x1 , y1 ), (x2 , y2 ), …, (yn , xn ), the sample correlation between x and y, denoted rxy , is
sxy
rxy = ,
sx sy
Pn q P q P
1 1 n 2 , and s = 1 n 2
where sxy = n–1 (x
i=1 i – x̄)(y i – ȳ), sx = n–1 i=1 i(x – x̄) y n–1 i=1 (yi – ȳ) .
The sample correlation between x and y is the sample covariance between x and y divided by the product of the
standard deviations of x and y. Importantly, since the units of sx are the units of x and the units of sy are the units of
y, the units of the numerator (units of x times units of y) cancel the units of the denominator, leading to the sample
correlation rxy being unitless. This fact and additional properties about the sample correlation are stated in the following
proposition:
Proposition 7.1. (Properties of the sample correlation) The sample correlation rxy has the following properties:
(i) rxy is unitless;

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 160 — #167
i i

160 Descriptive statistics and visuals: bivariate data

(ii) the sign of the sample correlation is the same as the sign of the sample covariance,
sign(rxy ) = sign(sxy );
(iii) rxx = 1;
(iv) –1 ≤ rxy ≤ 1.
Property (i) has already been discussed. For property (ii), note that both sx > 0 and sy > 0 since standard deviations
s
are positive. Therefore, the denominator in sxxysy is also positive, meaning the sign of rxy must be the same as the sign of
the numerator sxy . Property (iii) says that the correlation of a variable x with itself is exactly equal to one. For rxx = ssxxxsx ,
the numerator and denominator are both equal to the sample variance s2x , yielding rxx = 1. Finally, property (iv) says that
the magnitude of the sample correlation must be less than or equal to one, that is |rxy | ≤ 1 or, equivalently, –1 ≤ rxy ≤ 1.
The proof of this property is beyond the scope of this book, but the property makes sense if two extreme cases are
considered. Intuitively, the strongest positive correlation should be between a variable x and itself, in which case the
sample correlation is equal to 1 from property (iii). Likewise, the strongest negative correlation should be between a
variable x and its negative (–x), in which case the sample correlation is equal to –1.17 As property (iv) states, any other
sample correlation has a value between the two extremes of –1 and 1.
To see how sample correlation values relate to the bivariate association in scatter plots, Figure 7.12 shows a set of
six different scatter plots, each with a different rxy value. The top row of three scatter plots has rxy = 0.4, rxy = 0.8, and
rxy = 1, and the bottom row of three scatter plots has rxy = –0.4, rxy = –0.8, and rxy = 0. The rxy = 1 scatter plot indicates
a perfect linear and positive relationship between x and y. In this case, a positive-sloped line can be drawn through
all of the points. For rxy = 0.4 and rxy = 0.8, both plots reveal a positive relationship between x and y, but the larger
correlation (rxy = 0.8) appears to describe a stronger positive relationship in the sense that the cloud of points is closer
to the extreme of a linear relationship as compared with the smaller correlation (rxy = 0.4). The comparison between
the rxy = –0.4 and rxy = –0.8 scatter plots is similar. Both indicate a negative relationship between x and y, with the
rxy = –0.8 plot indicating a stronger negative relationship as the cloud of points is closer to the extreme of a negative
linear relationship as compared with rxy = –0.4. Finally, the rxy = 0 scatter plot is a case where there is no evident
relationship between x and y. The cloud of points in this case does not show a tendency to be either upward sloping or
downward sloping.
Before considering some empirical examples involving the sample correlation, it’s important to discuss one potential
pitfall associated with the sample correlation measure. As stated previously, the sample correlation is only meant to
measure the linear association between two variables. As such, a sample correlation of zero means that there is no
linear relationship between two variables, but it does not necessarily mean that there is no relationship whatsoever
between the two variables. Figure 7.13 pictures a rather extreme example, with a scatter plot of points that lie exactly
along a parabola. For these bivariate data, there is a perfect non-linear relationship between x and y. However, it turns
out that the sample correlation between the two variables is exactly equal to zero. The scatter plot on the right, with
horizontal and vertical lines drawn at the sample means, indicates why this is the case. The positive contributions to
the sample covariance from the upper-right quadrant are cancelled out exactly by the negative contributions from the
upper-left quadrant, and similarly the positive contributions from the lower-left quadrant are cancelled out exactly by
the negative contributions from the lower-right quadrant. The resulting sample covariance is zero, meaning the sample
correlation is also zero.
There’s no way to avoid this feature of the sample correlation, as the sample correlation is specifically designed
to measure the linear association between two variables. However, this non-linear scatter plot example highlights the
importance of using both descriptive visuals, like scatter plots, and descriptive statistics, like sample correlations, when
examining the association between variables.
With the sample correlation measure now in our descriptive statistics toolkit, we re-visit the previous examples.
Example 7.12 (Education and earnings) In Example 7.11, the sample covariance of educational attainment and
weekly earnings was seduc,earnwk = 586.375. The sample standard deviations of educ and earnwk, seduc = 2.4030 and

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 161 — #168
i i

Descriptive statistics and visuals: bivariate data 161

rxy = +0.4 rxy = +0.8 rxy = +1

2
1

1
y

y
0

0
−1

−1

−1
−2

−2

−2
−3 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

x x x

rxy = −0.4 rxy = −0.8 rxy = 0

2
1

1
y

y
0

0
−1

−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3

x x x

Figure 7.12
Scatter plots for different sample correlations

searnwk = 750.4846, are needed to calculate the sample correlation reduc,earnwk :

seduc,earnwk 586.375
reduc,earnwk = = ≈ 0.325.
seduc searnwk (2.4030)(750.4846)
The sample correlation is calculated in R with the cor function, with the two variables as arguments:

i i

The three highest values in the correlation matrix are rMRO,COP = 0.771, rBAC,WFC = 0.692, and rHD,LOW = 0.648. Not
coincidentally, these three stock pairs correspond to the three pairs of companies that are in the same industry, with
MRO and COP in the oil industry, BAC and WFC in the banking industry, and HD and LOW in the home improvement
industry. Looking at COP, for instance, its sample correlation with MRO (0.771) is much larger than its sample
correlations with the other four companies, which range from 0.215 to 0.396. Similarly, for HD, its sample correlation
with LOW (0.648) is much larger than its sample correlations with the other four companies, which range from 0.189
to 0.331. The four lowest values in the correlation matrix are rLOW,MRO = 0.181, rHD,MRO = 0.189, rHD,COP = 0.215, and
rLOW,COP = 0.256, suggesting that the shocks affecting the home improvement and oil industries are less related to each
other than the shocks affecting other pairs of the three industries.

7.2.3 Sample correlation when one variable is binary

(no correlation between x and y) for sxy = 0: s2v = a2 s2x + b2 s2y

(sum of two variables) for v = x + y: s2v = s2x + s2y + 2sxy
(difference of two variables) for v = x – y: s2v = s2x + s2y – 2sxy
p q
(iii) (sample standard deviation) sv = s2v = a2 s2x + b2 s2y + 2absxy
Similar to linear transformations, the additive constant, here denoted by k, only plays a role in the sample mean, as
it may affect the location of the resulting sample distribution of v but does not affect the dispersion. In property (ii),
the sample variance of the linear combination v has three components, the a2 s2x term coming from the sample variance
of x, the b2 s2y term coming from the sample variance of y, and the 2absxy term coming from the sample covariance
between x and y. The first two terms are always positive, and the sign of the third term depends upon the signs of a, b,
and sxy . In the case of no sample correlation between x and y, the third term drops out, so that the sample variance of
the linear combination is just a sum of the scaled sample variances of x and y.
To develop more intuition for the results in property (ii), let’s first focus on the case where v = x + y is the sum of the
two variables, for which we have s2v = s2x + s2y + 2sxy . If x and y are uncorrelated in the sample (sxy = rxy = 0), the sample
variance of v is the sum of the sample variances of x and y. In other words, the variation contained in the sum of the
two variables is equal to the sum of the variations of each variable individually. What if x and y are correlated (sxy 6= 0)?
Let’s think about the positive correlation case first (rxy > 0, so that sxy > 0). The sample variance s2v depends upon how
far the xi + yi values are from their sample mean x̄ + ȳ. When xi is very far above its sample mean x̄, the positive
correlation between x and y implies that we are more likely to see a yi value also above its sample mean ȳ. Similarly,
when xi is very far below its sample mean x̄, the positive correlation between x and y implies that we are more likely
to see a yi value also below its sample mean ȳ. This relationship causes more extreme values for (xi + yi – (x̄ + ȳ))2 than
we’d have if there were no correlation between x and y. As a result, the positive correlation between the two variables
causes the dispersion in x + y to increase, as the individual variables x and y move in the same direction.
We can also see this phenomenon graphically. Figure 7.14 shows three different cases: positive correlation between
x and y (top row), zero correlation between x and y (middle row), and negative correlation between x and y (bottom
row). The left plot in each row is a scatter plot of y versus x, and the right plot in each row is a histogram of v = x + y.
To make the three cases directly comparable to each other and to focus on the role of rxy , the sample means x̄ and ȳ
and the sample variances s2x and s2y have been chosen to be the same across all three cases. The axes for the graphs are
also exactly the same for each of the three cases. The scatter plots have downward-sloping lines added to them, where
each of these dotted lines corresponds to a different value of x + y. Specifically, the seven lines drawn on each scatter
plot correspond to the lines x + y = –3, x + y = –2, x + y = –1, x + y = 0, x + y = 1, x + y = 2, and x + y = 3. Looking at the top
scatter plot, when x is above 1, the x + y values are almost all above 1 (i.e., almost of all of the points corresponding to
x > 1 are above the diagonal line corresponding to x + y = 1). Similarly, when x is below –1, the x + y values are almost
all below –1. In contrast, for the scatter plot in the zero correlation case (second row), when x is above 1, several of
the x + y values are below 1 and some even below the sample mean x̄ + ȳ = 0; when x is below –1, several of the x + y
values are above 1 and some even above the sample mean x̄ + ȳ = 0. This comparison indicates that the dispersion of
the x + y values is greater in the positive correlation case since these values tend to be farther from their mean than in
the zero correlation case. The histograms in the first two rows confirm this fact, as the histogram of x + y in the positive
correlation case has thicker left and right tails and a lower density in the middle as compared to the histogram of x + y
in the zero correlation case.
How about the negative correlation case (third row) in Figure 7.14? We get something very different from the
positive correlation case. In the histogram, there is very little dispersion in x + y as compared to the dispersion for the
zero correlation case. In the scatter plot, the points are within a pretty narrow band of the possible x + y lines, with
almost all of them lying between the x + y = –1 and x + y = 1 lines. When xi is above its mean, it’s more likely that the
yi value is below its mean, so there are counteracting effects on the sum xi + yi , making it more likely that it is closer
to its mean x̄ + ȳ than in the case with zero correlation. Similarly, when xi is below its mean, it’s more likely that the
yi value is above its mean, again making it more likely that xi + yi is closer to its mean x̄ + ȳ than in the case with zero

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 168 — #175
i i

168 Descriptive statistics and visuals: bivariate data

Variation in x + y for positive rxy Distribution of x + y

1 2 3

0.8
Density
y

0.4
−1

0.0
−3

−3 −2 −1 0 1 2 3 −6 −5 −4 −2 0 1 2 3 4 5 6

Variation in x + y for rxy = 0 Distribution of x + y

1 2 3

0.8
Density
y

0.4
−1

0.0
−3

−3 −2 −1 0 1 2 3 −6 −5 −4 −2 0 1 2 3 4 5 6

Variation in x + y for negative rxy Distribution of x + y

1 2 3

0.8
Density
y

0.4
−1

0.0
−3

−3 −2 −1 0 1 2 3 −6 −5 −4 −2 0 1 2 3 4 5 6

Figure 7.14
Variation in x + y for different sample correlations

correlation. From the formula s2v = s2x + s2y + 2sxy , a negative correlation (sxy < 0) leads to s2v < s2x + s2y . The variance of the
sum of the variables is less than the sum of the variances of the variables in this case. The reduction in variance arises
precisely because of the negative relationship between x and y, where the tendency of the variables to be on opposite
sides of their means also leads to a tendency for x + y to be closer to its mean.
Next, we discuss the other result in property (ii), which involves the sample variance of the difference of two
variables. When v = x – y, the resulting sample variance is s2v = s2x + s2y – 2sxy . In the case of no correlation between x
and y (sxy = 0), we have s2v = s2x + s2y , so that the sample variance of the difference of two variables is the sum of the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 169 — #176
i i

Descriptive statistics and visuals: bivariate data 169

variances of the two variables. When there is correlation, the reasoning for the presence of the –2sxy is similar to that
from above. Let’s start with the positive correlation (sxy > 0) case. When x and y are positive correlated, their values
tend to be on the same side of their respective means, so that the difference x – y tends to be smaller in magnitude than
it would be in the case of zero correlation. The positive relationship between x and y leads to counteracting effects for
x – y, in contrast to the “exaggerating” effects for x + y, and causes the sample variance s2v to be lower, due to the –2sxy
term, than it would be in the zero correlation case. On the other hand, when x and y are negatively correlated, they tend
to be on opposite sides of their respective means, which means the difference x – y tends to be larger in magnitude as
compared to the case of zero correlation. The negative relationship between x and y increases the dispersion of x – y,
with the term –2sxy being positive since sxy < 0.
Example 7.18 (Earnings of siblings) Suppose we have data on the earnings of a sample of adult siblings, where x is
weekly earnings for the older sibling and y is weekly earnings for the younger sibling. For these variables, a positive
correlation (rxy > 0) is expected due to common factors like similar parenting, similar educational background, similar
socioeconomic background, etc. Thus, the variance of their combined weekly earnings, v = x + y, is larger than the sum
of the variances of their individual weekly earnings: s2v = s2x+y > s2x + s2y . The positive correlation between the siblings’
earnings leads to a greater dispersion in the sum of their earnings. As compared to two randomly chosen individuals
from the population (i.e., non-siblings), in which case there is zero correlation, there is a larger variance for the sum
of the earnings of adult siblings.
How about the average of the siblings’ weekly earnings? For v = x+y 1 1
2 = 2 x + 2 y, the sample variance of v is

1 1 1 1 1 1 1
s2v = s21 x+ 1 y = s2x + s2y + 2 sxy = s2x + s2y + sxy .
2 2 4 4 2 2 4 4 2
How about the difference between the siblings’ weekly earnings, say v = x – y (older sibling’s wages minus younger
sibling’s wages)? For v = x – y, the sample variance of v is
s2v = s2x–y = s2x + s2y – 2sxy ,
with the positive correlation in their wages leading to a decreased dispersion in the difference, as compared to the
zero correlation case.
Example 7.19 (Two-stock portfolio) Suppose we have data on the returns for two different stocks, with x denoting
the return for stock A and y denoting the return for stock B. These could be monthly returns or annual returns or
returns for some other time interval. We won’t specify the time interval for now, in the interest of keeping things
general. By applying linear combinations, these data can be used to see how a particular two-stock portfolio would
have performed over the same time period. Specifically, consider the two-stock portfolio where a fraction a is invested
in stock A and the remainder, (1 – a), is invested in stock B, with a being a constant between 0 and 1. Then, the return
on the two-stock portfolio is the linear combination
v = ax + (1 – a)y.
Applying Proposition 7.4, the sample mean and variance of the return on the two-stock portfolio are
v̄ = ax̄ + (1 – a)ȳ
and
s2v = a2 s2x + (1 – a)2 s2y + 2a(1 – a)sxy .
The sample mean of the portfolio return is a weighted average of the sample means for the two stocks, with weight
a placed on stock A’s sample mean and weight 1 – a placed on stock B’s sample mean. The sample variance s2v is
a measure of the risk associated with the portfolio since it tells us how much variability there is in the portfolio’s
observed returns. Part of this sample variance comes from the sample variances of the individual stocks, reflected by
the a2 s2x and (1 – a)2 s2y terms in the s2v formula. But there is also a third term, 2a(1 – a)sxy . If the stocks’ returns are
positive correlated, the 2a(1 – a)sxy term is positive since sxy > 0 (a and 1 – a are also positive since 0 < a < 1). The stock

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 170 — #177
i i

170 Descriptive statistics and visuals: bivariate data

returns move together in this case, with the returns tending to be on the same side of their respective means, leading
to an increased variance or risk of the two-stock portfolio, relative to the case of no correlation. On the other hand, if
the stocks’ returns are negative correlated, the 2a(1 – a)sxy term is negative and leads to a decreased variance or risk
of the two-stock portfolio, relative to the case of no correlation.
Interestingly, even in the case of zero correlation between the stocks’ returns (sxy = 0), there is a “diversification”
effect that affects the variance or risk of the two-stock portfolio. Using the fact that k2 < k for any positive constant
k < 1, we can show this effect as follows:
s2v = a2 s2x + (1 – a)2 s2y < as2x + (1 – a)s2y ≤ a max(s2x , s2y ) + (1 – a) max(s2x , s2y ) = max(s2x , s2y ),
which means that s2v is less than the maximum of the two sample variances s2x and s2y . For this case of zero correlation,
the same must then also be true for the standard deviations: sv < max(sx , sy ).
Let’s consider real-world examples of two-stock portfolios from the sp500 dataset. The following table shows sample
means and standard deviations for three stocks (Bank of America (BAC), Wells Fargo (WFC), and ConocoPhillips
(COP)) and also two different two-stock portofolios, one equally weighted (a = 1 – a = 0.5) between BAC and WFC
and one equally weighted (a = 1 – a = 0.5) between COP and WFC:
x̄ sx
BAC 0.01295 0.10530
WFC 0.01351 0.08157
COP 0.01093 0.08173
1/2BAC + 1/2WFC 0.01323 0.08607
1/2COP + 1/2WFC 0.01222 0.06822
For the BAC-WFC portfolio, we can confirm that, for the sample means, 0.01323 = (0.5)(0.01295) + (0.5)(0.01351).
For the sample standard deviation, using the sample correlation rBAC,WFC = 0.692 (Example 7.13) and the fact that
sBAC,WFC = rBAC,WFC sBAC sWFC , we have
s21 BAC+ 1 WFC = 0.25s2BAC + 0.25s2WFC + 0.5sBAC,WFC
2 2

(0.25)(0.10530)2 + (0.25)(0.08157)2 + (0.5)(0.692)(0.10530)(0.08157) ≈ 0.0074073.

=
√
The standard deviation is s 1 BAC+ 1 WFC = 0.0074073 ≈ 0.08607. The calculations are similar for s 1 COP+ 1 WFC . The
2 2 2 2
sample standard deviation of the COP-WFC portfolio is considerably lower than the sample standard deviation of
the BAC-WFC portfolio, which occurs for two reasons: (i) COP has a lower variance than BAC, and (ii) the sample
correlation between COP and WFC is lower than the sample correlation between BAC and WFC.
Other weights can be considered, in addition to the a = 0.5 choice above. Figure 7.15 shows the sample means
and sample standard deviations for different weights for the two stocks. The top figure plots the sample standard
deviation versus the sample mean for the BAC-WFC portfolio with possible weights a ∈ {0.1, 0.2, …, 0.9}, where a is
the weight on BAC. The bottom figure plots the sample standard deviation versus the sample mean for the COP-WFC
with possible weights a ∈ {0.1, 0.2, …, 0.9}, where a is the weight on COP.

7.2.6 Linear combination of more than two variables

What if a linear combination consists of more than two variables? The results of Proposition 7.4 generalize to additional
variables. First, consider a linear combination of three variables x, y, and z, with
v = k + ax + by + cz.
The observed sample is {(x1 , y1 , z1 ), (x2 , y2 , z2 ), …, (xn, yn , zn )}, with corresponding linear combinations {v1 , v2 , …, vn }.
One can show that
v̄ = k + ax̄ + bȳ + cz̄,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 171 — #178
i i

Descriptive statistics and visuals: bivariate data 171

BAC−WFC Portfolios for Different Weights

0.11
Standard deviation 0.9
0.8
0.7
0.6
0.09
0.5
0.4 0.3 0.2 0.1
0.07

0.0130 0.0131 0.0132 0.0133 0.0134

Mean

COP−WFC Portfolios for Different Weights

0.10
Standard deviation

0.08

0.9 0.1
0.8 0.2
0.7 0.6 0.4 0.3
0.5
0.06

0.0115 0.0120 0.0125 0.0130

Mean

Figure 7.15
Means and standard deviations for weighted two-stock portfolios

analogous to property (i) of Proposition 7.4. For the sample variance s2v , the expression is more complicated:
s2v = a2 s2x + b2 s2y + c2 s2z + 2absxy + 2acsxz + 2bcsyz .
This formula says that the sample variance of v involves the variances of the three variables individually, through the
a2 s2x + b2 s2y + c2 s2z component, but it also involves each of the possible pairwise covariances between the three variables.
With three variables, there are 32 = 3 different pairs of variables and, thus, three different covariances.

These results can be generalized to an even larger number of variables in the linear combination. The following
proposition provides the general results for the case of m ≥ 2 variables:
Proposition 7.5. Suppose k and a1 , a2 , …, am are known constants, and
m
X
v = k + a1 x1 + a2 x2 + · · · + am xm = k + aj xj
j=1

is a linear combination of the m variables x1 , x2 , …, xm . The descriptive statistics for the sample {v1 , v2 , …, vn } have
the following relationships to the descriptive statistics for the sample of observations for the variables x1 , x2 , …, xm :

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 172 — #179
i i

172 Descriptive statistics and visuals: bivariate data

Pm
(i) (sample mean) v̄ = k + a1 x1 + a2 x2 + · · · + am xm = k + j=1 aj xj
Pm Pm–1 Pm
(ii) (sample variance) s2v = j=1 a2j s2xj + 2 j=1 a a s x`
qP `=j+1 j ` xjP
m m–1 Pm
p
(iii) (sample standard deviation) sv = s2v = 2 2
j=1 aj sxj + 2 j=1 `=j+1 aj a` sxj x`
Property (i) remains simple, with the sample mean of v having the same linear relationship with the sample means
of the variables as in the original linear combination. As seen above for the case of three variables (m = 3), property (ii)
says that the sample variance of the linear combination v involves each of the sample variances of the m variables,
Pm
through the j=1 a2j s2xj summation term, but also all possible pairwise covariances among the m variables, through the
double summation term. There are m2 terms in the double summation, corresponding to the m2 possible variable

pairs among the m variables. Finally, property (iii) says that the sample standard deviation, as always, is equal to the
square root of the sample variance.

7.3 Correlation is not causation

In describing the sample covariance or the sample correlation between two variables x and y, we have been careful to
use phrases like “the positive association between x and y” or “the positive relationship between x and y” rather than
saying “x causes y to be larger” or “x has a positive effect on y.” It is important to stress that there is no way to tell
whether there is causality between x and y, with x causing y or vice versa, just by looking at a sample of bivariate data.
To be able to establish a causal relationship, additional information about how the data are generated is required.
Consider the example of education and earnings. While there is a positive relationship between education and
earnings, as measured by their positive correlation, we would not want to necessarily conclude that higher education
leads to higher earnings. There may be some other factor that is related to both of these variables and, therefore,
may be causing the observed positive correlation. For instance, parental income is positively related to an individual’s
educational attainment (e.g., more resources to go to college) and positively related to an individual’s earnings (e.g.,
through a parent’s professional connections). Even in the absence of any causal relationship between education and
earnings, this outside factor would cause a positive relationship between the variables. Another possible outside factor
is whether an individual has a “go-getter” attitude. Even though this factor is not something observed in a dataset,
we can still think about what it implies for the causality argument. If a go-getter is both more likely to attain more
education and to strive for higher earnings (and a non-go-getter is less likely to do either), this outside factor would also
lead to a positive correlation between education and earnings in the sample even if there were no causal relationship.
The type of problem described for the education-earnings relationship is the norm, rather than the exception, with
observational data. It is often difficult to establish causality between two variables unless we can rule out the types of
outside related factors discussed above. While a complete treatment of causal inference is beyond the scope of this
book, we can discuss one case in which causality can be reasonably established. Looking back at Example 2.1, we
have a situation in which one of the variables is randomly created through an experiment. Recall that widgets.com
randomized whether a user received e-mail A, e-mail B, or no e-mail at all. Suppose we consider only the users who
receive e-mail A or e-mail B and collect the resulting data on purchases. If x = 0 for e-mail A recipients, x = 1 for e-
mail B recipients, and y denotes the purchases, a positive correlation between x and y indicates that average purchases
among e-mail B recipients are higher than among e-mail A recipients. Moreover, here we could argue that the observed
positive relationship is a causal one because of the randomization used to create the x variable. The randomization rules
out the possibility of some outside factor being related to both x and y. Since x is generated randomly, without regard
to any other factor like age or income, there should not be any relationship between x and an outside factor.
The value of randomization for establishing causality explains why randomized A/B testing is so prevalent in many
areas, including medical research, business improvements in the tech industry, field experiments in economic research,
and so on. In medical research, for instance, the standard protocol is to randomly assign one subset of individuals to a
“treatment” group (i.e., receiving a new medicine or a new treatment) and another subset of individuals to a “control”
group (i.e., not receiving the medicine/treatment). The randomization is important since we don’t want the patients

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 173 — #180
i i

NOTES 173

to be able to choose whether to receive the treatment, as that choice itself could be related to many outside factors
whereas the randomization, by construction, is not.

Notes
16 Alternatively, a single argument can be used for plot if it consists of a data frame with the x variable as the first column and the y variable as
the second column. For Figure 7.5, the appropriate argument would be cps[,c("educ","wagehr")].
17 The sx,–x –sx,x –s2x
sample correlation between x and –x is equal to –1 since rx,–x = sx s–x
= sx sx
= s2x
= –1, using the facts that sx,–x = –sx,x and s–x = sx .

Exercises
1. Use the cps dataset for this question, focusing on the subsample of 2,809 employed individuals.
(a) Provide a table of the joint sample counts for hourly (rows) and race (columns).
(b) Provide a table of the joint sample proportions hourly (rows) and race (columns).
(c) Same as (b), but condition on race so that the column values sum to one. What does the table say about the
relationship between race and being paid hourly?
(d) Draw a figure similar to Figure 7.1 with a bar plot of hourly-wage status by race. As compared to Figure 7.1,
this figure will have only two categories (“Hourly” and “Non-hourly”) on the x-axis.
(e) Draw box plots of weekly earnings (earnwk) by hourly-wage status.
(f) Draw the two densities of weekly earnings (earnwk), one for hourly workers and one for non-hourly workers,
on the same graph. Use different line styles or colors to differentiate the two densities.
(g) Based on (e) and (f), how does the sample mean of weekly earnings for hourly workers compare to that of
non-hourly workers? How about the sample standard deviation? How about the sample 90% quantile?
2. Use the sp500 dataset for this question. The variable IDX represents the monthly return for the overall stock market
(as measured by the S&P 500 index).
(a) Use the command sp500$mkt <- ifelse(sp500$IDX>=0,"Up","Down") to create a categorical
variable mkt whose value is Up when the market has a positive monthly return and Down when the market has
a negative monthly return.
(b) Provide a table of mkt. What proportion of observations have mkt equal to Up?
(c) Use the summary and sd functions to provide summary statistics of Apple’s monthly returns (AAPL).
(d) Repeat (c) on the two subsamples of observations corresponding to mkt = Up and mkt = Down, using the
tapply function. How do the sample means and sample standard deviations compare to each other for the
two subsamples?
(e) Draw two densities of AAPL, one for the mkt = Up subsample and one for the mkt = Down subsample, on the
same graph. Use different line styles or colors to differentiate the two densities.
(f) Defining the binary variable mktup to be 1 if mkt = Up and 0 if mkt = Down, use Proposition 7.2 to calculate the
sample correlation between AAPL and mktup based upon the sample means and sample standard deviations of
AAPL and mktup. The necessary information is available from the answers to (b), (c), and (d).
3. For a sample size of six observations and two variables x and y, draw a possible scatter plot of data for which
(xi – x̄)(yi – ȳ) is positive for every i = 1, …, 6 but x and y are not perfectly correlated.
4. Use the dataset auctions for this question. The dataset consists of 684 eBay auctions for Apple iPod Mini devices in
June and July 2006. The binary variables new, used, and refurb indicate the condition of the device (e.g., a new device
has new = 1 and used = refurb = 0). For this question, focus only on the subsample of 624 auctions of used and new
devices, so drop those with refurb = 1.
(a) Draw side-by-side box plots of auction sales prices (finalprice, in dollars) for used devices and new devices.
(b) What is the sample correlation between finalprice and new?
(c) Draw a scatter plot of finalprice versus the number of bidders (bidders).
(d) What is the sample correlation between finalprice and bidders?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 174 — #181
i i

174 NOTES

(e) For the plot in (c), does the variability of finalprice change for larger values of bidders?
(f) For the plot in (c), does the average finalprice appear to always increase as the value of bidders increases?
Explain.
5. For a sample of firms, there are data on x = electricity purchased (in kilowatt-hours, or kwh) and y = firm revenues
(in dollars). What are the units of the following?
(a) Sample average of electricity purchased
(b) Variance of firm revenues
(c) Covariance between electricity purchased and firm revenues
(d) Correlation between electricity purchased and firm revenues
6. Last week, widgets.com had daily sales of 24, 13, 19, 21, 12, 28, and 18 widgets.
(a) What is the sample median of daily widget sales?
(b) What is the sample mean of daily widget sales?
(c) What is the sample standard deviation of daily widget sales?
(d) If daily fixed cost is 100 dollars and profit margin per widget is 20 dollars, daily profits are given by –100 +
20 widgets. Using the results for linear transformations of univariate data, what are the sample median, sample
mean, and sample standard deviation of daily profits?
(e) What is the sample correlation between daily widget sales and daily profits?
(f) What is the sample covariance between daily widget sales and daily profits?
7. A survey of CEO’s collected data on x = salary (in thousands of dollars) and y = education (in years) for each CEO.
The sample has x̄ = 200, sx = 30, ȳ = 17, and sy = 3. The sample covariance between x and y is sxy = 50.
(a) What is the sample correlation between salary and education?
(b) What is the sample covariance between salary in dollars (not thousands of dollars) and education?
(c) What is the sample correlation between salary in dollars and education?
8. You have data on the monthly returns of two stocks A and B, given respectively by the variables x and y. The sample
variance for stock A is 0.006 (s2x = 0.006) and the sample variance for stock B is 0.008 (s2x = 0.008).
(a) What must be true about the correlation between x and y for the average return 12 (x + y) to have a sample
variance less than or equal to 0.007?
(b) What must be true about the correlation between x and y for the average return 12 (x + y) to have a sample
variance less than or equal to 0.006?
(c) What must be true about the correlation between x and y for the difference in returns (x – y) to have a sample
variance less than 0.012?
9. You have data for 1,230 individuals, including their education (in years, denoted educ) as well as the education, in
years, for each individual’s mother (motheduc) and father (fatheduc). The sample correlation matrix is:
educ motheduc fatheduc
educ 1.000 0.452 0.440
motheduc 1.000 0.599
fatheduc 1.000
The sample variance of educ is 5.543, the sample variance of motheduc is 5.190, and the sample variance of fatheduc
is 10.653. The sample covariance between motheduc and fatheduc is 4.454.
(a) What is the sample covariance between educ and motheduc? What are the units?
(b) What is the sample variance of the sum of motheduc and fatheduc?
(c) What is the sample variance of the average of motheduc and fatheduc?
(d) What is the sample variance of the difference of motheduc and fatheduc?
(e) Explain why the sample variance in (b) is higher than the sample variance in (d).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 175 — #182
i i

NOTES 175

(f) What additional information is needed to calculate the sample variance of the average of all three education
variables (educ, motheduc, fatheduc)?
10. Use the exams dataset for this question.
(a) Draw a scatter plot of the second exam score (exam2) versus the first exam score (exam1), with lines drawn
through the sample means of the two variables. Specify the range of both axes to be between 0 and 100.
(b) Using the appropriate R commands (rather than counting points in the scatter plot), how many points are in
each of the four quadrants of the scatter plot in (a)?
(c) What is the sample correlation between exam1 and exam2?
(d) Based on (c), which of the following would have a higher sample standard deviation: the sum of the two exam
scores or the difference of the two exam scores? Answer without R.
(e) If you standardize both exam1 and exam2 (by de-meaning and dividing by the sample standard deviation), what
would be the sample correlation between the two standardized exam scores? Answer without R.
(f) If you standardize both exam1 and exam2 as in (e), what would be the variance of the sum of the two
standardized exam scores? Answer without R.
(g) Create two new variables exam1_std and exam2_std with the standardized exam scores. Suppose the instructor
calculates a composite score (score) as the sum of 0.75 times the higher standardized exam score and 0.25
times the lower standardized exam score. Create the new variable score. What is the sample mean and sample
standard deviation of the composite scores? If the instructor would like the top 20 students in the class to get
an A, what would be the appropriate cutoff for the composite scores?
11. (a) *For two variables x and y, show that the sample covariance of x and y is
n
sxy = (xy – x̄ȳ) ,
n–1
Pn
where xy = n1 i=1 xi yi is the sample average of the xi yi values.
(b) Suppose that x ∈ {0, 1} and y ∈ {0, 1} are binary variables. Using the result in (a) and the fact that a binary x
n
has sample variance s2x = n–1 x̄(1 – x̄), provide a formula for rxy in terms of x̄, ȳ, and xy.
(c) A city has two newspapers, the Daily Bugle and the Daily Planet. The binary variable x indicates whether a
local company advertises in the Daily Bugle in a given year (1 means yes, 0 means no), and the binary variable
y indicates whether a local company advertises in the Daily Planet in a given year (1 means yes, 0 means no).
The following table describes the advertising behavior of a sample of 80 local companies in a given year:
y
0 1
0 24 28
x
1 18 10
Using the result from (b), what is the sample correlation between x and y?
12. Use the sp500 dataset for this question. If the data are not already visible in the top-left window of RStudio, use the
command View(sp500).
(a) First, focus on the first 20 stocks that appear in the spreadsheet. Ignore IDX, which is in the second column, so
the first 20 stocks are given by the stock tickers AAPL through APA. Output the descriptive statistics for these
20 stocks, using the command summary(sp500[,3:22]). What is the sample mean for AMD?
(b) Use the command sapply(sp500[,3:22], sd) to calculate sample standard deviations for the stocks
in (a). The sapply function “applies” the sd function to each of the columns of the first argument
sp500[,3:22]. Which stock has the highest sample standard deviation? Which stock has the lowest sample
standard deviation?
(c) Create a new variable that contains the monthly returns for a portfolio with equal (1/2) weights on the first two
stocks (AAPL, ABMD). What is the sample mean and sample standard deviation for this two-stock portfolio?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 176 — #183
i i

Recalling the concepts of an “experiment” and the “population” from earlier chapters, the experiment associated with
a random variable X can be thought of as a draw of an observed variable x from the population. The lower-case x
notation denotes the observed variable, a value drawn from the population, and the upper-case notation X denotes
the random variable describing the population. An observed sample {x1 , x2 , …, xn } is the result of n repetitions of the
experiment, with each xi generated by the experiment of drawing an observation from the population associated with
the random variable X. In the example involving the number of firm patents, the population can be thought of as the
number of patents for all possible firms in existence. An observation xi drawn from that population is associated with
a specific firm, and its value is one of the possible values xk∗ ∈ {0, 1, 2, 3, …} associated with the random variable X.
We introduce tools to characterize a discrete random variable X and, therefore, the underlying population for the
variable that X represents. In the case of a discrete random variable X, we need to specify how likely each of the
potential values xk∗ are. This characterization involves specifying the probability of drawing each xk∗ value from the
population associated with X. In the firm patent example, we need to specify the probabilities associated with xk∗ ∈
{0, 1, 2, 3, …}, which are P(X = 0), P(X = 1), P(X = 2), P(X = 3), and so on. Whether or not these are known probabilities
(i.e., known to the researcher) depends upon the example. In the coin toss and die roll examples, the probabilities of the
possible outcomes are known, as long as the coin and die are fair. However, in the firm patent example, the outcome
probabilities are likely unknown, and we can only guess or estimate what they are after observing a sample from the
population. In fact, for most cases of interest where observational data are analyzed, the actual probabilities associated
with the possible outcomes are unknown. Despite that reality, it is very powerful to model the probabilities of those
outcomes, providing the theoretical underpinning for the statistical analysis of discrete variables.

8.2.2 Probability mass function

To completely describe a discrete random variable X, the probability P(X = xk∗ ) is specified for every possible value xk∗ .
These probabilities are collectively known as the probability mass function of the random variable:

Definition 8.3 The probability mass function (pmf) of a discrete random variable X, denoted pX (·), gives the
probability of each possible value of X:
pX (xk∗ ) = P(X = xk∗ ).
For each possible value xk∗ , the probability pX (xk∗ ) can be determined as follows:
• Step 1: Find all of the outcomes in S for which X = xk∗ :
A = {e : X(e) = xk∗ }.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 180 — #187
i i

180 Discrete random variables

• Step 2: Determine pX (xk∗ ) = P(X = xk∗ ) = P(A).

From the Axioms of Probability (Section 2.4), the probability mass function satisfies the following properties:
0 ≤ pX (xk∗ ) ≤ 1 for any possible value xk∗
and X
pX (xk∗ ) = 1.
k
That is, the pmf probabilities are all positive and sum to one.
Here are the probability mass functions for four of the examples considered above:
• Coin toss: For a fair coin, pX (1) = P(H) = 0.5 and pX (0) = P(T) = 0.5.
• Roll of a die: For a fair die, pX (xk∗ ) = 61 for each xk∗ ∈ {1, 2, 3, 4, 5, 6}.
• Three website visitors, purchase (Y) or not (N); X = total number of purchases: Assume that the true purchase

probability is 20% (0.2) for each visitor and that the purchase behavior by each visitor is independent of other
visitors. Then, the pmf is
pX (0) = P(NNN) = 0.83 = 0.512,
pX (1) = P(YNN ∪ NYN ∪ NNY) = (0.2)(0.8)(0.8) + (0.8)(0.2)(0.8) + (0.8)(0.8)(0.2) = 0.384,
pX (2) = P(YYN ∪ YNY ∪ NYY) = (0.2)(0.2)(0.8) + (0.2)(0.8)(0.2) + (0.8)(0.2)(0.2) = 0.096,
and
pX (3) = P(YYY) = 0.23 = 0.008.
• X = website visitors until first purchase, with the same assumptions as the previous example: The pmf is
pX (1) = 0.2,
pX (2) = (0.8)(0.2) = 0.16,
pX (3) = (0.8)(0.8)(0.2) = 0.128,
and, so on, with the general formula
pX (xk∗ ) = (0.8)k–1 (0.2).
For these four examples, Figure 8.1 provides a graph of each of the four pmf’s. Each graph has the probability values
on the y-axis and possible xk∗ values on the x-axis, with each vertical line indicating the probability associated with any
given possible value xk∗ . For instance, the pmf for the coin toss, in the top-left graph, has two vertical lines drawn at x = 0
and x = 1, each with height 0.5, corresponding to the equal probabilities of tails (X = 0) and heads (X = 1). Similarly,
the pmf for the die roll, in the top-right graph, has vertical lines with heights 1/6 at each of the six possible values. In
the lower-right graph for the fourth example (website visitors until first purchase), note that the x-axis only extends to
a maximum of xk∗ = 20 even though K is infinite. Although the x-axis could be extended more, the probability values
become very close to zero as xk∗ gets larger; for example, at xk∗ = 20, the probability is pX (20) ≈ 0.00288 = 0.288%.

8.2.3 Cumulative distribution function

The pmf pX (·) completely characterizes the discrete random variable X since it specifies the probability associated with
every possible value of X. This section introduces another convenient way to characterize the discrete random variable
X, known as the cumulative distribution function. It may seem redundant to introduce another approach when the
pmf already offers a complete characterization of X. However, the cumulative distribution function provides a unifying
approach that can be used for both discrete random variables and continuous random variables, whereas the pmf is
only useful for the case of discrete random variables.
The cumulative distribution function of a discrete random variable is defined formally as follows:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 181 — #188
i i

Discrete random variables 181

1.0
pmf for fair coin toss pmf for fair die roll

1.0
0.8

0.8
Probability pX(xk*)

Probability pX(xk*)
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0 1 1 2 3 4 5 6

xk* xk*

pmf for # of purchases pmf for # visitors until purchase

1.0

1.0
0.8

0.8
Probability pX(xk*)

Probability pX(xk*)
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 1 2 3 1 3 5 7 9 11 14 17 20

xk* xk*

Figure 8.1
Probability mass functions for four examples

Definition 8.4 The cumulative distribution function (cdf) of a discrete random variable X, denoted FX (·), gives the
probability that X is less than or equal to any argument x0 of FX (·):
X
FX (x0 ) = P(X ≤ x0 ) = pX (xk∗ ).
xk∗ ≤x0

The generic argument x0 can take on any possible value on the real line. Though x0 can be equal to a possible value
xk∗ , the definition also allows for x0 values that are between possible values of the random variable. The cdf has the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 182 — #189
i i

182 Discrete random variables

following properties:
0 ≤ FX (x0 ) ≤ 1 for every x0
and
x0 < x1 =⇒ FX (x0 ) ≤ FX (x1 ).
The first property follows directly from the fact that that FX (x0 ) = P(X ≤ x0 ) is a probability and must therefore be
between 0 and 1 (inclusive). The second property says that the cdf is a weakly increasing function. For x0 < x1 , this
property holds since
FX (x1 ) = P(X ≤ x1 ) = P(X ≤ x0 or x0 < X ≤ x1 ) ≥ P(X ≤ x0 ) = FX (x0 ).
Let’s consider the four examples from Section 8.2.2 to illustrate the concept of cdf’s.
Example 8.2 (Coin toss) For a fair coin, the pmf and the cdf at 0 and 1 are given in the following table:
xk∗ pX (xk∗ ) FX (xk∗ )
0 0.5 0.5
1 0.5 1
How about the cdf FX (x0 ) for other values of x0 ? For x0 = –0.4, FX (–0.4) = P(X ≤ –0.4) = 0 since there are no possible
values of X below –0.4. The same is true of any negative x0 value, so that FX (x0 ) = 0 when x0 < 0. This means that the
FX (·) function jumps from 0 to 1 exactly at the point x0 = 0. For x0 = 0.6 (a point between 0 and 1), FX (0.6) = P(X ≤
0.6) = P(X = 0) = 0.5. The same is true of any x0 that is strictly between 0 and 1, meaning the FX (·) function jumps from
0.5 to 1 exactly at the point x0 = 1. Finally, for any x0 value greater than 1, PX (x0 ) = P(X ≤ x0 ) = P(X ≤ 1) = 1.
Taking these results together, the cdf is a step function, as shown in Figure 8.2. The graph has been drawn with
the x-axis extending from –1 to 2, but it should be understood that the cdf extends to the left forever (with a value of
0) and to the right forever (with a value of 1). The solid lines indicate the cdf value at any given x0 value. There is
a “closed dot” and an “open dot” at the x0 values where the function jumps up. The closed dot is the cdf value at
the corresponding point. For instance, at x0 = 0, the cdf value is FX (0) = 0.5, represented by the closed dot and not the
open dot. Similarly, at x0 = 1, the cdf value is FX (1) = 1, represented by the closed dot and not the open dot.
Example 8.3 (Six-sided die) For a fair die, the pmf and the cdf for the possible outcome values are given in the table
below:
xk∗ pX (xk∗ ) FX (xk∗ )
1 1/6 1/6

2 1/6 2/6 = 1/3

3 1/6 3/6 = 1/2

4 1/6 4/6 = 2/3

5 1/6 5/6

6 1/6 1
For any number x0 in between these outcomes, the cdf FX (x0 ) is the FX (xk∗ ) for the lowest xk∗ that is greater than
x0 . For instance, if x0 = 3.7, the cdf is FX (3.7) = P(X ≤ 3.7) = P(X ≤ 3) = 1/2. Also, FX (x0 ) = 0 for x0 < 1, and FX (x0 ) = 1
for x0 > 6. Figure 8.3 shows the cdf for the fair die roll. Again, the closed dots indicate the cdf values at the x0 points
where the cdf jumps up, and the cdf extends forever to the left (with value 0) and to the right (with value 1).
Example 8.4 (Three website visitors) For the example with three website visitors and independent purchase
probabilities of 0.2, the pmf and cdf of X = number of purchases are given by the following table:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 183 — #190
i i

Discrete random variables 183

1.0
0.9
0.8
0.7
0.6
FX(x)

0.5
0.4
0.3
0.2
0.1
0.0

0 1

Figure 8.2
Cumulative distribution function for fair coin toss

xk∗ pX (xk∗ ) FX (xk∗ )

0 0.512 0.512
1 0.384 0.896
2 0.096 0.992
3 0.008 1
We determine the cdf values for other x0 values in the same way as the previous two examples. Figure 8.4 graphs
the cdf for this example. The step function in this case has steps at 0 (for x0 < 0), 0.512 (for 0 ≤ x0 < 1), 0.896 (for
1 ≤ x0 < 2), 0.992 (for 2 ≤ x0 < 3), and 1 (for x0 ≥ 3).
Example 8.5 (Website visitors until first purchase) For the example with continual website visitors and independent
purchase probabilities of 0.2, the pmf and cdf of X = number of visitors until first purchase are given by the following
table:18
xk∗ pX (xk∗ ) FX (xk∗ )
1 0.2 0.2
2 (0.8)(0.2) = 0.16 0.36
3 (0.8)2 (0.2) = 0.128 0.488
.. .. ..
. . .
Pk 1–(0.8)k
k (0.8)k–1 (0.2) j–1
j=1 (0.8) (0.2) = 1–0.8 (0.2) = 1 – (0.8)
k

.. .. ..
. . .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 184 — #191
i i

184 Discrete random variables

1.0
0.9
0.8
0.7
0.6
FX(x)

0.5
0.4
0.3
0.2
0.1
0.0

1 2 3 4 5 6

Figure 8.3
Cumulative distribution function for fair die roll

The cdf values for other x0 values can be determined in the same way as the last example. For example, for x0 = 2.8
(between 2 and 3), the cdf is FX (2.8) = FX (2) = 0.36. As k increases, the cdf value gets closer and closer to one.
Mathematically, FX (xk∗ ) = 1 – (0.8)k , and as k → ∞, (0.8)k → 0 so that FX (xk∗ ) → 1. The cdf never reaches one, but its
value gets closer and closer to one for larger k.

8.2.4 Probability of lying in an interval

The cdf is a useful tool to determine the probability that a random variable X lies in a particular interval. Since discrete
random variables are being considered, we need to be careful with how the endpoints of the interval are handled.
Specifically, in specifying the interval of interest, are the lower endpoint and/or the upper endpoint included as values
to consider? In mathematical terms, the inclusion of an endpoint depends upon whether we use a strict inequality (<
or >) or a weak inequality (≤ or ≥) for the interval.
To see why the form of the inequality matters, consider the probability that P(X ≤ c) versus P(X < c) for some
number c. The first probability, P(X ≤ c), is equal to FX (c) by the definition of the cdf. The second probability, however,
is not necessarily equal to FX (c). If c is a possible value for X, then P(X < c) does not include the probability P(X = c),
whereas FX (c) = P(X ≤ c) does include P(X = c). Instead, P(X < c) is equal to the cdf evaluated at the largest possible
xk∗ value that is less than c. If the notation c– is used to denote the largest possible xk∗ value that is less than c, then
P(X < c) = FX (c– ). Using this idea, the following proposition indicates how the probability of X being in an interval is
related to the cdf of X:
Proposition 8.2. Suppose a and b are two numbers such that a ≤ b. If X is a discrete random variable with possible
values {xk∗ }, then
P(a < X ≤ b) = P(X ≤ b) – P(X ≤ a) = FX (b) – FX (a)

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 185 — #192
i i

Discrete random variables 185

1.0
0.9
0.8
0.7
0.6
FX(xk*)

0.5
0.4
0.3
0.2
0.1
0.0

0 1 2 3

xk*

Figure 8.4
Cumulative distribution function for number of purchases

P(a ≤ X ≤ b) = P(X ≤ b) – P(X < a) = FX (b) – FX (a– )

P(a < X < b) = P(X < b) – P(X ≤ a) = FX (b– ) – FX (a)
P(a ≤ X < b) = P(X < b) – P(X < a) = FX (b– ) – FX (a– ),
where a– is the largest possible xk∗ value that is less than a, and b– is the largest possible xk∗ value that is less than b.
There are four cases for the form of the interval since both inequalities in the probability expression can be either
weak (≤) or strict (<). For the first result, P(X ≤ b) = P(X ≤ a) + P(a < X ≤ b), which implies P(a < X ≤ b) = P(X ≤
b) – P(X ≤ a). A similar decomposition of P(X ≤ b) or P(X < b) for the other three cases leads to the results given in
the proposition.
To see how these interval probabilities work in practice, consider the following example:
Example 8.6 (Website visitors until first purchase) In Example 8.5, with independent purchase probabilities of 0.2,
the cdf was provided for X = number of visitors until first purchase. The probability that X is greater than 3 and less
than or equal to 6 is
P(3 < X ≤ 6) = pX (4) + pX (5) + pX (6) = FX (6) – FX (3) ≈ 0.2499.
The probability that X is greater than or equal to 3 and less than or equal to 6 is
P(3 ≤ X ≤ 6) = pX (3) + pX (4) + pX (5) + pX (6) = FX (6) – FX (3– ) = FX (6) – FX (2) ≈ 0.3779.
The difference in these two probabilities arises since the first probability has a strict inequality for the interval’s lower
endpoint while the second probability has a weak inequality.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 186 — #193
i i

186 Discrete random variables

8.3 Population descriptive statistics

8.3.1 Population mean or expected value
Proposition 8.1 stated the formula for the sample mean of a discrete variable x in terms of sample proportions:
X
x̄ = xk∗ pk ,
k
where xk∗ are the possible values of the variable x and pk are the associated observed sample proportions. The
sample mean is a function of the realized observations or, equivalently, the observed sample proportions. This section
introduces the concept of a mean or average associated with a random variable X. Unlike the sample mean, the mean of
the random variable X is not a function of the sample but rather a function of the true outcome probabilities associated
with the population. To stress this distinction, the term population mean or population average is used for the mean
of X.

Definition 8.5 The population mean (or population average or expected value) of a discrete random variable X,
denoted µX or E(X), is X
µX = E(X) = xk∗ pX (xk∗ ).
k
The population mean µX or expected value E(X) is a weighted average of the possible xk∗ , where the weights are the
pmf probabilities pX (xk∗ ) = P(X = xk∗ ). Notice the similarity between k xk∗ pk (for the sample mean) and k xk∗ pX (xk∗ )
P P

(for the population mean), both of which are weighted averages of the xk∗ values, with the weights being the sample
proportions for the sample mean and the true outcome probabilities for the population mean. Recall the frequentist
interpretation of true probabilities discussed in Section 2.3, where we viewed the probability of an outcome as the
long-run frequency or proportion of the outcome being observed over repeated experiments. Here, for the thought
experiment of taking many repeated draws from the population, the probability pX (xk∗ ) is the number that the sample
proportion pk approaches as the sample size n gets arbitrarily large. Since this relationship holds for each probability
pX (xk∗ ) and each pk , the sample mean x̄ should also approach the population mean µX as the sample size n gets arbitrarily
large. This result, known as the Law of Large Numbers, is formalized in Chapter 13, but it is worthwhile to introduce
the intuition here. To summarize, for any given sample, generally it is the case that pk 6= pX (xk∗ ) for each of the possible
outcomes and x̄ 6= µX , either of which would only happen by chance, but as sample size increases, the sample proportion
pk gets closer to the true probability pX (xk∗ ) for each possible outcome and x̄ gets closer to µX .
Example 8.7 (Union status) Let X be a binary random variable representing union status, equal to 1 for a union
worker and 0 for a non-union worker. The population consists of the union status x ∈ {0, 1} for all possible workers.
The population mean is
µX = E(X) = 0 × pX (0) + 1 × pX (1) = pX (1).
The population mean is equal to the probability of union status in the population, pX (1) = P(X = 1). As there is nothing
special about union status in this example, µX = P(X = 1) holds for any binary random variable X ∈ {0, 1}.
Example 8.8 (Six-sided die) For X ∈ {1, 2, 3, 4, 5, 6} being the outcome of a fair die roll, the pmf probabilities are
pX (xk∗ ) = 61 for each outcome. The population mean or expected value of X is
1 1 1 1 1 1 21
µX = E(X) = 1 × +2× +3× +4× +5× +6× = = 3.5.
6 6 6 6 6 6 6
Example 8.9 (Three website visitors) For the example with three website visitors and independent purchase
probabilities of 0.2, Example 8.4 provided the pmf of X = number of purchases. The table below lists the pmf values
pX (xk∗ ) for xk∗ ∈ {0, 1, 2, 3} and calculates the xk∗ pX (xk∗ ) terms in the summation for µX .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 187 — #194
i i

Discrete random variables 187

xk∗ pX (xk∗ ) xk∗ pX (xk∗ )

0 0.512 (0)(0.512) = 0
1 0.384 (1)(0.384) = 0.384
2 0.096 (2)(0.096) = 0.192
3 0.008 (3)(0.008) = 0.024
Then, the population mean or expected value of X is
X
µX = E(X) = xk∗ pX (xk∗ ) = 0 + 0.384 + 0.192 + 0.024 = 0.6.
k

While we have been able to calculate the population mean in the examples above, there is no guarantee that the
population mean or expected value is well-defined (i.e., a finite number), as the following example illustrates:
Example 8.10 (Infinite expected value) Suppose pX (x) = 1x for any x that is a power of two — that is, x ∈
P∞
{2, 4, 8, …, 2k , …}. These outcomes and probabilities constitute a valid pmf since k=1 21k = 1–/21/2 = 1, but the expected
1

value of X is infinite since

∞ X ∞
X 1
µX = 2k = 1 = ∞.
2k
k=1 k=1

8.3.2 Population variance and population standard deviation

This section introduces the population variance of a discrete random variable X. For the sample variance of a discrete
variable x, Proposition 8.1 stated the formula in terms of sample proportions:
n X ∗ X
s2x = (xk – x̄)2 pk ≈ (xk∗ – x̄)2 pk .
n–1
k k

The sample variance is (approximately) a weighted average of the (xk∗ – x̄)2 values, where the weights are the associated
sample proportions of the outcomes. For the population variance, there are two main differences from this sample
variance formula. First, as with the population mean, the weights are the true probabilities of the possible outcomes
rather than the sample proportions. Second, the sample mean, which appears in each of the (xk∗ – x̄)2 expressions, is
replaced by the population mean µX . The formal definition of the population variance is given below:

Definition 8.6 The population variance of a discrete random variable X, denoted σX2 or Var(X), is
X
σX2 = Var(X) = E[(X – µX )2 ] = (xk∗ – µX )2 pX (xk∗ ).
k
The sample variance is a weighted average of the squared difference of possible outcomes from the sample mean,
and the population variance is a weighted average of the squared difference of possible outcomes from the population
mean. The weights for the sample variance are the sample proportions, and the weights for the population variance
are the true probabilities pX (xk∗ ) = P(X = xk∗ ). For a given sample, the sample proportions generally differ from the pmf
probabilities (pk 6= pK (xk∗ )), leading to a sample mean that differs from the population mean (x̄ 6= µX ) and a sample
variance that differs from the population variance (s2x 6= σX2 ). That said, if we again conduct the thought experiment
of taking many repeated draws from the population, it is expected that the sample variance s2x becomes closer to the
population variance σX2 as the sample size n gets larger and larger. The reason here is that, as the sample size n becomes
very large, (i) the sample proportions pk get closer to the true pmf probabilities pX (xk∗ ) and (ii) the sample mean x̄ gets
closer to the population mean µX , which taken together imply that each (xk∗ – x̄)2 pk term in s2x gets closer to each
(xk∗ – µX )2 pX (xk∗ ) term in σX2 .
We also define the population standard deviation associated with the random variable X:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 188 — #195
i i

188 Discrete random variables

Definition 8.7 The population standard deviation of a discrete random variable X, denoted σX or sd(X), is
q sX
σX = sd(X) = σX2 = (xk∗ – µX )2 pX (xk∗ ).
k

The population standard deviation σX is different from the sample standard deviation sx in the same way that the
population variance σX2 is different from the sample variance s2x . Similar to the variance measures, the sample standard
deviation sx is expected to get closer to the population standard deviation σX as the sample size n gets larger and larger.
Example 8.11 (Union status) In Example 8.7, where X ∈ {0, 1} is a binary random variable representing union status
(1 for union, 0 for non-union), the population mean was shown to be µX = pX (1). Using this result and the fact that
pX (0) = 1 – pX (1), the population variance is
σX2 = Var(X) = (0 – µX )2 pX (0) + (1 – µX )2 pX (1)
= (0 – pX (1))2 pX (0) + (1 – pX (1))2 pX (1)
= (0 – pX (1))2 (1 – pX (1)) + (1 – pX (1))2 pX (1)
= (pX (1)2 + (1 – pX (1))pX (1))(1 – pX (1))
= pX (1)(1 – pX (1)).
For this binary X, the population variance is pX (1)(1 – pX (1)), and the population standard deviation is
p
σX = sd(X) = pX (1)(1 – pX (1)).
As in Example 8.7, there’s nothing special about union status here, so these are general results for a binary random
variable X ∈ {0, 1}. These results will be re-visited when binary random variables are discussed further in Chapter 9.
Example 8.12 (Six-sided die) In Example 8.8, the population mean was shown to be µX = 3.5 for the random variable
X ∈ {1, 2, 3, 4, 5, 6} being the outcome of a fair die roll. Since the probability of each outcome is 16 , we have
P6
σX2 = Var(X) = ∗ (x∗ – 3.5)2 pX (xk∗ )
Px6k =1 k∗ 2 1

= xk∗ =1 (xk – 3.5) 6
= ((–2.5)2 +(–1.5)2 + (–0.5)2 + (0.5)2 + (1.5)2 + (2.5)2 ) 61

= (17.5) 61 = 35 12 ≈ 2.9167.
q
The population standard deviation is σX = 35 12 ≈ 1.7078. To see how the sample descriptive statistics relate to the
population descriptive statistics as the sample size n increases, Figure 8.5 shows the results from a computer simulation
of rolling a fair die 5,000 times. After each die roll, the following four quantities are updated: (i) sample proportion of
6 being the outcome of a roll (top-left graph), (ii) sample mean of the die rolls (top-right graph), (iii) sample variance
of the die rolls (bottom-left graph), and (iv) sample standard deviation of the die rolls (bottom-right graph). The figure
provides plots of these four quantities as the number of tosses n (along the x-axis) increases to 5,000. For comparison
purposes, the corresponding population statistics are drawn as horizontal dotted lines on each plot. These values are
the probability of a 6 (pX (6) = 16 ), the population mean (µX = 3.5), the population variance (σX2 = 35 12 ), and the population
q
standard deviation (σX = 35 12 ). As evident in the four graphs, the sample descriptive statistic in each case gets very
close to its corresponding population statistic as n increases toward 5,000.
Here is the R code to create Figure 8.5:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 189 — #196
i i

Discrete random variables 189

set.seed(1234)

# initialize the number of simulations

num_simulations <- 5000

# roll a fair six-sided die for all simulations

dierolls <- sample(1:6, num_simulations, replace=TRUE)

# graph-display format (two rows, two columns)

par(mfrow=c(2,2))

# plot the sample frequency of a "6" being rolled

plot(cumsum(dierolls==6)/(1:num_simulations), main="Sample proportion of 6's",
xlab="", ylab="", ylim=c(0,1), cex=0.5)
abline(h=1/6,lty=3)

# plot the sample mean of die rolls

plot(cumsum(dierolls)/(1:num_simulations), main="Sample mean of die rolls",
xlab="", ylab="", ylim=c(1,6), cex=0.5)
abline(h=3.5,lty=3)

# plot the sample variance of die rolls

tempvar <- rep(0,num_simulations)
for (i in 2:num_simulations) {
tempavg <- cumsum(dierolls)[i]/i
tempvar[i] <- sum((dierolls[1:i]-tempavg)^2)/(i-1)
}
plot(tempvar[2:num_simulations], main="Sample variance of die rolls",
xlab="", ylab="", ylim=c(2,4), cex=0.5)
abline(h=35/12,lty=3)

# plot the sample standard deviation of die rolls

plot(sqrt(tempvar[2:num_simulations]), main="Sample standard deviation of die rolls",
xlab="", ylab="", ylim=c(1.5,2.5), cex=0.5)
abline(h=sqrt(35/12),lty=3)

8.4 Multiple discrete random variables

This section considers a situation in which there are two or more discrete random variables associated with a
population. For the case of two discrete random variables, we will generally use the notation X and Y to denote
the two discrete random variables.
Example 8.13 (Phone and computer ownership) Suppose a random individual drawn from a population S of
individuals is surveyed to determine the number of phones and computers that they own. Let e be the identity of the
individual chosen and surveyed. The underlying discrete random variables are X = X(e) = number of phones owned
and Y = Y(e) = number of computers owned.
Example 8.14 (Patents and R&D staff) Suppose a random firm drawn from a population S of firms is surveyed to
determine the number of patents awarded (last year) and the size of the firm’s R&D staff. Let e be the identity of the
firm chosen and surveyed. The underlying discrete random variables are X = X(e) = number of patents awarded last
year and Y = Y(e) = size of R&D staff.

8.4.1 Joint probability mass function and joint cumulative distribution function
Let K and L denote the number of possible outcomes for X and Y, respectively, where K and/or L may be infinite.
The possible outcomes for X are {x1∗ , x2∗ , …, xK∗ } if K is finite and {x1∗ , x2∗ , …, xk∗ , …} if K is infinite. The possible
outcomes for Y are {y∗1 , y∗2 , …, y∗L } if L is finite and {y∗1 , y∗2 , …, y∗` , …} if L is infinite. The concept of the probability
mass function can be extended to a joint probability mass function as follows:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 190 — #197
i i

190 Discrete random variables

Sample proportion of 6's Sample mean of die rolls

1.0

6
0.8

5
0.6

4
0.4

3
0.2

2
0.0

1
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000

Sample variance of die rolls Sample standard deviation of die rolls

4.0

2.4
3.5

2.2
3.0

2.0
1.8
2.5

1.6
2.0

0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000

Figure 8.5
Descriptive statistics for 5,000 simulated die rolls

Definition 8.8 The joint probability mass function (joint pmf) of two discrete random variables X and Y, denoted
pXY (·, ·), gives the joint probability
pXY (xk∗ , y∗` ) = P(X = xk∗ ∩ Y = y∗` ) = P(X = xk∗ , Y = y∗` )
for every possible outcome xk∗ for X and every possible outcome y∗` for Y.
The collection of possible (xk∗ , y∗` ) values is a set of disjoint and exhaustive outcomes, meaning the joint pmf satisfies
the following properties:
0 ≤ pXY (xk∗ , y∗` ) ≤ 1 for any possible outcome pair (xk∗ , y∗` )
and X XX
pXY (xk∗ , y∗` ) = pXY (xk∗ , y∗` ) = 1.
(k,`) k `
The joint pmf probabilities are each between zero and one (inclusive) and sum to one. The summations in the
P
expression above are written in two equivalent ways: (i) (k,`) is a summation over all possible pairs (k, `), and

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 191 — #198
i i

Discrete random variables 191

P P
(ii) k ` is a “double summation” with the inner summation taken over possible ` values and the outer summation
taken over possible k values.
If X and Y are both finite discrete random variables, the number of possible joint outcomes is KL. Following the
approach that was taken for probability tables in Section 3.3, the joint pmf can be represented by a table as follows:
y∗1 y∗2 ··· y∗L
x1∗ pXY (x1∗ , y∗1 ) pXY (x1∗ , y∗2 ) ··· pXY (x1∗ , y∗L )
x2∗ pXY (x2∗ , y∗1 ) pXY (x2∗ , y∗2 ) ··· pXY (x2∗ , y∗L )
.. .. .. .. ..
. . . . .
xK∗ pXY (xK∗ , y∗1 ) pXY (xK∗ , y∗2 ) ··· pXY (xK∗ , y∗L )
Example 8.15 (Phone and computer ownership) Continuing Example 8.13, where X = number of phones owned and
Y = number of computers owned, let’s assume that no individual in the population owns more than two phones or
more than two computers, so that K = 2 and L = 2. Suppose the joint pmf is given by the following probability table:
Y (# computers)
0 1 2
0 0.06 0.03 0.01
X (# phones) 1 0.22 0.48 0.05
2 0.02 0.04 0.09
The joint pmf values in this table are the probabilities associated with the population and not sample proportions
based upon an observed sample. For example, pXY (0, 1) = 0.03 indicates there is a 3% probability that an individual
drawn from the population doesn’t own a phone and owns one computer. Similarly, pXY (1, 2) = 0.05 indicates that there
is a 5% probability that an individual drawn from the population owns one phone and two computers.
The concept of the cdf can also be extended to the case of two discrete random variables, as follows:

Definition 8.9 The joint cumulative distribution function (joint cdf) of two discrete random variables, denoted
FXY (·, ·), gives the probability that both X and Y are less than or equal to their corresponding arguments:
FXY (x0 , y0 ) = P(X ≤ x0 ∩ Y ≤ y0 ) = P(X ≤ x0 , Y ≤ y0 ).
The joint cdf has the following properties:
0 ≤ FXY (x0 , y0 ) ≤ 1 for every x0 and y0
and
x0 < x1 =⇒ FXY (x0 , y0 ) ≤ FXY (x1 , y0 ) and y0 < y1 =⇒ FXY (x0 , y0 ) ≤ FXY (x0 , y1 ).
The first property follows from the fact that FXY (x0 , y0 ) is a probability. The first part of the second property says that the
joint cdf is weakly increasing in its first argument; that is, if y0 is held fixed, the value of the joint cdf weakly increases
as x0 is increased. Similarly, the second part says that the joint cdf is weakly increasing in its second argument, so
holding x0 fixed, the value of the joint cdf weakly increases as y0 is increased.
Example 8.16 (Phone and computer ownership) Using the probability table in Example 8.15, the joint cdf can be
calculated at any specified arguments. For example, with x0 = 2 and y0 = 1, the joint cdf FXY (2, 1) = P(X ≤ 2, Y ≤ 1)
is the sum of all the probabilities from the first two columns, which is 0.85. If y0 is held fixed at y0 = 1, focusing
on the subpopulation of individuals who own one computer, the joint cdf values for the possible values of X are
FXY (0, 1) = 0.09, FXY (1, 1) = 0.79, and FXY (2, 1) = 0.85. These joint cdf values increase as x0 increases with y0 fixed.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 192 — #199
i i

192 Discrete random variables

8.4.2 Marginal and conditional probability mass functions

Using the concepts of marginal probabilities and conditional probabilities introduced in Chapter 3, marginal
probability mass functions and conditional probability mass functions can be inferred from the joint pmf of X
and Y. For the random variable X, the marginal probability mass function is given by pX (xk∗ ) for all possible values
xk∗ . Similarly, for the random variable Y, the marginal probability mass function is given by pY (y∗` ) for all possible
values y∗` . By applying the Law of Total Probability, the marginal probability mass function is related to the joint pmf
as follows:
Proposition 8.3. The marginal probability mass functions (marginal pmf’s) of X and Y are, respectively,
X
pX (xk∗ ) = pXY (xk∗ , y∗` )
`
and X
pY (y∗` ) = pXY (xk∗ , y∗` ).
k
When K and L are finite, in which case the joint pmf can be specified as probability table, the pX (xk∗ ) values are the
row totals of the joint probabilities and the pX (y∗` ) values are the column totals of the joint probabilities.
The conditional probability mass functions are defined as follows:

Definition 8.10 The conditional probability mass function (conditional pmf) of X given Y, denoted pX|Y (·|·), is
pXY (xk∗ , y∗` )
pX|Y (xk∗ |y∗` ) = P(X = xk∗ |Y = y∗` ) = .
pY (y∗` )
Similarly, the conditional pmf of Y given X, denoted pY|X (·|·), is
pXY (xk∗ , y∗` )
pY|X (y∗` |xk∗ ) = P(Y = y∗` |X = xk∗ ) = .
pX (xk∗ )
For the probability table when K and L are finite, the conditional probability pX|Y (xk∗ |y∗` ) is focused on the joint
probabilities in the column corresponding to Y = y∗` , and the conditional probability pY|X (y∗` |xk∗ ) is focused on the joint
probabilities in the row corresponding to X = xk∗ . The conditional pmf is itself a pmf, having the properties that its
probabilities are positive and add up to one. Therefore, a conditional cdf can be defined based upon the conditional
pmf’s probabilities:

Example 8.17 (Phone and computer ownership) The joint probability table from Example 8.15 is replicated below,
with the marginal pmf’s for X and Y now included.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 193 — #200
i i

Discrete random variables 193

Y (# computers)
0 1 2 pX (x)
0 0.06 0.03 0.01 0.10
X (# phones) 1 0.22 0.48 0.05 0.75
2 0.02 0.04 0.09 0.15

pY (y) 0.30 0.55 0.15

The marginal pmf of X, given by the pX (x) values, are obtained by summing the probabilities in each row. The marginal
pmf of Y, given by the pY (y) values, are obtained by summing the probabilities in each column. The marginal pmf’s
are themselves pmf’s, with the sum of the pX (x) probabilities equal to one and the sum of the pY (y) probabilities equal
to one.
Focusing on the Y = 1 column (individuals who own one computer), the conditional pmf is
pXY (0, 1) 0.03 3
pX|Y (0|1) = = = ,
pY (1) 0.55 55
pXY (1, 1) 0.48 48
pX|Y (1|1) = = = ,
pY (1) 0.55 55
and
pXY (2, 1) 0.04 4
pX|Y (2|1) = = = .
pY (1) 0.55 55
4
For instance, the probability than an individual owns two phones given that they own one computer is pX|Y (2|1) = 55
or approximately 7.27%. The associated conditional cdf is
3
FX|Y (0|1) = pX|Y (0|1) = ,
55
3 48 51
FX|Y (1|1) = pX|Y (0|1) + pX|Y (1|1) = + = ,
55 55 55
and
FX|Y (2|1) = pX|Y (0|1) + pX|Y (1|1) + pX|Y (2|1) = 1.
For instance, the probability that an individual owns (strictly) less than two phones given that they own one computer
is FX|Y (1|1) = 51
55 or approximately 92.73%.
Similarly, conditioning upon the number of phones owned involves focusing on a specific row. For the X = 0 row
(individuals who do not own a phone), the conditional pmf is
pXY (0, 0) 0.06
pY|X (0|0) = = = 0.6,
pX (0) 0.10
pXY (0, 1) 0.03
pY|X (1|0) = = = 0.3,
pX (0) 0.10
and
pXY (0, 2) 0.01
pY|X (2|0) = = = 0.1.
pX (0) 0.10
The probability that an individual owns one computer given that they do not own a phone is pY|X (1|0) = 0.3 or 30%.
This conditional probability is considerably lower than the unconditional probability that an individual from the
population owns one computer, which is pY (1) = 0.55 or 55%. This difference tells us that knowing X = 0 provides
additional information about the probability that Y = 1.
As mentioned above, each conditional pmf is itself a pmf. That is, fixing Y = y∗` for some possible outcome y∗` , the
collection of pX|Y (xk∗ |y∗` ) probabilities constitutes a pmf for the xk∗ values. Similarly, fixing X = xk∗ for some possible
outcome xk∗ , the collection of pY|X (y∗` |xk∗ ) probabilities constitutes a pmf for the y∗` values. In the example above, the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 194 — #201
i i

194 Discrete random variables

3 48 4
conditional pmf values sum to one ( 55 + 55 + 55 = 1 conditioning on Y = 1 and 0.6 + 0.3 + 0.1 = 1 conditioning on X = 0),
as expected for pmf’s.
Since a conditional pmf is just a special case of a pmf, it is natural to introduce conditional versions of the population
descriptive statistics, including the mean, variance, and standard deviation:

These definitions are stated for the conditional distribution of X given Y. For the conditional distribution of Y given
X, the same definitions can be used with the roles of X and Y reversed. For example, the population conditional mean
or conditional expectation of Y given X = xk∗ is
X
µY|X=xk∗ = E(Y|X = xk∗ ) = y∗` pY|X (y∗` |xk∗ ).
`

Example 8.18 (Phone and computer ownership) In Example 8.17, the conditional pmf associated with phone
3
ownership (X) given that an individual owned one computer (Y = 1) is: pX|Y (0|1) = 55 , pX|Y (1|1) = 48 4
55 , and pX|Y (2|1) = 55 .
Then, using the definitions above, the population conditional mean of X given Y = 1 is
3 48 4 56
µX|Y=1 = E(X|Y = 1) = 0 × +1× +2× = ≈ 1.018,
55 55 55 55
the population conditional variance of X given Y = 1 is
2 2 2
2 56 3 56 48 56 4
σX|Y=1 = Var(X|Y = 1) = 0 – × + 1– × + 2– × ≈ 0.1269,
55 55 55 55 55 55
and the population conditional standard deviation of X given Y = 1 is
√
σX|Y=1 = sd(X|Y = 1) ≈ 0.1269 ≈ 0.3563.
These population descriptive statistics describe the number of phones owned (X) in the subpopulation of individuals
who own one computer. The unconditional population mean of X is µX = (0)(0.10) + (1)(0.75) + (2)(0.15) = 1.05, and
the unconditional population variance of X is σX2 = (0 – 1.05)2 (0.10) + (1 – 1.05)2 (0.75) + (2 – 1.05)2 (0.15) = 0.2475.
Therefore, the conditional distribution of X given Y = 1 has a slightly lower population mean and a much lower
population variance, indicating that knowing Y = 1 provides useful information about the distribution of X.
Similarly, we can find the population descriptive statistics associated with the conditional pmf of Y given X = 0
found in Example 8.17, which is pY|X=0 (0|0) = 0.6, pY|X=0 (1|0) = 0.3, and pY|X=0 (2|0) = 0.1. The population conditional
mean of Y given X = 0 is
µY|X=0 = E(Y|X = 0) = 0 × 0.6 + 1 × 0.3 + 2 × 0.1 = 0.5,
the population conditional variance of Y given X = 0 is
2
σY|X=0 = Var(Y|X = 0) = (0 – 0.5)2 × 0.60 + (1 – 0.5)2 × 0.30 + (2 – 0.5)2 × 0.10 = 0.45,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 195 — #202
i i

Discrete random variables 195

and the population conditional standard deviation of Y given X = 0 is

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 198 — #205
i i

198 Discrete random variables

Example 8.21 (Phone and computer ownership) Continuing Example 8.19, are X (number of phones owned) and Y
(number of computers owned) independent? The answer is no since, for instance, pXY (0, 0) = 0.06 and pX (0)pY (0) =
(0.10)(0.30) = 0.03 are not equal to each other. Other joint probabilities could be checked, but it’s sufficient to find just
one case where pXY (xk∗ , y∗` ) 6= pX (xk∗ )pY (y∗` ) to show that X and Y are dependent random variables.
Example 8.22 (Stock price up or down?) Continuing Example 8.20, are X (binary variable indicating positive return
of stock A) and Y independent (binary variable indicating positive return of stock B)? The answer is no since pXY (0, 0) =
0.30 and pX (0)pY (0) = (0.45)(0.40) = 0.18 are not equal to each other.
At times throughout the book, we have made the implicit assumption that certain random variables are independent
of each other. For example, for two coin tosses that have nothing to do with each other, the tosses can be thought of as
independent random variables. To be more precise, if X = 1 if the first coin is heads and 0 if the first coin is tails and
Y = 1 if the second coin is heads and 0 if the second coin is tails, the following joint probability table represents the
case when X and Y are independent:
Y
0 1
0 0.25 0.25
X
1 0.25 0.25
The joint probabilities are products of the respective marginal probabilities since each marginal probability (of heads
or tails for either toss) is equal to 0.5. Similarly, for two fair die rolls X ∈ {1, 2, 3, 4, 5, 6} and Y ∈ {1, 2, 3, 4, 5, 6} that
1
are independent, the joint probability of any of the 36 possible outcomes (x, y) is 36 since each marginal probability for
1 1
X is equal to 6 and each marginal probability for Y is equal to 6 . Here is another example of two independent discrete
random variables:
Example 8.23 (Website purchases) For two visitors to a website, let X = 1 if the first visitor makes a purchase and 0
otherwise and Y = 1 if the second visitor makes a purchase and 0 otherwise. If the marginal probability of purchase is
0.2 for both individuals, the probability table in the case of independent X and Y (i.e., their purchase behaviors are
not related) is:
Y
0 1
0 (0.8)(0.8) = 0.64 (0.8)(0.2) = 0.16
X
1 (0.2)(0.8) = 0.16 (0.2)(0.2) = 0.04
An alternative way to show that two discrete random variables are dependent is to consider the population covariance
or population correlation. If the population covariance/correlation is non-zero, there is some linear association between
the two random variables, meaning they must be dependent. The following proposition formally states this result:
Proposition 8.5. If the discrete random variables X and Y are independent, the population covariance σXY and the
population correlation ρXY are equal to zero (σXY = ρXY = 0). Equivalently, if X and Y have a non-zero population
covariance or correlation, X and Y are dependent.
The reverse is not necessarily true: having a population covariance/correlation between X and Y equal to zero does
not necessarily imply independence. As with sample covariance/correlation, these population statistics only measure
the linear relationship between two random variables, so it’s possible that there is some non-linear dependence between
the two random variables even when the covariance/correlation is equal to zero. The following example shows such a
situation, where X and Y are dependent even though the population covariance is zero:
Example 8.24 Consider the following probability table for X ∈ {0, 1} and Y ∈ {–1, 0, 1}:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 199 — #206
i i

Discrete random variables 199

Y
–1 0 1 pX (x)
0 0 0.5 0 0.5
X
1 0.25 0 0.25 0.5

pY (y) 0.25 0.5 0.25

X and Y are dependent here since none of the joint probabilities are equal to the product of their respective marginal
probabilities. For example, pXY (0, 0) = 0 and pX (0)pY (0) = (0.5)(0.25) = 0.125. The population covariance is zero since
µX = 0.5, µY = 0, and
σXY = (0 – 0.5)(0 – 0)(0.5) + (1 – 0.5)(–1 – 0)(0.25) + (1 – 0.5)(1 – 0)(0.25) = 0.
Another way to think about independence is to ask whether knowing the realized draw of a random variable X
provides any useful information about the distribution of Y. If X and Y are independent, knowledge of the value of X
should provide no information about the distribution of Y, and vice versa. Therefore, an equivalent way to establish
that X and Y are independent is to show that the conditional distribution of Y|X is the same for all possible values of
X or, equivalently, that the conditional distribution of X|Y is the same for all possible values of Y. If the conditional
distribution of one random variable changes for different values of the other random variable, then the two random
variables are dependent. This result is stated in the following proposition:
Proposition 8.6. The discrete random variables X and Y are independent if and only if:
• For any possible value y∗` , the conditional distribution of X given Y = y∗` is the same as the unconditional (marginal)
distribution of X.
• For any possible value x∗ , the conditional distribution of Y given X = x∗ is the same as the unconditional (marginal)
k k
distribution of Y.
Based upon Proposition 8.6, X and Y can be shown to be dependent by showing one case where a conditional
distribution differs from the marginal distribution.
Example 8.25 (Phone and computer ownership) Recall the probability table from Example 8.15, with X = number of
phones owned and Y = number of computers owned:
Y (# computers)
0 1 2
0 0.06 0.03 0.01
X (# phones) 1 0.22 0.48 0.05
2 0.02 0.04 0.09
We can verify that the conditional distribution of Y|X depends upon the value of X:
pY|X (0|0) = 0.6, pY|X (1|0) = 0.3, pY|X (2|0) = 0.1 (conditional on X = 0)
22 48 5
pY|X (0|1) = , pY|X (1|1) = , pY|X (2|1) = (conditional on X = 1)
75 75 75
2 4 9
pY|X (0|2) = , pY|X (1|2) = , pY|X (2|2) = (conditional on X = 2)
15 15 15
Each of these conditional distributions differs from the marginal distribution of Y:
pY (0) = 0.30, pY (1) = 0.55, pY (2) = 0.15.
Again, it is sufficient to show that just one of the conditional distributions differs from the marginal distribution, even
though we have shown it for all three. Similarly, the conditional distribution of X|Y depends upon the value of Y, which
is left for the reader to verify.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 200 — #207
i i

200 Discrete random variables

Example 8.26 Consider the following probability table for X ∈ {0, 1} and Y ∈ {–1, 0, 1}, with the marginal
probabilities also calculated:
Y
–1 0 1 pX (x)
0 7/36 2/36 3/36 12/36
X
1 14/36 4/36 6/36 24/36

pY (y) 21/36 6/36 9/36

The conditional distributions of Y|X are the same as the marginal distribution of Y:
7 2 3
pY|X (–1|0) = , pY|X (0|0) = , pY|X (1|0) = (conditional on X = 0)
12 12 12
14 4 6
pY|X (–1|1) = , pY|X (0|1) = , pY|X (1|1) = (conditional on X = 1)
24 24 24
The probabilities in the X = 1 row are all proportional to the corresponding probabilities in the X = 0 row; specifically,
each one is two times the probability of the corresponding probability in the X = 0 row. This proportionality is what
leads the conditional pmf’s to be unchanged and, therefore, equal to the marginal distribution function. Similarly, the
conditional distributions of X|Y are the same as the marginal distribution of X, which can be verified by the reader.
Therefore, in this case, the random variables X and Y are independent.
The concept of independence can be generalized to additional random variables, with the intuition remaining
the same. When random variables have no relationship with each other, they are independent; otherwise, they are
dependent. Definition 8.17, which considered the case of two discrete random variables, can be extended to the general
case of multiple discrete random variables:

and
V = X – Y =⇒ σV2 = σX2 + σY2 – 2σXY .
Example 8.30 (Correlated purchases) Suppose a website sells two types of widgets (widget A and widget B), and a
visitor to the website is allowed to purchase one of each type of widget. For a given visitor to the website, let X = 1 if
they buy a widget of type A and X = 0 if not, and let Y = 1 if they buy a widget of type B and Y = 0 if not. The joint pmf
for X and Y is given in the following probability table:
Y
0 1
0 0.70 0.05
X
1 0.05 0.20
X and Y are positively correlated here, with
σXY = (0 – 0.25)(0 – 0.25)(0.70) + (0 – 0.25)(1 – 0.25)(0.05)
+(1 – 0.25)(0 – 0.25)(0.05) + (1 – 0.25)(1 – 0.25)(0.20) = 0.1375

σXY 0.1375 0.1375

ρXY = σX σY = √0.25(1–0.25)√
0.25(1–0.25)
= (0.25)(0.75) ≈ 0.7333
Letting V denote the total number of widgets purchased by a given website visitor, V = X + Y, the population mean of
V is µV = µX + µY = 0.25 + 0.25 = 0.5. Due to the positive correlation/covariance of X and Y, the population variance
of V, σV2 = σX2 + σY2 + 2σXY , is greater than the sum of the variances of X and Y (σX2 + σY2 ). X and Y tend to be on the
same sides of their respective population means, which here means that both being zero or both being one is more
likely than having one equal to one and the other equal to zero. This positive correlation causes the dispersion in
V = X + Y to be larger than the dispersion coming from just the variance of X (σX2 ) and the variance of Y (σY2 ). Since
σX2 = pX (1)(1 – pX (1)) and σY2 = pY (1)(1 – pY (1)), the sample variance of V is
σV2 = (0.25)(0.75) + (0.25)(0.75) + (2)(0.1375) = 0.65,
as compared to σX2 + σY2 = 0.375.
If X and Y were instead negatively correlated, there would be less dispersion in V (σV2 < σX2 + σY2 ) since X and Y
would tend to be on opposite sides of their respective population means.
These results can be generalized to an even larger number of variables in the linear combination. For m ≥ 2, let
X1 , X2 , …, Xm denote random variables. Then, for constants k and a1 , a2 , …, am , a linear combination of the random
variables is
m
X
V = k + a1 X1 + a2 X2 + · · · + am Xm = k + aj Xj .
j=1
Analogous to Proposition 7.5, the proposition below provides general results for the linear combination of m ≥ 2
random variables. We do not specify that the random variables are discrete, as the properties hold for any types of
random variables.
Proposition 8.7. Suppose k and a1 , a2 , …, am are known constants, and
m
X
V = k + a1 X1 + a2 X2 + · · · + am Xm = k + aj Xj
j=1

is a linear combination of the m random variables X1 , X2 , …, Xm . The population statistics for the random variable V
have the following relationships to the population statistics for the random variables X1 , X2 , …, Xm :
Pm
(i) (population mean) µV = k + a1 µX1 + a2 µX2 + · · · + am µXm = k + j=1 aj µXj
Pm Pm–1 Pm
(ii) (population variance) σV2 = j=1 a2j σX2 j + 2 j=1 aa σ `
qP `=j+1 j ` Xj XP
m m–1 Pm
p
(iii) (population standard deviation) σV = σV2 = 2 2
j=1 aj σXj + 2 j=1 `=j+1 aj a` σXj X`

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 204 — #211
i i

204 Discrete random variables

Part (ii) of Proposition 8.7 implies that, for random variables that are not independent of each other, the population
variance of the linear combination depends upon the covariances that exist between any pair of random variables. For
a linear combination of m = 3 random variables (V = k + a1 X1 + a2 X2 + a3 X3 ), the population variance is
σV2 = a21 σX2 1 + a22 σX2 2 + a23 σX2 3 + 2a1 a2 σX1 X2 + 2a1 a3 σX1 X3 + 2a2 a3 σX2 X3 ,
3
which has 2 = 3 covariance terms. More generally, the number of covariance terms is m2 for a linear combination of

(b) What is the population mean of daily breakdowns?
(c) What is the population variance of daily breakdowns?
2. Suppose you want to “load” a six-sided die so that you are more likely to roll a 6.
(a) If you get to choose the probability p = P(6), with the remaining five outcomes (1, 2, 3, 4, 5) still having equal
probabilities, what choice of p would give you an expected value of 5 for a roll of the six-sided die?
(b) With your choice of p from (a), what is the population variance of the loaded die roll? How does this variance
compare to the variance of the fair die roll?
(c) *Determine the population variance of the loaded die roll in terms of p. Using R, define a function that takes
p as an argument and returns the population variance. Then, evaluating the population variance for all p ∈
{0, 0.01, 0.02, …, 0.99, 1}, determine the value of p that maximizes the population variance.
3. A company sells widgets and receives shipments of its widgets from a supplier. Suppose the true probability of a
defect in any given widget is 10%, and assume that the production process is independent for every widget made.
Consider picking two widgets at random from a shipment, and let the random variable X ∈ {0, 1, 2} denote the number
of defective widgets.
(a) What is the pmf of X?
(b) What is the cdf of X evaluated at 1?
(c) What is the expected value of X?
(d) What is the population standard deviation of X?
4. A venture capital (VC) firm raises funds from investors and makes investments in early-stage companies. One
particular VC firm, called Mill VC, invests exactly $1 million in each company in its portfolio. For each investment,
the pmf of the amount X (in millions) that Mill VC receives after 2 years is:
x 0 5 10 50
pX (x) 0.90 0.05 0.03 0.02
Let n be the number of investments made by Mill VC, and assume that the outcomes of Mill VC’s investments are
independent of each other.
(a) What is the expected profit associated with any single investment?
(b) If n = 4, what is the probability that Mill VC’s total profit is negative?
(c) For n = 10, conduct 10,000 simulations in R to approximate (i) the probability that Mill VC’s total profit is
negative and (ii) Mill VC’s average profit. (For each simulation, make 10 independent draws of the discrete
random variable X, and calculate the total profit.) Does the simulated average profit make sense given the
answer in (a)?
5. An investor is considering three options for a $1,000 investment, as follows:
(
Profit of $10,000 with probability 0.15
Option 1:
Loss of $1,000 with probability 0.85

Profit of $1,000 with probability 0.50


Option 2: Profit of $500 with probability 0.30


Loss of $500 with probability 0.20
Option 3: Profit of $400 with probability 1
(a) Which option has the highest expected profit?
(b) Which option has the highest population variance?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 208 — #215
i i

208 NOTES

(c) Thinking about Option 1, what probability p on the $10,000 profit (and 1 – p on the $1,000 loss) would yield
an expected profit equal to that of Option 3?
(d) If the investor randomly chooses one of the three options (each with probability 1/3), what is the probability that
the investor realizes a profit?
(e) Let X2 denoting the random variable associated with the profit or loss associated with Option 2. Draw the cdf
associated with X2 .
6. The joint pmf of two random variables X and Y is given by the following table:
Y
1 2 3
1 0.24 0.12 0.04
X 2 0.12 0.06 0.02
3 0.24 0.12 0.04
(a) Are X and Y independent?
(b) What is the expected value of X + Y?
(c) What is the population variance of X + Y?
(d) What is the expected value of XY?
(e) What is the probability of X = 1 conditional on Y = X + 1?
7. For two positive numbers p and q with p + q < 1, the joint pmf of random variables X and Y is given by the following
table:
Y
y∗1 y∗2
x1∗ p ???
X x2∗ q ???
If X and Y are independent, what are the values of the two joint probabilities (in terms of p and q) in the second
column?
8. A bank has two car lanes with ATM machines. The following table provides the joint pmf for the number of cars in
each of the two lanes at a given time:
Cars in Lane 2 (Y)
0 1 2 3
0 0.05 0.15 0.02 0
Cars in Lane 1 (X) 1 0.15 0.20 0.10 0
2 0.02 0.10 0.10 0.03
3 0 0 0.03 0.05
(a) What is the probability that exactly one car is in Lane 1?
(b) What is the probability that exactly one car is in either lane?
(c) What is the probability that exactly one car is in both lanes?
(d) What is the probability that there are the same number of cars in the two lanes?
(e) What is the probability that exactly one car is in Lane 1 if two cars are in Lane 2?
(f) Are X and Y independent?
(g) Calculate the population covariance of X and Y.
(h) Consider the total number of cars in the two lanes, given by X + Y. What is the pmf of X + Y? What is E(X + Y)?
What is Var(X + Y)?
(i) Consider the difference in the number of cars in the two lanes, given by X – Y. What is the pmf of X – Y? What
is E(X – Y)? What is Var(X – Y)?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 209 — #216
i i

NOTES 209

(j) Does the difference between Var(X + Y) and Var(X – Y) found in (h) and (i) make sense given the population
covariance found in (g)?
9. Consider the population of families in the United States having at least two children. The following table describes
the probability that a randomly chosen family has a third child, broken down by the number of boys that the family
has among its first two children:
# of boys among first two children 0 1 2
Probability of having a third child 0.30 0.25 0.30
Let B ∈ {0, 1, 2} be the random variable representing the number of boys among the first two children. Let T be the
random variable indicating whether the family has a third child, with T = 1 for families having a third child and T = 0
for families not having a third child. Assume the following pmf for B: pB (0) = 0.25, pB (1) = 0.50, pB (2) = 0.25.
(a) Explain why the probabilities in the table do not add up to one.
(b) In terms of B and T, what does the 0.25 value in the table represent?
(c) What is the unconditional probability that a family with two children has a third child?
(d) Show the joint pmf of B and T in a table.
(e) Are B and T independent?
(f) What is the population covariance between B and T?
10. You are faced with two possible investment choices, whose annual returns are represented by the random variables
X and Y. The possible returns for the first investment (X) are 1%, 2%, 3%, and 4%, and the possible returns for the
second investment (Y) are 2%, 3%, and 4%. For simplicity, we omit the “%” sign below. The joint pmf is:
X
1 2 3 4
2 0.10 0.05 0.05 0
Y 3 0.05 0.15 0.05 0.05
4 0.05 0.05 0.15 0.25
(a) What is the joint cdf evaluated at X = 2 and Y = 3 (FXY (2, 3))?
(b) What are the marginal pmf’s of X and Y?
(c) What is the conditional pmf of X given Y = 3?
(d) What is the expected value of X?
(e) What is the expected value of Y?
(f) What is the population variance of X?
(g) What is the population variance of Y?
(h) What is the population covariance between X and Y?
(i) You put $1,000 into investment X and $2,000 into investment Y. The trades cost a total of $20 to execute.
Write an expression for the random variable G, which represents your net gain (in dollars) over the next year.
(Remember that the returns are in percentages, so for instance X = 1 corresponds to a 1% return on $1000,
which is (0.01)(1000) = 10 dollars.)
(j) What is the expected value of G?
(k) What is the population variance of G?
11. Companies A and B are in an R&D race to develop a new technology. Each company may develop the technology
in one, two, or three years. For each company, the probability that the technology is developed in one year is 20%, the
probability that the technology is developed in two years is 50%, and the probability that the technology is developed
in three years is 30%. Assume that the companies’ efforts are independent of each other; the time that it takes Company
B to develop the technology has nothing to do with how long it takes Company A to do so.

Regardless of who develops the technology first, both companies begin commercial production in five years. If one

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 210 — #217
i i

210 NOTES

company develops the technology first, its market share is 75% and its competitor’s market share is 25%. If the two
companies take the same number of years to develop the technology, each has a market share of 50%.

Define the random variables A and B as the number of years that it takes Company A and Company B to develop
the technology, respectively.
(a) Find the joint pmf of A and B.
(b) Let M denote the eventual market share for Company A. What is the pmf of M?
(c) Each year of R&D costs the company $5 million. The total market is worth $100 million in net revenues
(sales minus costs of production), which the two firms divide according to their market shares. Let the random
variable R be the profits of Company A, equal to net revenues minus the R&D costs.
i. Write R as a function of A and M.
ii. What is the expected value of R (in millions of dollars)?
12. Suppose a fair die is rolled 200 times, with each roll being independent. The rolls are recorded as a 200-character
sequence of integers between 1 and 6 (inclusive).
(a) What is the probability that the sequence 354 occurs in the first three rolls?
(b) *What is the expected value of the number of times that the sequence 354 shows up in the full 200-roll
sequence? (Hint: Think about the random variables Xj defined as 1 if 354 shows up starting with the j-th
roll and 0 otherwise. Then, use the result in Proposition 8.7 for the expected value of a linear combination of
random variables.)
(c) Conduct 10,000 simulations in R to confirm your answer to (b). Each simulation involves rolling a fair die 200
times and counting the number of times that the sequence 354 occurs.
13. The annual income X (in thousands of dollars) in a population of workers has expected value 70 and population
standard deviation 50. Assume all workers’ incomes are independent of each other. For this question, you may assume
X is a discrete random variable.
(a) What is the expected value and population standard deviation for the average of two workers’ annual incomes?
(b) What is the expected value and population standard deviation for the difference between two workers’ annual
incomes?
(c) What is the expected value and population standard deviation for the average of 10 workers’ annual incomes?
(d) What is the expected value and population standard deviation for the total of 10 workers’ annual incomes?
(e) Assume that every worker in the population works 2,080 hours (52 weeks times 40 hours per week). Now,
answer (d) using hourly incomes rather than annual incomes.
14. For the population of graduating Economics students, the probability distribution (pmf) associated with X = the
number of job offers a student receives before graduation is as follows:
x 0 1 2 3 4
pX (x) 0.15 0.40 0.30 0.10 0.05
(a) Use R to determine the expected value of X. (Hint: Create two vectors, one containing the possible outcomes
for X and one containing their associated probabilities, and then do the appropriate calculation based upon
these vectors.)
(b) Use R to determine the population variance of X.
(c) Use R to determine the population standard deviation of X.
(d) Use the sample function to create a vector with 10,000 random draws of X.
i. How do the sample mean, sample variance, and sample standard deviation compare to their population
counterparts?
ii. What proportion of the random draws are less than or equal to 2? How does that compare to FX (2)?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 211 — #218
i i

NOTES 211

iii. Draw a histogram or bar chart of the 10,000 random draws.

(e) Suppose 400 Economics students graduate in a given year, and the career services center is interested in T =
the total number of job offers for the 400 students. You can simulate one draw of T by summing the results of
a sample call — e.g., something like sum(sample(..., 400, ..., replace = TRUE)), where you fill in
the first ... with the outcome vector and the second ... with the probability vector. Try it out. What do you get
for the draw of T?
(f) Now, simulate and store the results from 5,000 random draws of T.
i. Draw a histogram of the 5,000 random draws.
ii. What is the sample average of the draws? How does it compare to the expected value of T? (Note:
E(T) should be 400E(X), and you found E(X) in (a).)
iii. What is the sample standard deviation of the draws?
iv. Based upon your draws, fill in the blank in the following statement: “I estimate that there’s a 90%
probability that the total number of job offers for the 400 students is greater than .”

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 212 — #219
i i

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 213 — #220
i i

9 Models of discrete random variables

Chapter 8 introduced the concept of a discrete random variable, defined some population descriptive statistics for
discrete random variables, and considered the properties of linear transformations and linear combinations involving
discrete random variables. This chapter introduces some commonly used models for discrete random variables. Each
of these models gives rise to a pmf for the underlying discrete random variable, where the pmf depends upon the
parameter(s) of the specific model being used. For instance, for the binary random variable X associated with a
website purchase (X = 1 if purchase, X = 0 if not), we can formalize a model that assumes there is some (possibly
unknown) probability π of a purchase; here, π is the “parameter” of the model, which gives rise to the pmf for X, with
pX (1) = π and pX (0) = 1 – π.

9.1 Bernoulli random variable

The simplest discrete random variable is a binary random variable X and is often called a Bernoulli random variable.
The binary X has two possible values, X = 1 (called a “success”) and X = 0 (called a “failure”). The success probability
pX (1) = P(X = 1) completely characterizes the Bernoulli random variable since P(X = 0) = 1 – P(X = 1). If the parameter
π = pX (1) denotes the success probability, we have pX (0) = 1 – π. A Bernoulli random variable is defined as follows:

Definition 9.1 A random variable X is a Bernoulli random variable with parameter π if X ∈ {0, 1} and π = pX (1) =
P(X = 1). We write X ∼ Bernoulli(π), where “∼” is read “is distributed as.”
Examples 8.7 and 8.11 considered a Bernoulli random variable for union status, where X = 1 indicated a union
worker and X = 0 a non-union worker. The results from those examples for the union-status random variable hold for
any Bernoulli random variable, as summarized in the proposition below:
Proposition 9.1. If X ∼ Bernoulli(π),
p
µX = E(X) = π, σX2 = π(1 – π), and σX = π(1 – π).
The population mean µX is the average over the zero and one values in the population. Since the zeros contribute
nothing to the average, µX is equal to the true proportion of ones in the population, which is pX (1) or π. For instance,
for π = 0.4, there is a 40% chance of drawing a 1 from the population in a single experiment associated with √ X. For
this π value, the population variance and population standard deviation are σX2 = (0.4)(0.6) = 0.24 and σX = 0.24,
respectively.
The sample variance σX2 = π(1 – π) is a quadratic function of π, as shown in Figure 9.1. The largest possible variance
occurs when π is exactly equal to 0.5 or 21 , which is the case for a fair coin toss. Having π = 0.5 maximizes the variance
or “noise” associated with the experiment of drawing a value of the random variable X from the population. The sample
variance, as a function of π, decreases as we move away from the 0.5 value, either to the left toward 0 or to the right
toward 1. For π = 0.8, for instance, there is less uncertainty about X, as compared to the π = 0.5 case, since it is much
more likely to see a 1 value than a 0 value. For even larger values, say π = 0.9 and π = 0.95, this uncertainty decreases
even more. At the extremes of π = 0 and π = 1, there is no uncertainty at all, so that the variance is equal to 0, indicating
that X is not even random in these extreme cases. There is a symmetry in the σX2 = π(1 – π) formula, so with respect to

i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 216 — #223
i i

216 Models of discrete random variables

Proposition 9.2. The probability mass function for X ∼ Binomial(n, π) is

n k
P(X = k) = π (1 – π)n–k for k ∈ {0, 1, 2, …, n}.
k
The population mean of X is
µX = nπ,
the population variance of X is
σX2 = nπ(1 – π),
and the population standard deviation of X is
p
σX = nπ(1 – π).
The following R functions are useful for working with a binomial random variable:
• dbinom(x, size, prob): Returns the pmf of a Binomial(size, prob) random variable evaluated at the
argument x, which may be a single number or a vector.
• pbinom(x, size, prob): Returns the cdf of a Binomial(size, prob) random variable evaluated at the

argument x, which may be a single number or a vector.

• rbinom(n, size, prob): Creates a vector of n i.i.d. random draws of a Binomial(size, prob) random

variable.
R has functions like these for other random variables, always using the convention that a function with d at the
beginning (e.g., dbinom) returns a density or pmf, a function with p at the beginning (e.g., pbinom) returns a cdf,
and a function with r at the beginning (e.g., rbinom) creates random draws of the random variable.
The following R code calculates P(X = 3) and P(X = 4) from Example 9.1, where X ∼ Binomial(10, 0.2):

dbinom(3,10,0.2)
## [1] 0.2013266
dbinom(4,10,0.2)
## [1] 0.08808038

Using the vector 0:10, consisting of all integers between 0 and 10 (inclusive), the complete pmf and cdf can be
calculated:

dbinom(0:10,10,0.2)
## [1] 0.1073741824 0.2684354560 0.3019898880 0.2013265920 0.0880803840
## [6] 0.0264241152 0.0055050240 0.0007864320 0.0000737280 0.0000040960
## [11] 0.0000001024

pbinom(0:10,10,0.2)
## [1] 0.1073742 0.3758096 0.6777995 0.8791261 0.9672065 0.9936306 0.9991356
## [8] 0.9999221 0.9999958 0.9999999 1.0000000

And, the following code simulates 50 i.i.d. draws from the X ∼ Binomial(10, 0.2) distribution of Example 9.1:

set.seed(1234)
rbinom(50,10,0.2)
## [1] 1 2 2 2 3 2 0 1 2 2 3 2 1 4 1 3 1 1 1 1 1 1 1 0 1 3 2 4 3 0 2 1 1 2 1 3 1 1
## [39] 5 3 2 2 1 2 1 2 2 2 1 3

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 217 — #224
i i

Models of discrete random variables 217

pmf for # of days that stock goes up

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
Probability

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19

Figure 9.2
Probability mass function for stock-increase example

Example 9.2 (Days of stock price increases) Suppose that whether a stock goes up on a given day is described by
a Bernoulli(0.6) random variable, where “success” (1) indicates the stock goes up and “failure” (0) indicates the
stock doesn’t go up. Moreover, assume that the Bernoulli random variables associated with each day are independent
of each other. Under these assumptions, for a sequence of 20 days, what is the probability that the stock goes up on
exactly 12 of the 20 days? In this case, X ∼ Binomial(20, 0.6), implying

20
P(X = 12) = 0.612 0.48 ≈ 0.180.
12
The probabilities P(X = x) for other x values can be similarly calculated. The R code below creates Figure 9.2, which
graphs the pmf of X. The type="h" optional argument for the plot function is used to draw a vertical line from the
x-axis to each pmf value.

plot(0:20, dbinom(0:20,20,0.6), type="h", axes=FALSE, main="pmf for # of days that stock goes up",
xlab="", ylab="Probability", xlim=c(0,20), ylim=c(0,0.2))
axis(1, at=0:20)
axis(2, at=seq(0,0.2,0.02))

X = 12 is the most likely (modal) outcome here, and the probabilities decrease either to the left or to the right of 12.
The pmf looks somewhat symmetric, but it is not exactly symmetric around 12; for instance, P(X = 13) is slightly larger
than P(X = 11).
Interestingly, the pmf in Example 9.2 has an approximate bell shape. This particular shape arises due to the
parameters of the binomial distribution (n = 20, π = 0.6) being considered. To examine other possible shapes of

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 218 — #225
i i

218 Models of discrete random variables

Binomial(10,0.1) pmf Binomial(10,0.2) pmf

0.4

0.30
0.3

0.20
Probability

Probability
0.2

0.10
0.1

0.00
0.0

0 2 4 6 8 10 0 2 4 6 8 10

Binomial(10,0.5) pmf Binomial(100,0.1) pmf

0.12
0.20

0.08
Probability

Probability
0.10

0.04
0.00
0.00

0 2 4 6 8 10 0 20 40 60 80 100

Figure 9.3
Probability mass functions for binomial random variables

binomial distributions, Figure 9.3 shows pmf’s for four different binomial random variables: (i) Binomial(10, 0.1)
in the upper-left graph, (ii) Binomial(10, 0.2) in the upper-right graph, (iii) Binomial(10, 0.5) in the lower-left graph,
and (iv) Binomial(100, 0.1) in the lower-right graph. For the Binomial(10, 0.1) random variable, the success probability
is so low (10%) that the most likely outcomes are one success and zero successes, with the other outcomes (two or
more successes) having declining probabilities. There is no evident bell shape in this graph. For the Binomial(10, 0.2)
random variable, the bell shape begins to emerge, although there isn’t really a left tail since the lowest possible outcome
is zero successes. For the Binomial(10, 0.5) random variable, the pmf has an approximate bell shape and, moreover,
a perfectly symmetric one. With a success probability of 0.5, the likelihood of having four successes out of ten trials
is the same as having six successes out of ten trials, the likelihood of having three successes is the same as having
seven successes, and so on. Finally, for the Binomial(100, 0.1) random variable, the pmf has a bell shape even though
the success probability is low (10%) and, in fact, the same as we had with the Binomial(10, 0.1) random variable. The
difference here is that the number of trials is much larger (100) than the number of trials (10) for the Binomial(10, 0.1)
random variable.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 219 — #226
i i

Models of discrete random variables 219

9.2.1 The sample proportion as a random variable

Suppose that, rather than the total number of successes, we are interested in the proportion of successes out of n
i.i.d. Bernoulli trials. If X is a binomial random variable, say X ∼ Binomial(n, π), define the random variable
1
P = X,
n
which represents the proportion of successes out of n independent Bernoulli(π) trials. P is a linear transformation of
X, with scaling constant n1 , so that its population mean, variance, and standard deviation can easily be determined.
√
Specifically, using the results for the binomial random variable (µX = nπ, σX2 = nπ(1 – π), and σX = nπ(1 – π) ),
1 1
µP = µX = nπ = π,
n n
1 1 π(1 – π)
σP2 = 2 σX2 = 2 nπ(1 – π) = ,
n n n
and r
1 1p π(1 – π)
σP = σX = nπ(1 – π) = .
n n n
The expected value of the proportion P is the success probability π regardless of the number of trials n. The population
variance and standard deviation depend on n, with both decreasing as n increases. The dispersion of the proportion
random variable P decreases as the number of trials increases, with its pmf becoming more tightly concentrated around
the population mean π.
The possible values for P are {0, n1 , 2n , …, n–1
n , 1}, corresponding to the possible values {0, 1, …, n – 1, n} for X. Since
the pmf for the binomial random variable X is known, the pmf of P follows directly:

k n k
P P= = P(X = k) = π (1 – π)n–k for each k ∈ {0, 1, 2, …, n}.
n k
Example 9.3 (Political polling) Suppose we conduct a poll to see if randomly selected voters favor candidate A
or candidate B. Let Xj = 1 if a voter from the population favors candidate A and Xj = 0 if a voter favors candidate
B. Although the true probability P(Xj = 1) would not ordinarily be known, let’s assume that we know its value
for this example. Specifically, assume that P(Xj = 1) = 0.52, so that Xj ∼ Bernoulli(0.52). If a poll of 200 voters is
conducted, what is the probability that the proportion supporting candidate A is strictly greater than 21 ? With 200
i.i.d. Bernoulli(0.52) draws and X ∼ Binomial(200, 0.52), the probability that the proportion is strictly greater than 12
is equal to the probability that X is strictly greater than 100:
200
X 200
P (P > 0.5) = P(X > 100) = (0.52)k (0.48)200–k ≈ 0.690.
k
k=101
In R, the probability P(X > 100) can be calculated with the dbinom function since P(X > 100) = 1 – P(X ≤ 100):

1-pbinom(100,200,0.52)
## [1] 0.6900274

Even though the true probability (0.52) of favoring candidate A is greater than 21 , there is approximately a 31.0%
chance that the observed proportion in the 200-voter poll is less than or equal to 21 . What if the poll has more
participants? If 1,000 voters were polled instead of 200 voters, so that X ∼ Binomial(1000, 0.52), the probability
that the proportion is strictly greater than 12 is equal to
1000
X 1000
P (P > 0.5) = P(X > 500) = (0.52)k (0.48)1000–k ≈ 0.891.
k
k=501

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 220 — #227
i i

220 Models of discrete random variables

pmf for P (200−voter poll)

Probability

0.00015
0.00000 0.3 0.4 0.5 0.6 0.7

pmf for P (1000−voter poll)

0.000015
Probability

0.000000

0.3 0.4 0.5 0.6 0.7

Figure 9.4
Probability mass functions for voter-poll example

1-pbinom(500,1000,0.52)
## [1] 0.8914189

In this larger poll, there is a considerably lower chance, approximately 10.9%, that the observed proportion in the
poll is less than or equal to 21 . If we continue to increase the size of the poll, the probability P(P > 0.5) gets closer and
closer to one. Figure 9.4 shows the pmf’s for the 200-voter and 1,000-voter polls. For each pmf, a dashed vertical line
is drawn at 0.5, and a solid vertical line is drawn at 0.52. For both pmf’s, the solid vertical line is at the center of
the distribution. The probability P(P > 0.5) corresponds to the sum of the probability values to the right of the dashed
vertical line. In comparing the two pmf’s, a much larger portion of the distribution is to the right of the 0.50 for
the 1,000-voter poll, corresponding to the 0.891 value above, as compared to the 200-voter poll, corresponding to
the 0.690 value above. The dispersion for the 200-voter poll is also clearly much larger than the dispersion for the
1,000-voter poll since σP2 = π(1–π)
n = (0.52)(0.48)
n .

9.3 Geometric random variable

A geometric model is used when repeated (and independent) trials of a Bernoulli random variable are observed and
we are interested in the number of failures that occur before the first success. Rather than fixing the number of trials,
the Bernoulli trials X1 , X2 , … are allowed to continue forever since it is not known when the first success will occur.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 221 — #228
i i

Models of discrete random variables 221

Definition 9.4 A geometric random variable X with parameter π, written X ∼ Geo(π), is the number of failures
observed before a success for a sequence of X1 , X2 , … independent random variables with each Xj ∼ Bernoulli(π).
The possible outcomes of X ∼ Geo(π) are {0, 1, 2, …}. If the first trial is a success, then X = 0 since there were no
failures before the success. If the second trial is a success, then X = 1 since there was one failure before the success.
And so on. Since successes are more likely for higher values of π, we should tend to get low values of X if π is high
and high values of X if π is low.
Example 9.4 (Visitors before a purchase) On a given day, the widgets.com website has a sequence of visitors,
each of whom has a 20% probability of purchasing a widget during their visit. The purchase behavior for the sequence
of visitors is described by a sequence of i.i.d. Bernoulli(0.2) random variables, with Xj = 1 for purchase and Xj = 0 for
no purchase. Then, the pmf of X ∼ Geo(0.2), the number of non-purchasing individuals that visit the website prior to
a purchase being made, is
P(X = 0) = P(X1 = 1) = 0.2
P(X = 1) = P(X1 = 0, X2 = 1) = (0.8)(0.2) = 0.16
P(X = 2) = P(X1 = 0, X2 = 0, X3 = 1) = (0.8)2 (0.2) = 0.128
..
.
P(X = k) = P(X1 = 0, X2 = 0, …, Xk = 0, Xk+1 = 1) = (0.8)k (0.2)
..
.
The term “geometric” is used for this random variable since each successive probability is equal to the previous
probability multiplied by the same constant (0.8 here). The interested reader can use Proposition 3.7 to confirm that
these probabilities add up to one.
The following proposition provides the general form of the pmf of a geometric random variable:
Proposition 9.3. If X ∼ Geo(π), then the probability mass function of X is
P(X = k) = (1 – π)k π for k ∈ {0, 1, 2, …}.
The population mean of X is
1–π
µX = ,
π
the population variance of X is
1–π
σX2 =
,
π2
and the population standard deviation of X is √
1–π
σX = .
π
The following R functions are useful for working with a geometric random variable:
• dgeom(x, prob): Returns the pmf of a Geo(prob) random variable evaluated at the argument x, which may
be a single number or a vector.
• pgeom(x, prob): Returns the cdf of a Geo(prob) random variable evaluated at the argument x, which may

be a single number or a vector.

• rgeom(n, prob): Creates a vector of n i.i.d. random draws of a Geo(prob) random variable.

For the X ∼ Geo(0.2) random variable of Example 9.4, the pmf and cdf values for k ∈ {0, 1, · · · , 10} are calculated as
follows:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 222 — #229
i i

222 Models of discrete random variables

dgeom(0:10,0.2)
## [1] 0.20000000 0.16000000 0.12800000 0.10240000 0.08192000 0.06553600
## [7] 0.05242880 0.04194304 0.03355443 0.02684355 0.02147484
pgeom(0:10,0.2)
## [1] 0.2000000 0.3600000 0.4880000 0.5904000 0.6723200 0.7378560 0.7902848
## [8] 0.8322278 0.8657823 0.8926258 0.9141007

Regardless of the value of π, P(X = 0) is the largest probability in the pmf. The successive probabilities are strictly
decreasing as k gets higher since each one is equal to the previous probability times (1 – π), a value less than one.
Figure 9.5 shows the pmf’s for four different geometric random variables: (i) X ∼ Geo(0.1) in the upper-left graph,
(ii) X ∼ Geo(0.2) in the upper-right graph, (iii) X ∼ Geo(0.5) in the lower-left graph, and (iv) X ∼ Geo(0.8) in the
lower-right graph. Each graph has the same y-axis, extending from 0 to 0.8, for ease of comparison.

# graph-display format (two rows, two columns)

par(mfrow=c(2,2))

# plots of four different geometric pmf's

plot(0:20, dgeom(0:20,0.1), type="h", main="Geo(0.1) pmf", xlab="", ylab="Probability", ylim=c(0,0.8))
plot(0:20, dgeom(0:20,0.2), type="h", main="Geo(0.2) pmf", xlab="", ylab="Probability", ylim=c(0,0.8))
plot(0:20, dgeom(0:20,0.5), type="h", main="Geo(0.5) pmf", xlab="", ylab="Probability", ylim=c(0,0.8))
plot(0:20, dgeom(0:20,0.8), type="h", main="Geo(0.8) pmf", xlab="", ylab="Probability", ylim=c(0,0.8))

For X ∼ Geo(0.1), the pmf starts at 0.1 and slowly declines with a long right tail, so that even quite high values for
X (i.e., many failures before a success) are possible. For X ∼ Geo(0.2), considered in Example 9.4, the pmf starts at
0.2 and declines more quickly than the π = 0.1 case. The decline in probabilities becomes even more pronounced for
X ∼ Geo(0.5) and X ∼ Geo(0.8). X ∼ Geo(0.5) is the random variable describing the number of tails observed before a
head is tossed for a sequence of fair coin tosses. When X ∼ Geo(0.8), it is very likely that the random variable X has a
low value; in this case, P(X = 0) = 0.8 and P(X ≤ 2) = 0.8 + (0.2)(0.8) + (0.2)2 (0.8) = 0.992, so that there is only a 0.8%
chance to have an X value of 3 or higher.
While we will not prove the results for the population descriptive statistics given in Proposition 9.3, we can apply
the formulas for the population mean, population variance, and the population standard deviation. For the website
0.8
purchase example (Example 9.4), where X ∼ Geo(0.2), the population mean is µX = 0.2 = 4, and the population variance
2 0.8
is σX = 0.22 = 20. For the much larger π parameter (π = 0.8) in the lower-right graph of Figure 9.5, the population mean
is µX = 0.2 2 0.2
0.8 = 0.25, and the population variance is σX = 0.82 = 0.3125. Consistent with Figure 9.5, the population mean
and variance are both much lower for π = 0.8 than they are for π = 0.2.

9.4 Negative binomial random variable

The negative binomial model generalizes the geometric model (Section 9.3). A geometric random variable is the
number of failures before a single success. If we are instead interested in the number of failures that occur before
multiple successes, a negative binomial random variable can be used.

Definition 9.5 A negative binomial random variable X with parameters π and r ≥ 1, written X ∼ NegBin(r, π), is
the number of failures observed before r successes for a sequence of X1 , X2 , … independent random variables with
each Xj ∼ Bernoulli(π).
When r = 1, a negative binomial random variable is a geometric random variable. That is, X ∼ NegBin(1, π) is
equivalent to X ∼ Geo(π). The following R functions are useful for working with a negative binomial random variable:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 223 — #230
i i

Models of discrete random variables 223

Geo(0.1) pmf Geo(0.2) pmf

0.8

0.8
0.6

0.6
Probability

Probability
0.4

0.4
0.2

0.2
0.0

0.0
0 5 10 15 20 0 5 10 15 20

Geo(0.5) pmf Geo(0.8) pmf

0.8

0.8
0.6

0.6
Probability

Probability
0.4

0.4
0.2

0.2
0.0

0.0

0 5 10 15 20 0 5 10 15 20

Figure 9.5
Probability mass functions for geometric random variables

• dnbinom(x, size, prob): Returns the pmf of a NegBin(size, prob) random variable evaluated at the
argument x, which may be a single number or a vector.
• pnbinom(x, size, prob): Returns the cdf of a NegBin(size, prob) random variable evaluated at the

argument x, which may be a single number or a vector.

• rnbinom(n, size, prob): Creates a vector of n i.i.d. random draws of a NegBin(size, prob) random

variable.
Since the geometric random variable is a special case of the negative binomial random variable,
dnbinom(x,1,prob) is equivalent to dgeom(x,prob), pnbinom(x,1,prob) is equivalent to pgeom(x,prob),
and rnbinom(x,1,prob) is equivalent to rgeom(x,prob).
Example 9.5 (Visitors before three purchases) Consider the same setup as Example 9.4, except that we are now
interested in the number of non-purchasing individuals that visit the website prior to three purchases being made,
which is the random variable X ∼ NegBin(3, 0.2). To have X = 0, the first three visitors must make a purchase:
P(X = 0) = (0.2)3 .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 224 — #231
i i

224 Models of discrete random variables

For X = 1, the fourth visitor makes a purchase, with the other two purchases occurring among the first three visitors.
In other words, the possible sequences associated with X = 1 are (0, 1, 1, 1), (1, 0, 1, 1), and (1, 1, 0, 1), so that:
P(X = 1) = (3)(0.8)(0.2)3 = 0.0192.
For X = 2, the fifth visitor makes a purchase, with the other two purchases occurring among the first four visitors.
How many sequences have two purchases and two non-purchases among the first four visitors? There are 42 = 6 such
sequences, so that:
4
P(X = 2) = (0.8)2 (0.2)3 = 0.03072.
2

dnbinom(0:2,3,0.2)
## [1] 0.00800 0.01920 0.03072

For the general case, for X = k, the (k +3)-th visitor makes a purchase, with the other two purchases occurring
among the first k + 2 visitors. There are k+2
2 such sequences, so that:

k+2
P(X = k) = (0.8)k (0.2)3 .
2
Figure 9.6 shows the pmf of X ∼ NegBin(3, 0.2) in the top graph. The most likely outcomes are roughly between 5 and
10. As a comparison, two other pmf’s are shown in the figure. The first is the pmf for a lower purchase probability (10%
rather than 20%) in the middle graph, X ∼ NegBin(3, 0.1). With the lower success probability, the lower outcomes for
X become less likely and the distribution gets stretched out with a very long and thick right tail. The second is the
pmf for a larger number of purchases (4 rather than 3) in the bottom graph, X ∼ NegBin(4, 0.2). Since more purchases
are required, the distribution appears to shift a little to the right and also is a bit lower at its peak relative to the
NegBin(3, 0.2) pmf.
Here is the R code to create Figure 9.6:

# graph-display format (three rows, one column)

par(mfrow=c(3,1))

# plots of three different negative binomial pmf's

plot(0:50, dnbinom(0:50,3,0.2), type="h", main="NegBin(3,0.2) pmf", xlab="",
ylab="Probability", ylim=c(0,0.06))
plot(0:50, dnbinom(0:50,3,0.1), type="h", main="NegBin(3,0.1) pmf", xlab="",
ylab="Probability", ylim=c(0,0.06))
plot(0:50, dnbinom(0:50,4,0.2), type="h", main="NegBin(4,0.2) pmf", xlab="",
ylab="Probability", ylim=c(0,0.06))

Let’s generalize the approach from Example 9.5 to a general X ∼ NegBin(r, π), for any success probability π and
any number of successes r. For X = k, it must be the case that (i) there are a total of k + r trials with k failures and r
successes and (ii) the last trial is a success. Combining these two facts, there are k failures and r – 1 successes among the
first k + r – 1 trials. There are k+r–1

r–1 possible sequences for which this is true. And, the probability of any individual
k r
sequence of the k + r trials is (1 – π) π , corresponding to the k failures and r successes. The following proposition
provides the general form of the pmf of a negative binomial random variable:
Proposition 9.4. If X ∼ NegBin(r, π), then the probability mass function of X is

k+r–1
P(X = k) = (1 – π)k π r for k ∈ {0, 1, 2, …}.
r–1

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 225 — #232
i i

Models of discrete random variables 225

NegBin(3,0.2) pmf

0.06
0.04
Probability

0.02
0.00
0 10 20 30 40 50

0.06
0.04
NegBin(3,0.1) pmf
Probability

0.02
0.00

0 10 20 30 40 50

NegBin(4,0.2) pmf
0.06
0.04
Probability

0.02
0.00

0 10 20 30 40 50

Figure 9.6
Probability mass functions for negative binomial random variables

The population mean of X is

1–π
µX = r,
π
the population variance of X is
1–π
σX2 = r,
π2
and the population standard deviation of X is √
(1 – π)r
σX = .
π
Both the population mean and the population variance are equal to r times the population mean and the population
variance, respectively, of the Geo(π) random variable (Proposition 9.3). The reason is that the X ∼ NegBin(r, π) random
variable turns out to be equivalent to the sum of r i.i.d. Geo(π) random variables; the interested reader can read the
endnote for an explanation.20
For Example 9.5, where X ∼ NegBin(3, 0.2), Proposition 9.4 implies that the population mean is µX = 0.8 0.2 (3) = 12,
2 0.8
and the population variance is σX = 0.22 (3) = 60. Looking at the pmf of X in Figure 9.6, the population mean of 12 is
a bit to the right of where the most likely (modal) outcomes of X occur since the long right tail, with large possible
values of X, leads to a larger value of µX . For the X ∼ NegBin(4, 0.2) distribution, the population mean and variance
are µX = 16 and σX2 = 80, which are larger than those for X ∼ NegBin(3, 0.2).

9.5 Poisson random variable

The Poisson model models the number of events that occur during a fixed time interval, specifically for a situation in
which the possibility of an event occurring at any time during the interval is constant and this possibility is independent

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 226 — #233
i i

226 Models of discrete random variables

of when the last event occurred. The parameter for the Poisson model, denoted λ, represents the expected value
(population average) of the number of events that occur within the fixed time interval.

Definition 9.6 A random variable X is a Poisson random variable with parameter λ > 0, written X ∼ Poisson(λ),
if X represents the number of times that an event occurs within a fixed time interval and the following are true:
(i) λ = µX = E(X), (ii) the average rate at which events occur is constant (depending on λ) and does not depend on
whether previous events have occurred, (iii) the occurrence of one event does not affect the likelihood that a future
event occurs, and (iv) two events cannot occur at exactly the same instant.
The possible outcomes for X ∼ Poisson(λ) are {0, 1, 2, …}, with no upper limit imposed on the value of X. Here are
two examples of Poisson models:
Example 9.6 (Coffee shop customers) Consider a random variable X given by the number of customers that arrive at
a coffee shop between 10am and 11am on a given weekday. The expected value or population mean of X is 20. So, on
average, a customer arrives every three minutes during the 10am-to-11am hour. For the assumptions in Definition 9.6
to be true, it must be the case the likelihood of one customer arriving has nothing to do with other customers arriving
and that the likelihood of arrival is constant throughout the hour. Then, X ∼ Poisson(20).
Example 9.7 (R&D and patents) A firm has an R&D department that does research throughout the year. Every so
often, the department makes a discovery that leads to a patent application. Let the random variable X be the number
of discoveries (or patent applications) by the firm in a given year. If the expected value of discoveries in a given year
is equal to two and the R&D process satisfies the assumptions from Definition 9.6, then X ∼ Poisson(2).
The following proposition provides the pmf of a Poisson random variable:
Proposition 9.5. If X ∼ Poisson(λ), then the probability mass function of X is
e–λ λk
P(X = k) = for k ∈ {0, 1, 2, 3, …}.
k!
The population mean of X is
µX = λ,
the population variance of X is
σX2 = λ,
and the population standard deviation of X is √
σX = λ.
–λ k
Deriving the probabilities associated with the pmf, P(X = k) = e k!λ , is beyond the scope of this book, but we can
still use the results from Proposition 9.5 for examples involving Poisson random variables. A Poisson random variable
has the interesting property that its population mean is the same as its population variance, with both equal to the
parameter λ.
The following R functions are useful for working with a Poisson random variable:
• dpois(x, lambda): Returns the pmf of a Poisson(lambda) random variable evaluated at the argument x,
which may be a single number or a vector.
• ppois(x, lambda): Returns the cdf of a Poisson(lambda) random variable evaluated at the argument x,

which may be a single number or a vector.

• rpois(n, lambda): Creates a vector of n i.i.d. random draws of a Poisson(lambda) random variable.

Example 9.8 (R&D and patents) Let’s continue Example 9.7, where the number of discoveries X in a given year is
a Poisson random variable, X ∼ Poisson(2). The population mean µX and population variance σX2 are both equal to

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 227 — #234
i i

Models of discrete random variables 227

√
λ = 2, and the population standard deviation is σX = 2. The pmf is
e–2 20
P(X = 0) = = e–2 ≈ 0.135
0!
e–2 21
P(X = 1) = = 2e–2 ≈ 0.271
1!
e–2 22
P(X = 2) = = 2e–2 ≈ 0.271
2!
e–2 23 4 –2
P(X = 3) = = e ≈ 0.180
3! 3
e–2 24 2 –2
P(X = 4) = = e ≈ 0.090
4! 3
..
.
e–2 2k
P(X = k) =
k!
..
.
The pmf of X ∼ Poisson(2) is shown in the top graph in Figure 9.7. The most likely outcomes are X = 1 and X = 2,
both with probability 0.271 from above. The pmf indicates that large values of X are not likely. For example, the
probability of having more than 5 discoveries, P(X > 5), is approximately 0.0166 or 1.66%. What would happen if the
firm had a more productive R&D department? If the Poisson parameter is increased to 4, corresponding to an expected
value of four discoveries in a given year, the pmf is shown in the bottom graph of Figure 9.7. In this case, the most
likely outcomes are X = 3 and X = 4, both with probability approximately equal to 0.195. As compared to λ = 2, larger
outcomes for X are much more likely with λ = 4; for instance, P(X > 5) is approximately 0.215 or 21.5%, and even the
probability of 8 discoveries, P(X = 8), is approximately 2.98%.
Here is the R code to create Figure 9.7:

# graph-display format (two rows, one column)

par(mfrow=c(2,1))

# plots of two different Poisson pmf's

plot(0:10, dpois(0:10,2), type="h", main="Poisson(2) pmf",
xlab="Number of patent applications", ylab="Probability")
plot(0:10, dpois(0:10,4), type="h", main="Poisson(4) pmf",
xlab="Number of patent applications", ylab="Probability")

Example 9.9 (Coffee shop customers) For the coffee shop example (Example 9.6), X ∼ Poisson(20) is the number of
customers that arrive at a coffee shop between 10am and 11am on a given weekday. Figure 9.8 shows the pmf of
X ∼ Poisson(20).

plot(0:40, dpois(0:40,20), type="h", main="", xlab="Customers between 10am and 11am", ylab="Probability")

For this λ value, the pmf has a bell shape that peaks around X = 20; in fact, the most likely outcomes here are
X = 19 and X = 20, both of which occur with a probability of approximately 0.089 or 8.9%. The bell shape of this pmf
is quite different from the shape of the pmf for the lower λ value of λ = 2 in Figure 9.7. In fact, when λ is very large,
alternative approaches can be used to model the random variable (e.g., using a normal distribution, which is discussed
in Chapter 11).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 228 — #235
i i

228 NOTES

Poisson(2) pmf

0.20
Probability

0.10
0.00 0 2 4 6 8 10

Number of patent applications

Poisson(4) pmf
0.20
Probability

0.10
0.00

0 2 4 6 8 10

Number of patent applications

Figure 9.7
Probability mass functions for Poisson patent example

What if we are interested in the number of customers that arrive during a one-minute interval (still at some point
between 10am and 11am on a weekday) rather than during a one-hour interval? If the expected number of customers
during the one-hour interval is 20, the expected number of customers during a one-minute interval must be 20 1
60 = 3
since there is a constant arrival rate. Figure 9.9 shows the pmf for the associated Poisson random variable, which
is a Poisson(1/3) random variable. The arrival of zero customers is by far the most likely outcome (71.7%), followed
by one customer (23.9%), and two customers (4.0%), with these three outcomes accounting for a total of 99.5% in
probability. Despite the same underlying process as the one-hour interval, this pmf has a quite different shape due to
the much lower λ parameter value. With λ = 31 , the arrival of a customer is a more unusual event since it must occur
during a one-minute interval, as compared to the arrival of a customer during the full one-hour interval.

Notes
20 Let Y be the number of failures before the first success, Y be the number of failures after the first success and before the second success, and
1 2
so on through Yr being the number of failures after the (r – 1)-th success and before the r-th success. With these definitions, Y1 , Y2 , …, Yr are all
geometric random variables Geo(π). Moreover, since the underlying Bernoulli trials are independent, it is also the case that the Y1 , Y2 , …, Yr are
independent random variables. Noting that, for X ∼ NegBin(r, π), we have X = Y1 + Y2 + · · · + Yr . Then, using the results for a linear combination of
independent random variables (Proposition 10.13),
1–π
µX = µY1 + µY2 + · · · + µYr = r
π
and
1–π
σX2 = σY21 + σY22 + · · · + σY2r = r.
π2

Exercises
1. Let X = 1 if a fair die roll results in 5 or 6 and X = 0 otherwise.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 229 — #236
i i

NOTES 229

0.08
0.06
Probability

0.04
0.02
0.00

0 10 20 30 40

Customers between 10am and 11am

Figure 9.8
Probability mass function for Poisson coffee shop example

(a) How would you describe the random variable X?

(b) What is the expected value of X?
(c) What is the population variance of X?
(d) How does the population variance of X compare to the population variance of Y, where Y = 1 if a 6 is rolled
and Y = 0 otherwise?
2. At the Capita County Fair, people can win tickets by playing carnival games. At the end of the fair, the tickets can
be turned in for a chance to win prizes. Suppose that (i) there is a 1% chance of winning a prize for each ticket and
(ii) all tickets are independent of each other.
(a) If an individual has 10 tickets, what is the expected value of the number of prizes won?
(b) If an individual has 10 tickets, what is the population standard deviation of the number of prizes won?
(c) If an individual has 20 tickets, what is the population standard deviation of the number of prizes won?
3. There are two coins on a table, one of which is fair (50% heads when flipped) and one of which is unfair (60% heads
when flipped). You pick one of the two coins at random and flip it 10 times.
(a) What is the probability that the outcome is exactly 4 heads?
(b) What is the probability that the coin is fair given that the outcome is exactly 4 heads?
4. A politician believes that 30% of all macroeconomists would support a proposal she wishes to advance. Suppose
this belief is correct and that five macroeconomists are approached at random.
(a) What is the probability that none of the macroeconomists would support the proposal?
(b) What is the probability that exactly three of the macroeconomists would support the proposal?
(c) What is the probability that a majority of the macroeconomists would support the proposal?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 230 — #237
i i

230 NOTES

0.7
0.6
0.5
0.4
Probability

0.3
0.2
0.1
0.0

0 1 2 3 4 5 6

Customers during a one−minute interval

Figure 9.9
Probability mass function for Poisson coffee shop example

(d) Plot the pmf associated with the random variable X = the total number of macroeconomists (out of five) that
support the proposal.
5. A recent study found that 13% of individuals in the United States are left-handed and 87% are right-handed.
(a) For a group of n ≥ 2 independent individuals, what is the probability that at least one individual is left-handed?
The expression should be a function of n. Evaluate the expression for n = 3, n = 4, and n = 5.
(b) For a group of n ≥ 2 independent individuals, what is the probability that exactly one individual is left-handed
if at least one individual is left-handed? The expression should be a function of n. Evaluate the expression for
n = 3, n = 4, and n = 5.
(c) For a group of n ≥ 2 independent individuals, what is the probability that exactly two individuals are left-handed
if at least one individual is left-handed? The expression should be a function of n. Evaluate the expression for
n = 3, n = 4, and n = 5.
6. You manage a factory and, due to the complexity of the manufacturing process, there is a 10% probability that any
one of your manufactured products has a defect. You may assume that each product is independent; that is, whether one
product is defective is not associated with another product being defective. On any given day, your factory produces
2000 products. Let the random variable D = the total number of defective products on a given day.
(a) Plot the pmf associated with D.
(b) What is the most likely value of D?
(c) What is P(200 < D ≤ 300)?
(d) What is P(200 ≤ D ≤ 300)?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 231 — #238
i i

NOTES 231

(e) Find the smallest integer d for which P(200 – d ≤ D ≤ 200 + d) is greater than 90%. For this value of d, what is
P(200 – d ≤ D ≤ 200 + d)? (Hint: Start at d = 0 and repeatedly increase d until the probability is above 90%.)
7. A university is experiencing increased enrollment, so it needs to accommodate more students in its classes. The
university schedules a 100-student class in a room that seats only 90 students.
(a) On a given day, suppose each student attends class with an 85% probability and all students can be considered
independent of each other. What is the probability that all students who attend the class will have a seat?
(b) *Now suppose the attendance probability π of an individual student is unknown. Limiting the possible values to
the set π ∈ {0.01, 0.02, …, 0.99, 1.00}, what is the largest value of π for which there is at least a 99% probability
that all students who attend the class will have a seat?
8. An entrepreneur is intent on starting a profitable business. The probability that she succeeds in any given year is
15%, and her success in any given year is independent of the outcomes in any previous year. The entrepreneur starts
trying in year 1. Let Y ∈ {1, 2, 3, · · · } be the random variable indicating the year in which the entrepreneur is first
successful in starting a profitable business.
(a) Explain why Y is not itself a geometric random variable.
(b) Write Y in terms of a geometric random variable.
(c) Determine the probability P(4 ≤ Y ≤ 10) using the R function dgeom.
(d) What is the expected value of Y?
(e) What is the population standard deviation of Y?
9. A family really wants to have a daughter, so they decide to keep having children until they have a daughter. Assume
that the gender of any given child is independent of the genders of all other children. The probability of having a
daughter for any given birth is 48.8% (yes, slightly lower than 50%).
(a) What is the probability that the family has at least two sons before having a daughter?
(b) What is the expected number of sons that the family has before having a daughter?
(c) If the family wants to have two daughters, what is the probability that the family has at least two sons before
having two daughters? What is the expected number of sons that the family has before having two daughters?
10. A political activist collects petition signatures by going door-to-door in a neighborhood. Suppose the probability
of successfully getting a signature at any given house is 5% and the random variables associated with success at all
houses are i.i.d. Let H ∈ {20, 21, 22, 23, …} be the random variable indicating the number of houses visited to obtain
20 signatures.
(a) Explain why H is not itself a negative binomial random variable.
(b) Write H in terms of a negative binomial random variable.
(c) What is the expected value of H?
(d) What is the population standard deviation of H?
(e) Determine the probability P(300 ≤ H ≤ 400) using the R function pnbinom.
(f) A more convincing individual has a success probability of 7%. Conduct 10,000 simulations in R to approximate
the probability that this individual (7% success rate) visits strictly fewer houses than the other individual (5%
success rate) to obtain 20 signatures.
11. *A multinomial distribution provides a generalization of the binomial distribution that allows for more than two
outcomes. Specifically, the multinomial distribution is based upon n i.i.d. draws of a discrete random variable with
m ≥ 2 possible outcomes (denoted, without loss of generality, {1, 2, …, m}) and probabilities p1 , p2 , …, pm (with
Pm
j=1 pj = 1). The discrete random variables X1 , X2 , …, Xm correspond to the total number of times that outcome 1
occurs, outcome 2 occurs, and so on through outcome m. The joint pmf is
n!
pX1 X2 ···Xm (x1 , x2 , …, xm ) = P(X1 = x1 , X2 = x2 , …, Xm = xm ) = px1 px2 · · · pxmm ,
x1 !x2 ! · · · xm ! 1 2

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 232 — #239
i i

232 NOTES

where x1 , x2 , …, xm are non-negative integers with x1 + x2 + · · · + xm = n.

There are three brands of a certain product (brands 1, 2, and 3). In the population, 15% of consumers prefer brand 1,
25% of consumers prefer brand 2, and 60% of consumers prefer brand 3.
(a) If ten consumers are chosen at random, what is the probability that two prefer brand 1, three prefer brand 2,
and five prefer brand 3?
(b) If ten consumers are chosen at random, what is the probability that five prefer brand 3 and more consumers
prefer brand 2 than brand 1?
(c) If ten consumers are chosen at random, what is the expected value of the number of consumers that prefer
brand 3? (Hint: Don’t use the joint pmf. Instead, consider the binomial distribution based upon brand 3 being
chosen or not chosen.)
(d) Make 10,000 simulated draws of (X1 , X2 , X3 ) in R for ten randomly chosen consumers. That is, for each
simulation, consider ten hypothetical consumers, assign them to a preferred brand based upon the probabilities
(15%, 25%, 60%), and let (X1 , X2 , X3 ) be the overall counts for the three brands.
(e) Using the simulated draws from (d), confirm your answer to (b).
(f) Using the simulated draws from (d), what is the approximate population correlation between X1 and X2 ? Does
the sign of this correlation make sense?
12. A financial company releases a market update e-mail every Friday afternoon. The expected number of typos in a
given e-mail is 0.8, and the number of typos follows a Poisson distribution.
(a) What is the probability that a given e-mail has no typos?
(b) What is the probability that a given e-mail has two or more typos?
(c) *Assume that each weekly e-mail can be considered independent. This part considers the number of typos in
two consecutive weekly e-mails, denoted X1 and X2 .
i. Use a while loop in R to calculate the probability that the two e-mails have the same number of typos,
given by
P(X1 = X2 ) = P(X1 = X2 = 0) + P(X1 = X2 = 1) + P(X1 = X2 = 2) + · · ·
Since this probability involves an infinite sum, continue looping until the probability P(X1 = X2 = j)
falls below a very small value, say 0.000001, to get an accurate approximation of the probability.
ii. Conduct 10,000 simulations in R, taking independent draws of X1 and X2 in each simulation, to confirm
your answer to (c)(i).
13. Consider the number of customers X that arrive at a coffee shop during one minute (within the 10am-11am hour).
Suppose the store manager knows that the average arrival rate of customers is two per minute. Based upon this
information, we can model X as a Poisson random, X ∼ Poisson(2), with pmf
e–2 2x
pX (x) = for x = 0, 1, 2, 3, …
x!
(a) Using the pmf formula directly, what is the probability that no customers arrive during a given minute?
(b) Confirm your answer to (a) by using the R function dpois.
(c) Plot the pmf of X for all values less than or equal to 10.
(d) What is the probability that four or more customers arrive during a given minute?
(e) Simulate 10,000 draws of X ∼ Poisson(2) in R. What proportion of the 10,000 draws are greater than or equal
to four? Is your answer similar, aside from simulation noise, to the probability found in (d)?
(f) A more popular coffee shop on the other side of town has an average of three customers per minute during the
10am-11am hour, with the number of customers arriving during a given minute (denoted Y) distributed as a
Poisson(3) random variable. Assume that X and Y are independent of each other.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 233 — #240
i i

NOTES 233

i. What is the joint probability that, during a given minute, there is one customer who arrives at the the
first store (the λ = 2 one) and one customer who arrives at the second store (the λ = 3 one)?
ii. Use simulated draws of X and Y to approximate the probability that more customers arrive at the first
store (the λ = 2 one) than the second store (the λ = 3 one) during a given minute. Specifically, simulate
10,000 draws of both X ∼ Poisson(2) and Y ∼ Poisson(3), and calculate the proportion of times that the
X draw is strictly greater than the Y draw.
iii. What is the expected value of the total number of customers arriving at the two stores during a given
minute?
iv. What is the population standard deviation of the total number of customers arriving at the two stores
during a given minute?
v. Use simulated draws of X and Y (again 10,000 for each) to approximate the probability that the total
number of customers arriving at the two stores is greater than or equal to six during a given minute.
14. *Suppose X1 ∼ Poisson(λ1 ) and X2 ∼ Poisson(λ2 ) are independent random variables. Show that X1 + X2 is a
Poisson(λ1 + λ2 ) random variable by verifying that
e–(λ1 +λ2 ) (λ1 + λ2 )k
P(X1 + X2 = k) = for k ∈ {0, 1, 2, 3, …}.
k!
(Hint: Use the “binomial theorem,” which is
m m
X m X m!
(a + b)m = aj bm–j = aj bm–j
j j!(m – j)!
j=0 j=0

for any m ∈ {0, 1, 2, …} and any constants a and b.)

15. Two employees must separately submit a report to their manager. Let T1 and T2 be random variables that represent
the number of typos in the two reports. Assume that T1 , the number of typos in the first employee’s report, is a
Poisson(1.8) random variable and that T2 , the number of typos in the second employee’s report, is a Poisson(2.4)
random variable. Assume that T1 and T2 are independent.
(a) What is the joint pmf of T1 and T2 ?
(b) What is the probability that there are no typos in the two reports?
(c) What is the probability that there are at most a total of two typos in the two reports?
(d) Conduct 10,000 simulations in R to approximate P(T1 = T2 ), the probability that each report has the same
number of typos.
(e) Suppose each employee now must submit two reports to their manager. Assume that the reports are
all independent, with the same underlying random variables as in the original question. Conduct 10,000
simulations in R to estimate the probability that the total number of typos for each employee is the same.
(f) Repeat (e) for the case where each employee must submit three reports to their manager.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 234 — #241
i i

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 235 — #242
i i

10 Continuous random variables

Chapter 8 formalized the concept of a discrete random variable, introducing the probability mass function (pmf) as
a complete characterization of the random variable and discussing population quantities like the population mean
and population variance. Chapter 9 considered some examples of models that are used in specific situations with
discrete random variables. This chapter departs from the case of discrete random variables and formalizes the concept
of a continuous random variable, where the underlying quantity being measured is a continuous or approximately
continuous variable.

10.1 Continuous random variables vs. discrete random variables

Recall from Definition 8.1 that a random variable X is a function that maps each outcome of the sample space S to
a number. Whereas a discrete random variable has a finite or countable set of possible values, a continuous random
variable has an uncountable set of possible values. Analogous to the definition of a continuous numerical variable
(Definition 5.7), a continuous random variable is defined as follows:

Definition 10.1 A continuous random variable is a random variable that can take on any value on some interval or
intervals of the real line, including perhaps the entire real line, and for which the probability of any specific outcome
x∗ occurring is equal to zero.
Before discussing the last part of this definition (“the probability of any specific outcome x∗ occurring is equal to
zero”), let’s first consider a few examples of continuous random variables:
• Weekly earnings: S contains all possible (positive) weekly earnings values for employed individuals, and X is
equal to the outcome.
• Monthly stock return:21 S contains all possible monthly returns for a given stock, and X is equal to the outcome.

• Unemployment rate: S contains all possible unemployment rates, ranging from 0% to 100% (or 0 to 1), and X is

equal to the outcome.

Why does it make sense to say that P(X = x∗ ) = 0 for any specific outcome x∗ ? Let’s think about a random variable X
which is “equally likely” to be any real number between 0 and 1. (As seen later in this chapter, such an X is called a
uniform random variable.) In this case, X can be thought of as the limit of a particular sequence of discrete random
variables, as follows:
• X1 ∈ {0.1, 0.2, · · · , 0.9, 1.0}, 10 equally likely outcomes (each with probability 0.1)
• X2 ∈ {0.01, 0.02, · · · , 0.99, 1.00}, 100 equally likely outcomes (each with probability 0.01)
• X3 ∈ {0.001, 0.002, · · · , 0.999, 1.000}, 1,000 equally likely outcomes (each with probability 0.001)
• X4 ∈ {0.0001, 0.0002, · · · , 0.9999, 1.0000}, 10,000 equally likely outcomes (each with probability 0.0001)
The construction of the Xj random variables can continue for larger j, with the probability of any individual outcome
becoming smaller and smaller (approaching zero) as j becomes larger and larger. For the random variable X, as the
limit of this sequence of Xj random variables, the probability of any specific outcome x∗ should be equal to zero. In

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 236 — #243
i i

236 Continuous random variables

fact, if we had P(X = x∗ ) = c for some positive number c > 0, then it must also be the case that any other outcome on
the [0, 1] interval must also have probability c of occurring since every value is equally likely. But, there are an infinite
number of possible outcomes on the interval [0, 1], meaning the probabilities of all of these outcomes would sum to
infinity, violating the Axioms of Probability.
The fact that P(X = x∗ ) = 0 for any specific outcome x∗ is a key distinguishing feature of a continuous random
variable, in contrast to a discrete random variable which has positive probabilities for at least two discrete outcomes.
For a continuous random variable, rather than analyzing probabilities of specific outcomes, which are all equal to zero,
the probabilities associated with intervals of outcomes are considered.

10.1.1 Mixtures of discrete and continuous random variables

Before introducing a framework for continuous random variables, it is worth mentioning that some random variables
may be a hybrid or mixture of a discrete random variable and a continuous random variable. Here are some examples:
• Monthly paycheck savings: S = [0, 1] contains all possible values corresponding to the fraction of a paycheck that
is saved in a given month by a given individual, and X is equal to the outcome.
• Weekly earnings: S = [0, ∞) contains all possible weekly earnings values for individuals (including those who are

unemployed or not in the labor force), and X is equal to the outcome.

• Annual healthcare expenditures: S = [0, ∞) contains all possible healthcare expenditure values for individuals, and

X is equal to the outcome.

For the paycheck savings example, X = 0 when an individual has no savings, and X = 1 when an individual saves
the entire paycheck. These two extremes may have non-zero probabilities and can be considered as discrete outcomes,
while anything in between (outcomes strictly between 0 and 1) can be considered continuous outcomes. For the weekly
earnings example, the population includes individuals who are unemployed or not in the labor force, which means
individuals can have weekly earnings equal to zero. There is a non-zero probability of zero weekly earnings as long as
the probability of being unemployed or not in the labor force is positive. This zero outcome can be considered a discrete
outcome, while all other (positive) outcomes can be considered continuous outcomes. The healthcare expenditure
example is similar, with the zero outcome having a positive probability, associated with individuals in the population
who have no healthcare expenditures in a given year.

10.2 Probability density function

For a discrete random variable, a pmf characterizes the probabilities for every possible outcome of the random
variable. Since the probabilities of specific outcomes are all zero for a continuous random variable, an alternative
function, known as the probability density function, describes the probabilities associated with any possible interval
of outcomes for the random variable.

Definition 10.2 The probability density function (pdf) of a continuous random variable X is a function fX (·) such
that for any two numbers a and b with a ≤ b,
Z b
P(a ≤ X ≤ b) = fX (x) dx.
a
This definition places no restriction on the shape of the pdf fX (·). The fact that any outcome x∗ has a zero probability
follows directly from Definition 10.2 since
Z x∗
P(X = x∗ ) = P(x∗ ≤ X ≤ x∗ ) = fX (x) dx = 0.
x∗
As special cases, Definition 10.2 also implies
Z b
P(X ≤ b) = fX (x) dx,
–∞

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 237 — #244
i i

Continuous random variables 237

when a = –∞, and Z ∞

P(X ≥ a) = fX (x) dx,
a
when b = ∞.
For a given x∗ value, it is important to stress that fX (x∗ ) is not itself a probability. The probability of any specific
∗
x occurring is equal to zero, and the probabilities in Definition 10.2 are probabilities of intervals rather than specific
values. Since fX (x∗ ) is not a probability, it is possible that fX (x∗ ) > 1. Also, since the probability of any given outcome is
zero, the use of strict versus weak inequalities does not matter for the probability of an interval, unlike discrete random
variables:
P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b) for any a ≤ b.
This result is easy to show. For the first equality, P(a ≤ X ≤ b) = P(X = a) + P(a < X ≤ b) = P(a < X ≤ b) since P(X = a) =
0. For the second equality, P(a ≤ X ≤ b) = P(a ≤ X < b) + P(X = b) = P(a ≤ X < b) since P(X = b) = 0. And, for the third
equality, P(a ≤ X ≤ b) = P(X = a) + P(a < X < b) + P(X = b) = P(a < X < b) since P(X = a) = P(X = b) = 0.
Figure 10.1 shows an example of a pdf fX (·), with the top graph showing just the pdf and the bottom graph showing
the pdf along with the shaded area between two values a and b. For any given x∗ , the value fX (x∗ ) is the height of
the pdf function at x∗ and can be thought of as the relative likelihood of outcomes being near x∗ . In the pdf shown,
outcomes near a are more likely than outcomes near b since fX (a) > fX (b). Also, the x values that are near the “hump”
of the distribution have the largest pdf values, so that an interval of x values near this hump should have a higher
probability than an interval (of the same width) away from the hump. The R b gray shading represents the area under
the fX (·) curve between a and b, and this area isRequal to the integral a fX (x) dx or, equivalently, the probability
a
P(a ≤ X ≤ b). Similarly, the probability
R ∞ P(X ≤ a) = –∞ fX (x) dx is equal to the area under the fX (·) curve and to the left
of a, and the probability P(X ≥ b) = b fX (x) dx is equal to the area under the fX (·) curve and to the right of b.
Proposition 10.1. For a continuous random variable X, the probability density function fX (·) has the following
properties:
(i) fXR (x) ≥ 0 for all x
∞
(ii) –∞ fX (x) dx = 1
Property (i) rules out negative fX (x) values. This property makes sense since if fX (·) is negative on any interval, even
a tiny one, the associated integral for that interval would be negative, leading to a negative probability. Proposition 10.1
does not restrict fX (x) ≤ 1 since fX (x) is not a probability.
R ∞ Property (ii) states that total probability across all possible
outcomes is equal to one since P(–∞ < X < ∞) = –∞ fX (x) dx = 1. This property is analogous to the property of a
discrete random variable that the sum of the probabilities of disjoint and exhaustive outcomes is equal to one. With
respect to the graph of the pdf fX (·), property (ii) says that the total area under the pdf curve (and above the x-axis)
must be equal to one. For a given shape of the pdf, this property dictates the scale of the fX (·) function along the y-axis.
Example 10.1 (Uniform distribution) Let X be a continuous random variable with pdf
(
1 if 0 ≤ x ≤ 1
fX (x) =
0 otherwise
In this case, X is said to have a uniform distribution between 0 and 1, written as X ∼ U(0, 1). Figure 10.2 graphs the
pdf curve fX (x), which has a rectangular shape with height equal to one along the [0, 1] interval, an interval that has
width equal to one. The full area under the pdf is equal to one, as the area of the rectangle is 1 × 1 = 1. Alternatively,
it can be verified using the integral:
Z ∞ Z 1 Z 1
fX (x) dx = fX (x) dx = 1 dx = x|10 = 1 – 0 = 1.
–∞ 0 0
For the uniform distribution, we can easily determine the probability of an interval. For example, P(0.2 ≤ X ≤ 0.5)
can
R 0.5 be evaluatedR 0.5by either (i) finding the area under the rectangle between 0.2 and 0.5 or (ii) evaluating the integral
0.2
fX (x) dx = 0.2
1 dx. Using either method, P(0.2 ≤ X ≤ 0.5) = 0.3.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 238 — #245
i i

238 Continuous random variables

pdf example

fX(x)
a b

pdf example with interval probability

fX(x)

a b

Figure 10.1
An example of a probability density function

There is nothing special about the endpoints 0 and 1 in the U(0, 1) random variable, and a uniform random variable
can be specified with different endpoints. For example, if X is a uniform random variable between 5 and 10, written
X ∼ U(5, 10), the pdf is a constant value within the [5, 10] interval and zero outside the [5, 10] interval. The constant
value can’t be equal to 1, as it is for U(0, 1), since the area of the rectangle would be 1 × (10 – 5) = 5. To have a
rectangle area equal to one, the constant value must be 15 or 0.2, implying the pdf is
(
0.2 if 5 ≤ x ≤ 10
fX (x) =
0 otherwise
Figure 10.3 shows the pdf curve for X ∼ U(5, 10). Probabilities of intervals can be easily determined here as well. For
instance, P(7 ≤ X ≤ 9) = (0.2)(9 – 7) = 0.4.
More generally, the pdf for a uniform random variable X ∼ U(a, b), where a < b, is
(
1
if a ≤ x ≤ b
fX (x) = b–a
0 otherwise
1
The height of the pdf, equal to b–a , is the value that ensures that the area of the rectangle, which has width b – a, is
equal to one.
The following R functions are useful for working with a uniform random variable:
• dunif(x, min=0, max=1): Returns the pdf of a U(min, max) random variable evaluated at the argument x,
which may be a single number or a vector. The optional arguments min and max have default values of 0 and 1,
respectively.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 239 — #246
i i

Continuous random variables 239

1.0
0.8
0.6
fX(x)

0.4
0.2
0.0

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

Figure 10.2
Probability density function of a U(0, 1) random variable

• punif(x, min=0, max=1): Returns the cdf of a U(min, max) random variable evaluated at the argument x,
which may be a single number or a vector. The optional arguments min and max have default values of 0 and 1,
respectively.
• runif(n, min=0, max=1): Creates a vector of n i.i.d. random draws of a U(min, max) random variable.

The optional arguments min and max have default values of 0 and 1, respectively.
The punif function returns the cdf, discussed below in Section 10.3. The following code shows examples of dunif
and runif for a U(5, 10) random variable:

dunif(4,5,10)
## [1] 0
dunif(6,5,10)
## [1] 0.2
dunif(8,5,10)

## [1] 0.2
set.seed(1234)
runif(20,5,10)

## [1] 5.568517 8.111497 8.046374 8.116897 9.304577 8.201553 5.047479 6.162753

## [9] 8.330419 7.571256 8.467956 7.724874 6.413668 9.617167 6.461579 9.186478
## [17] 6.431116 6.334104 5.933614 6.161130

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 240 — #247
i i

240 Continuous random variables

0.20
0.15
fX(x)

0.10
0.05
0.00

4 6 8 10 12

Figure 10.3
Probability density function of a U(5, 10) random variable

Example 10.2 (Triangular distribution) Let X be a “triangular” random variable, with pdf

x

 if 0 ≤ x ≤ 1
fX (x) = 2 – x if 1 < x ≤ 2


0 otherwise
Figure 10.4 shows the pdf curve for X, which has a triangular shape. To confirm that the area under the pdf curve is
equal to one, the area of a triangle is 21 times the length of the triangle base times the height of the triangle, which
here is 21 (2 – 0)(1) = 1.
Probabilities of intervals can be determined from the pdf’s shape in Figure 10.4. Alternatively, the integral formula
can be used, so for example, the probability that X is less than 0.5 is
Z 0.5 Z 0.5 0.5
x2
P(X < 0.5) = fX (x) dx = x dx = = 0.125.
–∞ 0 2 0
As another (more difficult) example, the probability that X is between 0.5 and 1.3 is
R 1.3
P(0.5 ≤ X ≤ 1.3) = 0.5 fX (x) dx
R1 R 1.3
= 0.5
x dx + 1
(2 – x) dx

1 1.3
x2 –(2–x)2
= 2 + 2
0.5 1

1
– 18 + –0.49
+ 12 = 0.63.

= 2 2

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 241 — #248
i i

Continuous random variables 241

1.0
0.8
0.6
fX(x)

0.4
0.2
0.0

−1 0 1 2 3

Figure 10.4
Probability density function of a triangular random variable

Here is the R code to create Figure 10.4:

# define a function that returns the pdf of a triangular distribution

tri_pdf <- function(x) {
return( (x>=0)*(x<=1)*x + (x>1)*(x<=2)*(2-x) )
}

# create a grid of values to evaluate the pdf

temp <- seq(-1,3,0.01)

# graph the pdf

plot(temp, tri_pdf(temp), type='l', main="", ylab=expression(f[X](x)), xlab="x")

A new function tri_pdf is defined since R does not have a built-in function for the pdf of a triangular distribution.
The function tri_pdf returns x when x is between 0 and 1, 2-x when x is between 1 and 2, and 0 otherwise.

10.3 Cumulative distribution function

The pdf fX (·) completely characterizes the continuous random variable X. As with discrete random variables, the
cumulative distribution function (cdf) can also be used to characterize a continuous random variable. The advantage
of the cdf is that it can be used for any random variable X, whether it’s discrete, continuous, or even a mix of discrete
and continuous. On the other hand, a pmf is used for a discrete random variable, a pdf is used for a continuous random
variable, and there’s no direct relationship between a pmf and a pdf.
The cdf of a continuous random variable is defined formally as follows:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 242 — #249
i i

242 Continuous random variables

Definition 10.3 The cumulative distribution function (cdf) of a continuous random variable X, denoted FX (·), gives
the probability that X is less than or equal to any argument x0 of FX (·):
Z x0
FX (x0 ) = P(X ≤ x0 ) = fX (x) dx.
–∞
Like the cdf for a discrete random variable (Definition 8.4), the cdf for a continuous random variable is equal to
P(X ≤ x0 ). Rather than the discrete summation used for a discrete random variable, the cdf of a continuous random
variable involves integration of the pdf fX (·) from –∞ to the argument x0 . The properties of the cdf of a continuous
random variable are given in the following proposition:
Proposition 10.2. The cumulative distribution function FX (·) of a continuous random variable X has the following
properties:
(i) 0 ≤ FX (x0 ) ≤ 1 for every x0
(ii) x0 < x1 =⇒ FX (x0 ) ≤ FX (x1 )
(iii) At every x0 for which the derivative FX0 (x0 ) exists, fX (x0 ) = FX0 (x0 ).
Property (i) follows from the fact that FX (x0 ) = P(X ≤ x0 ) is a probability and, therefore, must be between zero and
one (inclusive). Property (ii) says that the cdf FX (·) is a weakly increasing function. While it may stay level on certain
intervals of the real line, the cdf can never decrease as x increases. Property (iii), which is a re-statement of the “first
fundamental theorem of calculus,” provides an approach to derive the pdf fX (·) if the cdf FX (·) is known.
If the cdf FX (·) is completely known, determining the probabilities of intervals is considerably simplified as
compared to the case when only the pdf fX (·) is known. When only fX (·) is known, we generally must calculate the
integral associated with an interval, whereas integration is unnecessary if the cdf FX (·) is known. Using the cdf, to
determine the probability that X is less than a value a,
P(X ≤ a) = P(X < a) = FX (a).
To determine the probability that X is greater than a value a,
P(X ≥ a) = P(X > a) = 1 – FX (a).
And to determine the probability that X is between two values a and b, where a < b,
P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b) = FX (b) – FX (a).
Figure 10.5 shows an example of a pdf fX (·) and its associated cdf FX (·). The top graph shows the pdf along with
shading corresponding to the area representing P(X ≤ a). The middle graph shows the same pdf along with shading
corresponding to the area representing P(X ≤ b). The bottom graph shows the cdf associated with the pdf in the two
graphs above. The cdf starts at zero and increases, eventually approaching one. The probability P(a ≤ X ≤ b) can be
determined by the difference FX (b) – FX (a), where the two cdf values are read off the y-axis. In terms of the two pdf’s,
the probability P(a ≤ X ≤ b) is the difference between the area to the left of b (middle graph) and the area to the left of
a (top graph) or, equivalently, the area under the pdf curve between a and b.
Example 10.3 (Uniform distribution) Recall from Example 10.1 that the pdf of the uniform random variable X ∼
U(0, 1) is (
1 if 0 ≤ x ≤ 1
fX (x) =
0 otherwise
R x0
To determine the cdf FX (·) from the pdf fX (·), the integral FX (x0 ) = P(X ≤ x0 ) = –∞ fX (x) dx must be evaluated for all
possible values of x0 . There are three cases that can be treated separately (x0 ≤ 0, 0 < x0 ≤ 1, and x0 > 1):
Z x0 Z x0
x0 ≤ 0 : FX (x0 ) = fX (x) dx = 0 dx = 0
–∞ –∞

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 243 — #250
i i

Continuous random variables 243

pdf example

3.0
2.0
fX(x)

1.0
0.0
a b

3.0
2.0
pdf example
fX(x)

1.0
0.0

a b

cdf example
1.0

0.8

0.6
FX(x)

0.4

0.2

0.0
a b

Figure 10.5
Example of a cumulative distribution function

Z x0 Z x0
0 < x0 ≤ 1 : FX (x0 ) = fX (x) dx = 1 dx = x|x00 = x0
–∞ 0
Z x0 Z 1
x0 > 1 : FX (x0 ) = fX (x) dx = 1 dx = x|10 = 1
–∞ 0
Putting these results together, the cdf of X is

0 if x ≤ 0


FX (x) = x if 0 < x ≤ 1


1 if x > 1

Figure 10.6 shows this cdf for X ∼ U(0, 1). The pdf function is a constant on the (0, 1) interval for the uniform random
variable U(0, 1), which yields a cdf function that is linear on the (0, 1) interval due to the integration of the pdf.
In Example 10.1, P(0.2 ≤ X ≤ 0.5) = 0.3 was determined by integrating pdf fX (·) from 0.2 to 0.5. Using the cdf FX (·),
P(0.2 ≤ X ≤ 0.5) = FX (0.5) – FX (0.2) = 0.5 – 0.2 = 0.3. The answer can also be verified in R:

punif(0.5)-punif(0.2)
## [1] 0.3

Since the default for punif is the U(0, 1) distribution, we do not specify the endpoints of the uniform distribution.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 244 — #251
i i

244 Continuous random variables

1.0
0.8
0.6
FX(x)

0.4
0.2
0.0

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

Figure 10.6
Cumulative distribution function for a U(0, 1) random variable

Example 10.4 (Triangular distribution) From Example 10.2, the pdf of the triangular random variable X is

x

 if 0 ≤ x ≤ 1
fX (x) = 2 – x if 1 < x ≤ 2


0 otherwise
To determine the cdf, the same approach as Example 10.3 can be used, except that there are now four different cases
to consider (x0 < 0, 0 ≤ x0 ≤ 1, 1 < x0 ≤ 2, and x0 > 2):
Z x0 Z x0
x0 < 0 : FX (x0 ) = fX (x) dx = 0 dx = 0
–∞ –∞
x0 x0 x0
x2 x02
Z Z
0 ≤ x0 ≤ 1 : FX (x0 ) = fX (x) dx = x dx = =
–∞ 0 2 0 2
x0 1 x0 2 1 x
–(2 – x)2 0 (2 – x0 )2
Z Z Z
x
1 < x0 ≤ 2 : FX (x0 ) = fX (x) dx = x dx + (2 – x) dx = + =1–
–∞ 0 1 2 0 2 1 2
Z x0 Z 1 Z 2 1 2
x2 –(2 – x)2
x0 > 2 : FX (x0 ) = fX (x) dx = x dx + (2 – x) dx = + =1
–∞ 0 1 2 0 2 1

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 245 — #252
i i

Continuous random variables 245

1.0
0.8
0.6
FX(x)

0.4
0.2
0.0

−1 0 1 2 3

Figure 10.7
Cumulative distribution function for a triangular random variable

Putting these results together, the cdf is



0 if x < 0

 x2

if 0 ≤ x ≤ 1
2
FX (x) = 2


1 – (2–x)
2 if 1 < x ≤ 2

1 if x > 2


Figure 10.7 shows the cdf for this triangular distribution. The pdf function is linear on the (0, 1) and (1, 2) intervals
for this random variable, which yields a cdf function that is quadratic on the (0, 1) and (1, 2) intervals due to the
integration of the pdf. Here is the R code to create Figure 10.7, with a new function tri_cdf defined for the cdf of
the triangular distribution:

# define a function that returns the cdf of a triangular distribution

tri_cdf <- function(x) {
return( (x>=0)*(x<=1)*(x^2/2) + (x>1)*(x<=2)*(1-((2-x)^2/2)) + (x>2)*1 )
}

# create a grid of values to evaluate the pdf

temp <- seq(-1,3,0.01)

# graph the cdf

plot(temp, tri_cdf(temp), type='l', main="", ylab=expression(F[X](x)), xlab="x")

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 246 — #253
i i

246 Continuous random variables

In Example 10.2, the probability P(0.5 ≤ X ≤ 1.3) was determined by integrating the pdf fX (·) from 0.5 to 1.3.
Alternatively, using the complete cdf FX (·) specified above,
(2 – 1.3)2 0.52 1.51 0.25

P(0.5 ≤ X ≤ 1.3) = FX (1.3) – FX (0.5) = 1 – – = – = 0.63.
2 2 2 2

tri_cdf(1.3)-tri_cdf(0.5)
## [1] 0.63

10.3.1 Using R to calculate integrals

Working out integrals analytically can sometimes be difficult. Even for the somewhat simple example of the triangular
distribution above, using integration to derive the cdf from the pdf required quite a bit of work. On the other hand, R
can perform integration with a high degree of accuracy, even for very complicated functions. This type of integration
is often called numerical integration, and it only requires knowledge of the underlying function being integrated. The
R function for numerical integration is integrate:
• integrate(f, lower, upper): Returns the integral of the function f with lower limit lower and upper
limit upper. A lower limit of negative infinity can be specified by using lower=-Inf. An upper limit of positive
infinity can be specified by using upper=Inf. The function f should take a numerical value or vector as its only
argument and should return a numerical vector of the same length.
Here are some examples of integrate being used to calculate cdf’s and probability intervals for the uniform and
triangular distribution:

integrate(dunif,0.2,0.8)
## 0.6 with absolute error < 0.0000000000000067
integrate(dunif,-Inf,0.8)
## 0.7999995 with absolute error < 0.0000013
integrate(tri_pdf,0.5,1.3)
## 0.63 with absolute error < 0.000000000000007

These examples illustrate that the integrate function can be used with either a build-in R function (like dunif)
or a user-defined function (like tri_pdf). The first integrate command integrates the pdf of X ∼ U(0, 1) random
variable (dunif) from 0.2 to 0.8, which is equivalent to FX (0.8) – FX (0.2). The second integrate command
uses a lower limit of –∞, so it is equivalent to the cdf FX (0.8). The third integrate command provides an
alternative method of determining P(0.5 ≤ X ≤ 1.3) when X has a triangular distribution. Rather than determining
the cdf function and coding it, as we did with tri_cdf above, the pdf tri_pdf is integrated directly here. For
each use of integrate, R returns the result of the numerical integration along with the phrase “with absolute
error < ...” Since it’s performing numerical integration, the function integrate is only approximating the
value of the integral, but the very small absolute errors indicate the approximations are very accurate. (The desired
accuracy can be specified as an argument of integrate. For this and other options, refer to the R documentation for
integrate.)
As seen in the R output above, the integrate function does more than simply return the numeric value of the
integral. If the value of the integral needs to be stored in a variable and/or used in subsequent calculations, this value
can be accessed directly by appending $value after the integrate function.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 247 — #254
i i

Continuous random variables 247

integral_value <- integrate(tri_pdf,0.5,1.3)$value

paste("Calculated integral:", round(integral_value,3))
## [1] "Calculated integral: 0.63"

10.3.2 cdf of a mixture of discrete and continuous outcomes

The cdf is a unifying concept used for both discrete and continuous random variables. In fact, it can be used for any
random variable, including one with a mixture of discrete outcomes and continuous outcomes. The key idea is that the
cdf FX (·) is always defined as FX (x0 ) = P(X ≤ x0 ) for every value x0 , as in Definition 8.4 for a discrete random variable
and Definition 10.3 for a continuous random variable. The following example considers the cdf of a random variable
that is a mixture of discrete outcomes and continuous outcomes.
Example 10.5 (Uniform distribution with point masses) Suppose the cdf FX (·) associated with a random variable X
is 
0 if x < 0


FX (x) = 0.3 + 0.5x if 0 ≤ x < 1

if x ≥ 1

1
In this example, it’s important to pay close attention to whether an inequality is weak or strict in the specification of
FX (·). The cdf is flat at zero for all x < 0, and then at x = 0 jumps to FX (0) = 0.3. Then, the cdf is linear on the interval
(0, 1), increasing from 0.3 to 0.8 in this interval. Finally, at x = 1, the cdf jumps to FX (1) = 1. Figure 10.8 provides a
graph of the cdf FX (·).
The only way for a jump to occur at x = 0 is if the event X = 0 has positive probability. In this case, we say that
there is a “point mass” at x = 0, with P(X = 0) = 0.3. Similarly, the jump from 0.8 to 1 at x = 1 is associated with a
“point mass” at x = 1, with P(X = 1) = 1 – 0.8 = 0.2. For values within the interval (0, 1), the distribution of X is similar
to a uniform distribution. This random variable, then, can be thought of as a uniform distribution with point masses
at its endpoints of 0 and 1. Any value within the interval (0, 1) has zero probability, but the discrete outcomes 0 and
1 each have positive probability. The probability of any interval can be determined with the cdf. For example, the
probability that X is less than 0.3 is P(X < 0.3) = FX (0.3) = 0.45, and the probability that X is between 0.3 and 0.9 is
F(0.3 < X < 0.9) = FX (0.9) – FX (0.3) = 0.75 – 0.45 = 0.30.

10.4 Population descriptive statistics

10.4.1 Population quantiles
Section 6.4.2 introduced sample quantiles as descriptive statistics for an observed sample of data. For example, the
sample 80% quantile x̃0.8 represents the value for which 80% of the observed data are below x̃0.8 . An oft-used special
case of the sample quantile is the sample median x̃0.5 , the value for which 50% of the observed data are below x̃0.5 .
This section introduces a similar quantile concept for random variables, but now the quantile describes the distribution
of the random variable in the population rather than the distribution of observed data in the sample. The terminology
population quantile distinguishes it from the sample quantile.

Definition 10.4 For any q, with 0 < q < 1, the population quantile τX,q of a continuous random variable X is the value
for which
P(X ≤ τX,q ) = FX (τX,q ) = q.
A special case is the population median (τX,0.5 or τX,1/2 ) of a continuous random variable X, with
P(X ≤ τX,0.5 ) = FX (τX,0.5 ) = 0.5.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 248 — #255
i i

248 Continuous random variables

1.0
0.9
0.8
0.7
0.6
FX(x)

0.5
0.4
0.3
0.2
0.1
0.0

0 1

Figure 10.8
Cumulative distribution function for a mixture random variable

Let’s say that we are interested in the population 80% quantile τX,0.8 of X, where X is a continuous random variable.
τX,0.8 is the value for which there is a probability of 80% that X is less than or equal to τX,0.8 : P(X ≤ τX,0.8 ) = FX (τX,0.8 ) =
0.8. If we had a graph of the cdf FX (·), finding this population 80% quantile would involve drawing a horizontal line
at 0.8 and finding the x value where this line hits the cdf function. The top graph of Figure 10.9 shows an example
of a cdf, where the population 80% quantile (τX,0.8 ) and population median (τX,0.5 ) are shown as the values where the
horizontal lines at 0.8 and 0.5, respectively, cross the cdf curve. The corresponding pdf is shown in the bottom graph
of, with the population 80% quantile τX,0.8 and the population median τX,0.5 indicated. For the population median τX,0.5 ,
the area under the pdf to the left of τX,0.5 is equal to 0.5. For the population 80% quantile τX,0.8 , the area under the pdf
to the left of τX,0.8 is equal to 0.8.
Just as the sample interquartile range IQRx = x̃0.75 – x̃0.25 is defined as the difference between the sample 75% quantile
and sample 25% quantile, the population interquartile range is defined analogously as the difference between the
75% population quantile and 25% population quantile:

Definition 10.5 The population interquartile range (population IQR) of a continuous random variable X, denoted
IQRX , is
IQRX = τX,0.75 – τX,0.25 .
The following example illustrates how population quantiles can be determined analytically when the cdf FX (·) is
known. Since it is easier to find population quantiles from the cdf, it is recommended to first find the cdf if only the
pdf is specified.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 249 — #256
i i

Continuous random variables 249

cdf example with quantiles

1.0
0.9
0.8
0.7
0.6

FX(x)
0.5
0.4
0.3
0.2
0.1
0.0 τX,0.5 τX,0.8

corresponding pdf
3.0
2.0
fX(x)

1.0
0.0

τX,0.5 τX,0.8

Figure 10.9
Example of population quantiles for a continuous random variable

Example 10.6 (Triangular distribution) Continuing Example 10.4, recall that the cdf of the triangular random
variable X is 

0 if x < 0

 x2

if 0 ≤ x ≤ 1
FX (x) = 2 (2–x)2


1– 2 if 1 < x ≤ 2

1 if x > 2


Suppose we are interested in determining the population 40% quantile. We want to find the value τX,0.4 for which
FX (τX,0.4 ) = 0.4. We know we don’t get a cdf value equal to √0.4 for x ≤ 0, so we move
√ to the 0 < x ≤ 1 interval. Is there
x2
an x ∈ (0, 1] such that 2 = 0.4? The answer is yes, with x = 0.8, so that τX,0.4 = 0.8 ≈ 0.894.
Suppose we are interested in determining the population 70% quantile. We want to find the value τX,0.7 for which
FX (τX,0.7 ) = 0.7. We know that we need to move past the 0 < x ≤ 1 interval since the maximum cdf value, which occurs
2
at x = 1, is only
√ 0.5. Moving to the 1√< x ≤ 2 interval, is there an x ∈ (1, 2] such that 1 – (2–x)
2 = 0.7? The answer is yes,
with x = 2 – 0.6, so that τX,0.7 = 2 – 0.6 ≈ 1.225.
For commonly used continuous distributions, like the uniform distribution, R provides functions to calculate
population quantiles.
• qunif(p, min=0, max=1): Returns the population quantiles of a U(min, max) random variable specified
by the argument p, which may be a single number or a vector. The optional arguments min and max have default
values of 0 and 1, respectively.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 250 — #257
i i

250 Continuous random variables

R uses the convention of having q as the first letter for functions that return population quantiles. The following code
returns the population 80% quantile of a U(0, 1) random variable and the population 50% and 75% quantiles of a
U(5, 10) random variable:

qunif(0.8)
## [1] 0.8
qunif(c(0.50,0.75),5,10)
## [1] 7.50 8.75

10.4.2 Population mean or expected value

Section 8.3.1 introduced the population mean for a discrete random variable X, with possible outcomes {xk∗ }Kk=1 for
finite or (countably) infinite K: X
µX = E(X) = xk∗ pX (xk∗ ).
k
The population mean of X is a weighted average of the possible outcomes xK∗ , with weights given by the true
probabilities of each xk∗ in the population. This definition does not work for a continuous random variable since (i) the
number of possible outcomes is uncountable and (ii) the probability of any outcome is equal to zero. Instead, we
introduce a version of the population mean that replaces the discrete summation above with an integral and the pmf
values with pdf values.

Definition 10.6 The population mean (or population average or expected value) of a continuous random variable
X, denoted µX or E(X), is Z ∞
µX = E(X) = xfX (x) dx.
–∞
The integral in Definition 10.6 can be thought of as a summation over very small intervals of the x values.
Specifically, consider a “slice” of the pdf function right at the value x, with height fX (x) and width dx. The area of
this slice is fX (x)dx and can be thought of as the probability that the outcome of X is within the slice. The integral
essentially sums over all of these little slices, and the expected value provides a weighted average of the x values with
weights given by the “probabilities” fX (x)dx.
Example 10.7 (Uniform distribution) The pdf of the uniform random variable X ∼ U(0, 1) is
(
1 if 0 ≤ x ≤ 1
fX (x) =
0 otherwise
The population mean is
∞ 1 1
x2
Z Z
µX = xfX (x) dx = = 0.5,
x dx =
–∞ 0 2 0
as expected sense since X ∼ U(0, 1) is symmetric around the center of the (0, 1) interval.
How about the general uniform distribution X ∼ U(a, b) for a < b? The pdf is
(
1
if a ≤ x ≤ b
fX (x) = b–a
0 otherwise

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 251 — #258
i i

Continuous random variables 251

a+b
Again, due to symmetry around the middle of the (a, b) interval, the population mean is the midpoint 2 , which can
be confirmed as follows:
Z ∞ Z b b
1 x2 b2 – a2 a + b
µX = xfX (x) dx = x dx = = = .
–∞ a b–a 2(b – a) a 2(b – a) 2
In Example 10.7, we referred to the symmetry of the uniform distribution, a concept which we formally define:

Definition 10.7 A continuous random variable X is said to have a symmetric distribution or a symmetric pdf if
there is a midpoint, call it x∗ , for which the pdf on one side of x∗ is a mirror image of the pdf on the other side of x∗ .
Mathematically, with the midpoint x∗ , a symmetric distribution is symmetric about x∗ and has
fX (x∗ – v) = fX (x∗ + v) for all v ≥ 0
or, equivalently,
FX (x∗ – v) = 1 – FX (x∗ + v) for all v ≥ 0.
Symmetric distributions have some nice properties:
Proposition 10.3. A continuous random variable X with a symmetric distribution, for which the pdf is symmetric
around the midpoint x∗ , has the following properties:
(i) the population mean and the population median are equal to each other and to x∗ :
µX = τX,0.5 = x∗
(ii) for any q, with 0 < q < 0.5, the population quantiles τX,q and τX,1–q are equidistant from the midpoint x∗ :
|τX,q – x∗ | = |τX,1–q – x∗ | or, equivalently, |τX,q – τX,0.5 | = |τX,1–q – τX,0.5 |
Both properties should make intuitive sense since the part of the symmetric distribution to the left of its midpoint x∗
is a mirror image of the part of the distribution to the right of its midpoint x∗ . For the uniform distributions in Example
10.7, X ∼ U(0, 1) has a midpoint x∗ = 0.5 and X ∼ U(a, b) has a midpoint x∗ = a+b 2 . For X ∼ U(0, 1), the equidistant
quantile property (property (ii)) clearly holds, as for instance the population 20% and 80% quantiles, τX,0.2 = 0.2 and
τX,0.8 = 0.8, are the same distance from the midpoint x∗ = τX,0.5 = 0.5.
As was the case for discrete random variables, it is possible that the population mean or expected value of a
continuous random variable is not well-defined, as illustrated by the following example:
Example 10.8 (Infinite expected value) Consider the following pdf for X:
(
1
2 if x > 1
fX (x) = x
0 otherwise
R∞ ∞
This pdf is valid since 1 x12 dx = – 1x 1 = 0 – (–1) = 1, but the expected value of X is infinite since
Z ∞ Z ∞
1 1 ∞
µX = x 2 dx = dx = ln(x) 1 = ln(∞) – 0 = ∞.
1 x 1 x
10.4.3 Population variance and population standard deviation
Section 8.3.2 introduced the population variance for a discrete random variable X, with possible outcomes {xk∗ }Kk=1 for
finite or (countably) infinite K:
X ∗
σX2 = Var(X) = E (X – µX )2 = (xk – µX )2 pX (xk∗ ).

For a discrete random variable X, the population variance is a weighted average of the (xk∗ – µX )2 values, for all possible
outcomes xK∗ , with weights given by the true probabilities of each xk∗ in the population. As with the population mean,
this definition won’t work for a continuous random variable due to number of outcomes being uncountable and each

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 252 — #259
i i

252 Continuous random variables

outcome having probability zero. Instead, an analogous definition for a continuous random variable is introduced,
replacing the summation with an integral and replacing the pmf values with pdf values:

Definition 10.8 The population variance of a continuous random variable X, denoted σX2 or Var(X), is
Z ∞
σX2 = Var(X) = E (X – µX )2 = (x – µX )2 fX (x) dx.

–∞
The population standard deviation is defined as the square root of the population variance:

Definition 10.9 The population standard deviation of a continuous random variable X, denoted σX or sd(X), is
sZ
q ∞
σX = sd(X) = σX2 = (x – µX )2 fX (x) dx.
–∞

Example 10.9 (Uniform distribution) In Example 10.7, the population mean of X ∼ U(0, 1) was determined to be
µX = 0.5 and, in the general case, the population mean of X ∼ U(a, b) was determined to be µX = a+b 2 . How about
their respective population variances and population standard deviations? Starting with X ∼ U(0, 1), the population
variance is Z ∞ Z 1 1
(x – 0.5)3

1 1 1
σX2 = (x – µX )2 fX (x) dx = (x – 0.5)2 (1) dx = = – – = ,
–∞ 0 3 0 24 24 12
and the population standard deviation is r
1 1
q
σX = σX2 = =√ .
12 12
For the general case, X ∼ U(a, b), the population variance is
R∞
σX2 = –∞ (x – µX )2 fX (x) dx
Rb 2 1
= a x – a+b 2 b–a dx
3
b
1
(x– 2 )
a+b
1
(b–a)3 (a–b)3 (b–a)2
= b–a 3 = b–a 24 – 24 = 12 ,
a
and the population standard deviation is
r
(b – a)2 b – a
q
σX = σX2 = =√ .
12 12
Example 10.10 (Triangular distribution) The triangular distribution (Example 10.2) has pdf

x

 if 0 ≤ x ≤ 1
fX (x) = 2 – x if 1 < x ≤ 2


0 otherwise
From the graph of the pdf fX (·) in Figure 10.4, the triangular distribution appears to be symmetric with midpoint x∗ = 1.
We can confirm that it’s a symmetric distribution by checking that fX (x∗ – v) = fX (x∗ + v) for all v ≥ 0. First, for any v
such that 0 ≤ v ≤ 1, fX (1 – v) = 1 – v since 0 ≤ 1 – v ≤ 1 and fX (1 + v) = 2 – (1 + v) = 1 – v since 1 ≤ 1 + v ≤ 2. Second, for
any v > 1, fX (1 – v) = fX (1 + v) = 0 since 1 – v < 0 and 1 + v > 2. Therefore, X has a symmetric distribution, meaning its
population mean is equal to the value of the midpoint, µX = x∗ = 1.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 253 — #260
i i

Continuous random variables 253

Using the fact that µX = 1, the population variance of X is

R∞
σX2 = –∞ (x – µX )2 fX (x) dx
R1 R2
= 0
(x – 1)2 (x) dx + 1
(x – 1)2 (2 – x) dx
R1 R2
= 0
(x3 – 2x2 + x) dx + 1
(–x3 + 4x2 – 5x + 2) dx
1 4 2
x4 3 2 3 2
= 4 – 2x3 + x2 + – x4 + 4x3 – 5x2 + 2x
0 1

1
– 13 + 2 1
+ – 15 28 15
1 1 1
= 4 4 + 3 – 2 + 2 = 12 + 12 = 6 ,
and the population standard deviation of X is
r
1 1
q
2
σX = σX = =√ .
6 6
10.4.4 Using integration in R to calculate population statistics
Section 10.3.1 discussed numerical integration in R, using the integrate function. Since population means,
variances, and standard deviations are each defined in terms of an integral, we can use numerical integration to calculate
their values if the pdf fX (·) isR known, providing an alternative to analytic derivation of the integral. For example, the
∞
population mean is equal to –∞ xfX (x) dx, which can be evaluated by providing a function that calculates xfX (x) as the
first argument of integrate and –∞ and ∞ as the second and third arguments. Here is an example for a U(0, 1)
random variable:

# auxiliary function that returns x*f(x) for a U(0,1) random variable

exp_unif <- function(x) {
return( x*dunif(x) )
}

# numerical integration to evaluate E(X) for X~U(0,1)

integrate(exp_unif,-Inf,Inf)
## 0.5 with absolute error < 0.0000000013
integrate(exp_unif,-Inf,Inf)$value
## [1] 0.5

The numerical integration gives the correct answer µX = 0.5 for X ∼ U(0, 1). Two slightly different versions of the
integrate command are used. The first, a standard call to integrate, reports the value along with the accuracy.
The second, which appends the syntax $value function, provides just the numerical value of the integral. This
version is useful when we need to use the numerical value of the integral in further calculations, as seen below for the
population variance. R∞
The population variance is equal to –∞ (x – µX )2 fX (x) dx, so numerical integration requires a function that returns
(x – µX )2 fX (x) to be used as the first argument of integrate. Here is an example for a U(0, 1) random variable:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 254 — #261
i i

254 Continuous random variables

# store E(X) for X~U(0,1) in a variable

mean_unif <- integrate(exp_unif,-Inf,Inf)$value

# auxiliary function that returns (x-E(X))^2*f(x) for a U(0,1) random variable

var_unif <- function(x) {
return( (x-mean_unif)^2*dunif(x) )
}

# numerical integration to evaluate Var(X) for X~U(0,1)

integrate(var_unif,-Inf,Inf)
## 0.08333333 with absolute error < 0.000000046
# numerical integration to evaluate sd(X) for X~U(0,1)
sqrt(integrate(var_unif,-Inf,Inf)$value)
## [1] 0.2886751

1
The numerical integration provides the correct answer σX2 = 12 ≈ 0.083333 for X ∼ U(0, 1).
There’s nothing inherently special about the uniform distribution in these R code examples. To calculate the
population mean and population variance for the triangular distribution, for example, the density function tri_pdf
that was defined in Example 10.2 can be substituted in the code wherever dunif had appeared.

# auxiliary function that returns x*f(x) for a triangular random variable

exp_tri <- function(x) {
return( x*tri_pdf(x) )
}

# store and print E(X) for a triangular random variable

mean_tri <- integrate(exp_tri,-Inf,Inf)$value
print(mean_tri)
## [1] 1
# auxiliary function that returns (x-E(X))^2*f(x) for a triangular random variable
var_tri <- function(x) {
return( (x-mean_tri)^2*tri_pdf(x) )
}

# numerical integration to evaluate Var(X) for a triangular random variable

integrate(var_tri,-Inf,Inf)
## 0.1666671 with absolute error < 0.000093
# numerical integration to evaluate sd(X) for a triangular random variable
sqrt(integrate(var_tri,-Inf,Inf)$value)
## [1] 0.4082488

10.5 Linear transformations of one random variable

Sections 8.5.1 considered linear transformations of a discrete random variable. In this section, we generalize the
results from that section to all random variables, including some additional results related to quantiles that hold for
continuous random variables. The properties of a linear transformation of a random variable are summarized in the
following proposition:
Proposition 10.4. If X is a random variable and Y = a + bX for known constants a and b, then:
(i) (population mean) µY = a + bµX
(ii) (population variance) σY2 = b2 σX2
(iii) (population standard deviation) σY = |b|σX

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 255 — #262
i i

Continuous random variables 255

(iv) (population quantiles) τY,q = a + bτX,q if b ≥ 0 and X is a continuous random variable; τY,q = a + bτX,1–q if b < 0
and X is a continuous random variable
(v) (population IQR) IQRY = |b|IQRX if X is a continuous random variable
Parts (i) through (iii) are identical to the results from Section 8.5.1 for discrete random variables. The additive
constant a affects the population mean but not the population variance or standard deviation, whereas the scaling
constant b affects all three quantities. Part (iv) states that, when b is positive, the population quantiles for the linear
combination Y (τY,q ) are the same linear function of the population quantiles of X (τX,q ); when b is negative, in which
case we can think of the distribution being flipping around before being scaled, the linear function is applied to τX,1–q
rather than τX,q . Part (v) is analogous to the result for the sample IQR, where we had IQRy = |b|IQRx for the linear
transformation y = a + bx of the variable x.
Example 10.11 (Annualized earnings) If X represents the weekly earnings for an employed individual from the
population, the linear transformation Y = 52X is a random variable that represents that annualized earnings for an
employed individual from the population. Similar to the results for descriptive statistics in Example 6.29, Proposition
10.4 can be applied here to get µY = 52µX , σY2 = 522 σX2 = 2704σX2 , σY = 52σX , τY,q = 52τX,q for any q ∈ (0, 1), and
IQRY = 52IQRX .
Example 10.12 (Uniform distribution) Let X ∼ U(0, 1) be a uniform random variable on the interval (0, 1). For
constants a and b, where a < b, if the random variable Y is defines as the linear transformation
Y = a + (b – a)X,
the population mean of Y is
b–a a+b
µY = a + (b – a)µY = a + = ,
2 2
the population variance of Y is
(b – a)2
σY2 = (b – a)2 σX2 = ,
12
and the population standard deviation of Y is
b–a
σY = |b – a|σX = √ .
12
Recall from Examples 10.7 and 10.9 that these quantities are the same population mean, variance, and standard
deviation derived for the U(a, b) distribution. In fact, Y ∼ U(a, b) can be shown as follows:
y–a y–a
FY (y) = P(Y ≤ y) = P(a + (b – a)X ≤ y) = P X ≤ = for y ∈ (a, b),
b–a b–a
which implies that
1
fY (y) = FY0 (y) = for y ∈ (a, b),
b–a
by applying part (iii) of Proposition 10.2. Thus, the pdf of Y is exactly the same as the pdf of a U(a, b) random variable.
This example shows that the linear transformation of a uniform U(0, 1) random variable is also a uniform random
variable. Using this fact to derive the population mean, variance, and standard deviation is much easier than the
work required in Example 10.9 to derive those quantities directly. And, any uniform random variable can be written as
some linear transformation of the U(0, 1) random variable. For instance, to get Y ∼ U(–2, 5), the linear transformation
Y = –2 + 7X can be used for X ∼ U(0, 1).
A specific linear transformation that is often quite useful is the transformation that “standardizes” a random variable
X to have population mean equal to zero and population variance (and population standard deviation) equal to one.
This standardized random variable is formally defined as follows, where we do not restrict the type of random
variable:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 256 — #263
i i

256 Continuous random variables

Definition 10.10 For a random variable X, a standardized random variable Z is constructed by “de-meaning” the
random variable X and then dividing by its standard deviation:
X – µX
Z= .
σX
From the definition, Z is a linear transformation of X with additive constant a = – µσXX and scaling constant b = σ1X .
Using the results from Proposition 10.4,
µX 1
µZ = a + bµX = – + µX = 0,
σX σX
2
1
σZ2 = b2 σX2 = σX2 = 1,
σX
and
1
σZ = |b|σX = σX = 1.
σX
Therefore, the standardized random variable Z has population mean equal to zero and population variance and standard
deviation both equal to one. Moreover, Z is unitless since both its numerator X – µX and its denominator σX are in the
units of X. The random variable Z can be interpreted as the number of population standard deviations that X is from
its population mean µX , with negative Z corresponding to X being below µX and positive Z corresponding to X being
above µX . For instance, Z = –1.5 indicates that X is 1.5 standard deviations below µX (that is, X = µX – 1.5σX ), and
Z = 2.7 indicates that X is 2.7 standard deviations above µX (that is, X = µX + 2.7σX ).

10.6 Multiple continuous random variables

10.6.1 Joint probability density function and joint cumulative distribution function
When there are two continuous random variables, their joint distribution can be represented by a joint probability
density function, defined as follows:

Definition 10.11 The joint probability density function (joint pdf) of continuous random variables X and Y is a
function fXY (·, ·) such that, for any two numbers a and b with a ≤ b and any two numbers c and d with c ≤ d,
Z dZ b
P(a ≤ X ≤ b, c ≤ Y ≤ d) = fXY (x, y) dx dy.
c a
Whereas the probability of an interval for a single continuous random variable is a single integral, the joint
probability of X and Y being in their respective intervals is a double integral. The inner integral integrates over the
values x between a and b, while the outer integral integrates over the values y between c and d. Since X and Y are
continuous random variables, the joint probability of any specific outcome is equal to zero; that is, P(X = x, Y = y) = 0 for
any (x, y). Like the marginal pdf, the joint pdf will be non-negative for all (x, y) and integrate to one. These properties,
along with the relationship of the joint pdf to the marginal pdf’s, are stated in the following proposition:
Proposition 10.5. For continuous random variables X and Y, the joint probability density function fXY (·, ·) has the
following properties:
(i) fXY
R∞ (x,Ry) ≥ 0 for all (x, y)
∞
(ii) –∞ –∞R fXY (x, y) dx dy = 1
∞
(iii) fX (x) = R –∞ fXY (x, y) dy for all (x, y)
∞
(iv) fY (y) = –∞ fXY (x, y) dx for all (x, y)
Property (iii) states that the marginal pdf of X, evaluated at x, is obtained by fixing x and integrating the joint pdf
over all values of y. This relationship is analogous to the case of discrete random variables, where the marginal pmf of
X, evaluated at a possible outcome xk∗ , is obtained by fixing xk∗ and summing the joint pmf over all possibly y∗` values.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 257 — #264
i i

Continuous random variables 257

Similarly, property (iv) states that the marginal pdf of Y, evaluated at y, is obtained by fixing y and integrating the joint
pdf over all values of x.
Example 10.13 (Unrelated uniform random variables) Suppose X and Y are continuous random variables with joint
pdf (
1 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
fXY (x, y) =
0 otherwise
It can be confirmed that the joint pdf integrates to one:
Z ∞Z ∞ Z 1 Z 1 Z 1
fXY (x, y) dx dy = 1 dx dy = 1 dy = 1.
–∞ –∞ 0 0 0
R1R1
The double integral 0 0 1 dx dy can be interpreted as the volume of a rectangular solid, with height 1 (the constant
fXY (·, ·) value) and rectangular sides of 1 (for the range of x) and 1 (for the range of y).
For the marginal pdf of X, when 0 ≤ x ≤ 1,
Z ∞ Z 1
fX (x) = fXY (x, y) dy = 1 dy = 1,
–∞ 0
and otherwise (when x < 0 or x > 1),
Z ∞ Z ∞
fX (x) = fXY (x, y) dy = 0 dy = 0.
–∞ –∞
Therefore, X is a uniform random variable, X ∼ U(0, 1). Similarly, it can be shown that Y ∼ U(0, 1).
Example 10.14 (Related uniform random variables) Suppose X and Y are continuous random variables with joint pdf

2 if 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.5


fXY (x, y) = 2 if 0.5 < x ≤ 1 and 0.5 < y ≤ 1


0 otherwise

It can be confirmed that the joint pdf integrates to one:

Z ∞Z ∞ Z 0.5 Z 0.5 Z 1 Z 1
fXY (x, y) dx dy = 2 dx dy + 2 dx dy = 0.5 + 0.5 = 1.
–∞ –∞ 0 0 0.5 0.5
R 0.5 R 0.5
The rectangular solid volume intuition can be used to interpret 0 0 2 dx dy as the volume of a rectangular solid
R1 R1
with dimensions 2, 0.5, and 0.5. The same intuition can be used for 0.5 0.5 2 dx dy, again as the volume of a rectangular
solid with dimensions 2, 0.5, and 0.5.
For the marginal pdf of X, when 0 ≤ x ≤ 0.5,
Z ∞ Z 0.5
fX (x) = fXY (x, y) dy = 2 dy = 1,
–∞ 0
and when 0.5 < x ≤ 1, Z ∞ Z 1
fX (x) = fXY (x, y) dy = 2 dy = 1,
–∞ 0.5
and otherwise (when x < 0 or x > 1),
Z ∞ Z ∞
fX (x) = fXY (x, y) dy = 0 dy = 0.
–∞ –∞
Therefore, X is a uniform random variable, U(0, 1). Similarly, it can be shown that Y ∼ U(0, 1). Interestingly, the
marginal distributions for X and Y here, both U(0, 1), are identical to those in Example 10.13, even though the joint
pdf here is different from that example. In this example, the joint pdf implies that X and Y are clearly related to each

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 258 — #265
i i

258 Continuous random variables

other, which is not the case in Example 10.13. Here, X can only be in the range (0, 0.5) when Y is in the range (0, 0.5)
and vice versa, and X can only be in the range (0.5, 1) when Y is in the range (0.5, 1) and vice versa.
The following definition generalizes the definition of the joint cdf, introduced in Chapter 8 for discrete random
variables (Definition 8.9), to all types of random variables.

Definition 10.12 The cumulative distribution function (joint cdf) of two random variables X and Y, denoted
FXY (·, ·), gives the probability that both X and Y are less than or equal to their corresponding arguments:
FXY (x0 , y0 ) = P(X ≤ x0 , Y ≤ y0 ).
When X and Y are both continuous random variables, the joint cdf can be written in terms of the joint pdf:
Z y0 Z x0
FXY (x0 , y0 ) = fXY (x, y) dx dy.
–∞ –∞
Regardless of the types of the two random variables X and Y, the joint cdf has the same properties that were stated
in Chapter 8:

0 ≤ FXY (x0 , y0 ) ≤ 1 for every x0 and y0

and
x0 < x1 =⇒ FXY (x0 , y0 ) ≤ FXY (x1 , y0 ) and y0 < y1 =⇒ FXY (x0 , y0 ) ≤ FXY (x0 , y1 ).
The joint cdf is always between 0 and 1 (inclusive) since it is defined as a probability, and the joint cdf is weakly
increasing in both of its arguments.
Example 10.15 (Related uniform random variables) Continuing Example 10.14, where X and Y have the joint pdf

2 if 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.5


fXY (x, y) = 2 if 0.5 < x ≤ 1 and 0.5 < y ≤ 1


0 otherwise,

the joint cdf can be determined by calculating the double integral from Definition 10.12 for several different regions
based upon the (x, y) values. The joint cdf, whose derivation is left as an exercise (Exercise 10.13.), is


0 if x < 0 or y < 0

2xy if 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.5





0.5 + 2(x – 0.5)(y – 0.5) if 0.5 < x ≤ 1 and 0.5 < y ≤ 1






x if 0 ≤ x ≤ 0.5 and y > 0.5
FXY (x, y) =
y

 if x > 0.5 and 0 ≤ y ≤ 0.5

x if 0.5 < x ≤ 1 and y > 1





y if x > 1 and 0.5 < y ≤ 1






1 if x > 1 and y > 1

10.6.2 Conditional probability density functions

This section considers the conditional version of the pdf, which describes the pdf of a continuous random variable X
given the outcome of another random variable Y. We do not restrict Y to be continuous, but when Y is continuous, we
are able to express the conditional cdf in terms of the joint pdf function fXY (x, y) from Section 10.6.1.

Example 10.20 (Related uniform random variables) Continuing Example 10.14, the conditional expectation of X
given Y = y is
Z 0.5
E(X|Y = y) = (x)(2) dx = 0.25
0
when 0 ≤ y ≤ 0.5, and
Z 1
E(X|Y = y) = (x)(2) dx = 0.75
0.5
when 0.5 < y ≤ 1. The conditional variance of X given Y = y is
Z 0.5 0.5
2 1
Var(X|Y = y) = (x – 0.25)2 (2) dx = (x – 0.25)3 =
0 3 0 48
when 0 ≤ y ≤ 0.5, and
Z 1 1
2 1
Var(X|Y = y) = (x – 0.75)2 (2) dx =
(x – 0.75)3 = .
0.5 3 0.5 48
In fact, in this simple example, it would have been possible to determine the conditional expectations and conditional
variances without calculating the integrals. Since we already knew that the conditional distribution X|Y = y is U(0, 0.5)
when 0 ≤ y ≤ 0.5 and U(0.5, 1) when 0.5 < y ≤ 1, the conditional expectations are just the midpoints of the uniform
2
distribution (0.25 and 0.75, respectively) and the conditional variances are both equal to 0.5 1
12 = 48 from the general
(b–a)2
variance formula 12 for a U(a, b) distribution.
When information is available about the conditional distributions of a random variable, and therefore its conditional
expectations, we can determine the (unconditional) expected value of that random variable. This idea is analogous
to the Law of Total Probability from Chapter 3, where the unconditional probability is equal to a weighted sum
of conditional probabilities. The following proposition provides several results for expressing the (unconditional)
expected value of a random variable in terms of conditional expectations:
Proposition 10.6. (Expected value in terms of conditional expectations) Let Y be a random variable.
(i) If A1 , A2 , …, Ak are disjoint and exhaustive events, then22
k
X
E(Y) = E(Y|Aj )P(Aj ).
j=1

(ii) As a special case of (i), if X is a discrete random variable, then

X
E(Y) = E(Y|X = xk∗ )pX (xk∗ ).
k
(iii) If X is a continuous random variable, then
Z ∞
E(Y) = E(Y|X = x)fX (x) dx.
–∞
This proposition holds for any type of random variable Y. When conditioning on discrete events, as in part (i), or
discrete outcomes of a random variable, as in part (ii), the expected value of Y is a weighted average of the conditional
expectations, where the weights are the probabilities of the events/outcomes, similar to the Law of Total Probability.
Part (iii) is a continuous version of the weighted summation, where an integral is used and, for the weighting, the pdf
of x is used rather than the pmf of x. Interestingly, we can also use part (i) to partition the outcomes of a continuous
random variable. For example, if X is a continuous random variable with all real numbers as possible values, we can
partition X into negative values (X < 0) and non-negative values (X ≥ 0). Then, applying part (i) to the events {X < 0}
and {X ≥ 0} yields
E(Y) = E(Y|X < 0) · P(X < 0) + E(Y|X ≥ 0) · P(X ≥ 0).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 262 — #269
i i

262 Continuous random variables

Example 10.21 (Data analyst salaries) For Example 10.16, where the conditional distribution of salaries for data
analysts with no graduate degree (Y|X = 0) is U(60, 100) and the conditional distribution of salaries for data analysts
with a graduate degree (Y|X = 1) is U(90, 210), let π = P(X = 1) denote the probability that a data analyst has a
graduate degree. Then, the expected value of salary is related to the two conditional expectations of salaries as follows:
E(Y) = E(Y|X = 0) · P(X = 0) + E(Y|X = 0) · P(X = 1)
= 60+100
2 (1 – π) +
90+210
2 π = 80 + 70π.
For instance, if π = 0.2 or 20%, the expected value of salary is E(Y) = 94 or $94,000.

10.6.3 Population covariance and population correlation

For discrete random variables X and Y, Section 8.4.3 introduced the population covariance
X
σXY = Cov(X, Y) = E [(X – µX )(Y – µY )] = (xk∗ – µX )(y∗` – µY ) pXY (xk∗ , y∗` ),
(k,`)

which is a weighted average of (xk∗ – µX )(y∗`

– µY ) for all possible joint outcomes (xk∗ , y∗` ), where the weights are the joint
pmf values pXY (xk∗ , y∗` ). When X and Y are
continuous random variables, an analogous definition involves a weighted
average using weights given by the joint pdf values fXY (x, y), integrating over all (x, y).

Definition 10.18 The population covariance of continuous random variables X and Y, denoted σXY or Cov(X, Y), is
Z ∞Z ∞
σXY = Cov(X, Y) = E [(X – µX )(Y – µY )] = (x – µX )(y – µY )fXY (x, y) dx dy.
–∞ –∞

Definition 10.19 The population correlation of continuous random variables X and Y, denoted ρXY or Corr(X, Y),
is
σXY
ρXY = Corr(X, Y) = .
σX σY
The properties for the population covariance and population correlation were provided earlier in Proposition 8.4. The
population correlation ρXY is unitless, with 0 ≤ ρXY ≤ 1, and its sign is always the same as the sign of the population
covariance (sign(ρXY ) = sign(σXY )).
Example 10.22 (Related uniform random variables) We can determine the population covariance and population
correlation for the random variables X and Y introduced in Example 10.14, where the joint pdf was

2 if 0 ≤ x ≤ 0.5 and 0 ≤ y ≤ 0.5


fXY (x, y) = 2 if 0.5 < x ≤ 1 and 0.5 < y ≤ 1


0 otherwise

The population covariance is

R∞ R∞
σXY = –∞ –∞
(x – µX )(y – µY )fXY (x, y) dx dy
R 0.5 R 0.5 R1 R1
= 0 0
(x – 0.5)(y – 0.5)(2) dx dy + 0.5 0.5
(x – 0.5)(y – 0.5)(2) dx dy
R 0.5 –0.25
R 1 0.25
= 2 0 2 (y – 0.5) dy + 2 0.5 2 (y – 0.5) dy

–0.25 –0.25 0.25 0.25

= (2) 2 2 + (2) 2 2 = 0.0625,
and the population correlation is
σXY 0.0625
ρXY = = = 0.75.
σX σY √1 √1
12 12

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 263 — #270
i i

Continuous random variables 263

We have used µX = µY = 0.5 and σX = σY = √112 since X and Y are U(0, 1) random variables. The high positive
correlation indicates a strong positive association, which is expected given the specification of the joint pdf.
It should be noted that the population covariance σXY = Cov(X, Y) = E [(X – µX )(Y – µY )] is a well-defined population
concept more generally than the specific cases considered in Chapter 8 (X and Y both discrete) and this section (X and Y
both continuous). Regardless of the form of X and Y, we can think of taking many draws of (xi , yi ) from the population,
1
Pn
and the population covariance σXY is the number to which the sample covariance sxy = n–1 i=1 (xi – x̄)(yi – ȳ) eventually
converges (i.e., for very large n).23 One particular case of interest that has not been considered thus far is the covariance
between a discrete X and a continuous Y. For simplicity, let’s consider a binary (Bernoulli) X ∈ {0, 1}, with X ∼
Bernoulli(π), in which case24
σXY = E [(X – µX )(Y – µY )]
= E [(X – µX )(Y – µY )|X = 0] P(X = 0) + E [(X – µX )(Y – µY )|X = 1] P(X = 1)
= E [(0 – π)(Y – µY )|X = 0] (1 – π) + E [(1 – π)(Y – µY )|X = 1] π
= π(1 – π) (E [Y – µY |X = 1] – E [Y – µY |X = 0])
= π(1 – π) (E [Y|X = 1] – E [Y|X = 0]) .
The second equality follows from application of Proposition 10.6. For the third equality, we plug in the value 0 for X
when conditioning on X = 0 and the value 1 when conditioning on X = 1, and we use the fact that P(X = 1) = µX = π.
For the fourth equality, we pull the constants (–π and π, respectively) outside the conditional expectations and then
simplify. Finally, for the fifth equality, the additive constants (–µY for both) is pulled out of the two conditional
expectations and cancel out. Since π(1 – π) is always positive, the population covariance σXY and, thus, the population
correlation ρXY is positive when E [Y|X = 1] > E [Y|X = 0], negative when E [Y|X = 1] < E [Y|X = 0], and zero when
E [Y|X = 1] = E [Y|X = 0]. This result is analogous to the discussion in Section 7.2.3, where the sample correlation
between a discrete x and a continuous y was considered.
Example 10.23 (Data analyst salaries) Continuing Example 10.16, where the conditional distribution of salaries for
data analysts with no graduate degree (Y|X = 0) is U(60, 100) and the conditional distribution of salaries for data
analysts with a graduate degree (Y|X = 1) is U(90, 210), the population covariance is
σXY = π(1 – π) (E [Y|X = 1] – E [Y|X = 0]) = π(1 – π)(150 – 80) = 70π(1 – π),
where π = P(X = 1) is the probability that a data analyst at the firm has a graduate degree. There is a positive
covariance and, therefore, a positive correlation between the graduate-degree indicator X and the salary Y since
the population (conditional) mean of salaries for graduate-degree data analysts is higher than the population
(conditional) mean of non-graduate-degree data analysts. For instance, if π = 0.20, the population covariance is
σXY = (70)(0.2)(0.8) = 11.2.

10.6.4 Independence of random variables

This section begins the discussion of independence by focusing on the case of two continuous random variables,
but this concept of independence is then generalized to two random variables of any type (discrete, continuous, or a
mixture of the two). The generalization allows several propositions related to independence to apply to two random
variables of any type.
First, independence is defined for the case of two continuous random variables:

Definition 10.20 The continuous random variables X and Y are independent if and only if
fXY (x, y) = fX (x)fY (y) for every x and y
or, equivalently,
FXY (x, y) = FX (x)FY (y) for every possible outcome pair (x, y),

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 264 — #271
i i

264 Continuous random variables

where FXY (x, y) = P(X ≤ x, Y ≤ y). If this equality fails for any (x, y), then the continuous random variables X and Y
are dependent.
Independence based upon the cdf (that is, FXY (x, y) = FX (x)FY (y) for all (x, y)) is the same condition for independence
seen for discrete random variables in Definition 8.17. In fact, since the cdf is a unifying concept for all random
variables, including those that might be a mixture of both discrete and continuous outcomes, the following definition
is a general definition of independence that applies to any type of random variable.

Definition 10.21 The random variables X and Y are independent if and only if
FXY (x, y) = FX (x)FY (y) for every possible outcome pair (x, y),
where FXY (x, y) = P(X ≤ x, Y ≤ y). If this equality fails for any (x, y), then the random variables X and Y are dependent.
This definition is general in the sense that X can be discrete, continuous, or some mixture of discrete and continuous,
as can Y. For instance, X could be a binary (discrete) random variable, while Y is a uniform (continuous) random
variable.
Proposition 10.7. If the random variables X and Y are independent, the population covariance σXY and population
correlation ρXY are equal to zero (σXY = ρXY = 0). Equivalently, if the random variables X and Y have a non-zero
population covariance or correlation, X and Y are dependent.
As shown in Example 8.24, it is possible for discrete random variables to be dependent and have population
covariance/correlation equal to zero. The same is true of continuous random variables. It is possible to have dependent
continuous random variables with population covariance/correlation equal to zero.25
A general characterization of independence can also be given in terms of conditional distributions:
Proposition 10.8. The random variables X and Y are independent if and only if:
• For any possible value y, the conditional distribution of X given Y = y is the same as the marginal distribution of X.
Mathematically, FX|Y (x|y) = FX (x) for any x and any possible value y.
• For any possible value x, the conditional distribution of Y given X = x is the same as the marginal distribution of Y.

Mathematically, FY|X (y|x) = FY (y) for any y and any possible value x.
It is sufficient to show just one of the equivalences between the conditional pdf’s and the marginal pdf. For instance,
if we show that FX|Y (x|y) = FX (x) for any x and any possible value y, it is unnecessary to also show FY|X (y|x) = FY (y) for
any y and any possible value x.
As with Definition 10.21, Proposition 10.8 does not restrict the type of random variables since the results are stated
in terms of the conditional cdf. X can be discrete, continuous, or a mixture of the two, as can Y. If X and Y are both
discrete, having FX|Y (x|y) = FX (x) for any x and y is equivalent to having pX|Y (x|y) = pX (x) for all possible outcomes
(x, y). If X and Y are both continuous, having FX|Y (x|y) = FX (x) for any x and y is equivalent to having fX|Y (x|y) = fX (x)
for all possible (x, y).
Example 10.24 (Unrelated uniform random variables) In Example 10.17, for the joint pdf
(
1 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
fXY (x, y) =
0 otherwise,
we showed that the conditional pdf of X given Y = y (for 0 ≤ y ≤ 1) is the same as the marginal pdf of X. Also, we
showed that the conditional pdf of Y given X = x (for 0 ≤ x ≤ 1) is the same as the marginal pdf of Y. Then, from
Proposition 10.8,
R∞ R ∞ Y are independent. Since they are independent, we also know σXY = ρXY = 0 without having to
X and
calculate σXY = –∞ –∞ (x – µX )(y – µY )fXY (x, y) dx dy.
If two random variables are independent, it becomes much simpler to determine joint probabilities. For two discrete
random variables, the definition of independence itself (Definition 8.17) stated that the joint probability of an outcome

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 265 — #272
i i

Continuous random variables 265

(x, y) is equal to the product of the marginal probabilities of x and y. For continuous variables, the probability of a
specific joint outcome (x, y) is zero, so instead we consider the joint probability of X being in some interval and Y
being in some other interval. The following proposition says that this joint probability is equal to the product of the
marginal probabilities of the two random variables being in their respective intervals:
Proposition 10.9. If X and Y are independent random variables, then for a ≤ b and c ≤ d,
P(a ≤ X ≤ b, c ≤ Y ≤ d) = P(a ≤ X ≤ b) · P(c ≤ Y ≤ d).
The statement of this proposition is not restricted to continuous X and Y and holds for any types of random variables.
Example 10.25 (Producing a movie) The cost of producing a movie depends upon the number of days of filming (X)
and the number of days of editing (Y), with filming costing $1 million per day and editing costing $500,000 or $0.5
million per day. Assume that X and Y are both uniformly distributed and independent, as follows:
X ∼ U(60, 80) and Y ∼ U(40, 50).
The assumption of uniform distributions allows for partial days. The marginal pdf’s are
( (
0.05 if 60 ≤ x ≤ 80 0.1 if 40 ≤ y ≤ 50
fX (x) = and fY (y) =
0 otherwise 0 otherwise
Since X and Y are independent, the joint pdf pXY (x, y) = pX (x)pY (y) for all (x, y), yielding
(
0.005 if 60 ≤ x ≤ 80 and 40 ≤ y ≤ 50
fXY (x, y) =
0 otherwise
For any y such that 40 ≤ y ≤ 50, the conditional pdf of X is the same as the marginal pdf of X: fX|Y (x|y) = fX (x), which is
0.05 if 60 ≤ x ≤ 80 and 0 otherwise. Similarly, for any x such that 60 ≤ x ≤ 80, the conditional pdf of Y is the same as
the marginal pdf of Y: fY|X (y|x) = fY (y), which is 0.1 if 40 ≤ y ≤ 50 and 0 otherwise. The independence also simplifies
the calculation of joint cdf probabilities. For instance, the probability that X (days of filming) is less than or equal to
70 and Y (days of editing) is less than or equal to 45 is
P(X ≤ 70, Y ≤ 45) = P(X ≤ 70)P(Y ≤ 45) = ((0.05)(10)) × ((0.1)(5)) = (0.5)(0.5) = 0.25.
Proposition 10.9 generalizes to more than two random variables. First, we provide a definition of independence for
multiple random variables:

Definition 10.22 The random variables X1 , X2 , …, Xm , for m ≥ 2, are independent if and only if
m
Y
FX1 X2 ···Xm (x1 , x2 , …, xm ) = FX1 (x1 )FX2 (x2 ) · · · FXm (xm ) = FXj (xj )
j=1

for any (x1 , x2 , …, xm ), where FX1 X2 ···Xm (x1 , x2 , …, xm ) = P(X1 ≤ x1 , X2 ≤ x2 , …, Xm ≤ xm ).

This definition is stated in terms of the joint cdf and marginal cdf’s so that it applies to any types of random variables.
With independence, the joint probability of the random variables being in specified intervals is equal to the product of
the marginal probabilities of each random variable being in its corresponding interval:
Proposition 10.10. If X1 , X2 , …, Xm , for m ≥ 2, are independent random variables, then
P(a1 ≤ X1 ≤ b1 , a2 ≤ X2 ≤ b2 , …, am ≤ Xm ≤ bm )
Qm
= P(a1 ≤ X1 ≤ b1 )P(a2 ≤ X2 ≤ b2 ) · · · P(am ≤ Xm ≤ bm ) = j=1 P(aj ≤ Xj ≤ bj ),
where aj ≤ bj for all j ∈ {1, 2, …, m}.
If m = 3 and X1 , X2 , and X3 are independent random variables, then for a1 ≤ b1 , a2 ≤ b2 , and a3 ≤ b3 ,
P(a1 ≤ X1 ≤ b1 , a2 ≤ X2 ≤ b2 , a3 ≤ X3 ≤ b3 ) = P(a1 ≤ X1 ≤ b1 )P(a2 ≤ X2 ≤ b2 )P(a3 ≤ X3 ≤ b3 ).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 266 — #273
i i

266 Continuous random variables

For example, if X1 , X2 , and X3 are independent U(0, 1) random variables,

P(0.2 ≤ X1 ≤ 0.5, 0.7 ≤ X2 ≤ 0.8, 0.1 ≤ X3 ≤ 0.3)
= P(0.2 ≤ X1 ≤ 0.5)P(0.7 ≤ X2 ≤ 0.8)P(0.1 ≤ X3 ≤ 0.3) = (0.3)(0.1)(0.2) = 0.006.
We conclude this section by generalizing the definition of independent and identically distributed (i.i.d.) random
variables to cover continuous random variables. Recall from Definition 9.2 that discrete random variables are i.i.d. if
they are independent and have the same marginal pmf’s.

Definition 10.23 The random variables X1 , X2 , …, Xn are independent and identically distributed (i.i.d.) if
(i) X1 , X2 , …, Xn are independent and (ii) each Xj has the same cdf (FXj (x) = FX (x) for all j ∈ {1, 2, …, n} and all x).
This definition includes discrete random variables (Definition 9.2) as a special case but also applies to continuous
random variables and random variables that are a mixture of discrete and continuous outcomes.

10.7 Linear transformations and combinations of multiple random variables

Sections 8.5 and 8.6 considered linear transformations and linear combinations of discrete random variables. This
section generalizes the results from those sections to all random variables.

10.7.1 Linear transformations of two random variables

For two linear transformations of different random variables, V = a + bX and W = c + dY, the following proposition
generalizes the results from Section 8.5.2 for the case of two discrete random variables:
Proposition 10.11. If X and Y are random variables, V = a + bX for known constants a and b, and W = c + dY for
known constants c and d, then:
(i) (population covariance) σVW = bdσ
( XY
ρXY if bd > 0
(ii) (population correlation) ρVW =
–ρXY if bd < 0
(
ρXY if b ≥ 0
(iii) (transforming one random variable and not the other) σVY = bσXY and ρVY =
–ρXY if b < 0
Example 10.26 (Education and earnings) Let X and Y represent the weekly earnings (in dollars) and educational
attainment (in years) for an employed individual from the population, with population covariance σXY and population
correlation ρXY . Then, W = 52Y is the annualized earnings of an employed individual from the population. The
population covariance σXW is 52 times the population covariance σXY , and the population correlation ρXW is the
same as the population correlation ρXY . Focusing on the population correlation, the scale of the earnings random
variable does not affect the correlation value. If the scale of education is changed as well, for instance measuring
education in months rather than years, the population correlation would still remain unchanged.

10.7.2 Linear combination of multiple random variables

For the linear combination of two random variables, V = k + aX + bY, the following proposition generalizes the results
from Section 8.6 for the case of two discrete random variables:
Proposition 10.12. If X and Y are random variables and V = k + aX + bY for known constants k, a, and b, then:
(i) (population mean) µV = k + aµX + bµY
(ii) (population variance) σV2 = a2 σX2 + b2 σY2 + 2abσXY , with the following special cases:
(no correlation between X and Y) for σXY = 0: σV2 = a2 σX2 + b2 σY2
(sum of two random variables) for V = X + Y: σV2 = σX2 + σY2 + 2abσXY
(difference of two random variables) pfor Vp = X – Y: σV2 = σX2 + σY2 – 2abσXY
(iii) (population standard deviation) σV = σV = a2 σX2 + b2 σY2 + 2abσXY
2

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 267 — #274
i i

Continuous random variables 267

Example 10.27 (Producing a movie) Continuing Example 10.25, where X ∼ U(60, 80), the number of days of filming,
and Y ∼ U(40, 50), the number of days of editing, are independent random variables. The production cost, denoted C,
is $1 million per day of filming and $0.5 million per day of editing:
C = X + 0.5Y (in millions of dollars).
The population mean of C, or the expected cost, is
µC = µX + 0.5µY = 70 + (0.5)(45) = 92.5,
or $92.5 million. The population variance of C is
(80 – 60)2 (50 – 40)2 425
σC2 = σX2 + (0.5)2 σY2 = + (0.5)2 = ,
12 12 12
and the population standard deviation of C is
r
425
σC = ≈ 5.95,
12
or approximately $5.95 million.
Suppose $90 million has been budgeted for the movie. What is the probability that the film goes over budget? The
probability of going over budget is
P(C > 90) = P(X + 0.5Y > 90),
so we need to find the region of possible (x, y) values for which x + 0.5y > 90. Figure 10.10 helps to visualize the
problem. The rectangle represents the range of the possible values for X and Y, with x between 60 and 80 and y
between 40 and 50. The diagonal line is the y = 2(90 – x) = 180 – 2x line, such that any (x, y) value above this line has
y > 180 – 2x or x + 0.5y > 90. Then, P(X + 0.5Y > 90) is obtained by integrating the joint pdf fXY (x, y) = 0.005 over this
region. Since the pdf is constant within the region, the problem simplifies to determining the area of the gray region in
Figure 10.10 and multiplying it by 0.005, the height of the rectangular solid (given by the pdf). The area of the gray
trapezoid is 10+15
2 · 10 = 125, so that the probability of going over budget is P(C > 90) = (125)(0.005) = 0.625 or 62.5%.
An alternative approach to calculate P(C > 90) = P(X + 0.5Y > 90) is with computer simulation. Rather than
analytically deriving the probability of the gray region in Figure 10.10, the computer simulation approximates the
probability by repeatedly making random draws of X and Y and seeing how often the linear combination X + 0.5Y is
larger than 90. The following R code approximates the over-budget probability with 100,000 simulated draws of X
and Y:

set.seed(1234)

# initialize the number of simulations

num_simulations <- 100000

# create simulated draws

days_filming_draws <- runif(num_simulations,min=60,max=80)
days_editing_draws <- runif(num_simulations,min=40,max=50)

# calculate frequency corresponding to costs being over-budget

mean(days_filming_draws + 0.5*days_editing_draws > 90)
## [1] 0.6262

The approximated over-budget probability is 62.62%, which is very close to the true probability of 62.5%.

10.7.3 Linear combination of independent random variables

For a linear combination of multiple independent random variables, the following proposition generalizes the results
from Section 8.6.1 for discrete random variables:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 268 — #275
i i

268 Continuous random variables

60
55
(65,50)

50
45
Y

(70,40)
35
30

40 60 80 100

Figure 10.10
Over-budget region for the movie example

Proposition 10.13. For independent random variables X1 , X2 , …, Xm , the linear combination

m
X
V = k + a1 X1 + a2 X2 + · · · + am Xm = k + aj Xj
j=1

has population mean

m
X
µV = k + a1 µX1 + a2 µX2 + · · · + am µXm = k + aj µXj
j=1
and population variance
m
X
σV2 = a21 σX2 1 + a22 σX2 2 + · · · + a2m σX2 m = a2j σX2 j .
j=1
When V = X1 + X2 + · · · + Xm is the sum of independent random variables,
m
X
µV = µX1 + µX2 + · · · + µXm = µXj
j=1

and
m
X
σV2 = σX2 1 + σX2 2 + · · · + σX2 m = σX2 j .
j=1
1
When V = m (X1 + X2 + · · · + Xm ) is the average of independent random variables,
m
1 1 1 1X
µV = µX1 + µX2 + · · · + µXm = µXj
m m m m
j=1

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 269 — #276
i i

Continuous random variables 269

and
m
1 2 1 2 1 2 1 X 2
σV2 = σ + σ + · · · + σ = σXj .
m2 X1 m2 X2 m2 Xm m2
j=1

Example 10.28 (Sum of independent uniform random variables) Suppose X1 ∼ U(0, 1) and X2 ∼ U(0, 1) are
independent uniform random variables. The sum V = X1 + X2 has population mean
µV = µX1 + µX2 = 0.5 + 0.5 = 1,
population variance
1 1 1
σV2 = σX2 1 + σX2 2 = + = ,
12 12 6
and population standard deviation r
1 1
=√ .
σV =
6 6
Interestingly, V = X1 + X2 is actually the triangular distribution from Example 10.2, which the interested reader can
verify (by finding FV (·) and checking that it’s the cdf of the triangular distribution). The population mean, variance,
and standard deviation are the same as those derived for the triangular distribution in Example 10.10. Knowing that
the triangular distribution V is the sum of two independent U(0, 1) random variables greatly simplifies the calculation
of these quantities, as compared to the brute-force method using the population variance formula in Example 10.10.
How about the sum of three independent U(0, 1) random variables? In this case, V = X1 + X2 + X3 , with population
mean
µV = µX1 + µX2 + µX3 = (3)(0.5) = 1.5,
population variance
1 1
σV2 = σX2 1 + σX2 2 + σX2 3 = (3) = ,
12 4
and population standard deviation r
3 1
= .σV =
12 2
This idea extends to the sum of m independent U(0, 1) random variables, V = X1 + X2 + · · · + Xm , with population mean
µV = µX1 + µX2 + · · · + µXm = (m)(0.5) = 0.5m,
population variance
1 m
σV2 = σX2 1 + σX2 2 + · · · + σX2 m = (m) = ,
12 12
and population standard deviation r
m
σV = .
12
For the sum of independent U(0, 1) random variables, the population mean is m times the common population mean
of the Xj ’s, and the population variance is m times the common population variance of the Xj ’s.
If we are instead interested in the average of m independent U(0, 1) random variables, V = m1 (X1 + X2 + · · · + Xm ),
the population mean is
1 1 1 1
µV = µX1 + µX2 + · · · + µXm = (m) (0.5) = 0.5,
m m m m
the population variance is

1 1 1 1 1 1
σV2 = 2 σX2 1 + 2 σX2 2 + · · · + 2 σX2 m = 2
(m) = ,
m m m m 12 12m

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 270 — #277
i i

270 Continuous random variables

m=1 m=2

1.0

1.0
0.8

0.8
0.6

0.6
fV(v)

fV(v)
0.4

0.4
0.2

0.2
0.0

0.0
−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0

v v

m=3 m = 10

0.4
0.6

0.3
0.4
fV(v)

fV(v)

0.2
0.2

0.1
0.0

0.0

−1.0 0.0 0.5 1.0 1.5 2.0 −1.0 0.0 0.5 1.0 1.5 2.0

v v

Figure 10.11
Probability density functions for the average of m i.i.d. U(0, 1) random variables

and population standard deviation is

1
σV = √ .
12m
For the average of independent U(0, 1) random variables, the population mean is the same as the common population
mean of the Xj ’s, the population variance is m1 times the common population variance of the Xj ’s, and the population
standard deviation is √1m times the common population standard deviation of the Xj ’s. Figure 10.11 shows the
distribution of a U(0, 1) random variable, along with the distribution of V = m1 (X1 + X2 + · · · + Xm ) for three different
values of m (m = 2 in the upper-right graph, m = 3 in the lower-left graph, and m = 10 in the lower-right graph). For
each distribution, the mean µV is 0.5, which is the mean of the U(0, 1) random variable. As m increases, the dispersion
1 1 1
of the distribution of V decreases, with σV2 = 24 for m = 2, σV2 = 36 for m = 3, and σV2 = 120 for m = 10. We get a scaled
version of the triangular distribution for m = 2. For m = 3, there is a much different bell-shaped distribution. For m = 10,
there is also a bell-shaped distribution but more tightly distributed around the mean as compared to m = 3.
The results from Example 10.28 for the sum and average of m independent U(0, 1) random variables can be
generalized to any collection of i.i.d. random variables X1 , X2, …, Xm .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 271 — #278
i i

Continuous random variables 271

Proposition 10.14. Suppose X1 , X2 , …, Xm are i.i.d. random variables with common population mean µX and
population variance σX2 .
(i) For the sum of the i.i.d. random variables, V = X1 + X2 + · · · + Xm ,
√
µV = mµX , σV2 = mσX2 , and σV = mσX .
(ii) For the average of the i.i.d. random variables, V = m1 (X1 + X2 + · · · + Xm ),
σX2 σX
µV = µX , σV2 = , and σV = √ .
m m
In addition to the generalized results in Proposition 10.14, the tendency for the sum V = X1 + X2 + · · · + Xm or
the average V = m1 (X1 + X2 + · · · + Xm ) to look like a bell-shaped distribution as m gets larger is also a general
phenomenon that occurs for any i.i.d. random variables. This phenomenon was previously encountered in Chapter 9
for binomial random variables since a Binomial(n, π) random variable is the sum of n i.i.d. Bernoulli variables.
For example, Figure 9.3 showed a Binomial(100, 0.1) random variable with a distribution that appeared bell-shaped
and approximately symmetric. Even though the pdf of a Bernoulli(0.1) is certainly not bell-shaped, the sum of m
Bernoulli(0.1) random variables takes on the symmetric, bell-shaped distribution for the large m = 100 value.
The following example illustrates this same phenomenon for an asymmetric continuous distribution that is quite
different from the uniform distribution.
Example 10.29 (Asymmetric distribution) We consider a random variable with an asymmetric distribution, known as
an exponential random variable (discussed in more detail in Chapter 11). The top-left graph in Figure 10.12 shows
the pdf for a random variable X that is positive for positive values x > 0. In this example, the population mean is
µX = 2. Assume that X1 , X2 , …, Xm are i.i.d. random variables with this distribution. The remaining three graphs in
Figure 10.12 show the distribution of the average V = m1 (X1 + X2 + · · · + Xm ) for three different values of m (m = 3 in
the top-right graph, m = 10 in the lower-left graph, and m = 20 in the lower-right graph). While each Xj is asymmetric,
the distribution of V becomes closer and closer to a symmetric distribution as m increases. Even with m = 3, the
distribution begins to look like a bell-shaped distribution, though it is more asymmetric than when m increases to 10
or 20. For the higher m values, the nearly symmetric distributions are approximately centered around µX = 2 and the
dispersion decreases as m is increased from 10 to 20.

10.8 Expected values of functions of continuous random variables

Section 8.7 provided formulas for the expected value of a function of a single discrete random variable g(X) and
a function of two discrete random variables g(X, Y). In this section, we provide analogous formulas for continuous
random variables, which involves replacing the summations in Propositions 8.8 and 8.9 with integrals.
The following proposition considers a function of a single random variable:
Proposition 10.15. For any function g(x) and a continuous random variable X with pdf fX (·), the population mean or
expected value of g(X) is Z ∞
µg(X) = E(g(X)) = g(x)fX (x) dx.
–∞
A special case of Proposition 10.15 is the population variance (Definition 10.8), where g(X) = (X – µX )2 .
Example 10.30 (Price experiment) On a given day, a firm randomly sets the price X (in dollars) of its product
according to a uniform random variable, with X ∼ U(10, 20). The quantity demanded of the product is 100 – 3X,
which ranges from a high of 70 (for X = 10) to a low of 40 (for X = 20). We are allowing the quantity demanded to be
a non-integer value. The total revenue for the day, as a function of X, is
g(X) = (100 – 3X)(X) = 100X – 3X 2 .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 272 — #279
i i

272 Continuous random variables

m=1 m=3

0.5

0.12
0.4

0.08
0.3
fV(v)

fV(v)
0.2

0.04
0.1

0.00
0 1 2 3 4 5 0 1 2 3 4 5

v v

m = 10 m = 20
0.06

0.00 0.01 0.02 0.03 0.04

0.04
fV(v)

fV(v)
0.02
0.00

0 1 2 3 4 5 0 1 2 3 4 5

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 275 — #282
i i

Continuous random variables 275

For a continuous random variable X, Proposition 10.17 implies a particularly convenient relationship between the
population quantiles of g(X) and the population quantiles of X. The result is given in the following proposition:
Proposition 10.18. If the function g(·) is strictly increasing on the sample space associated with a continuous random
variable X, the population q-th quantile of the random variable g(X) is
τg(X),q = g(τX,q ) for any q ∈ (0, 1).
To get the q-th population quantile of g(X), we just apply the g(·) function to the q-th population quantile of X. Why
does this relationship hold? Recall that τX,q is the value for which P(X ≤ τX,q ) is equal to q. Therefore, P(g(X) ≤ g(τX,q ))
is also equal to q since g(·) is a strictly increasing function, meaning that the population q-th √ quantile of g(X) is g(τX,q ).
√
For Example 10.32, where X ∼ U(0, 1), the population q-th quantile (for q ∈ (0, 1)) of X is q since the population
q-th quantile of X is q; similarly, the population q-th quantile of X 2 is q2 for q ∈ (0, 1).
For Example 10.33, what is the population median of the investment fee g(X)? According to Proposition 10.18,
since the population median of X is the middle of the (–0.10, 0.25) interval, which is 0.075,
0.013
τg(X),0.5 = g(τ0.5 ) = g(0.075) = 0.002 + ≈ 0.01438.
1 + e–40(0.075)
10.10 Random variables with discrete and continuous outcomes
This chapter has already alluded to random variables that are a mixture or hybrid of discrete and continuous random
variables, with the idea first introduced in Section 10.1.1 and then again in Section 10.3.2 when we discussed the
generality of the cdf. Unfortunately, while the cdf applies generally to discrete random variables, continuous random
variables, and mixtures of the two, our definitions of many population quantities have been quite different for discrete
and continuous random variables. For example, the population mean and the population variance are both defined as
summations for discrete random variables and as integrals for continuous random variables.
Although a completely unified framework that handles arbitrary mixtures of discrete and continuous random
variables is beyond the scope of this book, we briefly discuss how to determine population quantities for such mixtures.
The basic approach, which works in most cases of practical interest, is to apply summations for the discrete outcomes,
to apply integrals for the continuous outcomes, and then sum the results. This idea is best illustrated through examples.
Example 10.34 (Uniform distribution with point masses) As in Example 10.5, suppose the cdf of X is

0 if x < 0


FX (x) = 0.3 + 0.5x if 0 ≤ x < 1

if x ≥ 1

1

This cdf corresponds to probabilities P(X = 0) = 0.3 and P(X = 1) = 0.2 for the two possible discrete outcomes, which
leaves a probability of 0.5 for the (uniform) continuous outcomes x ∈ (0, 1). The population mean of X is
Z 1 !
µX = E(X) = (0)(0.3) + (1)(0.2) + x(1)dx (0.5) = 0.45,
0

where the first two terms account for the two discrete outcomes and the integral accounts for the continuous outcomes.
This calculation partitions the expected value E(X) into a weighted average of conditional expectations, with
E(X) = E(X|X = 0) · P(X = 0) + E(X|X = 1) · P(X = 1) + E(X|0 < X < 1) · P(0 < X < 1),
which simplifies to
E(X) = (0)(0.3) + (1)(0.2) + (0.5)(0.5) = 0.45.
Similarly, the population variance of X is
!
Z 1
σX2 2 2
= (0 – 0.45) (0.3) + (1 – 0.45) (0.2) + 2
(x – 0.45) (0.5)(1)dx ≈ 0.164.
0

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 276 — #283
i i

276 NOTES

Example 10.35 (Average hourly wages, with zeros) Suppose we are interested in the average hourly wage in a
population of teenagers, but we want to include teenagers who are not working (i.e., zero hourly wage). If 40% of
teenagers do not work and the remaining 60% have wages (in dollars) drawn from a U(8, 16) random variable, the
population mean of wages is
Z 16 16
x x2 256 – 64
(0.4)(0) + (0.6) dx = (0.6) = (0.6) = 7.2,
8 8 16 8 16
or $7.20. The population variance of wages is
Z 16 16
(x – 7.2)2 (x – 7.2)3
(0.4)(0 – 7.2)2 + (0.6) dx = (0.4)(0 – 7.2)2 + (0.6) = 37.76,
8 8 24 8
√
meaning the population standard deviation of wages is 37.76 or approximately $6.14.

Notes
21 Strictly speaking, the outcome X = –1 (which corresponds to the stock price dropping to zero) may not have zero probability, but we assume
that it does for ease of exposition.
22 This result also holds for an infinite set of disjoint and exhaustive events, in which case the summation becomes an infinite summation.
23 In the terminology of Chapter 14, the sample covariance is said to be a “consistent estimator” of the population covariance σ .
XY
24 For general discrete X, the same basic idea applies except that there are more (possibly infinite) possible outcomes x∗ for X, resulting in a
k
∗
summation of terms that are probabilities pX (xk ) times conditional expectations given X = xk .∗
25 In Chapter 11, however, we will see that for X and Y with normal distributions, independence of X of Y holds if and only if the population
covariance/correlation is equal to zero.
26 The inverse function g–1 (·) is well-defined since g(·) is strictly increasing.

Exercises
1. Consider the probability distribution function (pdf)
(
3x2 for 0 ≤ x ≤ 1
fX (x) =
0 otherwise
(a) Show that fX (·) is a valid pdf.
(b) What is the population mean of X?
(c) What is the population variance of X?
(d) What is the value of the cdf of X evaluated at 0.4?
(e) What is the population median of X?
(f) Confirm your answers to (a), (b), and (d) in R by defining functions corresponding to fX (x) and xfX (x) and using
the integrate function.
2. In the United States, a FICO Score is a measure of an individual’s credit worthiness that companies (lenders, credit-
card issuers, etc) use to determine whether to extend credit. An individual’s FICO Score is an integer between 300 and
850, with low scores indicating low credit worthiness and high scores indicating high credit worthiness. Assume that
we can consider FICO Scores to be an approximately continuous variable, even though they are integers. According
to fico.com, the population probabilities associated with different intervals in April 2023 were as follows:
Interval 300-499 500-549 550-599 600-649 650-699 700-749 750-799 800-850
Probability 0.028 0.057 0.068 0.086 0.120 0.164 0.236 0.241
Let X denote the random variable associated with an individual’s FICO Score.
(a) What is FX (799)?
(b) What are the smallest and largest possible values of FX (520)?
(c) In which interval is the population median of X?
(d) What are the smallest and largest possible values of the population mean of X?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 277 — #284
i i

NOTES 277

3. Consider the following pdf for a random variable X defined on the interval [0, 1]:

0.8 for x ∈ [0, 0.5]


fX (x) = K for x ∈ (0.5, 1]


0 otherwise
(a) What is K? Graph the pdf of X.
(b) What is the cdf of X? Graph the cdf of X.
(c) What is the expected value of X?
(d) What is the population variance of X?
(e) What is the population 10% quantile of X?
(f) What is the population 90% quantile of X?
(g) Write an R function rsplitunif that takes a single argument n and returns a vector with n draws of the
random variable X. (Hint: Think of X as being generated in the following way: with probability 0.4, X is drawn
from a U(0, 0.5) distribution, and with probability 0.6, X is drawn from a U(0.5, 1) distribution.)
(h) Using the function rsplitunif, conduct 10,000 simulations in R to confirm the answers to (c) and (d).
(i) Suppose Y ∼ U(0, 1) and that X and Y are independent. Using the functions runif and rsplitunif,
conduct 10,000 simulations in R to approximate P(X > Y).
(j) Suppose Y ∼ U(0, 1) and that X and Y are independent. Using the functions runif and rsplitunif,
conduct 10,000 simulations in R to approximate P(|X – Y| ≤ 0.1), the probability that X and Y are within 0.1 of
each other.
4. Let X be a random variable with pdf


0.5x if 0 ≤ x ≤ 1


1 – 0.5x if 1 < x ≤ 2
fX (x) =


0.25 if 2 < x ≤ 4

0 otherwise


(a) Show that fX (·) is a valid pdf.

(b) What is the cdf FX (·)?
(c) For what value x∗ is the quantity FX (x∗ + 0.1) – FX (x∗ – 0.1) = P(|x – x∗ | ≤ 0.1) maximized?
(d) What is the population IQR of X?
(e) What is E(X)?
5. Two office-supply companies, Office Plus and Paper King, constitute a duopoly in a several different cities. Suppose
the random variable X represents the market share of Office Plus in a given city, with X having the following pdf
defined on the interval [0, 1]: ( 2
12 x – 21 for x ∈ [0, 1]
fX (x) =
0 otherwise
(a) What is the maximum value of fX (·) and at which x value(s) is it achieved?
(b) What is the minimum value of fX (·) and at which x value(s) is it achieved?
(c) Graph fX (·).
(d) What is the cdf of X?
(e) What is the probability that the market share of Office Plus in a given city is between 40% and 60%?
(f) What is the population variance of X?
(g) *The market share of Paper King in a given city is Y = 1 – X. Show that the pdf of Y is the same as the pdf of X.
(h) *Let M = max(X, Y) = max(X, 1 – X) be the maximum of the two companies’ market shares in a given city. What
is the pdf of M?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 278 — #285
i i

278 NOTES

(i) *The following R function rparabola takes a single argument n and returns n random draws of the random
variable X:

rparabola <- function(n) {

q <- runif(n)
sign_vec <- sign(2*q-1)
return( sign_vec*( (sign_vec*(2*q-1)/8)^(1/3) ) + 0.5 )
}

i. Copy and paste the code into R to define the rparabola function. Using the rparabola function,
draw a histogram of 100,000 simulated draws of X. How does the histogram compare to the pdf graph
from (c)?
ii. Consider two cities, with associated market shares X1 and X2 for Office Plus, where X1 and X2 are
i.i.d. random variables with pdf fX (·). Using the rparabola function, draw a histogram of 100,000
simulated draws of X1 +X
2 , the average market share for Office Plus in the two cities.
2

iii. Consider ten cities, with associated market shares X1 , …, X10 for Office Plus, where X1 , …, X10 are
i.i.d. random variables with pdf fX (·). Using the rparabola function, draw a histogram of 100,000
simulated draws of X1 +X210
+···+X10
, the average market share for Office Plus in the ten cities.
6. *Consider a car insurance policy that has a deductible of $500 and maximum coverage of $15,000. If the policy
holder gets into an accident and submits a claim, the policy holder must pay the first $500 of the claim and the
insurance company pays the rest of the claim up to the maximum of $15,000. Therefore, the maximum the insurance
company pays on a claim is $14,500, which happens when the claim is $15,000 or more. Suppose the pdf of the
claim X is (
6
3 (20000 – x)x for 0 ≤ x ≤ 20000
fX (x) = 20000
0 otherwise
Let Y be the amount that the insurance company pays on the claim. Y has discrete and continuous outcomes, with 0
and 14500 being discrete outcomes with positive probabilities and y ∈ (0, 14500) being continuous outcomes.
(a) What is the probability, P(Y = 0), that the insurance company pays nothing?
(b) What is the probability, P(Y = 14500), that the insurance company pays the maximum amount?
(c) Determine the cdf of Y.
(d) What is the probability that the insurance company pays between $10,000 and $12,000 on the claim?
7. The lifetime of a new restaurant, in years, is described by the random variable X with pdf
(
1
2 for x > 0
fX (x) = (x+1)
0 otherwise.
For example, if X = 2, the restaurant is in business for exactly two years and then closes.
(a) Determine the cdf of X.
(b) What is the population median of X?
(c) What is the probability that the restaurant stays in business for at least two years?
(d) If a company opens three restaurants, each of which has a lifetime that is an independent draw of X, what is the
probability that none of the restaurants lasts two years?
8. *Let X1 , X2 , …, Xn be i.i.d. continuous random variables with population median τX,0.5 . Let minX =
min(X1 , X2 , …, Xn ) and maxX = max(X1 , X2 , …, Xn ).
n–1
(a) Show that P(minX ≤ τX,0.5 ≤ maxX ) = 1 – 21 . (Hint: Consider the probability of the complement.)

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 279 — #286
i i

NOTES 279

(b) For n = 4, think about having many different four-observation samples {x1 , x2 , x3 , x4 }, each of which is a
realization of the {X1 , X2 , X3 , X4 } i.i.d. draws. For what fraction of these samples is the population median
τX,0.5 between min(x1, x2 , x3 , x4 ) and max(x1, x2 , x3 , x4 )?
(c) What is the smallest value of n for which the probability in (a) is at least 95%?
(d) What is the smallest value of n for which the probability in (a) is at least 99%?
9. Each of the following R commands makes 10,000 random draws that are stored in the vector x. For each command,
(i) describe the distribution that the random draws are being drawn from and (ii) provide your best guess at what the
value of mean(x) would be.
(a) x <- 10*runif(10000)-7
(b) x <- 1*(runif(10000)<0.7)
(c) x <- 2*(runif(10000)<0.4)-1
(d) x <- runif(10000,1,3)+runif(10000,5,8)
10. A store’s sales depend on whether it is a weekday (Monday through Friday) or a weekend day (Saturday or Saturday).
Specifically, sales (in thousands of dollars) are distributed as a uniform random variable U(1, 3) on a weekday and as a
uniform random variable U(2, 5) on a weekend day. Let X denote the random variable associated with the store’s sales
on a randomly chosen day of the week (i.e., the probability associated with each day is 1/7). We say that X is a mixture
of uniform random variables.
(a) Determine the cdf of X by applying the Law of Total Probability
P(X ≤ x) = P(X ≤ x|weekday)P(weekday) + P(X ≤ x|weekend)P(weekend).
(b) Determine the pdf of X.
(c) Conduct 10,000 simulations in R, making a random draw of X in each simulation, and plot a histogram of the
draws to confirm your answer to (b).
(d) What is the population mean of X? Either calculate the integral or use the fact that
E(X) = E(X|weekday)P(weekday) + E(X|weekend)P(weekend).
(e) What is the population median of X?
11. An investor is considering two different real estate properties, Property A and Property B. The annual return from
Property A, call it RA , follows a uniform distribution between 4% and 9%, and the annual return from Property B, call
it RB , follows a uniform distribution between 2% and 12%. Assume that RA and RB are independent.
(a) What are the pdf’s of RA and RB ?
(b) Which property has a higher probability of an annual return greater than 7%?
(c) What is the probability that both RA and RB are greater than 7%?
(d) What are the expected values and population standard deviations of RA and RB ?
(e) What is P(RB > RA )?
(f) Conduct 10,000 simulations in R, making random draws of RA and RB for each simulation, to confirm your
answer to (e).
12. Two buses arrive randomly and independently at a station between 8:00am and 8:20am. The arrival time of each bus
can be modeled as a uniform distribution.
(a) If each bus stays at the station for two minutes after its arrival, what is the probability that the two buses are
simultaneously at the station at some point?
(b) Conduct 10,000 simulations in R to confirm your answer to (a).
(c) For this part, vary the amount of time t (in minutes) that the two buses remain at the station. For each integer
value t ∈ {1, 2, · · · , 9, 10}, conduct 10,000 simulations in R to approximate the probability that the two buses
are simultaneously at the station at some point. Plot the approximated probabilities versus t.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 280 — #287
i i

280 NOTES

13. Derive the joint cdf in Example 10.15.

14. An individual commutes to work five days a week. The length of the morning commute, in minutes, is described by
a U(20, 40) random variable, and the length of the return commute, in minutes, is described by a U(25, 55) random
variable. Assume that all commute times are independent of each other.
(a) What is the expected value of the total commute time on a given day?
(b) What is the standard deviation of the total commute time on a given day?
(c) What is the expected value of the difference between the return commute time and the morning commute time
on a given day?
(d) What is the standard deviation of the difference between the return commute time and the morning commute
time on a given day?
(e) What is the expected value of the average commute time (morning and return combined) over five days?
(f) What is the standard deviation of the average commute time (morning and return combined) over five days?
(g) Conduct 10,000 simulations in R to approximate the probability that the return commute time is less than the
morning commute time on a given day.
(h) Conduct 10,000 simulations in R to approximate the probability that, over five days, the average return commute
time is less than the average morning commute time.
15. Suppose the random variables X and Y having the following joint pdf:
(
x + y if x ∈ [0, 1] and y ∈ [0, 1]
fXY (x, y) =
0 otherwise
(a) What are the marginal pdf’s fX (x) and fY (y)?
(b) What is the conditional pdf fX|Y (x|y) for 0 ≤ y ≤ 1?
(c) Based upon the answers to (a) and (b), are X and Y independent?
(d) What is E(X)?
(e) What is E(X + Y)?
(f) *What is E(XY)?
16. Suppose the random variables X and Y having the following joint pdf:
(
4xy if x ∈ [0, 1] and y ∈ [0, 1]
fXY (x, y) =
0 otherwise
(a) What are the marginal pdf’s fX (x) and fY (y)?
(b) Show that X and Y are independent by showing that the joint pdf is equal to the product of the marginal pdf’s
for all possible (x, y) values.
(c) Show that X and Y are independent by showing that the conditional pdf of X given Y = y doesn’t depend upon y.
17. *A company has instituted a new recruiting strategy for executive hires, but it is unsure how successful the strategy
will be. Suppose the probability of a successful hire is denoted PH = P(H = 1), where “success” means that the company
makes an offer that is accepted and H is 1 for a successful hire and 0 for an unsuccessful hire. Given the uncertainty
of the recruiting strategy, the recruiting manager initially thinks that PH is equally likely to be any number between 0
and 1; that is, the “prior distribution” of PH is U(0, 1).
(a) Suppose the first attempted hire is successful (H1 = 1). Using the following continuous version of Bayes’
Theorem,
P(H1 = 1|PH = p)fPH (p)
fPH (p|H1 = 1) = R 1 ,
0
P(H1 = 1|PH = p0 )fPH (p0 ) dp0
determine the (posterior) distribution of PH conditional on H1 = 1. That is, what is fPH (p|H1 = 1) as a function
of p for p ∈ [0, 1]?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 281 — #288
i i

NOTES 281

(b) Suppose the first two attempted hires are successful (H1 = 1, H2 = 1), where it is assumed that H1 and H2 are
independent. Determine the (posterior) distribution of PH conditional on H1 = 1 and H2 = 1. That is, what is
fPH (p|H1 = 1, H2 = 1) as a function of p for p ∈ [0, 1]? (Hint: Replace “H1 = 1” by “H1 = 1, H2 = 1” in the formula
from (a) and use the independence assumption.)
(c) If the first hire is successful (H1 = 1) and the second hire is unsuccessful (H2 = 0), how would your answer to
(b) change?
(d) Rather than PH ∼ U(0, 1), suppose the prior distribution is instead described by the pdf
(
6p(1 – p) for p ∈ [0, 1]
fPH (p) =
0 otherwise
i. Confirm that fPH (p) is a valid pdf.
ii. Without calculating integrals by hand, provide a graph in R that shows the prior distribution of PH and
the posterior distribution of PH conditional on the first hire being successful (H1 = 1). (Hint: Use the
formula from (a) and the integrate function to determine fPH (p|H1 = 1) for possible values of p.)
18. Conduct 10,000 simulations to approximate the expected value and the standard deviation of the investment after
two years in Example 10.31.
19. The winner of a raffle receives a payout of X 2 dollars, where X is drawn from a U(20, 40) random variable.
(a) What is the sample space associated with the payout X 2 ?
(b) What is the cdf of the payout X 2 ?
(c) What is the pdf of the payout X 2 ?
(d) What is the expected value of the payout X 2 ?
(e) *For this part, assume that the payout is X 2 – 10X instead of X 2 .
i. Confirm that the payout is an increasing function of X for the relevant values of X (20 ≤ X ≤ 40).
ii. What is the sample space associated with the payout X 2 – 10X?
iii. What is the cdf of the payout X 2 – 10X? (Hint: Use the quadratic formula to determine P(X 2 – 10X ≤ v).)
20. A cryptocurrency miner owns a computer server that “mines” cryptocurrency for 20 hours during the day (4:00am
to midnight). Let X ∈ [0, 20] denote a random variable that indicates, on a given day, how many hours the server runs
without crashing. If X < 20, the server crashes before midnight; if X = 20, the server does not crash and runs until
midnight. Suppose the cdf of X is

0

 for x ≤ 0
FX (x) = 0.005x + 0.00025x2 for 0 < x < 20

for x ≥ 20

1

(a) What is P(X = 20)? (Think about the jump in FX (x) that occurs at x = 20.)
(b) If the cryptocurrency miner makes the equivalent of $1,000 for each hour that the server runs, what is the
probability that the miner makes more than $16,000 on a given day?
(c) What is the pdf associated with the conditional distribution of X given 0 < X < 20?
(d) If the cryptocurrency miner makes the equivalent of $1,000 for each hour that the server runs, what is the
expected value of the amount of money made on a given day?

i i

fX(x)
µ

Increase in variance
fX(x)

Figure 11.2
Location and variance of normal random variables

• pnorm(x, mean=0, sd=1): Returns the cdf of a normal random variable, with mean mean and standard
deviation sd, evaluated at the argument x, which may be a single number or a vector. The optional arguments
mean and sd have default values of 0 and 1, respectively.
• rnorm(x, mean=0, sd=1): Creates a vector of n i.i.d. random draws of a normal random variable with mean

mean and standard deviation sd. The optional arguments mean and sd have default values of 0 and 1, respectively.
• qnorm(p, mean=0, sd=1): Returns the population quantiles of a normal random variable, with mean mean

and standard deviation sd, specified by the argument p, which may be a single number or a vector. The optional
arguments mean and sd have default values of 0 and 1, respectively.
The ability to easily calculate the pdf and cdf of a normal random variable is particularly appealing due to the
complicated nature of the pdf and the lack of a closed-form expression for the cdf.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 286 — #293
i i

286 Models of continuous random variables

pdf of normal random variable

fX(x)
µ

cdf of normal random variable

1.0
0.9
0.8
0.7
0.6
FX(x)

0.5
0.4
0.3
0.2
0.1
0.0 µ

Figure 11.3
Cumulative distribution function for a normal random variable

set.seed(1234)

dnorm(0)
## [1] 0.3989423
dnorm(1)
## [1] 0.2419707
pnorm(1)
## [1] 0.8413447
rnorm(10)

## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559

## [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378
dnorm(0,mean=0,sd=3)

## [1] 0.1329808
dnorm(1,mean=0,sd=3)
## [1] 0.1257944
pnorm(1,mean=0,sd=3)
## [1] 0.6305587

rnorm(10,mean=0,sd=3)
## [1] -1.4315781 -2.9951593 -2.3287617 0.1933765 2.8784822 -0.3308565
## [7] -1.5330285 -2.7335862 -2.5115150 7.2475055

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 287 — #294
i i

Models of continuous random variables 287

After the random seed is set, the next four commands in the code use the default values mean=0 and sd=1,
corresponding to X ∼ N(0, 1). In order, they return the pdf values fX (0) and fX (1), the cdf value FX (1), and ten
i.i.d. draws for X ∼ N(0, 1). The last four commands use mean=0 and sd=3, returning (in order) the pdf values
fX (0) and fX (1), the cdf value FX (1), and ten i.i.d. draws for X ∼ N(0, 9).

11.1.1 Rule-of-thumb probability intervals

For a normal random variable X ∼ N(µ, σ 2 ), it is possible to show the following:
• P(µ – σ ≤ X ≤ µ + σ) ≈ 0.6827, meaning the probability that X is within one standard deviation (σ) of its mean µ
is approximately 68% or just over 2/3.
• P(µ – 2σ ≤ X ≤ µ + 2σ) ≈ 0.9545, meaning the probability that X is within two standard deviations (2σ) of its mean

µ is approximately 95%.
• P(µ – 3σ ≤ X ≤ µ + 3σ) ≈ 0.9973, meaning the probability that X is within three standard deviations (3σ) of its

mean µ is nearly 100%, as there is only a 0.27% probability that X is more than 3σ away from µ.
Figure 11.4 shows the pdf of a normal random variable X ∼ N(µ, σ 2 ), with the gray region in the top graph indicating
P(µ – σ ≤ X ≤ µ + σ) and the gray region in the bottom graph indicating P(µ – 2σ ≤ X ≤ µ + 2σ). The probability P(µ –
σ ≤ X ≤ µ + σ) ≈ 0.6827 is equal to the area under the pdf curve between µ – σ and µ + σ, and the probability P(µ –
2σ ≤ X ≤ µ + 2σ) ≈ 0.9545 is equal to the area under the pdf curve between µ – 2σ and µ + 2σ.
A more exact 95% probability interval, commonly used for statistical inference, follows from the fact that
P(µ – 1.96σ ≤ X ≤ µ + 1.96σ) ≈ 0.9500.
There is a 95% probability that X is within 1.96 standard deviations, rather than two standard deviations, of its mean µ.
An exact 90% probability interval, also commonly used, follows from the fact that
P(µ – 1.645 ≤ X ≤ µ + 1.645) ≈ 0.9000.
There is a 90% probability that X is within 1.645 standard deviations of its mean µ.
Example 11.1 (Asset returns) Suppose the annual return on an asset X is normally distributed, with X ∼
N(0.07, (0.08)2 ). The annual return has population mean µX = 0.07 (or 7%) and population standard deviation
σX = 0.08 (or 8%). Then, there is approximately a 68% probability that the annual return is between 0.07 – 0.08 = –0.01
and 0.07 + 0.08 = 0.15. Also, there is a 95% probability that the annual return is between 0.07 – (1.96)(0.08) = –0.0868
and 0.07 + (1.96)(0.08) = 0.2268.

11.1.2 Standard normal random variable and probability intervals

A normal random variable with mean zero (µ = 0) and a standard deviation of one (σ 2 = σ = 1) is a useful random
variable known as the standard normal random variable:

Definition 11.2 A standard normal random variable, often denoted Z, is a normal random variable with µ = 0,
σ 2 = 1, and σ = 1. That is, Z ∼ N(0, 1).
Plugging µ = 0 and σ = 1 into the pdf formula from Definition 11.1, the standard normal Z ∼ N(0, 1) has pdf
1 1 2
fZ (z) = √ e– 2 z for – ∞ < z < ∞.
2π
The standard normal distribution Z ∼ N(0, 1) is used so often that its pdf and cdf are often represented by special
notation. The standard normal pdf is denoted φ(·), with
φ(z) = fZ (z),
and the standard normal cdf is denoted Φ(·), with
Φ(z) = FZ (z).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 288 — #295
i i

288 Models of continuous random variables

probability of being within σ of the mean µ

fX(x)
µ−σ µ µ+σ

probability of being within 2σ of the mean µ

fX(x)

µ − 2σ µ µ + 2σ

Figure 11.4
Probability intervals for a normal random variable

Figure 11.5 shows the pdf curve for a standard normal random variable Z ∼ N(0, 1). The distribution is symmetric
around zero, with φ(–v) = φ(v) and Φ(–v) = 1 – Φ(v) for all v > 0. The peak of the standard normal distribution occurs
at z = 0, with φ(0) ≈ 0.3989.
For the standard normal random variable, a 95% probability interval is
P(–1.96 ≤ Z ≤ 1.96) ≈ 0.9500,
and a 90% probability interval is
P(–1.645 ≤ Z ≤ 1.645) ≈ 0.9000.
Where do the values 1.96 and 1.645 come from? The qnorm function can be used to confirm that –1.96 and 1.96 are,
respectively, the 2.5% and 97.5% quantiles of the N(0, 1) distribution and that –1.645 and 1.645 are, respectively, the
5% and 95% quantiles of the N(0, 1) distribution.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 289 — #296
i i

Models of continuous random variables 289

0.5
0.4
0.3
φ(z)

0.2
0.1
0.0

−4 −3 −2 −1 0 1 2 3 4

Figure 11.5
Probability density function for a N(0, 1) random variable

# return the 2.5%, 5%, 95%, and 97.5% quantiles of N(0,1)

qnorm(c(0.025,0.05,0.95,0.975))
## [1] -1.959964 -1.644854 1.644854 1.959964
# confirm the quantiles and probability intervals using the cdf
pnorm(-1.96)
## [1] 0.0249979

pnorm(1.96)
## [1] 0.9750021

pnorm(1.96)-pnorm(-1.96)
## [1] 0.9500042
pnorm(-1.645)
## [1] 0.04998491
pnorm(1.645)

## [1] 0.9500151
pnorm(1.645)-pnorm(-1.645)
## [1] 0.9000302

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 290 — #297
i i

290 Models of continuous random variables

Due to the symmetry of the normal distribution, there is 2.5% probability that Z < –1.96 and a 2.5% probability that
Z > 1.96. Therefore, the value 1.96 is the population 97.5% quantile of Z,
Φ(1.96) = 0.975,
leaving 2.5% probability in the tail to the right of 1.96. Similarly, for the 90% probability interval (–1.645, 1.645),
there is a 5% probability that Z < –1.645 and a 5% probability that Z > 1.645. The value 1.645 is the population 95%
quantile of Z,
Φ(1.645) = 0.95,
leaving 5% probability in the tail to the right of 1.645.
This approach can be used for any symmetric probability interval of the standard normal random variable. For
instance, for a 70% probability interval, we need to find the value c such that the (–c, c) interval is associated with a
15% probability that Z < –c and a 15% probability that Z > c. The appropriate value of c is the population 85% quantile
of Z, which can be found in R:

qnorm(0.85)
## [1] 1.036433

The population 85% quantile is approximately 1.036, so that Φ(1.036) ≈ 0.85, Φ(–1.036) ≈ 0.15, and
P(–1.036 ≤ Z ≤ 1.036) ≈ 0.70.
Similarly, for a 80% probability interval, we find the population 90% quantile of Z, leaving 10% in the right tail:

qnorm(0.90)
## [1] 1.281552

The population 90% quantile is approximately 1.282, so that Φ(1.282) ≈ 0.90, Φ(–1.282) ≈ 0.10, and
P(–1.282 ≤ Z ≤ 1.282) ≈ 0.80.

11.1.3 Standardizing a normal random variable

A useful property of normal random variables is that taking a linear transformation of a normal random variable yields
another normal random variable.
Proposition 11.2. If X ∼ N(µ, σ 2 ), the linear transformation Y = a + bX is also a normal random variable, with Y ∼
N(a + bµ, b2 σ 2 ). The population mean of Y is µY = a + bµ, the population variance of Y is σY2 = b2 σ 2 , and the population
standard deviation of Y is σY = |b|σ.
Proposition 11.2 can be applied to “standardize” any normal random variable X ∼ N(µ, σ 2 ) by de-meaning it
(subtracting µ) and dividing by the standard deviation σ:
X–µ
X ∼ N(µ, σ 2 ) =⇒ ∼ N(0, 1).
σ
µ
The standardized random variable Z = X–µ σ is a linear transformation of X with additive constant a = – σ and scaling
1
constant b = σ . Regardless of the original X ∼ N(µ, σ 2 ) random variable, Z indicates the number of standard deviations
σ that X is above or below µ. When Z is positive, X is above its mean. When Z is negative, X is below its mean.
We can determine probabilities for X in terms of probabilities for Z ∼ N(0, 1). For instance,

X–µ b–µ b–µ b–µ
P(X ≤ b) = P ≤ =P Z ≤ =Φ
σ σ σ σ

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 291 — #298
i i

Models of continuous random variables 291

and
a–µ b–µ b–µ a – µ
P(a ≤ X ≤ b) = P ≤Z≤ =Φ –Φ .
σ σ σ σ
Example 11.2 (Asset returns) Example 11.1 considered annual asset returns given by the random variable X ∼
N(0.07, (0.08)2 ). The normal random variable X can be standardized by de-meaning it and dividing by its standard
deviation:
X – 0.07
Z= ∼ N(0, 1).
0.08
If Z = 1.5, X is 1.5 standard deviations above its mean and X = 0.19. If Z = –2.5, X is 2.5 standard deviations below its
mean and X = –0.13. The probability of a positive annual return is

X – 0.07 0 – 0.07
P(X > 0) = P > = P(Z > –0.875) = 1 – Φ(–0.875).
0.08 0.08
For the first equality, we subtract 0.07 and then divide by 0.08 for both X (on the left of the >) and 0 (on the right of
the >) so that the probability remains the same. Then, we calculate 1 – Φ(–0.875) ≈ 0.8092:

1-pnorm(-0.875)
## [1] 0.809213

Alternatively, since 1 – Φ(–0.875) = Φ(0.875) by symmetry of Z, pnorm(0.875) would give the same answer.
Using the results on probability intervals for Z ∼ N(0, 1) in Section 11.1.2, probability intervals for general X ∼
N(µ, σ 2 ), including the rule-of-thumb intervals from Section 11.1.1, can be constructed. These intervals are based
upon X = µ + σZ being a linear transformation of Z. For example, for a 95% probability interval,
0.95 = P(–1.96 ≤ Z ≤ 1.96) = P(µ – 1.96σ ≤ µ + σZ ≤ µ + 1.96σ) = P(µ – 1.96σ ≤ X ≤ µ + 1.96σ),
where the second equality is obtained by multiplying each of the three terms within the probability by σ and then
adding µ to each term. Similarly, the 90% probability interval for X is
0.90 = P(–1.645 ≤ Z ≤ 1.645) = P(µ – 1.645σ ≤ X ≤ µ + 1.645σ).
For other symmetric probability intervals of X, centered around µ, the constant that provides the desired probability
can be determined. As an example, suppose we want an 85% probability interval for X, leaving 7.5% probability less
than the lower end of the interval and 7.5% probability greater than the upper end of the interval. For the standard
normal, the 92.5% quantile is equal to 1.440, so that Φ(1.440) = 0.925 and Φ(–1.440) = 0.075.

qnorm(0.925)
## [1] 1.439531

Therefore,
P(–1.440 ≤ Z ≤ 1.440) = 0.85,
and, equivalently,
P(µ – 1.440σ ≤ X ≤ µ + 1.440σ) = 0.85.
Using the parameters from Example 11.2 (µ = 0.07, σ = 0.08), there is an 85% probability that X (annual asset return)
is between 0.07 – (1.440)(0.08) = –0.0452 and 0.07 + (1.440)(0.08) = 0.1852.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 292 — #299
i i

292 Models of continuous random variables

11.1.4 Independence of normal random variables

As discussed in Chapter 10 (Proposition 10.7), the independence of two random variables implies that their population
covariance and population correlation are equal to zero. While the reverse (zero covariance implying independence) is
not always true in general, it does turn out to be true for normal random variables.
Proposition 11.3. The normal random variables X and Y are independent if and only if σXY = ρXY = 0.
If X and Y are normally distributed with zero correlation, then X and Y are independent, meaning joint probabilities
can be expressed in terms of marginal probabilities:
P(a < X < b, c < Y < d) = P(a < X < b)P(c < Y < d).
For instance, if X and Y are independent N(0, 1) random variables, the probability that both X and Y are less than 1 is
P(X < 1, Y < 1) = P(X < 1)P(Y < 1) ≈ (0.8413)2 ≈ 0.708.

11.1.5 Linear combinations of normal random variables

Another useful property of normal random variables is that a linear combination of two or more normal random
variables is also a normal random variable. It is important to stress that this property is not a general result for other
types of random variables. Recall, from Example 10.28, that the sum of two independent U(0, 1) random variables is
not a uniform distribution but rather a triangular distribution. As another example, the sum of independent Bernoulli
random variables is not a Bernoulli random variable but rather a binomial random variable.
Proposition 11.4. Suppose X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 ) are normal random variables with covariance σXY .
Then, the linear combination
V = k + aX + bY
is a normal random variable, with
V ∼ N(k + aµX + bµY , a2 σX2 + b2 σY2 + 2abσXY ).
Proposition 10.12 already implies that the population mean of V is k + aµX + bµY and the population variance of V
is a2 σX2 + b2 σY2 + 2abσXY . So, the new result in Proposition 11.4 is that the linear combination V is also normal, with
these population mean and variance parameters characterizing the normal distribution.
Example 11.3 (Producing a movie) Example 10.27 considered the probability that a movie’s production costs were
over budget based on an assumption of independent uniform distributions for the number of days of filming (X) and the
number of days of editing (Y). Recall that the cost random variable is C = X + 0.5Y (in millions) and the movie budget
is $90 million. In this example, let’s assume independent normal distributions for X and Y, rather than independent
uniform distributions, as follows:
X ∼ N(70, 42 ) and Y ∼ N(45, 22 ), with σXY = ρXY = 0.
The production costs have a normal distribution with population mean
µC = µX + 0.5µY = 70 + (0.5)(45) = 92.5,
population variance
σC2 = σX2 + (0.5)2 σY2 = 16 + (0.5)2 (4) = 17,
q √
and population standard deviation σC = σC2 = 17. The probability that production costs go over budget is

C – 92.5 90 – 92.5 2.5
P(C > 90) = P √ > √ = P Z > –√ ,
17 17 17
which is approximately 0.728 or 72.8%.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 293 — #300
i i

Models of continuous random variables 293

1-pnorm(-2.5/sqrt(17))
## [1] 0.7278552

Example 11.4 (Two-asset portfolio) Consider two assets whose annual returns X and Y are described by the following
two normal distributions:
X ∼ N(0.07, (0.08)2 ) and Y ∼ N(0.03, (0.01)2 ).
Think of asset X as a riskier asset, like a stock mutual fund, with a higher average return but more risk due to the
higher variance. Think of asset Y as a less risky asset, like a bond mutual fund, with a lower average return but less
risk. Suppose a portfolio is constructed by investing half of the money in asset X and half of the money in asset Y, so
that the returns on the two-asset portfolio are
V = 0.5X + 0.5Y.
If the asset returns are uncorrelated (ρXY = 0), then
µV = 0.5µX + 0.5µY = (0.5)(0.07) + (0.5)(0.03) = 0.05,
σV2 = (0.5)2 σX2 + (0.5)2 σY2 = (0.5)2 (0.08)2 + (0.5)2 (0.01)2 = 0.001625,
and √
σV = 0.001625 ≈ 0.0403.
Thus, V ∼ N(0.05, 0.001625) when ρXY = 0. The average return for the two-asset portfolio V is exactly at the
midpoint of the averages of the two returns since the portfolio is equally weighted. The standard deviation and variance
of V indicate that V is less risky than asset X and more risky than asset Y.
What if the asset returns X and Y are positively correlated instead, say with ρXY = 0.2? The population mean of
V = 0.5X + 0.5Y is unchanged, and the population variance of V becomes
σV2 = (0.5)2 σX2 + (0.5)2 σY2 + 2(0.5)(0.5)σXY
= (0.5)2 (0.08)2 + (0.5)2 (0.01)2 + (2)(0.5)(0.5)(0.2)(0.08)(0.01) = 0.001705,
√
using σXY = ρXY σX σY . The population standard deviation of V is σV = 0.001705 ≈ 0.0413, which is 2.5% higher than
the standard deviation of 0.403 for the case of no asset correlation. The positive correlation between X and Y leads to
a tendency for the asset returns to move together, resulting in higher variance than when X and Y are uncorrelated.
Even though the average return for X is greater than the average return for Y, it is possible that the realized Y
value is larger than the realized X value due to the variances of the returns. The probability P(Y > X) can be written as
P(Y > X) = P(Y – X > 0), which is a probability in terms of a linear combination. The difference Y – X has population
mean 0.03 – 0.07 = –0.04 and population variance (0.01)2 + (0.08)2 – (2)(0.2)(0.08)(0.01) = 0.00618, so that

Y – X – (–0.04) 0 – (–0.04) 0.04
P(Y > X) = P(Y – X > 0) = P √ >√ =P Z > √ ,
0.00618 0.00618 0.00618
which is approximately 0.3054 or 30.54%.

1-pnorm(0.04/sqrt(0.00618))
## [1] 0.3054386

The population mean and population standard deviation can also be determined for a two-asset portfolio with
unequal weights. Figure 11.6 summarizes the population mean and standard deviation for V = wX + (1 – w)Y, with
weights ranging from w = 0 to w = 1. There are black dots shown for portfolios with w = 0.2, w = 0.5, and w = 0.8. As

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 294 — #301
i i

294 Models of continuous random variables

0.08
w=1

0.07
Population standard deviation of portfolio return

w=0.8
0.06
0.05
0.04

w=0.5
0.03
0.02

w=0.2
0.01

w=0

0.03 0.04 0.05 0.06 0.07

Population mean of portfolio return

weekly earnings

Distribution of ln(weekly earnings)

0.8
Density

0.4
0.0

3 4 5 6 7 8 9

ln(weekly earnings)

Figure 11.8
Distributions of weekly earnings and log weekly earnings

# create log-earnings variable

logwage <- log(cpsemployed$earnwk)

# graph-display format (two rows, one column)

par(mfrow=c(2,1))

# histogram of earnings, with normal distribution overlaid

hist(cpsemployed$earnwk, breaks=50, freq=FALSE, main="Distribution of weekly earnings",
xlab="weekly earnings")
lines(0:8000, dnorm(0:8000,mean(cpsemployed$earnwk),sd(cpsemployed$earnwk)), lwd=3)

# histogram of log-earnings, with normal distribution overlaid

hist(logwage, breaks=50, freq=FALSE, main="Distribution of ln(weekly earnings)",
xlab="ln(weekly earnings)")
lines(seq(3,9,by=0.01), dnorm(seq(3,9,by=0.01),mean(logwage),sd(logwage)), lwd=2)

11.2.1 Probability intervals for a log-normal random variable

We generally want to make statements about the random variable X (e.g., weekly earnings) rather than the random
variable ln(X) (e.g., natural log of weekly earnings) since values of X are easily interpretable whereas values of ln(X)
are not. If X is weekly earnings, X is in the units of dollars while ln(X) is in the less interpretable units of log-dollars.
To construct a probability interval for a log-normal random variable X, note that the cdf of X is directly related to
the cdf of a normal random variable (see Proposition 10.17). Specifically, for any x0 > 0,
FX (x0 ) = P(X ≤ x0 ) = P(ln(X) ≤ ln(x0 )) = Fln(X) (ln(x0 )).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 298 — #305
i i

298 Models of continuous random variables

The second equality holds since ln(·) is a strictly increasing function. Since ln(X) ∼ N(µ, σ 2 ) is a normal random
variable, Fln(X) (ln(x0 )) is the cdf of a N(µ, σ 2 ) random variable evaluated at ln(x0 ).
Then, the probability that X is in the interval [a, b], for 0 < a < b, can also be written in terms of the normal cdf:
P(a ≤ X ≤ b) = P(ln(a) ≤ ln(X) ≤ ln(b)) = Fln(X) (ln(b)) – Fln(X) (ln(a)).
Using this relationship, 90% and 95% probability intervals for X can be constructed:
P(µ – 1.645σ ≤ ln(X) ≤ µ + 1.645σ) ≈ 0.90 =⇒ P eµ–1.645σ ≤ X ≤ eµ+1.645σ ≈ 0.90

P(µ – 1.96σ ≤ ln(X) ≤ µ + 1.96σ) ≈ 0.95 =⇒ P eµ–1.96σ ≤ X ≤ eµ+1.96σ ≈ 0.95

There is a 90% probability that X is between eµ–1.645σ and eµ+1.645σ , with a 5% probability that X is below eµ–1.645σ
and a 5% probability that X is above eµ+1.645σ . There is a 95% probability that X is between eµ–1.96σ and eµ+1.96σ ,
with a 2.5% probability that X is below eµ–1.96σ and a 2.5% probability that X is above eµ+1.96σ . Unlike the probability
intervals for a normal random variable, these probability intervals are asymmetric in the sense that the endpoints are not
equidistant from the population median eµ . For example, the difference eµ – eµ–1.96σ is not the same as eµ+1.96σ – eµ .
This approach can be used for any probability interval. For instance, for a 70% probability interval, with 15%
probabilities each in the left tail and the right tail, Φ(1.036) ≈ 0.85 implies
P(µ – 1.036σ ≤ ln(X) ≤ µ + 1.036σ) ≈ 0.70 =⇒ P eµ–1.036σ ≤ X ≤ eµ+1.036σ ≈ 0.70.

Example 11.7 (Weekly earnings) Assume that weekly earnings X are log-normally distributed with ln(X) ∼
N(6.5, (0.7)2 ). These parameters roughly correspond to the log-normal distribution shown in Figure 11.8. Then, a
95% probability interval for X is
P(e6.5–1.96(0.7) ≤ X ≤ e6.5+1.96(0.7) ) = P(168.68 ≤ X ≤ 2622.81) = 0.95.
There is a 95% probability that weekly earnings are between $168.68 and $2,622.81, a 2.5% probability that weekly
earnings are less than $168.68, and a 2.5% probability that weekly earnings are greater than $2,622.81.
The following R functions are useful for working with a log-normal random variable:
• dlnorm(x, meanlog=0, sdlog=1): Returns the pdf of a log-normal random variable evaluated at the
argument x, which may be a single number or a vector. The optional arguments meanlog and sdlog have default
values of 0 and 1, respectively, and represent the mean and standard deviation of the natural log of the random
variable.
• plnorm(x, meanlog=0, sdlog=1): Returns the cdf of a log-normal random variable evaluated at the

argument x, which may be a single number or a vector. The optional arguments meanlog and sdlog have default
values of 0 and 1, respectively, and represent the mean and standard deviation of the natural log of the random
variable.
• rlnorm(x, meanlog=0, sdlog=1): Creates a vector of n i.i.d. random draws of a log-normal random

variable. The optional arguments meanlog and sdlog have default values of 0 and 1, respectively, and represent
the mean and standard deviation of the natural log of the random variable.
• qlnorm(p, meanlog=0, sdlog=1): Returns the population quantiles of a log-normal random variable

specified by the argument p, which may be a single number or a vector. The optional arguments meanlog and
sdlog have default values of 0 and 1, respectively, and represent the mean and standard deviation of the natural
log of the random variable.
For instance, although the probability intervals in Example 11.7 were determined through the use of the normal
distribution, they can also be calculated directly in R based upon the quantiles of the log-normal distribution:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 299 — #306
i i

Models of continuous random variables 299

qlnorm(c(0.025,0.975), meanlog=6.5, sdlog=0.7)

## [1] 168.6837 2622.7398

For the same distribution ln(X) ∼ N(6.5, (0.7)2 ), the following code calculates the probability that X > 1000 (weekly
earnings over $1,000) and makes 10 random draws from the distribution:

1-plnorm(1000, meanlog=6.5, sdlog=0.7)

## [1] 0.2801124
set.seed(1234)
rlnorm(10, meanlog=6.5, sdlog=0.7)
## [1] 285.7319 807.7079 1420.9896 128.7679 898.1951 947.8906 444.8255
## [8] 453.6644 448.0405 356.7280

11.2.2 Population statistics for a log-normal random variable

The following proposition provides the population mean, variance, and standard deviation of a log-normal random
variable:
Proposition 11.7. If X is a log-normal random variable with ln(X) ∼ N(µ, σ 2 ), the population mean of X is
1 2
µX = eµ+ 2 σ ,
the population variance of X is
2 2
σX2 = e2µ+σ eσ – 1 ,
and the population standard deviation of X is
1 2
p
σX = eµ+ 2 σ eσ2 – 1.
It is worth noting that the population mean of X is not eµ even though the population mean of ln(X) is µ. Instead, the
1 2
population mean of X is a scaled-up version of eµ , with a scaling factor of e 2 σ > 1.29 For the weekly earnings example
1 2
(Example 11.7), where ln(X) ∼ N(6.5, (0.7)2 ), the population mean of weekly earnings √ is µX = e6.5+ 2 (0.7) ≈ 849.80
1 2
dollars and the population standard deviation of weekly earnings is σX = e6.5+ 2 (0.7) e(0.7)2 – 1 ≈ 675.75 dollars.

exp(6.5+0.5*(0.7)^2)
## [1] 849.7991
exp(6.5+0.5*(0.7)^2)*sqrt(exp(0.7^2)-1)
## [1] 675.7459

11.3 Chi-square random variable

A family of distributions, known as chi-square distributions or, using the Greek letter, χ2 distributions, turns out to
be useful for statistical inference and appears in both Chapter 12 and 16. The simplest chi-square distribution is based
upon a standard normal random variable Z ∼ N(0, 1), with the chi-square random variable defined as the square of Z:

Definition 11.4 A random variable X is a chi-square random variable, denoted X ∼ χ2 or X ∼ χ21 , if X = Z 2 and
Z ∼ N(0, 1).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 300 — #307
i i

300 Models of continuous random variables

chi−square random variable, df=4 chi−square random variable, df=6

0.20

0.20
0.15

0.15
fX(x)

fX(x)
0.10

0.10
0.05

0.05
0.00

0.00
0 5 10 15 20 0 5 10 15 20

x x

chi−square random variable, df=8 chi−square random variable, df=10

0.20

0.20
0.15

0.15
fX(x)

fX(x)
0.10

0.10
0.05

0.05
0.00

0.00

0 5 10 15 20 0 5 10 15 20

x x

Figure 11.9
Probability density functions for chi-square random variables

All possible outcomes of X ∼ χ21 are non-negative since X = Z 2 . This random variable is a special case of a more
general chi-square random variable, which involves the sum of squared independent standard normals. Specifically,
if Z1 , Z2 , …, Zm are i.i.d. N(0, 1) random variables, Z12 + Z22 + · · · + Zm2 is said to have a chi-square distribution with m
degrees of freedom.

Definition 11.5 A random variable X is a chi-square random variable with m degrees of freedom, denoted X ∼ χ2m ,
if X = Z12 + Z22 + · · · + Zm2 and Z1 , Z2 , …, Zm are i.i.d. N(0, 1).
Figure 11.9 shows the distributions for four different chi-squared random variables, corresponding to 4 degrees of
freedom (χ24 ), 6 degrees of freedom (χ26 ), 8 degrees of freedom (χ28 ), and 10 degrees of freedom (χ210 ). The x-axis has
been arbitrarily cut off at 20, but each of the four distributions has a right tail that extends forever. The four distributions
are all right-skewed. And, as the value for the degrees of freedom increases, both the mean and the dispersion of the
distributions increase. This last feature should not be surprising since every time we increase the degrees of freedom
we are adding additional Zj2 terms to the random variable. In fact, it turns out that, for X ∼ χ2m , the population mean
and variance are µX = m and σX2 = 2m, respectively.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 301 — #308
i i

Models of continuous random variables 301

The following R functions are useful for working with chi-square random variables:
• dchisq(x, df): Returns the pdf of a chi-square random variable with df degrees of freedom evaluated at the
argument x, which may be a single number or a vector.
• pchisq(x, df): Returns the cdf of a chi-square random variable with df degrees of freedom evaluated at the

argument x, which may be a single number or a vector.

• rchisq(n, df): Creates a vector of n i.i.d. random draws of a chi-square random variable with df degrees of

freedom.
• qchisq(p, df): Returns the population quantiles of a chi-square random variable with df degrees of freedom

specified by the argument p, which may be a single number or a vector.

Here are some examples of these functions being used:

qchisq(0.90,4)
## [1] 7.77944
qchisq(0.90,6)
## [1] 10.64464
qchisq(0.90,8)
## [1] 13.36157
qchisq(0.90,10)
## [1] 15.98718
pchisq((1.96)^2,1)
## [1] 0.9500042
pchisq((1.645)^2,1)
## [1] 0.9000302

The qchisq commands calculate the population 90% quantiles associated with the four chi-square distributions
shown in Figure 11.9. As expected, the population 90% quantiles increase as the degrees of freedom increase, reflecting
a higher likelihood to have values farther out in the right tail. The pchisq commands illustrate a connection between
the cdf of a χ2 random variable and the cdf of a N(0, 1) random variable. Whereas 1.96 is the population 97.5%
quantile of N(0, 1), the pchisq((1.96)^2,1) command confirms that the value (1.96)2 is the population 95%
quantile of χ2 . This result arises since, if X ∼ χ2 and Z ∼ N(0, 1),
P X < (1.96)2 = P Z 2 < (1.96)2 = P(–1.96 < Z < 1.96) ≈ 0.95,

where the second equality follows from the fact that Z 2 < (1.96)2 can only happen when |Z| < 1.96. Similarly, with
1.645 being the population 95% quantile of N(0, 1), the pchisq((1.645)^2,1) command confirms that the value
(1.645)2 is the population 90% quantile of χ2 , which follows from
P X < (1.645)2 = P Z 2 < (1.645)2 = P(–1.645 < Z < 1.645) ≈ 0.90.

11.4 Exponential random variable

An exponential random variable is often used to model the amount of time that it takes for a certain event to occur.
Some examples include:
• the time that it takes a user to leave a website
• the time that it takes for a worker strike to end
• the time that a call with a customer service agent lasts

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 302 — #309
i i

302 Models of continuous random variables

Since the exponential model usually concerns time-related events, it is an example of a duration model. In the
examples above, the exponential model can be thought of as modeling the duration of the website visit, the duration of
the worker strike, or the duration of the customer service phone call. The formal definition of an exponential random
variable is the following:

Definition 11.6 An exponential random variable X with parameter θ > 0, written X ∼ Exp(θ), is a positive-valued
random variable with pdf
fX (x) = θe–θx for x > 0.
The parameter θ describes how quickly the underlying event is expected to occur. Larger values of θ correspond
to events that are expected to occur more quickly (i.e., shorter durations), whereas smaller values of θ correspond to
events that are not expected to occur quickly (i.e., longer durations). Since it turns out that the population mean of X
is µX = E(X) = θ1 , the value θ1 can be thought of as the expected time until the event occurs. For example, for θ = 1, the
expected time until the event occurs is 1 time unit, whereas for θ = 0.2, the expected time until the event occurs is 5
time units.
The cdf FX (·) for an exponential random variable X is obtained by integrating the pdf. For any value x0 > 0,
Z x0
x0
FX (x0 ) = θe–θx dx = –e–θx 0 = 1 – e–θx0 .
0
Figure 11.10 graphs three different pdf curves corresponding to three different values of the θ parameter, with θ = 1
in the left graph, θ = 0.5 in the middle graph, and θ = 0.2 in the right graph. Each of these three pdf’s appears to
be a strictly decreasing function of x, which is a property of any exponential random variable since the derivative
fX0 (x) = –θ2 e–θx < 0 for all x > 0. Regardless of the value of the parameter θ, it is always more likely for an exponential
random variable to have smaller values than larger values; for example, the probability that X is in (0, 1) is larger than
the probability that X is in (1, 2). As the value of θ decreases, moving from left to right in Figure 11.10, the height of
the pdf near zero decreases and the right tail becomes thicker.
The population descriptive statistics for an exponential random variable are given in the following proposition.30
Proposition 11.8. If X ∼ Exp(θ), the population mean of X is
1
µX = ,
θ
the population variance of X is
1
σX2 = 2 ,
θ
and the population standard deviation of X is
1
σX = .
θ
Example 11.8 (Duration of website visit) For a particular website, assume that the number of minutes that any given
visitor spends on the website, before leaving, is an exponential random variable X ∼ Exp(0.5). The expected duration
1
of the website visit, µX , is 0.5 or 2 minutes. The standard deviation of the duration of the website visit is also 2
minutes. The cdf, derived above as FX (x) = 1 – e–θx , can be used to calculate probabilities of intervals. For example,
the probability that the duration of the website visit is between 1 and 2 minutes is
FX (2) – FX (1) = 1 – e–(2)(0.5) – 1 – e–(1)(0.5) = e–0.5 – e–1 ≈ 0.239.

The probability that the website-visit duration is between 2 and 3 minutes is

FX (3) – FX (2) = 1 – e–(3)(0.5) – 1 – e–(2)(0.5) = e–1 – e–1.5 ≈ 0.145.

The largest probability for a one-minute interval is for a duration between 0 and 1, with probability e–0 – e–0.5 ≈ 0.393.
The following R functions are useful for working with exponential random variables:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 303 — #310
i i

Models of continuous random variables 303

Exponential RV, θ = 1 Exponential RV, θ = 0.5 Exponential RV, θ = 0.2

0.8

0.8
0.6

0.6

0.6
fX(x)

fX(x)

fX(x)
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

x x x

Figure 11.10
Probability density functions for exponential random variables

• dexp(x, rate=1): Returns the pdf of an exponential random variable with rate θ equal to rate evaluated at
the argument x, which may be a single number or a vector.
• pexp(x, rate=1): Returns the cdf of an exponential random variable with rate θ equal to rate evaluated at

the argument x, which may be a single number or a vector.

• rexp(n, rate=1): Creates a vector of n i.i.d. random draws of an exponential random variable with rate θ

equal to rate.
• qexp(p, rate=1): Returns the population quantiles of an exponential random variable with rate θ equal to

rate specified by the argument p, which may be a single number or a vector.

The following code confirms the calculations in Example 11.8 and also illustrates the use of the function rexp:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 304 — #311
i i

304 Models of continuous random variables

pexp(1,rate=0.5)-pexp(0,rate=0.5)
## [1] 0.3934693
pexp(2,rate=0.5)-pexp(1,rate=0.5)
## [1] 0.2386512
pexp(3,rate=0.5)-pexp(2,rate=0.5)
## [1] 0.1447493
set.seed(1234)
rexp(10,rate=0.5)
## [1] 5.00351721 0.49351777 0.01316391 3.48549218 0.77436517 0.17989934
## [7] 1.64816303 0.40523580 1.67608064 1.52086060
temp <- rexp(100000,rate=0.5)
mean(temp)
## [1] 1.998902
sd(temp)
## [1] 1.983554

The first rexp command simulates ten i.i.d. draws from an exponential random variable with θ = 0.5. The second
use of rexp, in the assignment of the variable temp, simulates 100,000 draws, and the values of mean(temp) and
sd(temp) are both very close to the population mean and standard deviation of θ1 = 2.
One potential drawback of the exponential model is that the exponential pdf is a strictly decreasing function of x,
regardless of the parameter value θ. This feature means that the assumption of an exponential random variable implies
that shorter durations are always more likely than longer durations. To relax this restriction, a more flexible duration
model would be needed. An example of such a model is the Weibull model, which is a two-parameter model that
generalizes the exponential model. Specifically, a Weibull random variable X has the pdf
α
fX (x) = αθ(θx)α–1 e–(θx) for x > 0,
where the two parameters α and θ are both positive.31 The exponential pdf is a special case, corresponding to α = 1.
Other values of α lead to pdf shapes different from those associated with the exponential model. For example, for
certain θ and α values, the Weibull pdf can increase until reaching a peak and then decrease afterwards, which is a
more appropriate model if the most likely durations are not close to zero but rather at some other value.

11.4.1 Sequence of exponential random variables

The description above for an exponential random variable relates it to the time that it takes a certain event to occur.
What if we are interested in multiple occurrences of a particular event? For example, an exponential random variable
can be used to model the time that it takes a new customer to arrive (since the last customer’s arrival). Think of a
“timer” that starts when the previous customer arrived and ends when a new customer arrives. Once the new customer
arrives, another exponential random variable can be used to model the time that the next customer arrives, and so on.
If these successive “times between events” are i.i.d. draws (denoted X1 , X2 , X3 , …) of an exponential random variable
Exp(θ), there is a direct connection to the Poisson random variable discussed in Section 9.5.
Proposition 11.9. If (i) the time scale starts at zero and (ii) the times between events (X1 , X2 , X3 , …) are i.i.d. draws of
an Exp(θ) random variable, then the number of events between time zero and time T is a Poisson(Tθ) random variable.
Figure 11.11 provides a visualization of the timeline, with the left end corresponding to time zero. X1 is the time to
the first event occurrence, X2 is the time between the first and second occurrences, X3 is the time between the second
and third occurrences, and so on. The total number of events that occur depends upon the time T that is specified. If T

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 305 — #312
i i

Models of continuous random variables 305

is less than X1 , the event count is 0; if T is between X1 and X1 + X2 , the event count is 1; if T is between X1 + X2 and
X1 + X2 + X3 , the event count is 2; and so on.32

X1 X2 X3 X4

Figure 11.11
Sequence of exponential random variables

To illustrate the connection between exponential random variables and Poisson random variables, we re-visit the
example of customers visiting a coffee shop (Example 9.6). In that example, the situation was modeled in terms of the
number or count of customers, assumed to be a Poisson random variable. Here, we instead model the customer arrival
times as exponential random variables and, using Proposition 11.9, infer that the number of customers is a Poisson
random variable. The advantage of this approach is that we can say something about the distributions of both the arrival
times between customers (exponential random variables) and the count of customers (a Poisson random variable).
Example 11.9 (Coffee shop customers) Suppose the arrival time, in hours, of a new customer at a coffee shop (since
1
the last customer arrived) is an Exp(20) random variable, so that the average arrival time is 20 = 0.05 hours or 3
minutes. If it is assumed that the arrival time of each successive customer is also an i.i.d. Exp(20) random variable,
Proposition 11.9 implies that the number of customers that arrives over the course of an hour is a Poisson(20) random
variable. Likewise, the number of customers that arrive over the course of two hours is a Poisson(40) random variable.
If we are interested in the arrival times themselves, rather than the count, the distribution associated with the Exp(θ)
random variable can be used for any single arrival time (e.g., between customers in Example 11.9). How about the
time that it takes two customers to arrive? This time would be a draw from the random variable given by the sum of
two i.i.d. Exp(θ) random variables. For example, the time for the first two customers to arrive is a draw from X1 + X2 .
Unfortunately, X1 + X2 is a more complicated distribution (not an exponential), but we can use computer simulations
to approximate this distribution and its properties (mean, standard deviation, etc). Similarly, the time for the first
three customers to arrive is a draw from X1 + X2 + X3 , whose distribution can again be approximated via computer
simulations. Perhaps not surprisingly, as seen in Example 10.29 and Figure 10.12 (for the average of exponential
random variables), the shape of the distribution of the sum of exponential random variables begins to look bell-shaped
as the number of random variables in the sum increases.33

11.5 Mixture of normal random variables

One drawback of the normal random variable is that it’s unimodal and, therefore, serves as a poor model or
approximation for data that are drawn from a true distribution that has two or more “humps” or modes. In this section,
we describe an approach that “mixes” normal random variables together to create more flexible distributions that may
be asymmetric and/or have multiple modes. While we focus on the case of mixing two normal random variables, we
start with a more general definition of a mixture distribution:

Definition 11.7 A mixture distribution is the probability distribution associated with a random variable X that is
based upon a collection of underlying random variables Y1 , Y2 , …, Ym using the following two-step process: (i) a
random variable is selected at random from the collection Y1 , Y2 , …, Ym according to the probabilities π1 , π2 , …, πm ,
Pm
where j=1 πj = 1, and (ii) the realized value of X is the realized value of the selected random variable.
To focus on mixtures of normal random variables, we provide the following definition as a special case of
Definition 11.7:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 306 — #313
i i

306 NOTES

Definition 11.8 A mixture of normal random variables is a random variable with a mixture distribution based upon
normal random variables Y1 , Y2 , …, Ym .
In the following example, we re-visit the data-analyst salary example (Example 10.16), first noting that the original
example involved a mixture of uniform random variables and then considering a different mixture distribution based
upon normal random variables instead of uniform random variables:
Example 11.10 (Data analyst salaries) Example 10.16 considered the salaries of data analysts at a large firm, where
the salaries for non-graduate-degree data analysts and graduate-degree data analysts were modeled as different
uniform random variables. Using the notation from Definition 11.7, let Y1 ∼ U(60, 100) and Y2 ∼ U(90, 210) denote
these two random variables, respectively. If the probability that a data analyst at the firm has a graduate degree is
20%, the random variable for data-analyst salaries is a mixture of Y1 and Y2 with probabilities π1 = 0.8 and π2 = 0.2.
Now, suppose the two salary distributions are modeled as normal random variables rather than uniform random
variables, specifically with
Y1 ∼ N(80, 102 ) and Y2 ∼ N(150, 302 ),
still with π1 = 0.8 and π2 = 0.2. To visualize the mixture distribution X, based upon Y1 and Y2 with probabilities 0.8
and 0.2, respectively, the following R code simulates 1,000,000 draws of X:

# create a temporary vector with 1,000,000 U(0,1) draws

set.seed(1234)
temp <- runif(1000000)

# salaries drawn from N(80,10^2) and N(150,30^2) distributions

y1 <- rnorm(1000000,mean = 80,sd = 10)
y2 <- rnorm(1000000,mean = 150,sd = 30)

# construct the mixture random variable, with probs 0.8 and 0.2
salary <- (temp<=0.8)*y1 + (temp>0.8)*y2

# draw smoothed density of salaries, with vertical lines at 80 and 150

plot(density(salary), main="", xlab="Data-analyst salary (in thousands)")
abline(v=80,lty=3)
abline(v=150,lty=3)

The temp vector consists of U(0, 1) draws that are used to determine whether Y1 or Y2 is the chosen random
variable for a given draw of X. Y1 is chosen with probability 0.8, or equivalently when the corresponding element
of temp is less than or equal to 0.8, whereas Y2 is chosen when the corresponding element of temp is greater than
0.8. The salary assignment command stores the full vector of simulated X draws. Figure 11.12 shows the smoothed
density plot output by the R code, with dotted lines drawn at the values of 80 and 150, corresponding to the means of Y1
and Y2 , respectively. The resulting distribution has two humps or modes at approximately 80 and 150. The hump near
80 is much higher than the one near 150 since the probability (80%) that X is drawn from the Y1 random variable is
so much higher than the probability (20%) that X is drawn from the Y2 random variable. To see the effect of changing
the probabilities of the two underlying distributions, the interested reader can alter the code above by replacing the
π1 value (0.8) with other values.

Notes
27 Alternatively, the population mean result (µ = µ) follows directly from property (iv).
X
28 The “strictly increasing” and “strictly decreasing” results can be shown by taking the derivative of fX (x) with respect to x and verifying that the
derivative is positive for x < µ and negative for x > µ.
29 That is, eE(ln(X)) > E eln(X) = E(X), which is a special case of a result in statistics known as Jensen’s inequality.

30 The interested reader can confirm these properties by working out the appropriate integrals. For example, µ = 1 can be shown by evaluating
X θ
the integral 0∞ θe–θx x dx.
R
31 The interested reader can look at the documentation for the Weibull-related R functions: dweibull, pweibull, rweibull, and
qweibull.
32 Proposition 11.9 allows for an infinite number of possible events so that the time T can be any value.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 307 — #314
i i

NOTES 307

0.030
0.025
0.020
Density

0.015
0.010
0.005
0.000

0 50 100 150 200 250 300

Data−analyst salary (in thousands)

Figure 11.12
Mixture of two normal random variables

33 The exact distribution of a sum of i.i.d. exponential random variables is a special case of a distribution known as the gamma distribution.

Exercises
1. An airline knows that the duration of the flight from Austin to Nashville is uniformly distributed between 110
minutes and 130 minutes. The flight departs at 1:00pm.
(a) If the airline wants the probability of a late arrival to be 20%, what time should they state as the arrival time?
(b) When the flight lands, it may have to wait until an arrival gate is available for passengers to deplane. The time
that the arrival gate becomes available is uniformly distributed between 2:50pm and 3:00pm. If the flight time
and the time that the arrival gate becomes available are independent, what is the probability that the flight will
have to wait for its arrival gate when it lands?
(c) Same as (b), but now assume flight time and arrival-gate availability are independent normal random variables.
Assume that the two random variables have the same means as the uniform random variables described above,
and that the standard deviation of flight time is 5 minutes and the standard deviation of arrival-gate availability
is 2.5 minutes.
2. A credit card company knows that the monthly balance X of a representative customer is normally distributed:
X ∼ N(300, 2500).
Assume that the monthly balances of customers are independent draws from X.
(a) If X1 and X2 are the monthly balances for two randomly chosen customers, what is the probability that X1 + X2
is greater than $700?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 308 — #315
i i

308 NOTES

(b) If X1 and X2 are the monthly balances for two randomly chosen customers, let Y = XX21 denote the ratio of two
customers’ balances. Here, Y is a non-linear combination of X1 and X2 , and Y is itself not a normal random
variable. Conduct 100,000 simulations in R to approximate the following quantities: (i) the mean of Y, (ii) the
standard deviation of Y, (iii) the median of Y, and (iv) the probability that Y > 1.5.
3. Maternal smoking during pregnancy has a negative association with birthweight. Suppose the distribution of
newborn child’s birthweight (in grams) is BWS ∼ N(3050, 5902 ) if the mother smokes during pregnancy, while the
distribution is BWNS ∼ N(3260, 5302 ) if the mother does not smoke during pregnancy.
(a) Which of the following is larger: the pdf of BWS evaluated at 3050 or the pdf of BWNS evaluated at 3260?
Explain why.
(b) Plot the two pdf’s on the same graph, over the range between 2000 grams and 4500 grams.
(c) A baby that weighs less than 2500 grams is classified as “low birthweight.” What is the probability of a low-
birthweight baby if the mother smokes during pregnancy? What is the probability of a low-birthweight baby if
the mother does not smoke during pregnancy?
(d) What is the probability that the birthweight of a baby born to a smoking mother is greater than the birthweight
of a baby born to the non-smoking mother? (Treat the two births as independent.)
(e) Conduct 10,000 simulations in R to confirm your answer to (d).
(f) Now consider two births associated with smoking mothers and two births associated with non-smoking
mothers, where the four birthweights are independent random variables. What is the probability that the average
of the two birthweights for the smoking mothers is greater than the average of the two birthweights for the
non-smoking mothers?
(g) Conduct 10,000 simulations in R to confirm your answer to (f).
(h) There are approximately 453.6 grams in a pound. What is the normal distribution associated with birthweight
in pounds for a baby born to a mother who smokes during pregnancy?
4. The annual returns for two stocks, for the companies Widgetville and Planet Widget, are given by normal
distributions
Widgetville: X ∼ N(0.10, 0.0064) and Planet Widget: Y ∼ N(0.06, 0.0049),
with positive correlation ρXY = 0.3.
(a) What is the probability that Widgetville’s return is greater than Planet Widget’s return in a given year?
(b) If you buy $100 of Widgetville stock and $200 of Planet Widget stock, what is the distribution of the net
gain/loss (in dollars) on your portfolio after one year?
(c) For the portfolio in (b), what is the probability that the net gain is greater than $30 after one year?
(d) *Now suppose you can only invest in Widgetville stock. Write a function widgetgain(amt, yrs, numsim)
with three arguments: amt is the amount invested in Widgetville stock, yrs is the number of years that
the money is invested, and numsim is the number of simulations. The function should return a vector of
length numsim, where each element of the vector is a simulation of the net gain/loss over yrs years from an
investment of amt in Widgetville stock. Assume that the annual return in each year is an independent draw
from the random variable X, but make sure to allow for compounding. For instance, for yrs = 2, if $100
is invested with a 0.10 return in year 1 and a 0.05 return in year 2, you would have $100(1 + 0.10) = $110
after one year and $110(1 + 0.05) = $115.50 after two years, yielding $15.50 net profit. Draw histograms
of widgetgain(100, 10, 10000) and widgetgain(100, 20, 10000), and calculate their respective
simulated means and standard deviations.
5. A clothing store has to decide whether to spend money on advertising for the upcoming month. If the store does not
advertise, the distribution of monthly revenue (in thousands of dollars) is NA ∼ N(90, 36). If the store does advertise,
the distribution of monthly revenue (in thousands of dollars) is A ∼ N(95, 25). The cost of advertising is $4,000.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 309 — #316
i i

NOTES 309

The correlation between NA and A is ρNA,A . (This correlation is likely positive since there are factors that affect monthly
revenue whether or not the store advertises.) The profitability X associated with advertising is a random variable, with
X = A – NA – 4.
(a) What is the expected value of X?
(b) What is the population standard deviation of X, in terms of ρNA,A ? Calculate σX for ρNA,A = 0.6 and ρNA,A = 0.8.
(c) Using the fact that X is also normally distributed, plot the probability of positive profitability, P(X > 0), versus
ρNA,A for ρNA,A = {0.1, 0.2, ..., 0.8, 0.9}.
6. Suppose X1 ∼ N(0, 4), X2 ∼ N(1, 1), and X3 ∼ N(2, 9) are independent random variables.
(a) What is the expected value of the average of draws from X1 , X2 , and X3 ?
(b) What is the population variance of the average of draws from X1 , X2 , and X3 ?
(c) What is the probability that X2 is larger than the average X1 +X
2 ?
3

X2 +X3
(d) What is the probability that X1 is larger than the average 2 ?
(e) Conduct 10,000 simulations in R to approximate the probability that X2 is closer to X1 than it is to X3 , which is
P(|X2 – X1 | < |X2 – X3 |).
7. A worker has three projects (A, B, and C) that she needs to complete, but there is uncertainty about how long each
project will take to complete. The completion times, in hours, are draws from normal random variables:
TA ∼ N(12, 9), TB ∼ N(24, 16), and TC ∼ N(8, 4).
Assume that TA , TB , and TC are independent.
(a) What is the distribution of the total completion time for the three projects?
(b) Suppose the worker does project A and then B and then C. Conduct 100,000 simulations in R to approximate
the pmf of X = # projects completed within 40 hours (a week of work)?
(c) Suppose the worker does project C and then A and then B. Conduct 100,000 simulations in R to approximate
the pmf of X = # projects completed within 40 hours (a week of work)?
(d) *For this part, drop the independence assumption and assume that the correlation between any two completion
times is equal to 0.4 (that is, ρTA TB = ρTA TC = ρTB TC = 0.4). What is the distribution of the total completion time
for the three projects? How does this compare to your answer to (a)?
8. A truncated normal random variable is restricted to a range (a, b) and, within that range, has a pdf proportional to
the pdf of a normal random variable. Specifically, the pdf of a truncated normal random variable based upon a N(µ, σ 2 )
random variable and restricted to the range (a, b) is
φ( x–µ
σ )

 1 · b–µ if a < x < b
σ Φ –Φ( a–µ
fX (x) = σ σ )

0 otherwise


(a) Explain why the variance of X is less than σ 2 (without actually determining σX2 ).
(b) Write an R function dtruncnorm(x,mean,sd,a,b) that returns the pdf of a truncated normal, based upon
a normal with mean mean and standard deviation sd and restricted to the range between a and b, evaluated at
each of the elements of the vector x.
(c) Write an R function rtruncnorm(n,mean,sd,a,b) that returns a vector of n i.i.d. random draws of a
truncated normal, based upon a normal with mean mean and standard deviation sd and restricted to the range
between a and b. (Hint: Continually make i.i.d. draws from the normal distribution until there are n values
between a and b.)
(d) In a population of credit-card holders, each individual has some probability L of making a late payment in
a given month. Suppose the distribution of these probabilities follows a truncated normal distribution on the
range (0, 1) based upon a N(0.1, 0.32 ) random variable.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 310 — #317
i i

310 NOTES

i. Using the dtruncnorm function, draw the density function of L for values ranging between –0.5 and
1.5 on the x-axis.
ii. Using the rtruncnorm function to create 100,000 simulated draws of L, what are the approximate
values of E(L), sd(L), τL,0.5 , and P(L < 0.1)?
9. Suppose the wealth (in dollars) of 70-year-olds in the United States is described by a log-normal random variable:
ln(W) ∼ N(10, 1).
(a) Using only the normal distribution, provide a 90% probability interval for the wealth of a randomly chosen
70-year-old.
(b) Using only the normal distribution, what is the probability that a randomly chosen 70-year-old has wealth
between $10,000 and $30,000?
(c) Conduct 100,000 simulations in R, using the rlnorm function, to confirm your answer to (b).
10. WebNet, a large technology company, owns many different websites. The monthly traffic T for any given website
(i.e., the number of unique visitors to the website) is a log-normal random variable with ln(T) ∼ N(10, 4).
(a) What is the population median of T?
(b) What is the probability that a given website has monthly traffic greater than 20,000?
(c) Assume that WebNet earns two cents for every unique visitor to any of its websites. Fill in the blank in the
following sentence: “There is a 90% probability that WebNet’s monthly earnings for a given website is greater
than dollars.”
11. A homeowner has an outdoor light that is always kept on. After replacing the light bulb, suppose the new bulb’s life
X (in years) is drawn from an exponential random variable with mean 0.8.
(a) What is the probability that the bulb lasts at least one year?
(b) What is the probability that the bulb lasts between one year and two years?
(c) *Suppose the homeowner immediately replaces a broken bulb with a new bulb, whose life is a new i.i.d. draw
of X. Conduct 100,000 simulations in R, using exponential random variables, to approximate the pmf of the
total number of bulbs B needed to keep the light illuminated for at least two years.
i. What are the approximate values of P(B = 1) and P(B = 2)?
ii. What is the approximate value of E(B)?
iii. Based upon Proposition 11.9, how is B related to a Poisson random variable?
12. A company’s customer service department takes calls throughout the day. After a given customer calls, the time
(in minutes) before the next customer calls is an i.i.d. exponential random variable with mean 4. The length of any
given call (in minutes) is also an exponential random variable, but having mean 3, and is independent of the length of
other calls and the arrival times of all calls. Suppose customer A calls at exactly 3:00pm, and the next two calls are by
customer B and customer C (in that order).
(a) What is the probability that customer B calls before 3:05pm?
(b) What is the probability that customer A’s call ends after 3:05pm?
(c) Conduct 100,000 simulations in R to approximate the following probabilities:
i. the probability that customer B calls before customer A’s call ends
ii. the probability that both customer B and customer C call before customer A’s call ends
iii. the probability that customer B’s call is still ongoing when both customer A’s and customer C’s calls
end
13. *A consumer wants to purchase a product, and she knows that the product costs p1 dollars at a major internet retailer.
She has to decide whether or not to spend time searching the internet for a lower price. Suppose there is a lower price
p2 < p1 available at another internet retailer, but that it takes some time T (in minutes) to find that retailer and price.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 311 — #318
i i

NOTES 311

Assume that T is an exponential random variable, with T ∼ Exp(θ), and that the consumer’s opportunity cost of time
is c dollars per minute.
(a) What is the consumer’s net gain associated with finding the lower price (in terms of p1 , p2 , T, and c)?
(b) The consumer only wants to search for the lower price if the expected net gain is positive. What must be true
about c for the consumer to search? Your answer should be an inequality in terms of p1 , p2 , and θ.
(c) Now suppose the consumer never searches for more than m minutes (since she knows the exponential random
variable has a long right tail). So, if she finds the lower price p2 within m minutes, she faces that price;
otherwise, she faces the original price p1 . What must be true about c for the consumer to search? Your answer
Rb –θx
b
should be an inequality in terms of p1 , p2 , θ, and m. (Hint: Use the fact that a xθe–θx dx = –xe–θx + e θ .)
a
14. Suppose X is an exponential random variable, with X ∼ Exp(θ).
(a) What is the population median of X?
(b) Provide a 95% probability interval (L, U) for X, where P(X < L) = P(X > U) = 0.025.
15. This question is a modified version of Exercise 10.10., now using normal random variables instead of uniform
random variables. A store’s sales depend upon whether it is a weekday (Monday through Friday) or a weekend
day (Saturday or Saturday). Specifically, sales (in thousands of dollars) are distributed as a normal random variable
N(2, 0.52 ) on a weekday and as a normal random variable N(3.5, 0.752 ) on a weekend day. Let X denote the random
variable associated with the store’s sales on a randomly chosen day of the week (i.e., the probability associated with
each day is 1/7), so that X is a mixture of normal random variables.
(a) What is E(X)?
(b) Determine P(X ≥ 3) analytically, in terms of the standard normal cdf Φ(·). Use the pnorm function in R to
calculate the probability.
(c) Create a vector with 1,000,000 simulated draws of X in R. When simulating these draws, store the vector with
the indicator of whether it is a weekday or not, as it will be needed for (e) below.
i. Confirm your answers to (a) and (b) based on the simulated draws.
ii. What is the approximate population median of X based upon the simulated draws?
iii. Find the approximate population 2.5% and 97.5% quantiles to construct an approximate 95%
probability interval for X.
iv. Draw the smoothed density associated with the simulated draws. Does the density curve appear to be
normal?
(d) Suppose you know that a given day’s sales are between $2,000 and $3,000. Use Bayes’ Theorem to determine
the probability that it is a weekday. (Hint: Let A denote the event that it is a weekday, let B denote the event
that 2 ≤ X ≤ 3, and determine P(A|B).)
(e) Use the simulated draws from (c) to confirm your answer to (d).
16. Let X ∈ {0, 1} be an indicator of whether an individual is female, with X = 1 for women and X = 0 for men. Height
Y (in inches) in a certain population is distributed as a N(64.5, 6.25) random variable for women and as a N(70, 9)
random variable for men. Assume P(X = 1) = 0.5.
(a) What is E(Y)?
(b) Create a vector with 1,000,000 simulated draws of Y in R.
i. Confirm your answer to (a) based on the simulated draws.
ii. What is the approximate population variance of Y?
iii. What is the approximate population IQR of Y?
iv. What is the approximate probability that Y is between 70 and 75? How does this probability compare
to the conditional probabilities P(70 ≤ Y ≤ 75|X = 0) and P(70 ≤ Y ≤ 75|X = 1)? (For each conditional

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 312 — #319
i i

312 NOTES

probability, use the appropriate normal distribution to calculate the actual probability rather than using
the simulations.)
(c) *Find σY2 analytically, using the fact that
σY2 = E((Y – µY )2 ) = E((Y – µY )2 |X = 0)P(X = 0) + E((Y – µY )2 |X = 1)P(X = 1).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 313 — #320
i i

12 Sampling distributions: exact

A central goal of statistics is to use an observed sample to say something about how the data are truly generated,
sometimes called the data-generating process (DGP). In other words, the observed sample characterizes the
population from which the sample was drawn. To simplify matters, this chapter focuses on the case of a simple
random sample. Recall from Chapter 5 that a sample is a simple random sample if each element of the population
is equally likely to be sampled. To formalize matters, let’s assume that the observed random sample {x1 , x2 , …, xn }
are the realizations of a collection of n i.i.d. random variables {X1 , X2 , …, Xn }. That is, x1 is the realized outcome of
X1 , x2 is the realized outcome of X2 , and so on. Since the random variables {X1 , X2 , …, Xn } are i.i.d., they share a
common cdf FX (·). The goal is to use the sample {x1 , x2 , …, xn } to characterize the distribution FX (·) associated with
the population.
Example 12.1 (Math SAT scores) Let X be the random variable associated with math SAT score in a population of
students. With an observed random sample, what can be said about the population mean µX of math SAT scores?
1
P10
Suppose a random sample of 10 students is collected, with sample mean x̄ = 10 i=1 xi = 618. Intuitively, it seems like
x̄ = 618 should be a good “guess” for the population mean µX , but how good is it? How close is the sample mean
x̄ = 618 to µX ? The precision of x̄ = 618 as a guess for µX depends upon the variability of the sample mean itself. If
a different set of 10 students had been collected from the population, would their sample mean also be close to 618?
How about for yet another randomly chosen set of 10 students? After all, the observed {x1 , x2 , …, x10 } sample is just
1
P10
one possible realization of the random variables {X1 , X2 , …, X10 }, and therefore the descriptive statistic x̄ = 10 i=1 xi
1
P10
is just one possible realization of the sample mean 10 X
i=1 i .
Figure 12.1 provides a visual representation of how we can think about drawing our random sample of 10 students.
There are many ways of drawing a random sample of n observations, and only one of those samples, represented by
the gray shading, is observed. For the sample {x1 , x2 , …, xn }, descriptive statistics like the sample mean x̄ and the
sample variance s2x can be calculated. Had one of the other samples been observed, a different sample mean and
sample variance would have been obtained. We are interested in characterizing the distribution of the realizations of x̄
(or s2x ) over the possible random samples.
In Definition 6.1, a statistic s(x1 , x2 , …, xn ) was defined as a function of the observed sample data. The sample
mean x̄ and the sample variance s2x are examples of statistics. For considering the distribution of a statistic over the
possible random samples that can be drawn, the random variable s(X1 , X2 , …, Xn ) is introduced. The random variable
s(X1 , X2 , …, Xn ) involves the same function s(·) applied to the random variables X1 , X2 , …, Xn rather than the observed
variables x1 , x2 , …, xn . In the case of the sample mean,
n
1X
s(x1 , x2 , …, xn ) = x̄ = xi
n
i=1
is the sample mean for a sample {x1, x2 , …, xn }, and
n
1X
s(X1 , X2 , …, Xn ) = X̄ = Xi
n
i=1

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 314 — #321
i i

314 Sampling distributions: exact

Your sample

⋯⋯⋯

𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# 𝑥! , 𝑥" , … , 𝑥!# ⋯⋯⋯

𝑥,̅ 𝑥,̅ 𝑥,̅ 𝑥,̅ 𝑥,̅

𝑠"𝑥 , 𝑠"𝑥 , 𝑠"𝑥 , 𝑠"𝑥 , 𝑠"𝑥 , ⋯⋯⋯
etc. etc. etc. etc. etc.

Figure 12.1
Random sampling and sampling distributions

is the random variable associated with the sample mean. Before observing the data, X̄ is itself a random variable since it
depends on the random variables X1 , X2 , …, Xn . After observing the data, x1 is the realization of X1 , x2 is the realization
of X2 , and so on through xn , and x̄ is the realization of the random variable X̄.
Similarly, in the case of the sample variance,
n
1 X
s(x1 , x2 , …, xn ) = s2x = (xi – x̄)2
n–1
i=1
and
n
1 X
s(X1 , X2 , …, Xn ) = s2X = (Xi – X̄)2 .
n–1
i=1
Before observing the data, s2X
is itself a random variable. The subscript X, and not x, indicates that s2X is a random
variable. After observing the data, the sample variance s2x is the realization of the random variable s2X .
The distribution of a statistic over the possible random samples is known as the sampling distribution and is
formally defined as follows:

Definition 12.1 The sampling distribution of a statistic s(X1 , X2 , …, Xn ) is the probability distribution of the statistic
over all possible random samples of size n from the population.
For a given sample size n, the sampling distribution of a statistic is also sometimes called the exact sampling
distribution or the finite-sample distribution. This chapter focuses on some examples where the exact sampling
distribution of a statistic can be determined based upon the specific form of the underlying random variables
X1 , X2 , …, Xn . For example, in the case of i.i.d. Bernoulli random variables or i.i.d. normal random variables, the
exact sampling distribution of the sample mean for a random sample of size n can be determined. For i.i.d. normal
random variables, the exact sampling distribution of the sample variance for a random sample of size n can also be
determined. In more general cases, however, it can be difficult to characterize the exact sampling distribution of a

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 315 — #322
i i

Sampling distributions: exact 315

statistic, even for a simple statistic like the sample mean. As it turns out, much more can be said when the sample size
n is large, as more general results are available to provide an approximate sampling distribution, rather than an exact
sampling distribution, for a wide range of statistics. This idea is the focus of Chapter 13, which considers large-sample
or asymptotic distributions for statistics, like the sample mean and many others, for large sample sizes.

12.1 Sampling distribution of the sample mean

If the distribution of the i.i.d. random variables X1 , X2 , …, Xn is known, it is sometimes possible to explicitly determine
Pn
the sampling distribution of X̄ = 1n i=1 Xi . This section considers two such cases, Bernoulli random variables
(Section 12.1.1) and normal random variables (Section 12.1.2), before discussing other cases that do not have a simple
description of the sampling distribution of X̄.

12.1.1 Bernoulli random variables

Example 12.2 (Consumer purchases) Let a consumer purchase be represented by the binary variable X (1 if purchase,
0 if not). A random sample of 10 consumers from the population is observed, where the purchase behavior of
each consumer is an i.i.d. Bernoulli(π) random variable. In the notation introduced above, with X1 , X2 , …, X10 as
1
i.i.d. Bernoulli(π) random variables, what is the distribution of X̄ = 10 (X1 + X2 + · · · + X10 )? First, note that the sum
X1 + X2 + · · · + X10 is a sum of 10 i.i.d. Bernoulli(π) random variables, which is a Binomial(10, π) random variable.
1
Then, X̄ is 10 times a Binomial(10, π) random variable. (X̄ is the random variable representing the sample proportion,
as detailed in Section 9.2.1.) Using the pmf formula for binomial random variables, the pmf for X̄ is:
p P(X̄ = p)
0 (1 – π)10
10
9
1 π(1 – π)
1/10
10 2 8
2 π (1 – π)
2/10
10 3 7
3 π (1 – π)
3/10
10 4 6
4 π (1 – π)
4/10
10 5 5
5 π (1 – π)
5/10
10 6 4
6 π (1 – π)
6/10
10 7 3
7 π (1 – π)
7/10
10 8 2
8 π (1 – π)
8/10
10 9
9 π (1 – π)
9/10
10
1 π
For instance, P(X̄ = 3/10) is the probability that 3 of the 10 consumers make a purchase, equal to 10
3 7
3 π (1 – π) . Put
another way, given that the true probability
3 of purchase is π, the probability that a random sample of 10 consumers
has exactly 3 purchases is equal to 10 3 π (1 – π)7
. Since the population mean and variance of the Binomial(10, π) are
10π and 10π(1 – π), respectively, the population mean and variance of X̄ are π and π(1–π) 10 , respectively.

The results of Example 12.2 can be generalized to any sample size n, and the following proposition summarizes the
results for the sampling distribution of X̄ for n i.i.d. Bernoulli random variables:
Proposition 12.1. If X1 , X2 , …, Xn are i.i.d. Bernoulli(π) random variables, the sampling distribution of X̄ is

j n j
P X̄ = = π (1 – π)n–j for j ∈ {0, 1, 2, …, n}.
n j
The population mean of X̄ is
µX̄ = E(X̄) = π,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 316 — #323
i i

316 Sampling distributions: exact

the population variance of X̄ is

π(1 – π)
σX̄2 = Var(X̄) = ,
n
and the population standard deviation of X̄ is
r
π(1 – π)
σX̄ = sd(X̄) = .
n
To illustrate the results of Proposition 12.1, Figure 12.2 shows the sampling distributions of X̄ for the case of π = 0.2
for four different sample sizes (n = 2, n = 4, n = 10, and n = 20). The y-axis label is pX̄ (v), denoting the pmf of the
random variable X̄. For n = 2, the most likely outcome for X̄ is 0, and of course it’s impossible to observe an outcome
of 0.2 since the only possible outcomes are 0, 0.5, and 1. For n = 4, the most likely outcomes are 0 and 0.25, each
with probability 0.4096. For n = 10, the most likely outcome is the true probability 0.2, with probability just over 0.3,
and the bulk of the distribution is between 0 and 0.4. For n = 20, the most likely outcome is again the true probability
0.2, but now the next most likely outcomes are 0.15 and 0.25, neither of which was possible with the smaller n = 10.
Overall, the dispersion of the sampling distribution decreases as the sample size gets larger. Comparing the bottom
two distributions (n = 10 and n = 20), for instance, the larger sample size of n = 20 has a sampling distribution that is
more tightly distributed around the true probability of π = 0.2. Put another way, the variability of the sample mean as a
descriptive statistic decreases as the sample size gets larger. As will be seen later, this relationship implies that a given
observed sample mean is a more reliable guess or estimate of the true mean when the sample size is larger. In the
context of the consumer purchase example (Example 12.2), a more accurate guess for the true purchase probability π
would be expected if 20 consumers are observed rather than 10 consumers.

12.1.2 Normal random variables

Example 12.3 (Firm sales) Suppose firm sales in a given year is a normal random variable X. Ten years of data on
firm sales are observed, under the assumption that the sales in each year is an i.i.d. N(µ, σ 2 ) random variable. With
the parameters µ and σ 2 being unknown, can the observed data be used to say something about their possible values?
Before the sales are observed, the potential annual sales values can be thought of as random variables X1 , X2 , …, X10 ,
which are i.i.d. N(µ, σ 2 ) random variables. Then, from Proposition 11.6, the average sales over the 10 years, X̄ =
1
10 (X1 + X2 + · · · + X10 ), is also normally distributed with

σ2

σ
X̄ ∼ N µ, and σX̄ = √ .
10 10
Thus, the observed sample mean x̄, the average of sales over the 10 years of data, can be thought of as a single draw
√
from a normal random variable with mean µ and standard deviation σ/ 10. From the properties of normal random
variables, the observed sample mean x̄ is within 1.96σ of the population mean µ with probability 95% and within
1.645σ of the population mean µ with probability 90%.
The results of Example 12.3 can be generalized to any sample size n, and the following proposition summarizes the
results for the sampling distribution of X̄ for n i.i.d. normal random variables:
Proposition 12.2. If X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random variables, the sampling distribution of X̄ is
σ2

X̄ ∼ N µ, .
n
The population mean of X̄ is
µX̄ = E(X̄) = µ,
the population variance of X̄ is
σ2
σX̄2 = Var(X̄) = ,
n

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 317 — #324
i i

Sampling distributions: exact 317

n=2 n=4

1.0

1.0
0.8

0.8
0.6

0.6
pX(v)

pX(v)
0.4

0.4
0.2

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

v v

n = 10 n = 20
1.0

1.0
0.8

0.8
0.6

0.6
pX(v)

pX(v)
0.4

0.4
0.2

0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

v v

Figure 12.2
Sampling distributions of the sample mean of i.i.d. Bernoulli(0.2) random variables

and the population standard deviation of X̄ is

σ
σX̄ = sd(X̄) = √ .
n
The sampling distribution of X̄, which is that of a normal random variable, is centered around the population mean µ.
The variance, denoted σX̄2 , and the standard deviation, denoted σX̄ , both depend on the sample size, with the variance
proportional to n1 and the standard deviation proportional to √1n . As the sample size n increases, the dispersion of the
sampling distribution of X̄ decreases. For instance, comparing the sampling distribution for two different sample sizes,
σ2 σ2
say n = 100 and n = 200, the variances are 100 and 200 , respectively, so that the variance with n = 200 is half as large as
the variance with n = 100. Similarly, for these sample sizes, the standard deviations are √σ100 and √σ200 , respectively,
so that the standard deviation with n = 200 is √12 times the standard deviation with n = 100. More generally, doubling
the sample size always results in a normal sampling distribution with the same mean µ, a variance that is half as large,
and a standard deviation is scaled by √12 . Quadrupling the sample size results in a normal sampling distribution with
the same mean, a variance that is 14 as large, and a standard deviation that is 12 as large. In fact, if the sample size n
grows arbitrarily large (n → ∞), the variance σX̄2 and standard deviation σX̄ shrink toward zero.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 318 — #325
i i

318 Sampling distributions: exact

12.1.3 Other random variables

Beyond the cases of Bernoulli random variables and normal random variables, it may not be easy to provide a simple
characterization of the sampling distribution of the sample mean X̄. The Bernoulli and normal random variables are
particularly convenient since the distribution of the average of i.i.d. draws is known for each case, specifically a (scaled)
binomial for the former and a normal for the latter. For other types of i.i.d. random variables, the distribution of the
average of i.i.d. draws may be some other distribution not covered in this book. There are lots and lots of distribution
families that are used by researchers and practitioners. In some cases, it might be possible to analytically determine
the exact form of the sampling distribution of X̄, but in other cases, it might be infeasible to do so.
There are some results for the sampling distribution of X̄ that hold regardless of the distribution of the underlying
i.i.d. random variables X1 , X2 , …, Xn . From Proposition 10.13, the mean, variance, and standard deviation of X̄ are
σX2 σX
µX̄ = µX , σX̄2 = , and σX̄ = √ ,
n n
where µX , σX2 , and σX are the mean, variance, and standard deviation of X, respectively. These properties were shown
above for both the Bernoulli and the normal, but they hold for any distribution of X. Specifically, the mean of X̄ is the
mean of X, and the variance and standard deviation of X̄ are scaled versions of the variance and standard deviation
of X, with the variance scaled by 1n and the standard deviation scaled by √1n . Thus, the dispersion of the sampling
distribution decreases with sample size in the same way as for the Bernoulli and normal cases. Unfortunately, knowing
the mean and variance of X̄ does not completely characterize the sampling distribution of X̄, as the actual distribution
(pmf, pdf, and/or cdf) of X must be used to derive the distribution of X̄.
Example 12.4 (Uniform random variables) Suppose X1 , X2 , …, Xn are i.i.d. U(0, 1) random variables. Recall that
1
X ∼ U(0, 1) has µX = 12 , σX2 = 12 , and σX = √112 , so that
1 σ2 1 σX 1
µX̄ = µX = , σX̄2 = X = , and σX̄ = √ = √ .
2 n 12n n 12n
The sampling distribution of X̄ is not uniform even though each of the Xi are uniform. For example, as seen in
Example 10.28 for n = 2, the sum X1 + X2 is a triangular distribution, meaning 21 (X1 + X2 ) also has the shape of a
triangular distribution. For larger n, the distribution of X̄ becomes smoother.34 Figure 12.3 shows the exact sampling
distributions for X̄ for four different sample sizes (n = 2, n = 3, n = 5, and n = 10). The y-axis label is fX̄ (v), corresponding
to the pdf of the random variable X̄. The range of X̄ is always between 0 and 1, regardless of n, since the average can
never be lower than the minimum value of X or higher than the maximum value of X. Each distribution has a mean at
1
2 , and the standard deviations are approximately 0.204, 0.167, 0.129, and 0.091, respectively.

Example 12.5 (Log-normal random variables) Suppose X1 , X2 , …, Xn are i.i.d. log-normal random variables, with
ln(Xi ) ∼ N(µ, σ 2 ) for each i ∈ {1, 2, …, n}. The sum of log-normal random variables does not have a log-normal
distribution, and therefore the average of log-normal random variables does not have a log-normal distribution.35
While there are no general results regarding the exact sampling distribution of X̄ for i.i.d. log-normal random
variables, computer simulation can be used to approximate the sampling distribution for specific values of n, µ, and
σ 2 . For a log-normal distribution with ln(X) ∼ N(0, 1) (µ = 0, σ 2 = σ = 1), Figure 12.4 shows the simulated sampling
distributions for four different sample sizes (n = 2, n = 5, n = 10, and n = 20). The assumed data-generating process
is used to make simulated draws. For instance, for n = 2, X1 and X2 are drawn randomly and independently from
a log-normal distribution with ln(X) ∼ N(0, 1) and the average of the two draws is calculated, and this process is
repeated many times. For the graphs shown in the figure, 100,000 simulations are used. The top-left graph shows
the smoothed density associated with the 100,000 draws of X̄ = 21 (X1 + X2 ) for n = 2. A similar process is used for the
other sample sizes. For n = 20, X1 , X2 , …, X20 are randomly and independently drawn from a log-normal distribution
1
with ln(X) ∼ N(0, 1), which gives simulated draws of X̄ = 20 (X1 + X2 + · · · + X20 ), and the lower-right graph shows
the smoothed density associated with 100,000 draws of X̄. Comparing across sample sizes, the dispersion in the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 319 — #326
i i

Sampling distributions: exact 319

n=2 n=3

4
3

3
fX(v)

fX(v)
2

2
1

1
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

v v

n=5 n = 10
4

4
3

3
fX(v)

fX(v)
2

2
1

1
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

v v

Figure 12.3
Sampling distributions of the sample mean of i.i.d. U(0, 1) random variables

distributions decreases as the sample size gets larger, as expected. Also, the right skewness characteristic of the log-
normal distribution is evident at the smaller sample sizes (n = 2 and n = 5), but the right skewness is less dramatic
for n = 10 and has nearly disappeared for n = 20. In fact, the distribution of X̄ for n = 20 looks more like a normal
distribution than it does a log-normal distribution.
The following R code approximates the sampling distributions for Figure 12.4:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 320 — #327
i i

320 Sampling distributions: exact

set.seed(1234)

# initialize the number of simulations

num_simulations <- 100000

# create a vector of sample averages for n=2

xmean_2 <- replicate(num_simulations, mean(rlnorm(2, meanlog=0, sdlog=1)))

# create a vector of sample averages for n=5

xmean_5 <- replicate(num_simulations, mean(rlnorm(5, meanlog=0, sdlog=1)))

# create a vector of sample averages for n=10

xmean_10 <- replicate(num_simulations, mean(rlnorm(10, meanlog=0, sdlog=1)))

# create a vector of sample averages for n=20

xmean_20 <- replicate(num_simulations, mean(rlnorm(20, meanlog=0, sdlog=1)))

# graph-display format (two rows, two columns)

par(mfrow=c(2,2))

# plot smoothed densities for all four sample sizes

plot(density(xmean_2), main=expression(n==2), xlim=c(0,8),
xlab=expression(v), ylab=expression(f[bar(X)](v)))
plot(density(xmean_5), main=expression(n==5), xlim=c(0,8),
xlab=expression(v), ylab=expression(f[bar(X)](v)))
plot(density(xmean_10), main=expression(n==10), xlim=c(0,8),
xlab=expression(v), ylab=expression(f[bar(X)](v)))
plot(density(xmean_20), main=expression(n==20), xlim=c(0,8),
xlab=expression(v), ylab=expression(f[bar(X)](v)))

The code uses the function replicate, which is useful for simulations involving random variables:
• replicate(n, expr): Returns a vector of length n containing the results of evaluating the expression expr
a total of n times.
The first use of replicate in the code creates the vector xmean_2, with 100,000 values for the 100,000 evaluations
of the expression mean(rlnorm(2, meanlog=0, sdlog=1)), which calculates the sample mean over 2
draws of the specified log-normal distribution. The other three replicate expressions are similar, with different
sample sizes specified.
The simulation approach in Example 12.5 is extremely general and can be used to approximate many different
sampling distributions. As long as (i) the distribution of the underlying i.i.d. random variables is known and (ii) a
computer can be used to simulate random draws from that distribution, simulation can always approximate the
sampling distribution of X̄ for a given sample size n. Moreover, since there is nothing special about the statistic X̄,
this simulation approach can also be used to approximate the sampling distribution of other statistics. For example, to
approximate the sampling distribution of the sample variance s2X , the only step we need to change is how the statistic
Pn
is calculated after the draws of X1 , X2 , …, Xn . For X̄, the value of the draw of X̄ = 1n i=1 Xi is used, whereas for s2X ,
1 n
the value of the draw of s2X = n–1 2
P
i=1 (Xi – X̄) is used. The approach can also be used for other statistics, like sample
quantiles (including the sample median), sample IQR, and others.

12.2 Sampling distribution of the sample variance

Like the sample average, the sample variance should be considered a random variable (s2X ) before the sample is
observed. After the sample is observed, which arises from realizations of X1 , X2 , …, Xn , the realization of the sample
variance s2X is the observed sample variance s2x . Likewise, for the random variable sX associated with the sample
standard deviation, the realization of sX is the sample standard deviation sx of the observed sample. For the random
variables X1 , X2 , …, Xn , the random variables associated with the sample variance and the sample standard deviation

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 321 — #328
i i

Sampling distributions: exact 321

n=2 n=5

0.6
0.0 0.1 0.2 0.3 0.4 0.5

0.4
fX(v)

fX(v)

0.2
0.0
0 2 4 6 8 0 2 4 6 8

v v

n = 10 n = 20

1.0
0.8
0.6

0.6
0.4
fX(v)

fX(v)

0.4
0.2

0.2
0.0

0.0

0 2 4 6 8 0 2 4 6 8

v v

Figure 12.4
Sampling distributions of the sample mean of log-normal random variables

are, respectively, v
n u n
1 X u 1 X
q
s2X = (Xi – X̄)2 and sX = 2
sX = t (Xi – X̄)2 .
n–1 n–1
i=1 i=1

What can be said about the sampling distribution of s2X ?

If the mean and variance of the random variable X are
known, even if the distribution of X is not fully known, the following proposition provides results for the population
statistics of the sample variance s2X .
Proposition 12.3. If X1 , X2 , …, Xn are i.i.d. random variables with population mean µX and population variance σX2 ,
the population mean of s2X is
µs2X = E(s2X ) = σX2 ,
the population variance of s2X is
2
E (X – µX )4 σ 2 (n – 3)
σs22 = Var(s2X ) = – X ,
X n n(n – 1)

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 322 — #329
i i

322 Sampling distributions: exact

and the population standard deviation of s2X is

s 2
E (X – µX )4 σ 2 (n – 3)
σs2X = sd(s2X ) = – X .
n n(n – 1)
We make a few remarks about the results. First, the mean of the random variable s2X is the true population variance
1
σX2 . The n–1 scaling in the sample variance formula is crucial for this result to hold, as it would not hold if the scaling
1 1
were n instead of n–1 . With this scaling, the result says that the possible realizations of s2X are, on average, equal to
2
σX . Second, though they have a somewhat complicated form, the population variance and standard deviation are both
decreasing in the sample size n, meaning the variability of the sample variance decreases for larger sample sizes. It may
seem strange to think about the “variance of a variance,” but that’s exactly what we have here. The sample variance
random variable s2X is just an example of a statistic s(X1 , X2 , …, Xn ), and its variance σs22 = Var(s2X ) provides a measure
X
of the dispersion of the sampling distribution of s2X . If the sample variance varies a lot depending on which sample
happens to be observed, this would be associated with a large value of σs22 = Var(s2X ), whereas there would be a lower
X
value of σs22 = Var(s2X ) if the sample variance does not vary a lot depending on the sample that happens to be observed.
X
If the sample size n grows arbitrarily large (n → ∞), the variance σs22 and standard deviation σs2X shrink toward zero.
X

12.2.1 Normal random variables

As with the sample average X̄, much more can be said about the sampling distribution of the sample variance s2X when
i.i.d. random variables are normally distributed.
Proposition 12.4. If X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random variables, the sampling distribution of s2X is characterized
by
s2
(n – 1) X2 ∼ χ2n–1 ,
σ
where χ2n–1 is a chi-square distribution with n – 1 degrees of freedom.
The chi-square random variable was introduced in Section 11.3. For known sample size n and known variance σ 2 ,
Proposition 12.4 says that the sample variance s2X is a scaled version of a χ2n–1 random variable, specifically a χ2n–1
σ2
random variable multiplied by n–1 . Thus, the exact sampling distribution for s2X is completely characterized in the
specific case of i.i.d. normal random variables.36 Knowledge of the exact sampling distribution of s2X allows probability
intervals to be calculated for the sample variance or the sample standard deviation when the parameters of the
underlying normal distribution are known.
Example 12.6 (Asset returns) Suppose the annual returns of an asset are i.i.d. normal random variables with mean
0.08 and standard deviation 0.06. Based upon an observed sample of 10 years of annual returns (n = 10), the sample
variance s2x and sample standard deviation sx are calculated. Before observing the data or calculating those descriptive
statistics, a probability interval can be determined for the random variable s2X or the random variable sX . Let’s say
that we are interested in 95% probability intervals, that is intervals for which there is a 95% chance that the calculated
descriptive statistic falls in the specified interval. Starting with the sample variance,
s2X s2
(n – 1) 2
∼ χ2n–1 =⇒ 9 X 2 ∼ χ29 =⇒ 2500s2X ∼ χ29 .
σ (0.06)
We calculate the population 2.5% and 97.5% quantiles of the χ29 distribution with the qchisq function in R :

qchisq(c(0.025,0.975),9)
## [1] 2.700389 19.022768

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 323 — #330
i i

Sampling distributions: exact 323

The 2.5% and 97.5% quantiles of the χ29 distribution are 2.700 and 19.023, respectively, so there is a 95% probability
of a χ29 random variable being in the interval (2.700, 19.023). Since 2500s2X ∼ χ29 , the probability of s2X being in the
interval (2.700/2500, 19.023/2500) ≈ (0.0011, 0.0076) is also equal to 95%. There is a 2.5% probability that s2X is
less than 0.0011 and a 2.5% probability that s2X is greater than 0.0076. With a probability interval p for the sample
√
variance s2X , we can construct a probability interval for the sample standard deviation sX since sX = s2X and · is
an √ function. Specifically, taking the square root of each endpoint, the 95% probability interval for sX is
√ increasing
( 0.0011, 0.0076) ≈ (0.033, 0.087).
We can also calculate the probability that the sample variance or the sample standard deviation is in some pre-
specified interval. For instance, what is the probability that the sample standard deviation is less than 0.05? When a
sample standard deviation is calculated, it will either be less than 0.05 or greater than 0.05 (with certainty), but what
is the probability before the sample is observed? This probability is
P(sX < 0.05) = P(s2X < (0.05)2 ) = P(2500s2X < 2500(0.05)2 ) = P(2500s2X < 6.25),
and since 2500s2X ∼ χ29 , this probability is the cdf of the χ29 distribution evaluated at 6.25:

pchisq(6.25,9)
## [1] 0.2853401

Therefore, P(sX < 0.05) ≈ 0.285, meaning there is a 28.5% chance that the observed sample standard deviation will
be less than 0.05.

12.2.2 Other random variables

Beyond i.i.d. normal random variables, it’s generally not easy to provide a simple characterization of the sampling
distribution of the sample variance s2X .37 As illustrated in Section 12.1.3 for the sample average X̄, however, simulation
methods can approximate the sampling distribution associated with a given sample size n and a specified underlying
distribution of the i.i.d. random variables X1 , X2 , …, Xn .
Example 12.7 (Uniform random variables) Suppose X1 , X2 , …, Xn are i.i.d. U(0, 1) random variables, with population
1
statistics µX = 12 , σX2 = 12 1
, and σX = √112 . Proposition 12.3 implies that the population mean of s2X is σX2 = 12 (that is,
1
E(s2X ) = 12 ≈ 0.0833) and also that the population variance and standard deviation should decrease for larger sample
sizes. To see the actual shapes of the distributions for particular sample sizes, computer simulations can approximate
the sampling distributions of s2X . Figure 12.5 shows the simulated sampling distributions for four different sample sizes
(n = 2, n = 3, n = 5, and n = 10), in each case based on 100,000 simulated calculations of the sample variance. For
instance, with n = 5, the process of drawing X1 , X2 , X3 , X4 , X5 as i.i.d. U(0, 1) random variables and calculating their
sample variance is repeated. The lower-left graph (n = 5) shows the smoothed density of the 100,000 simulated values
of the sample variance. The four graphs are drawn with the same x-axes and y-axes for ease of comparison. The label
on the y-axis is fs2X (v), corresponding to the pdf of the random variable s2X . The average associated with each of the four
1
graphs should be E(s2X ) = 12 ≈ 0.0833. For the very small sample sizes (n = 2 and n = 3), the distributions for s2X are
extremely right-skewed. Even with n = 5, the distribution becomes much more symmetric and bell-shaped. For n = 10,
the approximately symmetric and bell-shaped distribution becomes tighter (less dispersion) as compared to n = 5 and
appears to be centered near the mean value of 0.0833.
The following R code creates Figure 12.5:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 324 — #331
i i

324 Sampling distributions: exact

set.seed(1234)

# initialize the number of simulations

num_simulations <- 100000

# create a vector of sample variances for n=2, n=3, n=5, and n=10
varunif_2 <- replicate(num_simulations, var(runif(2)))
varunif_3 <- replicate(num_simulations, var(runif(3)))
varunif_5 <- replicate(num_simulations, var(runif(5)))
varunif_10 <- replicate(num_simulations, var(runif(10)))

# graph-display format (two rows, two columns)

par(mfrow=c(2,2))

# plot smoothed densities for the four sample sizes

plot(density(varunif_2), xlim=c(0,0.5), ylim=c(0,15), main=expression(n==2),
xlab=expression(v), ylab=expression(f[s[X]^2](v)))
plot(density(varunif_3), xlim=c(0,0.5), ylim=c(0,15), main=expression(n==3),
xlab=expression(v), ylab=expression(f[s[X]^2](v)))
plot(density(varunif_5), xlim=c(0,0.5), ylim=c(0,15), main=expression(n==5),
xlab=expression(v), ylab=expression(f[s[X]^2](v)))
plot(density(varunif_10), xlim=c(0,0.5), ylim=c(0,15), main=expression(n==10),
xlab=expression(v), ylab=expression(f[s[X]^2](v)))

The replicate function conducts the 100,000 simulations with a single command for each of the four sample
sizes. Using an almost identical approach, we can simulate sampling distributions for the sample standard deviation
sX . Rather than calculating the sample variance in each 100,000 simulation, we calculate the sample standard
deviation (replacing the var function by the sd function in the code above) and graph the smoothed densities over the
100,000 simulated standard deviations for each sample size. Figure 12.6 shows these simulated distributions, using the
same four sample sizes as above. The y-axis label is fsX (v), corresponding to the pdf of the random variable sX . Again,
the dispersion of the distributions decreases for the larger sample sizes, and symmetry of the sampling distribution
approximately holds when n = 10.
Example 12.8 (Log-normal random variables) Suppose X1 , X2 , …, Xn are i.i.d. log-normal random variables, with
ln(Xi ) ∼ N(0, 1) for each i ∈ {1, 2, …, n}. Figure 12.7 shows the simulated sampling distributions of the standard
deviation sX for n = 10 and n = 30, using 100,000 simulations for each sample size. Even with a sample size of n = 30,
the sampling distribution of sX is distinctly right-skewed. In fact, at first glance, the sampling distributions for n = 10
and n = 30 look quite similar. A closer look, especially for the lower values on the x-axis, reveals that the pdf for
n = 30 starts increasing at slight larger values than the pdf for n = 10 and the peak occurs at a slightly larger value for
n = 30. It may not be obvious that there is reduced dispersion in the distribution when moving from n = 10 to n = 30,
but we know from Proposition 12.3 that the variance of the n = 30 distribution must be lower than the variance of the
n = 10 distribution. Although neither sample size is large enough to result in a symmetric and bell-shaped sampling
distribution, we would get such a symmetric and bell-shaped sampling distribution eventually if we continue to increase
the sample size n. This general idea is the focus of the Chapter 13, where the concept of asymptotic or large-sample
sampling distributions is discussed.
Here is the R code to create Figure 12.7, again using the replicate function:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 325 — #332
i i

Sampling distributions: exact 325

n=2 n=3

15
10

10
fs2X(v)

fsX2(v)
5

5
0

0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

v v

n=5 n = 10
15

15
10

10
fs2X(v)

fsX2(v)
5

5
0

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

v v

Figure 12.5
Sampling distributions of the sample variance for i.i.d. U(0, 1) random variables

set.seed(1234)

# initialize the number of simulations

num_simulations <- 100000

# create a vector of sample standard deviations for n=10

sdlogn_10 <- replicate(num_simulations, sd(rlnorm(10)))

# create a vector of sample standard deviations for n=30

sdlogn_30 <- replicate(num_simulations, sd(rlnorm(30)))

# graph-display format (two rows, one column)

par(mfrow=c(2,1))

# plot smoothed densities for the two sample sizes

plot(density(sdlogn_10), main=expression(n==10), xlim=c(0,8),
xlab=expression(v), ylab=expression(f[s[X]](v)))
plot(density(sdlogn_30), main=expression(n==30), xlim=c(0,8),
xlab=expression(v), ylab=expression(f[s[X]](v)))

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 326 — #333
i i

326 Sampling distributions: exact

n=2 n=3

10
8

8
6

6
fsX(v)

fsX(v)
4

4
2

2
0

0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

v v

n=5 n = 10
10

10
8

8
6

6
fsX(v)

fsX(v)
4

4
2

2
0

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

v v

Figure 12.6
Sampling distributions of the sample standard deviation for i.i.d. U(0, 1) random variables

12.3 Sampling distribution of other statistics

This chapter has focused on the sampling distribution for three specific statistics: the sample mean X̄, the sample
variance s2X , and the sample standard deviation sX . The basic idea of a sampling distribution, however, extends to other
statistics. This section provides a few examples.
Example 12.9 (Maximum of normal random variables) Suppose X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random variables,
and we are interested in the sampling distribution of the maximum of the n random variables. Specifically, let
maxX = max(X1 , X2 , …, Xn ), where the value of the max(·) function is the largest of its arguments, and the realized
sample maximum is maxx = max(x1 , x2 , …, xn ) for the sample {x1 , x2 , …, xn }. For n = 1, the sampling distribution is
just N(µ, σ 2 ). For n = 2, the cdf of maxX is determined as follows:
P(maxX < v) = P(max(X1 , X2 ) < v)
= P(X1 < v, X2 < v)
= P(X1 < v)P(X2 < v) since X1 and X2 are independent
2
P X1σ–µ < v–µ P X2σ–µ < v–µ = Φ v–µ

= σ σ σ .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 327 — #334
i i

Sampling distributions: exact 327

n = 10

0.6
0.4
fsX(v)

0.2
0.0 0 2 4 6 8

n = 30
0.6
0.4
fsX(v)

0.2
0.0

0 2 4 6 8

Figure 12.7
Sampling distributions of the sample standard deviation for i.i.d. log-normal random variables

This reasoning extends to larger n, with the cdf given by

v – µ n
FmaxX (v) = P(maxX < v) = P(max(X1 , X2 , …, Xn )) = Φ .
σ
How about the pdf of maxX ? Since the pdf is the derivative of the cdf (Proposition 10.2), the pdf of maxX is
0 n v – µ v – µ n–1
fmaxX (v) = Fmax (v) = φ Φ .
X
σ σ σ
For the standard normal N(0, 1), the cdf and pdf simplify to FmaxX (v) = Φ(v)n and fmaxX (v) = nφ(v)Φ(v)n–1 , respectively.
For this case, Figure 12.8 shows graphs of the pdf of maxX for three different sample sizes (n = 5, n = 10, and n = 30).
As expected, the distribution shifts to the right as the sample size increases since the maximum is being taken over a
larger number of N(0, 1) draws. Whereas the center of the distribution for n = 5 seems to be located just above 1, the
center of the distribution for n = 30 is closer to 2.
Interestingly, the approach for determining the sampling distribution of the maximum maxX can be extended to
other continuous distributions rather easily. No special properties of the normal distribution were used to derive the
cdf FmaxX (v) above, so for some other distribution (like a uniform, a log-normal, or something else), the cdf and pdf of
maxX are FmaxX (v) = FX (v)n and fmaxX (v) = nfX (v)FX (v)n–1 , respectively. For instance, for i.i.d. U(0, 1) random variables
where Fx (v) = v and fx (v) = 1 for v ∈ (0, 1), the distribution of maxX has cdf and pdf FmaxX (v) = vn and fmaxX (v) = nvn–1
for v ∈ (0, 1), respectively.
Example 12.10 (Sample median of normal random variables) Suppose X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random
variables, and we are interested in the sampling distribution of the sample median. Since x̃0.5 is the median of the
sample {x1, x2 , …, xn }, let X̃0.5 denote the random variable associated with the sample median. Before observing the
sample, the sample median is the random variable X̃0.5 , and the random sample leads to a realization x̃0.5 of the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 328 — #335
i i

328 Sampling distributions: exact

n=5

0.8
fmaxX(v)

0.4
0.0
−2 −1 0 1 2 3 4

0.8
n = 10
fmaxX(v)

0.4
0.0

−2 −1 0 1 2 3 4

n = 30
0.8
fmaxX(v)

0.4
0.0

−2 −1 0 1 2 3 4

Figure 12.8
Sampling distributions of the sample maximum for i.i.d. N(0, 1) random variables

random variable X̃0.5 . To fix ideas, consider a random sample of n = 3 observations. If the observations are sorted
in order, the sample median is the middle value of the three observations. What would its sampling distribution be?
Since the sample median is always the middle of the three observations, the sampling distribution should be less
dispersed than the original N(µ, σ 2 ) distribution. It’s less likely to have the middle observation in either of the tails
since that would mean that there needs to be another observation even further out in the tail. Using similar reasoning,
the sampling distribution of the sample median should become tighter around the center µ of the original distribution
as the sample size grows. Also, since the original random variables are all symmetric, it would be surprising if the
sampling distribution of the sample median was not also symmetric. While it’s difficult to analytically determine the
sampling distribution in this case, unlike the sample maximum in Example 12.9, computer simulations can approximate
the sampling distributions of the sample median for different sample sizes. Figure 12.9 shows the simulated sampling
distributions of the sample median for three sample sizes (n = 3, n = 10, and n = 20), where the standard normal N(0, 1)
distribution is assumed. For each sample size, we use 100,000 simulations for the sampling distribution, where in each
simulation a random sample is drawn and the sample median is calculated. As predicted, the distributions become
tighter around the center, at zero, as the sample size becomes larger. The distributions all look symmetric. As a
comparison, for the n = 20 graph at the bottom, the dotted line shows the sampling distribution of the sample mean X̄,
which is X̄ ∼ N(0, 1/20) since n = 20. So, the sampling distributions of X̃0.5 and X̄ are both centered around zero, but the
sampling distribution of X̄ appears to exhibit less dispersion (lower variance) than the sampling distribution of X̃0.5 .
Put another way, it is somewhat more likely that the sample mean will be closer to the true mean/median of zero than
the sample median.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 329 — #336
i i

Sampling distributions: exact 329

n=3

0.0 0.5 1.0 1.5 2.0

fX0.5(v)
−2 −1 0 1 2

0.0 0.5 1.0 1.5 2.0

n = 10
fX0.5(v)

−2 −1 0 1 2

n = 20
0.0 0.5 1.0 1.5 2.0
fX0.5(v)

−2 −1 0 1 2

Figure 12.9
Sampling distributions of the sample median for i.i.d. N(0, 1) random variables

set.seed(1234)

# initialize the number of simulations

num_simulations <- 100000

# create a vector of sample medians for n=3, n=10, and n=20

mednorm_3 <- replicate(num_simulations, median(rnorm(3)))
mednorm_10 <- replicate(num_simulations, median(rnorm(10)))
mednorm_20 <- replicate(num_simulations, median(rnorm(20)))

# graph-display format (three rows, one column)

par(mfrow=c(3,1))

# plot smoothed densities for the three sample sizes

# for n=20, plot the sampling dist'n of the sample mean for comparison
plot(density(mednorm_3), ylim=c(0,2), xlim=c(-2,2), main=expression(n==3),
xlab=expression(v), ylab=expression(f[widetilde(X)[0.5]](v)))
plot(density(mednorm_10), ylim=c(0,2), xlim=c(-2,2), main=expression(n==10),
xlab=expression(v), ylab=expression(f[widetilde(X)[0.5]](v)))
plot(density(mednorm_20), ylim=c(0,2), xlim=c(-2,2), main=expression(n==20),
xlab=expression(v), ylab=expression(f[widetilde(X)[0.5]](v)))
lines(seq(-2,2,0.01), dnorm(seq(-2,2,0.01),sd=1/sqrt(20)), lty=3, lwd=2)

Even when the sampling distribution is difficult or impossible to obtain analytically, the simulation approach
illustrated in Example 12.10 can be used to simulate the sampling distribution for a statistic. Example 12.10 didn’t
require any special properties of the normal distribution, so the same approach could be used for any underlying
distribution of the i.i.d. random variables in the case of the sample median. For other statistics, like other sample

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 330 — #337
i i

330 NOTES

quantiles or the sample IQR, the only step in the simulation process that needs to be changed is calculating the
appropriate statistic after each simulated random sample is drawn. This approach even works for bivariate statistics,
like the sample covariance or the sample correlation, if the underlying joint distribution of the bivariate random
variables is fully known.

Notes
34 The exact sampling distribution of X̄ is not easy to derive, but its exact sampling distribution is known as the Irwin-Hall distribution.
35 In contrast, the product of log-normal random variables is log-normal. For instance, if ln(X ) and ln(X ) are i.i.d. N(µ, σ 2 ) random variables,
1 2
then ln(X1 X2 ) = ln(X1 ) + ln(X2 ), which has a N(2µ, 2σ 2 ) distribution, meaning X1 X2 is log-normal. Pn
36 Although the proof of Proposition 12.4 is complicated, some intuition can be developed for the result by noting that (n – 1)s2 = 2
X i=1 (Xi – X̄) ,
2
sX Pn
Xi –X̄
2 Pn
Xi –µ
2
which implies (n – 1) σ2 = i=1 σ
. If this expression had the population mean µ rather than X̄, it would be i=1 σ
, which is the sum
Pn Xi –µ 2
of n squared i.i.d. N(0, 1) random variables. That would imply that i=1 σ
has a χ2n distribution rather than the χ2n–1 distribution in the
2
proposition. Thus, the fact that X̄ appears in the expression for sX , rather than the population mean µ, means that we “lose a degree of freedom”
s2
when using X̄ in place of µ. The interested reader can try working this out explicitly for the case of n = 2, where the result simplifies to σX2 ∼ χ21 .
37 One exception is the case of i.i.d. Bernoulli random variables, but that’s arguably not very interesting since the sample variance s2 = X̄(1 – X̄),
X
so the exact sampling distribution of X̄ (a scaled binomial distribution) can be used directly to construct the sampling distribution for s2X .

Exercises
1. Let X ∈ {1, 2, 3} be a discrete random variable with pmf
P(X = 1) = 0.2, P(X = 2) = 0.3, P(X = 3) = 0.5.
(a) Consider a random sample of two observations, where X1 and X2 are i.i.d. random variables with the pmf above.
What is the sampling distribution of X̄?
(b) Consider a random sample of three observations, where X1 , X2 , and X3 are i.i.d. random variables with the pmf
above. What is the sampling distribution of X̄?
(c) Consider a random sample of three observations, where X1 , X2 , and X3 are i.i.d. random variables with the pmf
above. What is the sampling distribution of the sample median?
2. Suppose the probability of a recession (R = 1) in the United States in any given year is 10% and that the realizations
of the Bernoulli random variable R ∼ Bernoulli(0.1) in different years are independent. Consider a period of 10
consecutive years (n = 10) with realizations R1 , R2 , …, R10 .
(a) What is the sampling distribution of R̄, the sample proportion of 10 years in which there is a recession? What
are the mean and the standard deviation of the sampling distribution?
P10
(b) What is the sampling distribution of T = i=1 Ri , the total number of recession years? What are the mean and
the standard deviation of the sampling distribution? What is P(T ≥ 2)?
3. Let X ∈ {1, 2, 3, 4, 5, 6} be the random variable associated with the roll of a fair die.
(a) What is the sampling distribution associated with the total of two independent rolls of a fair die?
(b) What is the sampling distribution associated with the maximum of two independent rolls of a fair die?
(c) Conduct 100,000 simulations in R and draw the histogram that approximates the sampling distribution
associated with the total of five independent rolls of a fair die.
(d) Conduct 100,000 simulations in R and draw the histogram that approximates the sampling distribution
associated with the maximum of five independent rolls of a fair die.
(e) What are the mean and variance of the random variable associated with the sum of 100 independent rolls of a
fair die?
(f) What are the mean and variance of the random variable associated with the average of 100 independent rolls
of a fair die?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 331 — #338
i i

NOTES 331

4. A credit card company knows that the monthly balance X of a representative customer is normally distributed:
X ∼ N(300, 2500).
Assume that the monthly balances of customers are independent draws from X.
(a) Let X1 , X2 , …, X100 denote the monthly balances for 100 randomly chosen customers.
i. What is the distribution of X̄, the average monthly balance of the 100 customers?
ii. Determine a 99% probability interval for X̄.
(b) The credit card company considers a customer to be a “low-balance customer” if she has a monthly balance
below $200. Let L be an indicator variable equal to 1 for a low-balance customer and 0 otherwise.
i. What is the distribution of L?
ii. For 20 randomly chosen customers, what is the probability of at least one low-balance customer?
iii. If L1 , L2 , …, L100 denotes the “low-balance” random variables for 100 randomly chosen customers,
1
P100
what is the distribution of the sample proportion L̄ = 100 i=1 Li ?
5. Three firms have annual profits (in millions of dollars), denoted X1 , X2 , and X3 , that are log-normally distributed.
Assume that
ln X1 ∼ N(0, 1), ln X2 ∼ N(0.5, 0.25), and ln X3 ∼ N(0.75, 0.16)
are independent of each other. Based on 100,000 simulations in R, draw a histogram (with 100 bins) and a smoothed
density corresponding to the distribution of X1 + X2 + X3 , the sum of annual profits. How does the average of the
simulated draws of X1 + X2 + X3 compare to the population mean of X1 + X2 + X3 ?
6. Consider i.i.d. random variables X1 , X2 , …, Xn drawn from a N(µ, 4) distribution.
(a) For n = 10, what is P(|X̄ – µ| < 0.1) (i.e., the probability that X̄ is within 0.1 of µ)?
(b) What is the smallest value of n that guarantees P(|X̄ – µ| < 0.1) > 0.95?
(c) If you instead want P(|X̄ – µ| < 0.1) > 0.90, would you require a larger or smaller n as compared to (b)?
(d) If the random variables were instead drawn from a N(µ, 9) distribution and you want P(|X̄ – µ| < 0.1) > 0.95,
would you require a larger or smaller n as compared to (b)?
7. In the population, IQ (“intelligence quotient”) scores are normally distributed with a mean of 100 and a standard
deviation of 15. You intend to obtain IQ scores for a random sample of 20 individuals, from which you will calculate
the sample average and the sample standard deviation.
(a) For a single individual, what is the probability that IQ score is greater than 105?
(b) For a sample of 20 individuals, what is the probability that the sample average is greater than 105?
(c) For a sample of 20 individuals, provide an 80% probability interval for the sample standard deviation. That is,
what are the values a and b satisfying P(sX < a) = 0.1 and P(sX > b) = 0.1, so that P(a ≤ sX ≤ b) = 0.8?
(d) Conduct simulations in R using the assumed N(100, 152 ) distribution and n = 20. Specifically, use 100,000
simulations, where for each simulation you draw an i.i.d. sample of size n = 20 from the assumed distribution.
i. Draw a (density) histogram of the 100,000 sample averages.
ii. What is the standard deviation of the 100,000 sample averages?
iii. Draw a (density) histogram of the 100,000 sample standard deviations.
iv. What is the standard deviation of the 100,000 sample standard deviations?
v. What is the proportion of the 100,000 simulations for which the sample average is greater than 105?
Does this proportion approximate the exact probability from (b)?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 332 — #339
i i

332 NOTES

8. The triangular random variable X, introduced in Chapter 10, has pdf


x

 if 0 ≤ x ≤ 1
fX (x) = 2 – x if 1 < x ≤ 2


0 otherwise
(a) Using the fact that X is equivalent to the sum of two independent U(0, 1) random variables, write an R function
rtri that takes a single argument n and returns a vector of n i.i.d. draws of the triangular random variable X.
(b) For a sample size of n = 10, use the rtri function to conduct 100,000 simulations to approximate the sampling
distributions of both the sample mean X̄ and the sample standard deviation sX . Draw the densities associated
with each sampling distribution. What are the mean and standard deviation of each sampling distribution?
(c) Repeat (b) for a sample size of n = 40.
9. For each of ten pharmaceutical drug companies, assume that the number of new drug discoveries in a given year is
an independent Poisson random variable with expected value equal to 0.5.

For this question, use the following fact about independent Poisson random variables: If X1 , X2 , …, Xk are independent
Poisson random variables with Poisson parameters λ1 , λ2 , …, λk , respectively, the sum X1 + X2 + · · · + Xk is a
Poisson(λ1 + λ2 + · · · + λk ) random variable.
(a) What is the sampling distribution of the total number (over all ten companies) of new drug discoveries in a
given year?
(b) What is the sampling distribution of the average number (per company) of new drug discoveries in a given
year?
(c) Conduct 10,000 simulations in R to approximate the sampling distribution of the sample median of the total
number of new drug discoveries in a given year. Draw the histogram and calculate the mean and standard
deviation of the sampling distribution.
10. Consider a sample of n = 100 observations, where the underlying random variables X1 , X2 , …, X100 are i.i.d. uniform
U(0, 1) random variables.
(a) Conduct 10,000 simulations in R to approximate the sampling distribution of the sample interquartile range
IQRX . Draw the density and report the mean and standard deviation of the sampling distribution. Is the mean
of the sampling distribution close to what you expected? (Think about the population interquartile range.)
(b) *This part considers the trimmed mean of a sample, a descriptive statistic which is defined as the sample mean
calculated on the sample after dropping the most extreme observations. For example, the 5% trimmed mean
for n = 100 is the sample mean on the 90 observations that remain after dropping the 5 largest observations
and the 5 smallest observations. Similarly, the 2% trimmed mean for n = 100 is the sample mean on the 96
observations that remain after dropping the 2 largest observations and the 2 smallest observations. Conduct
10,000 simulations in R to approximate the sampling distributions of the sample mean, the 5% trimmed mean,
and the 2% trimmed mean. Draw their densities on the same graph to compare the distributions. Calculate
the mean and standard deviation for each of the three sampling distributions, and comment on how the values
compare to each other.
11. *In an English auction, the price of an object increases as bidders continue to bid on the object. A well-known
theoretical prediction in economics is that the bidder with the highest valuation for the object wins the auction, and
the highest bid is equal to the second-highest valuation among the bidders. For instance, among a group of bidders, if
the highest and second-highest valuations of an object are $92 and $88, respectively, the prediction is that the bidder
with the highest valuation bids $88, wins the auction, and realizes a “surplus” equal to the valuation minus the bid, or
$92 – $88 = $4.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 333 — #340
i i

NOTES 333

For this question, assume that a seller holds an auction for an item she values at $85, and assume the theoretical
prediction described above holds in practice. There is a group of B bidders for the object, with valuations V1 , V2 , …, VB
for the object, where each valuation Vi is an i.i.d. draw of a U(80, 100) random variable.
(a) First, consider the case of two bidders (B = 2).
i. What is the sampling distribution of the winning bid?
ii. What is the expected value of the seller’s surplus (equal to the winning bid minus $85)?
iii. What is the probability that the seller’s surplus is negative?
iv. What is the expected value of the winner’s surplus (equal to the winner’s valuation minus the winning
bid)?
(b) For B = 3, what is the sampling distribution of the winning bid?
(c) For each B value in {5, 10, 15}, conduct 100,000 simulated auctions in R in which you determine the winning
bid and the winner’s valuation for each simulated auction. Based on the results of the simulated auctions,
(i) graph the estimated distributions (pdf’s) of both the seller’s surplus and the winner’s surplus and (ii) calculate
the expected values of both the seller’s surplus and the winner’s surplus.
12. Let maxX = max(X1 , X2 , …, Xn ) be the maximum of n random variables X1 , X2 , …, Xn , as in Section 12.3.
(a) If X1 , X2 , …, Xn are i.i.d. draws from a U(0, 1) random variable, determine the probability that maxX is greater
than 0.98 as a function of n. What is the smallest n for which this probability is at least 95%?
(b) If X1 , X2 , …, Xn are i.i.d. draws from a N(0, 1) random variable, determine the probability that maxX is greater
than 3 as a function of n. What is the smallest n for which this probability is at least 95%?
(c) If X1 , X2 , …, Xn are i.i.d. rolls of a fair die, with Xi ∈ {1, 2, 3, 4, 5, 6}, what are the cdf and pmf of maxX ?
(d) If X1 , X2 , …, Xn are i.i.d. draws from a N(0, 1) random variable, what is the cdf of the minimum value minX =
min(X1 , X2 , …, Xn ) as a function of n?
(e) If X1 , X2 , …, Xn are i.i.d. draws from a U(0, 1) random variable, what is the cdf of the second-largest value as
a function of n?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 334 — #341
i i

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 335 — #342
i i

13 Sampling distributions: asymptotic

Chapter 12 explored sampling distributions of statistics when the sample size n is fixed and the complete distribution
of the underlying i.i.d. random variables is known. Analytical characterizations were provided for the exact sampling
distribution in some cases (e.g., the sample mean of Bernoulli random variables, the sample mean of normal random
variables, the sample variance of normal random variables), while simulations were used for others (e.g., the standard
deviation of a uniform distribution, the sample median of a normal distribution). In this chapter, we shift focus to the
situation where the sample size n “grows large” and characterize the sampling distribution of a statistic in this context.
Two main motivations drive the consideration of large-sample or “asymptotic” sampling distributions. First,
most real-world datasets used by economists and other practitioners tend to be large, which could mean hundreds
or thousands of observations or even millions of observations. For instance, the stock-return dataset sp500 used
throughout the book has 364 observations, while the labor-force dataset cps has several thousand. These sample
sizes far exceed the “small n” examples in Chapter 12. Second, with large n, remarkable statistical results enable
the characterization of the sampling distribution of a statistic. These results typically indicate that the asymptotic
sampling distribution is a normal distribution for most statistics discussed in this book. This contrasts with the exact
sampling distributions in Chapter 12, where normality holds only in very specific cases (e.g., the sample mean of
i.i.d. normal random variables) but more generally the sampling distribution’s shape depends on the specific sample
size, the specific statistic, and the specific underlying distribution of the random variables. As such, for large samples,
we will later (in Chapter 14) apply properties of the normal distribution to further analyze the variability of a given
statistic and, when viewed as a guess or estimator of an underlying parameter, its precision.

13.1 Asymptotic distribution of the sample mean

We begin our discussion of the asymptotic sampling distribution, or more concisely the asymptotic distribution, with
the sample mean X̄. As in Chapter 12, the case of i.i.d. random variables X1 , X2 , …, Xn is considered throughout this
chapter. The standard notation µX and σX2 denotes the population mean and variance of X, but no specific distribution is
assumed. In this general case, from Proposition 10.13, the population mean, variance, and standard deviation of X̄ are
σX2 σX
µX̄ = µX , σX̄2 = , and σX̄ = √ .
n n
To think about the sample size n growing large, we imagine what would happen if we continually draw new
observations from the population, with each new observation being a realization of an i.i.d. random variable Xi . In
this thought experiment, n grows arbitrarily large as more and more observations are drawn from the population, so
that n → ∞ (“n goes to infinity”). Of course, in practice, our observed sample will not be infinite, but the thought
experiment is useful since the theoretical results are based upon the “limit” of what happens when n → ∞. And, these
theoretical results can be applied to a given dataset as long as n is large. A more complete discussion of “how large is
large enough” is provided below.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 336 — #343
i i

336 Sampling distributions: asymptotic

13.1.1 Law of Large Numbers

From the general results above, the mean of the sampling distribution of X̄ is equal to the mean µX of the underlying
random variables. The variance and standard deviation of the sampling distribution of X̄ are both decreasing in n, with
σ2 σX
σX̄2 = nX → 0 and σX̄ = √ n
→ 0 as n → ∞. As the sample size n gets arbitrarily large, the sampling distribution of X̄ has
a mean that stays at µX and a dispersion that shrinks to zero, meaning the realizations of X̄ eventually get arbitrarily
close to µX . This intuition is formalized by the Law of Large Numbers (LLN):38
Proposition 13.1. (Law of Large Numbers (LLN)) If X1 , X2 , …, Xn are i.i.d. random variables with finite population
Pn
mean µX , the sample mean X̄ = n1 i=1 gets arbitrarily close to µX as n → ∞.
For i.i.d. Bernoulli random variables, the Law of Large Numbers is closely related to the frequentist interpretation
of probabilities discussed in Chapter 2. If X1 , X2 , …, Xn are i.i.d. Bernoulli(π) random variables, the sample mean X̄ is
just the sample proportion of ones or successes among the n trials, and the LLN states that this sample proportion of
successes gets arbitrarily close to π, the true probability of success, as n grows large. For a fair coin toss, the sample
proportion of heads gets arbitrarily close to 0.5 as the number of tosses grows large. For a fair die roll, the sample
proportion of sixes gets arbitrarily close to 1/6 as the number of rolls grows large. For a population of i.i.d. customers,
each with purchase probability 0.2, the sample proportion of purchases gets arbitrarily close to 0.2 as the number of
customers grows large.
The LLN holds regardless of the underlying distribution of the random variables. If a particular model has been
assumed for the random variables and that model is correct, the LLN implies X̄ gets arbitrarily close to the model-
based µX when n grows large. For i.i.d. normal random variables, with X ∼ N(µ, σ 2 ), the sample mean gets arbitrarily
close to µ as n grows large. For i.i.d. log-normal random variables, with ln(X) ∼ N(µ, σ 2 ), the sample mean gets
2
arbitrarily close to eµ+σ /2 as n grows large. For i.i.d. uniform random variables, with X ∼ U(a, b), the sample mean gets
arbitrarily close to a+b
2 as n grows large. For i.i.d. Poisson random variables, with X ∼ Poisson(λ), the sample mean
gets arbitrarily close to λ as n grows large.

13.1.2 Central Limit Theorem

While the Law of Large Numbers is a powerful and general result about the sample mean X̄ for large sample sizes, it
doesn’t say anything about the shape of the sampling distribution of X̄. Another classic statistics result, the Central
Limit Theorem (CLT), fully characterizes the sampling distribution of X̄ for large sample sizes. Amazingly, the shape
of this asymptotic distribution for X̄ is always that of a normal distribution, regardless of the underlying distribution of
the i.i.d. random variables.39
Proposition 13.2. (Central Limit Theorem (CLT)) If X1 , X2 , …, Xn are i.i.d. random variables with finite population
Pn
mean µX and finite population variance σX2 , then for sufficiently large n, the sample mean X̄ = 1n i=1 Xi is
σ2
approximately normally distributed with mean µX and variance nX :
σ2

a
X̄ ∼ N µX , X .
n
a
The notation “∼” in Proposition 13.2 is read as “asymptotically distributed as.” We say that “X̄ is asymptotically
σ2
distributed as a normal random variable with mean µX and variance nX ” or, alternatively, “X̄ has an asymptotic normal
σ2
distribution with mean µX and variance nX .” The variance of the asymptotic distribution is called the asymptotic
σX
variance, and the standard deviation of the asymptotic distribution, which is √ n
, is called the asymptotic standard
deviation.
The asymptotic normal distribution of X̄ from the CLT can also be standardized to yield a random variable that has
a standard normal as its asymptotic distribution. Subtracting the mean µX from X̄ and then dividing by the asymptotic
σX
standard deviation √ n
yields
X̄ – µX a
σX/√n
∼ N (0, 1) .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 337 — #344
i i

Sampling distributions: asymptotic 337

σ2
While the mean and variance of X̄ are already known to be µX and nX , respectively, the CLT provides the much
stronger result that the asymptotic sampling distribution of X̄ is a normal distribution with those mean and variance
parameters. The CLT is a remarkable result since the underlying distribution of the random variables X1 , X2 , …, Xn can
be anything. The CLT holds whether the underlying distribution is discrete or continuous (or some combination of the
two) and whether the underlying distribution is symmetric or asymmetric.
Since a similar result for the specific case of i.i.d. normal random variables has been seen previously (Section 12.1.2),
it is important to understand the difference between that result and the general CLT result. For i.i.d. normal random

2 σ2 a
variables, with X ∼ N(µ, σ ), the exact sampling distribution is X̄ ∼ N µ, n . The use of “∼” rather than “∼”
2

indicates an exact sampling distribution rather than an asymptotic distribution. The X̄ ∼ N µ, σn distribution holds

σ2

a
for any sample size, even for very small n, while the CLT result X̄ ∼ N µX , nX is an approximate sampling
distribution that requires large n.
The CLT requires that the sample size n is “sufficiently large,” but what does “sufficiently large” mean in practice?
Many textbooks give simple rules of thumb, like saying that n > 30 (a sample with more than 30 observations) is
sufficient to have a large sample. Unfortunately, in reality, the number of observations required for the CLT to hold (i.e.,
for the normal approximation to be accurate) depends upon the distribution of the underlying random variables. For
instance, a heavily right-skewed distribution of the random variables usually requires larger n than a nicely symmetric
distribution of the random variables. In fact, in some of the distribution graphs seen in Chapter 12, there were cases
where the distribution of X̄ looked normal for very small sample sizes (e.g., the uniform distribution) and other cases
where it did not (e.g., the log-normal distribution). For the specific case of Bernoulli random variables, which arises
frequently in practice and is discussed in more detail below, an oft-used and effective rule of thumb is that the normal
approximation works well when both nπ > 10 and n(1 – π) > 10. This Bernoulli rule of thumb requires a larger sample
size when the success probability π is closer to 0 or 1, which is the case where the Bernoulli distribution is more
asymmetric. With π = 0.5, the rule of thumb suggests having n > 20 is sufficient to use the CLT. In contrast, the rule of
thumb would suggest n > 50 for the CLT approximation when π = 0.2 or π = 0.8 and n > 100 when π = 0.1 or π = 0.9.
Example 13.1 (Restaurant franchises) Suppose a fast-food mogul owns 60 franchises of Burger Depot. The monthly
revenue, in thousands of dollars, for each of the 60 franchises is an i.i.d. random variable drawn from a distribution
with population mean µX = 20 and population standard deviation σX = 4. In a given month, the approximate
distribution of the sample average of monthly revenues at the 60 franchises, based upon the CLT, is
42

a
X̄ ∼ N 20, .
60
Then, an approximate 95% probability interval for the sample average of monthly revenues at the 60 franchises is

4 4
20 – 1.96 √ , 20 + 1.96 √ ≈ (18.99, 21.01),
60 60
and an approximate 90% probability interval for the sample average of monthly revenues at the 60 franchises is

4 4
20 – 1.645 √ , 20 + 1.645 √ ≈ (19.15, 20.85).
60 60
If the average monthly revenue falls below $18,500 in a given month, the fast-food mogul will need a bank loan. The
probability of that happening in a given month is

X̄ – 20 18.5 – 20 18.5 – 20
P(X̄ < 18.5) = P 4 √ < 4 √ =Φ √ ≈ 0.00184 or 0.184%.
/ 60 / 60 4/ 60

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 338 — #345
i i

338 Sampling distributions: asymptotic

pnorm((18.5-20)/(4/sqrt(60)))
## [1] 0.001837806

The CLT (Proposition 13.2) provides the asymptotic distribution for the average X̄ of i.i.d. random variables. Since
the sum X1 + X2 + · · · + Xn is equal to nX̄, an immediate corollary of the CLT is the following proposition for the
asymptotic distribution of the sum of i.i.d. random variables:
Proposition 13.3. (Asymptotic distribution of the sum of i.i.d. random variables) If X1 , X2 , …, Xn are i.i.d. random
variables with finite population mean µX and finite population variance σX2 , then for sufficiently large n, the sum
Pn
S = i=1 Xi = nX̄ is approximately normally distributed, with
a
S ∼ N nµX , nσX2 .

X̄ has an approximate normal distribution by the CLT, which means that nX̄ also is approximately normal since it’s
a scaled version of X̄. From Proposition 10.14, the mean and variance of the normal distribution that approximates the
Pn
sum S = i=1 Xi are n and n2 times the mean and variance, respectively, of the underlying random variable X.
Example 13.2 (Restaurant franchises) Continuing Example 13.1, the approximate distribution for the total monthly
revenues at the 60 franchises is
a a
S = 60X̄ ∼ N 60 · 20, 60 · 42 or S ∼ N (1200, 960) .

Then, an approximate 95% probability interval for the total monthly revenues at the 60 franchises is
√ √
1200 – 1.96 960, 20 + 1.96 960 ≈ (1139, 1261).

This interval can also be obtained by multiplying the endpoints of the the interval (18.99, 21.01) for X̄, calculated in
Example 13.1, by 60.

13.1.3 Normal approximation for the sample proportion or binomial random variable
In Section 12.1.1, the exact sampling distribution of X̄ was determined when X1 , X2 , …, Xn are i.i.d. Bernoulli(π)
random variables. Recall that X̄, the sample mean or sample proportion of successes, has an exact sampling distribution
equivalent to a Binomial(n, π) random variable scaled by n1 , which is true for any sample size n and success
probability π. This result followed from the fact that X1 + X2 + · · · + Xn is, by definition, a Binomial(n, π) random
variable, so that X̄ = 1n (X1 + X2 + · · · + Xn ) is a binomial random variable scaled by 1n . The population statistics for X̄ are
r
2 π(1 – π) π(1 – π)
µX̄ = π, σX̄ = , and σX̄ = .
n n
Applying the CLT, the asymptotic distribution of X̄ is

a π(1 – π) X̄ – π a
X̄ ∼ N π, or, equivalently, q ∼ N(0, 1).
n π(1–π)
n

For the binomial random variable Y = X1 + X2 + · · · + Xn , where by definition Y ∼ Binomial(n, π), the population
mean is µY = nπ and the population variance is σY2 = nπ(1 – π). Since Y = nX̄, the CLT result for the sample proportion
implies
a Y – nπ a
Y ∼ N (nπ, nπ(1 – π)) or, equivalently, √ ∼ N(0, 1).
nπ(1 – π)
The rule of thumb discussed above stated that the CLT normal approximation works well when both nπ >
10 and n(1 – π) > 10. To illustrate this phenomenon, Figure 13.1 considers four different sample sizes (n = 10,
n = 20, n = 50, and n = 100) when the success probability is π = 0.2. For π = 0.2, the rule of thumb suggests the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 339 — #346
i i

Sampling distributions: asymptotic 339

n = 10 n = 20

0.30

0.20
0.15
0.20
P(X = v)

P(X = v)

0.10
0.10

0.05
0.00

0.00
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

v v

n = 50 n = 100
0.12

0.08
0.08
P(X = v)

P(X = v)

0.04
0.04
0.00

0.00

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

v v

Figure 13.1
Sampling distributions of the sample mean for i.i.d. Bernoulli(0.2) random variables

asymptotic approximation works well when n > 50. For each graph in Figure 13.1, the exact-distribution pmf for the
sample proportion X̄ is shown with vertical bars, and the dotted curve associated with the asymptotic distribution
N 0.2, 0.2(1–0.2)
n is shown for comparison. For the smallest sample size (n = 10), the problem with the normal
approximation appears in the left tail, as it has positive probabilities associated with negative values for X̄. This problem
largely disappears for n = 20, and even for that sample size (which is lower than the rule-of-thumb suggestion), the
normal approximation looks pretty good. For the larger sample sizes (n = 50 and n = 100), the normal approximation
matches the pmf almost exactly.
How do we calculate probabilities or probability intervals using the normal approximation? Let’s start with a
binomial random variable. As an example, let’s say that n = 50 and π = 0.2, so that Y ∼ Binomial(50, 0.2). The true
probability of exactly 10 successes out of 50 trials is

50
P(Y = 10) = 0.210 0.840 ≈ 0.1398.
10

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 340 — #347
i i

340 Sampling distributions: asymptotic

dbinom(10,50,0.2)
## [1] 0.139819

Since nπ = 10 and nπ(1 – π) = 8, the approximate sampling distribution of Y is N(10, 8). The practical issue here is
that only integer outcomes between 0 and 50 are possible, so any of the non-integer values are not actually possible.
To determine the probability P(Y = 10), then, we don’t want to evaluate the pdf of N(10, 8) at the value 10. Instead,
we assume that the continuous interval (9.5, 10.5) corresponds to the discrete outcome 10, and similarly for any other
possible outcome; (10.5, 11.5) corresponds to the outcome 11, (35.5, 36.5) corresponds to the outcome 36, and so
on. Then, if W ∼ N(10, 8) denotes the normal approximation, the probability of 10 successes based upon the normal
approximation is
10.5 – 10 9.5 – 10
P(9.5 < W < 10.5) = Φ √ –Φ √ ≈ 0.1403,
8 8
which is quite close to the true probability of 0.1398.

pnorm((10.5-10)/sqrt(8))-pnorm((9.5-10)/sqrt(8))
## [1] 0.1403162

This method of looking at the probability of a continuous bin to approximate the discrete-outcome probability
is known as a continuity correction. To illustrate how the continuity correction can be used for different types of
probability intervals, where the inequalities may be strict or weak, the following table considers some additional
examples based upon the Y ∼ Binomial(50, 0.2) distribution.
Event probability P(Y = 10) P(6 < Y < 12) P(6 < Y ≤ 12) P(6 ≤ Y ≤ 12)
Probability based upon exact pmf 0.1398 0.6073 0.7105 0.7659
Normal approx. with continuity correction P(9.5 < W < 10.5) P(6.5 < W < 11.5) P(6.5 < W < 12.5) P(5.5 < W < 12.5)
Approx. probability based upon normal 0.1403 0.5941 0.7037 0.7558

The continuity correction can be used in a similar way for calculating probabilities and probability intervals
associated with the sample proportion X̄ for i.i.d. Bernoulli random variables. The only difference is the scale of the
outcomes associated with X̄ as compared to the binomial Y. For example, if X̄ is the sample proportion of successes
when n = 50 and π = 0.2, the possible outcomes for X̄ are {0, 1/50, 2/50, …, 49/50, 1} or {0, 0.02, 0.04, …, 0.98, 1}. For
the outcome X̄ = 0.20, the continuity correction would use the interval (0.19, 0.21) since 0.19 is midway between
0.20 and the outcome below it (0.18) and 0.21 is midway between 0.20 and the outcome above it (0.22). For the
probability P(0.20 ≤ X̄ ≤ 0.26), the continuity correction would entail using the
interval (0.19,0.27) for calculation of
q
(0.2)(0.8)
an approximate probability based upon the asymptotic normal distribution N 0.2, 50 .

Example 13.3 (College-educated adults) In the United States, the probability that an adult aged 25 to 34 has at least
a bachelor’s degree is 40%. Suppose a random sample of 100 of adults aged 25 to 34 is drawn from the population.
The sample proportion X̄ with at least a bachelor’s degree, among the n = 100 individuals, has
(0.4)(0.6) √
µX̄ = 0.40, σX̄2 = = 0.0024, and σX̄ = 0.0024 ≈ 0.0490.
100
The normal approximation can be used to calculate 90% and 95% probability intervals for X̄. Without a continuity
correction, an approximate 90% probability interval for X̄ is
(µX̄ – 1.645σX̄ , µX̄ + 1.645σX̄ ) ≈ (0.40 – 1.645(0.0490), 0.40 + 1.645(0.0490)) ≈ (0.3194, 0.4806),

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 341 — #348
i i

Sampling distributions: asymptotic 341

and an approximate 95% probability interval for X̄ is

(µX̄ – 1.96σX̄ , µX̄ + 1.96σX̄ ) ≈ (0.40 – 1.96(0.0490), 0.40 + 1.96(0.0490)) ≈ (0.3040, 0.4960).
What is the probability that the observed proportion with at least a bachelor’s degree is between 35% and 45%
(inclusive)? The true probability P(0.35 ≤ X̄ ≤ 0.45), based upon the exact distribution (the scaled binomial), is
P(0.35 ≤ X̄ ≤ 0.45) = P(35 ≤ Y ≤ 45) for Y ∼ Binomial(100, 0.4)
= FY (45) – FY (34) ≈ 0.7386.

pbinom(45,100,0.40)-pbinom(34,100,0.40)
## [1] 0.738573

The approximate probability, based upon the normal approximation and using a continuity correction, is
P(0.35 ≤ X̄ ≤ 0.45) ≈ P(0.345 < W < 0.455) for W ∼ N(0.4, 0.0024)
= Φ 0.455–0.4
0.0490 – Φ 0.345–0.4
0.0490 ≈ 0.7384.

pnorm((0.455-0.4)/0.0490)-pnorm((0.345-0.4)/0.0490)
## [1] 0.7383284

Now, consider a situation in which half of the sample (50 observations) is female and half (50 observations) is male.
If the probability that a female adult aged 25 to 34 has at least a bachelor’s degree is 44% and the probability for
a male adult aged 25 to 34 is 36%, what is the probability that the observed sample proportion of females with at
least a bachelor’s degree is greater than the observed sample proportion of males with at least a bachelor’s degree?
Letting X̄f denote the sample proportion among 50 females and X̄m denote the sample proportion among 50 males,
this probability is
P(X̄f > X̄m ) = P(X̄f – X̄m > 0).
The random variable X̄f – X̄m is a linear combination of two sample proportions, specifically the difference between
two sample proportions. While we don’t have specific results about the distribution of the difference of (scaled)
binomial random variables, we do have such results about the distribution of the difference of normal random
variables. Therefore, a normal approximation can be used for both X̄f and X̄m to significantly simplify the calculation
of P(X̄f – X̄m > 0).40 Using the CLT, the asymptotic distributions of X̄f and X̄m are

a (0.44)(0.56) a (0.36)(0.64)
X̄f ∼ N 0.44, and X̄m ∼ N 0.36, .
50 50
Moreover, X̄f and X̄m are independent random variables since they are both based upon i.i.d. random variables from the
population. Using results for linear combinations of normal random variables, X̄f – X̄m is also approximately normal.
To characterize the normal distribution, the population mean and variance of X̄f – X̄m are determined as follows:
E(X̄f – X̄m ) = E(X̄f ) – E(X̄m ) = 0.44 – 0.36 = 0.08
and
(0.44)(0.56) + (0.36)(0.64)
Var(X̄f – X̄m ) = Var(X̄f ) + Var(X̄m ) = = 0.009536.
50
Thus,
a
X̄f – X̄m ∼ N (0.08, 0.009536) ,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 342 — #349
i i

342 Sampling distributions: asymptotic

which implies
–0.08
P(X̄f – X̄m > 0) ≈ 1 – Φ √ ≈ 0.7937.
0.009536

1-pnorm(-0.08/sqrt(0.009536))
## [1] 0.7936729

Example 13.4 (Political polling) Suppose a political poll is conducted, where a random sample of voters is asked
whether they intend to vote for candidate A or candidate B. Let π denote the true probability that a randomly chosen
voter from the population intends to vote for candidate A. How many voters must be polled so that the width of the
(approximate) 95% probability interval for X̄ is no greater than six percentage points wide? π is unknown here, which
is why the poll is being conducted. The approximate 95% probability interval for X̄ is
r r !
π(1 – π) π(1 – π)
π – 1.96 , π + 1.96 ,
n n
so that the width of the interval is
r r ! r
π(1 – π) π(1 – π) π(1 – π)
π + 1.96 – π – 1.96 = 3.92 .
n n n
To ensure that the width of the interval is less than or equal to six percentage points, we require
r 2
π(1 – π) 3.92
3.92 ≤ 0.06 or, equivalently, n ≥ π(1 – π).
n 0.06
This inequality must hold for any possible value for π. Since π(1 – π) is maximized when π = 0.5, we require
2
3.92
n≥ (0.5)(1 – 0.5) ≈ 1067.1 or, equivalently, n ≥ 1068.
0.06
Therefore, at least 1,068 voters must be polled to get a 95% probability interval for X̄ that is no greater than six
percentage points wide.
Example 13.5 (Simulation error) In Chapter 2, computer simulations were used to illustrate the idea that a probability
is the limit to which a long-run frequency converges. For example, Figure 2.1 showed the cumulative frequency of heads
after 10,000 simulations. Since the parameter of the underlying Bernoulli(π) random variable is known to be π = 0.5,
the asymptotic sampling distribution can be used to provide information about the simulation error that would be
expected with this many simulations. Specifically, the 95% probability interval is
r r !
0.5(1 – 0.5) 0.5(1 – 0.5)
0.5 – 1.96 , 0.5 + 1.96 ≈ (0.490, 0.510),
10000 10000
meaning there is a 95% probability that the observed heads frequency for 10,000 coin tosses is between 49.0%
and 51.0%.
If the number of simulations is 100,000 rather than 10,000, the 95% probability interval for the observed heads
frequency is r r !
0.5(1 – 0.5) 0.5(1 – 0.5)
0.5 – 1.96 , 0.5 + 1.96 ≈ (0.497, 0.503),
100000 100000

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 343 — #350
i i

Sampling distributions: asymptotic 343

and, utilizing the fact that the 99.5% quantile of a N(0, 1) random variable is τZ,0.995 ≈ 2.576, a 99% probability
interval for the observed heads frequency is
r r !
0.5(1 – 0.5) 0.5(1 – 0.5)
0.5 – 2.576 , 0.5 + 2.576 ≈ (0.496, 0.504).
100000 100000
This example illustrates how increasing the number of simulations leads to reduced simulation error. In fact, using the
approach of the previous example (Example 13.4), we can determine the number of simulations that are needed to get
a desired width of the probability interval. In the case of the 99% confidence interval, to get an interval that has a
width less than 0.002 (very narrow!), we need
r 2
0.5(1 – 0.5) 5.152
2 × 2.576 < 0.002 or, equivalently, n > (0.5)(1 – 0.5) = 1,658,944.
n 0.002
With this many simulated coin tosses, there is a 99% probability that the observed heads frequency is within 0.001, or
0.1%, of the true heads probability of 0.5, or 50%.

13.2 Asymptotic distribution of the sample variance

In Chapter 12, the exact sampling distribution for the sample variance was shown for the case of i.i.d. normal random
variables. Recall that, for i.i.d. X1 , X2 , …, Xn ∼ N(µ, σ 2 ) random variables, the sampling distribution for any sample
size n is characterized by
s2
(n – 1) X2 ∼ χ2n–1 .
σ
For other underlying random variables, Proposition 12.3 provided the following population statistics for s2X that hold
in general for any i.i.d. X1 , X2 , …, Xn :
2
2 2 2 2 E (X – µX )4 σX2 (n – 3)
µs2X = E(sX ) = σX and σs2 = Var(sX ) = – .
X n n(n – 1)
Unfortunately, knowing the mean and variance of s2X does not provide a complete characterization of the sampling
distribution of s2X . The case of i.i.d. normal random variables is a special case where the exact sampling distribution
is known, but in general the exact sampling distribution is not known. In Section 12.2.2, for instance, simulation
methods were used as an alternative to approximate the sampling distribution of s2X for i.i.d. uniform random variables
and i.i.d. log-normal random variables.
Interestingly, while the sample variance s2X is not exactly a sample average, it turns out to have similar large-sample
properties to the ones presented in Section 13.1 for the sample mean X̄. First, the sample variance s2X of i.i.d. random
variables gets close to the population variance σX2 for large samples, similar to the result for X̄ from the LLN. Second,
the sample variance s2X of i.i.d. random variables has an approximate normal distribution for large samples, similar to
the result for X̄ from the CLT. These two properties are summarized in the following proposition:
Proposition 13.4. If X1 , X2 , …,
2 4
Xn are i.i.d. random variables with finite population mean µX , finite population
variance σX , and E (X – µX ) < ∞, then
(i) s2X gets arbitrarily close to σX2 as n → ∞
(ii) for sufficiently large n, s2X is approximately normally distributed:
2 2
4
!
2 a 2 E (X – µX ) – σX
sX ∼ N σX , .
n
2
E (X–µX )4 )–(σX2 )
The variance of the approximate normal distribution, ( n , looks slightly different from the expression
2 2
E (X–µX )4 ( σX2 ) (n–3) σ 2 (n–3)
→ 1 as n → ∞, so that ( X )

2 2 n–3
σ = Var(s ) =
s2X X n – n(n–1) provided above. The reason is that n–1 n(n–1) simplifies
2
σ2
to ( nX ) . As n gets large, the variance of the approximate (normal) sampling distribution shrinks to zero since n is in the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 344 — #351
i i

344 Sampling distributions: asymptotic

denominator. This shrinking variance for large n is consistent with the first property that the sample variance s2X gets
arbitrarily close to σX2 , which is the mean and center of the asympotic normal distribution. Again, the asymptotic normal
sampling distribution given by this proposition is remarkable, as it holds regardless of the shape of the underlying
distribution of the i.i.d. random variables.
Example 13.6 (Normal random variables) When X1 , X2 , …, Xn are i.i.d. random variables, there is an exact sampling
s2
distribution for s2X , described by (n – 1) σX2 ∼ χ2n–1 , that holds for any sample size n, including very small sample sizes.
For larger samples, there is also an asymptotic distribution given by Proposition 13.4. Figure 13.2 shows both sampling
distributions (the exact distribution and the asymptotic distribution) of s2X when X1 , X2, …, Xn are i.i.d. N(0, 1) random
variables for four different sample sizes (n = 10, n = 20, n = 50, and n = 100), with the exact distribution given by the
solid curve and the asymptotic distribution given by the dotted curve. There is some difference between the exact
distribution and asymptotic distribution for the smallest sample size (n = 10), with the exact distribution peaking at a
value less than one and exhibiting right skewness. At n = 20, the two distributions are much closer, with the peak for
the exact distribution just slightly lower than one. The two distributions are extremely close to each other at n = 50
and virtually identical at n = 100, suggesting the asymptotic normal approximation works well for sample sizes of 50
and larger. That said, even at n = 20, the normal approximation does pretty well in approximating the exact sampling
distribution of s2X .

Example 13.7 (Uniform random variables) In Example 12.7, simulation methods approximated the sampling
distribution of s2X for i.i.d. U(0, 1) random variables for some small sample sizes, and even with just ten observations
(n = 10) the sampling distribution appeared symmetric and bell-shaped. For i.i.d. U(0, 1) random variables, the
appropriate asymptotic distribution can be derived by determining the mean and variance of the normal distribution
1
given in part (ii) of Proposition 13.4. First, the mean of the distribution is σX2 , which is 12 for X ∼ U(0, 1). Second, the
4 2 2
E((X–µX ) )–(σX )
variance of the distribution is n , which can be evaluated by determining E (X – µ )4 : X

1 1
(x – 0.5)5
Z
1
E (X – µX )4 = (x – 0.5)4 dx =

= 0.00625 – (–0.00625) = 0.0125 = .
0 5 0 80
Plugging into the variance expression yields
2
E (X – µX )4 – σX2 1/80 – (1/12)2 1
= = ,
n n 180n
so that the asymptotic distribution is
1 a 1
s2X ∼ N
, .
12 180n
a 1 1

As an example, for n = 50, the distribution is s2X ∼ N 12 , 9000 , so that an approximate 95% probability interval for

2 1 1 1 1
sX is 12 – 1.96 √9000 , 12 + 1.96 √9000 ≈ (0.0627, 0.1040). There is approximately a 95% probability that the sample
variance will be between 0.0627 and 0.1040.

13.2.1 Asymptotic distribution of the sample standard deviation

We have now seen that the sample mean and the sample variance have asymptotic normal distributions that hold
regardless of the distribution of the underlying i.i.d. random variables. The same is true of the sample standard
deviation. The sample standard deviation sX gets arbitrary close to the population standard deviation σX for large
samples, and its sampling distribution is approximated by a normal distribution in large samples:

Xn are i.i.d. random variables with finite population mean µX , finite population
Proposition 13.5. If X1 , X2 , …,
variance σX2 , and E (X – µX )4 < ∞, then
(i) sX gets arbitrarily close to σX as n → ∞

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 345 — #352
i i

Sampling distributions: asymptotic 345

n = 10 n = 20

3.0

3.0
2.0

2.0
fs2X(v)

fsX2(v)
1.0

1.0
0.0

0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

v v

n = 50 n = 100
3.0

3.0
2.0

2.0
fs2X(v)

fsX2(v)
1.0

1.0
0.0

0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

v v

Figure 13.2
Sampling distributions of the sample variance for i.i.d. N(0, 1) random variables

(ii) for sufficiently large n, sX is approximately normally distributed:

2 !
a E (X – µX )4 – σX2
sX ∼ N σX , .
4nσX2
1
The asymptotic variance of sX is equal to 4σX2
times the asymptotic variance of s2X from Proposition 13.4. (Given the
asymptotic distribution of s2X , the asymptotic distribution of sX follows from a result in statistics known as the delta
method, discussed in Section 14.5.) Thus, if the asymptotic variance of s2X has already been determined, we can usually
determine the asymptotic variance of sX , as illustrated in the following example.
Example 13.8 (Uniform random variables) Continuing Example 12.7, the asymptotic distribution for sX for larger
sample sizes can be determined. Since
2
E (X – µX )4 – σX2 1
= ,
n 180n

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 346 — #353
i i

346 Sampling distributions: asymptotic

the asymptotic variance of sX is

2
E (X – µX )4 – σX2 1/180n 1
2
= 4 = ,
4nσX /12 60n
which implies
a 1 1
sX ∼ N √ , .
12 60n

a 1
For instance, for n = 50, the distribution is sX ∼ N √112 , 3000 , meaning an approximate 95% probability interval

1 1 1 1
for sX is √12 – 1.96 √3000 , √12 + 1.96 √3000 ≈ (0.2529, 0.3245). There is approximately a 95% probability that the
sample standard deviation will be between 0.2529 and 0.3245.

13.3 Asymptotic distribution of other statistics

Thus far in this chapter, we have seen that a normal distribution can approximate the large-sample sampling distribution
of the sample mean X̄, the sample variance s2X , and the sample standard deviation sX . As it turns out, nearly all of the
statistics that have been considered in this book also have an asymptotic sampling distribution that can be approximated
by a normal distribution. This phenomenon is extremely useful for the purposes of conducting statistical inference, as
the normal distribution can be used across a wide range of statistics when the sample is large.

13.3.1 Asymptotic distribution of sample quantiles and IQR

Like the sample mean, the sample median has a sampling distribution that is approximately normal for large sample
sizes. Recall that X̃0.5 is the random variable associated with the sample median, and x̃0.5 is the sample median that is
realized after a random sample is drawn from the population. The sample median gets arbitrarily close to the population
median as n → ∞, and the specific form of its asymptotic normal distribution is given in the following proposition:
Proposition 13.6. If X1 , X2 , …, Xn are i.i.d. continuous random variables with pdf fX (·) and population median τX,0.5 ,
then
(i) X̃0.5 gets arbitrarily close to τX,0.5 as n → ∞
(ii) for sufficiently large n, X̃0.5 is approximately normally distributed:

a 1
X̃0.5 ∼ N τX,0.5 , .
4nfX (τX,0.5 )2
The asymptotic variance has the familiar 1/n scaling seen for the asymptotic distributions of the sample mean, the
sample variance, and the sample standard deviation. The asymptotic variance formula contains fX (τX,0.5 ), which is the
pdf associated with X evaluated at the population median. With asymptotic distributions available for both the sample
mean and the sample median, it can be interesting to compare their respective asymptotic variances, especially in the
case of a symmetric distribution where both the population mean and population median are equal to the center of the
distribution. The following example shows how the asymptotic variances of the sample mean and sample median can
be compared in the case of the normal distribution.
Example 13.9 (Normal random variables) Suppose X1 , X2 , …, Xn are i.i.d. N(0, 1) random variables, so that µX =
a
τX,0.5 = 0. For the sample mean X̄, the asymptotic distribution is X̄ ∼ N(0, 1/n) from the CLT, and even the exact sampling
a
distribution
is X̄ ∼ N(0, 1/n). For the sample median X̃0.5 , Proposition 13.6 implies the asymptotic distribution X̃0.5 ∼

1
N 0, 4nφ(0) 2 , where φ(·) is the pdf associated with the N(0, 1) distribution. How do these two asymptotic variances
compare? Since φ(0) ≈ 0.3989, the asymptotic variance of X̃0.5 is
1 1 1.5711
2
≈ 2
≈ ,
4nφ(0) 4n(0.3989) n
which is greater than 1/n, meaning that in large samples the dispersion associated with the sample median’s
sampling distribution is larger than the dispersion associated with the sample mean’s sampling distribution. To

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 347 — #354
i i

Sampling distributions: asymptotic 347

quantify this difference, probability intervals for the two statistics can be formed for a chosen sample size. With
√
100 observations (n = 100), the asymptotic standard deviation of X̄ is √ 1/ 100 = 0.1, so that a 95% probability interval

for X̄ is (–0.196, 0.196); the asymptotic standard deviation of X̃0.5 is 1.5711/100 ≈ 0.1253, so that a 95% probability
interval for X̃0.5 is (–1.96(0.1253), 1.96(0.1253)) ≈ (–0.246, 0.246). In the thought experiment where many different
100-observation samples are drawn from the population, this result says that the realizations of the sample mean (the
x̄ values) tend to be slightly closer to the true center of the distribution (at zero) than the realizations of the sample
median (the x̃0.5 values). Over many 100-observations samples, the sample mean is between –0.196 and 0.196 for 95%
of the samples, whereas the sample median is between –0.246 and 0.246 for 95% of the samples. So, in the case of the
N(0, 1) distribution, the sample mean provides a more precise measure of the center of the distribution. On the other
hand, recall that the sample median is a more robust measure of the center of the distribution since it is less affected
by outliers than the sample mean. As a result, there is a precision-robustness tradeoff here, with the sample mean
being more precise and less robust and the sample median being less precise and more robust. The interested reader
can show that this idea generalizes to other normal random variables (that is, X1 , X2 , …, Xn i.i.d. N(µ, σ 2 ) random
variables), with the asymptotic variance of the sample mean being larger than the asymptotic variance of the sample
median.
Since other quantiles of a distribution may be of interest, we generalize Proposition 13.6 to other sample quantiles.
For i.i.d. random variables, any sample quantile gets arbitrarily close to its corresponding population quantile in large
samples and has an asymptotic distribution that is normal:
Proposition 13.7. If X1 , X2 , …, Xn are i.i.d. continuous random variables with pdf fX (·) and population quantiles τX,q ,
then for any q ∈ (0, 1),
(i) X̃q gets arbitrarily close to τX,q as n → ∞
(ii) for sufficiently large n, X̃q is approximately normally distributed:

a q(1 – q)
X̃q ∼ N τX,q , .
nfX (τX,q )2
The sample median (q = 0.5) is a special case of Proposition 13.7, with q(1 – q) = 14 leading to the asymptotic variance
in Proposition 13.6. Beyond the sample size n, which enters in the usual 1/n form, both q and fX (τX,q ) affect the value of
the asymptotic variance, so that the asymptotic variance generally varies as q varies. The following example illustrates
how the asymptotic distributions and probability intervals can be determined at different quantiles.
Example 13.10 (Normal random variables) Continuing Example 13.9, suppose X1 , X2 , …, Xn are i.i.d. N(0, 1) random
variables, and again consider a sample of 100 observations (n = 100). Using the asymptotic variance formula from
Proposition 13.7, the asymptotic standard deviation for any quantile q is
√
q(1 – q)
√ .
100φ(τX,q )
The 95% probability interval for X̃q can be constructed as the true quantile τX,q plus or minus 1.96 times the asymptotic
standard deviation. The following table shows the asymptotic standard deviations and 95% probability intervals for
five different quantiles (q = 0.1, q = 0.25, q = 0.5, q = 0.75, and q = 0.9)
√
q τX,q φ(τX,q ) √ q(1–q) 95% interval for X̃q
100φ(τX,q )

0.1 –1.2816 0.1755 0.1709 (–1.617, –0.947)

0.25 –0.6745 0.3178 0.1363 (–0.942, –0.407)
0.5 0 0.3989 0.1253 (–0.246, 0.246)
0.75 0.6745 0.3178 0.1363 (0.407, 0.942)
0.9 1.2816 0.1755 0.1709 (0.947, 1.617)

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 348 — #355
i i

348 Sampling distributions: asymptotic

For q = 0.5, the values in the table correspond to those in Example 13.9. While q(1 – q) is largest at q = 0.5, the
asymptotic standard deviation for q = 0.5 is actually the smallest among those shown, which arises since the value of
the pdf φ(0) is much larger than the pdf φ(·) evaluated at the other quantiles. The asymptotic standard deviations are
largest at the extreme quantiles (q = 0.1 and q = 0.9), so the probability intervals for X̃0.1 and X̃0.9 are also the widest.
Due to the symmetry of the N(0, 1) distribution, the asymptotic standard deviation for q = 0.1 is the same as that for
q = 0.9, and similarly the asymptotic standard deviation for q = 0.25 is the same as that for q = 0.75. As a result, the
widths of the probability intervals for X̃0.1 and X̃0.9 are the same, as are the widths of the probability intervals for X̃0.25
and X̃0.75 .
How about the interquartile range? Recall that the sample statistic IQRx is the difference between the 75% sample
quantile x̃0.75 and the 25% sample quantile x̃0.25 . The associated random variable is X̃0.75 – X̃0.25 for underlying
i.i.d. random variables X1 , X2 , …, Xn , with the IQRx statistic being a realization of X̃0.75 – X̃0.25 for the sample that
happens to be observed. Then, X̃0.75 – X̃0.25 has an asymptotic normal distribution since, from Proposition 13.7, both
X̃0.25 and X̃0.75 have asymptotic normal distributions. The following proposition gives the specific asymptotic normal
distribution associated with X̃0.75 – X̃0.25 :41
Proposition 13.8. If X1 , X2 , …, Xn are i.i.d. continuous random variables with pdf fX (·) and population quantiles τX,q ,
then
(i) X̃0.75 – X̃0.25 gets arbitrarily close to τX,0.75 – τX,0.25 as n → ∞
(ii) for sufficiently large n, X̃0.75 – X̃0.25 is approximately normally distributed:

a 1 3 3 2
X̃0.75 – X̃0.25 ∼ N τX,0.75 – τX,0.25 , + – .
16n fX (τX,0.25 )2 fX (τX,0.75 )2 fX (τX,0.25 )fX (τX,0.75 )
Example 13.11 (Uniform random variables) Suppose X1 , X2 , …, Xn are i.i.d. U(0, 1) random variables, so that
τX,0.25 = 0.25 and τX,0.75 = 0.75. Then, since fX (v) = 1 for all v ∈ (0, 1), the asymptotic distribution of X̃0.75 – X̃0.25 is

a 1
X̃0.75 – X̃0.25 ∼ N 0.5, .
4n

√ a sample of 100 observations (n = 100), the asymptotic variance is /400 and the asymptotic standard deviation is
For 1
1/400 = 0.05, so that a 95% probability interval for X̃0.75 – X̃0.25 is

(0.5 – 1.96(0.05), 0.5 + 1.96(0.05)) ≈ (0.402, 0.598).

Over the possible 100-observation samples that can be drawn from the population, there is a 95% probability that the
realized IQRx is between 0.402 and 0.598.

13.3.2 Asymptotic distribution of the sample correlation

Thus far, the focus of Chapter 12 and this chapter has been upon the sampling distributions of statistics for univariate
data. This section considers the large-sample sampling distribution for the sample correlation, a statistic based upon
bivariate data that measures the linear association between two variables. The focus here is on the sample correlation
rather than the sample covariance since, as previously discussed, the sample correlation is far more useful in practice,
as it is not affected by the units or scale of the underlying variables.
We first discuss the population and i.i.d. sampling in the context of bivariate data. Let X and Y denote two random
variables, which have some population correlation ρXY . A draw from the population is a realization of (x, y) that
comes from the joint distribution of X and Y. When i.i.d. draws are taken from the population to get a sample
{(x1 , y1 ), (x2 , y2 ), …, (xn , yn )}, each draw of an observation pair (xi , yi ) should provide no information about any other
draw of an observation pair (xj , yj ). While x and y may be related, as would certainly be the case for a non-zero
population correlation, the “independent” part of i.i.d. refers to independence across different bivariate draws and
not within a bivariate pair. As an example, if data are collected on education and earnings from a random sample
of employed individuals, there is likely a relationship between education and earnings, but the random draw of an
individual (and their education and earnings values) is independent of the draws of any other individual in the sample.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 349 — #356
i i

Sampling distributions: asymptotic 349

For sample size n, let {(X1 , Y1 ), (X2 , Y2 ), …, (Xn , Yn )} denote the random variables associated with the n draws of
bivariate data from the population. For any realized sample, the sample correlation rxy can be calculated. Denote the
associated random variable as rXY , where rxy is the realization of rXY that arises from the particular sample that happens
to be observed. The following proposition formally states that the sample correlation rXY gets arbitrarily close to the
population correlation ρXY in large samples and has an asymptotic normal distribution:
Proposition 13.9. If (X1 , Y1 ), (X2 , Y2 ), …, (Xn , Yn ) are i.i.d. bivariate random variables with population
correlation ρXY , then
(i) rXY gets arbitrarily close to ρXY as n → ∞
(ii) for sufficiently large n, rXY is approximately normally distributed:
(1 – ρ2XY )2

a
rXY ∼ N ρXY , .
n
Interestingly, the asymptotic distribution of the sample correlation depends only on the sample size n and the
population correlation ρXY and not on any other feature of the joint distribution of X and Y. When the population
a
correlation is zero (ρXY = 0), the asymptotic distribution of the sample correlation simplifies to rXY ∼ N(0, 1/n), in which
√ √
case a 95% probability interval for rXY is (–1.96/ n, 1.96/ n). Thus, even though the true correlation is zero, the observed
sample correlation will not be exactly zero, except by some rare coincidence, but a probability interval for rXY can
be quantified. For n = 100, the 95% probability interval for rXY is (–0.196, 0.196) when ρXY = 0; for n = 400, the 95%
probability interval for rXY is (–0.098, 0.098) when ρXY = 0; and so on.
Example 13.12 (Bivariate normal random variables) This example considers two random variables X and Y that
have a bivariate normal distribution, which is the case when aX + bY is normal for any values a and b. For bivariate
normal random variables X and Y, the marginal distributions of X and Y are both normal (plugging in a = 1, b = 0
for the former and a = 0, b = 1 for the latter). Denoting the marginal distributions as X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 )
and the population correlation between X and Y as ρXY , computer simulations can be used to determine the exact
sampling distribution of rXY since most statistical packages provide the ability to draw (x, y) randomly from bivariate
normal random variables X and Y.42 For simplicity, consider the case where the marginal distributions of X and Y are
both standard normal, so that µX = µY = 0 and σX = σY = 1. Figure 13.3 compares the simulation-based exact sampling
distribution of rXY with the asymptotic sampling distribution of rXY for two different sample sizes (n = 50 and n = 100)
and three different population correlation values (ρXY = 0, ρXY = 0.4, and ρXY = 0.8). For the simulation-based exact
sampling distributions, 100,000 simulations are used, and the solid black lines in the graphs show the density plot
of the realized rxy values.43 As a comparison, the dotted curves show2 the asymptotic normal distribution
given in
a 1 a (0.84) a (0.36)2

Proposition 13.9, which is rXY ∼ N 0, n for ρXY = 0, rXY ∼ N 0.4, n for ρXY = 0.4, and rXY ∼ N 0.8, n for
ρXY = 0.8. Even with a sample size of n = 50, the asymptotic distributions appear to be quite close to the exact sampling
distributions, although there are some small differences for ρXY = 0.4 and slightly bigger differences for ρXY = 0.8.
Statisticians have previously documented that it takes larger samples for the asymptotic distribution to provide a good
approximation when the population correlation ρXY is very large, which is consistent with the evidence from the n = 50
graphs in the top row. At the larger sample size of n = 100, there is still a slight discrepancy between the exact sampling
distribution and the asymptotic distribution for ρXY = 0.8.
For this example, knowing the specific type of distribution (i.e., the bivariate normal) is only important for being
able to simulate the exact sampling distributions of rXY . Even without knowing the form of the joint distribution
of X and Y, Proposition 13.9 provides the large-sample sampling distribution from just the population correlation
ρXY . For example,in the case
of a sample with 100 observations (n = 100) and q ρXY = 0.4, the asymptotic distribution
a 2 2
of rXY is rXY ∼ N 0, (0.84)
100 , meaning the asymptotic standard deviation is (0.84) 100 = 0.084 and a 95% probability
interval for rXY is (0.4 – (1.96)(0.084), 0.4 + (1.96)(0.084)) ≈ (0.235, 0.565). Thus, over the possible 100-observation
i.i.d. samples that can be drawn from the population (with ρXY = 0.4), there is a 95% probability that the realized
sample correlation rxy is between 0.235 and 0.565. For n = 100 and ρXY = 0.8, the asymptotic distribution of rXY is

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 350 — #357
i i

350 Sampling distributions: asymptotic

n = 50 , ρXY = 0 n = 50 , ρXY = 0.4 n = 50 , ρXY = 0.8

8
3.0
2.5

2.5

6
2.0

2.0
1.5
frXY(v)

frXY(v)

4
1.5
1.0

1.0

2
0.5

0.5
0.0

0.0

0
−0.4 −0.2 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 0.6 0.7 0.8 0.9 1.0

v v v

n = 100 , ρXY = 0 n = 100 , ρXY = 0.4 n = 100 , ρXY = 0.8

10
4
3

8
3
frXY(v)

frXY(v)

6
2

4
1

2
0

−0.4 −0.2 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 0.6 0.7 0.8 0.9 1.0

v v v

Figure 13.3
Sampling distributions of the sample correlation for bivariate normal random variables

2
q
a (0.36)2
rXY ∼ N 0, (0.36)
100 , meaning the asymptotic standard deviation is 100 = 0.036 and a 95% probability interval for
rXY is (0.8 – (1.96)(0.036), 0.8 + (1.96)(0.036)) ≈ (0.729, 0.871).

13.3.3 Not all statistics are asymptotically normal

This chapter has shown that many of the statistics considered in this book are associated with random variables that
have an asymptotic distribution that is normal. Such statistics are called asymptotically normal, and this property
allows easy calculation of probability intervals based upon the normal distribution. And, while nearly all the statistics
used by practitioners are asymptotically normal, there are instances in which a statistic has a non-normal sampling
distribution even in large samples. While we won’t spend much time on statistics that are not asymptotically normal,
we illustrate this point by re-visiting an example from Chapter 12.
Example 13.13 (Maximum of normal random variables) Example 12.9 considered the exact sampling distribution of
maxX = max(X1 , X2 , …, Xn ) when X1 , X2 , …, Xn are i.i.d. normal random variables. Formulas for the exact sampling
distribution were provided for a given sample size n and distributional parameters µ and σ. For the case of the N(0, 1)
distribution, these formulas were used to graph the sampling distributions for three small sample sizes (n = 5, n = 10,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 351 — #358
i i

NOTES 351

n = 5000

0.0 0.5 1.0 1.5 2.0

fmaxX(v)

2.5 3.0 3.5 4.0 4.5 5.0 5.5

n = 20000
0.0 0.5 1.0 1.5 2.0
fmaxX(v)

2.5 3.0 3.5 4.0 4.5 5.0 5.5

Figure 13.4
Sampling distributions of the sample maximum for i.i.d. N(0, 1) random variables

and n = 30). But what happens as the sample size n gets large? Does the asymptotic distribution of maxX look normal?
Figure 13.4 shows the exact sampling distributions, based upon the formulas from Example 12.9, for the much larger
sample sizes of n = 5000 and n = 20000. Even with these very large sample sizes, the sampling distributions of maxX
are asymmetric and right-skewed, suggesting that the asymptotic sampling distribution is non-normal. In fact, it is
known that maxX has an asymptotic distribution known as a Gumbel distribution rather than a normal distribution.

Notes
38 A more formal statement of the LLN requires a concept known as convergence in probability. The mathematical condition corresponding to
the phrase “gets arbitrarily close to µX ” is that, for any > 0,

lim P |X̄ – µX | < = 1.
n→∞

No matter how small is, there is always a large enough sample such that there is a probability arbitrarily close to 1 that the distance between X̄ and
µX is less than .
39 As with the LLN, a formal statement of the CLT requires more advanced statistical concepts. Specifically, a commonly used version of the CLT
√
uses the concept of convergence in distribution and states that the random variable n(X̄ – µX ) converges in distribution to the normal distribution
N(0, σX2 ).
40 Alternatively, to use the exact distributions of X̄ and X̄ (i.e., the scaled binomial random variables), simulation methods can be used to
f m
approximate P(X̄f > X̄m ).
41 The mean of the asymptotic distribution is E(X̃
0.75 ) – E(X̃0.25 ) = τX,0.75 – τX,0.25 . The random variables X̃0.25 and X̃0.75 are not independent. The
asymptotic variance is the sum of the asymptotic variances of X̃0.25 and X̃0.75 , each of which is obtained from Proposition 13.7, minus two times the
2
asymptotic covariance. The asymptotic covariance of X̃0.25 and X̃0.75 is nf (τ 0.25 )f (τ )
.
X X,0.25 X X,0.75
42 For bivariate normal random variables X and Y, where the marginal distributions are X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 ), the joint pdf is
2 2
1 x–µX y–µ (x–µX )(y–µY )
1 – 2 σX
+ σ Y –2ρXY σX σY
fXY (x, y) = q e 2(1–ρXY ) Y
,
2πσX σY 1 – ρ2XY

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 352 — #359
i i

352 NOTES

where ρXY is the population correlation between X and Y.

43 A straightforward approach is used to sample from bivariate normal random variables. Using σ = σ = 1, x is drawn from X ∼ N(0, 1) and y
q X Y
is drawn from N(ρXY x, 1 – ρ2XY ). The interested reader can confirm that Y = ρXY X + 1 – ρ2XY Z, where Z ∼ N(0, 1) and Z is independent of X, has a
standard normal distribution N(0, 1).

Exercises
1. 50 individuals are randomly selected from a population in which the probability of owning a dog is 25%.
(a) Based on the binomial distribution, what is the probability that the sample proportion of dog owners is strictly
greater than 30%?
(b) Based on the binomial distribution, what is the probability that the sample proportion of dog owners is between
25% and 35%?
(c) Based on the normal-distribution approximation to the binomial distribution, what is the probability that the
sample proportion of dog owners is between 25% and 35%? Use the continuity correction.
2. Major League Baseball teams play 162 games during the regular season. A team’s final winning percentage is equal
to their total number of wins divided by 162. Assume that any given team has a “true” win probability and that their
wins/losses are 162 i.i.d. draws from a Bernoulli random variable with this win probability.
(a) If team A has win probability πA = 0.60, what is the asymptotic distribution of team A’s winning percentage
WA ?
(b) If team A has win probability πA = 0.60 and team B has win probability πB = 0.55, what is the asymptotic
distribution of the difference in winning percentages, WA – WB , for the two teams? (Assume that WA and WB
are independent of each other.)
(c) Based upon your answer to (b), what is P(WB > WA )?
(d) If you look at winning percentages halfway through the season (after 81 games) rather than at the end of the
season, what is the probability that team B’s winning percentage is higher than team A’s winning percentage?
(e) Conduct 100,000 simulations in R to approximate the probabilities in (c) and (d) using the exact sampling
distributions (based on the binomial) of the winning percentage. Use a strict inequality in the simulations so
that equal winning percentages are not counted.
(f) Returning to asymptotic distributions, how many games in a season would be required to ensure that P(WA >
WB ) is at least 99%?
3. Consider a two-candidate election between candidates A and B. The probability that an older registered voter (aged
65 or over) favors candidate A is 70%. The probability that a younger registered voter (under age 65) favors candidate B
is 56%. 200 older voters and 1,000 younger voters turn out for the election. Assume that the voters that turn out for
the election are randomly drawn from the subpopulations of registered voters.
(a) What is the asymptotic distribution associated with the sample proportion of votes for candidate A?
(b) What is the approximate 95% probability interval for the sample proportion of votes for candidate A?
(c) What is the approximate probability that candidate A wins the election?
4. A worker at a data-entry company enters numbers into a spreadsheet and makes errors at a 0.1% rate; that is, the
probability that the worker makes an error for any given spreadsheet cell is π = 0.1% or π = 0.001. Consider the number
of errors, given by the random variable X, that the worker makes when entering 50,000 cells of data. Assume that the
underlying Bernoulli(π) trials are i.i.d.
(a) What is the exact sampling distribution of X?
(b) Using the exact sampling distribution, what is P(45 ≤ X ≤ 55)?
(c) What is the asymptotic distribution of X?
(d) Using the asymptotic distribution and the continuity correction, what is P(45 ≤ X ≤ 55)?
(e) Using the asymptotic distribution, provide a 95% probability interval for X.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 353 — #360
i i

NOTES 353

(f) Another worker at the company makes errors at a 0.11% rate, slightly higher than the worker described above.
Again assume that the underlying Bernoulli trials are i.i.d.
i. Using the asymptotic distributions, what is the probability that this worker makes more errors than the
other worker if both enter 50,000 cells of data?
ii. Using the asymptotic distributions, what is a 95% probability interval for the total number of errors for
the two workers?
5. Frank’s Factory produces computer chips on two separate production lines. The output of each production
line is 10,000 computer chips per day, with a 2% probability that any given chip is defective. Assume that the
quality/defectiveness of each computer chip is independent.
(a) What is the asymptotic distribution of the sample proportion of defects among the 20,000 chips produced on a
given day?
(b) What is the asymptotic distribution of the difference between the total number of defects on one production
line and the total number of defects on the other production line?
(c) What is the approximate probability that the magnitude of the difference in (b) is greater than 10? (Do not
worry about a continuity correction.)
(d) How would your answer to (c) change if one production line has a defect probability of 2.1% instead of 2%?
6. For a random sample of 400 unemployed workers drawn from the population, the duration of unemployment (in
weeks) for each worker is an i.i.d. draw of a random variable X. Given the large sample size, you can assume that the
CLT implies that X̄ has an approximately normal distribution.
(a) If P(X̄ > 21) = 0.5, what is E(X)?
(b) If P(X̄ > 21) = 0.4, what can be said about E(X)?
(c) If E(X) = 20.7 and σX = 11.2, what is P(X̄ > 21)?
(d) How would the answer to (c) change if n = 1600 rather than n = 400?
7. The Air Quality Index (AQI) is used by the U.S. Environmental Protection Agency (EPA) as an overall measure of
air quality. The AQI has a scale of 0 to 500, with lower values for better air quality. For instance, the range 0 to 50 is
considered “good,” the range 51 to 100 is considered “moderate” (acceptable air quality, minimal risk), the range 101
to 150 is considered “unhealthy for sensitive groups,” and higher values indicate even more unhealthy air quality.

Suppose 52 weekly AQI measures are taken during the year in both Augusta, Maine and Los Angeles, California.
Assume that all AQI measures are independent of each other, with Augusta’s measures i.i.d. draws from the QA
random variable with expected value 25 and standard deviation 15 and Los Angeles’s measures i.i.d. draws from the
QL random variable with expected value 45 and standard deviation 25.
(a) What is the asymptotic distribution of Q̄A , the sample average of AQI for Augusta over 52 weeks?
(b) What is the asymptotic distribution of Q̄L , the sample average of AQI for Los Angeles over 52 weeks?
(c) What is the asymptotic distribution of Q̄L – Q̄A ?
(d) Are you able to say anything about the probability P(QA > QL ) in a given week? Explain why or why not.
8. Cindy’s Cereals sells cereal in 20-ounce boxes. Its manufacturing process leads to actual weights (in ounces) that are
1
i.i.d. draws from the random variable X ∼ N 20, 900 . Suppose it repackages any box weighing less than 19.9 ounces.
(a) What is the probability that any given box is repackaged?
(b) For a manufacturing run of 20,000 boxes, what is the approximate (normal) distribution of the number of boxes
that are repackaged?
(c) The profits per box are P1 for boxes weighing at least 19.9 ounces and P2 for boxes weighing less than 19.9
ounces, where P1 > P2 due to the repackaging required for the latter. For a manufacturing run of 20,000 boxes,
what is the approximate (normal) distribution of total profits in terms of P1 and P2 ?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 354 — #361
i i

354 NOTES

9. *Allison’s Apparel, a women’s clothing store, is open eight hours each day. On any given day, it is known that the
arrival time for the next customer has an expected value of 5 minutes and a population standard deviation of 3 minutes.
(Arrival time is measured since store opening for the first customer and since the last customer’s arrival for each
subsequent customer.) Assume that all arrival times are independent of each other. Let T denote the random variable
associated with the total number of customers that shop at Allison’s Apparel on a given day.
(a) Ignore for a moment that the store eventually closes. What is the asymptotic distribution of the average arrival
time (in minutes) for n customers? What is the asymptotic distribution of the total amount of time (in minutes)
that it takes n customers to arrive?
(b) What is the approximate probability that the total time it takes 100 customers to arrive is less than 480 minutes?
(c) Explain why the probability in (b) is equal to P(T ≥ 100).
(d) Using the same reasoning as in (b) and (c), approximate P(T ≥ 101) and P(T = 100) = P(T ≥ 100) – P(T ≥ 101).
(e) The pmf of T, evaluated at 100, was determined in (d). Using the same reasoning, plot the pmf of T over the
range of values {80, 81, …, 119, 120} in R.
10. The number of visitors X to a popular website during the one minute between 10:00am and 10:01am is a Poisson
random variable with λ = 120.
(a) Thinking of X as the sum of 60 i.i.d. random variables Y1 , Y2 , …, Y60 ∼ Poisson(2), where each Yi is the number
of visitors in a given second, what is the asymptotic normal distribution associated with X?
(b) Using the asymptotic normal distribution, provide an approximate 90% probability interval for X.
(c) Calculate the exact probability, based on the Poisson(120) distribution, that X is within the interval from (b).
11. Use Proposition 13.6 and Proposition 13.7 for this question.
(a) What is the asymptotic distribution of the sample median X̃0.5 if X ∼ U(0, 1) and n = 400?
(b) What is the asymptotic distribution of the sample median X̃0.5 if X ∼ U(a, b) and n = 400?
(c) What is the asymptotic distribution of the sample 75% quantile X̃0.75 if X ∼ U(0, 1) and n = 400? Provide an
approximate 90% probability interval for X̃0.75 .
12. Use Proposition 13.8 to determine a 95% probability interval for X̃0.75 – X̃0.25 when X ∼ N(0, 1) and n = 100.
13. *As mentioned in Section 13.3.2, there are some concerns with the approximation provided by the asymptotic
distribution of the sample correlation,
(1 – ρ2XY )2

a
rXY ∼ N ρXY , ,
n
especially when the magnitude of the population correlation, |ρXY |, is large. In particular, when n is not large, the
actual sampling distribution of rXY may be quite asymmetric for large |ρXY |. An alternative approach, proposed by
statistician R. A. Fisher in a 1915 paper in Biometrika and known as the Fisher transformation, can provide more
accurate confidence intervals for rXY . The idea is to consider the asymptotic distribution of the following (increasing)
function of rXY ,
1 1 + rXY
ln ,
2 1 – rXY
rather than rXY itself. The asymptotic distribution of the Fisher transformation of rXY is

1 1 + rXY a 1 1 + ρXY 1
ln ∼N ln , .
2 1 – rXY 2 1 – ρXY n

(a) Provide a 95% probability interval for 12 ln 1+r XY
1–rXY in terms of n if ρXY = 0.

(b) Suppose a probability interval for 12 ln 1+r XY
1–rXY has been calculated to be (L, U). That is,

1 1 + rXY
P L ≤ ln ≤U =p
2 1 – rXY

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 355 — #362
i i

NOTES 355

for some probability p. Using this probability expression, construct a probability interval (rL , rU ) for rXY , with
P(rL ≤ rXY ≤ rU ) = p. (Hint: Exponentiate the three quantities within the probability.)
(c) How does the 95% probability interval based on the Fisher transformation compare to the 95% probability
interval based on the original rXY distribution when ρXY = 0 and n = 100?
(d) How does the 95% probability interval based on the Fisher transformation compare to the 95% probability
interval based on the original rXY distribution when ρXY = 0.85 and n = 40?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 356 — #363
i i

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 357 — #364
i i

14 Estimation and confidence intervals

Building upon the concept of sampling distributions from Chapters 12 and 13, this chapter introduces estimation,
which involves the use of a statistic as a guess or estimate of an underlying quantity of interest. As an example, the
sample mean might be used to estimate the true population mean in the usual situation where the population mean is
unknown. The sample mean has been introduced as a statistic based upon an observed sample, and it can be viewed
as playing two different roles, first as a descriptive statistic for the observed sample and second as an estimate of
the unknown population mean. This idea is formalized below and generalized to other statistics that may be used for
estimation purposes. In addition, to quantify the precision associated with a given estimate of an underlying quantity of
interest, this chapter also introduces confidence intervals, which provide a range of plausible values for the quantity of
interest based upon the estimation procedure and some pre-specified confidence level. In the case of the sample mean,
for instance, methods to construct confidence intervals for the unknown population mean, based upon the realization
of the sample mean, are introduced.

14.1 Estimation and properties of estimators

14.1.1 Terminology and notation
Definition 14.1 An estimator is the random variable associated with a statistic whose purpose is estimate some
quantity of interest, known as the estimand or parameter. The realization of the estimator, which is the calculated
statistic based upon an observed sample of data, is known as the estimate.
We introduce general notation associated with Definition 14.1. If s(·) is the statistic for estimation, let the estimand
or parameter be denoted θ, the estimator be denoted
θ̂X = s(X1 , X2 , …, Xn ),
and the estimate be denoted
θ̂x = s(x1 , x2 , …, xn ).
The estimator θ̂X is a random variable before the sample is observed, and the estimate θ̂x is the realization of the
random variable θ̂X that depends upon the actual observed sample.
For instance, if the sample mean is being used as an estimator of the population mean, the estimand or parameter is
θ = µX ,
and the estimator and estimate are, respectively,
n n
1X 1X
θ̂X = X̄ = Xi and θ̂x = x̄ = xi .
n n
i=1 i=1
Similarly, if the sample variance is being used as an estimator of the population variance, the estimand or parameter is
θ = σX2 ,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 358 — #365
i i

358 Estimation and confidence intervals

and the estimator and estimate are, respectively,

n n
1 X 1 X
θ̂X = s2X = (Xi – X̄)2 and θ̂x = s2x = (xi – x̄)2 .
n–1 n–1
i=1 i=1
In the interest of brevity, and without loss of generality, the notation above specifies a statistic that is a function
of univariate data. For these definitions and everything that follow in this chapter, we can generalize to the case of
bivariate data or even data with more than two variables. For instance, the estimator with bivariate data is denoted
θ̂XY = s((X1 , Y1 ), (X2 , Y2 ), …, (Xn , Yn )),
and the estimate is denoted
θ̂xy = s((x1 , y1 ), (x2 , y2 ), …, (xn , yn )).
The following table provides a more complete list of the descriptive statistics considered thus far. For each
descriptive statistic, the estimand is the underlying population quantity estimated by the descriptive statistic. For
instance, the sample median estimates the population median τX,0.5 , with the estimator X̃0.5 representing the random
variable associated with the sample median before the sample is observed and the estimate x̃0.5 representing the realized
sample median after the sample is observed. Similarly, the sample correlation estimates the population correlation ρXY ,
with the estimator rXY representing the (random) sample correlation before the sample is observed and the estimate rxy
representing the realized sample correlation after the sample is observed.
Estimand θ Estimator θ̂X Estimate θ̂x
Mean µX X̄ x̄
Variance σX2 s2X s2x
Standard deviation σX sX sx
Median τX,0.5 X̃0.5 x̃0.5
Quantile τX,q X̃q x̃q
IQR τX,0.75 – τX,0.25 X̃0.75 – X̃0.25 x̃0.75 – x̃0.25
Correlation ρXY rXY rxy

14.1.2 Properties of estimators

There are various properties that describe how an estimator should be expected to perform. The first property we
introduce is unbiasedness, which says that, on average, an estimator provides the right answer. The formal definition
of an unbiased estimator is provided below:

Definition 14.2 An estimator θ̂X = s(X1 , X2 , …, Xn ) is an unbiased estimator of θ if

E θ̂X = E(s(X1 , X2 , …, Xn )) = θ

for any sample size n. If E θ̂X 6= θ, then θ̂X is a biased estimator.

Unbiasedness is known as a finite-sample property since the E θ̂X = θ condition must hold regardless of the sample
size n for the estimator θ̂X to be unbiased. The sample mean X̄, as an estimator of the population mean µX , and the
sample variance s2X , as an estimator of the population variance σX2 , are two examples of unbiased estimators.
Pn
Proposition 14.1. (i) The sample mean X̄ = n1 i=1 Xi is an unbiased estimator of the population mean µX = E(X).
1 n
variance s2X = n–1 2 2
P
(ii) The sample
i=1 (Xi – X̄) is an unbiased estimator of the population variance σX = Var(X) =
E (X – µX )2 .
In fact, the unbiasedness of the sample mean and the sample variance was already discussed in Chapter 12, and
Proposition 14.1 is a re-statement of results from that chapter. The unbiasedness property means that, on average, the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 359 — #366
i i

Estimation and confidence intervals 359

estimator gives the right answer (the estimand). More precisely, over all of the possible samples of size n that can be
drawn from the population, the expected value of the sample mean X̄ is equal to the population mean µX . Similarly,
over all of the possible samples of size n that can be drawn from the population, the expected value of the sample
1
variance s2X is equal to the population variance σX2 . The n–1 scaling for s2X is required to make it an unbiased estimator,
as it would be biased if the scaling were 1n instead.
The unbiasedness of the sample mean in part (i) of Proposition 14.1 has several interesting applications to previously
discussed random variable models:
• For i.i.d. Bernoulli(π) random variables, the sample mean X̄ is an unbiased estimator of the success probability π,
which also means that nX̄ is an unbiased estimator of the population mean nπ of a Binomial(n, π) random variable.
• For i.i.d. Poisson(λ) random variables, the sample mean X̄ is an unbiased estimator of λ since µ = λ.
X
• For i.i.d. N(µ, σ 2 ) random variables, the sample mean X̄ is an unbiased estimator of µ.

1 1
θ since µX = θ .
• For i.i.d. Exp(θ) random variables, the sample mean X̄ is an unbiased estimator of

While the sample mean and sample variance are unbiased, many other estimators are not unbiased. For instance, the
sample standard deviation and the sample correlation are generally biased estimators, with E(sX ) 6= σX and E(rXY ) 6= ρXY .
In fact, among the estimators listed in the table in Section 14.1.1, only the sample mean and sample variance are
guaranteed to be unbiased estimators. While having a biased estimator might seem problematic, the bias of estimators
like sX or rXY is generally only an issue in very small samples. Rather than concerning ourselves with what happens
in small samples, a much more important property of an estimator θ̂X is that it gets close to the “right answer” (the
estimand θ) for large sample sizes. This property is known as consistency of an estimator, and the formal definition of
a consistent estimator is provided below:

Definition 14.3 An estimator θ̂X = s(X1 , X2 , …, Xn ) is a consistent estimator of θ if θ̂X gets arbitrarily close to θ as
n → ∞.
Consistency is generally considered the minimal requirement for a statistical estimator to be useful in practice. The
consistency of many estimators, including all of those listed in the table in Section 14.1.1, has already been stated
in the propositions of Chapter 13, when the large-sample or asymptotic sampling distributions of various descriptive
statistics were discussed. For example, the Law of Large Numbers (Proposition 13.1) states that the sample mean X̄,
Pn
when viewed as an estimator, is a consistent estimator of the population mean µX since X̄ = 1n i=1 gets arbitrarily
close to µX as n → ∞. As another example, for the sample correlation, part (i) of Proposition 13.9 states that rXY is a
consistent estimator of the population correlation ρXY . Thus, even though rXY may be a biased estimator of ρXY , rXY
still gets arbitrarily close to ρXY as n → ∞. Thus, for a large sample, the sample correlation is an appropriate estimator
for the population correlation.
In addition to the consistency properties provided in Chapter 13 for several descriptive statistics, as estimators of
their associated population quantities, the propositions in Chapter 13 also stated that each of these descriptive statistics
has an asymptotic sampling distribution that is normally distributed. When an estimator has an asymptotic sampling
distribution that is normally distributed, the estimator is said to be an asymptotically normal estimator:
√
Definition 14.4 An estimator θ̂X = s(X1 , X2 , …, Xn ) is said to be a n-consistent and asymptotically normal
estimator (or, more concisely, an asymptotically normal estimator) if

a V
θ̂X ∼ N θ,
n
for some V that does not depend on n, or equivalently
√
a
n θ̂X – θ ∼ N (0, V) .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 360 — #367
i i

360 Estimation and confidence intervals

For example, from the Central Limit Theorem (Proposition 13.2), the sample mean X̄ is an asymptotically normal
estimator of the population mean µX , with
σX2 √

a a
or, equivalently, n X̄ – µX ∼ N 0, σX2 .

X̄ ∼ N µX ,
n
√
The “ n-consistent” phrase used in Definition 14.4 refers to the rate at which the estimator X̄ approaches the estimand
σX
µX . To see why that’s the case, note that the asymptotic standard deviation of X̄ is √ n
, so that the width of any
probability interval for X̄ is proportional to √1n . The proportionality factor √1n is not unique to the sample mean,
√
but rather is a general feature of any n-consistent and asymptotically normal estimator. From Definition 14.4, if
the asymptotic variance of an estimator θ̂X is equal to Vn , where V does not depend upon n, the asymptotic standard
√
deviation of θ̂X is equal to √Vn , which is proportional to √1n .
As another example, part (ii) of Proposition 13.9 provides the asymptotic distribution of the sample correlation rXY :
(1 – ρ2XY )2 √

a a
or, equivalently, n(rXY – ρXY ) ∼ N 0, (1 – ρ2XY )2 .

rXY ∼ N ρXY ,
n
1–ρ2
The asymptotic standard deviation of the estimator rXY is √nXY . As the sample size n gets large, the standard deviation
of rXY shrinks toward zero, corresponding to the consistency of the sample correlation rXY , which says that rXY gets
arbitrarily close to ρXY as n → ∞.

14.1.3 Asymptotic efficiency

What if there are two or more estimators that can be used to estimate some underlying parameter θ? Is it possible to
say that one estimator is “better” than another? To be more precise, consider a case in which there are two alternative
estimators θ̂Xa and θ̂Xb , both of which are consistent and asymptotically normal estimators of θ.44 As an example, for
the population mean µX , we could specify θ̂Xa as the usual sample mean X̄ and θ̂Xb as the sample mean based upon only
the first half of the observations. Assuming that n is even, so that n/2 is an integer, these estimators are
n n/2
1X 1 X
θ̂Xa = Xi and θ̂Xb = n Xi .
n /2
i=1 i=1

Both estimators are consistent and asymptotically normal estimators of µX , but it seems like θ̂Xa should be the preferred
estimator since it is based upon more information, using all n observations as compared to the n/2 observations used
for θ̂Xb . The way to quantify that θ̂Xa is a more precise estimator than θ̂Xb is to compare the asymptotic variances or the
asymptotic standard deviations of the two estimators. The asymptotic variances of θ̂Xa and θ̂Xb are, respectively,
σX2 σ 2 2σ 2
and n X = X ,
n /2 n
and the asymptotic standard deviations of θ̂Xa and θ̂Xb are, respectively,
√
σX 2σ
√ and √ X .
n n
√
Therefore, the asymptotic standard deviation of the half-sample sample mean θ̂Xb is 2 times larger than the asymptotic
standard deviation of the full-sample sample mean θ̂Xa . On this basis, the full-sample sample mean θ̂Xa should be the
preferred estimator. In statistical terminology, it is said that the estimator θ̂Xa is more efficient than the estimator θ̂Xb , an
idea that is stated more generally in the following definition:

Definition 14.5 If there are two asymptotically normal estimators θ̂Xa and θ̂Xb of the parameter θ, the estimator θ̂Xa
is (asymptotically) more efficient than the estimator θ̂Xb if the asymptotic variance of θ̂Xa is less than the asymptotic
variance of estimator θ̂Xb . The estimator θ̂X is asymptotically efficient among all asymptotically normal estimators if it
is more efficient than any other asymptotically normal estimator.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 361 — #368
i i

Estimation and confidence intervals 361

For the example above, the full-sample sample mean θ̂Xa is more efficient than the half-sample sample mean θ̂Xb ,
but that doesn’t necessarily mean there isn’t some other estimator that is more efficient than θ̂Xa . That is, we have not
shown that the sample mean X̄ is the asymptotically efficient estimator since that would require showing that it has a
lower asymptotic variance than any other asymptotically normal estimator of µX .
Example 14.1 (Normal random variables, mean versus median) For a normal random variable N(µ, σ 2 ), the
population mean and population median are both equal to µ. The sample mean and sample median are two possible
estimators of the parameter µ. Example 13.6 considered the standard normal distribution, where X1 , X2 , …, Xn are
i.i.d. N(0, 1) random variables and µ = 0. In that example, it was shown that the asymptotic variance of the sample
mean X̄ was smaller than the asymptotic variance for the sample median X̃0.5 since
a
X̄ ∼ N(0, 1/n)
and
a 1
X̃0.5 ∼ N 0, ,
4nφ(0)2
1 1 1.5711
with 4nφ(0)2 ≈ 4n(0.3989)2 ≈ n > 1n . Therefore, when the underlying random variables have a standard normal
distribution, the sample mean X̄ is a more efficient estimator of µ = 0 than the sample median X̃0.5 . The interested
reader can show that this result holds for the general case of i.i.d. X1 , X2 , …, Xn ∼ N(µ, σ 2 ) random variables.

14.2 Finite-sample confidence intervals: population mean of i.i.d. normal random variables
This chapter considers how confidence intervals for parameters (estimands) can be constructed for a wide range of
estimators. For an unknown parameter θ, the goal is to provide an interval of plausible values for θ based upon the
estimator θ̂X . For example, a 95% confidence interval for θ is an interval for which, before observing the sample,
there is a 95% probability that the parameter θ is in the interval created by the estimation procedure. The first case
considered, in this section, is a confidence interval for the population mean µ associated with normally distributed
i.i.d. random variables X1 , X2 , …, Xn ∼ N(µ, σ 2 ). For this specific case, the exact sampling distribution results for the
sample mean X̄ from Section 12.1.2 are used to a confidence interval for µ. The resulting confidence interval is an
example of a finite-sample or exact confidence interval, as it is valid for any sample size n, even very small n. While
this specific case is interesting and sometimes useful, the resulting confidence interval does not generalize to other
settings. Therefore, to provide a more general method of constructing confidence intervals, subsequent sections of this
chapter consider the appropriate confidence interval based upon any asymptotically normal estimator. The resulting
confidence interval is an asymptotic confidence interval valid in large samples.
This section considers an observed sample {x1 , x2 , …, xn }, where the underlying i.i.d. random variables X1 , X2 , …, Xn
have a normal distribution N(µ, σ 2 ). The goal is to construct a confidence interval for the parameter µ based upon the
sample mean estimator X̄. From Section 12.1.2, the exact sampling distribution for the sample mean X̄ is
σ2

X̄ – µ
X̄ ∼ N µ, or, equivalently, √ ∼ N(0, 1).
n σ/ n
Based upon this exact sampling distribution, we can construct a probability interval for X̄ when µ and σ 2 are assumed
to be known. For instance, a 95% probability interval for X̄ is

σ σ
µ – 1.96 √ , µ + 1.96 √ ,
n n
meaning that, over all possible
samples of size n that can be drawn from the population, the probability that X̄ is in the
σ σ
µ – 1.96 √n , µ + 1.96 √n interval is equal to 0.95:

σ σ
P X̄ ∈ µ – 1.96 √ , µ + 1.96 √ = 0.95.
n n

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 362 — #369
i i

362 Estimation and confidence intervals

In most cases of interest, however, the parameters µ and σ 2 of the normal distribution are not known to the researcher.
As a result, we would like to essentially flip things around here, forming a probability interval for the parameter µ based
upon the estimator rather than a probability for the estimator based upon the parameters µ and σ 2 . To do so, the first
step is to re-write the exact sampling distribution in terms of the difference X̄ – µ, as follows
σ2 σ2

X̄ – µ ∼ N 0, or, equivalently, µ – X̄ ∼ N 0, .
n n
2

Both X̄ – µ and µ – X̄ have the same N 0, σn distribution due to the symmetry of the normal distribution. The latter
distribution says that µ – X̄ is normally distributed with mean zero and standard deviation √σn . Therefore, over all
possible samples of size n from the population, there is a 95% probability that µ – X̄ is between –1.96 √σn and 1.96 √σn :

σ σ
P µ – X̄ ∈ –1.96 √ , 1.96 √ = 0.95,
n n
or, equivalently, by adding X̄ to the µ – X̄ term and to the interval endpoints,

σ σ
P µ ∈ X̄ – 1.96 √ , X̄ + 1.96 √ = 0.95.
n n
If σ were known, this probability interval would be useful as a confidence interval for µ. Unfortunately, since σ is
unknown, this probability interval is not directly applicable since the endpoints can’t be calculated.
Rather than using the unknown population standard deviation σ, a sensible alternative is to use the sample standard
deviation sX , which is an estimator of σ. The complication introduced by using the estimator sX in place of the
parameter σ is that, while σ/X̄–µ
√ ∼ N(0, 1), the same is not true for the exact sampling distribution of X̄–µ
n
√ . In fact,
sX / n
X̄–µ
√
sX / n
does not have a normal distribution, but rather a different distribution known as a t-distribution:
Proposition 14.2. If X1 , X2 , …, Xn are i.i.d. N(µ, σ 2 ) random variables,
X̄ – µ
√ ∼ tn–1 ,
sX / n
where X̄ is the sample mean, sX is the sample standard deviation, and tn–1 is a t-distribution with n – 1 degrees of
freedom.
The exact form of the t-distribution, as given by the pdf or cdf, is somewhat complicated and therefore not explicitly
shown here. The following R functions are useful for working with random variables that follow a t-distribution:
• dt(x, df): Returns the pdf of a t-distributed random variable with df degrees of freedom evaluated at the
argument x, which may be a single number or a vector.
• pt(x, df): Returns the cdf of a t-distributed random variable with df degrees of freedom evaluated at the

argument x, which may be a single number or a vector.

• rt(n, df): Creates a vector of n i.i.d. random draws of a t-distributed random variable with df degrees of

freedom.
• qt(p, df): Returns the population quantiles of a t-distributed random variable with df degrees of freedom

specified by the argument p, which may be a single number or a vector.

The t-distribution has the following properties:
• tn–1 is symmetric and bell-shaped.
• tn–1 has thicker tails than the standard normal N(0, 1) distribution.
• The tails of t
n–1 become less thick as n increases, and the limit (as n → ∞) of the tn–1 distribution is the standard
normal N(0, 1) distribution.
To illustrate these properties of t-distributions, Figure 14.1 shows the distributions for four different values (3, 5, 10,
and 30) of the degrees of freedom. For each of the four graphs, the pdf of the t-distribution is indicated by the black

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 363 — #370
i i

Estimation and confidence intervals 363

df=3 df=5

0.4

0.4
0.3

0.3
ft3(v)

ft5(v)
0.2

0.2
0.1

0.1
0.0

0.0
−4 −2 0 2 4 −4 −2 0 2 4

v v

df=10 df=30
0.4

0.4
0.3

0.3
ft10(v)

ft30(v)
0.2

0.2
0.1

0.1
0.0

0.0

−4 −2 0 2 4 −4 −2 0 2 4

v v

Figure 14.1
t-distributions for different degrees of freedom

curve, and the standard normal pdf is shown as a dotted curve for comparison. The thicker tails, also associated with
a lower peak at zero, are quite evident in the top two graphs, for degrees of freedom equal to 3 and 5. For degrees of
freedom equal to 10, the t-distribution is visibly closer to the N(0, 1) distribution, with less thick tails than the graphs
in the top row. For degrees of freedom equal to 30, the t-distribution is nearly indistinguishable from the N(0, 1)
distribution.
The intuition is that using the estimator sX in place of the true σ introduces uncertainty into the ratio sXX̄–µ √ , and the
/ n
thicker tails of the tn–1 distribution, as compared to the N(0, 1) distribution, account for this additional uncertainty. The
uncertainty is larger for smaller sample sizes since sX is a less precise estimator of σ when n is small, reflected by the
thicker tails associated with lower degrees of freedom in the t-distribution. For larger sample sizes, sX becomes more
precise as an estimator of σ, corresponding to thinner tails for the tn–1 distribution. Since sX gets arbitrarily close to σ
as n increases, it should not be surprising that the limit of the tn–1 distribution is the N(0, 1) distribution, as sXX̄–µ
√ ≈ X̄–µ
/ n
√
σ/ n
for large n.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 364 — #371
i i

364 Estimation and confidence intervals

To quantify the difference that will be seen in probability intervals associated with a t-distribution, rather than a
N(0, 1) distribution, the following table provides the 97.5% quantile (used to construct a 95% probability interval) and
the 95% quantile (used to construct a 90% probability interval) for several different values of the degrees of freedom.
Sample size n Distribution (tn–1 ) 97.5% quantile (tn–1,0.025 ) 95% quantile (tn–1,0.05 )
4 t3 3.182 2.353
6 t5 2.571 2.015
11 t10 2.228 1.812
31 t30 2.042 1.697
101 t100 1.984 1.660
501 t500 1.965 1.648
“large” N(0, 1) 1.960 1.645
The table shows some new notation corresponding to the quantiles of the tn–1 distribution, specifically the tn–1,0.025
notation in the 97.5% quantile column and the tn–1,0.05 notation in the 95% quantile column. This notation is formally
defined as follows:

Definition 14.6 The critical value tn–1,q denotes the (1 – q) quantile of the tn–1 distribution. For example, tn–1,0.025 is
the 97.5% quantile of the tn–1 distribution, and tn–1,0.05 is the 95% quantile of the tn–1 distribution.
For the t30 distribution, corresponding to n = 31, the 97.5% quantile t30,0.025 is 2.042, which is approximately 4%
larger than the 97.5% quantile (1.960) of the N(0, 1) distribution, meaning a 95% probability interval based on the
t30 distribution has a width that is approximately 4% larger than the interval based on the N(0, 1) distribution. As n
increases, the quantile values for the tn–1 distribution approach those of the N(0, 1) distribution. For n = 101 and the
t100 distribution, the quantile values in the table are approximately 1% larger than the quantile values for the N(0, 1)
distribution. The difference becomes almost entirely negligible for n = 501 and the t500 distribution, with the quantiles
only about 0.2% larger than the N(0, 1) quantiles.
The following R code shows how the critical values in the table above are calculated with the qt function:

qt(c(0.975,0.95),3)
## [1] 3.182446 2.353363
qt(c(0.975,0.95),5)
## [1] 2.570582 2.015048
qt(c(0.975,0.95),10)
## [1] 2.228139 1.812461

qt(c(0.975,0.95),30)
## [1] 2.042272 1.697261
qt(c(0.975,0.95),100)
## [1] 1.983972 1.660234
qt(c(0.975,0.95),500)
## [1] 1.964720 1.647907

Figure 14.2 provides some graphical examples of critical values. The top two graphs show the 2.5% critical values
for the t5 and t30 distributions. In the top-left graph, the gray area to the right of the t5,0.025 ≈ 2.571 critical value
has probability 2.5%, and due to symmetry, the gray area to the left of –t5,0.025 has probability 2.5%. Therefore, the
probability for the interval between –t5,0.025 and t5,0.025 is equal to 95%. The top-right graph, for the t30 distribution

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 365 — #372
i i

Estimation and confidence intervals 365

t5 distribution t30 distribution

ft30(v)
ft5(v)

− t5,0.025 0 t5,0.025 − t30,0.025 0 t30,0.025

v v

t5 distribution t30 distribution

ft30(v)
ft5(v)

− t5,0.05 0 t5,0.05 − t30,0.05 0 t30,0.05

v v

Figure 14.2
Critical values for a t-distribution

is similar, except that the critical value t30,0.025 ≈ 2.042 is not as large as the t5,0.025 ≈ 2.571 critical value. Since the
t30 distribution has less thick tails than the t5 distribution, the critical value must be further to the left to still have the
probability to the right of t30,0.025 , again represented by the gray area, equal to 2.5%. The bottom two graphs show the
5% critical values for the t5 and t30 distributions. For each of these two graphs, the gray area in the right tail has 5%
probability, as does the gray area in the left tail, meaning there is 90% probability of being between –t5,0.05 and t5,0.05
for the t5 distribution and between –t30,0.05 and t30,0.05 for the t30 distribution.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 366 — #373
i i

366 Estimation and confidence intervals

The sampling distribution from Proposition 14.2 is

X̄ – µ
√ ∼ tn–1
sX / n
or, equivalently, since tn–1 is a symmetric distribution,
µ – X̄
√ ∼ tn–1 .
sX / n
Thus, there is a 95% probability that the quantity sXµ–/√X̄n is between –tn–1,0.025 and tn–1,0.025 . This probability can be
written as
µ – X̄
P √ ∈ (–tn–1,0.025 , tn–1,0.025 ) = 0.95.
sX / n
√
After multiplying by sX/ n and adding X̄, for both the ratio and the endpoints, an equivalent probability is

sX sX
P µ ∈ X̄ – tn–1,0.025 √ , X̄ + tn–1,0.025 √ = 0.95.
n n
Therefore, over all possible samples of size n from the population, there is a 95% probability that µ is in the interval

sX sX
X̄ – tn–1,0.025 √ , X̄ + tn–1,0.025 √ .
n n
Other than the fact that it’s commonly used by researchers, there is nothing special about the use of a 95% probability
here. For instance, if we are interested in a 90% probability interval, the only change is to use the tn–1,0.05 critical value
rather than the tn–1,0.025 critical value so that there is a total of 10% probability in the tails rather than a total of 5%
probability in the tails. This probability can be written

sX sX
P µ ∈ X̄ – tn–1,0.05 √ , X̄ + tn–1,0.05 √ = 0.90.
n n
This idea can be generalized to other probability levels. For example, for a sample size of n = 20 and an 80% confidence
interval, the critical value t19,0.10 ≈ 1.328 is used.

qt(0.90,19)
## [1] 1.327728

To make these probability intervals useful in practice, the only remaining step is to use the realized estimates x̄ and
sx , associated with the observed sample, in place of the estimators X̄ and sX . The standard deviation of the sampling
distribution of X̄ is equal to √sXn , which is replaced by √sxn . This latter quantity is known as the standard error of the
sample mean estimator.

Definition 14.7 The standard error is the estimated standard deviation of the sampling distribution of the
estimator θ̂X .
Then, the general result is that, for a value α, the (1 – α) confidence interval for µ is

sx sx
x̄ – tn–1,α/2 √ , x̄ + tn–1,α/2 √ .
n n
When α = 0.05, the 95% confidence interval for µ is

sx sx
x̄ – tn–1,0.025 √ , x̄ + tn–1,0.025 √ .
n n

The owner can be 95% confident that the true average of weekly sales is greater than $1035. Note that $1035 is the
same value as the lower end of the two-sided 90% confidence interval found in Example 14.3, as the same critical
value t5,0.05 is used for both calculations.
What if the owner would like even more certainty, say 99% confidence rather than 95% confidence? The critical
value t6–1,0.05 ≈ 2.015 is replaced with t6–1,0.01 ≈ 3.365, yielding the one-sided 99% confidence interval

sx
x̄ – t6–1,0.01 √ , ∞ ≈ (1200 – (3.365)(81.65), ∞) ≈ (925, ∞).
n
Since the owner wants to be more certain here (99% probability), the lower end of the one-sided confidence interval
($925) is considerably lower than the lower end of the 95% one-sided confidence interval ($1035).

14.3 Asymptotic confidence intervals: population mean of i.i.d. random variables

This section considers the case of a large sample, in which case the asymptotic sampling distribution results from
Chapter 13 can be applied. To make a clear comparison with the finite-sample confidence intervals introduced in
Section 14.2, this section focuses upon the sample mean estimator. The generalization to other asymptotically normal
estimators is discussed in Section 14.4.
For a large sample, we can relax the assumption of normality that was crucial for the finite-sample confidence
intervals in Section 14.2. The observed sample is {x1 , x2 , …, xn }, for large n, where the underlying random variables
X1 , X2 , …, Xn are assumed to be i.i.d. but not necessarily normal. By the Central Limit Theorem (Proposition 13.2),
the asymptotic sampling distribution of the sample mean estimator X̄ is
σ2

a X̄ – µX a
X̄ ∼ N µX , X or, equivalently, √ ∼ N(0, 1).
n σX / n
The important difference from the finite-sample case is that √sXn gets arbitrarily close to the asymptotic standard
σX
deviation √ n
as the sample size n gets large. As such, there is no need to adjust for the uncertainty in estimating
σX , as we did with the t-distribution for the finite-sample case. Since sX is a consistent estimator of σX , the asymptotic
distribution for X̄ can also be written as
X̄ – µX a
√ ∼ N(0, 1)
sX / n
by plugging in sX for σX . As a result, asymptotic confidence intervals are based on critical values from the standard
normal distribution N(0, 1) rather than critical values from a t-distribution. The following definition introduces the
notation for critical values based upon the N(0, 1) distribution.

Definition 14.8 The critical value zq denotes the (1 – q) quantile of the N(0, 1) distribution. For example, z0.025 ≈ 1.96
is the 97.5% quantile of the N(0, 1) distribution, and z0.05 ≈ 1.645 is the 95% quantile of the N(0, 1) distribution.
Following the same reasoning from Section 14.2, but now using z critical values rather than t critical values,

µX – X̄
P √ ∈ (–z0.025 , z0.025 ) = 0.95
sX / n
and
sX sX
P µX ∈ X̄ – z0.025 √ , X̄ + z0.025 √ = 0.95.
n n
For the case of i.i.d. normal random variables, this 95% probability interval is the natural large-sample version of
sX sX
the finite-sample interval P µX ∈ X̄ – tn–1,0.025 n , X̄ + tn–1,0.025 n = 0.95 since the limit of the tn–1 distribution, as
√ √

n gets larger, is the N(0, 1) distribution, meaning the limit of the tn–1,0.025 critical value, as n gets larger, is the z0.025
critical value. Importantly, however, the 95% asymptotic probability interval also applies for i.i.d. random variables
that are non-normal.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 371 — #378
i i

Estimation and confidence intervals 371

The 95% probability interval can be generalized to other probability intervals, with the (1 – α) probability interval

sX sX
P µX ∈ X̄ – zα/2 √ , X̄ + zα/2 √ = 1 – α.
n n
The critical value zα/2 is the value for which there is probability of α/2 that a N(0, 1) random variable is larger than
zα/2 , and by symmetry –zα/2 is the value for which there is probability of α/2 that a N(0, 1) random variable is less than
–zα/2 . The following proposition formally states the results for a two-sided confidence interval based upon this general
probability interval.
Proposition 14.5. (Two-sided confidence interval for the sample mean of i.i.d. random variables) If X1 , X2 , …, Xn are
i.i.d. random variables with population mean µX and large n, the probability that µX is in the interval

sX sX
X̄ – zα/2 √ , X̄ + zα/2 √
n n
is approximately equal to 1 – α. The associated asymptotic 1 – α confidence interval for µX is

sx sx
x̄ – zα/2 √ , x̄ + zα/2 √ ,
n n
sx
where x̄ is the sample mean and √
n
is the standard error based upon the sample standard deviation sx .
sx
The standard error √
n
is sometimes called an asymptotic standard error since it is based upon the asymptotic
sampling distribution, but we often simply call it a “standard error” so that the same terminology can be used when
referring to a finite-sample standard error or an asymptotic standard error. The following R code defines a function
se_meanx that calculates the standard error of x̄, given by √sxn , for any vector of data:

# function to calculate the se of the sample mean for a vector

se_meanx <- function (x, na.rm = FALSE) {
if (na.rm) {
x <- x[!is.na(x)]
}
return(sd(x)/sqrt(length(x)))
}

The function se_meanx is defined with an optional argument na.rm so that vectors with missing (NA) values can
be handled, similar to the optional argument na.rm available for built-in functions like mean and sd.
For the case of i.i.d. Bernoulli(π) random variables, the sample mean X̄ is an estimator of the population mean or
true success probability π. Since the population variance is σX = π(1 – π), the asymptotic standard deviation of X̄ is
r
σX π(1 – π)
√ = ,
n n
which suggests two possible approaches to calculating the standard error: (i) plug in the sample standard deviation sx
for σX , so that the standard error is √sxn , or (ii) plug in the sample mean x̄ for π, so that the standard error is x̄(1–x̄)
√ .
n
√
Approach (ii) is valid since the sample mean is a consistent estimator of π, so that x̄(1 – x̄) will be arbitrarily close
√
to π(1 – π). Since x̄ is just the sample proportion of successes, or the fraction of ones observed, this approach is
appealing since it doesn’t require the extra step of calculating sx . Approach (i) and approach (ii) give nearly identical
p n √ p n
standard errors, as it can be shown that sx = n–1 x̄(1 – x̄) and the scaling factor n–1 ≈ 1 for large n.46
Example 14.5 (Widget website) In Example 2.1, widgets.com conducted an e-mail campaign experiment in which
300 users received e-mail A, 300 users received e-mail B, and 2400 users received no e-mail. The resulting purchase
probabilities were 20% (60 out of 300) for e-mail A recipients, 22% (66 out of 300) for e-mail B recipients, and 15%
(360 out of 2400) for non-recipients. Using Proposition 14.5, confidence intervals for the true purchase probabilities
of the three groups can be calculated. Let πA denote the purchase probability of an e-mail A recipient, πB denote the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 372 — #379
i i

372 Estimation and confidence intervals

purchase probability of an e-mail B recipient, and πC denote the purchase probability of a non-recipient. Starting with
the sample of e-mail A recipients, assume that the random variables X1 , X2 , …, X300 are i.i.d. Bernoulli(πA ), with a
one (success) associated with a purchase. The observed proportion of successes (purchases) is pA = 0.20, so that the
sample mean is
x̄ = pA = 0.20,
and the standard error (of the sample mean) is
r r
sx pA (1 – pA ) (0.2)(0.8)
√ = = ≈ 0.0231.
n n 300
Then, the asymptotic 95% confidence interval for πA is

sx sx
x̄ – z0.025 √ , x̄ + z0.025 √ ≈ (0.20 – (1.96)(0.0231), 0.20 + (1.96)(0.0231)) ≈ (0.155, 0.245).
n n
Similar calculations yield an asymptotic 95% confidence interval for πB ,
r r !
sx sx (0.22)(0.78) (0.22)(0.78)
x̄ – z0.025 √ , x̄ + z0.025 √ ≈ 0.22 – (1.96) , 0.22 + (1.96) ≈ (0.173, 0.267),
n n 300 300
and an asymptotic 95% confidence interval for πC ,
r r !
sx sx (0.15)(0.85) (0.15)(0.85)
x̄ – z0.025 √ , x̄ + z0.025 √ ≈ 0.15 – (1.96) , 0.15 + (1.96) ≈ (0.136, 0.164).
n n 2400 2400
The standard errors for the three confidence intervals are 0.023, 0.024, and 0.007, respectively. The first two standard
errors are similar since the sample size n = 300 is the same and the sample standard deviations differ only slightly
√ √
( (0.20)(0.80) versus (0.22)(0.78)). The third standard error is much lower due to the larger sample size n = 2400.
Figure 14.4 provides a graphical representation of the 95% confidence intervals for the purchase probabilities
of the three groups. The confidence intervals for πA and πB have very similar widths since they have the similar
standard errors. The confidence interval for πC (no e-mail) is much more narrow since the sample size n = 2400 is
much larger, corresponding to the fact that pC = 0.15 is a more precise estimate of πC than the estimates for the
smaller samples. Comparing the confidence intervals for πA and πB , it seems that there’s not strong evidence that the
purchase probability of e-mail B recipients, πB , is greater than the purchase probability of e-mail A recipients, πA .
Even though the observed purchase probability is larger for e-mail B recipients, there is a lot of overlap between the
confidence intervals of the two purchase probabilities πA and πB . (Later in this chapter, a more formal method for
directly looking at the difference πA – πB is considered.) In contrast, the narrow confidence interval for πC leads to no
overlap with the πB confidence interval and only small overlap with the πA confidence interval, offering some evidence
that the purchase probability πC is lower than either πA or πB . (Again, a more formal examination of the differences
πA – πC and πB – πC is provided later.)
Example 14.6 (Labor force data) The confidence interval for a sample mean estimator can be applied to variables
from a cross-sectional dataset, at least in cases where the observations can plausibly be considered to be realizations
of i.i.d. random variables. This idea is illustrated by calculating standard errors and confidence intervals for several
variables from the cps dataset. The following table shows the sample mean x̄, the standard error √sxn , the asymptotic
95% confidence interval for µX , and the asymptotic 90% confidence interval for µX for five different variables: age
(age in years), educ (education in years), ownchild (number of children in household), earnwk (weekly earnings), and
union (1 if union member, 0 if not). For the first three variables, the quantities are calculated based upon the full
sample size n = 4013. For the last two variables, the quantities are calculated based upon the employed sample size
n = 2809.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 373 — #380
i i

Estimation and confidence intervals 373

e−mail A (πA)

e−mail B (πB)

no e−mail (πC)

0.0 0.1 0.2 0.3 0.4

Figure 14.4
Asymptotic 95% confidence intervals for true widget purchase probabilities

sx
Variable n x̄ se = √
n
95% CI for µX 90% CI for µX
age 4013 45.02 0.142 (44.74, 45.30) (44.78, 45.25)
educ 4013 12.57 0.039 (12.50, 12.65) (12.51, 12.64)
ownchild 4013 0.748 0.0176 (0.713, 782) (0.719, 0.777)
earnwk 2809 971.18 14.16 (943.42, 998.93) (947.89, 994.47)
union 2809 0.098 0.0056 (0.087, 0.109) (0.089, 0.107)
The abbreviation “se” is used for the standard error √sxn . For the education variable educ, the asymptotic 95%
confidence interval for the population mean is (12.50, 12.65). Since this confidence interval is quite narrow, the sample
mean 12.57 is providing a precise estimate of the population mean, which occurs since the sample size n = 4013 is
so large here. For the weekly earnings variable earnwk, the sample mean of 971.18 dollars is the estimate of the
population mean of weekly earnings for the population of employed individuals, with the associated 95% asymptotic
confidence interval being between 943.42 dollars and 998.93 dollars. The union variable union is an indicator
variable, so its observations can be viewed as realizations of Bernoulli random variables. The estimate of the true
probability that an employed individual from the population is in a union is 9.8%, with an asymptotic 95% confidence
interval for the true probability of union membership being between 8.7% and 10.9%.
Here is the R code used to calculate the n, x̄, and se = √sxn columns in the table above:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 374 — #381
i i

374 Estimation and confidence intervals

# create union indicator variable from unionstatus factor variable

cps$union <- 1*(cps$unionstatus=="Union")

# loop through variables of interest

for (varname in c("age","educ","ownchild","earnwk","union")) {
# sample size is number of non-missing observations
nobs <- sum(!is.na(cps[,varname]))

# calculate sample mean of variable

mean_var <- mean(cps[,varname], na.rm=TRUE)

# calculate standard error of sample mean

se_mean <- se_meanx(cps[,varname], na.rm=TRUE)

# output results
print(paste(varname,":","n",nobs,"Mean",signif(mean_var,digits=5),"SE(mean)",signif(se_mean,digits=5)))
}
## [1] "age : n 4013 Mean 45.017 SE(mean) 0.14205"
## [1] "educ : n 4013 Mean 12.573 SE(mean) 0.038863"
## [1] "ownchild : n 4013 Mean 0.74782 SE(mean) 0.017606"
## [1] "earnwk : n 2809 Mean 971.18 SE(mean) 14.16"
## [1] "union : n 2809 Mean 0.098256 SE(mean) 0.0056172"

The code loops through the five variable names of interest. For each variable given by varname, the expression
sum(!is.na(cps[,varname])) returns the number of non-missing observations, which is the effective sample
size n. To limit the number of significant digits reported by R, the function signif is used so that five significant
digits, as specified by the second argument digits=5, are reported for the values of mean_var and se_mean.
Example 14.7 (Simulation error: the likelihood of “streaks”) In Example 4.16, computer simulations estimated the
probability that a streak of at least five heads occurs during a sequence of 100 coin tosses. Over 100,000 simulations,
the calculated frequency of streaks was 0.81156, corresponding to 81,156 of the 100,000 simulations. The associated
standard error is r
0.81156(1 – 0.81156)
≈ 0.001237,
100000
so that an asymptotic 99% confidence interval for the true probability (of observing a sequence of at least five heads
in 100 coin tosses) is
(0.81156 – z0.005 (0.001237), 0.81156 + z0.005 (0.001237)) ≈ (0.8084, 0.8147),
using z0.005 ≈ 2.576. Thus, it is very likely that the true probability is close to 81% with this many simulations.

14.3.1 One-sided confidence intervals

In Section 14.2.1, one-sided confidence intervals for the population mean of i.i.d. normal random variables were
constructed using finite-sample distribution results and the t-distribution. Similarly, one-sided asymptotic confidence
intervals can be constructed for the population mean of i.i.d. random variables, which may be non-normal, by using the
asymptotic distribution results and the N(0, 1) distribution. These confidence intervals are described in the following
proposition.
Proposition 14.6. (One-sided confidence intervals for the sample mean of i.i.d. random variables) If X1 , X2 , …, Xn are
i.i.d. random variables with population mean µX and large n, the probability that
sX
µX > X̄ – zα √
n
is approximately equal to 1 – α, and the probability that
sX
µX < X̄ + zα √
n

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 375 — #382
i i

Estimation and confidence intervals 375

is approximately equal to 1 – α. The associated 1 – α asymptotic confidence intervals for µX are

sx
x̄ – zα √ , ∞
n
and
sx
–∞, x̄ + zα √ ,
n
sx
where x̄ is the sample mean and √
n
is the standard error based upon the sample standard deviation sx .
Example 14.8 (Worksite accidents) At a certain factory, assume that the occurrence of a worker-involved accident
on any given day can be considered an i.i.d. draw of a Bernoulli(π) random variable. Over the course of 1000 work
days, there are 20 days on which a worker-involved accident occurs. The factory owner is interested in the one-sided
confidence interval for π, in which only the upper endpoint is specified. A 95% one-sided confidence interval for π is
r !
sx (0.02)(0.98)
–∞, x̄ + z0.05 √ = –∞, 0.02 + 1.645 ≈ (–∞, 0.0273),
n 1000
so the owner can have 95% confidence that the daily accident probability π is less than 2.73%. Similarly, a 99%
one-sided confidence interval for π is
r !
sx (0.02)(0.98)
–∞, x̄ + z0.01 √ = –∞, 0.02 + 2.326 ≈ (–∞, 0.0303),
n 1000
so the owner can have 99% confidence that the daily accident probability π is less than 3.03%.

14.4 Asymptotic confidence intervals: parameters with asymptotically normal estimators

Section 14.3 showed how to construct an asymptotic confidence interval for the specific case of a population mean
based upon an i.i.d. sample drawn from the population. This section generalizes the concept of an asymptotic
confidence interval to any parameter estimated by an asymptotically normal estimator. Chapter 13 provided the
asymptotic sampling distributions associated with several statistics of interest, including the sample mean, the sample
variance, the sample standard deviation, sample quantiles, the sample IQR, and the sample correlation. Each of these
statistics, when viewed as an estimator of its population analogue (the parameter or estimand of interest), has an
asymptotic sampling distribution that is normally distributed. As examples, Proposition 13.6 provided the asymptotic
distribution for the sample median,
a 1
X̃0.5 ∼ N τX,0.5 , ,
4nfX (τX,0.5 )2
and Proposition 13.9 provided the asymptotic distribution for the sample correlation,
(1 – ρ2XY )2

a
rXY ∼ N ρXY , .
n
The sample median X̃0.5 , as an estimator of the population median τX,0.5 , and the sample correlation rXY , as an estimator
√
of the population correlation ρXY , are just two examples of n-consistent and asymptotically normal estimators.
√
As defined in Definition 14.4, a n-consistent and asymptotically normal estimator has an asymptotic sampling
distribution of the form
a V
θ̂X ∼ N θ, ,
n
where θ is the
q parameter (estimand) of interest, θ̂X is the estimator of θ, and the associated asymptotic standard
deviation is Vn . Standardizing the random variable θ̂X gives the equivalent asymptotic distribution

θ̂X – θ a
√ ∼ N (0, 1) .
V/n

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 376 — #383
i i

376 Estimation and confidence intervals

For this general case, an asymptotic 95% confidence interval is based on

!
θ̂X – θ
P –1.96 < √ < 1.96 = 0.95
V/n

or, equivalently,
r r !
V V
P θ̂X – 1.96 < θ < θ̂X + 1.96 = 0.95.
n n
To calculate an asymptotic 95% confidence interval for θ, the realized √ estimate θ̂x is used in place of the estimator
θ̂X and the standard error associated with θ̂x is used in √ place of V/n. For the standard error, a consistent estimate
of V is required, call it V̂, so that the standard error is V̂/n. For example, in the case of the sample mean √ X̄ as san
2
estimator of the population mean µX , the results from Section 14.4 use V̂ = sx , so that the standard error is V̂/n = √xn .
As another example, in the case of the sample correlation rXY as an estimator of the population correlation ρXY , the
asymptotic variance shown above has V = (1 – ρ2XY )2 , which can be consistently estimated by plugging in rxy for ρXY ,
√ 1–r2
2 2
so that V̂ = (1 – rxy ) and the standard error is V̂/n = √nxy .
The 95% probability statement above can be generalized to other probability levels by changing the z0.025 = 1.96
critical value to the appropriate critical value needed for the chosen level of probability. To get a probability 1 – α, the
appropriate critical value is zα/2 :
r r !
V V
P θ̂X – zα/2 < θ < θ̂X + zα/2 = 1 – α.
n n

The general form of Proposition 14.5, which holds for any asymptotically normal estimator θ̂X , is given by the
following proposition:
Proposition 14.7. (Two-sided confidence intervals based upon an asymptotically normal estimator) If θ̂X is an
a
asymptotically normal estimator of θ with θ̂X ∼ N θ, Vn , the probability that θ is in the interval

r r !
V V
θ̂X – zα/2 , θ̂X + zα/2
n n
is approximately equal to 1 – α for large n. The associated asymptotic 1 – α confidence interval for θ is
 s s 
θ̂x – zα/2 V̂ , θ̂x + zα/2 V̂  ,
n n
q
V̂
where θ̂x is the realized estimate of θ and n is the standard error based upon a consistent estimate V̂ of V.
A convenient notation is se(θ̂x ), denoting the standard error for the estimate θ̂x , with
s
V̂
se(θ̂x ) = ,
n
and the asymptotic 1 – α confidence interval for θ, from Proposition 14.7, is
(θ̂x – zα/2 se(θ̂x ), θ̂x + zα/2 se(θ̂x )).
Proposition 14.7 provides a very powerful result, as it covers all of the asymptotically normal estimators introduced
in this book. More generally, beyond the estimators we’ve covered, if there is an asymptotically normal estimator θ̂X
of θ for which a computer is able to produce both the estimate θ̂x and the standard error se(θ̂x ), the asymptotic 1 – α
confidence interval is always (θ̂x – zα/2 se(θ̂x ), θ̂x + zα/2 se(θ̂x )).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 377 — #384
i i

Estimation and confidence intervals 377

In Section 14.3, the asymptotic confidence interval was given for the population mean, based upon the sample mean
of an i.i.d. sample, with
r s
V sX V̂ sx
asymptotic standard deviation = √ and standard error se (x̄) = =√ .
n n n n
For the specific case of underlying i.i.d. Bernoulli(π) random variables, where the sample mean is an estimator of the
true probability π = µX ,
r r s r
V π(1 – π) V̂ x̄(1 – x̄)
asymptotic standard deviation = and standard error se (x̄) = = .
n n n n
For the remainder of this section, several other examples of using the asymptotic confidence interval are considered.
Before those examples, however, we state the general result for one-sided confidence intervals:
Proposition 14.8. (One-sided confidence intervals based upon an asymptotically normal estimator) If θ̂X is an
a
asymptotically normal estimator of θ with θ̂X ∼ N θ, Vn , the probability that

r
V
θ > θ̂X – zα
n
is approximately equal to 1 – α for large n, and the probability that
r
V
θ < θ̂X + zα
n
is approximately equal to 1 – α for large n. The associated asymptotic 1 – α confidence intervals for θ are
 s   s 
V̂ V̂ 
θ̂x – zα , ∞ and –∞, θ̂x + zα ,
n n
q
V̂
where θ̂x is the realized estimate of θ and n is the standard error based upon a consistent estimate V̂ of V.
From Proposition 14.8, the one-sided asymptotic (1 – α) confidence intervals for θ are
(θ̂x – zα se(θ̂x ), ∞) and (–∞, θ̂x + zα se(θ̂x )),
where θ̂x is the estimate of θ and se(θ̂x ) is its standard error.

14.4.1 Confidence intervals for the population standard deviation

Proposition 13.5 provided the asymptotic sampling distribution of the sample standard deviation sX associated with
i.i.d. random variables X1 , X2 , …, Xn :
2 !
a E (X – µX )4 – σX2
sX ∼ N σX , .
4nσX2
The expression for V in this case, then, is
2
E (X – µX )4 – σX2
V= .
4σX2
Taking the square root of the asymptotic variance, the asymptotic standard deviation is
q 2
E (X – µX )4 – σX2
r
V
= √ .
n 2 nσX

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 378 — #385
i i

378 Estimation and confidence intervals

To consistently estimate V, which is necessary to derive a formula for the standard error se(sx ), the realized sample
standard deviation sx is plugged in for σX and the summation
n
1X
(xi – x̄)4
n
i=1
4
is plugged in for E (X – µX ) , leading to
s q 2
Pn 2 1
Pn
1
n i=1 (xi
4
– x̄) – s2x V̂ n i=1 (xi – x̄)4 – s2x
V̂ = and se(sx ) = = √ .
4s2x n 2 nsx
With this formula for se(sx ), the asymptotic 1 – α confidence interval for the population standard deviation is
(sx – zα/2 se (sx ) , sx + zα/2 se (sx )) .
The following R code defines a function se_sx that calculates the standard error se(sx ) for any vector of data:

# function to calculate the se of the sample stdev for a vector

se_sx <- function (x, na.rm = FALSE) {
if (na.rm) {
x <- x[!is.na(x)]
}
nobs <- length(x)
return(sqrt(mean((x-mean(x))^4) - sd(x)^4)/(2 * sqrt(nobs) * sd(x)))
}

Example 14.9 (Standard deviation of monthly stock returns) Considering the monthly stock return data from sp500,
confidence intervals can be formed for the population standard deviation of monthly returns for any given company.
The following table shows the sample standard deviation sx , its standard error, and the associated 95% confidence
interval (in the column labeled “95% CI for σX ”) for six different stocks (HD, LOW, BAC, WFC, MRO, COP).
Company sx se(sx ) 95% CI for σX
HD 0.0737 0.00312 (0.0676, 0.0798)
LOW 0.0916 0.00361 (0.0845, 0.987)
BAC 0.1053 0.00882 (0.0880, 0.1226)
WFC 0.0816 0.00538 (0.0710, 0.0921)
MRO 0.1210 0.01064 (0.1001, 0.1418)
COP 0.0817 0.00502 (0.0719, 0.0916)
For example, the sample standard deviation sx of Home Depot (HD) monthly returns is 0.0737, and we can say with
95% confidence that the population standard deviation σX is between 0.0676 and 0.0798. The standard errors in the
“se(sx )” column vary a lot, with the largest value of 0.01064 for MRO, indicating that its standard deviation estimate
is the least precise or, equivalently, that the associated confidence interval is the widest of the six stocks.
The following R code calculates the sx and se(sx ) values for the table above:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 379 — #386
i i

Estimation and confidence intervals 379

# loop through the stocks of interest

for (stock in c("HD","LOW","BAC","WFC","MRO","COP")) {
returns <- sp500[,stock]

# calculate the stdev of returns

sx <- sd(returns)

# calculate the se of the stdev of returns

stderror_sx <- se_sx(returns)

# output the stock name, along with stdev and se(stdev)

print(paste(stock,":","sx",round(sx,4),"se_sx",round(stderror_sx,5)))
}
## [1] "HD : sx 0.0737 se_sx 0.00312"
## [1] "LOW : sx 0.0916 se_sx 0.00361"
## [1] "BAC : sx 0.1053 se_sx 0.00882"
## [1] "WFC : sx 0.0816 se_sx 0.00538"
## [1] "MRO : sx 0.121 se_sx 0.01064"
## [1] "COP : sx 0.0817 se_sx 0.00502"

14.4.2 Confidence intervals for the population correlation

Proposition 13.9 provided the asymptotic sampling distribution of the sample standard deviation rXY associated with
i.i.d. bivariate random variables (X1 , Y1 ), (X2 , Y2 ), …, (Xn , Yn ):
(1 – ρ2XY )2

a
rXY ∼ N ρXY , .
n
In this case, V = (1 – ρ2XY )2 , so that the asymptotic standard deviation is
r
V 1 – ρ2XY
= √ .
n n
Since the sample correlation is a consistent estimator of the population correlation ρXY , the standard error se(rxy ) can
be obtained by plugging in rxy for ρXY in the expression for V, yielding
s
2
V̂ 1 – rxy
se(rxy ) = = √
n n
and an asymptotic 1 – α confidence interval

rxy – zα/2 se rxy , rxy + zα/2 se rxy .
The following R code defines a function se_rxy that calculates the standard error se(rxy ) for any two vectors of data:

# function to calculate the se of the sample stdev for a vector

se_rxy <- function (x, y, na.rm = FALSE) {
# if na.rm is TRUE, keep only observations where both are non-missing
if (na.rm) {
nonmissing <- (!is.na(x) & !is.na(y))
x <- x[nonmissing]
y <- y[nonmissing]
}
nobs <- length(x)
return((1-cor(x,y)^2)/sqrt(nobs))
}

Example 14.10 (Education and earnings) For the sample of n = 2809 employed individuals from the cps dataset,
the sample correlation between education (x = educ) and weekly earnings (y = earnwk) is rxy ≈ 0.325. The associated

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 380 — #387
i i

380 Estimation and confidence intervals

standard error is
2
1 – rxy 1 – (0.325)2
se(rxy ) = √ ≈ √ ≈ 0.0169.
n 2809

se_rxy(cps$educ, cps$earnwk, na.rm = TRUE)

## [1] 0.01687314

The asymptotic 95% confidence interval for the population correlation ρXY between education and earnings is

rxy – z0.025 se rxy , rxy + z0.025 se rxy ≈ (0.325 – (1.96)(0.0169), 0.325 + (1.96)(0.0169)) ≈ (0.292, 0.358),
and the asymptotic 99% confidence interval for the population correlation ρXY is

rxy – z0.005 se rxy , rxy + z0.005 se rxy ≈ (0.325 – (2.576)(0.0169), 0.325 + (2.576)(0.0169)) ≈ (0.281, 0.369).
It can be said that, with 99% confidence, the population correlation ρXY is between 0.281 and 0.369. This 99%
confidence interval provides strong evidence that the true correlation between education and earnings is positive
since a correlation value of zero is not in the interval and, in fact, is far below the lower endpoint. Since the standard
0.281
error is 0.0169, the number of standard errors that zero is below the lower endpoint 0.281 is 0.0169 ≈ 16.6.
Example 14.11 (Monthly stock returns) Example 7.13 provided a correlation matrix for six stocks (HD, LOW,
1–r2
BAC, WFC, MRO, COP) from the sp500 dataset. Applying the se(rxy ) = √nxy formula, the correlation matrix can be
augmented to provide standard errors, reported in parentheses, alongside the sample correlation values:
HD LOW BAC WFC MRO COP
HD 1.000 0.648 (0.030) 0.331 (0.047) 0.280 (0.048) 0.189 (0.051) 0.215 (0.050)
LOW 1.000 0.357 (0.046) 0.262 (0.049) 0.181 (0.051) 0.256 (0.049)
BAC 1.000 0.692 (0.027) 0.331 (0.047) 0.339 (0.046)
WFC 1.000 0.379 (0.045) 0.396 (0.044)
MRO 1.000 0.771 (0.021)
COP 1.000
1–r2
Since each of the stock pairs has the same sample size n = 364, the standard error se(rxy ) = √nxy is a decreasing function
of rxy , so the larger correlation values are associated with lower standard errors. The largest sample correlation of
0.771, between MRO and COP, has a se(rxy ) value of 0.021, whereas the lowest sample correlation of 0.181, between
LOW and MRO, has a se(rxy ) value of 0.051.
The confidence intervals for different correlations can be compared to each other. For instance, the asymptotic
95% confidence interval for ρHD,LOW is (0.589, 0.707), whereas the asymptotic 95% confidence interval for ρHD,BAC is
(0.239, 0.423). The sample correlation rHD,LOW = 0.648 is considerably higher than the sample correlation rHD,BAC =
0.331, and these two confidence intervals provide strong evidence that rHD,LOW > rHD,BAC is not happening by chance.
The two 95% confidence intervals for ρHD,LOW and ρHD,BAC have no overlap at all and, in terms of the standard error
magnitudes, are separated by a large distance.
In general, it is very useful to report the standard error of an estimate alongside the estimate itself, as the reader can
then form any asymptotic confidence interval based upon those two numbers. For instance, for the sample correlation
rxy = 0.648 and the standard error se(rHD,LOW ) = 0.030 for the HD and LOW monthly stock returns, a reader could
“ballpark” an asymptotic 95% confidence interval in their head by using a 0.648 plus-or-minus two standard error
(0.030) interval. Alternatively, to more formally calculate a confidence interval, the appropriate critical value can be
used, for example an asymptotic 90% confidence interval for ρXY of (0.648 – (1.645)(0.030), 0.648 + (1.645)(0.030))
using z0.05 ≈ 1.645.
Here is the R code to create the table above:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 381 — #388
i i

Estimation and confidence intervals 381

# determine the sample size

nobs <- nrow(sp500)

# initialize variables with stock names and number of stocks

stocks <- c("HD","LOW","BAC","WFC","MRO","COP")
num_stocks <- length(stocks)

# double loop to loop through possible pairs of stocks

for (i in 1:(num_stocks-1)) {
for (j in (i+1):num_stocks) {
# calculate sample correlation
rxy <- cor(sp500[,stocks[i]],sp500[,stocks[j]])

# calculate standard error of sample correlation

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 384 — #391
i i

384 Estimation and confidence intervals

second exam, respectively. The sample means are

exam1 = 77.31 and exam2 = 70.90,
so that
exam1 – exam2 = 6.42.
The standard deviation of w = exam1 – exam2 is
v
u n
u 1 X
sw = sexam1–exam2 = t (exam1i – exam2i – (exam1 – exam2))2 ≈ 10.81,
n–1
i=1
sw
giving a standard error of √
n
= 10.81
√
77
≈ 1.23. Here is the R code used to perform these calculations:

# sample mean of exam-score difference

mean(exams$exam1-exams$exam2)
## [1] 6.415584
# sample stdev of exam-score difference
sd(exams$exam1-exams$exam2)
## [1] 10.80734
# std error of sample mean of exam-score difference
se_meanx(exams$exam1-exams$exam2)
## [1] 1.231611

The asymptotic 95% confidence interval for µexam1 – µexam2 is

(6.42 – (1.96)(1.23), 6.42 + (1.96)(1.23)) ≈ (4.01, 8.83).
With 95% confidence, it can be said that the population average of exam1 scores is between 4.01 and 8.83 points higher
than the population average of exam2 scores. The observed difference of 6.42 in the sample means is thus associated
with statistical evidence that the second exam was truly more difficult, at least as reflected by a lower population mean.
If the exams were equally difficult, with µexam1 = µexam2 , it would be expected that the value of zero would either be in
the 95% confidence interval or at least close to the interval.
Example 14.14 (Website user activity) A company is interested in gauging the effectiveness of two alternative links
on the homepage of its website. A visitor to the website has the opportunity to click on link A, an event denoted by
linkA = 1 if clicked and 0 if not, and/or link B, an event denoted by linkB = 1 if clicked and 0 if not. The following table
summarizes the click behavior for a sample of 200 visitors.
linkB
0 1
0 110 30
linkA
1 44 16
Let πA and πB be the probabilities that a website user from the population clicks on link A and link B, respectively.
Since linkA and linkB are indicator variables, generated by a Bernoulli random variable, µlinkA = πA , µlinkB = πB , and
60
µlinkA – µlinkB = πA – πB . The sample average of linkA, equal to the observed frequency of ones (clicks), is linkA = 200 =
46
0.30, and the sample average of linkB is linkB = 200 = 0.23. The sample standard deviation of linkA – linkB can be
determined in two different, but equivalent, ways. The first is to calculate the sample standard deviation based upon
the sample proportions of linkA – linkB. The second is to calculate the sample standard deviation based upon the
full sample of observations. Although we don’t have a dataset corresponding to the table above, it is easy enough to
generate a data frame in R corresponding to what the sample would be. Based upon the table, there are 110 rows
with linkA = linkB = 0, 30 rows with linkA = 0 and linkB = 1, 44 rows with linkA = 1 and linkB = 0, and 16 rows with

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 385 — #392
i i

Estimation and confidence intervals 385

linkA = linkB = 1. The following R code shows how the standard deviation slinkA–linkB can be calculated by the two
methods:

# calculate means based upon table of counts

meanA <- (44+16)/200
meanB <- (30+16)/200
meandiff <- meanA-meanB

# calculate stdev of linkA-linkB using sample proportion method

sd_1 <- sqrt( (200/199) * ((110/200)*((0-0)-meandiff)^2 + (30/200)*((0-1)-meandiff)^2
+ (44/200)*((1-0)-meandiff)^2 + (16/200)*((1-1)-meandiff)^2) )

print(paste("SD of linkA-linkB from sample proportion method:",round(sd_1,5)))

## [1] "SD of linkA-linkB from sample proportion method: 0.60575"
# create a data frame based upon the table of counts
links <- data.frame(linkA = c(rep(0,110),rep(0,30),rep(1,44),rep(1,16)),
linkB = c(rep(0,110),rep(1,30),rep(0,44),rep(1,16)))

# calculate stdev of linkA-linkB using the sample

sd_2 <- sd(links$linkA - links$linkB)

print(paste("SD of linkA-linkB from sample stdev method:",round(sd_2,5)))

## [1] "SD of linkA-linkB from sample stdev method: 0.60575"

The standard deviation slinkA–linkB ≈ 0.60575 yields the standard error

slinkA–linkB
se(linkA – linkB) = √ ≈ 0.0428.
200
Therefore, an asymptotic 95% confidence interval for the difference in click probabilities πA – πB is
((0.30 – 0.23) – (1.96)(0.0428), (0.30 – 0.23) + (1.96)(0.0428)) ≈ (–0.0140, 0.1540),
so it can be said with 95% confidence that πA – πB is between –1.40% and 15.40%.

14.4.5 Confidence intervals for two-sample parameter differences

This section considers a problem similar to that considered in Section 14.4.4, except that the two variables of interest
come from different samples rather than from the same sample. This situation is called a two-sample problem, as one
estimated parameter is based upon one random sample and another estimated parameter is based upon another random
sample. Here are examples where the difference in parameters for a two-sample setting would be of interest:
• Widget purchases: Example 2.1 considers the purchase probabilities associated with three different samples: 300 e-
mail A recipients, 300 e-mail B recipients, and 2,400 non-recipients. How do we construct a confidence interval for
πA – πB , where πA is the population purchase probability for e-mail A recipients and πB is the population purchase
probability for e-mail B recipients? How about the difference between e-mail A recipients and non-recipients?
• Union versus non-union workers: If a sample of employed individuals is split into two samples based upon union

status, with one subsample being union employees and the other subsample being non-union employees, interest
may lie in the difference between the population mean of weekly earnings for the two corresponding subpopulations.
Similarly, the difference in population variance or population standard deviation of weekly earnings could be
examined, or the difference in the population correlation between education and earnings for the two subpopulations
could be examined.
To formalize the two-sample setting, assume that one sample is labeled as sample A and the other sample is labeled as
sample B, where θA and θB are the underlying parameters of interest for the two corresponding populations. The sample
sizes of the two samples are nA and nB , respectively. Assume that θ̂XA and θ̂XB are asymptotically normal estimators of

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 386 — #393
i i

386 Estimation and confidence intervals

θA and θB , with
a VA a VB
θ̂XA ∼ N θA , and θ̂XB ∼ N θB , ,
nA nB
q
V̂A
where the realized parameter estimates are denoted θ̂xA and θ̂xB . The associated standard errors are se(θ̂xA ) = nA and
q
se(θ̂xB ) = V̂nBB , where V̂A and V̂B are consistent estimates of VA and VB .
A key property of the two-sample setting is that the two estimators θ̂XA and θ̂XB are independent of each other since
they are based upon different random samples. Therefore, when considering the asymptotic variance of the difference
θ̂XA – θ̂XB , it is not necessary to consider the covariance between the two estimators since their covariance is equal
to zero. The difference θ̂XA – θ̂XB , as a linear combination of θ̂XA and θ̂XB , is asymptotically normal. The mean of the
asymptotic sampling distribution is θA – θB , and the variance of the asymptotic sampling distribution is VnAA + VnBB , so that

a VA VB
θ̂XA – θ̂XB ∼ N θA – θB , + .
nA nB
To get the standard error for this estimator, the consistent estimates V̂A and V̂B are plugged in for VA and VB and a
square root is taken, leading to
s
q
V̂A V̂B
se(θ̂xA – θ̂xB ) = + = se(θ̂xA )2 + se(θ̂xB )2 .
nA nB
Thus, the two-sided asymptotic 1 – α confidence interval for the parameter difference θA – θB is
q q
(θ̂xA – θ̂xB ) – zα/2 se(θ̂xA )2 + se(θ̂xB )2 , (θ̂xA – θ̂xB ) + zα/2 se(θ̂xA )2 + se(θ̂xB )2 .

Conveniently, only the estimate θ̂xA and standard error se(θ̂xA ) from sample A and the estimate θ̂xB and standard error
se(θ̂xB ) from sample B are needed to calculate this confidence interval.
Example 14.15 (Widget website) For the widgets.com e-mail experiment, Example 14.5 provided confidence
intervals for the purchase probabilities πA , πB , and πC associated with the subpopulations of e-mail A recipients, e-mail
B recipients, and non-recipients. The estimates for these three parameters are the observed purchase frequencies
60 66 360
pA = = 0.20, pB = = 0.22, and pC = = 0.15,
300 300 2400
with associated standard errors are
r r
pA (1 – pA ) (0.20)(0.80)
se(pA ) = = ≈ 0.0231,
300 300
r r
pB (1 – pB ) (0.22)(0.78)
se(pB ) = = ≈ 0.0239,
300 300
and r r
pC (1 – pC ) (0.15)(0.85)
se(pC ) = = ≈ 0.0068.
2400 2400
In this example, confidence intervals are calculated for the difference in purchase probabilities for two of the
subpopulations: πA – πB , πA – πC , and πB – πC . (It’s unnecessary to separately consider the differences πB – πA or
πC – πB since their confidence intervals can be inferred directly from the confidence intervals for πA – πB and πB – πC ,
respectively.) The standard error of pA – pB , as an estimator for πA – πB , is
r
(0.20)(0.80) (0.22)(0.78)
se(pA – pB ) = + ≈ 0.0332.
300 300
The asymptotic 95% confidence interval for πA – πB is
((0.20 – 0.22) – (1.96)(0.0332), (0.20 – 0.22) + (1.96)(0.0332)) ≈ (–0.085, 0.045).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 387 — #394
i i

Estimation and confidence intervals 387

This interval is quite wide, with 95% confidence that the difference in purchase probabilities πA – πB is between –8.5%
and 4.5%. The value of zero, corresponding to no difference (πA = πB ), is within this interval and therefore is plausible.
The asymptotic 95% confidence intervals for πA – πC and πB – πC can be constructed similarly. The associated
standard errors are r
(0.20)(0.80) (0.15)(0.85)
se(pA – pC ) = + ≈ 0.0242.
300 2400
and r
(0.22)(0.78) (0.15)(0.85)
se(pB – pC ) = + ≈ 0.0250.
300 2400
The asymptotic 95% confidence interval for πA – πC is
((0.20 – 0.15) – (1.96)(0.0242), (0.20 – 0.15) + (1.96)(0.0242)) ≈ (0.026, 0.074),
and the 95% asymptotic confidence interval for πB – πC is
((0.22 – 0.15) – (1.96)(0.0250), (0.22 – 0.15) + (1.96)(0.0250)) ≈ (0.045, 0.095).
Unlike the confidence interval for πA – πB , these two confidence intervals provide statistical evidence of significant
differences in the purchase probabilities, with the first confidence interval indicating 95% confidence that the difference
πA – πC is between 2.6% and 7.4% and the second confidence interval indicating 95% confidence that the difference
πB – πC is between 4.5% and 9.5%. The value of zero, corresponding to no difference, is not contained within either of
the two confidence intervals.
Example 14.16 (Union workers versus non-union workers) Example 6.20 provided descriptive statistics for weekly
earnings of union workers and non-union workers, based upon the cps data. Part of the table from Example 6.20 is
reproduced below:
Sample n x̄ sx
Union workers 276 1197.7 720.0
Non-union workers 2533 946.5 749.7
Using U and NU subscripts to distinguish union and non-union workers, the standard error associated with the sample
means x̄U = 1197.7 and x̄NU = 946.5 are
720.0 749.7
se(x̄U ) = √ and se(x̄NU ) = √ ,
276 2533
so that the standard error of x̄U – x̄NU , as an estimator of µX,U – µX,NU , is
r
720.02 749.72
se(x̄U – x̄NU ) = + ≈ 45.83.
276 2533
The following R code uses the function se_meanx, defined in Section 14.3, to calculate these quantities:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 388 — #395
i i

388 Estimation and confidence intervals

# calculate std errors of the sample averages of union and non-union wages
se_union <- se_meanx(cps[cps$unionstatus=="Union","earnwk"], na.rm=TRUE)
se_nonunion <- se_meanx(cps[cps$unionstatus=="Non-union","earnwk"], na.rm=TRUE)

se_union
## [1] 43.33802
se_nonunion
## [1] 14.89694
# calculate std error of the difference in sample averages
sqrt(se_union^2 + se_nonunion^2)
## [1] 45.82688

The asymptotic 95% confidence interval for the difference in the population means of weekly earnings, µX,U – µX,NU ,
is
((1197.7 – 946.5) – (1.96)(45.83), (1197.7 – 946.5) + (1.96)(45.83)) ≈ (161, 341).
It can be said with 95% confidence that the difference between the population means weekly earnings between union
workers and non-union workers is between $161 and $341 dollars. This confidence interval provides fairly strong
evidence that the difference is indeed positive, though the size of the interval indicates that the estimated difference of
$251 dollars is not very precise.
How about the difference in the population standard deviation of weekly earnings for union versus non-union
workers? That is, what can be said about the difference in the variation of wage distributions for the two
subpopulations? The sample standard deviations sx,U = 720.0 and sx,NU = 749.7 are estimates of the population
standard deviations σX,U and σX,NU . Based upon the formula from Section 14.4.1 for the standard error of the sample
standard deviation, the associated standard errors are
se(sx,U ) = 55.18 and se(sx,NU ) = 34.56.
Then, the standard error of sx,U – sx,NU , as an estimator of σX,U – σX,NU , is
q
se(sx,U – sx,NU ) = se(sx,U )2 + se(sx,NU )2 ≈ 65.1.
The following R code uses the function se_sx, defined in Section 14.4.1, to calculate these quantities:

# calculate std errors of the stdevs of earnings

se_sx_union <- se_sx(cps[cps$unionstatus=="Union","earnwk"], na.rm=TRUE)
se_sx_nonunion <- se_sx(cps[cps$unionstatus=="Non-union","earnwk"], na.rm=TRUE)

se_sx_union
## [1] 55.18488
se_sx_nonunion
## [1] 34.56397
# calculate std error of difference in stdevs of earnings
sqrt(se_sx_union^2 + se_sx_nonunion^2)
## [1] 65.11558

The asymptotic 95% confidence interval for the difference in population standard deviations of weekly earnings,
σX,U – σX,NU , is
((720.0 – 749.7) – (1.96)(130.2), (720.0 – 749.7) + (1.96)(130.2) ≈ (–157, 98),
indicating no statistical evidence of a difference between the two population standard deviations.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 389 — #396
i i

Estimation and confidence intervals 389

To illustrate the generality of this two-sample approach, we consider looking at a difference in correlations for the
two subpopulations. Specifically, what can be said about the difference in the population correlation between weekly
earnings and education for union versus non-union workers? Does the confidence interval for the difference provide
any statistical that the earnings-education relationship is different for union and non-union workers? For notation, let
rxy,U and rxy,NU denote the sample correlations between earnwk and educ for union workers and non-union workers,
respectively, and let ρXY,U and ρXY,NU denote the corresponding population correlations. The sample correlations
between earnwk and educ for the two subsamples are
rxy,U ≈ 0.253 and rxy,NU ≈ 0.329.

cor(cps[cps$unionstatus=="Union","earnwk"],cps[cps$unionstatus=="Union","educ"],use="complete.obs")
## [1] 0.2529239
cor(cps[cps$unionstatus=="Non-union","earnwk"],cps[cps$unionstatus=="Non-union","educ"],use="complete.obs")
## [1] 0.3287519

The optional argument use="complete.obs" tells the cor function to use only those observations for which
all variables have non-missing (non-NA) values. This argument is similar to the na.rm=TRUE optional argument for
functions like mean and sd.
Using the formula for the standard error of a sample correlation (Section 14.4.2), the standard errors are
1 – 0.2532 1 – 0.3292
se(rxy,U ) = √ ≈ 0.056 and se(rxy,NU ) = √ ≈ 0.018.
276 2533
Then, the standard error of rxy,U – rxy,NU , as an estimator of ρXY,U – ρXY,NU , is
q
se(rxy,U – rxy,NU ) = se(rxy,U )2 + se(rxy,NU )2 ≈ 0.059.
The following R code uses the function se_rxy, defined in Section 14.4.2, to calculate these quantities:

se_rxy_union <- se_rxy(cps[cps$unionstatus=="Union","earnwk"],

cps[cps$unionstatus=="Union","educ"], na.rm=TRUE)
se_rxy_nonunion <- se_rxy(cps[cps$unionstatus=="Non-union","earnwk"],
cps[cps$unionstatus=="Non-union","educ"], na.rm=TRUE)

se_rxy_union
## [1] 0.05634235

se_rxy_nonunion
## [1] 0.01772186
sqrt(se_rxy_union^2 + se_rxy_nonunion^2)
## [1] 0.05906374

Therefore, the asymptotic 95% confidence interval for rxy,U – rxy,NU is

((0.253 – 0.329) – (1.96)(0.059), (0.253 – 0.329) + (1.96)(0.059)) ≈ (–0.19, 0.04).
While there is some evidence that rxy,U – rxy,NU < 0, this confidence interval contains the value zero that corresponds to
rxy,U = rxy,NU .

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 390 — #397
i i

390 Estimation and confidence intervals

14.5 Functions of consistent estimators

When we have one or more parameters of interest, there may be situations in which we are interested in estimating
a function of the parameter(s). Due to an important result, known as the continuous mapping theorem, it is usually
straightforward to estimate such a function of one or more parameters. We first state the result for the case of a single
parameter θ, for which we have a consistent estimator θX .
Proposition 14.9. (Continuous mapping theorem) If θX is a consistent estimator of θ and f (·) is a continuous function,
then f (θX ) is a consistent estimator of f (θ).
As an example we’ve already seen, when θ = σX2 is the population variance of X, the sample p variance θX = sX is
2
√ √ 2
a consistent estimator of θ. For f (·) = ·, the continuous mapping theorem says that θX = sX = sX is a consistent
estimator of the population standard deviation σX .
Another example is a quantity known as odds, which is based on a Bernoulli parameter π. Specifically, odds are
defined as the ratio
π
f (π) = .
1–π
For instance, when π = 0.5, the odds are f (0.5) = 1 or “even odds.” When π = 23 , the odds are f ( 23 ) = 2, meaning the
probability of success is twice the probability of failure. Since X̄ is a consistent estimator of π when X1 , X2 , …, Xn are
i.i.d. Bernoulli(π) random variables, the continuous mapping theorem says that
X̄
f (X̄) =
1 – X̄
is a consistent estimator of the odds.
The continuous mapping theorem extends more generally to a case with multiple consistent estimators available. To
keep things simple, we state the result for the case of two consistent estimators, but it can be generalized.
Proposition 14.10. (Continuous mapping theorem) If θXa and θXb are consistent estimators of θa and θb , respectively,
and f (·, ·) is a continuous function, then f (θXa , θXb ) is a consistent estimator of f (θa , θb ). Special cases include:
(i) θXa + θXb is a consistent estimator of θa + θb ;
(ii) θXaaθXb is a consistent estimator of θa θb ;
θ a
(iii) θXb is a consistent estimator of θθb (if θb 6= 0).
X

As an example, suppose we are interested in the quantity IQR σX , which is the population IQR of the random variable
X

X in terms of standard deviations (e.g., a value of 4 would indicate that the IQR is four standard deviations wide). For
i.i.d. random variables X1 , X2 , …, Xn , consistent estimators of IQRX and σX are X̃0.75 – X̃0.25 and sX , respectively, and
Proposition 14.10 implies
X̃0.75 – X̃0.25
sX
IQRX
is a consistent estimator of σX .
As another example, suppose we are interested in comparing two population correlations ρX1 X2 and ρX3 X4 based
upon data from a single dataset (like the sp500 dataset). Consistent estimators are the sample correlations rX1 X2
and rX3 X4 , respectively. Two alternative ways to compare correlations are to look at differences or to look at ratios.
Proposition 14.10 implies that the difference rX1 X2 – rX3 X4 is a consistent estimator of the difference ρX1 X2 – ρX3 X4 and
r ρ
also that the ratio rXX1 XX2 is a consistent estimator of the ratio ρXX1 XX2 (if ρX3 X4 6= 0).
3 4 3 4
How about asymptotic normality? The good news is that, due to a result known as the delta method, functions of
estimators will generally be asymptotically normal if the underlying estimators are themselves asymptotically normal.
For the single-parameter case, the only additional assumption needed is that the function f (·) is differentiable at the
true parameter θ. The following proposition provides the asymptotic-variance formula for the single-parameter case:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 391 — #398
i i

Estimation and confidence intervals 391

√ a
Proposition 14.11. (Delta method) If θX is an asymptotically normal estimator of θ, with n(θ̂X – θ) ∼ N(0, V), and
f (·) is a continuous function that is differentiable at θ, then f (θX ) is an asymptotically normal estimator of f (θ), with
√ a
n(f (θ̂X ) – f (θ)) ∼ N(0, f 0 (θ)2 V)
or, equivalently,
f 0 (θ)2 V

a
f (θ̂X ) ∼ N f (θ), .
n
In the one-parameter case with an estimator f (θX ), the mean of the asymptotic distribution is f (θ) since f (θX ) is a
consistent estimator of f (θ); the variance of the asymptotic distribution is f 0 (θ)2 Vn , where Vn is the asymptotic variance
of the original estimator. For example, for the odds estimator f (X̄) = 1–X̄X̄ discussed above, f 0 (π) = (1–π)
1
2 is obtained by
π
taking the derivative of 1–π with respect to π. Therefore, the asymptotic variance of f (X̄) = 1–X̄X̄ , as an estimator of
π
f (π) = 1–π , is
2
1 π(1 – π) π
=
(1 – π)2 n n(1 – π)3
π(1–π)
since nis the asymptotic variance of X̄ as an estimator of π. Based upon this asymptotic variance, the asymptotic
q
π
standard deviation is n(1–π) 3 , leading to the following standard error for the odds estimator:

x̄ r x̄
se = .
1 – x̄ n(1 – x̄)3
In the two-parameter case with an estimator f (θXa , θXb ), the mean of the asymptotic normal distribution is f (θa , θb );
the variance of the asymptotic distribution is more complicated than the one-parameter case since it involves partial
derivatives and the covariance between the two estimators.

14.6 Asymptotic predictive intervals for continuous random variables

For a continuous random variable X, suppose we would like to provide an interval for which the probability that X lies
in the interval is equal to a certain value, like 95% or 99%. In general, a two-sided 1 – α probability interval for X is
(τX,α/2 , τX,1–α/2 ),
where the lower endpoint is the population α/2’th quantile and the upper endpoint is the population (1 – α/2)’th
quantile. Since these population quantiles are unknown in most cases of practical interest, we must estimate them. The
ability to consistently estimate the two quantiles at the endpoints leads to an interval, which we call an asymptotic
predictive interval, that has endpoints which become arbitrarily close to the true endpoints as the sample size
increases.

Definition 14.9 An asymptotic (1 – α) predictive interval for X is (θ̂α/2 , θ̂1–α/2 ), where θ̂α/2 is a consistent estimate of
the population α/2’th quantile τX,α/2 and θ̂1–α/2 is a consistent estimate of the population (1 – α/2)’th quantile τX,1–α/2 .
There are two types of asymptotic predictive intervals, model-free intervals and model-based intervals. A model-free
interval uses the sample quantiles as consistent estimates of the population quantiles, so that the model-free asymptotic
(1 – α) predictive interval is
(θ̂α/2 , θ̂1–α/2 ) = (x̃α/2 , x̃1–α/2 ).
An advantage of this interval is that it doesn’t require any knowledge or assumptions about the distribution of X. For
the case of α = 0.05, the asymptotic 95% predictive interval has lower endpoint x̃0.025 and upper endpoint x̃0.975 . For
a large sample, there is approximately a 95% probability that a new draw of X from the population is between x̃0.025
and x̃0.975 . Since x̃0.025 and x̃0.975 are only estimates of the population quantiles τX,0.025 and τX,0.975 , it is a good idea
to calculate the standard errors for both x̃0.025 and x̃0.975 as a way of assessing how close the estimated endpoints are
likely to be to the true endpoints.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 392 — #399
i i

392 NOTES

A model-based interval uses the model and estimates of the model’s parameters rather than the sample quantiles.
As an example, for a normal random variable X ∼ N(µ, σ 2 ), Section 11.1 provided the (1 – α) probability interval
(τX,α/2 , τX,1–α/2 ) = (µ – zα/2 σ, µ + zα/2 σ),
which suggests the model-based asymptotic (1 – α) predictive interval
(θ̂α/2 , θ̂1–α/2 ) = (x̄ – zα/2 sx , x̄ + zα/2 sx ).
Since X̄ is a consistent estimator of µ and sX is a consistent estimator of σ, X̄ – zα/2 sX and X̄ + zα/2 sX are consistent
estimators of µ – zα/2 σ and µ + zα/2 σ, respectively. The realized endpoints x̄ – zα/2 sx and x̄ + zα/2 sx become arbitrarily
close to the true endpoints as the sample size gets larger. The model-based interval for a normal random variable has
its endpoints equidistant from x̄, whereas a model-free interval would not necessarily have this property. Also, if the
model is true, it’s likely that a model-based interval provides more precise estimates of the endpoints than a model-
free interval does. The intuition is that the model-based interval uses additional information, specifically the model
being assumed for X, as compared to the model-free interval. That said, for very large samples, the model-free and
model-based intervals should look quite similar since both are based upon consistent estimates of the endpoints.

Notes
44 If one estimator is inconsistent, then the other (consistent) estimator is generally preferred.
45 The number of confidence intervals, out of 100, for which µ is outside the confidence interval is a Binomial(100, 0.05) random variable. If the
number of Monte Carlo simulations grows large, the percentage of µ falling outside the 95% confidence interval gets arbitrarily close to 5%.
46 The sample variance s2 of a Bernoulli random variable is 1
Pn 2
x n–1 i=1 (xi – x̄) , which simplifies to
n n n
!
1 X 2 1 X X 1 1
(xi – 2xi x̄ + x̄2 ) = xi – 2x̄ xi + nx̄2 = nx̄ – 2nx̄ + nx̄2 = nx̄(1 – x̄),
n – 1 i=1 n – 1 i=1 i=1
n–1 n–1

n √
q
applying the fact that xi2 = xi for an indicator variable xi . Thus, sx = n–1
x̄(1 – x̄).

Exercises
1. A professor wants to estimate the probability of cannabis use (in the last year) in the population of students who
take economic statistics. Concerned that students might not honestly answer a direct question about cannabis use, she
uses a method known as randomized response to elicit honest responses. There are 300 students in her economics
statistics class. Before class, she creates 300 pieces of paper, numbered 1 through 300. As students come into class,
they randomly pick a piece of paper (without the professor knowing their number). Once everyone is seated, she gives
the following instructions: “Please answer the question Is the last digit of your phone number even? if your piece of
paper has a number less than or equal to 150. Please answer the question Have you used cannabis in the last year? if
your piece of paper has a number greater than 150.” Using a phone-response system, she observes the total number
of “yes” responses and nothing else. Let Y denote the random variable associated with the total number of “yes”
responses from the 300-student class. Let π denote the probability of cannabis use in the last year. Assume (i) students
honestly respond to their question and (ii) the probability of an even last-digit is 0.5.
(a) What is E(Y) in terms of π?
(b) Propose an estimator of π that is a function of Y, and show that the proposed estimator is unbiased. What is the
value of the estimate when the realization of Y is 125?
(c) *The estimator in (b) estimates the unconditional probability of cannabis use in the last year. Using Bayes’
Theorem, determine both the (i) the conditional probability of cannabis use given a “yes” response and (ii) the
conditional probability of cannabis use given a “no” response in terms of π. Plug in the estimate of π from (b)
(when Y = 125) to yield estimates of these conditional probabilities.
(d) Suppose an alternative randomization technique is used. Rather than the paper system, each student is asked
to flip a coin and, then, to answer the first question if flip is heads and the second question if the flip is tails.
In this case, while the expected number of students answering the first question is 150, the actual number may

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 393 — #400
i i

NOTES 393

be different from 150. Is the estimator from (b) still unbiased? How would you expect the variance of the
coin-flip-based estimator to compare to the variance of the paper-based estimator?
(e) Returning to the paper randomization, suppose the professor alters the instructions to have the students answer
the first question if their number is less than or equal to 75 and the second question if their number is greater
than 75. What is an appropriate estimator of π (in terms of Y) in this case?
(f) In this part, computer simulations will be used to compare the performance of the estimators for the three
different randomized-response alternatives (alternative 1: original question, alternative 2: part (d), alternative
3: part (e)). Assume that the probability of cannabis use in the last year is known to be 30%. For each of
the three alternatives, conduct 10,000 simulations in R of the 300-student class responses and calculate the
associate estimates. What are the averages of the estimates for each of the three alternatives? What are the
standard deviations of the estimates for each of the three alternatives? Do the relative sizes of the standard
deviations make sense?
2. *Consider a negative binomial random variable X ∼ NegBin(r, π).
r–1
(a) For r ≥ 2, show that π̂X = X+r–1 is an unbiased estimator of π.
(b) An employee at an investment management company has been asked to call the company’s clients to see which
ones are interested in receiving information about a new mutual fund. The employee has decided that she
will take a break after the fourth successful call, where “success” means the client is interested in receiving
information about the new mutual fund. Based on (a), what is the estimate of the success probability if the
successes occur on the 5’th, 12’th, 14’th, and 21’st calls?
1
(c) For r ≥ 1, show that θ̂X = X+r
r is an unbiased estimator of θ = π .
r
(d) The result in (c) suggests an alternative estimator of π, given by π̃X = X+r . Using this estimator, what is the
estimate of the success probability for the scenario described in (b)?
(e) It turns out that both π̂X and π̃X are consistent estimators of π (when we think of r growing very large). In this
part, computer simulations will be used to compare the performance of the alternative estimators. Specifically,
we consider the scenario from (b), but we assume that we know that the true success probability is 17% (π =
0.17). For r = 4, conduct 100,000 simulated i.i.d. draws of X ∼ NegBin(r, π) in R and, for each draw, calculate
the two estimates based upon π̂X and π̃X .
i. What are the average values of the two estimators (over the simulations)?
ii. One way to compare estimators is to calculate the mean absolute error, defined as the average of
|estimate – true value| over the simulations. What are the mean absolute errors of the two estimators?
iii. Another way to compare estimators is to calculate the mean squared error, defined as the average of
(estimate – true value)2 over the simulations. What are the mean squared errors of the two estimators?
iv. Repeat the simulations and parts i.-iii. for r = 100 instead of r = 4.
Pn
3. Referring to Proposition 13.3, explain why the sum of i.i.d. random variables, S = i=1 Xi , is an unbiased estimator
of nµX but not a consistent estimator of nµX .
4. Consider an i.i.d. random sample x1 , x2 , …, xn drawn from a N(10, 4) distribution, where n is a very large number.
In thinking about the box plot with whiskers and outliers (Section 6.5.1), provide your best guesses for the values of
the following quantities:
(a) Sample median (line within the box)
(b) Top and bottom of the box
(c) Upper and lower whiskers
(d) The percentage of points (“outliers”) above the upper whisker
5. Assume that the monthly returns on a certain asset are i.i.d. draws from a normally distributed random variable X
with unknown mean and standard deviation. You observe the monthly returns for one year (n = 12), and calculate the
sample average to be 0.01 and the sample standard deviation to be 0.08.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 394 — #401
i i

394 NOTES

(a) Using the appropriate t distribution, provide 95% and 99% confidence intervals for the population mean µX .
(b) How do the confidence intervals in (a) compare to the confidence intervals based upon a N(0, 1) distribution,
rather than the t distribution, for calculating the critical values?
(c) Suppose instead that n = 120, and you calculate the same sample average (0.01) and sample standard deviation
(0.08). Using the normal approximation, provide 95% and 99% confidence intervals for µX .
6. Assume that IQ scores are normally distributed, but the population mean and standard deviation of the normal
distribution are unknown. You collect a random sample of 20 individuals for which you calculate x̄ = 98 and sx = 12.
(a) Using the appropriate t distribution, provide 90% and 95% confidence intervals for the population mean µX .
(b) How do the confidence intervals in (a) compare to the confidence intervals based upon a N(0, 1) distribution,
rather than the t distribution, for calculating the critical values?
(c) If you instead had x̄ = 102 and sx = 12, how would the width of the 95% confidence interval for µX compare to
the one in (a)?
(d) If you instead had x̄ = 98 and sx = 16, how would the width of the 95% confidence interval for µX compare to
the one in (a)?
7. A company’s weekly profits are i.i.d. draws of a normal random variable X ∼ N(µ, σ 2 ). After n weeks of profits
(x1 , x2 , …, xn ) are observed, the following two confidence intervals for µ, based on t-distribution inference, are
constructed:
95% confidence interval for µ: (7.901, 12.099)
and
90% confidence interval for µ: (8.355, 11.645).
(a) What is the sample average of weekly profits?
(b) What is n? (Hint: Think about what the ratio of confidence-interval widths says about the ratio between the
critical values. Then, use R to determine the value of n consistent with that ratio.)
(c) What is the sample standard deviation of weekly profits?
(d) What are the one-sided 95% confidence intervals for µ? There are two such intervals, one of the form (L, ∞)
and one of the form (–∞, U).
8. A car factory implements a new production process and, over the course of the first 7 days, produces 198, 208, 206,
225, 234, 210, and 187 cars. Assume the daily production numbers are i.i.d. draws from a normal distribution.
(a) Calculate the sample mean and sample standard deviation of daily production.
(b) Determine the one-sided 95% confidence interval, of the form (L, ∞), for the population average of daily
production.
9. You are a fast-food mogul and own 50 franchises of Longhorn Burgers and 40 franchises of Perfect Pitas. Suppose
the monthly revenue (in thousands of dollars) for every franchise is an i.i.d. random variable, with Longhorn Burgers
revenues drawn from a distribution with population mean 20 and population standard deviation 4 and Perfect Pitas
revenues drawn from a distribution with population mean 15 and population standard deviation 3.
(a) In a given month, what is the approximate distribution of the sample average of the monthly revenues at the 40
franchises of Perfect Pitas?
(b) In a given month, what is the approximate distribution of the sample average of monthly revenues at all 90
franchises?
10. The two-sided asymptotic 95% confidence interval for a certain parameter is (10, 20).
(a) What is the two-sided asymptotic 80% confidence interval for the same parameter?
(b) For the one-sided asymptotic 95% confidence interval (L, ∞), what is the value of L?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 395 — #402
i i

NOTES 395

11. A university’s IT department has a large inventory of old computer monitors, some of which are no longer working.
To estimate the proportion of non-working monitors, the IT staff tests 60 of them (at random) and finds that 15 are not
working. Find the asymptotic 95% confidence interval for the population proportion of monitors that are not working.
12. A landscaping company offers a promotion in a suburban neighborhood, whereby they mow a house’s lawn free the
first time and then ask the homeowner if they would like to continue having regular (paid) service thereafter. Suppose
n homeowners allow the company to mow their lawn for free. The company’s owner is interested in the probability π
that such homeowners continue with the paid service. After observing the proportion of n homeowners that continue
with the paid service, the company’s owner finds that the asymptotic 95% confidence interval for π is (0.1503, 0.2997).
(a) What is the sample proportion of homeowners that continued with the paid service?
(b) What is n?
13. An advertising company wishes to estimate the population mean of the distribution of hours of television watched
per household per day. Suppose the population standard deviation of hours watched per household per day is known
to be 2.8 hours. The company decides that it wants the asymptotic 99% confidence interval for the population mean to
be no wider than 0.5 hours. What is the minimum sample size that results in a small enough confidence interval?
14. Use the exams dataset, which contains data for 77 students on two different exams (exam1 and exam2). Suppose
exam1 scores are i.i.d. draws from a random variable with mean µ1 and standard deviation σ1 and exam2 scores are
i.i.d. draws from a random variable with mean µ2 and standard deviation σ2 .
(a) Which is larger: (i) the middle of the asymptotic 95% confidence interval for µ1 or (ii) the middle of the
asymptotic 95% confidence interval for µ2 ?
(b) Which is larger: (i) the width of the asymptotic 95% confidence interval for µ1 or (ii) the width of the asymptotic
95% confidence interval for µ2 ?
(c) What is the asymptotic 90% confidence interval for µ1 ?
(d) What is the asymptotic 90% confidence interval for µ1 + µ2 , the sum of the exam population means?
(e) What is the asymptotic 90% confidence interval for the population correlation ρexam1,exam2 ?
(f) The standard errors of the sample standard deviations of exam1 and exam2 are 2.1122 and 2.8632,
respectively. Which is larger: (i) the upper endpoint of the asymptotic 95% confidence interval of σexam1 or
(ii) the upper endpoint of the asymptotic 95% confidence interval of σexam2 ?
15. A researcher surveys 500 individuals on their level of happiness x (on a scale from 1 to 10) and their annual income
y (in thousands of dollars). Assume that the observed data are i.i.d. draws from the joint distribution of the underlying
random variables (X, Y).
(a) Before the researcher observes the data, what is the largest possible width of the asymptotic 95% confidence
interval for the population correlation ρXY (i.e., over all possible realizations of rxy )?
(b) The researcher calculates rxy = 0.23 based upon the observed sample. What is the asymptotic 95% confidence
interval for ρXY ?
(c) Using rxy = 0.23 again, what is the one-sided asymptotic 95% confidence interval, of the form (L, 1), for ρXY ?
(The upper end is 1 and not ∞ since ρXY ≤ 1.)
16. In a random survey of 250 undergraduates, 150 respondents indicate that they consume caffeine daily.
(a) Provide an asymptotic 95% confidence interval for the probability πC that an undergraduate from the population
consumes caffeine daily.
(b) Another survey of a different sample of 250 undergraduates indicates that 140 out of the 250 respondents sleep
at least seven hours on a daily basis. Let πS denote true probability that an undergraduate sleeps at least seven
hours on a daily basis. Provide an asymptotic 95% confidence interval for the difference πC – πS .
17. In the sp500 data, 236 of the 360 monthly returns for the S&P 500 Index (idx) are positive. Assume that each
monthly return is an i.i.d. draw from some underlying random variable.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 396 — #403
i i

396 NOTES

(a) Provide an asymptotic 90% confidence interval for the probability that the S&P 500 Index monthly return is
positive.
(b) A risk-averse investor is worried about large negative returns. In the sp500 data, 33 of the 360 monthly returns
for the S&P 500 Index are less than –0.05. Provide a one-sided asymptotic 95% confidence interval, of the
form (–∞, U), for the probability that the S&P 500 Index monthly return is less than –0.05.
18. Use the cps dataset for this question.
(a) Form a table of gender and union to determine how many male workers are union vs non-union and how
many female workers are union vs non-union.
(b) What is the asymptotic 95% confidence interval for the probability that a male worker is in a union?
(c) What is the asymptotic 95% confidence interval for the probability that a female worker is in a union?
(d) What is the asymptotic 95% confidence interval for the difference between the probability that a male worker
is in a union and the probability that a female worker is in a union?
19. A survey of 150 students from college A finds that the average SAT math score is 610 with a sample standard
deviation of 50. A similar survey of 200 students from college B finds that the average and standard deviation are 580
and 45, respectively.
(a) What is the asymptotic 90% confidence interval for the population average of SAT math scores at college A?
(b) What is the asymptotic 90% confidence interval for the population average of SAT math scores at college B?
(c) What is the asymptotic 90% confidence interval for the difference between the population average at college A
and the population average at college B?
(d) Based upon the confidence interval from (c), do you think that it is likely that the population average for college
A students is larger than the population average for college B students?
20. *An economist wants to study the gender wage gap in a particular industry by collecting salary data from male and
female workers. In this industry, 80% of workers are male. Let µm = E(Xm ) and µf = E(Xf ) denote the population means
of salaries for male and female workers, respectively. The economist is interested in forming confidence intervals for
the difference µm – µf , and the size of the total sample collected (male and female combined) is n. Let γ be the
proportion of the sample made up by male workers, so that there are γn male workers and (1 – γ)n female workers.
(a) If σm2 = Var(Xm ) and σf2 = Var(Xf ), what is the asymptotic variance of X̄m – X̄f in terms of n, γ, σm2 , and σf2 ?
(b) What value of γ (in terms of σm , σf , and/or n) minimizes the asymptotic variance of X̄m – X̄f ?
(c) If the variances of male and female wages are the same (σm2 = σf2 ), what value of γ minimizes the asymptotic
variance of X̄m – X̄f ?
(d) If the economist collects the data with a simple random sample, the proportion of male workers in the
sample will be approximately 80%. For the case of equal wage variances (σm2 = σf2 ), how would the width
of the asymptotic confidence interval based upon γ = 0.8 (simple random sample) compare to the width of
the asymptotic confidence interval using the optimal γ found in (c)? Does this argue for oversampling or
undersampling of female workers?
21. The number of customers that enter a coffee shop during a given minute of the day (say, between 2:00pm and
2:01pm) is distributed as a X ∼ Poisson(λ) random variable, where λ is an unknown parameter. Suppose the coffee
shop gathers data over the course of 100 days, each day recording the number of customers that enter between 2:00pm
and 2:01pm, and finds that the sample average is 2.12 and the sample standard deviation is 1.53. You may assume that
the number of customers on each day is independent from other days.
(a) The population mean of a Poisson random variable X is µX = λ. Provide an asymptotic 95% confidence interval
for λ.
(b) Thinking of the sample average x̄ as an estimate of µX = λ, what is the estimated probability of having exactly
two customers enter between 2:00pm and 2:01pm on a given day? (Plug x̄ in for λ in the Poisson pmf formula.)

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 397 — #404
i i

NOTES 397

(c) Repeat (b), but now use the lower and upper endpoints of the confidence interval from (a) to calculate the
estimated probability of having exactly two customers enter between 2:00pm and 2:01pm on a given day.
(Since there is uncertainty in our estimate of λ, as reflected by the confidence interval, this part shows how that
uncertainty translates to estimation of the probability value.)
22. Use the strikes dataset for this question. This dataset contains information on worker contract strikes within United
States manufacturing for the period 1968-1976. There are 566 observations on the variable duration (strike duration,
in weeks).
(a) What are the sample mean, sample median, and sample standard deviation of duration?
(b) Draw a (density) histogram of duration with 10 bins.
(c) Given the right-skewed nature of duration, you consider whether a log-normal distribution might be a good
description of duration. Generate a new variable lndur equal to the natural logarithm of duration.
(d) If lndur ∼ N(µ, σ 2 ) is true, the sample mean of lndur is a consistent estimator of µ. Provide an asymptotic 95%
confidence interval for µ.
1 2
(e) If lndur ∼ N(µ, σ 2 ) is true, the expected value of duration is eµ+ 2 σ . Using the sample mean and sample
standard deviation of lndur as estimates of µ and σ, respectively, plug into the expected-value formula to
get an estimated expected value of duration. How does this estimate compare to the sample mean of duration?
(f) Draw a (density) histogram of lndur with 10 bins. What do you conclude about the log-normal distribution
being a good model for duration?
23. The number of workplace injuries at a certain factory is tracked over 200 weeks. The average of the weekly number
of injuries is 0.4. Assume that each weekly observation is an i.i.d. draw from a Poisson(λ) random variable.
(a) Provide an asymptotic 90% confidence interval for λ.
(b) Show that P(X > 0) = 1 – P(X = 0) is an increasing function of λ.
(c) Based upon (b), the endpoints of an asymptotic 90% confidence interval for P(X > 0) can be determined from
the the following two probabilities: (i) the probability that there are any workplace injuries in a given week
based upon the Poisson(λ) distribution, and (ii) the probability that there are any workplace injuries in a given
week based upon the Poisson(λ) distribution. Calculate these endpoints.
24. A convenience store sells lottery tickets. On the day of a large drawing, the time (in minutes) between lottery-ticket
purchases can be considered i.i.d. draws of an exponential random variable X ∼ Exp(θ). Suppose the average of 100
observed times-between-purchases is 1.8 minutes.
(a) What is the asymptotic 90% confidence interval for θ1 ?
(b) Use the endpoints of the interval from (a) to form an asymptotic 90% confidence interval for θ. The resulting
interval need not be symmetric. (Hint: P(L ≤ 1/θ ≤ U) = P(1/U ≤ θ ≤ 1/L).)
(c) Based on the continuous mapping theorem (Proposition 14.9), how would you consistently estimate θ given
the consistent estimator of θ1 ?
(d) *Based on the delta method, what is the asymptotic standard error associated with the estimate from (c)?
(e) Provide an asymptotic 90% confidence interval for θ using the estimate from (c) and the standard error from
(d). How does this interval compare to the one found in (b)?
25. A prolific inventor submits 150 patent applications to the U.S. Patent and Trademark Office (USPTO), and
110 are successful (resulting in a patent being issued). Assume that the success of each patent application is an
i.i.d. Bernoulli(π) random variable.
π
(a) What are the estimated odds of a patent application being successful? (Recall that odds is defined as 1–π .)
(b) Provide an asymptotic 95% confidence interval for the odds of a patent application being successful.
26. For i.i.d. X1 , X2 , …, Xn ∼ Poisson(λ) random variables, a consistent estimator of µX = λ is the sample average.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 398 — #405
i i

398 NOTES

(a) Based on the continuous mapping √theorem (Proposition 14.9), how would you consistently estimate the
population standard deviation (σX = λ) of the underlying Poisson random variable?
(b) *Based on the delta method, provide a formula for the asymptotic standard error associated√with the estimate
from (a). If x̄ = 1.2 and n = 200, what is the asymptotic 95% confidence interval for sd(X) = λ?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 399 — #406
i i

15 The bootstrap

Chapter 14 considered estimation of standard errors and confidence intervals based upon two types of statistical
inference: finite-sample inference and asymptotic inference. For finite-sample inference, the finite-sample (exact)
sampling distribution results from Chapter 12 were used as the basis for confidence intervals for the population mean
of i.i.d. normal random variables. For asymptotic inference, the more general asymptotic (large-sample) sampling
distribution results from Chapter 13 were used as the basis for confidence intervals associated with any asymptotic
normal estimator.
This chapter introduces another type of statistical inference, known as bootstrap inference or, more concisely, the
bootstrap. The bootstrap is a resampling-based method that can be used as an alternative to the inference approaches
in Chapter 14. Why might the bootstrap be needed as an alternative method for statistical inference? Here are three
different reasons:
1. Even with a very large sample, when the asymptotic distribution provides a good approximation, the asymptotic
variance (or standard deviation) formula might not provide a simple approach for estimation of an estimator’s standard
error. For example, as seen in Chapter 13, when using a sample quantile to estimate a population quantile, the formula
for the asymptotic standard deviation involves an unknown pdf fX (·) value that needs to be estimated for a standard
error to be calculated. The bootstrap provides an alternative approach that avoids estimation of the pdf altogether.
2. In certain situations, we may be interested in estimating a parameter for which the asymptotic variance is not readily
available and/or difficult to determine. For example, suppose the two random variables X and Y are associated with two
variables in an observed sample and we are interested in the difference between the population standard deviations,
σX – σY . Section 14.4.4 considered a similar situation in which an asymptotic confidence interval for µX – µY was
proposed, but the reasoning used there does not obviously extend to σX – σY . While sX – sY is an appropriate estimator
of σX – σY , determining the asymptotic standard deviation of sX – sY , as an estimator of σX – σY , is difficult since
sX and sY are not necessarily independent of each other. The bootstrap provides an alternative approach for this
situation that does not require the researcher to analytically determine the asymptotic variance. There are many other
similar situations, in which the asymptotic variance is difficult to obtain, that the bootstrap can provide an appealing
alternative. For example, the bootstrap can be used to form a confidence interval for the difference between the
population mean µX and population median τX,0.5 , based upon a random sample associated with a random variable X.
As another example, the bootstrap can be used to form a confidence interval for the difference between two population
correlations, based upon a random sample of multivariate data. For the labor force data, for instance, if we are interested
in the relative magnitudes of ρearnwk,educ and ρearnwk,age , a confidence interval for ρearnwk,educ – ρearnwk,age would be useful.
3. The observed sample may not be large enough for the asymptotic distribution to provide a good approximation of
an estimator’s true sampling distribution. While a finite-sample distribution can be used in some specific cases, like
estimation of the population mean of i.i.d. normal random variables, it is more likely that a finite-sample distribution
will not be available. For instance, for estimation of the population mean µX of i.i.d. random random variables with an
unknown distribution, there is no finite-sample distribution that can be used, but the asymptotic sampling distribution
may not provide a good approximation if the sample is very small. As another example, the asymptotic distribution
associated with estimation of the true correlation ρXY between two random variables X and Y is known to provide a

Similarly, for the other descriptive statistics, the statistic associated with any bootstrap sample can differ from the
statistic associated with the original sample.
The following R code constructs a single bootstrap sample:

set.seed(1234)

# create a data frame with the original data

df <- data.frame(x = c(4,3,8,12,0,10,5), y = c(8,6,10,1,15,3,6))

# determine the number of observations

nobs <- nrow(df)

# create a vector of index values by sampling with replacement

bs_index <- sample(1:nobs, nobs, replace = TRUE)
bs_index
## [1] 4 2 6 5 4 7 1
# create a data frame corresponding for the bootstrap sample
bs_df <- df[bs_index,]

bs_df
## x y
## 4 12 1
## 2 3 6
## 6 10 3
## 5 0 15
## 4.1 12 1
## 7 5 6
## 1 4 8
print(paste("Means for bootstrap sample: x", round(mean(bs_df$x),2), ", y", round(mean(bs_df$y),2)))
## [1] "Means for bootstrap sample: x 6.57 , y 5.71"
print(paste("Medians for bootstrap sample: x", median(bs_df$x), ", y", median(bs_df$y)))
## [1] "Medians for bootstrap sample: x 5 , y 6"
print(paste("Stdevs for bootstrap sample: x", round(sd(bs_df$x),2), ", y", round(sd(bs_df$y),2)))
## [1] "Stdevs for bootstrap sample: x 4.76 , y 4.89"
print(paste("Correlation for bootstrap sample:", round(cor(bs_df$x,bs_df$y),3)))
## [1] "Correlation for bootstrap sample: -0.924"

The data frame df contains the original sample of n = 7 observations for x and y. The sample command randomly
draws seven index values from the set {1, 2, …, 7} with replacement (replace = TRUE argument). From the value
of bs_index, we see that the index value 4 is drawn twice and the index value 3 is not drawn at all. The data frame
bs_df is assigned to be the bootstrap sample associated with the index values bs_index. The print commands
summarize the descriptive statistics for this bootstrap sample.
Example 15.1 shows four different bootstrap samples associated with an original sample. In practice, however,
many more bootstrap samples are used for inference, corresponding to a large value of B. Ideally, we would want to
construct all possible bootstrap samples from the original sample, but that’s infeasible. After all, by the multiplication
rule, the total number of possible (distinct) bootstrap samples is equal to nn . Even with n = 7, the total number of
distinct bootstrap samples is 823,543.
The next section considers bootstrap sampling when the number of bootstrap samples B is chosen to be large. With
the statistic or estimator of interest calculated for each of the bootstrap samples, we can consider the distribution of
the statistic or estimator over the large number of bootstrap samples.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 403 — #410
i i

The bootstrap 403

15.2 Bootstrap sampling distribution

This section considers the bootstrap sampling distribution of a statistic or estimator, defined as follows:

Definition 15.2 The bootstrap sampling distribution of a statistic or estimator s(·) is the distribution of the values
of the statistic or estimator applied to each of the B bootstrap samples.
For a univariate statistic s(x1 , x2 , …, xn ), the statistic for the b-th bootstrap sample is s(x1b , x2b , …, xnb ). For example,
the sample mean x̄ is the statistic or estimate associated with the estimator X̄, and its bootstrap sampling distribution
is the distribution of the sample means for each of the B bootstrap samples, {x1b , x2b , …, xnb } for b ∈ {1, 2, …, B}. If the
bootstrap sample means are denoted x̄b , then
n
1X b
x̄b = xi for each b ∈ {1, 2, …, B}.
n
i=1
1 2 B
The collection of the B values, {x̄ , x̄ , …, x̄ }, is the bootstrap sampling distribution of the sample mean.
As another example, the sample correlation rxy is the estimate associated with the estimator rXY , and the bootstrap
sample correlations are
1
Pn b b
b b
b n–1 i=1 xi – x̄ yi – ȳ
rxy = q P
2
q for each b ∈ {1, 2, …, B},
n b 2
1 b b
1
Pn b
n–1 i=1 xi – x̄ n–1 i=1 y i – ȳ
b 1
Pn b b 1
Pn b 1 2 B
where x̄ = n i=1 xi and ȳ = n i=1 yi . The collection of the B values, {rxy , rxy , …, rxy }, is the bootstrap sampling
distribution of the sample correlation between x and y.
Example 15.2 Continuing Example 15.1, consider the bootstrap sampling distributions of four statistics: the sample
mean x̄, the sample mean ȳ, the sample standard deviation sx , and the sample correlation rxy . Increasing B to 1,000,
Figure 15.1 shows the histograms of the statistics calculated for each of the 1,000 bootstrap samples. For each
histogram, a corresponding density curve is drawn, and the value of the original sample statistic is indicated by a
vertical dashed line. For the sample mean x̄ (top-left graph), the bootstrap sampling distribution looks fairly symmetric
around x̄ = 6, with nearly the entire distribution between 3 and 9. For the sample mean ȳ (top-right graph), the
bootstrap sampling distribution also looks fairly symmetric, this time around ȳ = 7. In contrast, the bootstrap sampling
distributions for the sample standard deviation sx (bottom-left graph) and the sample correlation rxy (bottom-right
graph) appear asymmetric. For the sample correlation, the sample statistic rxy = –0.790 is clearly to the right of the
peak of the distribution, which occurs below –0.9. The bootstrap sampling distribution for rxy has a very long right
tail, while there is no left tail since the sample correlation is never below –1.
Here is the R code to create the B = 1,000 bootstrap samples, calculate their descriptive statistics, and draw the
graphs in Figure 15.1:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 404 — #411
i i

404 The bootstrap

set.seed(1234)

# create a data frame with the original data

df <- data.frame(x = c(4,3,8,12,0,10,5), y = c(8,6,10,1,15,3,6))

# determine the number of observations

nobs <- nrow(df)

# initialize the number of bootstrap replications

B <- 1000

# initialize vectors to hold bootstrap statistics

bs_meanx <- rep(0,B)
bs_meany <- rep(0,B)
bs_sx <- rep(0,B)
bs_rxy <- rep(0,B)

# loop B times to create bootstrap samples and calculate bootstrap statistics

for (i in 1:B) {
# create a vector of index values by sampling with replacement
bs_index <- sample(1:nobs, nobs, replace = TRUE)

# create a data frame corresponding for the bootstrap sample

bs_df <- df[bs_index,]

# calculate the bootstrap statistics

bs_meanx[i] <- mean(bs_df$x)
bs_meany[i] <- mean(bs_df$y)
bs_sx[i] <- sd(bs_df$x)
bs_rxy[i] <- cor(bs_df$x,bs_df$y)
}

# graph-display format (two rows, two columns)

par(mfrow=c(2,2))

hist(bs_meanx, freq=FALSE, xlab="", ylab="", main=expression(paste("Sample mean of ",x)),

xlim=c(2,10), yaxt='n', breaks=20)
lines(density(bs_meanx), lwd=2)
abline(v=mean(df$x), lwd=3, lty=2)

hist(bs_meany, freq=FALSE, xlab="", ylab="", main=expression(paste("Sample mean of ",y)),

xlim=c(2,12), yaxt='n', breaks=20)
lines(density(bs_meany), lwd=2)
abline(v=mean(df$y), lwd=3, lty=2)

hist(bs_sx, freq=FALSE, xlab="", ylab="", main=expression(paste("Sample stdev of ",x)),

xlim=c(1,6), yaxt='n', breaks=40)
lines(density(bs_sx), lwd=2)
abline(v=sd(df$x), lwd=3, lty=2)

hist(bs_rxy, freq=FALSE, xlab="", ylab="",

main=expression(paste("Sample correlation of ",x," and ",y)),
xlim=c(-1,0), yaxt='n', breaks=40)
lines(density(bs_rxy), lwd=2)
abline(v=cor(df$x,df$y), lwd=3, lty=2)

15.3 Bootstrap standard errors and bootstrap confidence intervals

This section shows how the bootstrap sampling distribution from Section 15.2 can be used for statistical inference,
specifically for the purposes of (i) calculating the standard error of an estimate and (ii) constructing confidence intervals
for an underlying parameter. Denote the statistic or estimator of interest by s(), where the arguments have been omitted
to allow for the generality of handling univariate data, bivariate data, or multivariate data with more than two variables.
For univariate data, s() is short for s(x1 , x2 , …, xn ). For a large number B of bootstrap samples, this same statistic is
calculated for each bootstrap sample. In the definition below, the notation sb is used as shorthand for the statistic
calculated on the b-th bootstrap sample. With this notation, s1 is the statistic for the first bootstrap sample, s2 is the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 405 — #412
i i

The bootstrap 405

Sample mean of x Sample mean of y

2 4 6 8 10 2 4 6 8 10 12

Sample stdev of x Sample correlation of x and y

1 2 3 4 5 6 −1.0 −0.8 −0.6 −0.4 −0.2 0.0

Figure 15.1
Bootstrap sampling distributions

statistic for the second bootstrap sample, and so on. For univariate data, sb is shorthand for s(x1b , x2b , …, xnb ). Using this
notation, the bootstrap standard error is defined as follows:

Definition 15.3 The bootstrap standard error of a statistic s(), denoted seB , is the standard deviation of the s()
statistic over the B bootstrap samples, v
u
u 1 X B

seB = t sb – s̄B ,
B–1
b=1
b
where s is the statistic for the b-th bootstrap sample and s̄B is the average of the statistic over the B bootstrap samples.
For large samples, the bootstrap standard error can be used as an alternative to the asymptotic standard error. Even
in cases where the asymptotic distribution provides a good approximation to an estimator’s sampling distribution,
the bootstrap standard error can be very useful if the asymptotic standard error is difficult to calculate (e.g., sample
quantiles) or if the formula for the asymptotic standard deviation may be unknown (as in the examples at the beginning
of the chapter).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 406 — #413
i i

406 The bootstrap

When does the bootstrap standard error provide an appropriate alternative to the asymptotic standard error?
√
Generally speaking, as long as an estimator θ̂X is n-consistent and asymptotically normal, the bootstrap standard
error seB gets arbitrarily close to the asymptotic standard deviation of θ̂X as n → ∞ and B → ∞. The reasons for the
two “→ ∞” conditions are quite different. The n → ∞ requirement is needed so that the sample is large enough for
the asymptotic (normal) sampling distribution to be arbitrarily close to the estimator’s true sampling distribution. The
B → ∞ requirement, on the other hand, is needed to eliminate the sampling error associated with bootstrap sampling,
as there is an inherent randomness due to the resampling process. In practice, B should be chosen as large as possible,
subject to the constraints of the computer, to minimize sampling error. In most applied work, B is chosen to be in the
thousands, and a larger choice is always preferred if possible.
√
The bootstrap standard error can be used for any of the n-consistent and asymptotically normal estimators already
discussed in this book, as well as the regression estimators discussed in later chapters. For estimators that are not
√
n-consistent and asymptotically normal, the bootstrap is not guaranteed to provide valid inference. An example is
the sample maximum, maxX = max(X1 , X2 , …, Xn ), for which the bootstrap should not be used.
Example 15.3 (Mean and median of a log-normal random variable) Suppose the sample x1 , x2 , …, x100 consists of
i.i.d. draws from a log-normal distribution, with ln(X) ∼ N(0, 1). The sample size is n = 100. The following R code
first draws the sample x1 , x2 , …, x100 from the population, calculates the sample mean x̄ and sample median x̃0.5 , and
calculates the asymptotic standard error se(x̄) = √sxn . For the sample median, the asymptotic standard error is difficult
to calculate, as previously discussed. Using B = 10,000 bootstrap iterations, the code calculates bootstrap standard
errors for the sample mean x̄ and the sample median x̃0.5 :

set.seed(1234)

nobs <- 100

x <- rlnorm(nobs)
mean(x)
## [1] 1.523783
median(x)
## [1] 0.6808154
# asymptotic standard error of sample mean
sd(x)/sqrt(nobs)
## [1] 0.2170919
# use bootstrap sampling to calculate bootstrap standard errors
B <- 10000
bs_meanx <- rep(0,B)
bs_medianx <- rep(0,B)

for (i in 1:B) {
bs_index <- sample(1:nobs, nobs, replace = TRUE)
bs_x <- x[bs_index]
bs_meanx[i] <- mean(bs_x)
bs_medianx[i] <- median(bs_x)
}

print(paste("Bootstrap standard error of the sample mean:", round(sd(bs_meanx),4)))

## [1] "Bootstrap standard error of the sample mean: 0.2165"
print(paste("Bootstrap standard error of the sample median:", round(sd(bs_medianx),4)))
## [1] "Bootstrap standard error of the sample median: 0.0877"

The original sample has sample mean x̄ = 1.524, sample median x̃0.5 = 0.681, and sample standard deviation sx =
2.171. For the sample mean, the asymptotic standard error is se(x̄) = √sxn = 0.2171. Using B = 10,000 bootstrap samples,
the bootstrap standard error for the sample mean is seB (x̄) = 0.2165, and the bootstrap standard error for the sample

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 407 — #414
i i

The bootstrap 407

median is seB (x̃0.5 ) = 0.0877. The bootstrap standard error seB (x̄) = 0.2165 is very close to the asymptotic standard
error se(x̄) = 0.2171.
How about statistical inference for the difference between the population mean µX and the population median τX,0.5 ?
The difference between the sample mean and the sample median, x̄ – x̃0.5 = 1.5239 – 0.6808 = 0.8431, is an estimate
of µX – τX,0.5 . (The difference µX – τX,0.5 is non-zero here since the log-normal is a right-skewed distribution, with
population mean greater than population median.) Once bootstrap sampling has been done, we can calculate the
bootstrap standard error seB (x̄ – x̃0.5 ) by calculating the difference x̄ – x̃0.5 for each bootstrap sample and taking the
standard deviation of the resulting B values. This process, based upon the B = 10,000 bootstrap samples, yields the
bootstrap standard error seB (x̄ – x̃0.5 ) = 0.1874.

print(paste("Bootstrap standard error of the difference:", round(sd(bs_meanx-bs_medianx),4)))

## [1] "Bootstrap standard error of the difference: 0.1874"

Example 15.4 (Monthly stock returns: correlation differences) Example 14.11 provided a sample correlation matrix,
with asymptotic standard errors, for a set of six stocks from the sp500 dataset. Each sample correlation in that
matrix is an estimate of the underlying population correlation. For example, the sample correlation rHD,LOW is an
estimate of the population correlation ρHD,LOW , and its asymptotic standard error can be used for statistical inference.
What if we are instead interested in the difference between two population correlations? For instance, the difference
ρHD,LOW – ρHD,BAC is of interest if we want to know if the correlation between HD and LOW returns is larger or
smaller than the correlation between HD and BAC returns. While the difference rHD,LOW – rHD,BAC provides a logical
estimator of ρHD,LOW – ρHD,BAC , it is difficult to calculate an asymptotic standard error for rHD,LOW – rHD,BAC . This
situation is different from the two-sample setting considered in Section 14.4.5, where the asymptotic standard error of
the correlation difference was easy to determine since the two correlations for a two-sample problem are independent
of each other (see Example 14.16). Here, since they are based upon the same sample, the sample correlations
rHD,LOW and rHD,BAC are not independent, meaning the asymptotic standard error is complicated and depends upon
their covariance/correlation. Rather than attempting to derive this more complicated asymptotic standard error, the
bootstrap provides a simpler alternative. We repeatedly create bootstrap samples, and for each bootstrap sample, the
bootstrap statistic is the difference between the HD-LOW correlation and the HD-BAC correlation.
The following R code calculates the bootstrap standard error for rHD,LOW – rHD,BAC using B = 1,000:

set.seed(1234)

# calculate difference r(HD,LOW)-r(HD,BAC)

cor(sp500$HD,sp500$LOW)-cor(sp500$HD,sp500$BAC)
## [1] 0.317298
# initialize variables
nobs <- nrow(sp500)
B <- 1000
bs_rdiff <- rep(0,B)

# loop to create bootstrap samples and calculate the bootstrap statistic

for (i in 1:B) {
bs_index <- sample(1:nobs, nobs, replace = TRUE)
bs_df <- sp500[bs_index,]
bs_rdiff[i] <- cor(bs_df$HD,bs_df$LOW)-cor(bs_df$HD,bs_df$BAC)
}

# output the bootstrap standard error for r(HD,LOW)-r(HD,BAC)

sd(bs_rdiff)
## [1] 0.06108325

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 408 — #415
i i

408 The bootstrap

The estimate of the difference ρHD,LOW – ρHD,BAC is rHD,LOW – rHD,BAC = 0.317, and the bootstrap standard error is
seB (rHD,LOW – rHD,BAC ) = 0.061. There is nothing special about the pairs HD-LOW and HD-BAC used in this example,
and the same approach could be used to calculate a bootstrap standard error for the difference between any two
correlations (e.g., rBAC,WFC – rMRO,COP ).

15.3.1 Normal-based bootstrap confidence interval

√
For an estimator that is n-consistent and asymptotically normal, Proposition ?? says that the bootstrap standard error
is a consistent estimator of the asymptotic standard deviation. Similar to how asymptotic standard errors were used
to construct confidence intervals in Chapter 14, bootstrap standard errors can also be used to construct confidence
intervals. Since the normality of the estimator’s sampling distribution is still used, this confidence interval is known as
a normal-based bootstrap confidence interval.
Specifically, for an estimate θ̂ of an unknown parameter θ, a two-sided normal-based bootstrap 1 – α confidence
interval is
(θ̂ – zα/2 seB (θ̂), θ̂ + zα/2 seB (θ̂))
where seB (θ̂) is the bootstrap standard error of the estimate θ̂. Similarly, one-sided normal-based bootstrap 1 – α
confidence intervals are
(θ̂ – zα seB (θ̂), ∞) and (–∞, θ̂ + zα seB (θ̂)).
Example 15.5 (Mean and median of a log-normal random variable) In Example 15.3, the sample created by 100
i.i.d. draws from a log-normal random variable had:
x̄ = 1.5238, with seB (x̄) = 0.2165,
x̃0.5 = 0.6808, with seB (x̃0.5 ) = 0.0877,
and
x̄ – x̃0.5 = 0.8430, with seB (x̄ – x̃0.5 ) = 0.1874.
Based upon these estimates and bootstrap standard errors, two-sided normal-based bootstrap confidence intervals can
be constructed. The 95% confidence interval for the population mean µX is
(x̄ – 1.96seB (x̄), x̄ + 1.96seB (x̄)) = (1.5238 – (1.96)(0.2165), 1.5238 + (1.96)(0.2165)) ≈ (1.099, 1.948).
The 95% confidence interval for the population median τX,0.5 is
(x̃0.5 – 1.96seB (x̃0.5 ), x̃0.5 + 1.96seB (x̃0.5 )) ≈ (0.509, 0.853).
The 95% confidence interval for the difference µX – τX,0.5 is
((x̄ – x̃0.5 ) – 1.96seB (x̄ – x̃0.5 ), (x̄ – x̃0.5 ) + 1.96seB (x̄ – x̃0.5 )) ≈ (0.476, 1.210).
For this last interval, it can be said with 95% confidence that the difference between the population mean and
population median, µX – τX,0.5 , is between 0.476 and 1.1210. Therefore, this confidence interval provides strong
statistical evidence that the population mean µX is greater than the population median τX,0.5 , as zero is not in the
0.476
interval and is quite far (approximately 0.1874 ≈ 2.5 bootstrap standard errors) from the lower bound of the interval.
Example 15.6 (Monthly stock returns: correlation differences) In Example 15.4, the bootstrap was used to calculate a
bootstrap standard error of the estimate (rHD,LOW – rHD,BAC = 0.317) of the difference in correlations ρHD,LOW – ρHD,BAC .
The calculated bootstrap standard error was seB (rHD,LOW – rHD,BAC ) = 0.061. Therefore, a 95% confidence interval for
ρHD,LOW – ρHD,BAC is
(rHD,LOW – rHD,BAC ) ± 1.96seB (rHD,LOW – rHD,BAC ) ≈ (0.20, 0.44),
and a 99% confidence interval for ρHD,LOW – ρHD,BAC is
(rHD,LOW – rHD,BAC ) ± z0.005 seB (rHD,LOW – rHD,BAC ) ≈ (0.16, 0.47).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 409 — #416
i i

The bootstrap 409

The 99% confidence interval provides strong statistical evidence that ρHD,LOW – ρHD,BAC is positive or, equivalently,
that ρHD,LOW is greater than ρHD,BAC .
Example 15.7 (Labor force data) Example 14.12 reported the bootstrap standard errors for various sample quantiles,
including the sample deciles and sample quartiles, of the weekly earnings variable. Here is the R code used to create
the table in Example 14.12, with B = 5,000 bootstrap iterations used to calculate the bootstrap standard errors and the
normal-based confidence intervals:

set.seed(1234)

# construct a vector with the quantiles of interest (deciles and quartiles)

quantiles <- sort(c(0.1,0.2,0.25,0.3,0.4,0.5,0.6,0.7,0.75,0.8,0.9))

# initialize variables
nobs <- nrow(cpsemployed)
B <- 5000

# loop over the quantiles of interest

for (q in quantiles) {
bs_quantile <- rep(0,B)

# loop to create bootstrap samples and calculate the bootstrap statistics

for (i in 1:B) {
bs_index <- sample(1:nobs, nobs, replace = TRUE)
bs_quantile[i] <- quantile(cpsemployed[bs_index,"earnwk"],q)
}

# output the bootstrap se and normal-based bootstrap CI for this quantile

quantile_estimate <- quantile(cpsemployed$earnwk,q)
print(paste("Quantile ", q, ": sample quantile ", round(quantile_estimate),
", bootstrap se ", round(sd(bs_quantile),1),
", 95% CI (", round(quantile_estimate-1.96*sd(bs_quantile),1),
",", round(quantile_estimate+1.96*sd(bs_quantile),1), ")", sep=""))
}
## [1] "Quantile 0.1: sample quantile 355, bootstrap se 7, 95% CI (340.8,368.3)"
## [1] "Quantile 0.2: sample quantile 480, bootstrap se 7.4, 95% CI (465.4,494.6)"
## [1] "Quantile 0.25: sample quantile 520, bootstrap se 11.7, 95% CI (497,543)"
## [1] "Quantile 0.3: sample quantile 576, bootstrap se 7.4, 95% CI (561.6,590.4)"
## [1] "Quantile 0.4: sample quantile 670, bootstrap se 10.4, 95% CI (649.6,690.4)"
## [1] "Quantile 0.5: sample quantile 770, bootstrap se 11.1, 95% CI (748.1,791.9)"
## [1] "Quantile 0.6: sample quantile 900, bootstrap se 16.2, 95% CI (868.2,931.8)"
## [1] "Quantile 0.7: sample quantile 1080, bootstrap se 21.9, 95% CI (1037.1,1122.9)"
## [1] "Quantile 0.75: sample quantile 1194, bootstrap se 23.7, 95% CI (1147.1,1240.1)"
## [1] "Quantile 0.8: sample quantile 1346, bootstrap se 23.9, 95% CI (1299.1,1392.9)"
## [1] "Quantile 0.9: sample quantile 1750, bootstrap se 52, 95% CI (1648,1852)"

How about calculating a bootstrap standard error and normal-based bootstrap interval for the interquartile range
τX,0.75 – τX,0.25 ? The following R code calculates these quantities, again using B = 5,000 bootstrap iterations:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 410 — #417
i i

410 The bootstrap

set.seed(1234)

# initialize variables
nobs <- nrow(cpsemployed)
B <- 5000
bs_iqr <- rep(0,B)

# loop to create bootstrap samples and calculate the bootstrap statistics

for (i in 1:B) {
bs_index <- sample(1:nobs, nobs, replace = TRUE)
bs_iqr[i] <- IQR(cpsemployed[bs_index,"earnwk"])
}

# output the bootstrap standard error and normal-based CI for the IQR
iqr_earnwk <- IQR(cpsemployed$earnwk)
print(paste("IQR: ", round(iqr_earnwk,1),
", bootstrap se ", round(sd(bs_iqr),1),
", 95% CI (", round(iqr_earnwk-1.96*sd(bs_iqr),1),
",", round(iqr_earnwk+1.96*sd(bs_iqr),1), ")", sep=""))
## [1] "IQR: 673.6, bootstrap se 23.3, 95% CI (628,719.2)"

The estimated IQRearnwk is 673.6, with a bootstrap standard error seB (IQRearnwk ) = 23.3 and a normal-based
bootstrap 95% confidence interval of (628.0, 719.2). With 95% confidence, it can be said that the population IQR
is between 628.0 and 719.2.

15.3.2 Bootstrap percentile interval

When there is concern that the asymptotic normal distribution does not accurately approximate the true sampling
distribution of an estimator, an alternative to the normal-based confidence interval might be used. The asymptotic
normal distribution may not work well because the sample size n is not large enough, which might occur because n is
truly small or because the estimator under consideration, like a sample correlation (especially when close to 1 or –1)
or a sample standard deviation, requires larger n for asymptotic normality to hold.
An alternative bootstrap confidence interval that allows for non-normality and asymmetry is known as a bootstrap
percentile interval. Once the bootstrap sampling distribution for a given statistic is obtained, the bootstrap percentile
interval is easy to construct. Here are the steps required:
• For the statistic or estimate s() of interest, calculate the statistic or estimate on the B bootstrap samples to yield the
bootstrap statistics s1 , s2 , …, sB . These B statistics constitute the bootstrap sampling distribution.
• For a two-sided bootstrap percentile interval: Let c
α 1 2 B
α/2 be the sample 2 -th quantile of the {s , s , ..., s } values
α 1 2 B
and c1–α/2 be the sample (1 – 2 )-th quantile of the {s , s , …, s } values. The two-sided 1 – α bootstrap percentile
interval is
(cα/2 , c1–α/2 ) .
For example, the two-sided 95% bootstrap percentile interval is (c0.025 , c0.975 ), where c0.025 and c0.975 are the sample
2.5% and 97.5% quantiles of the {s1 , s2 , ..., sB } values, respectively.
1 2 B
• For a one-sided bootstrap percentile interval: Let c
α be the sample α-th quantile of the {s , s , ..., s } values and
1 2 B
c1–α be the sample (1 – α)-th quantile of the {s , s , …, s } values. The one-sided 1 – α bootstrap percentile intervals
are
(cα , ∞) and (–∞, c1–α ) .
For example, the one-sided 95% bootstrap percentile intervals are (c0.05 , ∞) and (–∞, c0.95 ), where c0.05 and c0.95
are the sample 5% and 95% quantiles of the {s1 , s2 , ..., sB } values, respectively.
There is nothing in the construction of the two-sided percentile interval that causes the estimate s() to be at the midpoint
of the interval. Therefore, unlike a two-sided confidence interval based on the normal distribution, the bootstrap

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 411 — #418
i i

The bootstrap 411

percentile interval may be asymmetric, with the distance between the lower endpoint and the estimate potentially
being different from the distance between the estimate and the upper endpoint.47
While the importance of choosing a large value for B has already discussed, it is particularly important in the
context of forming bootstrap percentile intervals. These intervals are based upon the extremes of the bootstrap sampling
distribution (e.g., the 2.5% and 97.5% quantiles in the case of a two-sided 95% interval), and estimators of extreme
quantiles are less precise than estimators of other quantities like the standard deviation. Therefore, all else equal,
we want a larger choice of B for calculating percentile intervals than would be needed for calculation of a bootstrap
standard error.
Example 15.8 (Mean and median of a log-normal random variable) Example 15.3 provided R code to calculate
the bootstrap distribution (B = 10,000 bootstrap statistics) associated with the sample mean x̄, the sample median
x̃0.5 , and the difference x̄ – x̃0.5 . The bootstrap statistics for the sample mean and the sample median were stored in
the vectors bs_meanx and bs_medianx, respectively. With these two vectors available, we can calculate 95%
bootstrap percentile intervals with the following R code:

# output the 2.5% and 97.5% quantiles of the bootstrap distributions

quantile(bs_meanx, c(0.025,0.975))
## 2.5% 97.5%
## 1.136476 1.976367
quantile(bs_medianx, c(0.025,0.975))
## 2.5% 97.5%
## 0.6041670 0.8766832
quantile(bs_meanx-bs_medianx, c(0.025,0.975))
## 2.5% 97.5%
## 0.4870959 1.2108018

The results are as follows: a 95% bootstrap percentile interval of (1.136, 1.976) for µX (as compared to the 95%
normal-based bootstrap confidence interval (1.099, 1.948)), a 95% bootstrap percentile interval of (0.604, 0.877) for
τX,0.5 (as compared to the 95% normal-based bootstrap confidence interval (0.509, 0.853)), and a 95% bootstrap
percentile interval of (0.487, 1.211) for µX – τX,0.5 (as compared to the 95% normal-based bootstrap confidence
interval (0.476, 1.210)). The only meaningful difference seems to arise for the τX,0.5 interval, which may suggest that the
sample size (n = 100) is not large enough for asymptotic normality of the sample median estimator with the underlying
log-normal random variable.48
Changing the confidence level of the percentile interval is straightforward. The following R code instead calculates
90% bootstrap percentile intervals by changing the quantiles from 2.5% and 97.5% to 5% and 95%, respectively:

# output the 5% and 95% quantiles of the bootstrap distributions

quantile(bs_meanx, c(0.05,0.95))

## 5% 95%
## 1.185190 1.900511
quantile(bs_medianx, c(0.05,0.95))
## 5% 95%
## 0.6088298 0.8469239
quantile(bs_meanx-bs_medianx, c(0.05,0.95))
## 5% 95%
## 0.5295099 1.1399688

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 412 — #419
i i

412 NOTES

The resulting 90% bootstrap percentile intervals are (1.185, 1.901), (0.609, 0.847), and (0.530, 1.140) for µX , τX,0.5 ,
and µX – τX,0.5 , respectively.
Example 15.9 (Monthly stock returns: correlation differences) Example 15.4 provided R code to calculate the
bootstrap distribution (B = 1,000 bootstrap statistics) associated with rHD,LOW – rHD,BAC , with the results in the vector
bs_rdiff. With this vector available, we can calculate a 95% bootstrap percentile interval by calculating the sample
2.5% and 97.5% quantiles.

# output the 2.5% and 97.5% quantiles of the bootstrap distribution

quantile(bs_rdiff, c(0.025,0.975))
## 2.5% 97.5%
## 0.2019175 0.4382012

The 95% bootstrap percentile interval is (0.20, 0.44), which is the same to two decimal places as the normal-based
bootstrap 95% confidence interval calculated in Example 15.6.
Example 15.10 (Labor market data) Example 15.7 provided R code to calculate the bootstrap distribution (B = 5,000)
associated with the IQR of the weekly earnings variable earnwk from the cps dataset, with the results in the vector
bs_iqr. With this vector available, we can calculate a 95% bootstrap percentile interval, as in the previous examples:

# output the 2.5% and 97.5% quantiles of the bootstrap distribution

quantile(bs_iqr, c(0.025,0.975))
## 2.5% 97.5%
## 617.663 701.000

The 95% bootstrap percentile interval for the population IQR is (617.7, 701.0), as compared to the normal-based
bootstrap 95% confidence interval (628.0, 719.2).

Notes
47 While the ability of the bootstrap percentile interval to handle non-normality and asymmetry is appealing, there are no general theoretical
results showing that the bootstrap percentile interval actually performs better than a normal-based interval (either asymptotic or bootstrap). In fact,
the main theoretical results supporting the use of the bootstrap involve large samples (n → ∞), in which case the asymptotic normal confidence
interval should work well. For large samples, another bootstrap method known as the studentized bootstrap or bootstrap-t method has been shown
to perform better than asymptotic normal confidence intervals; this method is beyond the scope of this book, but the interested reader can easily find
references for it.
48 Indeed, the histogram of the bootstrap median estimates confirms that the bootstrap distribution does not have a nice bell shape like that of
the distribution of the bootstrap mean estimates. The interested reader can confirm with the commands hist(bs_meanx,breaks=50) and
hist(bs_medianx,breaks=50).

Exercises
1. Suppose a bootstrap sample of size n is created by sampling with replacement from the i.i.d. sample {x1 , x2 , …, xn }.
(a) What is the probability that a specific observation xj , for some j ∈ {1, 2, …, n}, is in the bootstrap sample at
least once?
(b) What is the probability that a specific observation xj , for some j ∈ {1, 2, …, n}, is in the bootstrap sample at
least twice?
(c) Evaluate the probabilities from (a) and (b) for n = 100.
2. Exercise 14.25. considered a prolific inventor who submits 150 patent applications to the U.S. Patent and Trademark
Office (USPTO), with 110 being successful (resulting in a patent being issued). Assume that the success of each patent
application can be considered an i.i.d. Bernoulli(π) random variable.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 413 — #420
i i

NOTES 413

(a) Rather than creating a dataset with 110 successes (ones) and 40 failures (zeros), a shortcut to create bootstrap
samples for i.i.d. Bernoullidata is to repeatedly draw from the appropriate binomial distribution, which in this
case is Binomial 150, 110150 . For instance, the first draw from this binomial might be 105, corresponding to a
bootstrap sample with 105 successes (ones) and 45 failures (zeros). Use this approach with 5,000 iterations to
calculate a bootstrap standard error and construct a two-sided normal-based 95% bootstrap confidence interval
for π.
(b) Using the same approach, conduct 5,000 bootstrap iterations to calculate a bootstrap standard error and
π
construct a two-sided normal-based 95% bootstrap confidence interval for the odds 1–π .
(c) Using the same approach, conduct 5,000 bootstrap iterations to construct a two-sided 95% bootstrap percentile
π
interval for the odds 1–π .
3. Use the strikes dataset for this question. This dataset contains information on worker contract strikes within United
States manufacturing for the period 1968-1976. There are 566 observations on the variable duration (strike duration,
in weeks). Let X denote the random variable associated with duration.
(a) What are the sample 75% quantile and sample 90% quantile of duration? What is the sample IQR of duration?
(b) Use the bootstrap with 5,000 iterations to construct two-sided normal-based bootstrap 95% confidence intervals
for τX,0.75 , τX,0.90 , and IQRX .
(c) Let πlong = P(X > 52) be the probability that a strike lasts longer than a year. Use the bootstrap with 5,000
iterations to construct a two-sided normal-based bootstrap 95% confidence interval for πlong . How does this
interval compare to the 95% asymptotic confidence interval for πlong ?
4. Due to concerns about a dangerous intersection, a town gathers data on the weekly number of car accidents in the
intersection. The data for 50 consecutive weeks are summarized by the following table:
# accidents 0 1 2 3 4 5 6
# weeks 23 14 5 2 5 0 1
Assume that the number of accidents each week is an i.i.d. draw from a random variable X.
(a) Create a data frame or a vector of 50 observations in R based upon the table.
(b) Calculate the sample mean and the sample variance.
(c) Use the bootstrap with 5,000 iterations to construct a two-sided normal-based bootstrap 95% confidence
interval for µX and a two-sided normal-based bootstrap 95% confidence interval for σX2 .
(d) A town official studied statistics in college and wonders whether X is a Poisson random variable. She recalls
that a feature of a Poisson(λ) random variable is that the population mean and population variance are both
equal to λ and, therefore, equal to each other. Let θ = µX – σX2 be the difference between the population mean
and population variance, which would be θ = 0 if X is truly Poisson. Use the bootstrap with 5,000 iterations
to construct a two-sided normal-based bootstrap 95% confidence interval for θ = µX – σX2 . What does the
confidence interval say about X being Poisson?
(e) Same as (d), but construct a two-sided 95% bootstrap percentile interval for θ = µX – σX2 .
5. There are 106 unemployed individuals in the cps data, for whom the variable lfstatus (labor-force status) is equal
to “Unemployed” and the variable unempwks (weeks unemployed) is reported. For this question, focus only on the
sample of 106 unemployed individuals.
(a) Draw a histogram of unempwks for the unemployed individuals.
(b) Given the right-skewed distribution of unempwks, a classmate suggests taking a log transformation of the
variable to get a distribution that is more symmetric. Create a new variable lnunempwks that does so, and
calculate its sample mean and sample median.
(c) If the distribution of lnunempwks is symmetric, it should be the case that the population mean µX is equal
to the population median τX,0.5 for the underlying random variable X. Use the bootstrap with 5,000 iterations

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 414 — #421
i i

414 NOTES

to construct a two-sided normal-based bootstrap 95% confidence interval for θ = µX – τX,0.5 . What does the
confidence interval say about the population mean of X being equal to the population median of X?
(d) Same as (c), but do so for the (ratio) parameter θ = τµX,0.5
X
.
6. There are 2,809 employed individuals in the cps data, which is the sample of interest for this question. Suppose the
probability of union membership, in the population, is πm for male workers and πf for female workers.
(a) What is the estimate of the ratio ππmf , given by π̂π̂mf , where π̂m is the observed sample proportion of union members
among male workers and π̂f is the observed sample proportion of union members among female workers?
(b) Use the bootstrap with 5,000 iterations to construct a two-sided normal-based bootstrap 95% confidence
interval for the ratio ππmf . (Hint: Create bootstrap samples separately for the male-worker subsample and the
female-worker subsample.)
(c) The odds ratio (OR) is a measure used by statisticians to compare the likelihood of a certain outcome occurring
in two different groups. In the context of this union-gender example, the odds ratio is
πm /(1 – πm )
,
πf /(1 – πf )
which is the ratio between the odds of a male worker being in a union and the odds of a female worker being in
a union. Plugging π̂m and π̂f in for πm and πf , what is the estimated OR? Use the bootstrap with 5,000 iterations
to construct a two-sided normal-based bootstrap 95% confidence interval for the OR.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 415 — #422
i i

16 Hypothesis testing

This chapter introduces the concept of hypothesis testing. Sections 16.1-16.3 focus on testing a hypothesis about the
value of a single unknown parameter, and Section 16.4 extends the framework to consider tests of multiple hypotheses.
For the case of a single unknown parameter, we fix ideas by denoting the unknown parameter of interest by θ. The
following examples motivate the usefulness of hypothesis tests for a single parameter.
Example 16.1 (Widget website) Examples 2.1, 14.5, and 14.15 considered the purchase probabilities for three groups
of widgets.com users: recipients of e-mail A, recipients of e-mail B, and non-recipients. The parameters πA , πB , and
πC denote the purchase probabilities of these three groups, respectively. To compare the effectiveness of e-mail A versus
e-mail B, the difference πA – πB is the quantity of interest. If this difference is positive, then e-mail A is more effective
than e-mail B; if this difference is negative, then e-mail B is more effective than e-mail A; and, if this difference is zero,
then e-mail A and e-mail B are equally effective. If θ = πA – πB is the parameter equal to the difference between the two
purchase probabilities, we want to know whether θ = 0 (no difference in e-mail effectiveness) or θ 6= 0 (difference in
e-mail effectiveness). Example 14.15 constructed a 95% confidence interval (–0.085, 0.045) for θ = πA – πB based upon
60 66
the estimate pA = 300 = 0.20 of πA , pB = 300 = 0.22 of πB , and the standard errors se(pA ) and se(pB ). From this confidence
interval, it appears that zero is a plausible value for θ since it falls within the confidence interval. Therefore, it would
be expected that a formal statistical test would not be able to rule out θ = 0 with a high level of confidence. A test of the
hypothesis θ = 0 is known as a two-sided test since statistical evidence of either θ < 0 or θ > 0 would call into question
the hypothesis θ = 0.
Example 16.2 (Investment opportunity) You are interested in the possibility of buying a business that produces and
sells a certain product. By your calculations, the true average of weekly sales would need to be at least $10,000 in
order for the investment to be worthwhile. As part of due diligence, you obtain weekly sales figures from the business for
10 randomly chosen weeks. For those 10 weeks, the sample mean of weekly sales is $11,200, and the sample standard
deviation of weekly sales is $3,400. If θ denotes the population average of weekly sales, measured in thousands of
dollars, you are interested in knowing whether θ ≤ 10 (the business is not a worthwhile investment) or θ > 10 (the
business is a worthwhile investment). The observed sample is {x1 , x2 , …, x10 }, where xi is weekly sales in thousands of
dollars for a particular week. If these observations can be considered i.i.d. draws from some distribution, the sample
mean x̄ = 11.2 serves as an estimate of θ and, therefore, provides some evidence against θ ≤ 10. But how strong is this
evidence? A more formal test needs to take into account the fact that the sample mean is a random variable. As seen
below, the estimate’s standard error accounts for the potential imprecision of the estimator, similar to what was done
for confidence intervals in Chapter 14. A test of the hypothesis θ ≤ 10 is known as a one-sided test since only statistical
evidence of θ > 10 would call into question the hypothesis θ ≤ 10.
As done for confidence intervals in Chapter 14, this chapter first considers hypothesis testing for the unknown
population mean of i.i.d. normal random variables and then considers hypothesis testing for the more general case
of an unknown parameter for which an asymptotic normal estimator is available. For the former case, covered in
Section 16.1, testing is based upon the exact sampling distribution of the sample mean estimator. For the latter case,
covered in Section 16.2, testing is based upon the asymptotic distribution of the estimator.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 416 — #423
i i

416 Hypothesis testing

Before getting into the details of the tests, some additional notation and terminology is needed.

Definition 16.1 The null hypothesis is the hypothesis to be tested and is often denoted H0 . The alternative hypothesis
is the opposite of the null hypothesis and is often denoted H1 (though some other sources use the notation Ha ).
For a two-sided test of an unknown parameter θ, the null hypothesis is
H0 : θ = c,
for some known constant c specified by the researcher. The alternative hypothesis, which is the opposite of the null
hypothesis, is
H1 : θ 6= c.
The alternative hypothesis is true whenever the null hypothesis is false, and vice versa.49 The hypothesis test of
H0 determines whether or not there is statistical evidence to reject H0 : θ = c. In Example 16.1, the null hypothesis is
H0 : θ = 0. Since θ is unknown, an estimate of θ that is far away from the hypothesized value c should provide statistical
evidence against H0 : θ = c, but what does “far away” mean? Due to the randomness and noise inherent in the estimate
of θ, the estimate’s standard error will help to quantify how “far away” the estimate of θ is from c.
For a one-sided test of an unknown parameter θ, the null hypothesis is either
H0 : θ ≥ c
or
H0 : θ ≤ c,
for some known constant c specified by the researcher. The direction of the inequality in the null hypothesis H0 depends
upon the situation. In Example 16.2, for example, the null hypothesis of interest is H0 : θ ≤ 10.
For the null hypothesis H0 : θ ≥ c, the alternative hypothesis is
H1 : θ < c,
and the hypothesis test of H0 determines whether or not there is statistical evidence to reject H0 : θ ≥ c. Statistical
evidence against H0 : θ ≥ c comes from an estimate of θ that is far below c.
For the null hypothesis H0 : θ ≤ c, the alternative hypothesis is
H1 : θ > c,
and the hypothesis test of H0 determines whether or not there is statistical evidence to reject H0 : θ ≤ c. Statistical
evidence against H0 : θ ≤ c comes from an estimate of θ that is far above c. Again, the notions of “far below” and “far
above” will be formalized in terms of an estimate’s standard error.

16.1 Finite-sample hypothesis testing: population mean of i.i.d. normal random variables
This section considers hypothesis tests for the population mean µ associated with normally distributed i.i.d. random
variables X1 , X2 , …, Xn ∼ N(µ, σ 2 ). Section 14.2 covered this case in detail and used the exact sampling distribution
results for the sample mean estimator X̄ from Section 12.1.2. Proposition 14.2 provided the key result to construct
confidence intervals for µ:
X̄ – µ
√ ∼ tn–1 ,
sX / n
where X̄ is the sample mean estimator, sX is the sample standard deviation estimator, and tn–1 is the t-distribution with
√
n – 1 degrees of freedom. The standard deviation of the estimator X̄ is sX / n. The ratio sXX̄–µ
√ is known as the t-ratio
/ n
and indicates the number of standard deviations that the estimator X̄ is away from the parameter µ. The t-ratio is
positive when X̄ is greater than µ and negative when X̄ is less than µ. Unfortunately, even after observing the realized
sample mean x̄ and standard deviation sx , the t-ratio is unknown since the parameter µ is unknown.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 417 — #424
i i

Hypothesis testing 417

Consider the two-sided test of the null hypothesis

H0 : µ = c
versus the alternative hypothesis
H1 : µ 6= c,
where c is a known constant. Let’s conduct the thought experiment where we assume that the null hypothesis H0 : µ = c
is true. Plugging c into the t-ratio for µ implies
X̄ – c
√ ∼ tn–1 when H0 is true.
sX / n
The ratio sXX̄–c √ is called the t-statistic and indicates the number of standard deviations that the estimator X̄ is away
/ n
from c. Unlike the t-ratio, the t-statistic can be calculated after the realized estimates x̄ and sx are observed. A positive t-
statistic corresponds to an estimate above c, and a negative t-statistic corresponds to an estimate below c. For example, a
t-statistic of 1.4 indicates that the sample mean is 1.4 standard deviations above c, whereas a t-statistic of –2.1 indicates
that the sample mean is 2.1 standard deviations below c. And, to be clear, when the term “standard deviation” is used
√
here, it refers to the standard deviation of the estimator, which is equal to sX / n, and not the standard deviation of X.
The t-statistic sXX̄–c
√ should be thought of as a random variable before the sample is observed, and the result above
/ n
says that this t-statistic is distributed as a tn–1 distribution if H0 is true. Therefore, intuitively, if H0 is true, the realized
t-statistic sxx̄–c
√ should look like it was drawn from a tn–1 distribution. This intuition can be formalized with probability
/ n
intervals. Before providing a more general result, let’s first consider the case of a two-sided 95% probability interval
for the t-statistic sXX̄–c
√ when H0 is true:
/ n

X̄ – c
P –tn–1,0.025 < √ < tn–1,0.025 = 0.95 when H0 : µ = c is true
sX / n
or, equivalently,
X̄ – c
P √ < tn–1,0.025 = 0.95 when H0 : µ = c is true.
sX / n
Importantly, these probability statements are not true under the alternative hypothesis H1 : µ 6= c since the t-statistic
X̄–c
√ does not have a tn–1 distribution when µ 6= c. The second probability statement says that there is a 95% probability
sX / n
that the t-statistic has a magnitude or absolute value less than the critical value tn–1,0.025 when H0 is true. After observing
the data, having the magnitude of the t-statistic, sxx̄–c
√ , larger than tn–1,0.025 is considered evidence against H0 : µX = c.
/ n
A large magnitude for the t-statistic indicates that the estimate of µ, which is the sample mean x̄, is far away from the
√
hypothesized value c, as measured by the number of standard errors sx / n.
The test based upon the t-statistic, known as a t-test, uses the following rejection rule suggested by the second
probability statement above:

Rejection rule based on the t-statistic (test at the 5% level):

x̄–c
• Reject H0 : µ = c at the 5% level if t-statistic = √
sx / n
≥ tn–1,0.025 .
x̄–c
• Do not reject H0 : µ = c at the 5% level if t-statistic = √
sx / n
< tn–1,0.025 .

There are two possible conclusions from this t-test rejection rule. Either we “reject” the null hypothesis H0 , which
occurs when the magnitude of the t-statistic is above the critical value, or we “do not reject” the null hypothesis H0 ,
which occurs when the magnitude of the t-statistic is below the critical value. A t-statistic with large magnitude
(greater than the critical value) provides evidence against H0 : µ = c and, therefore, it is said that H0 : µ = c is rejected.
On the other hand, a t-statistic with a small magnitude (less than the critical value) does not provide evidence against
H0 : µ = c; such a t-statistic is not surprising given the underlying tn–1 distribution that holds when H0 is true. For this

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 418 — #425
i i

418 Hypothesis testing

case, it is said that H0 : µ = c is not rejected. While some discussions of hypothesis testing use the term “accept H0 ”
rather than “do not reject H0 ,” the use of the term “accept H0 ” is not advisable. After all, it is never possible to have
strong evidence that H0 : µ = c since there is always some uncertainty from the estimation of µ. The most that can be
said, in the case that sxx̄–c
√ < tn–1,0.025 , is that there is not sufficient statistical evidence against the null hypothesis H0 .
/ n
As such, using the phrase “do not reject H0 ” is appropriate. Even in cases where there is a strong prior belief that H0
√
is false, there might be a failure to reject H0 just because a small sample size leads to a large standard error sx / n.
The magnitude of the t-statistic, sxx̄–c √ , can be thought of as a “statistical distance” from the estimate x̄ to the
/ n
hypothesized value c. While |x̄ – c| gives the actual distance from the estimate x̄ to the hypothesized value c, there is
no way to know if the actual distance |x̄ – c| is small or large due to the uncertainty associated with the estimate x̄.
√
Dividing by the standard error sx / n accounts for this uncertainty, so that the statistical distance sxx̄–c √ is a distance in
/ n
terms of the number of standard errors that x̄ is from c. For this statistical distance, unlike the actual distance, statistical
theory tells us what types of values should be expected if the null hypothesis is true; specifically, as discussed above,
the statistical distance given by the magnitude t-statistic sxx̄–c √
/ n
should be the absolute value of a realized draw from
the tn–1 distribution if the null hypothesis H0 : µ = c is true.
The level of the test, a term was introduced in the rejection rule above, is formally defined as follows:

Definition 16.2 The level or significance level of a hypothesis test, denoted by α, is the probability that the null
hypothesis H0 is rejected when the null hypothesis H0 is true. The level of a test is also called the type I error of the
test.
For the rejection rule above, the level of the hypothesis test is α = 5% or α = 0.05. Figure 16.1 provides a graphical
view of the rejection regions on the tn–1 distribution. If the magnitude of the t-statistic is larger than tn–1,0.025 , its value
falls into either the gray region for the left tail (if the t-statistic is negative) or the gray region for the right tail (if the
t-statistic is positive). If H0 is true, there is a 95% probability that the t-statistic falls in the middle region between the
two values –tn–1,0.025 and tn–1,0.025 . But, even if H0 is true, there is still a 5% probability that the rejection rule above says
to “reject H0 ” due to the t-statistic falling in the left or right tail; for such cases, the test is wrong about the rejection,
and the level of the test (α = 5% here) indicates the probability that the test rejects when H0 is true.
We can generalize the t-test and the rejection rule to levels other than α = 0.05. Letting α denote the level of the test,
the relevant probability statement is that

X̄ – c
P √ < tn–1,α/2 = 1 – α when H0 : µ = c is true.
sX / n
This probability statement leads to the following rejection rule for a t-test at the α level:

Rejection rule based on the t-statistic (test at the α level):

x̄–c
• Reject H0 : µ = c at the α level if t-statistic = √
sx / n
≥ tn–1,α/2 .
x̄–c
• Do not reject H0 : µ = c at the α level if t-statistic = √
sx / n
< tn–1,α/2 .

There is a tradeoff involved in choosing the level of the test. By its definition, the level indicates how likely it is
to reject H0 when H0 is actually true. Therefore, decreasing the level from 10% to 5% lowers the probability that
an incorrect rejection of H0 occurs. On the other hand, since the t-statistic is the same regardless of the level of the
test, decreasing the level from 10% to 5% leads to a lower chance that H0 is rejected (even if H0 is false). This lower
rejection rate occurs since the critical value increases from tn–1,0.05 to tn–1,0.025 when the level is changed from 10% to
5%: if the magnitude of the t-statistic is less than tn–1,0.05 , the test would not reject H0 at either level; if the magnitude
of the t-statistic is greater than tn–1,0.025 , the test would reject H0 at either level; and, if the magnitude of the t-statistic
is between tn–1,0.05 and tn–1,0.025 , the test would reject H0 at the 10% level but not the 5% level.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 419 — #426
i i

Hypothesis testing 419

− tn−1,0.025 tn−1,0.025

Figure 16.1
Rejection areas for the t-test at a 5% level

Example 16.3 (Food truck) In Example 14.3, the weekly profits X of a food truck were assumed to be normally
distributed with unknown mean µ and variance σ 2 , with each week’s profits being an i.i.d. draw. Weekly profits were
recorded for a total of six weeks, with the sample average of weekly profits equal to $1200 and the sample standard
deviation equal to $200. Suppose the food truck owner, prior to seeing the data, believed that the true average µ of
weekly profits was $1000. This belief corresponds to the null hypothesis H0 : µ = 1000. For a test at the 5% level, with
α = 0.05, the appropriate critical value is tn–1,0.025 = t5,0.025 ≈ 2.571. Therefore, the t-test does not reject H0 : µ = 1000
at a 5% level since |2.449| < 2.571.

# calculate t-statistic
tstat <- (1200-1000)/(200/sqrt(6))
tstat
## [1] 2.44949
# critical value for t-test at 5% level
qt(0.975,5)
## [1] 2.570582

For a test at the 10% level (α = 0.10), the appropriate critical value is tn–1,0.05 = t5,0.05 ≈ 2.015. Therefore, the t-test
does reject H0 : µ = 1000 at a 10% level since |2.449| ≥ 2.015. For this null hypothesis, then, there is rejection at the
10% level but not the 5% level.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 420 — #427
i i

420 Hypothesis testing

# critical value for t-test at 10% level

qt(0.95,5)
## [1] 2.015048

Example 14.3 calculated confidence intervals for µ based upon the sample information provided above. The two-
sided 95% confidence interval for µ is

sx sx
x̄ – t6–1,0.025 √ , x̄ + t6–1,0.025 √ ≈ (990, 1410),
n n
and the two-sided 90% confidence level for µ is

sx sx
x̄ – t6–1,0.05 √ , x̄ + t6–1,0.05 √ ≈ (1035, 1365).
n n
The hypothesized value of c = 1000 lies within the 95% confidence interval for µ but not within the 90% confidence
interval for µ. With respect to the 95% confidence interval, the value of 1000 is a plausible value since it lies within
that interval. Not coincidentally, this fact corresponds with the finding that H0 : c = 1000 is not rejected at the 5% level.
On the other hand, with respect to the 90% confidence interval, the value of 1000 does not appear likely since it lies
outside the interval. Again, not coincidentally, this fact corresponds with the finding that H0 : c = 1000 is rejected at
the 10% level.
The last part of Example 16.3 highlights the connection between two-sided 1 – α confidence intervals for µ and the
rejection decision for tests at the α level. The following proposition formally states this relationship:
Proposition 16.1. For a t-test of the null hypothesis H0 : µ = c at the α level, the null hypothesis H0 is rejected if c lies
outside the two-sided 1 – α confidence interval for µ and is not rejected if c lies inside the two-sided 1 – α confidence
interval for µ.
To show this result, note that the two-sided 1 – α confidence interval for µ is

sx sx
x̄ – tn–1,α/2 √ , x̄ + tn–1,α/2 √ ,
n n
so c is inside this interval if and only if
sx sx
x̄ – tn–1,α/2 √ < c < x̄ + tn–1,α/2 √ .
n n
The first inequality is equivalent to
x̄ – c
√ < tn–1,α/2 ,
sx / n
and the second inequality is equivalent to
x̄ – c
√ > –tn–1,α/2 .
sx / n
Putting these two inequalities together yields
x̄ – c x̄ – c
–tn–1,α/2 < √ < tn–1,α/2 or, equivalently, √ < tn–1,α/2 ,
sx / n sx / n
which corresponds to the “do not reject” rule for testing H0 : µ = c at the α level. Therefore, c is within the two-sided
1 – α confidence interval if and only if H0 : µ = c is not rejected at the α level.
Therefore, an alternative and equivalent rejection rule for the t-test is the following:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 421 — #428
i i

Hypothesis testing 421

Rejection rule based on the confidence interval (test at the α level):

• Reject H0 : µ = c at the α level if c is not within the two-sided 1 – α confidence interval for µ.
• Do not reject H0 : µ = c at the α level if c is within the two-sided 1 – α confidence interval for µ.

16.1.1 p-values for the t-test

A drawback of the rejection-rule approach to hypothesis testing is that the rejection rule only says whether H0 is
rejected or not based upon the level α and the associated critical value. For example, with a sample size of n = 11 and
a level of α = 0.05, the critical value is t10,0.025 = 2.228. A t-statistic of 2.2 would fail to reject H0 , whereas a slightly
higher t-statistic of 2.3 would reject H0 . In addition, t-statistics of 2.3 and 6.3 would both lead to rejection of H0 , but
the rejection rule by itself fails to indicate that the statistical evidence against H0 is much stronger for a t-statistic of
6.3 than a t-statistic of 2.3.
To address this drawback, the concept of a p-value is introduced. We first provide the general definition of a p-value
and then consider how the p-value is determined in the specific case of a two-sided t-test.

Definition 16.3 The p-value of a test of the null hypothesis H0 is the smallest level α∗ such that the test rejects H0 at
the level α∗ .
In the case of the two-sided t-test of H0 : µ = c, the p-value is

x̄ – c
p-value = P(|T| > |t-statistic|) = P |T| > √ when H0 is true, where T ∼ tn–1 .
sx / n
In words, if the null hypothesis H0 is true, the p-value is the probability of observing a t-statistic at least as large in
magnitude as the one actually observed.
Figure 16.2 provides a graphical depiction of the p-value. The value of the realized t-statistic is denoted by t-stat in
the figure, with |t-stat| on the right side of the graph and –|t-stat| on the left side of the graph. The p-value is equal to
the total area in the two tails, which includes the area to the left of –|t-stat| and the area to the right of |t-stat|, and is
represented by the gray shading.
As an example, if the t-statistic for H0 : µ = c is calculated to be –1.2 for a sample size of n = 15, R can be used to
find that P(T > 1.2) = P(T < –1.2) ≈ 0.125 for T ∼ t14 . Then, the associated p-value is P (|T| > 1.2) ≈ (2)(0.125) = 0.250,
meaning there is a 25% chance of seeing a t-statistic at least as large as 1.2 in magnitude if the null hypothesis H0 is
true. Due to the symmetry of the t-distribution, the p-value can be calculated as either two times the area of the left tail
or two times the area of the right tail.

# calculate p-value as area in both tails

## calculation based upon two times left-tail area
2*pt(-1.2,14)
## [1] 0.2500528
## calculation based upon two times right-tail area
2*(1-pt(1.2,14))
## [1] 0.2500528

For a t-test at the 5% level, the critical value tn–1,0.025 is the value that gives 5% total area in the tail to the right of
tn–1,0.025 and the tail to the left of –tn–1,0.025 . If the p-value for this t-test is less than 0.05 or 5%, it must be the case
that the associated t-statistic t∗ is larger in magnitude than tn–1,0.025 so that the total area of the tails to the right of the
|t∗ | and to the left of –|t∗ | is less than 0.05. Similarly, if the p-value for this t-test is greater than 0.05 or 5%, it must
be the case that the associated t-statistic is smaller in magnitude than tn–1,0.025 so that the total area of the tails to the

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 422 — #429
i i

422 Hypothesis testing

−|t−stat| |t−stat|

Figure 16.2
p-value for a t-test

right of the |t∗ | and to the left of –|t∗ | is greater than 0.05. This same idea holds for any level of the test, suggesting the
following rejection rule for the t-test based upon the p-value:

Rejection rule based on the p-value (test at the α level):

• Reject H0 : µ = c at the α level if p-value < α.
• Do not reject H0 : µ = c at the α level if p-value > α.

The advantage of reporting the p-value associated with the t-test is that it immediately indicates whether the null
hypothesis H0 would be rejected at any specified level. For example, if the p-value is 0.08, the null hypothesis H0 would
be rejected for any level α greater than 0.08, and the null hypothesis H0 would not be rejected for any level α less than
0.08. Larger t-statistic magnitudes are associated with lower p-values, meaning H0 is more likely to be rejected, and
smaller t-statistic magnitudes are associated with higher p-values, meaning H0 is less likely to be rejected.
Example 16.4 (Food truck) In Example 16.3, a t-statistic of 2.449 was calculated for the null hypothesis H0 : µ =
1000. Since the sample size is n = 6, the distribution of interest is the t5 distribution. For a random variable T ∼ t5 , R
calculates that the p-value is 0.058:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 423 — #430
i i

Hypothesis testing 423

# calculate t-statistic
tstat <- (1200-1000)/(200/sqrt(6))

# calculate the p-value

2*pt(-abs(tstat),5)
## [1] 0.05797277
# confirm with the left and right tail area
## left-tail area
pt(-abs(tstat),5)
## [1] 0.02898639
## right-tail area
1-pt(abs(tstat),5)
## [1] 0.02898639

The left tail, corresponding to P(T < –2.449), has an area of 0.029, and the right tail, corresponding to P(T > 2.449),
also has an area of 0.029, yielding the p-value of 0.058. Therefore, the null hypothesis H0 : µ = 1000 is rejected for any
level above 0.058 and not rejected for any level below 0.058. This finding is consistent with Example 16.3, where H0
was rejected at the 10% level but not at the 5% level.

16.1.2 Power of the t-test

The level α of a test indicates how likely it is for the null hypothesis H0 to be incorrectly rejected when H0 is actually
true. But what if H0 is not true? An effective test should have a high probability of rejecting H0 when H0 is not true.
This probability, which is known as the power of the test, depends on the specific alternative hypothesis that is true.
In the case of the null hypothesis H0 : µ = c, a specific alternative hypothesis would be µ = d for some value d 6= c.

Definition 16.4 For a specific alternative hypothesis, the power of a test is the probability that the test correctly
rejects the null hypothesis H0 . For a specific alternative hypothesis, the probability that the test does not reject the null
hypothesis, which is equal to one minus the power, is called the type II error of the test.
If the specific alternative hypothesis µ = d (for d 6= c) is true, then
X̄ – d
√ ∼ tn–1 .
sX / n
X̄–c
Since H0 is not true, the t-statistic √
sX / n
is not distributed as a tn–1 random variable. Instead, the t-statistic can be
written as
X̄ – c X̄ – d d–c
√ = √ + √ ,
sX / n sX / n sX / n
so that the t-statistic is a tn–1 random variable plus the term sXd–c
√ . If d is much larger than c, this term is a large positive
/ n
number so that the observed t-statistic should look like a random draw from tn–1 plus a large positive number, which
makes it more likely to be above the right-tail critical value for rejection. Similarly, if d is much smaller than c, this
term is a large negative number so that the observed t-statistic should look like a random draw from tn–1 minus a large
number, which makes it more likely to be below the left-tail critical value for rejection.
To help visualize the power of the two-sided t-test, Figure 16.3 shows the power of the test under two different
specific alternative hypotheses. In both graphs, the thin curve corresponds to the tn–1 distribution for the t-statistic
that would hold under the null hypothesis H0 , along with the critical values –tn–1,0.025 and tn–1,0.025 . The bold curves
represent the actual distributions for the t-statistic under two specific alternative hypotheses, with d being 2.5 standard
deviations above c in the top graph and 4 standard deviations above c in the bottom graph. The gray areas indicate the
rejection probabilities using the critical-value cutoff rejection rule. For both graphs, the rejection probability is clearly
much larger than 5%, which is the rejection rate under the null hypothesis. Since d is farther away from c in the bottom

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 424 — #431
i i

424 Hypothesis testing

− tn−1,0.025 tn−1,0.025

Figure 16.3
Power of a t-test

graph, the power of the test is larger in the bottom graph, as indicated by the larger gray area. The type II errors are
given by the white area under the bold curves, and since the type II error is just one minus the power, the type II error
is smaller in the bottom graph.
For any specific alternative µ = d, with d 6= c, the power of the t-test is an increasing function of sXd–c
√ , so that the
/ n
power depends upon the sample size and the value of d. For a given sample size n, the power of the test increases
when the specific alternative hypothesis has d farther away from c. Intuitively, this relationship makes sense since it is
easier to find evidence against H0 : µ = c when d is not close to c. For any specific alternative hypothesis, the power of
the test is increasing in the sample size n. Larger samples make it easier to reject the null hypothesis when it is false.
In fact, if the sample size n gets arbitrarily large (n → ∞), the power of the test becomes arbitrarily close to 100% if
the null hypothesis H0 is false.

16.1.3 One-sided t-test

This section considers testing a one-sided null hypothesis rather than a two-sided null hypothesis. Suppose interest lies
in testing the null hypothesis
H0 : µ ≥ c
versus the alternative hypothesis
H1 : µ < c.
x̄–c
Thinking about the t-statistic √ ,
sx / n
it no longer makes sense to reject H0 when the t-statistic is in the right tail of the
tn–1 distribution. A positive t-statistic corresponds to an estimate of µ that is greater than c and, therefore, consistent
with the null hypothesis H0 being true. The null hypothesis should only be rejected for when a t-statistic is unusually

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 425 — #432
i i

Hypothesis testing 425

negative. To have a test at the α level, the area under the tn–1 distribution to the left of the critical value must be equal
to α. The corresponding rejection rule is:

Rejection rule based on the t-statistic (test at the α level):

• Reject H0 : µ ≥ c at the α level if t-statistic = sxx̄–c
√ ≤ –tn–1,α .
/ n
• Do not reject H0 : µ ≥ c at the α level if t-statistic = sxx̄–c
√ > –tn–1,α .
/ n

Similarly, for testing the null hypothesis

H0 : µ ≤ c
versus the alternative hypothesis
H1 : µ > c,
the null hypothesis should only be rejected when the t-statistic is in the right tail of the tn–1 distribution. For this
case, negative t-statistics are associated with estimates of µ that are less than c and, therefore, consistent with the
null hypothesis. Only unusually large positive t-statistics provide evidence against the null hypothesis. Therefore, the
corresponding rejection rule is:

Rejection rule based on the t-statistic (test at the α level):

• Reject H0 : µ ≤ c at the α level if t-statistic = sxx̄–c
√ ≥ tn–1,α .
/ n
• Do not reject H0 : µ ≤ c at the α level if t-statistic = sxx̄–c
√ < tn–1,α .
/ n

Figure 16.4 shows the rejection areas for both types of one-sided tests for a 5% level. The top graph shows the
rejection area for testing H0 : µ ≥ c, with the gray region corresponding to values less than –tn–1,0.05 and having an area
of 5%. The bottom graph shows the rejection area for testing H0 : µ ≤ c, with the gray region corresponding to values
greater than tn–1,0.05 and having an area of 5%.
Example 16.5 (Investment opportunity) Continuing Example 16.2, let’s make the additional assumption that the 10
observed weekly sales figures are i.i.d. draws from a normal distribution. Knowing that the sample mean of weekly
sales is $11,200 (x̄ = 11.2) and the sample standard deviation of weekly sales is $3,400 (sx = 3.4), would the null
hypothesis H0 : θ ≤ 10, which corresponds to the business not being a worthwhile investment, be rejected at the 5%
level? The t-statistic is
x̄ – c 11.2 – 10
√ = √ ≈ 1.116.
sx / n 3.4/ 10
Since tn–1,0.05 = t9,0.05 ≈ 1.833, the null hypothesis H0 : θ ≤ 10 is not rejected since the t-statistic 1.116 is less than the
critical value 1.833. How about a one-sided test at the 10% level? The critical value is t9,0.10 ≈ 1.383, so that the null
hypothesis is still not rejected at the 10% since 1.116 < 1.383. Here is the R code for the necessary calculations:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 426 — #433
i i

426 Hypothesis testing

Rejection area for H0 : µ ≥ c

− tn−1,0.05

Rejection area for H0 : µ ≤ c

tn−1,0.05

Figure 16.4
Rejection areas for one-sided t-tests at a 5% level

# calculate t-statistic
tstat <- (11.2-10)/(3.4/sqrt(10))
tstat
## [1] 1.116098
# critical value for test at 5% level
qt(0.95,9)
## [1] 1.833113
# critical value for test at 10% level
qt(0.90,9)
## [1] 1.383029

The concept of p-values can be extended to one-sided hypothesis testing. Again, the difference from two-sided
testing is that only one of the tails is considered, with the left tail used for H0 : µ ≥ c and the right tail used for
H0 : µ ≤ c. For the one-sided t-test of H0 : µ ≥ c, the p-value is

x̄ – c
p-value = P(T < t-statistic) = P T < √ when H0 is true, where T ∼ tn–1 .
sx / n
For the one-sided t-test of H0 : µ ≤ c, the p-value is

x̄ – c
p-value = P(T > t-statistic) = P T > √ when H0 is true, where T ∼ tn–1 .
sx / n

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 427 — #434
i i

Hypothesis testing 427

As with a two-sided test, the p-value can be used to determine whether a test of a one-sided null hypothesis should
be rejected. The null hypothesis H0 is rejected if the p-value is less than the level α of the test and not rejected if the
p-value is greater than the level α of the test.
Example 16.6 (Investment opportunity) Continuing Example 16.5, the p-value associated with the t-statistic 1.116 is
the area to the right of 1.116 under the t9 distribution.

# calculate p-value for one-sided t-test

1-pt(1.116,9)
## [1] 0.1466655

Since P(T > 1.116) ≈ 0.147 for a random variable T ∼ t9 , the p-value for the one-sided test of H0 : µ ≤ 10 is 0.147.
Thus, a one-sided t-test does not reject H0 : µ ≤ 10 for any level below 14.7%. Even if µ = 10, this p-value tells us that
there would be a 14.7% probability of seeing a t-statistic at least as large as the one observed (1.116).
Finally, the connection between one-sided 1 – α confidence intervals and one-sided tests at the α level can be
established. Recall that the one-sided 1 – α confidence intervals for µ are

sx sx
x̄ – tn–1,α √ , ∞ and –∞, x̄ + tn–1,α √ .
n n
Proposition 16.2. For a t-test of the null hypothesisH0 : µ ≥ c at the α level, the null hypothesis H0 is rejected if c
lies outside the one-sided 1 – α confidence interval –∞, x̄ + tn–1,α √sxn for µ and is not rejected if c lies inside the

one-sided 1 – α confidence interval –∞, x̄ + tn–1,α √sxn for µ. For a t-test of the null hypothesis H0 : µ ≤ c at the α

level, the null hypothesis H0 is rejected if c lies outside the one-sided 1 – α confidence interval x̄ – tn–1,α √sxn , ∞ for

µ and is not rejected if c lies inside the one-sided 1 – α confidence interval x̄ – tn–1,α √sxn , ∞ for µ.

Example 16.7 (Investment opportunity) Re-visiting Example 16.5, the same conclusions for the one-sided tests at the
5% and 10% levels can be obtained by using one-sided confidence intervals and Proposition 16.2. The one-sided 95%
confidence interval for µ is

sx 3.4
x̄ – tn–1,α √ , ∞ = 11.2 – t9,0.05 √ , ∞ ≈ (9.23, ∞),
n 10
meaning H0 : µ ≤ 10 is not rejected at a 5% level since the value 10 is within this confidence interval. Similarly, the
one-sided 90% confidence interval for µ is

sx 3.4
x̄ – tn–1,α √ , ∞ = 11.2 – t9,0.10 √ , ∞ ≈ (9.71, ∞),
n 10
meaning H0 : µ ≤ 10 is not rejected at a 10% level since the value 10 is within this confidence interval. But for the
one-sided 85% confidence interval for µ, which is

sx 3.4 3.4
x̄ – tn–1,α √ , ∞ = 11.2 – t9,0.15 √ , ∞ = 11.2 – 1.0997 √ , ∞ ≈ (10.01, ∞),
n 10 10
the value 10 is not within the interval. Thus, H0 : µ ≤ 10 is rejected at a 15% level, which agrees with the p-value of
0.147 found in Example 16.6.

16.2 Asymptotic hypothesis testing: parameters with asymptotically normal estimators

Hypothesis testing 431

Rejection rule based on the confidence interval (test at the α level):

• Reject H0 : θ = c at the α level if c is not within the two-sided 1 – α confidence interval for θ.
• Do not reject H0 : θ = c at the α level if c is within the two-sided 1 – α confidence interval for θ.

Example 16.11 (Widget website) For the z-test of the null hypothesis H0 : πA = 0.25 in Example 16.8, the estimated
60
purchase probability is θ̂x = x̄ = pA = 300 = 0.20, with a standard error of x̄ of 0.0231. As shown in Example 14.5, a 95%
confidence interval for πA is
(0.20 – (1.96)(0.0231), 0.20 + (1.96)(0.0231)) ≈ (0.155, 0.245).
Since 0.25 does not fall within this 95% confidence interval, H0 : πA = 0.25 is rejected at the 5% level, the same
conclusion as reached in Example 16.8. On the other hand, a 99% confidence interval for πA is
(0.20 – (2.576)(0.0231), 0.20 + (2.576)(0.0231)) ≈ (0.140, 0.260),
meaning H0 : πA = 0.25 is not rejected at the 1% level since 0.25 is within the interval.
In the case of the two-sided z-test of H0 : θ = c, the p-value is
!
θ̂x – c
p-value = P(|Z| > |z-statistic|) = P |Z| > when H0 is true, where Z ∼ N(0, 1).
se(θ̂x )
If the null hypothesis H0 is true, the p-value is the probability of observing a z-statistic at least as large in magnitude
as the one actually observed. Graphically, as shown in Figure 16.5, the p-value is equal to the total area in the two
tails, adding the area in the tail to the left of –|z-stat| and the area in tail to the right of |z-stat|. This figure is identical to
Figure 16.2, except that the distribution in Figure 16.5 is the N(0, 1) distribution while the distribution in Figure 16.2
is the tn–1 distribution.
For the z-test, like the t-test of Section 16.1, knowing the p-value tells us whether the null hypothesis H0 : θ = c is
rejected at any level α. When the p-value is less than α, the z-statistic must be in either the left tail (less than –zα/2 ) or
in the right tail (greater than zα/2 ), indicating rejection at the level α. On the other hand, when the p-value is greater
than α, the z-statistic is between –zα/2 and zα/2 , indicating a lack of rejection at the level α.

Rejection rule based on the p-value (test at the α level):

• Reject H0 : θ = c at the α level if p-value < α.
• Do not reject H0 : θ = c at the α level if p-value > α.

Example 16.12 (Widget website) For the z-test of the null hypothesis H0 : πA = 0.25 in Example 16.8, the calculated
z-statistic is
θ̂x – c 0.20 – 0.25
= ≈ –2.16.
se(θ̂x ) 0.0231
The associated p-value, for Z ∼ N(0, 1), is
p-value = P(|Z| > |z-statistic|) = P (|Z| > 2.16) ≈ 0.030.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 432 — #439
i i

432 Hypothesis testing

−|z−stat| |z−stat|

Figure 16.5
p-value for a z-test

# calculate pa and its standard error

pa <- 60/300
se_pa <- sqrt(pa*(1-pa)/300)

# calculate z-statistic
zstat <- (pa-0.25)/se_pa

# calculate the p-value for the z-test

2*(1-pnorm(abs(zstat)))

## [1] 0.03038282

The null hypothesis H0 : πA = 0.25 is rejected at any level above 3.0% and not rejected at any level below 3.0%,
which agrees with the conclusion in Examples 16.8 and 16.11 that H0 is rejected at a 5% level but not a 1% level.
Similarly, we can calculate p-values associated with the z-tests of the three null hypotheses H0 : πA – πB = 0, H0 :
πA – πC = 0, and H0 : πB – πC = 0 considered in Example 16.8.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 433 — #440
i i

Hypothesis testing 433

# calculate purchase probabilities and standard errors

pa <- 60/300
se_pa <- sqrt(pa*(1-pa)/300)
pb <- 66/300
se_pb <- sqrt(pb*(1-pb)/300)
pc <- 360/2400
se_pc <- sqrt(pc*(1-pc)/2400)

# calculate z-statistics
zstat_abdiff <- (pa-pb)/sqrt(se_pa^2 + se_pb^2)
zstat_acdiff <- (pa-pc)/sqrt(se_pa^2 + se_pc^2)
zstat_bcdiff <- (pb-pc)/sqrt(se_pb^2 + se_pc^2)
zstat_abdiff
## [1] -0.6015661
zstat_acdiff
## [1] 2.064674
zstat_bcdiff
## [1] 2.79972
# calculate the p-values for the z-tests
2*(1-pnorm(abs(zstat_abdiff)))
## [1] 0.547463
2*(1-pnorm(abs(zstat_acdiff)))
## [1] 0.03895389
2*(1-pnorm(abs(zstat_bcdiff)))
## [1] 0.005114694

The following table summarizes the z-statistics and p-values for the three null hypotheses:
Null hypothesis z-statistic p-value
H0 : πA – πB = 0 –0.60 0.547
H0 : πA – πC = 0 2.06 0.039
H0 : πB – πC = 0 2.80 0.005
Having the p-values makes it easy to see that the second and third null hypotheses are rejected at a 5% level and
the first hypothesis is not. The strongest evidence of a difference is between πB and πC , with an associated p-value of
0.005, for which H0 : πB – πC = 0 is rejected at any level above 0.5%. The weakest evidence of a difference is between
πA and πB , with the large p-value of 0.547 indicating that H0 : πA – πB = 0 is not be rejected at any level below 54.7%.

16.2.1 One-sided z-test

Similar to the one-sided t-tests in Section 16.1.3, this section considers one-sided versions of the z-test. Suppose
interest lies in testing the null hypothesis
H0 : θ ≥ c
versus the alternative hypothesis
H1 : θ < c.
Rejection of H0 should only occur when the z-statistic is negative and large in magnitude since a positive z-statistic is
consistent with the null hypothesis that θ is larger than c. To have a one-sided test of level α, the area under the N(0, 1)
distribution to the left of the critical value must be equal to α. The appropriate critical value is therefore –zα , and the
corresponding rejection rule is:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 434 — #441
i i

434 Hypothesis testing

Rejection rule based on the z-statistic (test at the α level):

θ̂x –c
• Reject H0 : θ ≥ c at the α level if z-statistic = se( θ̂ )
≤ –zα .
x
θ̂x –c
• Do not reject H0 : θ ≥ c at the α level if z-statistic = se( θ̂ )
> –zα .
x

Similarly, for testing the null hypothesis

H0 : θ ≤ c
versus the alternative hypothesis
H1 : θ > c,
the null hypothesis should only be rejected when the z-statistic is large and positive, corresponding to a z-statistic in
the right tail of the N(0, 1) distribution. For this case, negative z-statistics are associated with estimates of µ that are
less than c and, therefore, consistent with the null hypothesis. The corresponding rejection rule is:

Rejection rule based on the z-statistic (test at the α level):

θ̂x –c
• Reject H0 : θ ≤ c at the α level if z-statistic = se( θ̂ )
≥ zα .
x
θ̂x –c
• Do not reject H0 : θ ≤ c at the α level if z-statistic = se( θ̂ )
< zα .
x

The rejection areas for one-sided z-tests are similar to those depicted in Figure 16.4 for one-sided t-tests. For
instance, for a test at the 5% level, the 5% rejection area for testing H0 : θ ≥ c corresponds to all z-statistic values
less than –z0.05 , whereas the 5% rejection area for testing H0 : θ ≤ c corresponds to all z-statistic values greater than
z0.05 .
Example 16.13 (Betting strategy) An experienced sports gambler is convinced that they have a strategy that is
profitable at a certain casino. They have tested their strategy betting on 120 games, with a 55% success rate (66
winning bets on the 120 total games). Due to fees that the casino charges, the gambler needs a success rate π of at
least 52% to be profitable over the long run. The gambler is therefore interested in testing, and hopes to be able to
reject, the one-sided null hypothesis
H0 : π ≤ 0.52,
which corresponds to unprofitable π values. The alternative hypothesis H1 : π > 0.52 corresponds to profitable π values.
The observed success rate is consistent with the alternative hypothesis H1 being true, but the gambler is concerned
that the high realized success rate may have arisen due to chance. The estimate of π is 0.55, and the only additional
information needed for the z-test is the standard error of this
q estimate. Under the assumption that the success of
each bet is an i.i.d. Bernoulli(π) draw, the standard error is (0.55)(0.45)
120 ≈ 0.0454. The z-statistic is 0.55–0.52
0.0454 ≈ 0.661,
meaning the null hypothesis H0 : π ≤ 0.52 is not rejected at a 5% level since 0.661 < 1.645 = z0.05 . The gambler is right
to be concerned. What if the 55% success rate had occurred
q for a much larger set of games, say 360 games instead of
120 games? In that case, the standard error would be (0.55)(0.45)
360 ≈ 0.0262, yielding a z-statistic of 0.55–0.52
0.0262 ≈ 1.144.
Even with this larger sample, H0 : π ≤ 0.52 is not rejected at a 5% level since 1.144 < 1.645 = z0.05 .
Similar to one-sided t-tests, the concept of p-values can be extended to one-sided z-tests. For the one-sided z-test of
H0 : θ ≥ c, the p-value is
!
θ̂x – c
p-value = P(Z < z-statistic) = P Z < when H0 is true, where Z ∼ N(0, 1).
se(θ̂x )

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 435 — #442
i i

Hypothesis testing 435

For the one-sided z-test of H0 : µ ≤ c, the p-value is

!
θ̂x – c
p-value = P(Z > z-statistic) = P Z > when H0 is true, where Z ∼ N(0, 1).
se(θ̂x )
The null hypothesis H0 is rejected if the p-value is less than the level α of the test and not rejected if the p-value is
greater than the level α of the test.
Example 16.14 (Betting strategy) Continuing Example 16.13, the p-value associated with testing H0 : π ≤ 0.52, based
upon the 55% success rate on 120 games, is
p-value = P(Z > z-statistic) = P (Z > 0.661) ≈ 0.254 for Z ∼ N(0, 1).

# calculate z-statistic
zstat <- (0.55-0.52)/sqrt(0.55*(1-0.55)/120)
zstat
## [1] 0.6605783
# calculate one-sided p-value
1-pnorm(zstat)
## [1] 0.2544414

The null hypothesis H0 : π ≤ 0.52 is not rejected at any level below 25.4%. Example 16.13 also considered how
things would change with a larger set of games (360) and the same success rate (55%), in which case the z-statistic
increases to 1.144 due to the lower standard error. The p-value would be P(Z > 1.14) ≈ 0.126, considerably lower than
0.254 but still implying that H0 : π ≤ 0.52 is not rejected at any level below 12.6%.

# calculate z-statistic
zstat <- (0.55-0.52)/sqrt(0.55*(1-0.55)/360)
zstat
## [1] 1.144155
# calculate one-sided p-value
1-pnorm(zstat)
## [1] 0.1262797

16.3 Statistical significance versus practical significance

This section briefly discusses the important difference between statistical significance and practical significance in
the context of estimation and testing of parameter estimates. Statistical significance arises when there is statistical
evidence that a parameter estimate is not equal to zero; that is, seeing a realized non-zero estimate of the parameter is
not likely due to chance. On the other hand, practical significance relates to the magnitude of the estimated parameter
and whether the magnitude makes the estimate practically important.
In Examples 16.8 and 16.12, for instance, the estimated difference in purchase probabilities between e-mail B
recipients and non-recipients (0.22 – 0.15 = 0.07) was statistically significant at any level above 0.5%, with a z-statistic
of 2.80 and a p-value of 0.005. The estimated difference of 0.07 is almost certainly practically significant as well since
e-mail B recipients are much more likely (approximately 46.7% more likely) to make a purchase than non-recipients.
In Example 14.13, the difference between average scores on two exams (6.42) was statistically significant at any
level, with a z-statistic of 5.22 and a p-value of 0.000. Is the magnitude of this estimated difference also practically
significant? A simple way of answering that question is to compare the estimated difference to the standard deviations

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 436 — #443
i i

436 Hypothesis testing

of the exam scores, which are sexam1 = 14.19 and sexam2 = 14.15. The estimated difference of the average scores is
approximately 45% of the standard deviation on either exam, which is a practically important magnitude.
It is especially important to think about practical significance when sample sizes are large. A t-statistic or z-statistic
has the standard error in its denominator, and since the standard error is proportional to √1n , the test statistic becomes
arbitrarily large in magnitude as n increases for any fixed value of its numerator. As a result, even if the estimate in
the numerator is very small in magnitude, with little or no practical significance, it is possible to find that the estimate
is statistically significant with a very low p-value. In the exam example, suppose the estimated difference in averages
is 0.52 rather than 6.42 and the sample standard deviations are unchanged. With an extremely large sample size, the
standard error would eventually be low enough that the p-value for testing equality of the exam score averages would
be rejected at a 5% level, leading to a statistically significant estimated difference. The estimated difference would not,
however, be very practically significant since it represents only about 3.7% of the standard deviation on either exam.
The idea here is that with a very large sample, we can precisely estimate parameters whose true values are close
to zero from a practical point of view. The precision of the estimate can lead to a low p-value (when testing against
zero) and, thus, statistical significance even though the magnitude of the estimate is not practically significant. The
following example illustrates this issue.
Example 16.15 (Fundraising campaign) A non-profit organization would like to increase the average level of giving
among its 100,000 past donors, and a consultant has suggested a physical mail campaign to augment their usual e-
mail outreach efforts. The non-profit randomly selects 50,000 of its past donors as the “treatment” group to receive the
physical mailing, and the other 50,000 past donors serve as the “control” group that receives only the e-mail outreach.
The average donation received from the treatment donors is $32.20, with a sample standard deviation of $28.90, and
the average donation received from the control donors is $31.80, with a sample standard deviation of $29.10. If µT
and µC denote the population means of donations for the treatment and control subpopulations, respectively, a test of
the null hypothesis
H0 : µT = µC or H0 : µT – µC = 0
has a z-statistic equal to
32.20 – 31.80
q ≈ 2.18
28.902 29.102
50000 + 50000
and a p-value of approximately 0.029. Therefore, the estimated difference in average donations ($0.40 or 40 cents)
is statistically significant at a 5% level. But the 40-cent differential is likely not practically significant for the
organization, as 40 cents is just over 1% of the average donation level and might also be offset or negated by the
costs of the mail campaign.

16.4 Hypothesis testing for multiple hypotheses: the wald test

The previous sections of this chapter have considered testing a single null hypothesis involving an unknown parameter.
In some situations, however, we may be interested in testing multiple hypotheses about unknown parameters. This
section discusses a test, known as a Wald test, that can test multiple hypotheses simultaneously.51 While many of
the technical details underlying the Wald test are relegated to the Appendix, this section introduces a null hypothesis
containing multiple linear restrictions, describes the intuition behind the Wald test, and discusses the use of the test’s
associated p-value to draw conclusions about the null hypothesis.
Example 16.16 (Widget website) Using data on purchases by e-mail A recipients, e-mail B recipients, and non-
recipients, Example 16.8 showed how z-tests could be used to test whether the purchase probabilities of any two of
the groups were equal to each other. Specifically, a z-test was used to separately test three different null hypotheses
against their alternatives:
H0 : πA = πB versus H1 : πA 6= πB ,
H0 : πA = πC versus H1 : πA 6= πC ,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 437 — #444
i i

Hypothesis testing 437

and
H0 : πB = πC versus H1 : πB 6= πC .
The p-values associated with these three z-tests were 0.547, 0.039, and 0.005, respectively. What if instead we wanted
to simultaneously test the equality of all three purchase probabilities, πA = πB = πC ? In order for πA = πB = πC to be
true, it must be the case that both πA = πB and πB = πC are true. Therefore, the null hypothesis can be written
H0 : πA = πB , πB = πC ,
where the convention is to read the comma in H0 as “and.” The null hypothesis H0 is false when either πA 6= πB or
πB 6= πC , giving the alternative hypothesis
H1 : πA 6= πB or πB 6= πC .
For H0 to be false and H1 to be true, it is enough to have either πA 6= πB or πB 6= πC . For example, if πA 6= πB and
πB = πC , the null hypothesis is false. A few remarks about the formulation of the null hypothesis (H0 : πA = πB , πB = πC )
are necessary. First, it is unnecessary to also include πA = πC in the statement of H0 since that equality is implied
by the other two. Inclusion of πA = πC is redundant. Second, the choice of the two equalities in the statement of H0
doesn’t matter, as long as all three purchase probabilities are involved. That is, it is equally appropriate to specify
the null hypothesis as H0 : πA = πC , πB = πC or H0 : πA = πB , πA = πC . For any of these equivalent statements of the null
hypothesis, the conclusion of the Wald test described below will be the same.52
To formalize the test of multiple hypotheses, suppose Q hypotheses are to be tested simultaneously. The notation
θ1 , θ2 , …, θQ denotes the unknown parameters to be tested against hypothesized values c1 , c2 , …, cQ , respectively. To
keep things general, each of the θj parameters may itself be a linear function of multiple unknown parameters, an idea
illustrated in the examples considered below. The null hypothesis of interest is
H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ .
This null hypothesis is said to consist of Q linear restrictions since each θj may be a linear function of parameters.
The associated test of the null hypothesis H0 tests whether all of the Q linear restrictions are true. The alternative
hypothesis, which is true when one or more of the q linear restrictions are false, is
H1 : θ1 6= c1 or θ2 6= c2 or … or θQ 6= cQ .
Example 16.17 (Widget website) For the null hypothesis
H0 : πA = πB , πB = πC
discussed in Example 16.16, there are two linear restrictions (Q = 2). Using the notation developed above, this null
hypothesis can be written
H0 : θ1 = 0, θ2 = 0,
where θ1 = πA – πB , θ2 = πB – πC , and c1 = c2 = 0. The alternative hypothesis
H1 : θ1 6= 0 or θ2 6= 0
is true when πA 6= πB or πB 6= πC .
Example 16.18 (Asset correlation) Suppose an investor’s portfolio currently consists of three assets A, B, and C.
The investor is considering whether to add an asset D to the portfolio but only wants to do so if there is no evidence
that asset D’s daily returns have correlation with the daily returns of the three other assets. Let ρAD denote the true
correlation of the daily returns of asset A and asset D, and similarly for ρBD and ρCD . The null hypothesis is
H0 : ρAD = 0, ρBD = 0, ρCD = 0,
which has three linear restrictions (Q = 3), with θ1 = ρAD , θ2 = ρBD , θ3 = ρCD , and c1 = c2 = c3 = 0. The alternative
hypothesis is
H1 : ρAD 6= 0 or ρBD 6= 0 or ρCD 6= 0.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 438 — #445
i i

438 Hypothesis testing

If the investor uses historical data to estimate the three correlations (ρAD , ρBD , ρCD ) and test the null hypothesis H0 , a
rejection of H0 would provide statistical evidence that at least one of the correlations is non-zero, in which case the
investor would not want to add asset D to the portfolio.
Example 16.19 (Earnings and marital status) The labor-force data in cps contains a categorical variable marstatus
indicating marital status, with four possible values: “Married,” “Divorced,” “Widowed,” and “Never married.”
Suppose we want to test whether there is any relationship between average weekly earnings (earnwk) and marital
status. Put another way, are average weekly earnings (earnwk) the same or different for the four groups of workers
as delineated by marital status? For notation, use M for married workers, D for divorced workers, W for widowed
workers, and N for never-married workers, and let µM , µD , µW , µN denote the population mean of weekly earnings for
each of the four subpopulations of employed individuals. Since we want to test µM = µD = µW = µN , the null hypothesis
can be written
H0 : µM = µD , µD = µW , µW = µN ,
which has Q = 3 linear restrictions, with θ1 = µM – µD , θ2 = µD – µW , θ3 = µW – µN , and c1 = c2 = c3 = 0. Again, as in
Example 16.16, there are equivalent ways of writing the null hypothesis (e.g., H0 : µM = µD , µM = µW , µM = µN ). The
alternative hypothesis is
H1 : µM 6= µD or µD 6= µW or µW 6= µN .
If the null hypothesis H0 is tested and rejected, there is evidence of a statistically significant difference in average
weekly earnings between at least two of the four subpopulations of employed individuals.
√
To conduct a Wald test of a null hypothesis with multiple linear restrictions, we require n-consistent and
asymptotically normal estimators of the Q parameters θ1 , θ2 , …, θQ . Let the realized estimates of those parameters
be denoted as θ̂1 , θ̂2 , …, θ̂Q . Intuitively, when these estimates are “close to” the hypothesized values c1 , c2 , …, cQ ,
respectively, we are in a situation that is consistent with the null hypothesis H0 being true. On the other hand, when
one or more of the estimates is “far from” the hypothesized values c1 , c2 , …, cQ , respectively, we are in a situation that
is consistent with the alternative hypothesis H1 being true. The z-test discussed in Section 16.2 can be used to test
whether any individual θ̂j is close to an individual cj , but it is only useful for testing a single linear restriction rather
than multiple linear restrictions simultaneously.
To provide more intuition, consider the simplest case of Q = 2, where the null hypothesis is
H0 : θ1 = c1 , θ2 = c2 ,
and the alternative hypothesis is
H1 : θ1 6= c1 or θ2 6= c2 .
The estimates of θ1 and θ2 are θ̂1 and θ̂2 , respectively. The differences θ̂1 – c1 and θ̂2 – c2 form the basis of the Wald
test, and the conclusion of the test is based upon how far each of these differences are from zero. If θ̂1 – c1 and θ̂2 – c2
are both very close to zero, in statistical terms, that situation would be consistent with the null hypothesis H0 being
true. If the null hypothesis H0 is true, meaning θ1 = c1 and θ2 = c2 , then both θ̂1 – c1 and θ̂2 – c2 should be realizations
of draws from normal distributions that are centered at zero. For the z-test, recall that the magnitude of the z-statistic,
θ̂x –c
se(θ̂ )
, is a statistical distance between θ̂x and c, obtained by dividing the actual distance |θ̂x – c| by the standard error
x

se(θ̂x ). Unlike this z-test, there are now two distances, |θ̂1 – c1 | and |θ̂2 – c2 |, to consider when there are Q = 2 hypotheses
being tested.
The Wald statistic generalizes the notion of a statistical distance to handle both higher dimensions (two for the
Q = 2 case) and possible correlation between the two estimators of θ1 and θ2 . As the formula for the Wald statistic
requires more advanced mathematics for the general case, a more complete description of the Wald statistic is left for
the Appendix. Instead, for the remaining discussion in this section, we assume that R is able to calculate the Wald
statistic. In the case of Q = 2, the realized Wald statistic is a draw from a χ22 distribution (chi-square distribution with 2
degrees of freedom) if the null hypothesis H0 is true. The Wald statistic is always non-negative, which is intuitive since

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 439 — #446
i i

Hypothesis testing 439

it is a statistical distance measure. Values of the Wald statistic close to zero are consistent with the null hypothesis
H0 since such values arise when θ̂1 is close to c1 and θ̂2 is close to c2 . On the other hand, large positive values of the
Wald statistic are consistent with the alternative hypothesis since such values arise when θ̂1 is far from c1 and/or θ̂2 is
far from c2 . As a result, the resulting Wald test is a one-sided test, where for Q = 2 rejection only occurs in the right
tail of the χ22 distribution. For instance, for a test at a 5% level, we reject H0 if the Wald statistic is greater than the
95% quantile of the χ22 distribution, which is approximately 5.991. For a test at a 10% level, we reject H0 if the Wald
statistic is greater than the 90% quantile of the χ22 distribution, which is approximately 4.605. These critical values are
calculated in R with the qchisq function:

qchisq(0.95,2)
## [1] 5.991465
qchisq(0.90,2)
## [1] 4.60517

The Wald statistic generalizes to additional linear restrictions. For higher Q, there are more estimators to be
considered and, therefore, more distances (between realized estimates and hypothesized values) taken into account by
the Wald statistic. The biggest difference for higher Q is the sampling distribution of the Wald statistic when the null
hypothesis is true. Specifically, the realized Wald statistic is a draw from a χ2Q distribution (chi-square distribution with
Q degrees of freedom) when the null hypothesis H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ is true. Intuitively, as more distances
are added into the Wald statistic, the overall statistical distance measure is expected to increase, which corresponds to
the thicker right tails of the chi-square distribution as the degrees of freedom increase.
If a statistical package can conduct a Wald test, it usually provides the value of the Wald statistic and/or the p-value
for the test itself. As with t-tests and z-tests, the p-value is the most useful, as it immediately tells us whether the
null hypothesis H0 would be rejected at any level. In the interest of completeness, however, we also describe how to
conduct the test based upon the Wald statistic itself. First, notation for the critical value of a chi-square distribution is
required.

Definition 16.5 The critical value wQ,q denotes the (1 – q) quantile of the χ2Q distribution. For example, w2,0.05 is the
95% quantile of the χ22 distribution, and w2,0.10 is the 90% quantile of the χ22 distribution.
The following proposition states the sampling distribution of the Wald statistic when the null hypothesis is true:
Proposition 16.3. The Wald statistic associated with testing the null hypothesis
H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ
is distributed as a χ2Q random variable if H0 is true. The probability that the Wald statistic is greater than the critical
value wQ,α is equal to α if H0 is true.
For Q hypotheses being tested, given the χ2Q sampling distribution of the Wald statistic when H0 is true, the following
rejection rule based on critical values can be used:

Rejection rule for the Wald test based on the Wald statistic (test at the α level):
• Reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if Wald statistic ≥ wQ,α .
• Do not reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if Wald statistic < wQ,α .

Figure 16.6 provides a graphical view of the rejection area for a test at the 5% level, where the critical value is
wQ,0.05 . The gray region, corresponding to any Wald statistic above the critical value wQ,0.05 , indicates the area where

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 440 — #447
i i

440 Hypothesis testing

wQ,0.05

Figure 16.6
Rejection area for the Wald test at a 5% level

H0 is rejected. Therefore, the probability of rejecting H0 when H0 is true is equal to 5%, as expected since it’s the level
of the test.
Alternatively, if the p-value for the Wald test is available, this p-value can be directly used to test the null hypothesis,
similar to t-tests and z-tests. For the Wald test, the p-value is the probability that a χ2Q random variable is greater than
the Wald statistic. If the null hypothesis H0 is true, the p-value provides the probability that a realized Wald statistic is
at least as large as the one calculated for the observed sample. With the p-value, the rejection rule should look familiar:

Rejection rule for the Wald test based on the p-value (test at the α level):
• Reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if p-value < α.
• Do not reject H0 : θ1 = c1 , θ2 = c2 , …, θQ = cQ at the α level if p-value > α.

Figure 16.7 shows how the p-value relates to the Wald statistic and the χ2Q distribution. For a given Wald statistic,
labeled “Wald stat” in the figure, the p-value is the area of the gray region to the right of the Wald statistic under the
χ2Q distribution.
We now re-visit two of the examples from the beginning of this section.
Example 16.20 (Widget website) Continuing Example 16.17, the null hypothesis
H0 : πA = πB , πB = πC
or, equivalently,
H0 : θ1 = 0, θ2 = 0,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 441 — #448
i i

Hypothesis testing 441

Wald stat

Figure 16.7
p-value for a Wald test

where θ1 = πA – πB and θ2 = πB – πC , corresponds to the purchase probabilities being the same for the three groups (e-
mail A recipients, e-mail B recipients, and non-recipients). There are Q = 2 hypotheses being tested. The Wald statistic
associated with H0 turns out to be 11.17. The p-value is calculated in R with the pchisq function:

1-pchisq(11.17,2)
## [1] 0.00375375

The p-value is approximately 0.004, meaning the null hypothesis is rejected at any level above 0.4% and providing
strong statistical evidence that the three purchase probabilities are not all equal to each other. By itself, the Wald test
doesn’t tell us which of the tested equalities is causing the rejection, although the previous evidence from the z-tests
for this example suggests that the rejection is being driven by the fact that the estimate of πC (pC = 0.15) is much lower
than the estimates of πA (pA = 0.20) and πB (pB = 0.22).
Example 16.21 (Earnings and marital status) Continuing Example 16.19, where µM , µD , µW , µN denoted the
population mean of weekly earnings for the four subpopulations based on marital status (“Married,” “Divorced,”
“Widowed,” “Never married,” respectively), the null hypothesis associated with these population means being equal
to each other is
H0 : µM = µD , µD = µW , µW = µN .
There are Q = 3 hypotheses being tested. The sample means of weekly earnings for the four groups are
x̄M = 1047, x̄D = 902, x̄W = 661, and x̄N = 820.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 442 — #449
i i

442 Hypothesis testing

The Wald statistic associated with H0 turns out to be 80.90, which has a p-value of 0.000.

1-pchisq(81.59,3)
## [1] 0

Thus, the null hypothesis is rejected at any level, providing strong statistical evidence against the population means
being the same for the four groups. We might wonder if this result is being driven by the much lower sample average of
weekly earnings observed for the “Widowed” group, with x̄W = 661. To ignore the “Widowed” group, we could instead
test the null hypothesis
H0 : µM = µD , µD = µN ,
which has Q = 2 hypotheses. For this null hypothesis, the Wald statistic is 49.31, still with a p-value of 0.000, indicating
again that there is strong statistical evidence against the population means being the same for the three remaining
groups.
What happens for the Wald test when there is only a single restriction (Q = 1)? In this case, the null hypothesis
is H0 : θ1 = c1 , which can be tested with a z-test. Would we get a different answer using a Wald test? Thankfully,
the answer is no. Whether the test of H0 is conducted using a z-test or a Wald test, the p-value for the test will be
numerically identical, meaning the rejection conclusion is also the same for the two tests. This equivalence arises
since, in the Q = 1 case, the Wald statistic is exactly equal to the square of the z-statistic, and the critical value w1,α is
exactly equal to the square of the critical value zα/2 . The latter fact follows from a χ21 random variable being equal to
the square of a Z ∼ N(0, 1) random variable.

Appendix: Details for the Wald test

This appendix provides more technical details for the Wald test, including calculation of the Wald statistic and the
associated p-value. Some additional notation is required. We let L denote the number of original parameters being
estimated and Q denote the number of linear restrictions that are being tested. For example, for testing whether three
Bernoulli parameters (πA , πB , πC ) are equal to each other (H0 : πA = πB = πC ), we have L = 3 parameters and Q = 2 linear
restrictions (πA = πB and πB = πC ).
We collect the L original parameters into a vector denoted γ, with elements γ1 , γ2 , …, γL , and similarly we collect
the L estimates of these parameters into a vector denoted γ̂, with elements γ̂1 , γ̂2 , …, γ̂L :
γ1 γ̂1
   
 γ2   γ̂2 
   
 ..   .. 
γ =  .  and γ̂ =  . 
  
.
 .   . 
 ..   .. 
γL γ̂L
We assume that each component of γ̂ is an asymptotically normal estimate of the corresponding component in γ. That
is, γ̂1 is an asymptotically normal estimate of γ1 , γ̂2 is an asymptotically normal estimate of γ2 , and so on.
These estimates will each have an estimated asymptotic variance, but it is also possible that there will be non-zero
covariances between two or more of the estimates (e.g., if the estimates are based upon the same sample). To allow for

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 443 — #450
i i

Hypothesis testing 443

this possibility, we introduce the estimated asymptotic variance matrix with L rows and L columns, denoted V̂:
v̂11 v̂12 · · · · · · v̂1L
 
 v̂21 v̂22 · · · · · · v̂2L 
 
 .. .. .. .. 
V̂ =  .
 . . .
. 
 . . . .
. . . .

 . . . . 
v̂L1 v̂L2 ··· · · · v̂LL
The diagonal elements are the estimated asymptotic variances of the estimates γ̂1 , γ̂2 , …, γ̂L . For example, v̂22 is the
estimated asymptotic variance of γ̂2 . The off-diagonal elements are the estimated asymptotic covariances between two
estimates. For example, v̂12 is the estimated asymptotic covariance between γ̂1 and γ̂2 . (The true covariance will be
zero if the two underlying estimators are independent.)
To represent the Q linear restrictions being tested, we use a matrix denoted R (with Q rows and L columns) and a
vector denoted c (with Q rows):

R11 R12 · · · · · · R1L

   
c1
 R21 R22 · · · · · · R2L   c2 
   
 .. .. ..   .. 
R= . . .  and c =  . 
 
.
 . . .  . 
. . . .

 . . .   . 
RQ1 RQ2 · · · · · · RQL cQ
The elements of R and c are all constants associated with the null hypothesis being tested. Each row of R and c
corresponds to one of the Q linear restrictions being tested. For example, the first row represents the restriction
R11 γ1 + R12 γ2 + · · · + R1L γL = c1 ,
the second row represents the restriction
R21 γ1 + R22 γ2 + · · · + R2L γL = c2 ,
and so on. The notation in Section 16.4 had θ1 , θ2 , ..., θQ representing the linear functions of underlying parameters,
so that the restrictions in the null hypothesis H0 are
θ1 = R11 γ1 + R12 γ2 + · · · + R1L γL = c1
θ2 = R21 γ1 + R22 γ2 + · · · + R2L γL = c2
.. .. .. .. ..
.....
θQ = RQ1 γ1 + RQ2 γ2 + · · · + RQL γL = cQ
To develop a better understanding of how the matrix R and the vector c are specified in practice, let’s consider some
examples:
• Testing the equality of three Bernoulli parameters: For the Bernoulli parameters (πA , πB , πC ), we let γ1 = πA ,
γ2 = πB , and γ3 = πC using the notation from above. The null hypothesis H0 : γ1 = γ2 = γ3 has two linear restrictions,
γ1 = γ2 and γ2 = γ3 , which can be re-written as γ1 – γ2 = 0 and γ2 – γ3 = 0. In this case, the number of parameters is
L = 3 and the number of restrictions is Q = 2. The appropriate R and c are

1 –1 0 0
R= and c = ,
0 1 –1 0
with the first row corresponding to 1 · γ1 + (–1) · γ2 + 0 · γ3 = 0, which is γ1 – γ2 = 0, and the second row
corresponding to 0 · γ1 + 1 · γ2 + (–1) · γ3 = 0, which is γ2 – γ3 = 0.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 444 — #451
i i

444 Hypothesis testing

• Testing the equality of four population averages: For the population averages (µA , µB , µC , µD ), we let γ1 = µA ,
γ2 = µB , γ3 = µC , and γ4 = µD . The null hypothesis H0 : µA = µB = µC = µD has three linear restrictions, given by
γ1 = γ2 , γ2 = γ3 , and γ3 = γ4 . We have L = 4 parameters and Q = 3 restrictions. Then, R and c are
   
1 –1 0 0 0
R = 0 1 –1 0
  and c = 0  .

0 0 1 –1 0
• Testing that all parameters are equal to zero: In some situations, it may be of interest to test whether a set of
parameters are all equal to zero. For instance, this test is often used in the context of multiple regression models
(Chapter 18). If we have L parameters γ1 , γ2 , …, γL and want to test
H0 : γ1 = γ2 = · · · = γL = 0,
there are Q = L linear restrictions (γ1 = 0, γ2 = 0, ..., γL = 0). Then, R and c are
1 0 ··· ··· 0 0
   
 0 1 ··· ··· 0   0 
   
 .. .. . . ..   .. 
R= . .
 . .  and c = 

 . .

 . . .
 .. .. ..
. ..   ... 
  

0 0 ··· ··· 1 0
In this case, R is a matrix with ones along the diagonal and zeros everywhere else.
• Other tests: The framework using R and c is quite general. Suppose we have L = 5 parameters and want to jointly
test the following three restrictions:
H0 : γ1 + γ2 = 4, γ3 = 2γ4 , γ5 = 10.
This null hypothesis says that the first two parameters sum to 4, the third parameter is two times the size of the
fourth parameter, and the fifth parameter is equal to 10. The corresponding R and c are
   
1 1 0 0 0 4
R =  0 0 1 –2 0  and c =  0  .
0 0 0 0 1 10
The first row has the restriction γ1 + γ2 = 4, the second row has the restriction γ3 – 2γ4 = 0, and the third row has the
restriction γ5 = 10.
Since R does not have a build-in function for general Wald tests, we introduce a user-defined function wald_test
below. The function wald_test takes four arguments: the estimate vector gamma_hat (γ̂), the asymptotic variance
matrix var_gamma_hat (V̂), the restriction matrix R (R), and the constant vector c (c). The restriction matrix R
and the constant vector c are optional arguments, with default values equal to the identity matrix (ones along the
diagonal and zeros for all non-diagonal elements) and the zero vector, respectively; these default values correspond to
a null hypothesis whose restrictions are that each element of γ is equal to zero. The function wald_test returns a
list containing two elements, the Wald statistic (W) and its associated p-value (p_value). The p-value is calculated
by determining the area to the right of the Wald statistic for the χ2Q distribution.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 445 — #452
i i

Hypothesis testing 445

wald_test <- function(gamma_hat, var_gamma_hat, R=diag(length(gamma_hat)), c=rep(0,length(gamma_hat))) {

# gamma_hat: L x 1 vector of parameter estimates
# var_theta_hat: L x L variance-covariance matrix of the parameter estimates
# R: Q x L matrix of linear constraints to be tested
## default value of R is the identity matrix (for testing parameters equal to zero)
# c: Q x 1 vector of values of the linear constraints to be tested
## default value of c is a vector of zeros (for testing parameters equal to zero)

# when R has one row (one restriction), make sure R has matrix type
if (!is.matrix(R)) {
R <- t(as.matrix(R))
}

# calculate the Wald statistic using linear algebra

W <- t(R %*% gamma_hat - c) %*% solve(R %*% var_gamma_hat %*% t(R)) %*% (R %*% gamma_hat - c)
W <- as.numeric(W)

# calculate the p-value of the Wald test

p_value <- 1 - pchisq(W, nrow(R))

# return the Wald test statistic and p-value

return(list(W = W, p_value = p_value))
}

We also define additional R functions to handle estimation of the asymptotic variance matrix V̂ for some of the
more common examples of the Wald test.
Sample proportions of independent samples: The function var_prop_indep estimates the asymptotic
variance matrix for a vector of sample proportions estimated on different (independent) samples. Its arguments are
pi_hat, the vector of sample proportions, and nobs, the vector of underlying sample sizes for each sample.

var_prop_indep <- function(pi_hat, nobs) {

# pi_hat: vector of sample proportions
# nobs: vector of sample sizes

# calculate the variance-covariance matrix

var_pi_hat <- diag(pi_hat*(1-pi_hat)/nobs)

return(var_pi_hat)
}

Example 16.22 (Widget website) We re-visit Example 16.20 to detail how the Wald statistic and p-value are
calculated for the null hypothesis H0 : πA = πB = πC . Here is the R code that uses the functions var_prop_indep
and wald_test:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 446 — #453
i i

446 Hypothesis testing

# create a vector of the pi-hat estimates

gamma_hat <- c(0.20,0.22,0.15)

# calculate the estimated asymptotic variance matrix

var_hat <- var_prop_indep(gamma_hat, c(300,300,2400))

# set up the linear restrictions with R and c

R <- rbind(c(1,-1,0),c(0,1,-1))
c <- c(0,0)

# conduct the Wald test

wald_test(gamma_hat, var_hat, R, c)
## $W
## [1] 11.17296
##
## $p_value
## [1] 0.003748198

First, we store the estimated purchase probabilities (γ̂1 = π̂A = 0.20, γ̂2 = π̂B = 0.22, γ̂3 = π̂C = 0.15) in the vector
gamma_hat. Then, we estimate the asymptotic covariance matrix using var_prop_indep with the arguments
gamma_hat (the Bernoulli parameter estimates) and the vector of sample sizes c(300,300,2400). The matrix
R and the vector c are defined to correspond to the linear restrictions γ1 – γ2 = 0 and γ2 – γ3 = 0. We use the rbind
function to construct the R matrix. The rbind function takes vectors as arguments, where each vector corresponds
to one row, and stacks the vectors as rows in a matrix. The command R <- rbind(c(1,-1,0),c(0,1,-1))
stores the matrix
1 –1 0
0 1 –1
in the variable R. Finally, the function wald_test is called and outputs both the Wald statistic and the p-value, as
seen previously in Example 16.20.
Sample means of independent samples: The function var_mean_indep estimates the asymptotic variance
matrix for sample means estimated on different (independent) samples. Its argument is x_vectors, a list of the
sample vectors (i.e., the actual observations in each sample). Each of the sample vectors within the list x_vectors
may have different length.

var_mean_indep <- function(x_vectors) {

# x_vectors: list of sample vectors

# initialize variables
num_means <- length(x_vectors)
tempvec <- rep(0, num_means)

# calculate the variance-covariance matrix

for (i in 1:num_means) {
tempvec[i] <- var(x_vectors[[i]])/length(x_vectors[[i]])
}
var_mean <- diag(tempvec)

return(var_mean)
}

What is a list? A list in R is a collection of data objects, which may be of different data types. The list x_vectors is
a collection of the sample vectors for each of the (independent) samples. For example, in the case of two samples, given
by sample1 and sample2, the function call would be var_mean_indep(list(sample1,sample2)),
where the function list combines its arguments into a list object.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 447 — #454
i i

Hypothesis testing 447

Example 16.23 (Earnings and marital status) We re-visit Example 16.21 to detail how the Wald statistic and p-
value are caclulated for the null hypothesis H0 : µM = µD = µW = µN , corresponding to the population mean of weekly
earnings being the same for the four subpopulations based on marital status (“Married,” “Divorced,” “Widowed,” and
“Never Married,” respectively). In this case, the number of parameters is L = 4, and the number of linear restrictions
in H0 is Q = 3. Here is the R code to conduct the Wald test:

# construct the four subsamples based on marital status

sample_m <- cpsemployed[cpsemployed$marstatus=="Married",]$earnwk
sample_d <- cpsemployed[cpsemployed$marstatus=="Divorced",]$earnwk
sample_w <- cpsemployed[cpsemployed$marstatus=="Widowed",]$earnwk
sample_n <- cpsemployed[cpsemployed$marstatus=="Never married",]$earnwk

# create a vector of the sample mean estimates

gamma_hat <- c(mean(sample_m),mean(sample_d),mean(sample_w),mean(sample_n))

# calculate the estimated asymptotic variance matrix

var_hat <- var_mean_indep(list(sample_m,sample_d,sample_w,sample_n))

# set up the linear restrictions with R and c

R <- rbind(c(1,-1,0,0),c(0,1,-1,0),c(0,0,1,-1))
c <- c(0,0,0)

# conduct the Wald test

wald_test(gamma_hat, var_hat, R, c)
## $W
## [1] 80.90417
##
## $p_value
## [1] 0

Example 16.21 also considered dropping the “Widowed” subsample and testing H0 : µM = µD = µN . In this case, the
number of parameters is L = 3, and the number of linear restrictions in H0 is Q = 2. Here is the R code to conduct the
Wald test:

# create a vector of the sample mean estimates

gamma_hat <- c(mean(sample_m),mean(sample_d),mean(sample_n))

# calculate the estimated asymptotic variance matrix

var_hat <- var_mean_indep(list(sample_m,sample_d,sample_n))

# set up the linear restrictions with R and c

R <- rbind(c(1,-1,0),c(0,1,-1))
c <- c(0,0)

# conduct the Wald test

wald_test(gamma_hat, var_hat, R, c)
## $W
## [1] 49.30529
##
## $p_value
## [1] 0.00000000001965583

Sample means from the same sample: The function var_mean_onesample estimates the asymptotic variance
matrix for the sample means of several variables from the same sample. Its arguments are df, the data frame containing
the sample, and vars, a vector of either variable names or indices that identifies which elements of the data frame df
to consider.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 448 — #455
i i

448 NOTES

var_mean_onesample <- function(df, vars) {

# df: data frame containing the sample
# vars: vector of variable names or indices

# calculate the variance-covariance matrix

var_mean <- cov(df[,vars])/nrow(df)

return(var_mean)
}

Example 16.24 (Exam score data) Example 16.10 implemented a z-test to test the equality of µexam1 and µexam2 , the
population means associated with the variables exam1 and exam2 in the exams dataset. There are L = 2 parameters,
and the z-test is a special case of the Wald test with Q = 1 linear restriction. The null hypothesis is H0 : µexam1 – µexam2 =
0. Here is the R code to conduct the Wald test:

# create a vector of the sample mean estimates

gamma_hat <- c(mean(exams$exam1),mean(exams$exam2))

# calculate the estimated asymptotic variance matrix

var_hat <- var_mean_onesample(exams,c("exam1","exam2"))

# set up the linear restrictions with R and c

R <- c(1,-1)
c <- 0

# conduct the Wald test

wald_test(gamma_hat, var_hat, R, c)
## $W
## [1] 27.13471
##
## $p_value
## [1] 0.0000001897602

The p-value from this Wald test is the same as the p-value from the z-test in Example 16.10, and the Wald statistic
(approximately 27.13) is equal to the z-statistic (approximately 5.22) squared. If there were another exam within the
dataset, say exam3, we could test the equality of all three population means, H0 : µexam1 = µexam2 = µexam3 , by modifying
the R code to account for the extra parameter (L = 3) and the extra restriction (Q = 2).

Notes
49 The use of the word “null” in “null hypothesis” stems from the fact that many tests of interest consider c = 0, so that H : θ = 0. If θ is a parameter
0
that measures some sort of effect, which is common in regression models, the null hypothesis H0 corresponds to no effect or a “null” effect.
50 An alternative approach is to use a simple linear regression of one variable on the other variable, which is covered in Chapter 17. For the simple
linear regression, the z-test uses the slope estimate and its standard error.
51 The test is named after Abraham Wald, who introduced the idea in a 1943 paper in the Transactions of the American Mathematical Society.
52 More formally, the p-values associated with the tests of these three null hypotheses are numerically identical.

Exercises
1. You are considering investing in a company called Tech Trove. The company has been listed on a stock exchange
for ten years, with the following annual returns:
0.04, –0.10, 0.17, 0.02, –0.19, –0.08, –0.01, 0.09, 0.22, 0.06.
Assume the annual returns are i.i.d. draws from a normal distribution with unknown mean µ and unknown variance σ 2 .
(a) What is the 95% confidence interval for µ?
(b) What is the t-statistic for the test of the null hypothesis H0 : µ = 0? Do you reject H0 at a 5% level?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 449 — #456
i i

NOTES 449

(c) What is the p-value for the test of the null hypothesis H0 : µ = 0?
(d) Without doing any additional calculations, how would the t-statistic for testing H0 : µ = –0.01 compare to the
t-statistic found in (b)?
(e) Without doing any additional calculations, how would the p-value for testing H0 : µ = –0.01 compare to the
p-value found in (c)?
2. A random sample of 12 undergraduates enrolled in an introductory economics course are asked to predict the annual
income for their first job after graduation. The sample average of the responses is $49,000, and the sample standard
deviation of the responses is $8,000. Assume that the individual responses are drawn independently from a normal
distribution.
(a) If µX denotes the population average of the predicted annual income, what is the t-statistic associated with
testing H0 : µX = 50000?
(b) Do you reject H0 : µX = 50000 at a 5% level?
(c) What is the p-value for the t-test of H0 : µX = 50000?
(d) If the same sample average and sample standard deviation were observed for a sample size of n = 1200 (instead
of n = 12), would you reject H0 : µX = 50000 at a 5% level?
(e) For (d), is it necessary to assume that the individual responses are drawn from a normal distribution?
3. The average grade among fourth graders on a math aptitude test in a certain state is 62.5 out of 100 points. (Assume
62.5 is the population average.) An educational company has developed an app for elementary-school math and would
like to test whether the app improves average performance on the aptitude test among fourth graders. Let µX denote
the population mean of test scores among students who use the app.
(a) To provide statistical evidence that the app increases the true average of test scores, what are the appropriate
one-sided null hypothesis and alternative hypothesis?
(b) The company provides the app to 15 randomly selected students. The sample mean and sample standard
deviation of test scores are 68.3 and 13.5, respectively. Under the assumption that the test scores are i.i.d. and
normally distributed, would the t-test of the null hypothesis in (a) be rejected at the 5% level? at the 10% level?
(c) What is the p-value associated with the t-test in (b)?
4. Suppose 200 students are chosen at random to do a blind taste test of green M&M’s versus red M&M’s. Of the 200
students, 112 students prefer green M&M’s and 88 prefer red M&M’s. Let π denote the population probability that a
randomly selected student prefers green M&M’s.
(a) What is the asymptotic 95% confidence interval for π?
(b) State the null hypothesis that corresponds to green M&M’s and red M&M’s being equally preferred in the
population.
(c) What is the p-value associated with the test of the null hypothesis in (b)?
(d) Suppose another 200 students are chosen to do the blind taste test, and once again(!) 112 prefer green M&M’s
and 88 prefer red M&M’s. How do the following quantities for the 400-student sample compare to the
associated quantities for the initial 200-student sample?
i. Middle of the asymptotic 95% confidence interval for π
ii. Width of the asymptotic 95% confidence interval for π
iii. The p-value for testing the null hypothesis that green M&M’s and red M&M’s are equally preferred in
the population
5. Use the metricsgrades dataset for this question. These data are from a graduate econometrics course with 68
students.
(a) Provide an asymptotic 95% confidence interval for the difference in the population means of total (composite
course score) for the subpopulations of domestic (domestic = 1) and international (domestic = 0) students.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 450 — #457
i i

450 NOTES

(b) What is the p-value for the z-test of the null hypothesis that the population means of total (composite course
score) for the subpopulations of domestic (domestic = 1) and international (domestic = 0) students are the same?
(c) Define an indicator variable hiscore that is equal to 1 if total > 80 and 0 otherwise. What is the p-value for the
z-test of the null hypothesis that P(hiscore = 1|domestic = 1) = P(hiscore = 1|domestic = 0)?
6. *A classmate gives you a coin to toss and claims it is a fair coin, but you have your suspicions. Let π be the
probability of heads with this coin.
(a) If you toss the coin 100 times and get 58 heads, what is the p-value for testing H0 : π = 0.50?
(b) Suppose the true heads probability is π = 0.51, so that
X̄ – 0.51
q ∼ N(0, 1).
(0.51)(0.49)
n

The z-statistic for testing H0 : π = 0.50, before observing outcomes of the coin tosses, can be written
X̄ – 0.50 X̄ – 0.51 0.51 – 0.50
q =q +q ,
(0.51)(0.49) (0.51)(0.49) (0.51)(0.49)
n n n

where the first term is a N(0, 1) random variable and the second term is a constant (depending on n). Given
this expression, how many tosses n∗ would be required for there to be a 99% probability that the z-statistic is
greater than 1.96?
(c) For the n∗ found in (b), and still assuming π = 0.51, conduct 100,000 simulations q in R to determine the
probability that the z-statistic is greater than 1.96. For this part, use the usual se(x̄) = x̄(1–x̄)
n formula rather
q
than the “true” (0.51)(0.49)
n formula.
(d) Repeat (b) for different values of the true heads probability, with π = {0.51, 0.52, …, 0.59, 0.60} being the
possibilities. Plot a graph of required sample size n∗ against probability π.
7. You have an estimate θ̂x = 1.2 of an unknown parameter θ based upon a sample size of n = 100. You calculate that
the p-value associated with a z-test of H0 : θ = 1 is p∗ . How do the following quantities compare with p∗ ?
(a) the p-value for a z-test of H0 : θ = 1.1
(b) the p-value for a one-sided z-test of H0 : θ ≤ 1
8. Suppose the smoking rate among adults aged 25-44 in the United States is 11.2%. A public-health researcher has
an informational video that she is convinced will lower the prevalence of smoking in this age group. She randomly
selects a sample (including smokers and non-smokers) from this age group and then follows up in six months to ask
whether they are a smoker or not. Let π denote the probability of smoking for an adult aged 25-44 who sees the video.
(a) To provide statistical evidence that the informational video decreases the smoking rate, what are the appropriate
one-sided null hypothesis and alternative hypothesis?
(b) If the researcher finds that 10% of participants are smokers in her follow-up survey, how large would her sample
need to be to reject the null hypothesis in (a) at a 5% level?
9. Use the sp500 dataset for this question.
(a) What are the z-statistic and p-value for testing that the population average of Home Depot (HD) monthly returns
is equal to the population average of Lowe’s (LOW) monthly returns?
(b) The sample average of the market-index (IDX) monthly returns is 0.0078. If you are interested in testing
whether the population average of Home Depot monthly returns is the same as the population average of the
market-index monthly returns, would it be appropriate to test H0 : µHD = 0.0078? Explain.
(c) *Use the bootstrap for this part to calculate the standard error required for the test. What are the z-statistic
and p-value for testing that the population median of Home Depot monthly returns is equal to the population
median of Lowe’s monthly returns?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 451 — #458
i i

NOTES 451

(d) *Use the bootstrap for this part to calculate the standard error required for the test. What are the z-statistic
and p-value for testing that the population standard deviation of Home Depot monthly returns is equal to the
population standard deviation of Lowe’s monthly returns?
(e) *Use the bootstrap for this part to calculate the standard error required for the test. What are the z-statistic and
p-value for testing that ρHD,IDX (the population correlation between Home Depot returns and the market-index
returns) is equal to ρLOW,IDX (the population correlation between Lowe’s returns and the market-index returns)?
10. Use the cps dataset for this question. Focus on the sample of 2,809 employed individuals.
(a) Provide the sample correlation matrix for the variables hrslastwk, age, educ, and ownchild.
(b) For each sample correlation in the correlation matrix from (a), calculate the p-value for the z-test of the null
hypothesis that the population correlation is equal to zero. Which of the correlations are statistically significant
at a 5% level?
(c) Define an indicator variable collgrad that is equal to 1 if educ ≥ 16 and 0 otherwise. Calculate the sample
standard deviation of hrslastwk for the two subsamples of college graduates (collgrad = 1) and non-college
graduates (collgrad = 0). What are the z-statistic and p-value for the test of the null hypothesis that the
population standard deviations for the two subpopulations (college graduates and non-college graduates) are
the same? (Hint: Use the se_sx function defined in Section 14.4.)
11. *A podcast provider has 1,000 customers who have signed up for a free three-month trial. It would like to do an
A/B test of two alternative plans to get customers to subscribe after the free trial, where plan A guarantees a $2.99
monthly fee forever if the customer pays 12 months up-front and plan B guarantees a $3.99 monthly fee forever but
allows cancellation at any time. Let πA and πB be the true probabilities of subscription for plans A and B, respectively.
The podcast provider chooses the number of customers offered plan A, denoted n∗ , with the 1000 – n∗ other customers
offered plan B. Let XA and XB be the Bernoulli random variables for the two plans, with success indicating a subscriber.
(a) If n∗ = 300, what is the asymptotic distribution of X̄A – X̄B in terms of πA and πB ?
(b) What value of n∗ minimizes the asymptotic variance of X̄A – X̄B (in terms of πA and πB )?
(c) If the null hypothesis H0 : πA = πB is true, what value of n∗ minimizes the asymptotic variance of X̄A – X̄B ?
(d) Suppose plan A is slightly more effective than plan B, with πA – πB = , for some small number > 0. Then, the
z-statistic for testing H0 : πA = πB , before observing outcomes, can be written
X̄A – X̄B X̄A – X̄B –
= + .
sd(X̄A – X̄B ) sd(X̄A – X̄B ) sd(X̄A – X̄B )
Since is small, you may assume that sd(X̄A – X̄B ) is approximately the same as it would be for πA = πB . Explain
why the choice of n∗ found in (c) is the best choice for testing H0 : πA = πB , in the sense that it’s the choice that
makes rejection the most likely when πA – πB = .
12. Use the brands dataset for this question. The dataset consists of 14,560 observations on customers who purchased
a candy bar in their last visit to a specific market. There are five brands, numbered 1 through 5, and the last_brand
variable indicates the brand that was purchased on the last visit. The purchase variable is 1 if the customer purchases a
candy bar during their current visit and 0 otherwise. If a purchase is made (purchase = 1), the variable brand indicates
the brand purchased on the current visit; if no purchase is made (purchase = 0), the variable brand has a value of 0.
(a) For each of the five conditional purchase probabilities (purchase given brand 1 on last visit, purchase given
brand 2 on last visit, and so on), provide the estimated probability and an asymptotic 95% confidence interval.
(b) For the 10 possible pairs of different brands (b1 and b2 ), provide the p-value for the test of
H0 : P(purchase = 1|last_brand = b1 ) = P(purchase = 1|last_brand = b2 ).
How many pairs indicate a statistically significant different at a 5% level?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 452 — #459
i i

452 NOTES

𝑆𝐴𝐿𝐸𝑆

𝐸(𝑆𝐴𝐿𝐸𝑆|𝐴𝐷) = 𝛼 + 𝛽𝑥

𝛼+𝛽

𝐴𝐷 = 1 𝐴𝐷
𝐴𝐷 = 0

Figure 17.2
SLR model for sales and advertising

population correlation, σXY = σX σY ρXY , which implies

σXY σX σY ρXY σY
β= 2 = = ρXY .
σX σX2 σX
To show the result for α, taking expectations of the both sides of the SLR model yields
E(Y) = E(α + βX + U) = α + βE(X),
from which it follows that α = E(Y) – βE(X) = µY – βµX .
An immediate implication of Proposition 17.2 is that the sign of β is the same as the sign of the population
covariance or population correlation between Y and X:
sign(β) = sign(σXY ) = sign(ρXY ).
For positively correlated random variables Y and X, the slope β in the SLR model is positive. For negatively correlated
Y and X, the slope β in the SLR model is negative. Another implication of Proposition 17.2 is that
µY = α + βµX = E(Y|X = µX ),
meaning that the point of population means (µX , µY ) lies on the model’s conditional-expectation line E(Y|X) = α + βX.
Knowing β = σσXY2 suggests that an estimator of β can be obtained by using sXY as an estimator of σXY and s2X as
X
an estimator of σX2 . In large samples, since both are consistent estimators, sXY gets arbitrarily close to σXY and s2X gets
arbitrarily close to σX2 . Therefore, the fraction ssXY2 should get arbitrarily close to β = σσXY2 and, by the continuous mapping
X X
theorem (Proposition 14.10), should be a consistent estimator of β. This estimator is known as the least-squares slope
estimator for the SLR model. The reason for the “least-squares” terminology will become clear below. Using the
notation β̂XY to denote the least-squares slope estimator,
sXY
β̂XY = 2 .
sX
The estimator β̂XY is a random variable, as its value depends upon the particular sample that is observed. After the
sample (y1 , x1 ), (y2 , x2 ), …, (yn , xn ) is observed, there is a realization of β̂XY . This realization is the least-squares slope
estimate, denoted β̂, and is given by
sxy sy
β̂ = 2 or, equivalently, β̂ = rxy .
sx sx
The notation β̂ has been used, rather than β̂xy , in the interest of brevity. In words, the least-squares slope estimate β̂ is
the sample covariance between x and y divided by the sample variance of x. For β̂ to be well-defined, we need s2x > 0,
which is true as along as the xi values are not all equal to each other. The formulas for β̂ imply that
sign(β̂) = sign(sxy ) = sign(rxy ).
The least-squares slope estimate has the same sign as the sample covariance and sample correlation between x and y.
s
Based upon the β̂ = rxy sxy formula, one way to think about the least-squares slope estimate is that it is a scaled version of
the sample correlation rxy , where the scaling is the ratio between the standard deviation of y and the standard deviation
of x. This formula also implies that the units of the estimate β̂ are the units of y divided by the units of x, as is the case
for the parameter β itself.
For the intercept parameter α, the relationship α = µY – βµX suggests that an estimator can be obtained by plugging
in the estimator β̂XY for β and using Ȳ as an estimator of µY and X̄ as an estimator of µX and β̂. This estimator is
known as the least-squares intercept estimator for the SLR model. Using the notation α̂XY to denote this estimator,
α̂XY = Ȳ – β̂XY X̄.
After observing the sample, the realization is the least-squares intercept estimate, denoted α̂, and is given by
α̂ = ȳ – β̂x̄.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 461 — #468
i i

Simple linear regression 461

Based upon the least-squares estimates α̂ and β̂, the estimated regression line is
Ê(Y|X = x) = α̂ + β̂x,
where Ê(Y|X = x) denotes the estimate of the true conditional expectation E(Y|X = x). Since the formula for α̂ can be
re-written as
ȳ = α̂ + β̂x̄,
it follows that the point of sample means (x̄, ȳ) falls on the estimated regression line. That is, ȳ is the estimated
conditional expectation of Y given that X is equal to the sample mean x̄.
Proposition 17.3. If the SLR model holds and s2X > 0, the least-squares slope estimator
sXY sY
β̂XY = 2 = rXY
sX sX
and the least-squares intercept estimator
α̂XY = Ȳ – β̂XY X̄
are consistent estimators of β and α, respectively.
Consistency follows from the continuous mapping theorem (Proposition 14.10). β̂XY = ssXY2 consistently estimates
X

β = σσXY2 since sXY and s2X are consistent estimators of σXY and σX2 , respectively. And, α̂XY = Ȳ – β̂XY X̄ consistently
X
estimates α = µY – βµX since Ȳ and X̄ are consistent estimators of µY and µX , respectively. It turns out that the least-
squares estimators β̂XY and α̂XY are also unbiased estimators of β and α, respectively. But since we will only focus
on asymptotic statistical inference for the least-squares estimators, the consistency property is the important one. The
estimators are also asymptotically normal, as discussed in detail below.
The lm function, where “lm” is short for “linear model,” is the primary function used for least-squares estimation
in R. Although lm has many different arguments available, the following provides the basic syntax that handles most
linear-regression problems of interest:
• lm(formula, data, subset): Returns the results from the least-squares estimation of the model specified
by formula. The optional argument data specifies the data frame to be used and can greatly simplify how
formula is written. The optional argument subset is a logical vector that specifies the subset of data to be used
for estimation.
The lm function automatically ignores observations for which the variables used in the formula argument have
missing (NA) values. Therefore, it is not necessary to remove rows of the vector or data frame with NA values.
Here are some simple examples that illustrate the usage of the lm function:
• lm(df$y~df$x): This lm command returns the results from the least-squares estimation of a SLR model with
df$y as the outcome variable and df$x as the explanatory variable. The syntax for the formula argument has
the outcome variable before the tilde (~) and the explanatory variable after the tilde. Here, df is a data frame that
contains the two variables.
• lm(y~x, data=df): This lm command is identical to lm(df$y~df$x), with the data=df argument

indicating that the variables in the formula argument are in the df data frame.
• lm(y~x, data=df, subset=(x>10)): This lm command does the least-squares estimation, with df$y

as the outcome variable and df$x as the explanatory variable, on the subsample for which (x>10) is true.
Example 17.6 (Earnings and union status) Example 17.2 considered a SLR model for the relationship between weekly
earnings (EARNWK) and union membership (UNION):
EARNWK = α + βUNION + U.
To estimate this model, consider the sample of employed individuals (n = 2809) from the cps dataset. The outcome
variable is earnwk, and the explanatory variable is an indicator variable that indicates union membership. Although

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 462 — #469
i i

462 Simple linear regression

unionstatus is a categorical variable (with categories “Non-union” and “Union”) in the dataset, for the purposes of
estimating the SLR model, we define a binary variable with union = 1 for the “Union” category and union = 0 for the
“Non-union” category. The least-squares estimates of α and β can be calculated in R using the lm function:

cps$union <- 1*(cps$unionstatus=="Union")

lm(earnwk~union, data=cps)
##
## Call:
## lm(formula = earnwk ~ union, data = cps)
##
## Coefficients:
## (Intercept) union
## 947 251

The least-squares estimates of α and β are

α̂ = 947 and β̂ = 251.
The intercept estimate α̂ is an estimate of α = E(EARNWK|UNION = 0), the population average of weekly earnings for
non-union members:
α̂ = Ê(EARNWK|UNION = 0) = 947.
This estimate turns out to be exactly equal to the sample average of weekly earnings for the 2,533 non-union workers in
the sample. The slope estimate β̂ is an estimate of the difference in the population average of weekly earnings between
union members and non-members. The estimate β̂ = 251 turns out to be exactly equal to the difference between the
sample average of weekly earnings of the 276 union workers ($1,198) and the sample average of weekly earnings of
the 2m533 non-union workers ($947). Thus, the estimated “union premium” in expected weekly earnings is $251.
Interestingly, R can do the least-squares estimation above without formally defining a new union indicator variable:

lm(earnwk~unionstatus, data=cps)
##
## Call:
## lm(formula = earnwk ~ unionstatus, data = cps)
##
## Coefficients:
## (Intercept) unionstatusUnion
## 947 251

The lm function recognizes unionstatus as a categorical variable and automatically creates an indicator
variable, which appears as unionstatusUnion in the ouput and is equal to 1 when unionstatus is Union
and 0 when unionstatus is Non-union. This capability extends to categorical variables with more than two
categories. For example, the command lm(earnwk~race, data=cps) would automatically create two indicator
variables based upon the categorical variable race since it has three categories. Whether we explicitly create the
indicator variable(s) based upon a categorical variable or let lm do it for us is largely a matter of preference, though
creation of the indicator variables allows us to explicitly indicate which is the “omitted category” in the model.
The relationship between the least-squares estimates and the different subsample averages of y, seen in Example 17.6,
is a general property for any least-squares estimates when x is a binary variable.
Proposition 17.4. For a SLR model with a binary explanatory variable X, the least squares estimates α̂ and β̂ satisfy
α̂ = ȳ0

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 463 — #470
i i

Simple linear regression 463

and
β̂ = ȳ1 – ȳ0 ,
where ȳ0 is the average of y values in the x = 0 subsample and ȳ1 is the average of y values in the x = 1 subsample.
The intercept estimate α̂ corresponds to the sample average of y values for the x = 0 subsample, whereas the slope
estimate β̂ is the difference in the sample average of y values for the x = 1 subsample and the sample average of y
values for the x = 0 subsample. Therefore, Proposition 17.4 implies that the estimated line α̂ + β̂x passes through the
two points corresponding to the sample averages of y values for the x = 0 and x = 1 subsamples. Specifically, the points
1 –ȳ0
(0, ȳ0 ) and (1, ȳ1 ) lie on the estimated line, corresponding to the slope estimate β̂ = ȳ1–0 = ȳ1 – ȳ0 . It is important to
stress that this property does not generalize to an x variable that has more than two possible values.
Example 17.7 (Cigarette sales and cigarette taxes) Example 17.4 considered a SLR model for the relationship
between state-level annual cigarette sales (number of packs per capita), given by the random variable CIGSALES,
and state-level cigarette taxes (dollars per pack), given by the random variable CIGTAX:
CIGSALES = α + βCIGTAX + U.
The dataset cigdata contains data for 2019 that can be used to estimate this SLR model. Specifically, data for the 50
individual states plus the District of Columbia yields a sample with n = 51 observations. The realized outcome variable
is cigsales, and the realized explanatory variable is cigtax.

lm(cigsales~cigtax, data=cigdata)
##
## Call:
## lm(formula = cigsales ~ cigtax, data = cigdata)
##
## Coefficients:
## (Intercept) cigtax
## 55.95 -9.49

The least-squares estimates of α and β are

α̂ = 55.95 and β̂ = –9.49.
Figure 17.4 shows the scatter plot of cigsales versus cigtax, with the estimated SLR line
α̂ + β̂cigtax = 55.95 – 9.49cigtax
drawn as a solid line. The intercept estimate α̂ indicates that the estimated conditional expectation of CIGSALES given
CIGTAX = 0 is
Ê(CIGSALES|CIGTAX = 0) = 55.95,
corresponding to an estimated average cigarette sales of 55.95 packs per capita among the subpopulation of
(hypothetical) states with no state-level cigarette tax. On the figure, α̂ = 55.95 is the value read off the least-squares
estimated line at cigtax = 0. The estimated slope β̂ is negative, indicating a negative association between cigsales
and cigtax. The sign of β̂ is the same as the sign of the sample correlation between cigtax and cigsales, which is
equal to –0.668 for this sample. The meaning of the slope estimate of β̂ = –9.49 is that a one-dollar change in the
state-level per-pack cigarette tax (CIGTAX) is associated with an estimated decrease of 9.49 packs per capita in
expected state-level annual cigarette sales. As explained in Section 17.1, the slope parameter β also describes the
difference between expected outcomes at different values of the explanatory variable. For instance, the difference
between expected CIGSALES at CIGTAX = 3 ($3 per-pack tax) and expected CIGSALES at CIGTAX = 1 ($1 per-pack
tax) is
E(CIGSALES|CIGTAX = 3) – E(CIGSALES|CIGTAX = 1) = (3 – 1)β = 2β.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 464 — #471
i i

464 Simple linear regression

2019 cigarette sales (packs per capita)

0 1 2 3 4 5

State cigarette tax (per pack)

Figure 17.4
Least-squares estimated line for cigarette data

An estimate of this difference is 2β̂ = (2)(–9.49) = –18.98, or expected state-level sales that are estimated to be 18.98
packs per-capita lower among the population of states with a $3 per-pack tax as compared to the population of states
with a $1 per-pack tax. This estimate says something about expected state-level sales. It is possible that the realized
state-level sales in a specific $3-tax state may be higher than the realized state-level sales in a specific $1-tax state. But
averaging over many such states with $3 taxes and $1 taxes, the negative slope estimate says that expected state-level
sales are estimated to be lower in states with $3 taxes than in states with $1 taxes.
Here is the R code used to draw the scatter plot and least-squares regression line in Figure 17.4:

par(mfrow = c(1,1))

plot(cigdata$cigtax,cigdata$cigsales,cex=0.5,xlab="State cigarette tax (per pack)",

ylab="2019 cigarette sales (packs per capita)",xaxt="n",yaxt="n")
abline(lm(cigdata$cigsales~cigdata$cigtax),lwd=2)
axis(1, at=seq(0,5,by=1), las=1)
axis(2, at=seq(0,80,by=20), las=1)

After the scatter plot is drawn with the plot function, the least-squares regression line is drawn by using the lm
regression itself as the argument of the abline function. The abline function determines the intercept and slope of
the regression line from the lm regression results.
Example 17.8 (Monthly stock returns and the overall market) Example 17.5 introduced a SLR model of the
relationship between the returns of an individual stock and the returns of a market index:
RSTOCK = α + βRIDX + U.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 465 — #472
i i

Simple linear regression 465

We estimate the parameters of this model using a sample from the sp500 dataset, which has n = 364 monthly
observations. The variable IDX contains the monthly returns for the S&P 500 index, which we use as the realized
values of the random variable RIDX from the model above. Since the model can be used for any individual stock, let’s
use Home Depot (HD) as the first one to examine.

lm(HD~IDX, data=sp500)
##
## Call:
## lm(formula = HD ~ IDX, data = sp500)
##
## Coefficients:
## (Intercept) IDX
## 0.00873 1.02045

The least-squares estimates, using y = HD and x = IDX, are

α̂ = 0.0087 and β̂ = 1.020.
Figure 17.5 shows the scatter plot of Home Depot monthly returns versus the S&P 500 index monthly returns. The
solid line is the least-squares estimate line, with estimated intercept α̂ = 0.0087 and estimated slope β̂ = 1.020. The
vertical dotted line is drawn through IDX = 0, corresponding to a monthly market index return of zero (no change in
the market index). The horizontal dotted line is drawn through the intercept estimate, which is where the least-squares
line intersects the vertical IDX = 0 line. The intercept estimate indicates that the estimated expected monthly return for
Home Depot is 0.0087, or 0.87%, when the monthly return of the market index is zero. This estimate only tells us about
the expected Home Depot return when the market index return is zero, as it is evident from the scatter plot that there is
considerable variation to Home Depot monthly returns for observations that are close to the vertical IDX = 0 line, with
values that range from about –0.10 (negative 10% monthly return) to 0.15 (positive 15% monthly return). While we
often interpret slope estimates in terms of one-unit changes in the explanatory variable, a one-unit change is too large
to be meaningful for this example since the full range of the IDX variable is between –0.20 and 0.15. Instead, think
about about a change of 0.01, or one percentage point, in the value of IDX. Then, the slope estimate β̂ = 1.020 indicates
that an increase of 0.01 in the market index monthly return is associated with an increase of (0.01)(1.020) = 0.0102 in
the expected Home Depot monthly return. This estimated increase is very close to one-to-one since the estimated slope
is numerically close to one.
The following table reports the results from least-squares estimation of the model for six different choices of the
outcome variable. Each row corresponds to the least-squares estimates for a SLR model relating the individual stock
monthly returns to the S&P 500 market index (IDX) monthly returns. The first row contains results for Home Depot
(HD), and the remaining rows contain estimation results for five other stocks.
α̂ β̂
Home Depot (HD) 0.0087 1.020
Lowe’s (LOW) 0.0117 1.107
Bank of America (BAC) 0.0014 1.489
Wells Fargo (WFC) 0.0054 1.048
Marathon Oil (MRO) –0.0006 1.449
ConocoPhillips (COP) 0.0028 1.042
The interpretation of the estimates is similar to that for the Home Depot results above. The stock with the highest
estimated slope is Bank of America, with a slope estimate of β̂ = 1.489, which indicates that an increase of 0.01 in the
market index monthly return is associated with an increase of (0.01)(1.489) = 0.0149 in the expected Bank of America

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 466 — #473
i i

466 Simple linear regression

0.3

0.2

Home Depot monthly return

0.1

0.0

−0.1

−0.2

−0.15 −0.10 −0.05 0.00 0.05 0.10

S&P 500 monthly return

Figure 17.5
Least-squares estimated line for Home Depot versus market index

monthly return. For Bank of America, the intercept estimate of α̂ = 0.0014 indicates that the estimated expected monthly
return of Bank of America is 0.0014, or 0.14%, when the market index monthly return is zero.

17.3 Fitted values, estimated residuals, and regression fit

To gain a better understanding of how well least-squares estimation “fits the data,” we introduce some additional
terminology and notation based upon the observed sample and the least-squares estimates α̂ and β̂.

Definition 17.1 The fitted value for observation i, denoted ŷi , is

ŷi = α̂ + β̂xi ,
where α̂ and β̂ are the least-squares estimates of α and β, respectively.
The fitted value is an estimate of the conditional expectation of the outcome variable given that the explanatory
variable is equal to xi . The conditional expectation is E(Y|X = xi ) = α + βxi , so that its estimate is obtained by plugging
in α̂ for α and β̂ for β, Ê(Y|X = xi ) = α̂ + β̂xi . Graphically, for an x value of xi , the fitted value ŷi is read off the estimated
regression line at x = xi . For this reason, the least-squares regression line is also sometimes called the fitted line.
Figure 17.6 shows the same dataset as seen Figure 17.3, but now with the least-squares estimated line Ê(Y|X = x) =
α̂ + β̂x drawn. After estimation, the values of the estimates α̂ and β̂ are known, so the least-squares estimated line is
also known. In contrast, the “true” conditional-expectation line from the SLR model, E(Y|X = x) = α + βx is not known
since α and β are unknown parameters. It has been drawn in this figure, with a thin dotted line, to make clear that this
line, although unknown, is the one that we are trying to estimate with the least-squares estimated line. As shown in the
figure, the estimated intercept α̂ for this sample turns out to be larger than the true α, and the estimated slope β̂ turns

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 467 — #474
i i

Simple linear regression 467

(𝑥𝑖 , 𝑦𝑖 )
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥 (SLR model)
𝑦𝑖 ^
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 " 𝑖 (least-squares estimation)
! + 𝛽𝑥

𝑦# 𝑖 = 𝛼 " 𝑖
! + 𝛽𝑥

!)
(0, 𝛼

𝑥𝑖 𝑥
(0,0)

Figure 17.6
Least-squares estimated line

out to be less (i.e., a flatter slope) than the true β. Looking at the specific point (xi , yi ) highlighted in the figure, the
fitted value ŷi = α̂ + β̂xi is read off the estimated line, with the realized outcome yi greater than the fitted value ŷi and
the point being above the estimated line.

Definition 17.2 The estimated residual for observation i, denoted ûi , is

ûi = yi – ŷi = yi – α̂ – β̂xi ,
where α̂ and β̂ are the least-squares estimates of α and β, respectively.
The estimated residual is the difference between the realized outcome yi and its fitted value ŷi . Graphically,
observations with yi above the estimated regression line have positive estimated residuals, and observations with yi
below the estimated regression line have negative estimated residuals. Looking at Figure 17.6, the highlighted point
(xi , yi ) is above the estimated line and has a positive estimated residual since the realized outcome yi is greater than
the fitted value ŷi . Among the twelve data points shown in the figure, six lie above the estimated line and have positive
estimated residuals, while six lie below the estimated line and have negative estimated residuals.
The following proposition states that the fitted values and estimated residuals provide consistent estimates of their
population counterparts:56
Proposition 17.5. (Consistency of fitted values and estimated residuals) Assume that the SLR model holds and s2x > 0.
Then, for any i ∈ {1, 2, …, n}, the fitted value ŷi is a consistent estimate of E(Y|X = xi ) = α + βxi , and the estimated
residual ûi is a consistent estimate of the population residual ui = yi – α – βxi .
This result follows directly from the fact that the least-squares estimators of α and β are consistent.
Example 17.9 (Cigarette sales and cigarette taxes) Example 17.7 provided least-squares estimates of a SLR model
with cigarette sales (cigsales) as the outcome variable y, measured in packs per-capita in 2019, and cigarette tax
(cigtax) as the explanatory variable x, measured in dollars per pack. Using the intercept estimate α̂ = 55.95 and
the slope estimate β̂ = –9.49, the fitted values and estimated residuals for the 51 observations can be calculated,
respectively, as
ŷi = α̂ + β̂xi = 55.95 – 9.49xi

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 468 — #475
i i

468 Simple linear regression

and
ui = yi – ŷi = yi – (55.95 – 9.49xi ).
These fitted values and estimated residuals are calculated by the lm function in R. Specifically, if the results from
lm estimation are stored in a variable called results, the fitted values and estimated residuals are contained in
results$residuals and results$fitted.values, respectively:

# store results of least-squares estimation

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 471 — #478
i i

Simple linear regression 471

mean(uhat)
## [1] 0.00000000000000019355
mean(cigdata$cigsales)
## [1] 38.82
mean(yhat)
## [1] 38.82

The following R output confirms that the correlation between the estimated residuals and the explanatory variable
is equal to zero (property (ii)), as is the correlation between the estimated residuals and the fitted values (property
(iii)):

cor(uhat,cigdata$cigtax)
## [1] -0.000000000000000023861
cor(uhat,yhat)
## [1] 0.0000000000000000066505

Finally, the following R code calculates the residual standard deviation estimate:

sighatsq_u <- sum(uhat^2)/(nrow(cigdata)-2)

sighat_u <- sqrt(sighatsq_u)
sighat_u
## [1] 12.521
summary(results)$sigma
## [1] 12.521

The residual standard deviation estimate is σ̂U ≈ 12.52 packs per-capita. To get a sense of how large this residual
standard deviation is, it can be compared to the standard deviation of the outcome variable, which is sy ≈ 16.65 packs
per-capita. The estimate σ̂U is determined in two ways by the code above. The first way is to use the formulas for σ̂U2
and σ̂U directly, based upon the estimated residual ûi values. The second way is to access the stored results from lm
estimation using the summary(results)$sigma command. The expression summary(results) provides a
nice summary view of the lm estimation results, and some of its components (like $sigma for the residual standard
deviation estimate) can be easily accessed. Here is the output from summary(results) for this example:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 472 — #479
i i

472 Simple linear regression

summary(results)
##
## Call:
## lm(formula = cigsales ~ cigtax, data = cigdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.92 -8.10 -0.86 5.01 39.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.95 3.24 17.25 < 0.0000000000000002 ***
## cigtax -9.49 1.51 -6.28 0.000000088 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.5 on 49 degrees of freedom
## Multiple R-squared: 0.446,Adjusted R-squared: 0.434
## F-statistic: 39.4 on 1 and 49 DF, p-value: 0.0000000875

This output contains many elements that have not yet been discussed, but the σ̂U ≈ 12.52 estimate can be seen near
the end, with the line beginning Residual standard error.
Working directly from the definition for estimated residuals, ûi = yi – ŷi , the outcome variable yi can be written as
yi = ŷi + ûi .
This equation shows that yi can be decomposed into two parts, the fitted value ŷi = α̂ + β̂xi , which is the part of the
outcome that is directly related to the explanatory variable xi , and the estimated residual ûi , which is the part of
the outcome that is uncorrelated with the explanatory variable xi . This decomposition can be used to determine how
well least-squares estimation, based upon the explanatory variable xi , explains the variation in the yi outcomes. From
property (iii) of Proposition 17.7, we know that the fitted values ŷi and estimated residuals ûi are uncorrelated with
each other (sŷû = 0), meaning the sample variance of y is
s2y = s2ŷ + s2û .
Thus, the variation in the outcome variable yi is also decomposed into two parts, the variation explained by the
1
Pn
explanatory variable xi , given by s2ŷ = n–1 2
i=1 (ŷi – ȳ) , and the variation left unexplained by the explanatory variable,
2 1
P n 2 2 2 1 1
given by sû = n–1 i=1 ûi . (sû and σ̂U are slightly different since the former has a n–1 scaling and the latter has a n–2
scaling.) Therefore, the fraction of the variation in the outcome variable that is explained by the explanatory variable
is equal to
s2ŷ
,
s2y
which can also be written as
s2
1 – û2
sy
s2ŷ s2y –s2û s2
since s2ŷ = s2y – s2û implies s2y = s2y = 1 – sû2 . This measure of overall regression fit is known as the R-squared value.
y

Definition 17.5 The R-squared value associated with least-squares estimation of the SLR model, denoted R2 , is
s2ŷ s2û
R2 = =1– .
s2y s2y
For instance, if R2 = 0.24, we say that “the explanatory variable explains 24% of the variation in the outcome
variable.” It is important to note that the sample size n does not have a direct impact on R2 . While having a very

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 473 — #480
i i

Simple linear regression 473

large sample improves the precision of the least-squares estimates, there is no reason to expect that R2 increases for
larger n. A large sample still has the same inherent residual noise as a smaller sample. In fact, looking at the expression
s2 σ2
for R2 , we see that R2 = 1 – s2û should get arbitrarily close to 1 – σU2 as n → ∞.
y Y
The terminology “R-squared” comes from the fact that R2 is equal to the square of the sample correlation between
the outcomes yi and the fitted values ŷi , as stated in the following proposition:
Proposition 17.8. The R-squared value is equal to the square of the correlation between outcomes yi and fitted values
ŷi :
R2 = ryŷ
2
.
For simple linear regression, it is also the case that R2 = ryx
2
.
The result of Proposition 17.8 can be shown as follows:
2
s2ŷ
2
2 syŷ s2ŷ
ryŷ = 2 2 = 2 2 = 2 = R2 ,
sy sŷ sy sŷ sy
where the first equality follows from the definition of correlation and the second equality follows from the fact that
syŷ = sŷŷ + sûŷ = s2ŷ + 0 = s2ŷ . Since ŷ is a linear transformation of x, it also follows that ryŷ
2 2
= ryx and, therefore, R2 = ryx
2
.
Since the correlation ryŷ is between –1 and 1 (inclusive), it follows from Proposition 17.8 that
0 ≤ R2 ≤ 1.
At one extreme, R2 = 0 corresponds to a case where there is no correlation between yi and ŷi , which can only happen
if β̂ = 0. Graphically, this case has a flat estimated regression line, with α̂ = ȳ and β̂ = 0. For R2 = 0, the explanatory
variable xi explains none of the variation in the outcome yi . At the other extreme, R2 = 1 corresponds to a case where
there is perfect correlation, either positive (ryŷ = 1) or negative (ryŷ = –1). Graphically, the estimated regression line has
β̂ 6= 0 and passes exactly through all of the sample observation points (yi , xi ). For R2 = 1, the explanatory variable xi
explains 100% of the variation in the outcome yi .
For values of R2 strictly between 0 and 1, there is correlation between yi and ŷi but not perfect correlation. A
regression with a larger magnitude of ryŷ has a larger R2 . The sign of the correlation ryŷ does not impact R2 , so for
instance the correlations ryŷ = 0.8 and ryŷ = –0.8 both correspond to R2 = 0.82 = 0.64.
Example 17.11 (Cigarette sales and cigarette taxes) Continuing Example 17.7, the decomposition of the outcome
variance
s2y = s2ŷ + s2û
involves the following sample variances
s2y = 277.155, s2ŷ = 123.522, and s2û = 153.632.
The R-squared value is
123.522
R2 = ≈ 0.446 or 44.6%,
277.155
which indicates that state-level cigarette taxes explain 44.6% of the variation in state-level cigarette sales. Therefore,
55.4% of the variation in state-level cigarette sales is left unexplained by state-level cigarette taxes. Another way
to determine the R-squared value, which doesn’t require the least-squares estimates at all, is to square the sample
correlation between cigsales and cigtax. This sample correlation is rcigsales,cigtax = –0.6676 for the observed sample,
meaning R2 = (–0.6676)2 ≈ 0.446. The following R code illustrates the various equivalent methods of calculating R2
for this example:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 474 — #481
i i

474 Simple linear regression

y <- cigdata$cigsales
x <- cigdata$cigtax

var(yhat)/var(y)
## [1] 0.44568
1-var(uhat)/var(y)
## [1] 0.44568
cor(y,yhat)^2
## [1] 0.44568
cor(y,x)^2
## [1] 0.44568
summary(results)$r.squared
## [1] 0.44568

The first two calculations of R2 are based upon Definition 17.5. The second two calculations are based upon
Proposition 17.8. The last expression accesses the R2 value directly from the lm regression results.
Example 17.12 (Monthly stock returns and the overall market) Continuing Example 17.8, we consider the overall fit
of the least-squares estimation using an individual stock return as the outcome y and the market index return (IDX) as
the explanatory variable x. For the six stocks considered in Example 17.8, the following table reports the R-squared
value and the estimated residual standard deviation associated with the least-squares estimates:
R2 σ̂U
Home Depot (HD) 0.336 0.0602
Lowe’s (LOW) 0.256 0.0791
Bank of America (BAC) 0.350 0.0850
Wells Fargo (WFC) 0.289 0.0689
Marathon Oil (MRO) 0.251 0.1048
ConocoPhillips (COP) 0.285 0.0692
The R-squared values are fairly similar across the six individual stocks. The lowest R-squared is for Marathon Oil,
indicating that 25.1% of the variation in Marathon Oil’s monthly returns are explained by S&P 500 monthly returns,
and the highest R-squared is for Bank of America, indicating that 35.0% of the variation in Bank of America’s monthly
returns are explained by S&P 500 monthly returns. For these two stocks, since R2 is equal
√ to the square of the sample
correlation
√ between x and y, the corresponding sample correlations are rMRO,IDX = 0.251 ≈ 0.501 and rBAC,IDX =
0.350 ≈ 0.592. (These sample correlations must be positive, rather than negative, because the least-squares slope
estimates reported in Example 17.8 were both positive.)
Here is the R code to produce the table of R2 and σ̂U values above:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 475 — #482
i i

Simple linear regression 475

# loop through the six stocks of interest

for (varname in c("HD","LOW","BAC","WFC","MRO","COP")) {
# run the SLR regression of company returns vs index (IDX) returns
results <- lm(sp500[,varname]~sp500$IDX)

# access statistics from the regression results

rsq <- summary(results)$r.squared
sighat_u <- summary(results)$sigma

# output the statistics

print(paste(varname,"versus index:", "R-sq", round(rsq,3), "sighat_u", round(sighat_u,4)))
}
## [1] "HD versus index: R-sq 0.336 sighat_u 0.0602"
## [1] "LOW versus index: R-sq 0.256 sighat_u 0.0791"
## [1] "BAC versus index: R-sq 0.35 sighat_u 0.085"
## [1] "WFC versus index: R-sq 0.289 sighat_u 0.0689"
## [1] "MRO versus index: R-sq 0.251 sighat_u 0.1048"
## [1] "COP versus index: R-sq 0.285 sighat_u 0.0692"

17.4 Asymptotic normality and statistical inference

For the purposes of statistical inference, we assume that the sample size n is large enough for asymptotic inference
to be used. In this case, the least-squares estimators α̂XY and β̂XY are each asymptotically normal estimators of their
estimands α and β.
√
Proposition 17.9. If the SLR model holds and σX2 > 0, the least-squares estimators α̂XY and β̂XY are n-consistent and
asymptotically normal estimators of α and β, respectively, with
√ a
n(α̂XY – α) ∼ N (0, Vα )
and
√ a
n(β̂XY – β) ∼ N 0, Vβ ,
Vα Vβ
so that the asymptotic variance of α̂XY is n and the asymptotic variance of β̂XY is n .
The proof of asymptotic normality is beyond the scope of this book, but it relies primarily on versions of the LLN
and CLT theorems introduced in Chapter 13.58
V
To say more about the asymptotic variances Vnα and nβ and the associated standard errors, it is useful to introduce
some additional terminology that relates to the conditional variance of the SLR model’s residuals.

Definition 17.6 The residuals of an SLR model are homoskedastic if the conditional variance Var(U|X = x) is
constant and does not depend upon x. Equivalently, since σU2 is the unconditional variance of U, the residuals are
homoskedastic if Var(U|X = x) = σU2 for all x. The residuals are said to exhibit homoskedasticity.
Figure 17.7 provides a graphical representation of the data-generating process for a SLR model with homoskedastic
residuals. The figure shows three possible values (x∗ , x∗∗ , x∗∗∗ ) for the random variable X. At each of these three values,
the conditional expectation E(Y|X = x) is the value for which the associated vertical line passes through the E(Y|X =
x) = α + βx line. The realized y outcome depends upon this conditional expectation but also the realized residual. The
realized residual is a draw of U, conditional upon the value of X. For X = x∗ , this conditional distribution is shown as
the rotated pdf curve for U conditional on X = x∗ . This pdf curve is centered at the SLR model line, and its variance
describes the variance of residual draws associated with X = x∗ . Similarly, for x∗ and x∗∗ , the rotated pdf curves shown
in the figure represent the distributions of U conditional on X = x∗ and X = x∗∗ , respectively. In Figure 17.7, the shape of
the conditional distribution of U is exactly the same for X = x∗ , X = x∗∗ , and X = x∗∗∗ , meaning the conditional variance
Var(U|X = x) is the same for these three possible values of X. Although the constant conditional residual variance has

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 476 — #483
i i

476 Simple linear regression

pdf of U
given X = 𝑥∗∗∗
pdf of U
given X = 𝑥∗∗
pdf of U
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥
given X = 𝑥∗

𝑥∗ 𝑥∗∗ 𝑥∗∗∗ 𝑥

Figure 17.7
SLR model with homoskedastic residuals

only been shown for three values in the figure, the same conditional residual variance would arise for any other value
of X if the residuals are homoskedastic.

Definition 17.7 The residuals of an SLR model are heteroskedastic if the conditional variance Var(U|X = x) is non-
constant and depends upon x. The residuals are said to exhibit heteroskedasticity.
Figure 17.8 shows how the data-generating process differs when the SLR model has heteroskedastic residuals. The
figure shows a situation in which the variance of the residual U depends upon the value of X, with the residual variance
increasing for larger values of X. The distribution of U conditional on X = x∗ has a lower variance, with its pdf being
more tightly distributed around the SLR line. The variance of the distribution of U conditional on X = x∗∗ is larger, as
indicated by the increased dispersion of the pdf curve, and the variance of the distribution of U conditional on X = x∗∗∗
is even larger with a more dispersed pdf.
To see how the cases of homoskedastic errors and heteroskedastic errors affect the realized sample, we artificially
create two different samples based upon the SLR model, one exhibiting homoskedasticity and one exhibiting
heteroskedasticity. Specifically, we assume that the SLR model
Y = 1 + 8X + U
describes the conditional expectation E(Y|X) = 1 + 8X for both samples, but the conditional distribution of U differs for
the two samples, as follows:
• Sample 1: Homoskedastic errors, where the distribution of U given X = x is N(0, 1)
• Sample 2: Heteroskedastic errors, where the distribution of U given X = x is N(0, 4x2 )
Figure 17.9 shows the two samples generated by the SLR model under the two assumptions on the residuals. For each
sample, the sample size is n = 500, and the x values are draws from the U(0, 1) distribution. The y values are generated

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 477 — #484
i i

Simple linear regression 477

pdf of U
given X = 𝑥∗∗∗
pdf of U
given X = 𝑥∗∗
pdf of U
𝐸(𝑌|𝑋 = 𝑥) = 𝛼 + 𝛽𝑥
given X = 𝑥∗

𝑥∗ 𝑥∗∗ 𝑥∗∗∗ 𝑥

Figure 17.8
SLR model with heteroskedastic residuals

as y = 1 + 8x + u, where u is a draw from the distribution of U given X = x, as specified for the two samples above.
The scatter plot on the top corresponds to Sample 1 with homoskedastic errors, and the scatter plot on the bottom
corresponds to Sample 2 with heteroskedastic errors. A solid line corresponding to the SLR line E(Y|X) = 1 + 8X is
drawn for reference in both plots. For the sample with homoskedastic errors, the vertical spread of data points around
the SLR line stays roughly the same throughout the full range of possible x values. On the other hand, for the sample
with heteroskedastic errors, the vertical spread of data points around the SLR line changes a lot over the range of x
values, with very low conditional variance of the residuals for x values near zero and a steadily increasing conditional
variance of the residuals as x increases.
Practitioners care about the homoskedasticity/heteroskedasticity of residuals for two main reasons. First, the
appropriate way to calculate standard errors associated with the least-squares estimates turns out to depend upon
whether the residuals are homoskedastic or heteroskedastic. This issue is discussed in more detail below. Second, if we
are interested in determining a predictive interval for Y given X = x, this interval clearly depends upon the conditional
variance of U given X = x. For instance, looking at Sample 2 in Figure 17.9, a predictive interval for Y associated with
a low value of x should be much more narrow than a predictive interval for Y associated with a high value of x since
the conditional variance of the residuals is much higher for higher values of x. We return to this idea in Section 18.8,
where we discuss predictive intervals based upon least-squares estimation.
V
The formulas for the asymptotic variances Vnα and nβ in the general case of heteroskedasticity are quite complicated
and, therefore, omitted from our discussion. That said, statistical packages like R are able to calculate standard errors
based upon these formulas, and these standard errors are known as heteroskedasticity-robust standard errors or
sometimes, more concisely, as robust standard errors. For least-squares estimates α̂ and β̂, which are the realizations

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 478 — #485
i i

478 Simple linear regression

Sample 1 (homoskedastic)

8
6
y

4
2
0 0.0 0.2 0.4 0.6 0.8 1.0

Sample 2 (heteroskedastic)
12
8
y

6
4
2

0.0 0.2 0.4 0.6 0.8 1.0

Figure 17.9
Homoskedasticity versus heteroskedasticity

of the estimators α̂XY and β̂XY , the standard errors are

s
V
bα
se(α̂) = , where Vbα is a consistent estimator of Vα ,
n
and s
V
bβ
se(β̂) = , where V
bβ is a consistent estimator of Vβ .
n
Even if we suspect that residuals are homoskedastic, there is no harm in calculating heteroskedasticity-robust standard
errors since they are valid regardless of whether or not homoskedasticity holds. Therefore, it’s accepted practice in
economics to always use heteroskedasticity-robust standard errors for i.i.d. cross-sectional data.
Even though heteroskedasticity-robust standard errors are used in practice, it is useful to briefly consider the case of
V
homoskedastic residuals for pedagogical reasons since the formulas for the asymptotic variances Vnα and nβ simplify
greatly. The following proposition provides these formulas:
Proposition 17.10. If the SLR model holds, σX2 > 0, and the residuals are homoskedastic, the asymptotic variance of
α̂XY is
Vα σU2 µ2

= 1 + X2 ,
n n σX
and the asymptotic variance of β̂XY is
Vβ σU2
= .
n nσX2

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 479 — #486
i i

Simple linear regression 479

The asymptotic-variance formulas in Proposition 17.10 highlight the features of the population and sample that
affect the precision of the least-squares estimators. To obtain standard errors for the case of homoskedasticity, sample
descriptive statistics can be plugged in for population statistics, so that
s s s
s2û

V
bα x̄2 sû x̄2
se(α̂) = = 1+ 2 = √ 1+ 2
n n sx n sx
and s s
V
bβ s2û sû
se(β̂) = = 2
=√ .
n nsx n sx
Let’s focus on the slope estimate β̂ first, as its standard error expression is a bit simpler. There are three factors that
affect the standard error of β̂:
• Sample size: Larger n leads to a smaller standard error se(β̂), so that having more observations is better for
√
precision. The standard error has the usual √1n scaling for n-consistent and asymptotically normal estimators.
For example, quadrupling the sample size should lead to a standard error that is roughly half as large.
• Residual noise: A smaller residual variance σ 2 , as estimated by s2 , leads to a smaller standard error se(β̂). Although
U û
not within our control, less noise in the residuals is better for precision.
• Variation of the explanatory variable: A larger variance σ 2 of the explanatory variable, as estimated by s2 , leads
X x
to a smaller standard error se(β̂). Since the slope β represents the change in E(Y|X = x) associated with changes
in x, the intuition is that it is better to have observations that exhibit a wide range of x values. In cases where the
distribution of X may be in our control, as might be the case in choosing a survey population (e.g., if age is the
explanatory variable, choosing to interview people between 25 and 45 rather than between 30 and 40), a choice
with higher variance σX2 leads to a lower standard error se(β̂), holding n and σU2 fixed.
Now, looking at the formula for the standard error of the intercept estimate α̂, these three factors affect se(α̂) in the
same way. Larger n, smaller residual variance σU2 , and larger explanatory variable variance σX2 are all associated with
lower se(α̂) or more precise intercept estimates. There is a fourth factor for se(α̂), which is that a larger value of µ2X
or x̄2 leads to a larger standard error se(α̂). Recall that α̂ is an estimate of E(Y|X = 0). When the average X value is
far from zero (i.e., µX very negative or µX very positive), as indicated by µ2X being large, this relationship says that
it becomes more difficult to precisely estimate the intercept α. The best-case scenario, for se(α̂), is when x̄ = 0, in
which case the standard-error formula simplies to se(α̂) = √sûn , which is similar to the formula for the standard error of
a sample mean.
Unfortunately, the built-in R function lm does not calculate heteroskedasticity-robust standard errors and instead
reports standard errors based upon the restrictive assumption of homoskedastic residuals. Thankfully, there are several
R packages with functions for calculating robust standard errors. We will use a package called estimatr, and the
least-squares regression function is called lm_robust and will be illustrated in the examples throughout this chapter
and Chapter 18. For now, here is the code that installs the estimatr package:

install.packages("estimatr")
library(estimatr)

Example 17.13 (Cigarette sales and cigarette taxes) Example 17.7 reported the least-squares estimates for the SLR
model using cigarette sales as the outcome variable and cigarette tax as the explanatory variable:
α̂ = 55.95 and β̂ = –9.49.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 480 — #487
i i

480 Simple linear regression

Using the lm_robust function, the heteroskedasticity-robust standard errors are

se(α̂) = 2.92 and se(β̂) = 1.06.
In R, the lm_robust function is used like the lm function, with the specification of the formula argument (for the
model to be estimated) and the optional data and subset arguments.59

results <- lm_robust(cigsales~cigtax, data=cigdata)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 55.9489 2.9236 19.1367 2.2024e-24 50.074 61.8242 49
## cigtax -9.4871 1.0635 -8.9206 7.7435e-12 -11.624 -7.3499 49
summary(results)
##
## Call:
## lm_robust(formula = cigsales ~ cigtax, data = cigdata)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 55.95 2.92 19.14 2.20e-24 50.1 61.82 49
## cigtax -9.49 1.06 -8.92 7.74e-12 -11.6 -7.35 49
##
## Multiple R-squared: 0.446 ,Adjusted R-squared: 0.434
## F-statistic: 79.6 on 1 and 49 DF, p-value: 7.74e-12

The commands results and summary(results) provide alternative ways of displaying the lm_robust
estimation results, with the latter providing additional information like the R-squared value. In addition to the
parameter estimates and standard errors, the output includes additional columns for hypothesis testing and confidence
intervals that will be discussed in the next section. In the summary(results) output, the HC2 specified as the
“standard error type” is the method that lm_robust uses to calculate the robust standard errors.
When results is assigned to the lm_robust regression function call, lots of useful information about the
regression is stored in results, including the following:
• results$res_var: residual variance estimate σ̂U2
• results$r.squared: R-squared value
• results$fitted.values: a vector (of length n) with the fitted values
• results$coefficients: a vector with the estimates, in the same order as the output
• results$std.error: a vector with the standard errors, in the same order as the output
Here are some examples of how these quantities can be accessed after the regression above:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 481 — #488
i i

Simple linear regression 481

# R-squared value
results$r.squared
## [1] 0.4456806
# residual standard deviation estimate
sqrt(results$res_var)
## [1] 12.52068
# slope estimate and its standard error
results$coefficients[2]
## cigtax
## -9.487131
results$std.error[2]
## cigtax
## 1.063502

Example 17.14 (Monthly stock returns and the overall market) Example 17.8 reported the least-squares estimates
for SLR models that related monthly returns of six individual stocks to monthly returns of the S&P 500 index. The
following table augments those estimates with their heteroskedasticity-robust standard errors reported in parentheses:
α̂ (se) β̂ (se)
Home Depot (HD) 0.0087 (0.0032) 1.020 (0.082)
Lowe’s (LOW) 0.0117 (0.0042) 1.107 (0.104)
Bank of America (BAC) 0.0014 (0.0046) 1.489 (0.146)
Wells Fargo (WFC) 0.0054 (0.0039) 1.048 (0.131)
Marathon Oil (MRO) –0.0006 (0.0053) 1.449 (0.203)
ConocoPhillips (COP) 0.0028 (0.0037) 1.042 (0.116)
The results from this table are obtained using the lm_robust function for each of the six SLR models. For example,
the first row corresponds to the SLR model with HD as the outcome variable and IDX as the explanatory variable,
with results given by the following code:

results <- lm_robust(HD~IDX, data=sp500)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.0087316 0.0032119 2.7185 6.8739e-03 0.0024152 0.015048 362
## IDX 1.0204494 0.0816816 12.4930 5.0256e-30 0.8598195 1.181079 362

17.4.1 Confidence intervals and hypothesis testing

√
Since least-squares estimators are n-consistent and asymptotically normal, statistical inference can be conducted by
using the general theory developed for confidence intervals in Section 14.4 and for z-tests in Section 16.2. While we
focus on two-sided confidence intervals and two-sided tests, we can also construct one-sided confidence intervals and
conduct one-sided tests as done in those previous sections.
Calculation of confidence intervals requires only the least-squares estimates and their standard errors, with the 1 – α
confidence intervals for α and β given by
(α̂ – zα/2 se(α̂), α̂ + zα/2 se(α̂))

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 482 — #489
i i

482 Simple linear regression

and
(β̂ – zα/2 se(β̂), β̂ + zα/2 se(β̂)),
respectively. For example, the 95% confidence interval for β is (β̂ – 1.96se(β̂), β̂ + 1.96se(β̂)), and the 90% confidence
interval for β is (β̂ – 1.645se(β̂), β̂ + 1.645se(β̂)).
Example 17.15 (Cigarette sales and cigarette taxes) Using the standard errors of the least-squares estimates from
Example 17.13, we can construct confidence intervals for α and β. The 95% confidence interval for α is
(α̂ – z0.025 se(α̂), α̂ + z0.025 se(α̂)) = (55.95 – (1.96)(2.924), 55.95 + (1.96)(2.924)) ≈ (50.07, 61.82).
It can be said with 95% confidence that the intercept parameter α, or equivalently the conditional expectation
E[CIGSALES|CIGTAX = 0], is between 50.07 and 61.82 packs per-capita. The 95% confidence interval for β is
(β̂ – z0.025 se(β̂), β̂ + z0.025 se(β̂)) = (–9.49 – (1.96)(1.064), –9.49 + (1.96)(1.064)) ≈ (–11.62, –7.35).
It can be said with 95% confidence that the slope parameter β, or equivalently the change in expected cigarette sales
associated with a one-dollar increase in cigarette taxes, is between –11.62 and –7.35 packs per-capita.
The lm_robust function automatically calculates these confidence intervals in R. The default is 95% confidence
intervals, but the optional argument alpha can be set at other values (different from the default alpha=0.05) to get
other confidence intervals. Here is the code for calculating 95% confidence intervals and 90% confidence intervals:

# output results with 95% confidence intervals (default alpha=0.05)

lm_robust(cigsales~cigtax, data=cigdata)
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 55.9489 2.9236 19.1367 2.2024e-24 50.074 61.8242 49
## cigtax -9.4871 1.0635 -8.9206 7.7435e-12 -11.624 -7.3499 49
# output results with 90% confidence intervals (alpha=0.10)
lm_robust(cigsales~cigtax, data=cigdata, alpha=0.10)
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 55.9489 2.9236 19.1367 2.2024e-24 51.047 60.8505 49
## cigtax -9.4871 1.0635 -8.9206 7.7435e-12 -11.270 -7.7041 49

Moving to hypothesis testing, suppose we want to test whether the slope parameter β is equal to some constant c,
so that the null hypothesis is
H0 : β = c.
The z-statistic for testing this null hypothesis is
β̂ – c
z-statistic =
se(β̂)
and indicates the number of standard errors that β̂ is above c (positive z-statistic) or the number of standard errors
that β̂ is below c (negative z-statistic). Applying the z-test approach from Section 16.2, the rejection rule for testing
H0 : β = c at the α level is:
β̂–c
• Reject H0 : β = c at the α level if z-statistic = se(β̂)
≥ zα/2 .
β̂–c
• Do not reject H0 : β = c at the α level if z-statistic = se(β̂)
< zα/2 .
As with other asymptotically normal estimators, we can calculate a p-value based upon the z-statistic, and this p-value
can be used to test the null hypothesis at any level α. For the null hypothesis H0 : β = c , the p-value is
!
β̂ – c
p-value = P (|Z| > |z-statistic|) = P |Z| > , where Z ∼ N(0, 1),
se(β̂)

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 483 — #490
i i

Simple linear regression 483

and the rejection rule for a test at the α level is:

• Reject H0 : β = c at the α level if p-value < α.
• Do not reject H0 : β = c at the α level if p-value > α.
A particularly important hypothesis test for the slope parameter uses c = 0, which corresponds to testing
H0 : β = 0.
If H0 : β = 0 is true, the SLR model doesn’t contain the explanatory variable at all since Y = α + 0X + U = α + U.
Therefore, a test of H0 : β = 0 is essentially a test of whether or not X belongs in the model. If H0 : β = 0 is rejected,
that provides evidence that X belongs in the model or, in other words, that X is a significant factor in explaining Y.
Also, since β = ρYX σσXY is a scaled version of ρYX , the null hypothesis H0 : β = 0 can only be true if ρYX = 0. Thus, the null
hypothesis H0 : β = 0 is equivalent to the null hypothesis that there is no correlation between Y and X, and least-squares
estimation provides a simple way to establish a statistically significant correlation between the outcome variable and
the explanatory variable (by rejecting H0 : β = 0).
For the null hypothesis H0 : β = 0, the z-statistic simplifies to
β̂
z-statistic = ,
se(β̂)
which is the ratio between the estimate β̂ and its standard error and indicates how many standard errors the estimate β̂
is from zero. The rejection rule is a special case of the rejection rule above:
β̂
• Reject H0 : β = 0 at the α level if z-statistic = se(β̂)
≥ zα/2 .
β̂
• Do not reject H0 : β = 0 at the α level if z-statistic = se(β̂)
< zα/2 .
Testing this null hypothesis is so commonly done by practitioners that it is standard for statistical packages to
automatically report the z-statistic and p-value for the test of H0 : β = 0, in addition to the estimate β̂ and its standard
error se(β̂). The output provided by the lm_robust function in the examples above provides the z-statistic and p-
value for H0 : β = 0 in the columns labeled t value and Pr(>|t|). (More appropriate column labels would be
“z-stat” and “p-value,” respectively.)
While the discussion here has focused on z-tests for the slope parameter β, the same ideas also apply to the intercept
parameter α. For a test of H0 : α = c, the z-statistic is
α̂ – c
z-statistic = ,
se(α̂)
and the z-test proceeds in the usual way using this z-statistic or its associated p-value. The z-statistic and p-value
for testing H0 : α = 0, which may or may not be of interest in a particular application, is also provided by the the
lm_robust results.
Example 17.16 (Cigarette sales and cigarette taxes) Here is the code and output from Example 17.13:

results <- lm_robust(cigsales~cigtax, data=cigdata)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 55.9489 2.9236 19.1367 2.2024e-24 50.074 61.8242 49
## cigtax -9.4871 1.0635 -8.9206 7.7435e-12 -11.624 -7.3499 49

The first and second columns provide the estimates (α̂ in the first row, β̂ in the second row) and the standard errors
of the estimates (se(α̂) in the first row, se(β̂) in the second row). The third column is the z-statistic for testing H0 : α = 0
(first row) and for testing H0 : β = 0 (second row). For the case of c = 0, the test of H0 : α = 0 has a z-statistic equal
to se(α̂α̂) , and the test of H0 : β = 0 has a z-statistic equal to se(β̂β̂) . Therefore, the z-statistic values in the third column

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 484 — #491
i i

484 Simple linear regression

are equal to the values in the first column (estimates) divided by the values in the second column (standard errors).
Finally, the fourth column reports the p-value associated with testing the parameter against zero, which is the p-value
for H0 : α = 0 in the first row and the p-value for H0 : β = 0 in the second row.
Seeing such a low p-value for the α parameter is to be expected since it would be surprising if we could not reject
that the expected cigarette sales is equal to zero when there is no cigarette tax. In fact, from its z-statistic, the estimate
α̂ is 19.14 standard errors above zero! The p-value associated with H0 : β = 0 is more interesting. This p-value is again
very small (zero to many decimal places), meaning the null hypothesis H0 : β = 0 is rejected at any level. In other
words, the slope estimate is statistically significant at any level, meaning there is strong statistical support for the
existence of a negative relationship between state-level cigarette sales and state-level cigarette taxes. The magnitude
of this relationship can be provided by using a confidence interval, like the 95% confidence interval for β provided in
Example 17.15.
We’ve already seen that results contains information about the regression, and several quantities associated with
the confidence intervals and testing are accessible after results is assigned to the lm_robust function call above:
• results$statistic: a vector with the z-statistics of the estimates, in the same order as the output
• results$p.value: a vector with the p-values for the two-sided test against 0, in the same order as the output
• results$conf.low and results$conf.high: vectors with the upper and lower endpoints of the

confidence interval, respectively, in the same order as the output

For instance, results$statistic[2] is the slope parameter’s z-statistic, and results$conf.high[2]
contains the upper endpoint of the confidence interval for the slope parameter.
Example 17.17 (Monthly stock returns and the overall market) Using the estimates and standard errors reported in
Example 17.14, we can conduct z-tests on the SLR model parameters for any of the six stocks considered. For example,
here are the least-squares estimation results previously seen for the Home Depot (HD) monthly stock returns, with the
z-statistics and p-values reported for the tests of H0 : α = 0 and H0 : β = 0:

results <- lm_robust(HD~IDX, data=sp500)

A test of the null hypothesis H0 : α = 0 is interesting here since it is a test of whether the expected Home Depot
monthly return is equal to zero when the S&P 500 monthly return is equal to zero. Based upon the p-value of 0.0069,
this null hypothesis is rejected at any level above 0.69%. The null hypothesis H0 : β = 0 is less interesting here since it
would be surprising to find no relationship (i.e., no correlation) between Home Depot returns and S&P 500 returns.
Indeed, the p-value is equal to zero (to many decimal places) since the estimated slope is 12.55 standard errors above
zero. A more interesting null hypothesis to test is H0 : β = 1. When β = 1, a 0.01 change in the S&P 500 monthly return
is associated with an expected change of 0.01 in the Home Depot monthly return. The z-statistic associated with
H0 : β = 1 is
β̂ – 1 1.020 – 1
z-statistic = = ≈ 0.245.
se(β̂) 0.0817
The associated p-value is 2(1 – Φ(0.245)) ≈ 0.806. Therefore, using either the z-statistic rejection rule or the p-value
rejection rule, the null hypothesis H0 : β = 1 would not be rejected at a 5% level. Based upon the p-value, H0 : β = 1
would not be rejected at any reasonable level.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 485 — #492
i i

Simple linear regression 485

Example 17.18 (Earnings and union status) Example 17.6 provided the least-squares estimates for the SLR model
with weekly earnings as the outcome variable and union membership as the explanatory variable. The more complete
results are reported below:

results <- lm_robust(earnwk~union, data=cps)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 946.50 14.897 63.5366 0.0000e+00 917.29 975.71 2807
## union 251.15 45.827 5.4805 4.6184e-08 161.30 341.01 2807

The test of H0 : α = 0 is not interesting here (why?), so we focus on the test of H0 : β = 0. Recall that
β = E(EARNWK|UNION = 1) – E(EARNWK|UNION = 0),
meaning the null hypothesis H0 : β = 0 is true if there is no difference between the expected weekly earnings of union
workers and the expected weekly earnings of non-union workers. The p-value of zero (to many decimal places)
indicates that H0 : β = 0 is rejected at any level, providing evidence that the association of weekly earnings with union
status is statistically significant. The 95% confidence interval for β is
(β̂ – z0.025 se(β̂), β̂ + z0.025 se(β̂)) = (251.2 – (1.96)(45.83), 251.2 + (1.96)(45.83)) ≈ (161.3, 341.0),
meaning we can say with 95% confidence that the true difference between the expected weekly earnings of union
workers and the expected weekly earnings of non-union workers is between $161.30 and $341.00.
Example 17.19 (A/B testing) Example 17.18 showed that the SLR model and least-squares estimation can be used to
directly test whether the expected value of an outcome variable Y differs over two subpopulations, where the binary
variable X indicates which of the two subpopulations an observation is in. In Example 17.18, the random variable
Y corresponds to weekly earnings and the random variable X to union membership (X = 1 for union member, X = 0
for non-member). The case of an A/B test, where the outcome of interest is a continuous outcome Y, also fits in this
framework. If there are two possible treatments (A or B) and X is a binary variable indicating the treatment (say, X = 1
for treatment B, X = 0 for treatment A), then
α = E(Y|X = 0) = expected value of Y for treatment A
and
α + β = E(Y|X = 1) = expected value of Y for treatment B.
This setup is a generalization of the advertising “experiment” considered in Example 17.1, where the Y random
variable was SALES and the X random variable was AD (equal to 1 for cities receiving targeted advertising and 0 for
cities not receiving targeted advertising). To test whether the expected value of Y differs for the two treatments, the
null hypothesis of interest is H0 : β = 0. The z-test involves using either the z-statistic se(β̂β̂) or its corresponding p-value.
A confidence interval for the estimated difference between the expected outcome Y for treatment A and the expected
outcome Y for treatment B is just the standard confidence interval for β.

17.5 Causality and prediction

In Section 17.1, the SLR model Y = α + βX + U, with the exogeneity assumption E(U|X) = 0, was introduced, and
the several examples discussed in that section considered whether the exogeneity assumption was likely to hold. In
this section, we provide a more detailed discussion of the exogeneity assumption and, in particular, its importance
for establishing a causal effect of the explanatory variable X on the outcome variable Y. Even though least-squares
estimation can be used more generally for the purposes of prediction (i.e., predicting Y given X), estimating the causal
effect of X on Y requires the strong assumption of exogeneity.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 486 — #493
i i

486 Simple linear regression

First, let’s consider the relationship between the random variables Y and X without specifying an exogeneity
assumption. To do so, the following proposition provides an important result concerning the decomposition of Y
into two parts, one which is a linear function of X and the other which is a random variable uncorrelated with X:
Proposition 17.11. If X and Y are random variables with σX2 > 0, Y can be decomposed into a linear function of X and
another random variable V such that
Y = α∗ + β ∗ X + V with Cov(X, V) = 0 and E(V) = 0.
For this decomposition, the parameter β ∗ is
Cov(X, Y) σXY
β∗ = = 2 ,
Var(X) σX
and the parameter α∗ is
σXY
α∗ = µY – β ∗ µX = µY – µX .
σX2
While the equation in Proposition 17.11 looks very similar to the specification of the SLR model, it is important
to stress that the relationship described by the proposition is not a model, in the sense that no assumptions have been
made to state the relationship between Y and X. The observant reader will notice that β ∗ is the population analogue
of the least-squares slope estimator β̂XY = ssXY2 , and likewise α∗ is the population analogue of the least-squares intercept
X

estimator α̂XY = Ȳ – β̂XY X̄.

Proposition 17.12. For the decomposition in Proposition 17.11, the least-squares slope estimators α̂XY and β̂XY are
consistent estimators of the parameters α∗ and β ∗ , respectively, if s2X > 0. Moreover, the parameters α∗ and ∗
β provide
the best linear prediction of Y given X in the population, in the sense that these parameters minimize E (Y – a – bX)2
over all possible a and b.
Proposition 17.12 formally states that the least-squares estimators consistently estimate the parameters α∗ and β ∗
from the decomposition in Proposition 17.11. An immediate implication is that the realized least-squares estimates
α̂ and β̂ get arbitrarily close to α∗ and β ∗ , respectively, when the sample size grows large. The second part of
Proposition 17.12 states that the parameters α∗ and β ∗ provide the best linear prediction of Y given X in the population.
A general linear predictor for Y is a + bX for some numbers a and b, in which case the “prediction error” is Y – a – bX.
∗ ∗
The proposition says
that choosing
2
a = α and b = β does the best job at minimizing the expected squared “prediction
error,” which is E (Y – a – bX) . In fact, we already know from Proposition 17.6 that the least-squares estimates α̂ and
Pn
β̂ minimize the sample version of this average, which is 1n i=1 (yi – a – bxi )2 . Therefore, the least-squares estimates
can be said to provide the best linear prediction of y given x for the realized sample, in the sense of minimizing the
average squared in-sample prediction error. Importantly, this property of least-squares estimation holds even in the
absence of assuming that the SLR model holds.
Now, turning to the idea of causality, let’s consider the linear equation specified in the SLR model, but without the
exogeneity assumption:
Y = α + βX + U with E(U) = 0.
We have stated E(U) = 0, which is just a normalization (relative to the parameter α), rather than E(U|X) = 0. Suppose
we are able to shift the explanatory variable X by one unit while holding the unobservable U fixed. In this case, the
outcome variable Y would change by β units. Thus, β measures the causal effect of a one-unit change of X on Y,
holding U fixed, and is the object of interest when a practitioner would like to establish and estimate a causal effect
of X on Y. The practical problem arises when the explanatory variable X and the unobservable U are related to each
other, in which case it doesn’t make sense to hold U fixed while changing X. For instance, if X and U are positively
correlated, it is more likely than not that U increases when X increases by one unit.
In the absence of the exogeneity assumption E(U|X) = 0, least-squares estimation does not generally provide a
consistent estimate of the causal (slope) parameter β. To see the issue clearly, it is useful to work out the expression

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 487 — #494
i i

Simple linear regression 487

for Cov(X, Y) without the exogeneity assumption imposed:

Cov(X, Y) = Cov(X, α + βX + U) = βVar(X) + Cov(X, U),
which implies
σXY – σXU σXU
β= = β∗ – 2 .
σX2 σX
From above, we know that the least-squares slope estimator always provides a consistent estimator of the parameter
β ∗ , but the causal (slope) parameter β is only equal to β ∗ when X and U are uncorrelated (σXU = 0).
Proposition 17.13. Suppose α∗ and β ∗ are the parameters from the decomposition in Proposition 17.11. If s2X > 0 and
the SLR model holds without the exogeneity assumption imposed, that is
Y = α + βX + U with E(U) = 0,
then the least-squares slope estimator β̂XY is a consistent estimator of
σXU
β∗ = β + 2 .
σX
When X and U are uncorrelated (σXU = 0), β̂XY is a consistent estimator of β.
Proposition 17.13 contains two important results. First, in the case that X and U are uncorrelated, the model becomes
Y = α + βX + U with Cov(X, U) = 0 and E(U) = 0,
which coincides exactly with the decomposition in Proposition 17.11, with α∗ = α, β ∗ = β, and V = U. For this case,
the least-squares slope estimator provides a consistent estimator of the causal effect of X on Y. The SLR model’s
exogeneity assumption E(U|X) = 0 implies Cov(X, U) = 0 and, therefore, implies consistent estimation of the causal
effect parameter β. Moreover, with the exogeneity assumption E(U|X) = 0, we also know from Section 17.1 that β has
a causal intepretation in terms of conditional expectations since
β = E(Y|X = x∗ + 1) – E(Y|X = x∗ ).
Second, in cases where X and U are correlated, knowing that the least-squares slope estimator consistently estimates
β + σσXU2 may tell us something about whether β̂ is providing an over-estimate or an under-estimate of the causal effect
X
σXU
parameter β. If there is positive correlation between X and U, the term σX2
is positive, meaning β̂ is estimating a
population quantity larger than β; that is, β̂ provides an over-estimate of β in this case. On the other hand, if there is
negative correlation between X and U, the term σσXU2 is negative, meaning β̂ provides an under-estimate of β.
X
How can we know whether the correlation between X and U is positive or negative? After all, U is never observed,
so it’s not possible to calculate a sample correlation for x and u. And, while it is true that we can never definitively
know the sign of the correlation between X and U, it may be possible to make an educated guess about its sign for any
particular example of X and Y random variables.
Example 17.20 (Cigarette sales and cigarette taxes) In Example 17.4, where cigarette sales (Y) were modeled as
a function of cigarette taxes (X), we argued that a negative correlation between X and the unobservable U might
exist since higher (lower) cigarette taxes are more likely to be legislated in states with fewer (more) smokers. If
σXU < 0, then the least-squares estimate β̂ = –9.49 (from Example 17.7) is likely to be an under-estimate of β (i.e., too
negative). Unfortunately, we can’t say much more than that. While it would be desirable to know by how much we are
under-estimating β, that would require direct knowledge of the magnitude of σXU or ρXU , which we do not have.
Example 17.21 (Earnings and education) In Example 17.3, where weekly earnings (Y) were modeled as a function
of educational attainment (X), we argued that a positive correlation between X and the unobservable U might exist
since unobserved productivity is likely to be higher (lower) for people with more (less) education. The causal slope
parameter β is a measure of the causal effect of a one-year change in education upon expected weekly earnings.
With the positive correlation between X and U (σXU > 0), the least-squares slope estimate β̂ is an over-estimate of this

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 488 — #495
i i

488 NOTES

causal parameter β. Therefore, even if a positive slope estimate β̂ is obtained, it’s still possible that there is no causal
effect of education on expected earnings (β = 0) since β̂ is providing an over-estimate.

Notes
53 It is not restrictive to assume that E(U|X) is equal to zero, rather than some other constant, since the parameter α could always be changed to
yield E(U|X) = 0. For instance, if we had Y = α0 + βX + U 0 with E(U 0 |X) = c for some constant c, the model could be re-written as Y = α + βX + U
with α = α0 + c and U = U 0 – c.
54 In fact, linearity is not really an assumption at all for the case of a binary explanatory variable since a line can always be drawn exactly through
the two points E(Y|X = 0), at X = 0, and E(Y|X = 1), at X = 1. Once a third value is possible for the explanatory variable, the linearity assumption
becomes important since it requires that all three conditional expectations lie along a line.
55 A slightly more complicated version of this model, which incorporates the “risk-free rate,” is the capital asset pricing model (CAPM). In the
CAPM model, the outcome variable is the stock return minus the risk-free rate (e.g., the interest rate on a Treasury bond), and the explanatory
variable is the market index return minus the risk-free rate.
56 We are being a little loose with terminology here. Consistency is a property of an estimator, so a “consistent estimate” refers to the realization
of a consistent estimator.
57 Take partial derivatives of S(a, b) with respect to a and b and set them both equal to zero:

n
∂S(a, b) X
= –2 (yi – a – bxi ) = 0
∂a i=1

and
n
∂S(a, b) X
= –2 (yi – a – bxi )xi = 0.
∂b i=1
The first equation implies nȳ – na – bnx̄ = 0, or a = ȳ – bx̄. The second equation implies ni=1 (yi – a – bxi )xi = 0, and plugging in a = ȳ – bx̄ yields
P

n
X
(yi – ȳ – b(xi – x̄))xi = 0
i=1

or, equivalently,
n
X
(yi – ȳ – b(xi – x̄))(xi – x̄) = 0.
i=1
Pn
(y –ȳ)(xi –x̄) sxy
Solving this last equation for b yields b = Pni=1 i = = β̂. Finally, plugging β̂ for b in a = ȳ – bx̄ yields a = ȳ – β̂x̄ = α̂.
i=1 (xi –ȳ)(xi –x̄) s2x
58 Although not explicitly stated in Proposition 17.9, the additional technical assumptions that X and U have finite variances is required to apply
CLT results and prove the result.
59 The interested reader can confirm that the non-robust standard errors obtained using lm are se(α̂) = 3.24 and se(β̂) = 1.51. To do so, store
the results with the command results <- lm(cigsales~cigtax, data=cigdata) and view a summary of the results, including the
standard errors, with summary(results).

Exercises
1. Consider a sample {(xi , yi )}ni=1 , where x and y are standardized variables with sample correlation rxy = 0.7.
(a) For the SLR model Y = α + βX + U, what are the least-squares estimates α̂ and β̂?
(b) Under the exogeneity assumption E(U|X) = 0, what is the interpretation of the least-squares slope estimate β̂?
2. A manufacturing company has 100 factories spread across the United States. It has purchased a new technology
that it can deploy at 30 of its factories. Assume that the company assigns the technology randomly to 30 factories. In
the subsequent year, the company collects data on total production (prodi , in thousands of units) at each factory, with
techi = 1 if the factory has the new technology and techi = 0 otherwise. Let PROD and TECH be the associated random
variables, and assume the following SLR model holds:
PROD = 10 + 3TECH + U with E(U|TECH) = 0,
with an additional assumption that U|TECH ∼ N(0, 4).
(a) Explain why the exogeneity assumption E(U|TECH) = 0 is likely to hold.
(b) What is the conditional distribution of PROD given TECH = 0? What is the conditional distribution of PROD
given TECH = 1?
(c) Determine P(PROD > 12|TECH = 1) – P(PROD > 12|TECH = 0).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 489 — #496
i i

NOTES 489

(d) If factory A has the new technology and factory B does not, what is the distribution of the difference between
factory A production and factory B production?
3. Use the hrs dataset for this question. The data consist of 6,052 non-married individuals who are 50 and older.
Consider a SLR model, where the outcome variable is annual out-of-pocket medical costs (outofpocket_costs, in
dollars) during 2000, and the explanatory variable is age (age, in years).
(a) Use lm to estimate the SLR model, and store the results in hrs_results.
(b) Interpret the slope estimate β̂.
(c) What is the estimated difference in expected out-of-pocket medical costs between a 70-year-old and a 60-year-
old?
(d) What percentage of the residuals are positive? negative?
(e) Which five observations are associated with the largest magnitudes of the estimated residuals?
(f) What percentage of fitted values are negative? Is this problematic, since outofpocket_costs ≥ 0?
(g) Draw a scatter plot of outofpocket_costs versus age.
(h) Add the least-squares regression line to the plot in (g) with the command abline(hrs_results,
col="blue"). Are the residuals left-skewed, right-skewed, or approximately symmetric?
(i) Estimate the SLR model separately for men (male = 1) and women (male = 0). How do the slope estimates from
the two regressions compare?
4. Use the baseball dataset for this question. The data consist of 30 Major League Baseball teams for the 2022
regular season. Consider a SLR model, where the outcome variable is a team’s average attendance at its home games
(attend_home) and the explanatory variable is the team’s winning percentage for the season (winpct_22). A team
winning half its games has winpct_22 = 0.5, and a team winning 55% of its games has winpct_22 = 0.55.
(a) Use lm to estimate the SLR model, and store the results in mlb_results.
(b) Draw a scatter plot of attend_home versus winpct_22. Add the least-squares regression line to the plot with the
command abline(mlb_results, col="blue").
(c) What does the slope estimate β̂ say about the difference between a team with 55% winning percentage and
50% winning percentage?
(d) How much of the variation in attend_home is left unexplained by the least-squares regression?
(e) What is the estimated standard deviation of the SLR model residual?
(f) Re-run the regression using the previous year’s winning percentage (winpct_21) as the explanatory variable.
Focusing on the slope estimate and the R-squared value, how do the results compare to the original regression?
(g) Define a new outcome variable pctattend equal to attend_home divided by capacity (the size of the team’s
stadium). Estimate the SLR model with pctattend as the outcome variable and winpct_22 as the explanatory
variable. What does the slope estimate β̂ say about the difference between a team with 55% winning percentage
and 50% winning percentage?
(h) To determine the association between team payroll (payroll, in millions of dollars) and team performance
(winpct_22), run a regression with winpct_22 as the outcome variable and payroll as the explanatory variable.
What is the estimated difference in expected winning percentage between a team with a payroll of $200 million
and a team with a payroll of $150 million?
5. Use the cps dataset for this question, and focus on the sample of 2,809 employed workers. You’ll find it easiest to
create a new data frame cpsemployed for the employed workers.
(a) Use lm_robust to estimate the SLR model from Example 17.3 with earnwk as the outcome variable and
educ as the explanatory variable.
(b) Interpret the slope estimate β̂.
(c) Provide a 95% asymptotic confidence interval for the SLR slope parameter.
(d) Interpret the R-squared value.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 490 — #497
i i

490 NOTES

(e) Use plot to plot the estimated residuals versus educ, and add a horizontal line at zero.
(f) From the plot in (e), do you think the residuals are homoskedastic or heteroskedastic? Explain.
(g) From the plot in (e), do you think the conditional expectation of the residuals is zero for all values of educ?
Explain.
6. Use the exams dataset for this question. You are interested in modeling exam2 performance based upon exam1
performance.
(a) Use lm_robust to estimate the SLR model with exam2 as the outcome variable and exam1 as the explanatory
variable.
(b) Interpret the slope estimate β̂.
(c) Perform a z-test of H0 : β = 0 at the 5% level. What do you conclude?
(d) Perform a z-test of H0 : β = 1 at the 5% level. What do you conclude? What is the p-value for this test?
(e) Provide an estimate of the conditional expectation E(exam2|exam1 = 80).
(f) Draw two histograms, one of the actual exam2 values and one of the fitted values from the regression. Which
histogram has lower dispersion, and why?
(g) Standardize both variables (exam1 and exam2) and re-run the regression with lm_robust.
i. Interpret the slope estimate of the new regression.
ii. How does the R-squared value compare to the R-squared value of the original regression?
iii. How does the z-statistic for testing H0 : β = 0 compare to the original regression?
7. Consider a sample {(xi , yi )}ni=1 , where x̄ = 0 and ȳ = 0. For the SLR model Y = α + βX + U, show that the least-squares
estimates are Pn
xi yi
α̂ = 0 and β̂ = Pi=1n 2
.
i=1 xi
8. A dataset has IQ scores (iq) for 930 individuals, between the ages of 20 and 24, along with the IQ scores of
their mothers (momiq) and fathers (dadiq). The average parental IQ is avgiq = momiq+dadiq 2 . With iq as the outcome
variable, the following table summarizes the results from regressions for three different SLR models: (i) momiq as the
explanatory variable, (ii) dadiq as the explanatory variable, and (iii) avgiq as the explanatory variable.
SLR: iq on momiq SLR: iq on dadiq SLR: iq on avgiq
α̂ (se) 69.585 (3.394) 77.156 (3.426) 56.487 (4.331)
β̂ (se) 0.299 (0.034) 0.231 (0.035) 0.434 (0.044)
R2 0.0925 0.0522 0.1161
Each of the IQ variables has a sample mean close to 100 and a sample standard deviation close to 15.
(a) Which of the three explanatory variables explains the most variation in iq? What is the sample correlation
between this variable and iq?
(b) Interpret the slope estimate from the regression of iq on momiq.
(c) Provide a 95% asymptotic confidence interval for the slope in the SLR model with momiq as the explanatory
variable.
(d) Provide an estimate of the expected difference in IQ between an individual whose mother’s IQ is 110 and an
individual whose mother’s IQ is 105. What is the standard error of this estimate?
(e) Thinking about an observation with momiq = 120, what is the fitted value from the regression of iq on momiq?
Thinking about an observation with dadiq = 120, what is the fitted value from the regression of iq on dadiq?
Thinking about an observation with avgiq = 120, what is the fitted value from the regression of iq on avgiq?
Explain why one of these fitted values is markedly different from the others.
(f) Focus on the SLR with avgiq as the explanatory variable. Copy and execute the R code below, which draws
the (solid) estimated regression line, a dashed 45-degree line (iq = avgiq), and a dotted horizontal line at the
outcome mean (iq = 100). The graph shows a phenomenon known as regression to the mean. For avgiq values

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 491 — #498
i i

NOTES 491

between 110 and 130, how do the fitted values of iq compare to avgiq and the mean of iq? For avgiq values
between 70 and 90, how do the fitted values of iq compare to avgiq and the mean of iq?

# create a vector of IQ scores from 70 to 130

iqvec <- seq(70,130,1)

# plot the fitted regression line using avgiq

plot(iqvec, 56.487+0.434*iqvec, type="l", ylim=c(70,130), xlab="avgiq", ylab="iq")

# draw the iq=avgiq 45-degree line (dashed)

lines(iqvec, iqvec, lty=2)

# draw the line at iq=100 (dotted)

abline(h=100, lty=3)

(g) Assume that the true SLR model is

IQ = 60 + 0.4AVGIQ + U,
2
where U|AVGIQ ∼ N(0, 13.75 ) is a homoskedastic normal random variable.
i. Provide a 95% probability interval for IQ conditional on AVGIQ = 110.
ii. What is the probability that IQ ≥ 110 conditional on AVGIQ = 110?
9. Suppose Y = α + βX + U is the SLR model. Reversing the roles of Y and X, we specify the reverse regression model
X = γ + δY + V.
For a sample {(xi , yi )}ni=1 , let α̂ and β̂ denote the least-squares estimates of the SLR model and γ̂ and δ̂ denote the
least-squares estimates of the reverse regression model.
(a) How do the R-squared values for the two regressions compare to each other?
2
(b) Show that β̂ δ̂ = rxy .
10. For the sp500 dataset, Example 17.12 reported R2 = 0.336 for the least-squares regression of Home Depot (HD)
returns on the market index (IDX) returns. The sample size is n = 364.
(a) What is the sample correlation rHD,IDX ?
(b) Provide a 95% asymptotic confidence interval for the population correlation ρHD,IDX .
(c) ρ2HD,IDX is the value to which R-squared converges as the sample size grows large. Use the endpoints of the
interval in (b) to provide a 95% confidence interval for ρ2HD,IDX .
(d) *Use the delta method (Section 14.5) to determine the asymptotic standard error of R2 , as an estimator of
ρ2HD,IDX , and construct a 95% asymptotic confidence interval. How does it compare to the interval found in (c)?
11. For each of the following examples, do you think the least-squares slope estimate should be considered an estimate
of the causal effect (β) of the explanatory variable on the outcome variable? If not, use Proposition 17.13 to determine
whether the least-squares regression is over- or under-estimating β.
(a) A new medical treatment, targeted at 50-year-olds with a certain disease, is being tested. There is a sample of
patients with the disease, but the medical treatment is only provided to patients who can afford it (i.e., patients
over a certain income threshold). Data are collected after all the patients in the sample die. The outcome variable
is the life span of the patient, and the explanatory variable is an indicator of whether they received treatment.
(b) An economics department is interested in improving the post-graduation outcomes of its graduate students.
The department chair decides to provide a fellowship, with extra monetary support and reduced teaching
obligations, to the department’s best students (specifically the top 20%) during the year prior to their graduation.
The outcome variable is the income level of each graduate student in their initial job, and the explanatory
variable is an indicator of whether they received the fellowship.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 492 — #499
i i

492 NOTES

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 498 — #505
i i

498 Multiple linear regression

which is approximately 3.5%. Although we do not yet know whether there is a statistically significant difference
between the two slope estimates βˆ1 and βˆ2 , the estimates suggest that the S&P 500 returns have a stronger association
(slope of 0.595) than Lowe’s returns (slope of 0.384) with expected Home Depot’s returns.
Why is the estimated slope on IDX (β̂ = 1.020) in the SLR model, reported in the second column of the table, so
much larger than the MLR estimated slope of 0.595? First, it’s important to realize that the underlying population
parameters are measuring different things, with the SLR slope measuring the association of a one-unit change in IDX
with expected HD but completely ignoring LOW and the MLR slope measuring the association of a one-unit change in
IDX with expected HD with LOW held fixed. For the SLR model, the least-squares estimate of the IDX slope parameter
is essentially forced to pick up any association with HD that may be coming from LOW since LOW is not in the model.
It turns out that, perhaps unsurprisingly, there is a positive correlation between the S&P returns (IDX) and Lowe’s
returns (LOW), with rIDX,LOW = 0.506. Therefore, when IDX increases by one unit, LOW also tends to increase, so
the SLR slope parameter on IDX is expected to be larger than the MLR slope parameter on IDX, as the SLR slope
parameter captures the direct association of IDX with expected HD but also the indirect association of LOW with
expected HD. Similar reasoning can be applied to explain why the SLR slope estimate for LOW, reported in the third
column of the table, is larger than the MLR slope estimate for LOW.
Example 18.3 (Cigarette sales and cigarette taxes) In Examples 17.4 and 17.7, we introduced a SLR model to explain
state-level cigarette sales (CIGSALES) with state-level cigarette taxes (CIGTAX). We now add a variable to the model,
yielding a MLR model with two explanatory variables. The new variable is the binary variable PRODUCER, which is
equal to 1 for any state producing more than 20 million pounds of tobacco in 2019 and 0 otherwise.
CIGSALES = α + β1 CIGTAX + β2 PRODUCER + U with E(U|CIGTAX, PRODUCER) = 0.
The PRODUCER variable is included to allow for the possibility that cigarette sales, even after controlling for
cigaratte taxes, may be higher in states that produce tobacco (e.g., due to greater acceptance of tobacco and smoking).
In the dataset, there are seven states (Georgia, Kentucky, North Caroline, Pennsylvania, South Carolina, Tennessee,
and Virginia) with producer = 1. The following table reports the least-squares estimates (α̂, β̂1 , β̂2 ) of the MLR model
parameters in the first column and, for comparison, the SLR estimates from Example 17.7 in the second column:
MLR estimates SLR estimates
α (intercept) 54.28 55.95
β1 (slope on CIGTAX) –8.97 –9.49
β2 (slope on PRODUCER) 5.37
The MLR estimates are obtained using the lm function in R:

lm(cigsales~cigtax+producer, data=cigdata)
##
## Call:
## lm(formula = cigsales ~ cigtax + producer, data = cigdata)
##
## Coefficients:
## (Intercept) cigtax producer
## 54.28 -8.97 5.37

There is not much change in the CIGTAX slope estimate in the MLR model (–8.97) as compared to the SLR model
(–9.49). Again, the meaning of the two parameters and estimates is different, as the MLR estimate measures what
happens when holding PRODUCER fixed and the SLR estimate does not. For the MLR model, the slope estimate
β̂1 = –8.97 implies that a one-unit (one dollar) change in state cigarette taxes, holding fixed whether the state is a
tobacco producer, is associated with an estimated decrease of 8.97 packs per capita, on average. For the tobacco-
producer variable, a one-unit change in the variable involves going from a non-producing state (PRODUCER = 0) to a

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 499 — #506
i i

Multiple linear regression 499

producing state (PRODUCER = 1). The slope estimate β̂2 = 5.37 estimates the difference in expected per-capita packs
sold between a tobacco-producing state and a non-producing state, holding state cigarette taxes fixed. Whether this
difference is statistically significant depends upon the standard error of the estimate, an issue that we re-visit later.
Example 18.4 (Weekly earnings) In the previous chapter, SLR models were used to model weekly earnings in terms
of union status (Example 17.2) and in terms of educational attainment (Example 17.3). With the MLR model, we can
model weekly earnings in terms of several explanatory variables simultaneously. We use the cps data on n = 2809
employed individuals and the following four explanatory variables (K = 4) to model the outcome earnwk (weekly
earnings):
educ = years of education
exper = years of experience
union = 1 if individual is a union member, 0 otherwise
female = 1 if individual is female, 0 otherwise
For the experience (exper) variable, we adopt a standard definition used by labor economists, which defines62
exper = age – educ – 6, where age is age in years.
With both exper and educ in the MLR model, it is not possible to also include age since, by the definition of exper, the
age variable is a perfect linear combination of educ and exper, which would violate Assumption MLR-NPC.
Here is the R code to construct the female and exper variables and calculate the MLR least-squares estimates:

# define the female indicator variable

cps$female <- 1*(cps$gender=="Female")

# define the experience variable

cps$exper <- cps$age-cps$educ-6

# least-squares estimation of the MLR model

lm(earnwk~educ+exper+union+female, data=cps)
##
## Call:
## lm(formula = earnwk ~ educ + exper + union + female, data = cps)
##
## Coefficients:
## (Intercept) educ exper union female
## -431.0 110.9 4.9 142.7 -344.8

The interpretation of the slope estimates are as follows:

• β̂1 = 110.9: An additional year of education is associated with an estimated increase of $110.90 in weekly earnings,
on average, holding all other variables (experience, union, female) fixed.
• β̂ = 4.9: An additional year of experience is associated with an estimated increase of $4.90 in weekly earnings, on
2
average, holding all other variables (education, union, female) fixed.
• β̂ = 142.7: A union member is estimated to earn $142.70 more than a non-member, on average, holding all other
3
variables (education, experience, female) fixed.
• β̂ = –344.8: A female worker is estimated to earn $344.80 less than a non-female worker, on average, holding all
4
other variables (education, experience, union) fixed.
The estimates also allow us to make statements about changing two or more variables at once. For example, let’s say
that we are interested in the four possible combinations of the union and female variables, while holding educ and
exper fixed. The estimated difference in average weekly earnings between a female union member (female = union = 1)
and a male non-member (female = union = 0) is equal to β̂3 + β̂4 = 142.7 – 344.8 = –202.1. The estimated difference in
average weekly earnings between a female union member and a female non-member (female = 1, union = 0) is equal
to βˆ3 = 142.7. The estimated difference in average weekly earnings between a female union member and a male union

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 500 — #507
i i

500 Multiple linear regression

member (female = 0, union = 1) is equal to βˆ4 = –344.8. Estimated differences in average weekly earnings between
other groups can be calculated similarly.
The least-squares minimization problem in Proposition 18.1 yields the least-squares estimates for the observed
sample that has been drawn from the population. The values of the least-squares estimates depend upon the particular
sample that happened to be drawn from the population. In thinking about the sampling distribution associated with
least-squares estimation, the distribution of the least-squares estimators (α̂XY , β̂1,XY , β̂2,XY , …, β̂K,XY ) is described by
the distribution of all possible least-squares estimates that could arise from every possible n-observation i.i.d. sample
drawn from the population. Like the least-squares estimators for the SLR model parameters, the least-squares
estimators for the MLR model have the desirable properties of consistency and asymptotic normality, with the
latter property allowing for large-sample inference using the normal distribution. The following proposition states
the consistency and asymptotic normality properties of the least-squares estimators of the MLR model:63
Proposition 18.2. If the MLR model holds and Assumption MLR-VarX and Assumption MLR-NPC hold for any
possible sample drawn from the population, least-squares estimation is consistent, with α̂XY being a consistent
estimator of α and each β̂k,XY being a consistent estimator of βk for each k ∈ {1, 2, …, K}. Moreover, the least-squares
estimators are asymptotically normal, with
√ a
n(α̂XY – α) ∼ N (0, Vα )
and
√ a
n(β̂k,XY – βk ) ∼ N 0, Vβk for each k ∈ {1, 2, …, K}.
Vα
The asymptotic variance of the intercept estimator α̂XY is n , and the asymptotic variance of each slope estimator
V
β̂k,XY is nβk for each k ∈ {1, 2, …, K}.
Proposition 18.2 says that the least-squares estimators get arbitrarily close to the underlying MLR parameters as
the sample size increases. The asymptotic normality of the estimators allows construction of confidence intervals and
hypothesis testing based upon the normal distribution, as discussed in further detail in Section 18.3.

18.2.1 Fitted values, estimated residuals, and regression fit

The concepts of fitted values and estimated residuals are generalized to the MLR model. We start with the definition
of fitted values.

Definition 18.1 The fitted value for observation i, denoted ŷi , is

ŷi = α̂ + β̂1 xi1 + β̂2 xi2 + · · · + β̂K xiK ,
where (α̂, βˆ1 , βˆ2 , …, β̂K ) are the least-squares estimates of (α, β1 , β2 , …, βK ).
The fitted value yi provides an in-sample estimate of the conditional expectation of the outcome variable, given the
values of observation i’s explanatory variables. The true conditional expectation is
E(Y|X1 = xi1 , X2 = xi2 , …, XK = xik ) = α + β1 xi1 + β2 xi2 + · · · + βK xiK ,
and the fitted value provides an estimate by plugging in the least-squares estimates of the MLR model parameters,
Ê(Y|X1 = xi1 , X2 = xi2 , …, XK = xik ) = ŷi = α̂ + β̂1 xi1 + β̂2 xi2 + · · · + β̂K xiK .
We move next to the definition of estimated residuals.

Definition 18.2 The estimated residual for observation i, denoted ûi , is

ûi = yi – ŷi = yi – α̂ – β̂1 xi1 – β̂2 xi2 – · · · – β̂K xiK ,
where (α̂, βˆ1 , βˆ2 , …, β̂K ) are the least-squares estimates of (α, β1 , β2 , …, βK ).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 501 — #508
i i

Multiple linear regression 501

The estimated residual is the difference between the realized outcome yi and its fitted value ŷi . When an estimated
residual is small in magnitude, it means the realized outcome is close to the value predicted by the fitted value.
Since the least-squares estimators are consistent, the fitted values and estimated residuals calculated from the least-
squares estimates are also consistent. The following proposition states this result, generalizing the SLR result
(Proposition 17.5):
Proposition 18.3. (Consistency of fitted values and estimated residuals) Assume that the MLR model holds and
Assumption MLR-VarX and Assumption MLR-NPC hold for any possible sample drawn from the population. Then,
for any i ∈ {1, 2, …, n}, the fitted value ŷi is a consistent estimate of
E(Y|X1 = xi1 , X2 = xi2 , …, XK = xik ) = α + β1 xi1 + β2 xi2 + · · · + βK xiK ,
and the estimated residual ûi is a consistent estimate of the population residual
ui = yi – α – β1 xi1 – β2 xi2 – · · · – βK xiK .
Example 18.5 (Weekly earnings) Continuing Example 18.4, the following table shows detailed information for the
first ten observations in the cps dataset, including the values of the four explanatory variables (educ, exper, union,
female), the observed outcome value (earnwk), and the fitted values and estimated residuals.
i educi experi unioni femalei yi (earnwki ) ŷi ûi
1 14 30 0 0 577 1269 –692
2 18 10 0 1 3049 1270 1779
3 18 6 0 1 2500 1250 1250
4 12 17 0 0 300 983 –683
5 12 38 0 0 1000 1086 –86
6 7.5 28.5 0 1 1000 196 804
7 12 37 0 1 650 737 –87
8 13 39 0 1 1712 857 854
9 16 17 0 1 820 1082 –262
10 12 41 1 0 1240 1244 –4
We calculate the fitted values and estimated residuals in R in the same way as seen for the SLR model:

# store results of least-squares estimation

results <- lm(earnwk~educ+exper+union+female, data=cps)

# access estimated residuals and fitted values

uhat <- results$residuals
yhat <- results$fitted.values

# output first ten estimated residuals and fitted values, rounded to the nearest dollar
round(uhat[1:10])

## -692 1779 1250 -683 -86 804 -87 854 -262 -4

round(yhat[1:10])
## 1269 1270 1250 983 1086 196 737 857 1082 1244

The fitted values are based upon the least-squares parameter estimates, using the formula in the definition of ŷi .
Looking at the sixth observation (i = 6), for instance, the fitted value is very low since the education level is so low; the
fitted value indicates an estimate of $196 for the estimated expected earnings of an individual with those characteristics
(female, non-union worker with 7.5 years of education and 28.5 years of experience). The fitted values for the first three
observations happen to be very close to each other due to the values of their explanatory variables. Looking at the
second and third observations, the only difference is a four-year difference in experience (exper = 10 for i = 2 and

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 502 — #509
i i

502 Multiple linear regression

exper = 6 for i = 3), leading to a difference in fitted values of 4β̂exper = (4)(4.9) ≈ 20. The estimated residuals ûi exhibit
a lot of variation. Some observations have fitted values quite close to the actual outcomes and estimated residuals close
to zero, most notably for i ∈ {5, 7, 10}, while other observations have fitted values very far from the actual outcomes
and estimated residuals with large magnitudes, most notably for i ∈ {2, 3}.
Summary measures of the overall size of the model’s residuals can be obtained by generalizing the residual variance
and standard deviation estimates introduced for the SLR model. Here are their definitions for the MLR model:

Definition 18.3 The residual variance estimate, denoted by σ̂U2 , is

n
1 X
σ̂U2 = û2i .
n–K –1
i=1

Definition 18.4 The residual standard deviation estimate, denoted by σ̂U , is

v
u n
1
q u X
2
σ̂U = σ̂U = t û2i .
n–K –1
i=1

As compared to the σ̂U2 and σ̂U formulas for least-squares estimation of the SLR model, the formulas for the MLR
1 1
model involve a scaling of n–K–1 rather than n–2 . This scaling accounts for the estimation of the (K + 1) parameters
in the MLR model, with the denominator equal to n – (K + 1) = n – K – 1. The SLR formulas are a special case of the
1 1
MLR formulas with K = 1. While the n–K–1 scaling does not differ much numerically from either a n–1 or 1n scaling
when n is large, this scaling does ensure that the residual variance estimator is unbiased.
The following proposition summarizes the properties of the estimated residuals from least-squares estimation of the
MLR model. This proposition generalizes Proposition 17.7, which considered the properties of the estimated residuals
for the SLR case.
Proposition 18.4. (Properties of estimated residuals) The estimated residuals
ûi = yi – α̂ – β̂1 xi1 – β̂2 xi2 – · · · – β̂K xiK,
based upon the least-squares estimates (α̂, β̂1 , β̂2 , …, β̂K ) have the following properties:
(i) The sample average of the estimated residuals is zero:
n
1X
ûi = 0.
n
i=1
(ii) The sample correlation between the values of any explanatory variable and the estimated residuals is zero:
rxk û = 0 for every k ∈ {1, 2, …, K}.
(iii) The sample correlation between the fitted values and the estimated residuals is zero:
rŷû = 0.
1 n
(iv) The residual variance estimate σ̂U2 = n–K–1 2 2
P
i=1 ûi is a consistent (and unbiased) estimate of σU , the
unconditional variance of the random variable U. q
1
Pn 2
(v) The residual standard deviation estimate σ̂U = n–K–1 i=1 ûi is a consistent estimate of σU , the unconditional
standard deviation of the random variable U.
Properties (i), (iii), (iv), and (v) are basically the same as those stated for the SLR model in Proposition 17.7,
1
with the different scaling n–K–1 used for properties (iv) and (v) here. Property (ii) states that the sample correlation
between estimated residuals and any of the explanatory variables is equal to zero. That is, the explanatory variable
xk is uncorrelated with the estimated residual û for any k ∈ {1, 2, …, K}. This property is a sample analogue of the
corresponding population property, that the population residual U is uncorrelated with each random variable Xk , for

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 503 — #510
i i

Multiple linear regression 503

k ∈ {1, 2, …, K}, which is implied by the exogeneity assumption E(U|X) = 0. Since property (ii) is a by-product of
least-squares estimation, it is true even if the exogeneity assumption doesn’t actually hold. As such, it is not fruitful to
use a sample correlation between an explanatory variable xk and the estimated residual û to test whether Xk is related
to U, as the sample correlation is always zero.
With estimated residuals and fitted values defined for least-squares estimation of the MLR model, we can generalize
the R-squared measure that was introduced in Definition 17.5 for the SLR model.

Definition 18.5 The R-squared value associated with least-squares estimation of the MLR model, denoted R2 , is
s2ŷ s2û
R2 = =1– .
s2y s2y
This definition is identical to the definition of R-squared for least-squares estimation of the SLR model. In terms
of intepretation, the only difference for the MLR model is that there may be more than a single explanatory variable
explaining the variation in the outcome variable. For example, if K = 3 and R2 = 0.36, then the three explanatory
variables x1 , x2 , and x3 explain 36% of the variation in the y variable.
The following proposition summarizes the important properties of the R-squared value for MLR models:
Proposition 18.5. (Properties of R-squared) The R-squared value, based upon the least-squares estimates
(α̂, β̂1 , β̂2 , …, β̂K ), has the following properties:
(i) R2 = ryŷ 2

(ii) 0 ≤ R2 ≤ 1
(iii) If an explanatory variable xK+1 is added to the MLR model, least-squares estimation of the new MLR model
(with K + 1 explanatory variables) has an R-squared value (call it R2K+1 ) that is at least as large as the original R2
value (from the MLR model with K explanatory variables): R2K+1 ≥ R2 .
Properties (i) and (ii) have been seen previously for SLR models. R2 is equal to the square of the correlation between
the values of the outcome variable y and the fitted values ŷ. To the degree that the explanatory variables provide a better
fit/prediction for the outcome variable, the correlation between y and ŷ is larger in magnitude and, therefore, R2 is
higher. The extreme of R2 = 1 corresponds to 100% of the variation in y being explained by the explanatory variables,
which happens when y is a perfect linear function of the explanatory variables. The extreme of R2 = 0 corresponds
to none of the variation in y being explained by the explanatory variables, which happens only when all of the slope
estimates β̂1 , β̂2 , …, β̂K are exactly equal to zero. Property (ii) follows directly from property (i) since the correlation
ryŷ satisfies –1 ≤ ryŷ ≤ 1.
Property (iii) states that the R-squared value (weakly) increases when a variable is added to the MLR model and
new least-squares estimates are calculated. The intuition here is that you can’t do worse in explaining the variation
in the outcome variable y when you add an explanatory variable to the model. (Mathematically, if the estimated
slope on the added variable is exactly equal to zero, the fitted values would be the same as the original least-squares
estimation, so that R2 is unchanged. For any other (non-zero) estimate of the slope of the added variable, the R2 value
increases as compared to the original R2 value.) By continually applying property (iii), R2 increases if we add more
and more explanatory variables to the MLR model. This property should be considered a cautionary tale. Specifically,
if a practitioner is too focused on the R2 value, it may cause them to add too many explanatory variables to a model.
In choosing which explanatory variables belong in the model, R2 should not be the guiding force; instead, variables
should be included based upon prior knowledge that the researcher has about their practical or economic relevance in
explaining the outcome variable.
Example 18.6 (Monthly stock returns) Continuing Example 18.2, the following table reports the R-squared values
and residual standard deviation estimates for least-squares estimation of three models for the outcome HD: (i) MLR
with IDX and LOW as explanatory variables, (ii) SLR with IDX as the explanatory variable, and (iii) SLR with LOW
as the explanatory variable:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 504 — #511
i i

504 Multiple linear regression

MLR estimates SLR estimates SLR estimates

α (intercept) 0.004 0.009 0.006
β1 (slope on IDX) 0.595 1.020
β2 (slope on LOW) 0.384 0.522
R2 0.505 0.336 0.421
σ̂U 0.052 0.060 0.056
Let’s first compare the results from the two SLR models. There is a higher R-squared value for the SLR model with
LOW than for the SLR model with IDX. Lowe’s returns explain 42.1% of the variation in Home Depot returns, while
the S&P 500 returns explain 33.6% of the variation in √ Home Depot returns. From these values, we know that the
correlation between HD√and LOW (which is rHD,LOW = 0.421 ≈ 0.65) is larger than the correlation between HD and
IDX (which is rHD,IDX = 0.336 ≈ 0.58). Since LOW explains more of the variation of HD than IDX does, the residual
standard deviation estimate σ̂U is lower for the SLR model with LOW (σ̂U = 0.056) than the SLR model with IDX
(σ̂U = 0.060). That is, there is less variation left unexplained by LOW as compared to IDX. To put these σ̂U estimates
in perspective, the overall sample standard deviation of the outcome variable HD is sHD = 0.074.
Looking at the estimates for the MLR model that includes both IDX and LOW, the R-squared value is 0.501,
meaning S&P 500 returns and Lowe’s returns together explain 50.1% of the variation in Home Depot returns. This
R-squared value is larger than either of the R-squared values for the SLR models, which is expected from property (iii)
of Proposition 18.5 since the MLR model has an additional explanatory variable. For instance, moving from the SLR
model with IDX to the MLR model with IDX and HD, the R-squared value increases from 0.336 to 0.501; that is,
with IDX already in the model, the addition of LOW to the model explains an additional 16.5% of the variation in
HD. The residual standard deviation estimate σ̂U for the MLR model is 0.052, which is lower than each of the two
corresponding estimates for the SLR models. The two explanatory variables explain more of the variation in HD than
either explanatory variable individually, leaving less variation in the residual. But even with the two explanatory
variables in the MLR model, there is still a lot (49.9%) of the variation in Home Depot returns left unexplained.
Here is the R code to calculate the R2 and σ̂U values reported in the table:

# store MLR estimation results (HD on IDX and LOW)

results_mlr <- lm(HD~IDX+LOW, data=sp500)

# store SLR estimation results (HD on IDX)

results_slr1 <- lm(HD~IDX, data=sp500)

# store SLR estimation results (HD on LOW)

results_slr2 <- lm(HD~LOW, data=sp500)

# output the R-squared values and estimated residual standard deviations

print(paste("MLR results (IDX, LOW)", "R-sq", round(summary(results_mlr)$r.squared,3),
"sighat_u", round(summary(results_mlr)$sigma,4)))
## [1] "MLR results (IDX, LOW) R-sq 0.505 sighat_u 0.052"
print(paste("SLR results (IDX)", "R-sq", round(summary(results_slr1)$r.squared,3),
"sighat_u", round(summary(results_slr1)$sigma,4)))
## [1] "SLR results (IDX) R-sq 0.336 sighat_u 0.0602"
print(paste("SLR results (LOW)", "R-sq", round(summary(results_slr2)$r.squared,3),
"sighat_u", round(summary(results_slr2)$sigma,4)))
## [1] "SLR results (LOW) R-sq 0.421 sighat_u 0.0562"

Example 18.7 (Weekly earnings) Re-visiting Example 18.4, the following table shows how R2 changes when
explanatory variables are added one-by-one to the model that explains weekly earnings (earnwk):

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 505 — #512
i i

Multiple linear regression 505

Explanatory variables R2
educ 0.106
educ, exper 0.108
educ, exper, union 0.115
educ, exper, union, female 0.166
Property (iii) of Proposition 18.5 states that R2 increases as variables are added to the model. The first row corresponds
to a SLR model with only educ as an explanatory variable. Education alone explains 10.6% of the variation in weekly
earnings. Adding experience (exper) to the model provides only a small incremental increase in R2 to 0.108, as does
adding union, which increases R2 to 0.115. Adding the female variable provides a larger increase, to an R2 of 0.166,
meaning the four explanatory variables together explain 16.6% of the variation in weekly earnings.

18.3 Standard errors and confidence intervals

This section focuses upon statistical inference based upon least-squares estimation of the MLR model. Proposition 18.2
states that the least-squares estimators are consistent and asymptotically normal. As such, with a statistical package
like R that calculates the standard errors of the least-squares estimates, we can use standard approaches based upon
the normal distribution to construct confidence intervals for the MLR population parameters and perform hypothesis
tests involving these parameters.
As with SLR models in Chapter 17, we rely on standard errors that are robust to heteroskedasticity of the MLR
model residuals. However, we also discuss the case of homoskedasticity in this section since this case provides useful
insight into the features of the population and sample that affect the magnitude of the standard errors and, therefore, the
precision of the least-squares estimates. Unlike the SLR model with only a single explanatory variable, the possibility
of multiple explanatory variables in the MLR model means that the relationships among the explanatory variables
can affect the standard errors. The following proposition provides a framework for examining these relationships, by
decomposing each explanatory variable into two parts, one which is a linear function of the other explanatory variables
and the other which is a random variable uncorrelated with the other explanatory variables.
Proposition 18.6. If Assumption MLR-VarX and Assumption MLR-NPC hold for any possible sample drawn from the
population, the random variable X1 can be decomposed into a linear function of the other explanatory variables and
another random variable X̃1 such that
X1 = γ + δ1 X2 + δ2 X3 + · · · + δK–1 XK + X̃1
with
Cov(X̃1 , X2 ) = Cov(X̃1 , X3 ) = · · · = Cov(X̃1 , XK ) = 0 and E(X̃1 ) = 0.
Similarly, for any other k ∈ {2, 3, …, K}, the random variable Xk can be decomposed into a linear function of the
explanatory variables and another random variable X̃k such that X̃k is uncorrelated with each of the other explanatory
variables and has population mean zero.
Let’s focus on the first explanatory variable X1 in the MLR model. From Proposition 18.6, the random variable X̃1
is the part of the explanatory variable X1 that is not explained by (or correlated with) the other (K – 1) explanatory
variables X2 , X3 , …, XK . Recall that the slope parameter β1 indicates how the expected outcome Y changes when X1
changes and all other explanatory variables are held fixed. As such, since the other explanatory variables are being held
fixed, we care about the part of X1 that is not related to those other explanatory variables when we are estimating β1 .
This part of X1 is the “residual” X̃1 from the decomposition of X1 given in the equation in Proposition 18.6. When
there is a lot of possible variation in the X̃1 random variable, which happens when Var(X̃1 ) = σX̃2 is high, it should be
1
possible to have a more precise estimate of β1 . The variation in X̃1 is sometimes called the independent variation of
X1 , since X̃1 is the part of X1 that is uncorrelated with the other explanatory variables.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 506 — #513
i i

506 Multiple linear regression

This same reasoning applies to any other explanatory variable Xk in the MLR model. For any k ∈ {1, 2, …, K}, the
random variable X̃k is the part of Xk that is not related to the other explanatory variables in the model. To precisely
estimate βk , it is better to have a lot of independent variation in the Xk variable, which happens when the variance of
X˜k , Var(X̃k ) = σX̃2 , is high.
k
As we did for the SLR model (Proposition 17.10), we consider the case of homoskedastic residuals for the MLR
model as a pedagogical tool to see the factors that affect the standard errors of the slope estimates. To be clear,
homoskedasticity is not required to perform least-squares estimation, as the standard errors of the least-squares
estimates can and should be calculated under the more relaxed case of heteroskedasticity.
The definitions of homoskedasticity and heteroskedasticity are generalized to the MLR model as follows:

Definition 18.6 The residuals of a MLR model are homoskedastic if the conditional variance
Var(U|X1 = x1 , X2 = x2 , …, XK = xK ) = σU2
is constant and does not depend upon the values x1 , x2 , …, xK . The residuals are said to exhibit homoskedasticity.

Definition 18.7 The residuals of a MLR model are heteroskedastic if the conditional variance
Var(U|X1 = x1 , X2 = x2 , …, XK = xK )
is non-constant and depends upon the values x1 , x2 , …, xK . The residuals are said to exhibit heteroskedasticity.
In the case of homoskedastic residuals, the following proposition provides the asymptotic variance formulas for the
least-squares slope estimators:
Proposition 18.7. If the MLR model holds, Assumption MLR-VarX and Assumption MLR-NPC hold for any possible
sample drawn from the population, and the residuals are homoskedastic, then the asymptotic variance of β̂k,XY (the
least-squares slope estimator of βk ) is
Vβk σ2
= U2 for each k ∈ {1, 2, …, K}.
n nσX̃
k

The variance in the denominator of the asymptotic-variance expression is the independent variation in Xk , given by
σX̃2 , and not the overall variation in Xk , given by σX2 k . To obtain the standard errors of the least-squares slope estimates,
k
sample descriptive statistics can be plugged in for the population statistics, so that
s s
V̂βk s2û sû
se(β̂k ) = = =√ for each k ∈ {1, 2, …, K},
n ns2x̃k n sx̃k
where sû is the sample standard deviation of the estimated residuals and sx̃k is the sample standard deviation of the
variable that measures the part of xk that is not linearly related to the other explanatory variables.64 Thus, there are
three factors that affect the standard errors of the least-squares slope estimates, with the first two being the same as
seen for the SLR model:
• Sample size: Larger n leads to a smaller standard error se(β̂k ), with the usual √1 scaling.
n
• Residual noise: A smaller residual variance σU2 ,
as estimated by s2û ,
leads to a smaller standard error se(β̂k ).
• Independent variation of the explanatory variable: Whereas the overall variation of the explanatory variable affects
the standard error for the SLR model, is it the independent variation of an explanatory variable xk that affects se(β̂k )
for the MLR model. The independent variation of the random variable Xk is given by the variance of the random
variable X̃k introduced above and measures how much variation is left in Xk after the linear relationship with other
explanatory variables has been accounted for. When this independent variation, as estimated by s2x̃k , is larger, the
standard error se(β̂k ) is smaller.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 507 — #514
i i

Multiple linear regression 507

For the third factor, while a lot of independent variation in xk is good for the precision of β̂k (i.e., a low se(β̂k )), it is
also true that the precision of β̂k is poor (i.e., a high se(β̂k )) when there is little independent variation in xk . When the
variable xk is highly correlated with and well explained by the other explanatory variables in the model, the variance
σX̃2 can be very low, which can cause the asymptotic variance of the slope estimator β̂XY,k and the standard error of
k

the estimate βˆk to “blow up” since this variance appears in the denominator of the asymptotic-variance formula in
Proposition 18.7. This issue, when high correlation among two or more explanatory variables causes standard errors
to be large, is known as multicollinearity. While some textbooks offer “solutions” to multicollinearity, there really is
no solution if we wish to keep all of the affected variables in the model. For instance, suppose K = 2 and the variables
x1 and x2 are highly correlated with each other. In such a case, it is possible that x1 and x2 together provide a good
prediction of the outcome variable y, as reflected by a high R2 value, even if both β̂1 and β̂2 have very high standard
errors due to the multicollinearity. While the two variables are jointly important in explaining y, their high correlation
means that it is difficult or impossible to disentangle the separate effects of x1 and x2 on the expected outcome. β1
measures how X1 affects E(Y|X) while holding X2 fixed, but estimating β1 is difficult because there is little variation
in x1 once x2 is held fixed due to the high value of rx1 x2 .
Example 18.8 (Cigarette sales and cigarette taxes) The following table reproduces the least-squares estimates from
Example 18.3, now with heteroskedasticity-robust standard errors reported in parentheses:
MLR estimates SLR estimates
α (intercept) 54.28 (3.51) 55.95 (2.92)
β1 (slope on CIGTAX) –8.97 (1.19) –9.49 (1.06)
β2 (slope on PRODUCER) 5.37 (5.08)
The standard error for the cigarette-tax slope in the MLR model is se(β̂1 ) = 1.19. This standard error is only slightly
higher than the standard error of 1.06 for the cigarette-tax slope in the SLR model, suggesting that even after
controlling for producer (whether or not a state is a tobacco producer) there is still a lot of independent variation
left in cigtax for precisely estimating β1 in the MLR model.
The MLR estimates in the table are obtained in R using the lm_robust function:

results <- lm_robust(cigsales~cigtax+producer, data=cigdata)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 54.2762 3.5128 15.4511 2.9333e-20 47.213 61.339 48
## cigtax -8.9687 1.1935 -7.5146 1.2014e-09 -11.368 -6.569 48
## producer 5.3669 5.0785 1.0568 2.9589e-01 -4.844 15.578 48

Example 18.9 (Monthly stock returns) The following table reproduces the least-squares estimates from Example 18.2,
now with heteroskedasticity-robust standard errors reported in parentheses:
MLR estimates SLR estimates SLR estimates
α (intercept) 0.004 (0.003) 0.009 (0.003) 0.006 (0.003)
β1 (slope on IDX) 0.595 (0.091) 1.020 (0.082)
β2 (slope on LOW) 0.384 (0.047) 0.522 (0.040)
For the MLR model, the standard error of the slope on IDX (S&P 500 returns) is se(β̂1 ) = 0.091. This standard error
is roughly 11% higher than the standard error of 0.082 obtained for the slope on IDX in the SLR model. The higher
standard error for the MLR model is not too surprising since β1 is estimated using only the independent variation of
IDX after controlling for LOW. Since IDX and LOW are correlated, with rIDX,LOW = 0.5061, the amount of independent
variation in IDX used in estimation of the MLR model is somewhat lower than the overall variation of IDX used in

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 508 — #515
i i

508 Multiple linear regression

estimation of the SLR model. Similarly, for the same reason, the standard error on LOW (Lowe’s returns) is higher for
estimation of the MLR model, with se(β̂2 ) = 0.047, as compared to estimation of the SLR model with only LOW as an
explanatory variable, where the standard error is 0.040.
Here is the R code to produce the estimates and standard errors in the table:

# MLR model of HD on IDX and LOW

mlr_results <- lm_robust(HD~IDX+LOW, data=sp500)
mlr_results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.0042476 0.0027743 1.5311 1.2663e-01 -0.0012081 0.0097034 361
## IDX 0.5949112 0.0906584 6.5621 1.8444e-10 0.4166262 0.7731962 361
## LOW 0.3842477 0.0474368 8.1002 8.5485e-15 0.2909606 0.4775348 361
# SLR model of HD on IDX
slr_results1 <- lm_robust(HD~IDX, data=sp500)
slr_results1
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.0087316 0.0032119 2.7185 6.8739e-03 0.0024152 0.015048 362
## IDX 1.0204494 0.0816816 12.4930 5.0256e-30 0.8598195 1.181079 362
# SLR model of HD on LOW
slr_results2 <- lm_robust(HD~LOW, data=sp500)
slr_results2
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.0060877 0.0030005 2.0289 4.3199e-02 0.00018713 0.011988 362
## LOW 0.5218212 0.0404116 12.9127 1.2310e-31 0.44235020 0.601292 362

The asymptotic normality of the least-squares estimators (Proposition 18.2) allows practitioners to use the normal
distribution for (i) calculating confidence intervals for the MLR parameters and (ii) conducting hypothesis tests
involving the MLR parameters. The 1 – α confidence interval for the intercept parameter α is
(α̂ – zα/2 se(α̂), α̂ + zα/2 se(α̂)),
and the 1 – α confidence interval for each slope parameter βk , for k ∈ {1, 2, …, K}, is
(β̂k – zα/2 se(β̂k ), β̂k + zα/2 se(β̂k )).
The specific case of 95% confidence intervals for α and βk are, respectively,
(α̂ – 1.96se(α̂), α̂ + 1.96se(α̂))
and
(β̂k – 1.96se(β̂k ), β̂k + 1.96se(β̂k )).
Example 18.10 (Weekly earnings) The following R code shows the least-squares estimates from Example 18.4, with
heteroskedasticity-robust standard errors and 95% confidence intervals also reported:

results <- lm_robust(earnwk~educ+exper+union+female, data=cps)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -431.031 100.1611 -4.3034 1.7398e-05 -627.4277 -234.6338 2804
## educ 110.940 7.2783 15.2425 1.8421e-50 96.6684 125.2113 2804
## exper 4.897 1.3157 3.7221 2.0146e-04 2.3173 7.4768 2804
## union 142.748 44.4518 3.2113 1.3363e-03 55.5862 229.9094 2804
## female -344.791 25.8271 -13.3500 1.8495e-39 -395.4331 -294.1491 2804

The 95% confidence interval for the education slope β1 is

(110.90 – (1.96)(7.28), 110.90 + (1.96)(7.28)) ≈ (96.67, 125.21).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 509 — #516
i i

Multiple linear regression 509

We are 95% confident that the change in expected weekly earnings associated with a one-year change in education,
holding all other variables fixed, is between $96.67 and $125.21. For the union-membership slope β3 , the 95%
confidence interval is
(142.70 – (1.96)(44.45), 142.70 + (1.96)(44.45)) ≈ (55.59, 229.91).
We are 95% confident that the difference in expected weekly earnings between a union member and a non-member,
holding all other variables fixed, is between $55.59 and $229.91. While this interval for β3 is quite wide (approximately
$174 wide), it does not include zero, supporting the idea that the union differential is statistically meaningful even if
it’s not very precisely estimated.
We can obtain other confidence intervals for the parameters by changing the optional alpha argument for the
lm_robust function. Here is output with 90% confidence intervals:

results <- lm_robust(earnwk~educ+exper+union+female, data=cps, alpha=0.10)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -431.031 100.1611 -4.3034 1.7398e-05 -595.8356 -266.2259 2804
## educ 110.940 7.2783 15.2425 1.8421e-50 98.9641 122.9156 2804
## exper 4.897 1.3157 3.7221 2.0146e-04 2.7322 7.0618 2804
## union 142.748 44.4518 3.2113 1.3363e-03 69.6069 215.8887 2804
## female -344.791 25.8271 -13.3500 1.8495e-39 -387.2869 -302.2953 2804

The 90% confidence interval for the education slope β1 is

(110.90 – (1.645)(7.28), 110.90 + (1.645)(7.28)) ≈ (98.96, 122.92),
meaning we are 90% confident that the change in expected weekly earnings associated with a one-year change in
education, holding all other variables fixed, is between $98.96 and $122.92.
How about a change in education that is not a one-unit change? For instance, for a two-year change in education,
the associated change in E(EARNWK|X) is 2β1 . The estimate is 2β̂1 , and its standard error is se(2β̂1 ) = 2se(β̂1 ), which
is 2 times the original standard error of β̂1 . This standard error can be used to construct confidence intervals, so for
example the 90% confidence interval for 2β1 is
((2)(110.90) – (1.645)(2)(7.28), (2)(110.90) + (1.645)(2)(7.28)) ≈ (197.9, 245.8),
which can also be obtained by multiplying the upper and lower limits of the 90% confidence interval for β1 by 2.

18.4 Inference for linear combinations of regression parameters

The previous section considered asymptotic inference, including standard errors and confidence intervals, for
individual regression model parameters. In this section, we consider asymptotic inference for linear combinations
of regression model parameters. To motivate the importance of these linear combinations, we consider a MLR model
with two explanatory variables, under the exogeneity assumption:
E(Y|X) = α + β1 X1 + β2 X2 .
Based on least-squares estimates (α̂, β̂1 , β̂2 ) and their standard errors, inference for the individual parameters α, β1 ,
and β2 proceeds as in Section 18.3. Here are some examples with more complicated inference since they involve linear
combinations of more than one model parameter:
• Combined effects of variables: If both X1 and X2 increase by one unit, the relevant change in E(Y|X) is the linear
combination β1 + β2 . The estimate of this change is straightforward and given by β̂1 + β̂2 . The standard error is more
complicated since, in general,
se(β̂1 + β̂2 ) 6= se(β̂1 ) + se(β̂2 )

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 510 — #517
i i

510 Multiple linear regression

due to possible correlation between the estimators β̂1,XY and β̂2,XY . More generally, if X1 increases by v1 units
and X2 increases by v2 units, the relevant change in E(Y|X) is the linear combination v1 β1 + v2 β2 , with estimate
v1 β̂1 + v2 β̂2 and standard error se(v1 β̂1 + v2 β̂2 ).
• Difference in effects: If X and X are comparable variables (e.g., they are measured in the same units), we might be
1 2
interested in the difference of the two variables’ effects on E(Y|X). This difference is given by the linear combination
β1 – β2 . (This turns out to be a special case of the case considered above, thinking about increasing X1 by one unit
and decreasing X2 by one unit.)
• Estimating the conditional expectation: The conditional expectation E(Y|X) is itself a linear combination of the

three model parameters. For X1 = x1∗ and X2 = x2∗ , the conditional expectation is
E(Y|X1 = x1∗ , X2 = x2∗ ) = α + β1 x1∗ + β2 x2∗ .
This linear combination can be re-written 1 · α + x1∗ · β1 + x2∗ · β2 , making it clear that the expression is a linear
combination of the three parameters, with coefficients 1, x1∗ , and x2∗ . The estimate of the conditional expectation is
obtained by plugging in the least-squares estimates,
Ê(Y|X1 = x1∗ , X2 = x2∗ ) = α̂ + β̂1 x1∗ + β̂2 x2∗ ,
and has standard error se(α̂ + β̂1 x1∗ + β̂2 x2∗ ) = se(1 · α̂ + x1∗ · β̂1 + x2∗ · β̂2 ).
To accommodate general forms of linear combinations, we introduce a (row) vector that contains the appropriate
coefficients for any given linear combination of the model parameters. In the example above, there are three model
parameters (α, β1 , β2 ) and three estimates (α̂, β̂1 , β̂2 ), so the vector will have three elements. The first element
corresponds to the constant multiplying α̂, the second element corresponds to the constant multiplying β̂1 , and the
third element corresponds to the constant multiplying β̂2 . For the three examples above, the vectors would be:

0 1 1 for β̂1 + β̂2 ,

0 1 –1 for β̂1 – β̂2 ,
and
1 x1∗ x2∗ for α̂ + β̂1 x1∗ + β̂2 x2∗ .

More generally, the MLR model has K + 1 model parameters (α, β1 , β2 , …, βK ) and K + 1 estimates (α̂, β̂1 , β̂2 , β̂K ),
so the vector will have K + 1 elements. For example, with K = 4 explanatory variables, the linear combination β1 +
3β3 – 2β4 is represented by the vector
0 1 0 3 –2
since β1 + 3β3 – 2β4 = 0α + 1β1 + 0β2 + 3β3 + (–2)β4 .
To facilitate the calculation of a standard error for a linear combination of least-squares estimates, a user-defined R
function linear_combination has been written:
• linear_combination(regresults, linvec): Takes regresults, the results from a lm_robust
regression, and linvec, a vector that specifies the linear combination of interest, as arguments and returns the
estimate (estimate) of the linear combination and the standard error (se) of the estimate. The vector linvec
has the same length (K + 1) as the parameter vector (α, β1 , β2 , …, βK ), with its elements (a1 , a2 , a3 , …, aK+1 )
corresponding to the coefficients of each parameter that give the linear combination of the estimates, a1 α̂ + a2 β̂1 +
a3 β̂2 + · · · + aK+1 β̂K .
Example 18.11 (Monthly stock returns) Parameter estimates and standard errors for the MLR model
E(HD|IDX, LOW) = α + β1 IDX + β2 LOW were provided in Example 18.9. We use the linear_combination
function to provide estimates and standard errors for some linear combinations of the parameters, specifically the
following: (i) β1 + β2 , representing the combined effect of increasing both IDX and LOW, (ii) β1 – β2 , representing the
difference in the effects of IDX and LOW, and (iii) α + 0.05IDX + 0.03LOW, representing the conditional expectation
E(HD|IDX = 0.05, LOW = 0.03). The following R code provides output for these three cases:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 511 — #518
i i

Multiple linear regression 511

results <- lm_robust(HD~IDX+LOW, data=sp500)

# estimate beta(IDX)-beta(LOW) and its standard error

linear_combination(results, c(0,1,1))
## $estimate
## [1] 0.97916
##
## $se
## [1] 0.068797
# estimate beta(IDX)-beta(LOW) and its standard error
linear_combination(results, c(0,1,-1))
## $estimate
## [1] 0.21066
##
## $se
## [1] 0.1273
# estimate E(HD|IDX=0.05,LOW=0.03) and its standard error
linear_combination(results, c(1,0.05,0.03))
## $estimate
## [1] 0.045521
##
## $se
## [1] 0.0042493

We have β̂1 + β̂2 = 0.979, with a standard error of 0.069. If IDX and LOW both increase by 0.01 (one percentage
point), these estimates imply an associated increase in the conditional expectation of HD of 0.00979, with a standard
error of 0.00069. For the difference in the slope parameters, we have β̂1 – β̂2 = 0.211, with a standard error of 0.127,
which implies the 95% confidence interval for β1 – β2 is (–0.039, 0.460). And, the estimated conditional expectation
Ê(HD|IDX = 0.05, LOW = 0.03) = α̂ + β̂1 (0.05) + β̂2 (0.03) = 0.0455,
with a standard error of 0.0042, which implies the 95% confidence interval for the true conditional expectation
E(HD|IDX = 0.05, LOW = 0.03) is (0.0372, 0.0538).

18.5 Hypothesis testing

For a hypothesis test involving an individual MLR parameter, the z-test is applicable. For instance, a null hypothesis
involving just the intercept parameter,
H0 : α = c,
is tested with the z-statistic
α̂ – c
z-statistic =
se(α̂)
or its associated p-value. Similarly, a null hypothesis involving one of the slope parameters,
H0 : βk = c,
is tested with the z-statistic
β̂k – c
z-statistic =
se(β̂k )
or its associated p-value. It is very common to test whether an individual slope parameter is equal to zero (c = 0), as
the null hypothesis H0 : βk = 0 implies that the variable Xk does not belong in the MLR model. In fact, R provides
the z-statistic and p-value associated with H0 : βk = 0, for each k ∈ {1, 2, …, K}, as part of the least-squares estimation
output. If we do not reject H0 : βk = 0, there does not exist statistical support for its inclusion in the model, and we
might say that “the variable Xk is not statistically significant in the model.” If there was uncertainty about including

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 512 — #519
i i

512 Multiple linear regression

Xk in the model at the outset, the failure to reject H0 : βk = 0 might prompt a practitioner to drop Xk from the model.
That said, variables should be dropped from the MLR model with caution. First, it’s possible that we just don’t have
a large enough sample to estimate a statistically significant slope on Xk . Second, it can be useful to have a variable in
the model even if its slope estimate is not statistically significant, as someone who is looking at the results might have
expected the variable to matter in the model and would be interested to see that it does not.
Example 18.12 (Monthly stock returns) The following R code shows the MLR results from Example 18.9:

mlr_results <- lm_robust(HD~IDX+LOW, data=sp500)

mlr_results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.0042476 0.0027743 1.5311 1.2663e-01 -0.0012081 0.0097034 361
## IDX 0.5949112 0.0906584 6.5621 1.8444e-10 0.4166262 0.7731962 361
## LOW 0.3842477 0.0474368 8.1002 8.5485e-15 0.2909606 0.4775348 361

The output contains z-statistics and p-values for testing each individual MLR parameter versus zero. Since α =
E(HD|IDX = LOW = 0), the null hypothesis H0 : α = 0 corresponds to testing whether the expected Home Depot return
(HD) is equal to zero when the S&P 500 return (IDX) and Lowe’s return (LOW) are both equal to zero. From the table,
the z-statistic for this null hypothesis is 1.53, so H0 : α = 0 would not be rejected at a 5% level (1.53 < z0.025 = 1.96) or
at a 10% level (1.53 < z0.05 = 1.645). The p-value of 0.1266 indicates that H0 : α = 0 would not be rejected at any level
below 12.66%. For both slope parameters, the test of H0 : βk = 0 has a p-value equal to zero to many decimal places.
Therefore, H0 : β1 = 0 and H0 : β2 = 0 are both rejected at any level, indicating that IDX and LOW are both statistically
significant variables in the MLR model.
Example 18.13 (Cigarette sales and cigarette taxes) The following R code shows the MLR results from Example 18.8,
with the z-statistics and p-values for testing each parameter versus zero:

mlr_results <- lm_robust(cigsales~cigtax+producer, data=cigdata)

mlr_results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 54.2762 3.5128 15.4511 2.9333e-20 47.213 61.339 48
## cigtax -8.9687 1.1935 -7.5146 1.2014e-09 -11.368 -6.569 48
## producer 5.3669 5.0785 1.0568 2.9589e-01 -4.844 15.578 48

The state-level cigarette tax (CIGTAX) variable is statistically significant, as the p-value of 0.0000 indicates that
H0 : β1 = 0 is rejected at any level. On the other hand, the tobacco-producing state indicator (PRODUCER) variable
has a z-statistic of 1.06 and a p-value of 0.296 associated with testing H0 : β2 = 0. The null hypothesis H0 : β2 = 0
corresponds to PRODUCER not being in the MLR model, and H0 : β2 = 0 would not be rejected at any level below
29.6%. Given these results, whether or not to leave PRODUCER in the model is a choice that the practitioner needs
to make. If PRODUCER is dropped from the model, the resulting model is the SLR model with only CIGTAX as an
explanatory variable; from Example 18.8, the cigarette-tax slope estimate for that SLR model is –9.49 with a standard
error of 1.06.
Other hypothesis tests may be of interest for a MLR model, including the following:
Testing a single linear restriction: The simple null hypotheses above are examples of single linear restrictions on
the MLR parameters. What if we want to test whether the slope on one variable X1 is equal to the slope on another
variable X2 ? If the two variables are measured in the same units, a test of H0 : β1 = β2 assesses whether the partial effect
of X1 on E(Y|X), holding all other variables fixed, is the same as the partial effect of X2 on E(Y|X), holding all other
variables fixed. The null hypothesis H0 : β1 = β2 is equivalent to
H0 : β1 – β2 = 0,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 513 — #520
i i

Multiple linear regression 513

which can be tested with the z-statistic

β̂1 – β̂2
z-statistic = .
se(β̂1 – β̂2 )
For example, for the MLR model in Example 18.12, the null hypothesis H0 : β1 = β2 is true if the partial effects of
IDX and LOW on E(HD|IDX, LOW) are the same. The main issue with calculating the z-statistic is that the standard
error se(β̂1 – β̂2 ) is not provided in the least-squares estimation output, so we use the linear_combination
function discussed in Section 18.4 to calculate se(β̂1 – β̂2 ). Using the estimates β̂1 – β̂2 and the standard error se(β̂1 – β̂2 )
calculated in Example 18.11, the associated z-statistic is
β̂1 – βˆ2 0.2107
= ≈ 1.66,
ˆ
se(β̂1 – β2 ) 0.1273
which has a p-value of 0.098, meaning H0 : β1 = β2 is rejected at a 10% but not a 5% level.
Of course, H0 : β1 = β2 is not the only linear restriction that might be of interest. How about the null hypothesis
H0 : β1 + β2 = c? If X1 and X2 are measured in the same units, this null hypothesis assesses whether the combination of
the partial effects of X1 and X2 on E(Y|X) (i.e., the change in E(Y|X) when both X1 and X2 are increased by one unit)
β̂1 +β̂2 –c
is equal to some constant c. This null hypothesis can be tested with z-statistic se( β̂ +β̂ )
, which can be calculated after
1 2

the standard error se(β̂1 + β̂2 ) is calculated by linear_combination.

Testing multiple restrictions: In the same way that H0 : βk = 0 can be tested if we are considering dropping Xk from
the MLR model, we can also test whether several slope parameters are equal to zero if we are considering dropping a
group of explanatory variables. As an example, if we are considering dropping the three variables X2 , X4 , and X5 from
a MLR model, the null hypothesis of interest is
H0 : β2 = β4 = β5 = 0
or, equivalently written explicitly as three linear restrictions,
H0 : β2 = 0, β4 = 0, β5 = 0.
The Wald test from Section 16.4 can be used to test such a null hypothesis. If we do not reject H0 , that would provide
some justification for dropping the three variables from the model. One important reason to do a test of the joint
significance of variables is that testing significance of variables one-by-one, via z-tests, can be problematic if there
is multicollinearity between variables. As an example, let’s say that X1 and X2 are very highly correlated. The slope
estimates β̂1 and β̂2 , from estimation of a MLR model with X1 and X2 as explanatory variables, might be statistically
insignificant simply because X1 and X2 both have very little independent variation. This statistical insignificance would
be reflected in high standard errors and low magnitudes of the z-statistics used to test H0 : β1 = 0 and H0 : β2 = 0. Even
though X1 and X2 are individually insignificant, it is possible that they are jointly significant in explaining the outcome
variable, which could be verified if H0 : β1 = β2 = 0 is rejected.
To facilitate the Wald test, we define an R function test_linear_restrictions, summarized as follows:
• test_linear_restrictions(regresults, R, c): Takes regresults (results from a lm_robust
regression), a matrix R that specifies one or more linear combinations of interest, and a vector c specifying the
constants associated the linear combinations as arguments, and returns both the Wald statistic and the p-value for
the test. The matrix R has Q rows, where Q is the number of restrictions being tested, and (K + 1) columns. For
each row, the elements (a1 , a2 , a3 , …, aK+1 ) correspond to the coefficients of each parameter, leading to the linear
combination a1 α + a2 β1 + a3 β2 + · · · + aK+1 βK . The corresponding element of the vector c is the constant being
tested against.
For example, for the null hypothesis H0 : β2 = 0, β4 = 0, β5 = 0 considered above in a model with six explanatory
variables, the matrix R has Q = 3 rows and K + 1 = 7 columns. For the first row, the third element (for β2 ) would
be 1 and the other elements 0; for the second row, the fifth element (for β4 ) would be 1 and the other elements 0;
and, for the third row, the sixth element (for β5 ) would be 1 and the other elements 0. The c vector would consist of

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 514 — #521
i i

514 Multiple linear regression

three 0 elements since each of the parameters is being tested against zero. Using the notation from the Appendix to
Chapter 16, the number of restrictions is Q = 3 and the number of parameters is L = K + 1 = 7, and we have
   
0 0 1 0 0 0 0 0
R =  0 0 0 0 1 0 0  and c =  0  .
0 0 0 0 0 1 0 0
Each row of the matrix R is a specific linear combination of the L = K + 1 model parameters. For the specific R matrix
shown above, the first row above is β2 , the second row is β4 , and the third row is β5 . Each element of the c (column)
vector provides the constant against which the corresponding row of R is being tested. For the example shown, these
elements are all zero.
Suppose the MLR model for Home Depot (HD) monthly returns is augmented to include the monthly returns of
Bank of America (BAC) and Wells Fargo (WFC) as explanatory variables, with the results provided below:

results <- lm_robust(HD~IDX+LOW+BAC+WFC, data=sp500)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.0042975 0.0027741 1.54913 1.2223e-01 -0.0011581 0.009753 359
## IDX 0.6567405 0.1158056 5.67106 2.9223e-08 0.4289978 0.884483 359
## LOW 0.3876343 0.0478259 8.10512 8.3756e-15 0.2935802 0.481688 359
## BAC -0.0393008 0.0382543 -1.02736 3.0494e-01 -0.1145315 0.035930 359
## WFC -0.0067428 0.0503518 -0.13391 8.9355e-01 -0.1057644 0.092279 359

Based on the z-statistics and p-values, the IDX and LOW variables remain statistically significant, but the p-values
on the two added variables indicate lack of significance for each variable individually (p-value of 0.30 for BAC and
p-value of 0.89 for WFC). These results are suggestive that the two added variables are not valuable in the model for
explaining Home Depot’s returns, and a Wald test of H0 : β3 = β4 = 0 provides a p-value for their joint significance:

results <- lm_robust(HD~IDX+LOW+BAC+WFC, data=sp500)

# use rbind function to create a matrix consisting of the two restrictions:

# first restriction is beta3=0, second restriction is beta4=0
R <- rbind(c(0,0,0,1,0), c(0,0,0,0,1))

# create a vector of zeros for the two restrictions

c <- c(0,0)

# conduct the Wald test

test_linear_restrictions(results, R, c)

## $W
## [1] 1.4126
##
## $p_value
## [1] 0.49346

The rbind function stacks the rows of linear restrictions to form the matrix R. The resulting p-value of 0.49 implies
that H0 : β3 = β4 = 0, corresponding to BAC and WFC not being in the model, would not be rejected at any reasonable
level. This test, therefore, provides support for dropping the two variables from the model.

18.6 Modeling approaches and explanatory variables

The examples of MLR models considered thus far have been rather simplistic in their specification of the explanatory
variables. First, while some models have considered indicator variables, we have not included categorical variables
with more than two categories. Such categorical variables are the focus of Section 18.6.1. Second, each explanatory
variable has only been considered in isolation, as none of the models has had an underlying variable showing up as part

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 515 — #522
i i

would be in the model. A one-unit change in X1 , from X1 = x1∗ to X1 = x1∗ + 1, the associated change in E(Y|X), holding
the other explanatory variables fixed, is67
E(Y|X1 = x1∗ + 1, X4 = x4∗ , …, XK = xK∗ ) – E(Y|X1 = x1∗, X4 = x4∗ , …, XK = xK∗ )
= β1 + β2 (x1∗ + 1)2 – (x1∗ )2 + β3 (x1∗ + 1)3 – (x1∗ )3

= β1 + β2 (2x1∗ + 1) + β3 (3x1∗2 + 3x1∗ + 1).

For this cubic model, the partial effect of X1 on E(Y|X) is a quadratic function of x1∗ . This same idea can be generalized
to higher-order polynomials.
Example 18.15 (Earnings with a quadratic in education) Example 18.10 provided least-squares estimates and
standard errors for a MLR model of weekly earnings (earnwk) with educ, exper, female, and union as explanatory
variables. To allow for the possibility of a non-linear association between E(earnwk|X) and educ, we add the quadratic
variable educ2 to the model. In R, there are two alternative methods that can be used to include the quadratic variable.
The first is to define a new variable equal to educ2 and add it to the list of variables in the lm_robust function. The
second is to use the I() syntax in the lm_robust function, as illustrated below:

results <- lm_robust(earnwk~educ+I(educ^2)+exper+union+female, data=cps)

results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 883.6885 123.1009 7.1786 8.9823e-13 642.3109 1125.0661 2803
## educ -122.7119 23.5915 -5.2015 2.1190e-07 -168.9703 -76.4534 2803
## I(educ^2) 9.8213 1.1688 8.4031 6.8198e-17 7.5296 12.1130 2803
## exper 5.2867 1.2943 4.0846 4.5384e-05 2.7488 7.8245 2803
## union 143.9565 43.9820 3.2731 1.0767e-03 57.7163 230.1968 2803
## female -345.5150 25.2119 -13.7044 2.0057e-41 -394.9508 -296.0792 2803
summary(results)$r.squared
## [1] 0.19813

The second variable is specified as I(educ^2), and the I() syntax tells the lm_robust function to do the
calculation within parentheses for each observation and include the resulting variable in the least-squares estimation.
The following table reports these results, with the original model in the first column and the new model, with educ2
included, in the second column. Standard errors are reported in parentheses.
MLR model MLR model
with educ with educ, educ2
Intercept –431.03 (100.08) 883.69 (123.10)
educ 110.94 (7.27) –122.71 (23.59)
educ2 9.821 (1.169)
exper 4.897 (1.315) 5.287 (1.294)
union 142.75 (44.39) 143.96 (43.98)
female –344.79 (25.82) –345.52 (25.21)
R2 0.166 0.198
σ̂U 685.78 672.64
For a z-test of H0 : βeduc2 = 0, the z-statistic is 9.821 2
1.169 ≈ 8.4 and p-value 0.000, indicating that the educ variable belongs
2
in the model. Including educ in the model increases the R-squared value from 16.6% to 19.8% and decreases the
residual standard deviation estimate from $685.78 to $672.64. On the other hand, adding educ2 to the model has very
little effect on the estimates of the slopes for exper, union, and female.
Let’s take a closer look at the partial effects of education implied by the two models. For the model with only educ,
the estimated partial effect of a one-year change in education on E(earnwk|X) is constant and equal to β̂educ = 110.94.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 519 — #526
i i

Multiple linear regression 519

For the model with educ2 included, the estimated partial effect of a one-year change in education, from educ∗ to
educ∗ + 1, on E(earnwk|X) is
β̂educ + β̂educ2 (2educ∗ + 1) = –122.71 + 9.821(2educ∗ + 1).
To calculate a standard error for this partial effect, we use the linear_combination function to calculate
se(β̂educ + β̂educ2 (2educ∗ + 1)) for a given value of educ∗ . The following table summarizes the partial effects estimated
by the two models for four possible values of educ∗ , with standard errors reported in parentheses:
MLR model MLR model
Change in E(earnwk|X) when:
with educ with educ, educ2
educ changes from 10 to 11 110.94 (7.27) 83.54 (4.84)
educ changes from 12 to 13 110.94 (7.27) 122.82 (7.65)
educ changes from 14 to 15 110.94 (7.27) 162.11 (11.71)
educ changes from 16 to 17 110.94 (7.27) 201.39 (16.12)
For the quadratic model, the estimated partial effect of education increases a lot as the level of education increases.
For example, the estimated partial effect at educ∗ = 16 is equal to $201.39 and is roughly 64% larger in magnitude
than the estimated partial effect at educ∗ = 12, which is equal to $122.82.
Here is the R code to calculate standard errors for the estimated partial effects in the quadratic model:

# consider educ values of 10, 12, 14, and 16

educ_vec <- c(10,12,14,16)

# calculate the partial effect and standard error for each educ value
for (educ_star in educ_vec) {
print(linear_combination(results, c(0,1,2*educ_star+1,0,0,0)))
}
## $estimate
## [1] 83.535
##
## $se
## [1] 4.8381
##
## $estimate
## [1] 122.82
##
## $se
## [1] 7.6458
##
## $estimate
## [1] 162.11
##
## $se
## [1] 11.714
##
## $estimate
## [1] 201.39
##
## $se
## [1] 16.115

18.6.3 Interactions of explanatory variables

In the previous section, the use of polynomial functions of an explanatory variable X1 led to partial effects that were
functions of the value of X1 rather than a constant. In this section, this idea is generalized to allow for partial effects of
one explanatory variable to depend upon another explanatory variable. For instance, to allow the partial effect of X1 to
depend upon the value of X2 , an interaction variable X1 X2 is added to the MLR model. Consider the following MLR

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 520 — #527
i i

520 Multiple linear regression

model, which contains X1 , X2 , the interaction variable X3 = X1 X2 , and additional explanatory variables:
E(Y|X) = α + β1 X1 + β2 X2 + β3 X1 X2 + β4 X4 + · · · + βK XK .
The partial effect of increasing X1 by one unit, from x1∗ to x1∗ + 1, on E(Y|X) is equal to
E(Y|X1 = x1∗ + 1, X2 = x2∗ , …,XK = xK∗ ) – E(Y|X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ )
= β1 + β3 (x1∗ + 1)x2∗ – x1∗ x2∗
= β1 + β3 x2∗ .
Therefore, the partial effect of X1 on E(Y|X) is a function of x2∗ , the value of X2 . When β3 is positive, the partial
effect of X1 on E(Y|X) is an increasing function of x2∗ , and when β3 is negative, the partial effect of X1 on E(Y|X) is a
decreasing function of x2∗ .
Similarly, we can determine the partial effect of increasing X2 by one unit, from x2∗ to x2∗ + 1, on E(Y|X):
E(Y|X1 = x1∗ , X2 = x2∗ + 1, …,XK = xK∗ ) – E(Y|X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ )
= β2 + β3 x1∗ (x2∗ + 1) – x1∗ x2∗
= β2 + β3 x1∗ .
The partial effect of X2 on E(Y|X) depends on x1∗ , with the partial effect increasing in x1∗ if β3 is positive and decreasing
in x1∗ if β3 is negative. Thus, a feature of a model with the interaction variable X1 X2 is that, as long as β3 6= 0, it must
be the case that both the partial effect of X1 depends upon X2 and the partial effect of X2 depends upon X1 . When
including X1 X2 in a model, it is also good practice to always include both of the original variables X1 and X2 in the
model, even if it turns out that one or both appear to be insignificant based upon z-tests.
To test whether an interaction variable is statistically significant, a z-test can be used for testing the null hypothesis
H0 : β3 = 0. Rejection of H0 : β3 = 0 indicates statistical significance of the interaction variable, and a failure to reject
H0 : β3 = 0 would support dropping the interaction variable, especially if the p-value is very high.
Example 18.16 (Weekly earnings) Example 18.10 provided least-squares estimates and standard errors for a MLR
model of weekly earnings (earnwk) with educ, exper, female, and union as explanatory variables. To allow for the
possibility that the partial effect of education on weekly earnings depends upon a worker’s experience level, the
interaction variable educ · exper can be added to the MLR model. This inclusion also allows for the partial effect of
experience on weekly earnings to depend upon a worker’s education level. The I() syntax for lm_robust can be
used to include the interaction variable in the model without creating a new variable. For the educ · exper interaction,
the added variable is I(educ*exper):

time horizon is that the “true model” might have changed over time. For instance, what if the relationship between
an individual stock’s return and the market index return has not remained the same over the 30+ years observed in
the data? To address that concern, one approach is to split the sample by creating an indicator variable post2005 that
indicates which observations are after 2005:
(
1 if observation is after 2005 (2006-2021)
post2005 =
0 if observation is 2005 or earlier (1991-2005).
The MLR model
E(HD|IDX, post2005) = α + β1 post2005 + β2 IDX + β3 post2005 · IDX
includes the interaction variable post2005 · IDX, allowing the partial effect of the market index monthly return (IDX)
to depend upon whether post2005 = 0 or post2005 = 1. Specifically, if IDX increases by one unit, the expected HD
return changes by β2 if post2005 = 0 and by β2 + β3 if post2005 = 1. The parameter β3 is, therefore, the difference in
the partial effects of the market index return on the Home Depot return between the 2006-2021 period and the 1991-
2005 period. There’s nothing special about HD (Home Depot) here, so we can replace the outcome variable by the
monthly return of any individual stock. The following table shows the least-squares estimates of this model for four
different stocks: Home Depot (HD), Lowe’s (LOW), Bank of America (BAC), and ConocoPhillips (COP). Standard
errors are reported in parentheses, and a row containing the p-value for the test of the statistical significance of the
interaction variable (H0 : β3 = 0) has been included.
HD LOW BAC COP
Intercept 0.009 (0.005) 0.018 (0.007) 0.008 (0.005) 0.008 (0.005)
post2005 0.000 (0.006) –0.012 (0.008) –0.012 (0.009) –0.010 (0.007)
IDX 1.150 (0.134) 1.070 (0.174) 0.998 (0.138) 0.697 (0.118)
post2005 · IDX –0.239 (0.164) 0.068 (0.214) 0.904 (0.272) 0.635 (0.202)
p-value for H0 : β3 = 0 0.147 0.752 0.001 0.002
Starting with the Home Depot (HD) model, the estimated slope on the interaction variable, β̂3 = –0.239, indicates
that the association between expected HD returns and IDX returns is lower in the post-2005 period. Specifically,
when IDX goes up by 0.01, expected HD is estimated to increase by 0.01β̂2 = 0.01150 in the pre-2005 period and
0.01(β̂2 + β̂3 ) = 0.00911 in the post-2005 period. But, in looking at the p-value of 0.147 for the test of H0 : β3 = 0,
we do not reject that the interaction variable has a true slope of zero at the 10% level. There is limited statistical
evidence that there is meaningful difference in the partial effects of IDX on HD for the pre-2005 and post-2005
periods. For the Lowe’s (LOW) model, the p-value of 0.784 for testing H0 : β3 = 0 is very high, meaning we would not
reject H0 : β3 = 0 at any reasonable level. The picture is quite different for the other two stocks, Bank of America (BAC)
and ConocoPhillips (COP), where the p-values for H0 : β3 = 0 are equal to 0.001 and 0.002, respectively. These low
p-values indicate that the interaction variable is statistically significant in both the BAC model and the COP model.
For the BAC model, when IDX goes up by 0.01, expected BAC is estimated to go up by 0.00998 (with standard error
0.00138) in the pre-2005 period and by 0.01902 (with standard error 0.00235) in the post-2005 period.
Here is the R code to calculate the least-squares estimates above, along with the standard error for the post-2005
period IDX partial effect for the BAC model:

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 524 — #531
i i

524 Multiple linear regression

# create the post-2005 indicator variable

sp500$post2005 <- c(rep(0,180),rep(1,184))

# least-squares estimation for models with post-2005 interaction variable

results_HD <- lm_robust(HD~post2005+IDX+I(post2005*IDX), data=sp500)
results_HD
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.0086138 0.005211 1.65305 9.919e-02 -0.001634 0.01886 360
## post2005 0.0000169 0.006427 0.00263 9.979e-01 -0.012622 0.01266 360
## IDX 1.1500041 0.133917 8.58747 2.722e-16 0.886647 1.41336 360
## I(post2005 * IDX) -0.2392878 0.164496 -1.45467 1.466e-01 -0.562782 0.08421 360
results_LOW <- lm_robust(LOW~post2005+IDX+I(post2005*IDX), data=sp500)
results_LOW
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.01754 0.006829 2.5688 1.061e-02 0.004113 0.030972 360
## post2005 -0.01154 0.008369 -1.3788 1.688e-01 -0.027999 0.004919 360
## IDX 1.06958 0.174243 6.1384 2.202e-09 0.726914 1.412240 360
## I(post2005 * IDX) 0.06781 0.214007 0.3169 7.515e-01 -0.353050 0.488672 360
results_BAC <- lm_robust(BAC~post2005+IDX+I(post2005*IDX), data=sp500)
results_BAC
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.007928 0.004899 1.618 1.065e-01 -0.001706 0.017562 360
## post2005 -0.012168 0.008931 -1.362 1.739e-01 -0.029732 0.005397 360
## IDX 0.998438 0.137935 7.238 2.764e-12 0.727179 1.269697 360
## I(post2005 * IDX) 0.903626 0.272383 3.317 1.001e-03 0.367963 1.439288 360
results_COP <- lm_robust(COP~post2005+IDX+I(post2005*IDX), data=sp500)
results_COP
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.00835 0.004624 1.806 7.182e-02 -0.0007446 0.017444 360
## post2005 -0.01036 0.007065 -1.466 1.435e-01 -0.0242519 0.003537 360
## IDX 0.69732 0.117650 5.927 7.223e-09 0.4659514 0.928687 360
## I(post2005 * IDX) 0.63502 0.201638 3.149 1.773e-03 0.2384856 1.031556 360
# partial effect (with standard error) of IDX for post-2005 period in BAC model
linear_combination(results_BAC,c(0,0,1,1))
## $estimate
## [1] 1.902
##
## $se
## [1] 0.2349

18.7 Log-transformed outcome variable

In many economics and business applications, regression models use a log-transformed outcome variable rather than
the outcome variable itself. There are several reasons for considering a log-transformed outcome variable:
1. For variables with right-skewed distributions (e.g., wages, sales, profits, website visits), the log transformation can
symmetrize the unconditional distribution of the outcome variable. In Example 11.7, for instance, we saw that the
distribution of log-transformed weekly earnings was quite symmetric and bell-shaped even though the distribution of
weekly earnings itself was extremely right-skewed.
2. For variables with right-skewed distributions, the log transformation decreases the magnitude of extreme outliers,
meaning the influence of outliers on regression estimates may be reduced with a log-transformed outcome variable.
3. A log-transformed outcome variable leads to an interpretation in terms of percentage changes in the outcome variable
rather than absolute changes in the outcome variable.
For an underlying outcome variable Y with only positive values (Y > 0), the MLR model based upon the log-
transformed outcome is
ln(Y) = α + β1 X1 + β2 X2 + · · · + βK XK + U with E(U|X) = 0

i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 527 — #534
i i

Multiple linear regression 527

While the exogeneity assumption E(U|X) = 0 is assumed to simplify exposition, this assumption is not necessary for
the purposes of predicting the value of the outcome variable.69 The approaches described below can be applied even
if there is doubt about the exogeneity assumption holding. That is, even in a model where it may not be possible to
establish causality due to failure of the exogeneity assumption, we can still use least-squares estimation for predictive
purposes.
If the values of the explanatory variables are X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ , the new outcome Y ∗ can be written
Y ∗ = α + β1 x1∗ + β2 x2∗ + · · · + βK xK∗ + U ∗ ,
where U ∗ is the population residual associated with Y ∗ . There are two parts of the outcome Y ∗ , the part linearly related
to the explanatory variables, which is the conditional expectation
E(Y ∗ |X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ ) = α + β1 x1∗ + β2 x2∗ + · · · + βK xK∗ ,
and the part unrelated to the explanatory variables, which is the population residual U ∗ . The least-squares estimates
can be used to estimate the conditional-expectation part, with
Ê(Y ∗ |X1 = x1∗ , X2 = x2∗ , …, XK = xK∗ ) = α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ .
Due to the consistency of least-squares estimation, this estimate of the conditional-expectation part gets arbitrarily
close to the true conditional expectation of Y ∗ as the sample size grows. In this section, we assume that the sample size
n is large enough that we can ignore any estimation imprecision for the conditional-expectation part of Y ∗ , meaning
the uncertainty associated with the asymptotic predictive interval for Y ∗ comes only from the uncertainty associated
with the population residual U ∗ .70
Since Y ∗ = E(Y ∗ |X) + U ∗ , the distribution of Y ∗ conditional on X is the same as the distribution of U ∗ conditional
on X but shifted by an amount equal to E(Y ∗ |X). The shape of the conditional distribution of Y ∗ given X is identical
to the shape of the conditional distribution of U ∗ given X. Therefore, determining a predictive interval for Y ∗ given X
simplifies to determining a predictive interval for U ∗ given X and then adding Ê(Y ∗ |X), which is a consistent estimate
of E(Y ∗ |X). While the appealing properties of the least-squares estimators, including consistency and asymptotic
normality, do not require any distributional assumptions on the population residuals, imposing additional assumptions
can lead to simplified predictive intervals. To illustrate, we’ll focus on a specific distribution assumption, namely the
assumption of normally distributed residuals.
The remainder of this section considers how a predictive interval can be constructed for Y ∗ given X in four different
cases: (i) U ∗ is normally distributed and homoskedastic, (ii) U ∗ is normally distributed and heteroskedastic, (iii) U ∗
has an unspecified distribution that does not depend on X, and (iv) U ∗ has an unspecified distribution that depends
on X.
Case (i): U ∗ is normally distributed and homoskedastic. In this case, the conditional distribution of U ∗ does not
depend upon the explanatory variables and is always N(0, σU2 ), where σU2 is the unconditional variance of U. The 1 – α
probability interval for U ∗ is
(–zα/2 σU , zα/2 σU ).
Based upon the least-squares estimates, this 1 – α probability interval can be consistently estimated by
(–zα/2 σ̂U , zα/2 σ̂U ),
q Pn 2
1
where σ̂U = n–K–1 i=1 ûi is the residual standard deviation estimate. Then, the 1 – α asymptotic predictive interval
for Y ∗ , given the values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables, is
(α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ – zα/2 σ̂U , α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ + zα/2 σ̂U ).
Example 18.21 (Monthly stock returns) Suppose we assume homoskedastic and normally distributed residuals in
the MLR model with Home Depot returns (HD) as the outcome variable and S&P 500 index returns (IDX) and
Lowe’s returns (LOW) as the explanatory variables. From Example 18.6, the residual standard deviation estimate is

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 528 — #535
i i

528 Multiple linear regression

σ̂U = 0.052. A 95% predictive interval for HD given IDX = 0 and LOW = 0 is
(α̂ – z0.025 σ̂U , α̂ + z0.025 σ̂U ) = (0.004 – (1.96)(0.052), 0.004 + (1.96)(0.052)) ≈ (–0.098, 0.106),
meaning there is a 95% probability that the Home Depot monthly return is between –9.8% and 10.6% when the S&P
500 and Lowe’s returns are equal to zero. Asymptotic predictive intervals can be formed for other choices of the values
of IDX and LOW. For example, if IDX and LOW are both equal to 0.05, the 95% predictive interval for HD is
0.004 + (0.595)(0.05) + (0.384)(0.05) ± (1.96)(0.052) ≈ (–0.049, 0.155),
and if IDX and LOW are both equal to –0.05, the 95% predictive interval for HD is
0.004 + (0.595)(–0.05) + (0.384)(–0.05) ± (1.96)(0.052) ≈ (–0.147, 0.057).
Case (ii): U is normally distributed and heteroskedastic. In this case, the conditional distribution of U ∗ depends
∗

upon the explanatory variables. While this conditional distribution is normal, by assumption, the conditional variance
of U ∗ depends upon the values of the explanatory variables. To provide a predictive interval for U ∗ , then, we need
to explicitly model the conditional variance of U ∗ as a function of the explanatory variables. A particularly simple
approach is to adopt a model with a linear specification like the original MLR model, with
Var(U|X) = γ + δ1 X1 + δ2 X2 + · · · + δK XK .
The drawback to this model is that, depending upon the parameter values, the conditional variance is not guaranteed
to be positive for all values of the explanatory variables. Note that Var(U|X) = E((U – E(U))2 |X) = E(U 2 |X) since
E(U|X) = 0, so that the conditional-variance model becomes
E(U 2 |X) = γ + δ1 X1 + δ2 X2 + · · · + δK XK .
To estimate the parameters (γ, δ1 , δ2 , …, δK ), the least-squares estimator can be applied to a model with û2i as the
outcome variable and (xi1 , xi2 , …, xiK ) as the explanatory variables.71 Then, the conditional variance of U given X is
consistently estimated by
Var(U|X)
c = γ̂ + δ̂1 X1 + δ̂2 X2 + · · · + δ̂K XK ,
and the conditional standard deviation of U given X is consistently estimated by
q q
sd(U|X)
b = Var(U|X)
c = γ̂ + δ̂1 X1 + δ̂2 X2 + · · · + δ̂K XK if Var(U|X)
c > 0.
Then, the estimated 1 – α probability interval of U ∗ , given values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables, is
q q
–zα/2 γ̂ + δ̂1 x1∗ + · · · + δ̂K xK∗ , zα/2 γ̂ + δ̂1 x1∗ + · · · + δ̂K xK∗ .

The 1 – α asymptotic predictive interval of Y ∗ , given values (x1∗ , x2∗ , …, xK∗ ) for the explanatory variables, is
q
α̂ + β̂1 x1∗ + · · · + β̂K xK∗ – zα/2 γ̂ + δ̂1 x1∗ + · · · + δ̂K xK∗ ,
q
∗ ∗ ∗ ∗
α̂ + β̂1 x1 + · · · + β̂K xK + zα/2 γ̂ + δ̂1 x1 + · · · + δ̂K xK .

Example 18.22 (Birthweight data) The dataset births contains information on 50,249 births in the United States
during the month of December 2021. Birth outcomes are of interest to health economists since adverse birth outcomes,
like low birthweight, can be associated with high healthcare costs. The MLR model has the outcome variable bweight,

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 529 — #536
i i

Multiple linear regression 529

the baby’s birthweight measured in grams, modeled as a function of the following explanatory variables:
age = mother’s age (in years)
hsgrad = 1 if mother is high-school grad and not beyond, 0 otherwise
somecoll = 1 if mother has some college but not college grad, 0 otherwise
collgrad = 1 if mother is a 4-year college grad, 0 otherwise
married = 1 if mother is married, 0 otherwise
smoke = 1 if mother smoked during pregnancy, 0 otherwise
male = 1 if baby is male, 0 otherwise
This model specification has three indicator variables associated with educational-attainment categories, with the
omitted category being nonhsgrad (non-high school graduates). Their estimates should therefore be interpreted as
differences from non-high school graduates.
For the conditional-variance model, we use a specification that, like the MLR model, is a linear function of the
explanatory variables:
Var(U|X) = E(U 2 |X) = γ + δ1 age + δ2 hsgrad + δ3 somecoll + δ4 collgrad + δ5 married + δ6 smoke + δ7 male.
To estimate the parameters of the conditional-variance model, the least-squares estimates (α̂, β̂1 , …, β̂7 ) of the MLR
model are used to construct the estimated residuals ûi for i = 1, …, n. Then, the û2i values for i = 1, …, n are obtained by
squaring each ûi . The squared estimated residuals û2i are used as the outcome variable, and least-squares estimation
yields the estimates (γ̂, δ̂1 , …, δ̂7 ) of the parameters for the conditional-variance model.
The following table shows the least-squares estimates of the MLR model side-by-side with the estimates of the
conditional-variance model. Standard errors are reported in parentheses, and the additional columns provide the
p-values for testing the MLR slopes (βk ’s) against zero and the conditional-variance slopes (δk ’s) against zero.
p-value for p-value for
MLR model Var(U|X) model
H0 : βk = 0 H 0 : δk = 0
α (intercept) 3262.93 (28.13) γ (intercept) 194758.2 (28661.5)
β1 (age) –5.377 (0.828) 0.000 δ1 (age) 3889.3 (850.5) 0.000
β2 (hsgrad) 32.69 (16.57) 0.048 δ2 (hsgrad) –13472.8 (17101.7) 0.431
β3 (somecoll) 66.59 (16.06) 0.000 δ3 (somecoll) –16022.7 (16554.2) 0.333
β4 (collgrad) 92.56 (15.77) 0.000 δ4 (collgrad) –59650.4 (16193.9) 0.000
β5 (married) 44.73 (5.15) 0.000 δ5 (married) –14443.6 (5270.0) 0.006
β6 (smoke) –168.68 (19.94) 0.000 δ6 (smoke) 42771.7 (19780.7) 0.031
β7 (male) 99.39 (4.68) 0.000 δ7 (male) 29592.6 (4774.4) 0.000
Looking at the conditional-variance model estimates, the p-values indicate that age, collgrad, married, smoke, and
male all have statistically significant associations, at a 5% level, with the conditional variance of the MLR residual.
These p-values provide strong evidence that the residual variance depends upon the explanatory variables and,
therefore, the MLR model residuals are heteroskedastic. For the age variable, the estimate δ̂1 = 3889.3 means that
a one-year increase in age is associated with an estimated increase in the residual variance of 3889.3, holding all
other variables fixed. The residual variance is in the units of grams squared, so this 3889.3 estimate is in the units of
grams squared. For the smoke variable, the estimate δ̂6 = 42771.7 means that a mother who smokes during pregnancy
has a residual that has a variance that is estimated to be 42771.7 larger than the residual variance for a non-smoking
mother, holding all other variables fixed.
The MLR model and Var(U|X) model estimates can be used together to provide a predictive interval for birthweight
based on any specific values of the explanatory variables. For instance, consider a 30-year-old mother (age = 30) who
is a college graduate (collgrad = 1), is married (married = 1), doesn’t smoke during pregnancy (smoke = 0), and has a
male child (male = 1). The estimated conditional expectation of birthweight, based upon the MLR model, is
3262.93 + (–5.377)(30) + (92.56)(1) + (44.73)(1) + (99.39)(1) ≈ 3338.3.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 530 — #537
i i

530 Multiple linear regression

The estimated conditional variance of the MLR residual, based upon the Var(U|X) model, is
194758.2 + (3889.3)(30) + (–59650.4)(1) + (–14443.6)(1) + (29592.6)(1) ≈ 266934.9.
Therefore, the estimated conditional standard deviation of the MLR residual is
√
266934.9 ≈ 516.7,
and a 95% predictive interval for birthwight, given these values of the explanatory variables, is
(3338.3 – (1.96)(516.7), 3338.3 + (1.96)(516.7)) ≈ (2325.6, 4351.0).
The R code below calculates the estimates for the MLR model and Var(U|X) model reported above:

# create the squared residuals

lmresults <- lm(bweight~age+hsgrad+somecoll+collgrad+married+smoke+male, data=births)
births$uhatsq <- (lmresults$residuals)^2

# MLR model with birthweight as the outcome

lm_robust(bweight~age+hsgrad+somecoll+collgrad+married+smoke+male, data=births)
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 3262.9269 28.13280 115.9830 0.0000e+00 3207.78625 3318.0675 50241
## age -5.3767 0.82824 -6.4918 8.5630e-11 -7.00007 -3.7534 50241
## hsgrad 32.6901 16.56687 1.9732 4.8476e-02 0.21888 65.1614 50241
## somecoll 66.5883 16.05817 4.1467 3.3786e-05 35.11414 98.0625 50241
## collgrad 92.5560 15.76927 5.8694 4.4013e-09 61.64810 123.4640 50241
## married 44.7337 5.15184 8.6830 3.9667e-18 34.63600 54.8313 50241
## smoke -168.6788 19.93603 -8.4610 2.7213e-17 -207.75361 -129.6039 50241
## male 99.3946 4.68139 21.2318 1.3318e-99 90.21898 108.5701 50241
# Var(U|X) model with squared residuals as the outcome
lm_robust(uhatsq~age+hsgrad+somecoll+collgrad+married+smoke+male, data=births)
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 194758.2 28661.53 6.79511 1.0944e-11 138581.3 250935.1 50241
## age 3889.3 850.48 4.57306 4.8180e-06 2222.3 5556.2 50241
## hsgrad -13472.8 17101.69 -0.78780 4.3082e-01 -46992.3 20046.7 50241
## somecoll -16022.7 16554.24 -0.96789 3.3310e-01 -48469.2 16423.8 50241
## collgrad -59650.4 16193.90 -3.68351 2.3029e-04 -91390.6 -27910.2 50241
## married -14443.6 5269.99 -2.74073 6.1325e-03 -24772.8 -4114.4 50241
## smoke 42771.7 19780.65 2.16230 3.0600e-02 4001.4 81542.0 50241
## male 29592.6 4774.35 6.19824 5.7539e-10 20234.8 38950.4 50241

Case (iii): U ∗ has an unspecified distribution that does not depend on X. In this case, the conditional distribution
of U ∗ given X is the same as the unconditional distribution of U ∗ since it is assumed that the distribution of U ∗ does
not depend on X. Without the assumption of normality, the quantiles of the U ∗ distribution can’t be determined using
the zα/2 critical values for the normal distribution. Instead, we can directly estimate the desired quantiles of the U ∗
distribution using the corresponding sample quantiles of the estimated residuals ûi . Let v̂q denote the sample q-th
quantile of the distribution of the estimated residuals {û1 , û2 , …, ûn }. Then, the estimated 1 – α probability interval for
U ∗ is
(v̂α/2 , v̂1–α/2 ),
and the estimated 1 – α asymptotic predictive interval for Y ∗ , given the values (x1∗ , x2∗ , …, xK∗ ) for the explanatory
variables, is
(α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ + v̂α/2 , α̂ + β̂1 x1∗ + β̂2 x2∗ + · · · + β̂K xK∗ + v̂1–α/2 ).
Example 18.23 (Monthly stock returns) Re-visiting Example 18.21, we consider predictive intervals for Home Depot
returns (HD), based upon S&P 500 index returns and Lowe’s returns, but without assuming normality of the residuals.
The estimated residuals {û1 , û2 , …, ûn } have sample 2.5% and 97.5% quantiles
v̂0.025 = –0.113 and v̂0.975 = 0.104,

i i

The LPM is P(Y = 1|emailA, emailB) = α + β1 emailA + β2 emailB. The least-squares estimates of the LPM parameters
are provided in the following table, with standard errors and z-statistics and p-values for the z-test of each parameter
being equal to zero.
Estimate Standard error z-statistic p-value
α 0.1500 0.0073 20.58 0.000
β1 0.0500 0.0243 2.06 0.039
β2 0.0700 0.0250 2.80 0.005
The parameter estimates are as expected, with
P̂(Y = 1|emailA = emailB = 0) = α̂ = 0.15,
P̂(Y = 1|emailA = 1) = α̂ + β̂1 = 0.15 + 0.05 = 0.20,
and
P̂(Y = 1|emailB = 1) = α̂ + β̂2 = 0.15 + 0.07 = 0.22.
The estimates β̂1 = 0.05 and β̂2 = 0.07 indicate that e-mail A recipients and e-mail B receipients are 5 percentage points
and 7 percentage points more likely, respectively, than non-recipients to make a purchase. The p-value of 0.039 for
H0 : β1 = 0 indicates that, at a 5% level, there is a statistically significant difference between the purchase probability
for e-mail A recipients and the control group. Similarly, the p-value of 0.005 for H0 : β2 = 0 indicates that, for any level
above 0.5%, there is a statistically significant difference between the purchase probability for e-mail B recipients and
the control group. A z-test of H0 : β1 = β2 , which can be conducted by either making e-mail A recipients the omitted
category or by directly calculating se(β̂1 – β̂2 ), has a p-value of 0.548. Therefore, there is no statistically meaningful
difference between the purchase probabilities of e-mail A recipients and e-mail B recipients since H0 : β1 = β2 can not
be rejected at any reasonable level.
The following R code calculate the LPM estimates above, with the p-value for the H0 : β1 = β2 test:

# estimate the LPM model

lpm_results <- lm_robust(purchase~emailA+emailB, data=widgets)
lpm_results
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.15 0.0072902 20.5755 4.2270e-88 0.1357057 0.164294 2997
## emailA 0.05 0.0242542 2.0615 3.9341e-02 0.0024435 0.097556 2997
## emailB 0.07 0.0250412 2.7954 5.2166e-03 0.0209004 0.119100 2997
# test the equality of the emailA and emailB slopes
abdiff <- linear_combination(lpm_results, c(0,1,-1))
zstat_abdiff <- abdiff$estimate/abdiff$se
2*(1-pnorm(abs(zstat_abdiff)))
## [1] 0.54813

Example 18.25 (Union membership) The previous examples using the cps dataset considered weekly earnings as the
outcome variable. In this example, we instead consider union membership, measured by the indicator variable union,
as the outcome variable. The least-squares estimates of the LPM describe how other variables are associated with,
and can be used to predict, union membership for the sample of 2809 employed individuals. The following R code
provides the least-squares estimates for one such LPM, with explanatory variables educ, exper, exper2 , and female.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 534 — #541
i i

534 Multiple linear regression

union_lpm <- lm_robust(union~educ+exper+I(exper^2)+female, data=cps)

union_lpm
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -0.1343665 0.0495898 -2.710 6.778e-03 -0.2316027 -3.713e-02 2804
## educ 0.0098938 0.0022852 4.329 1.547e-05 0.0054129 1.437e-02 2804
## exper 0.0084330 0.0030392 2.775 5.562e-03 0.0024736 1.439e-02 2804
## I(exper^2) -0.0001111 0.0000588 -1.890 5.893e-02 -0.0002264 4.193e-06 2804
## female -0.0609701 0.0110421 -5.522 3.667e-08 -0.0826215 -3.932e-02 2804

The LPM slope estimates are interpreted as follows:

• β̂educ = 0.0099: An additional year of education, holding exper and female fixed, is associated with an increase of
0.99 percentage points in the probability of union membership.
• β̂
female = –0.0610: Female workers are 6.10 percentage points less likely to be union members than male workers,
holding educ and exper fixed.
• β̂ 2
exper = 0.00843, β̂exper2 = –0.00011: Experience is included in the model as a quadratic form, with exper and exper
as explanatory variables. Therefore, the partial effect of exper on the conditional probability of union membership
depends upon the value of exper. For example, with exper = 15, the estimated partial effect is
β̂exper (16 – 15) + β̂exper2 (162 – 152 ) = 0.00843 + (–0.00011)(31) ≈ 0.0050,
which means that an additional year of experience for a worker with 15 years of experience, holding educ and
female fixed, is associated with an increase of 0.50 percentage points in the probability of union membership.
The standard error, se(β̂exper + 31β̂exper2 ), is calculated to be 0.0013 (see R output below). If the starting value is
exper = 20, the estimated partial effect is lower and equal to
β̂exper (21 – 20) + β̂exper2 (212 – 202 ) = 0.00843 + (–0.00011)(41) ≈ 0.0039,
with a standard error, se(β̂exper + 41β̂exper2 ), of 0.0008 (see R output below).
Here is the R code to calculate the standard errors associated with the estimated partial effects of exper:

# standard error for exper partial effect at exper=15

linear_combination(union_lpm, c(0,0,1,31,0))
## $estimate
## [1] 0.0049889
##
## $se
## [1] 0.0013044

# standard error for exper partial effect at exper=20

linear_combination(union_lpm, c(0,0,1,41,0))
## $estimate
## [1] 0.0038779
##
## $se
## [1] 0.00082957

We can also use the LPM estimates to directly predict the probability of union membership for any specific values
of the explanatory variables. For example, for a female worker (female = 1) with 12 years of education (educ = 12) and
15 years of experience (exper = 15, exper2 = 225), the predicted probability of union membership is
–0.1344 + (0.0099)(12) + (0.00843)(15) + (–0.00011)(225) + (–0.0610)(1) ≈ 0.0249 or 2.49%,
with a standard error of 0.0097 (or 0.97%).

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 535 — #542
i i

NOTES 535

# estimated union conditional probability and its standard error

linear_combination(union_lpm, c(1,12,15,225,1))
## $estimate
## [1] 0.024886
##
## $se
## [1] 0.0096934

Notes
60 For two explanatory variables (K = 2), the graph is three-dimensional. The MLR conditional expectation, E(Y|X) = α + β X + β X , is a two-
1 1 2 2
dimensional plane that extends forever in three-dimensional space. (Think of a sheet of paper that may be tilted and extends forever.) If y is measured
vertically, positive residuals are associated with data points that are above the MLR plane, negative residuals are associated with data points that are
below the MLR plane, and the magnitude of a residual is the vertical distance from the data point to the plane.
61 If either Assumption MLR-VarX or Assumption MLR-NPC are violated, the minimization problem has an infinite number of possible solutions.
For instance, if x1 has zero variance, there are an infinite number of combinations of a and b1 that could be chosen to minimize S(a, b1 , b2 , …, bK ).
62 Six years of age is roughly when children in the United States start their education, so the experience variable approximates the number of
post-education years. The variable is not a perfect measure of experience since it doesn’t take into account any period(s) of unemployment.
63 Although not explicitly stated in Proposition 18.2, additional technical assumptions are required to prove the result. Specifically, we require
that U has a finite variance, each component of X has a finite variance, and the covariance between any two components of X is finite.
64 s can be calculated as the sample standard deviation of the estimated residuals from least-squares estimation of a model that has x as the
x̃k k
outcome variable and all other x variables as the explanatory variables, which corresponds to the decomposition described in Proposition 18.6.
65 Alternatively, lm_robust(earnwk~marstatus, data=cps) automatically creates three indicator variables from marstatus.
66 An alternative approach is to use derivatives to approximate the change in E(Y|X). A small change dx in X , from x∗ to x∗ + dx , is associated
1 1 1 1 1
with a change of E(Y|X) equal to (β1 + 2β2 x1∗ )dx1 since ∂E(Y|X)
∂x
= β 1 + 2β 2 x ∗.
1
1
67 For the derivative approach, a small change dx1 in X1 , from x1∗ to x1∗ + dx1 , is associated with a change of E(Y|X) equal to (β1 + 2β2 x1∗ +
3β3 x1 )dx1 since ∂E(Y|X)
∗2
∂x
= β1 + 2β2 x1∗ + 3β3 x1∗2 .
1
68 This approximation is based upon the derivative of the natural logarithm, dln(Y)
dY
= Y1 .
69 Proposition 17.11 can be generalized to the case of multiple explanatory variables, with the decomposition of Y into a linear function of the
explanatory variables and a random variable uncorrelated with the explanatory variables: Y = α∗ + β1∗ X1 + β2∗ X2 + · · · + βK∗ + V, with Cov(Xk , V) = 0
for k ∈ {1, 2, …, K} and E(V) = 0. For this decomposition, the least-squares estimates (α̂, β̂1 , β̂2 , …, β̂K ) consistently estimate (α∗ , β1∗ , β2∗ , …, βK∗ )
even if the exogeneity assumption of the MLR model doesn’t hold.
70 For small sample sizes, where there may be imprecision in the estimate of E(Y ∗ |X), the resulting asymptotic predictive interval is not wide
enough. One approach for gauging whether imprecision in the estimate of E(Y ∗ |X) affects the predictive interval is to utilize the bootstrap from
Chapter 15. Specifically, a predictive interval can be constructed based upon each bootstrap sample to see how much the predictive interval varies
over bootstrap samples. If the estimates of E(Y ∗ |X) are very precise, there should be little difference in the predictive intervals over bootstrap
samples.
71 There are alternative conditional-variance models that guarantee positive estimated conditional variances. One oft-used model is the nonlinear
(exponential) model
E(U 2 |X) = eγ+δ1 X1 +δ2 X2 +···+δK XK ,
for which the parameters (γ, δ1 , δ2 , …, δK ) can be estimated by many q statistical packages. With consistent estimates (γ̂, δ̂1 , δ̂2 , …, δ̂K ), the
∗ ∗
estimated conditional standard deviation of U ∗ given (x1∗ , x2∗ , …, xK∗ ) is eγ̂+δ̂1 x1 +···+δ̂K xK .
72 An alternative approach is to directly model the conditional quantiles v (Y|X) rather than using the E(Y|X) model at all.
q
73 With only discrete explanatory variables, P(Y = 1|X) may be guaranteed to be between zero and one. A simple example is an LPM with a single
binary variable X1 , for which P(Y = 1|X1 = 0) = α and P(Y = 1|X1 = 1) = α + β.

Exercises
1. Use the widgets dataset for this question. These data are for 3,000 users, 300 of whom receive e-mail A (emailA = 1),
300 of whom receive e-mail B (emailB = 1), and 2,400 of whom receive neither (emailA = emailB = 0). The outcome
variable of interest is amount, which is the total amount purchased (in dollars) by the user.
(a) How many users have amount = 0?
(b) What are the sample averages of amount for the three subsamples corresponding to e-mail A recipients, e-mail
B recipients, and non-recipients?
(c) Use lm_robust to estimate the multiple regression of amount on emailA and emailB. Interpret the intercept
estimate and the two slope estimates. How do these estimates relate to the sample averages in (b)?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 536 — #543
i i

536 NOTES

(d) What is the p-value for the z-test of H0 : βemailA = 0? What do you conclude from this p-value?
(e) Create a binary variable nonrecipient equal to 1 for non-recipients and 0 for e-mail A and e-mail B recipients.
Re-run the regression using emailA and nonrecipient as the explanatory variables. Test whether there is a
significant difference, at the 5% level, between average purchases for e-mail A users and e-mail B users.
2. Use the metricsgrades dataset for this question. These data are from a graduate econometrics course with 68
students, containing the following variables:
total = overall composite course grade (out of 100 points)
gre_quant = score on GRE quantitative test (out of 170 points)
gre_verbal = score on GRE (English) verbal test (out of 170 points)
domestic = 1 if domestic (U.S.) student, 0 otherwise
(a) Provide the sample correlation matrix for the four variables. Which variable has the largest correlation (in
magnitude) with total?
(b) Use lm_robust to estimate the multiple regression with total as the outcome variable and the other three
variables as explanatory variables.
(c) Interpret the estimate of βgre_quant .
(d) What is the estimated conditional expectation of total for a non-domestic student with a GRE quantitative score
of 160 and a GRE verbal score of 150?
(e) What is the estimated standard deviation of the regression model’s residual?
(f) Test H0 : βdomestic = 0 at a 10% level. What do you conclude?
(g) Drop domestic from the regression and re-run it. How do the results compare to the original regression?
(h) Now put domestic back in the regression and instead drop gre_verbal. Re-run the regression. What happens to
the statistical significance of domestic, and why?
3. Use the cigdata dataset for this question. Example 18.8 provided the results from a regression of cigsales on cigtax
and producer.
(a) Add the variable price_pack (equal to the total price per pack) to the model and re-run the regression using
lm_robust.
(b) How does the R-squared value of this regression compare to the R-squared value of the regression without the
price_pack variable?
(c) What happens to the statistical significance of the slope on the state tax (cigtax)?
(d) Considering the correlation between cigtax and price_pack, explain the result in (c).
(e) Now drop the state-tax (cigtax) variable, and re-run the regression with price_pack and producer as the
explanatory variables. How do the results compare to the regression in Example 18.8?
(f) Do you prefer the MLR model with cigtax and producer or the MLR model with price_pack and producer?
4. Use the mutualfunds dataset for this question. The sample, consisting of 206 mutual funds categorized as “Large
Blend Equity” by Morningstar, includes the following variables:
return_10yr = ten-year annualized return
expense_ratio = annual fee (e.g., 0.005 is an annual fee of 0.5%)
manager_tenure = tenure of current fund manager (in years)
fund_age = age of fund (in years)
load = "Y" if fund has a sales charge, "N" otherwise
(a) Add the binary variable hasload, equal to 1 if the fund has a sales charge and 0 otherwise, to the data frame.
(b) Use lm_robust to estimate the multiple regression with return_10yr as the outcome variable and the four
explanatory variables expense_ratio, manager_tenure, fund_age, and hasload.
(c) Interpret the R-squared value.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 537 — #544
i i

NOTES 537

(d) If the expense_ratio value increases by 0.001 (0.1%), what does the estimate of βexpense_ratio imply about the
conditional expectation of return_10yr?
(e) Do any of the variables appear to be statistically significant at a 5% level? If so, which one(s)?
(f) Provide a 90% asymptotic confidence interval for βhasload .
(g) *You are considering dropping the fund_age and manager_tenure variables from the regression. You are
worried about multicollinearity, so you want to test them jointly. Use the test_linear_restrictions
function to determine the p-value for the test of H0 : βfund_age = βmanager_tenure = 0.
5. Use the congress dataset for this question. The data consist of Congressional election outcomes in the United States
between 1948 and 1990. For this question, you will focus on the subsample of 476 Congressional district elections that
occurred in 1990. Each election is between a Democrat and a Republican, where demvoteshare (between 0 and 1) gives
the fraction of votes received by the Democrat, meaning that the Democrat won the election if demvoteshare > 0.5. The
explanatory variables of interest are:
medianincome = median income within the district
pcturban = fraction (between 0 and 1) of district residents who live in an urban area
pctblack = fraction (between 0 and 1) of district residents who are black
pcthighschl = fraction (between 0 and 1) of district residents who have a high-school degree
(a) Plot the histogram of demvoteshare. Is the distribution unimodal or bimodal?
(b) The variable democrat is equal to 1 if the Democrat won and 0 if the Republican won. Of the 476 elections in
1990, what fraction were won by Democrats?
(c) Provide the sample averages of the explanatory variables separately for elections won by Democrats and
elections won by Republicans.
(d) Run the necessary simple linear regressions (with democrat as the explanatory variable) to test, at a 5% level,
whether each of the explanatory variables has a different population mean in the two subsamples.
(e) Use lm_robust to estimate the multiple regression with demvoteshare as the outcome variable and the four
explanatory variables above.
(f) Interpret the slope estimate for medianincome, thinking about a $1,000 change.
(g) Interpret the slope estimate for pcturban, thinking about a change of 10 percentage points.
(h) Add lagdemocrat, which is equal to 1 if the Democrat won the previous election and 0 otherwise, and re-run
the regression. How do the results change?
(i) Interpret the slope estimate for lagdemocrat in the regression in (h), and provide an asymptotic 95% confidence
interval for the βlagdemocrat .
(j) Without actually doing it, imagine re-running the regression in (e) using repvoteshare = 1 – demvoteshare as
the outcome variable and the same explanatory variables. Describe how the slope estimates, their z-statistics,
and their p-values would change.
6. Use the hrs dataset for this question. The data consist of 6,052 non-married individuals who are 50 and older.
For this question, focus on the subsample of 3,983 individuals with positive (non-zero) out-of-pocket medical costs
(outofpocket_costs, in dollars) during 2000. The explanatory variables of interest are:
age = individual’s age (in years)
ins_none = 1 if individual has no health insurance, 0 otherwise
ins_medicare = 1 if individual has Medicare insurance, 0 otherwise
male = 1 if individual is male, 0 otherwise
(a) Use lm_robust to estimate the multiple regression with outofpocket_costs as the outcome variable and the
four explanatory variables above. Interpret the slope estimates for age and ins_none.
(b) Draw a histogram of outofpocket_costs.
(c) Create a new variable, ln_oopc, equal to the natural logarithm of outofpocket_costs.

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 538 — #545
i i

538 NOTES

(d) Draw a histogram of ln_oopc.

(e) What are the sample average and sample standard deviation of ln_oopc?
(f) Use lm_robust to estimate the multiple regression with ln_oopc as the outcome variable and the four
explanatory variables above. Interpret the slope estimates for age and ins_none.
(g) What is the R-squared value for the regression in (f)?
(h) What is the estimated residual standard deviation for the regression in (f)?
7. Use the cps dataset for this question.
(a) Create a table like the one in Example 18.15, except use hrslastwk (hours worked last week) as the outcome
variable rather than earnwk. The sample is still the 2,809 employed workers.
(b) In comparing your table to the one in Example 18.15, do the explanatory variables explain more of the variation
in weekly earnings or hours worked?
(c) What is the p-value associated with the test of the null hypothesis that the slope on educ2 is zero? From this
p-value, does the quadratic variable belong in the MLR model?
(d) For the two MLR models, what is the estimated partial effect of educ on the conditional expectation of
hrslastwk, evaluated at educ = 12? evaluated at educ = 16?
8. Use the dataset auctions for this question. The dataset consists of 684 eBay auctions for Apple iPod Mini devices
in June and July 2006. The outcome variable of interest is finalprice (the sales price, in dollars). The binary variables
new, used, and refurb indicate the condition of the device (e.g., a new device has new = 1 and used = refurb = 0), and
bidders indicates the number of bidders in the auction.
(a) Use lm_robust to estimate the multiple regression with finalprice as the outcome variable and bidders and
bidders2 as the explanatory variables. Is the quadratic variable statistically significant at a 5% level?
(b) Plot the fitted values from the regression in (a) versus bidders.
(c) Add two indicators variables, used and refurb, to the model and re-run the regression. Is the statistical
significance of the two bidders variables affected at all? How does the R-squared value of this regression
compare to the one in (a)?
9. Use the inflation_expectations data for this question. The data consist of survey results for 290 individuals in the
United States in 2007 and 2008, where each individual was asked their prediction for inflation in the next year. The
outcome variable is inflation_pred, which is an individual’s inflation prediction measured in percentage points
(e.g., a value of 5 corresponds to a predicted 5% inflation). The explanatory variables of interest are:
age = individual’s age (in years)
finlit_score = financial literacy score (out of 12 points)
collgrad = 1 if individual is a college graduate, 0 otherwise
famincome_hi = 1 if family income > $75,000, 0 otherwise
(a) Use lm_robust to estimate the multiple regression with inflation_pred as the outcome variable and the four
explanatory variables above. Which slope estimates are statistically significant at a 5% level?
(b) Interpret the slope estimate for finlit_score.
(c) Interpret the slope estimate for collgrad.
(d) Draw a histogram of the fitted values.
(e) Add the interaction variable finlit_score × collgrad and re-run the regression.
i. What is the p-value for the test of the null hypothesis that the slope on the interaction variable is zero?
ii. What does the sign of the interaction-variable slope say about the partial effect of collgrad as a function
of finlit_score?
iii. What is the estimated partial effect of finlit_score (on the conditional expectation of inflation_pred) for
a college graduate? What is the estimated partial effect of finlit_score for a non-college graduate?

i i

i i
i i

“"ps4e (ECO 329 Fall 2024)"” — 2024/8/20 — 7:04 — page 539 — #546
i i

NOTES 539

iv. *Use the linear_combination function to determine the standard errors for the two partial effects
in (e)(iii).
(f) The actual inflation, for benchmark, was 3.2% in 2006, 2.9% in 2007, and 3.8% in 2008. Therefore, a sensible
forecast for inflation should probably fall in the 2% to 5% range. Create a binary variable accurate equal to 1 if
2 ≤ inflation_pred ≤ 5 and 0 otherwise. Run an LPM regression with accurate as the outcome variable, using
the same explanatory variables as in (a). Interpret the slope estimates for finlit_score and collgrad.
10. You have a sample of 1,000 graduating seniors from a certain university. The outcome variable y is equal to 1 if the
student has a job offer and 0 otherwise. The explanatory variable x is equal to 1 if the student is an economics major
and 0 otherwise. The joint sample counts are given by the following table:
econ (x)
0 1
0 270 20
offer (y)
1 630 80

(a) For the LPM model P(Y = 1|X) = α + βX, what are the least-squares estimates α̂ and β̂? Use only the table above
to answer this part.
(b) Interpret the slope estimate β̂.
(c) In R, create a data frame with 1,000 rows and 2 columns that corresponds to the table of joint sample counts
above. Use lm_robust to confirm the answer to (a). What is the p-value for testing H0 : β = 0?
11. Use the births dataset considered in Example 18.22. Since the healthcare costs associated with births are mostly
concentrated on babies with low birthweight, public health researchers and economists use a specific definition of “low
birthweight” that corresponds to babies having birthweight less than 2500 grams.
(a) Create a new variable lowbwt equal to 1 if bweight is less than 2500 and 0 otherwise.
(b) Run an LPM regression with lowbwt as the outcome, using the explanatory variables from Example 18.22.
(c) What are the highest and lowest LPM fitted values (predicted probabilities)?
(d) Plot the LPM fitted values (predicted probabilities) versus age.
(e) Interpret the LPM slope estimate for smoke.
(f) What do the LPM results say about the difference in low-birthweight probabilities for college-graduate mothers
(collgrad = 1) versus high-school graduate mothers (hsgrad = 1), holding all other variables fixed?
(g) What is the p-value associated with testing that there is no difference in (f)? For this part, either test the linear
combination directly or re-run the LPM with a different omitted education category.
(h) Provide a 95% confidence interval for the difference in low-birthweight probabilities between a 30-year-old
mother and a 25-year-old mother, holding all other variables fixed. Is the difference statistically significant at a
5% level?
(i) Add a quadratic age variable (age squared) to the LPM and re-run the regression. Provide a 95% confidence
interval for the difference in low-birthweight probabilities between a 30-year old mother and a 25-year-old
mother, holding all other variables fixed. Is the difference statistically significant at a 5% level?
(j) For the LPM regression in (i), you should see very high p-values for the two age-variable slopes. Does this
result suggest that the two age variables should be dropped from the model? Explain why or why not.
(k) *For the LPM regression in (i), use the test_linear_restrictions function to determine the p-value
for testing that the slopes for the two age variables are both equal to zero.
12. *Use the brands dataset for this question. Refer to Exercise 16.12.. Run a LPM regression with purchase as the
outcome variable and four indicator variables (for four of the five possible values of last_brand) as explanatory
variables. Use the regression results and the test_linear_restrictions function as necessary to answer parts
(b) and (e) of Exercise 16.12..

i i

STAT319 Lab Manual: R Software Guide
No ratings yet
STAT319 Lab Manual: R Software Guide
127 pages
Exercises
No ratings yet
Exercises
38 pages
Introduction To R: Exercises: Aboratory For Pplied Tatistics Elle Ørensen Niversity of Openhagen Ugust
No ratings yet
Introduction To R: Exercises: Aboratory For Pplied Tatistics Elle Ørensen Niversity of Openhagen Ugust
42 pages
Beginner's Guide to R and RStudio
No ratings yet
Beginner's Guide to R and RStudio
150 pages
Essential R
No ratings yet
Essential R
183 pages
Lucero R Tutorial 2016
No ratings yet
Lucero R Tutorial 2016
135 pages
EssentialR PDF
No ratings yet
EssentialR PDF
181 pages
An R Companion To Statistical Thinking For The 21st Century
No ratings yet
An R Companion To Statistical Thinking For The 21st Century
159 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
R Manual PDF
No ratings yet
R Manual PDF
78 pages
R With RStudio For Introductory Statistics
No ratings yet
R With RStudio For Introductory Statistics
163 pages
Day1 2017
No ratings yet
Day1 2017
74 pages
07 Introduction To R
No ratings yet
07 Introduction To R
75 pages
Introduction To R: General Lines
No ratings yet
Introduction To R: General Lines
36 pages
INtroductionGeostatistics R
No ratings yet
INtroductionGeostatistics R
30 pages
An Introduction To R
No ratings yet
An Introduction To R
133 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Intro Stat
No ratings yet
Intro Stat
324 pages
MATH1208AnnotatedBook Imp
No ratings yet
MATH1208AnnotatedBook Imp
145 pages
Stats With R
No ratings yet
Stats With R
103 pages
IntroStat Oct2010
No ratings yet
IntroStat Oct2010
324 pages
Statistical Analysis and Visualizations Using R: Okan Bulut
No ratings yet
Statistical Analysis and Visualizations Using R: Okan Bulut
96 pages
Rintro
No ratings yet
Rintro
14 pages
Introduction To R: 1 Getting Started
No ratings yet
Introduction To R: 1 Getting Started
14 pages
Modeling and Visulizing Data Using R: A Practical Introduction
No ratings yet
Modeling and Visulizing Data Using R: A Practical Introduction
106 pages
Problem Set 1: Introduction To R - Solutions With R Output: 1 Install Packages
No ratings yet
Problem Set 1: Introduction To R - Solutions With R Output: 1 Install Packages
24 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
An Introduction To R
No ratings yet
An Introduction To R
141 pages
Boulder Handout 2019
No ratings yet
Boulder Handout 2019
187 pages
Introduccion A R en Mexico
No ratings yet
Introduccion A R en Mexico
29 pages
Econometrics I: RStudio Guide
No ratings yet
Econometrics I: RStudio Guide
77 pages
R Guide
No ratings yet
R Guide
152 pages
R Socialscience
No ratings yet
R Socialscience
62 pages
R Manual
No ratings yet
R Manual
10 pages
Introduction To R
No ratings yet
Introduction To R
23 pages
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
No ratings yet
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
7 pages
CAM625 2019 s1 Module1
No ratings yet
CAM625 2019 s1 Module1
31 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
R Software Project
No ratings yet
R Software Project
42 pages
Introduction To R, Version 2
No ratings yet
Introduction To R, Version 2
51 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Intro To R
No ratings yet
Intro To R
19 pages
RStudio Exercices
No ratings yet
RStudio Exercices
8 pages
R Using R Statistics Stowell2014
No ratings yet
R Using R Statistics Stowell2014
232 pages
Unit 2 R
No ratings yet
Unit 2 R
16 pages
Applied Multivariate Statistics in R 1684815170
No ratings yet
Applied Multivariate Statistics in R 1684815170
528 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
R Intro Script
No ratings yet
R Intro Script
86 pages
Case Studies in R
No ratings yet
Case Studies in R
4 pages
Math 1280 Notes
No ratings yet
Math 1280 Notes
91 pages
SASprimer
No ratings yet
SASprimer
125 pages
R in 15 Min
No ratings yet
R in 15 Min
4 pages
Resume: Personal Information
No ratings yet
Resume: Personal Information
3 pages
A Guide To DEAP Version 2.1: A Data Envelopment Analysis (Computer) Program
No ratings yet
A Guide To DEAP Version 2.1: A Data Envelopment Analysis (Computer) Program
49 pages
Higher Education in Afghanistan
No ratings yet
Higher Education in Afghanistan
8 pages
Modaris V7R2 EN tcm31-216804
0% (1)
Modaris V7R2 EN tcm31-216804
2 pages
3d Modelling For Virtual Reality: Tutorial #2 - VRML Sliding Door!
No ratings yet
3d Modelling For Virtual Reality: Tutorial #2 - VRML Sliding Door!
12 pages
Call Center Culture
No ratings yet
Call Center Culture
2 pages
Fullstack Development
No ratings yet
Fullstack Development
4 pages
Kano State Polytechnic - Google Search
No ratings yet
Kano State Polytechnic - Google Search
1 page
Setting Up and Using Xero For Small Business v1.1 PDF
100% (2)
Setting Up and Using Xero For Small Business v1.1 PDF
285 pages
Isst PDF
No ratings yet
Isst PDF
2 pages
03 - FNG31 - Install SW in ATCA NG
No ratings yet
03 - FNG31 - Install SW in ATCA NG
30 pages
DIY Smart Home with Arduino
100% (1)
DIY Smart Home with Arduino
17 pages
Gmail Srs (Updated)
100% (3)
Gmail Srs (Updated)
18 pages
EG Manual-1
No ratings yet
EG Manual-1
13 pages
CSE
No ratings yet
CSE
20 pages
JPMorganChase - SEP SI - Job Description
No ratings yet
JPMorganChase - SEP SI - Job Description
8 pages
PC Assembly & Disassembly Guide
No ratings yet
PC Assembly & Disassembly Guide
27 pages
Fundamentals of Law For Health Informatics and Information Management 3rd Edition Unlocked Test Bank
No ratings yet
Fundamentals of Law For Health Informatics and Information Management 3rd Edition Unlocked Test Bank
325 pages
Advanced Driver Assistance Systems
No ratings yet
Advanced Driver Assistance Systems
11 pages
Database Backup REPORT - Updated
No ratings yet
Database Backup REPORT - Updated
19 pages
DVDR and FHDB Task Cards
No ratings yet
DVDR and FHDB Task Cards
43 pages
Television Téléviseur Televisor Digital A Color Con Pantalla de Cristal Líquido
No ratings yet
Television Téléviseur Televisor Digital A Color Con Pantalla de Cristal Líquido
48 pages
Project Design Brief (G2)
No ratings yet
Project Design Brief (G2)
1 page
1tool QuickReference EN
No ratings yet
1tool QuickReference EN
18 pages
TrueRTA Quick Start
No ratings yet
TrueRTA Quick Start
8 pages
Manual SIMOTION Web Accumulator V3.0.0
No ratings yet
Manual SIMOTION Web Accumulator V3.0.0
59 pages
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
No ratings yet
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
81 pages
Openscape Business v2 Feature Description Issue 7
No ratings yet
Openscape Business v2 Feature Description Issue 7
676 pages
Student's Digital Archive
No ratings yet
Student's Digital Archive
51 pages
Tamron Lenses for Canon & Nikon
No ratings yet
Tamron Lenses for Canon & Nikon
18 pages