KEMBAR78
Lecture Notes | PDF | Statistics | Probability Distribution
0% found this document useful (0 votes)
153 views80 pages

Lecture Notes

This document provides lecture notes for an introductory statistics course. It covers topics including probability distributions, descriptive statistics, estimators, interval estimation, hypothesis testing, and goodness of fit. The notes were written by Tomasz Kosmala using slides from previous lecturer Tatiana Tyukina. The notes include examples and exercises for students to practice the concepts covered. Feedback is welcomed to improve the notes for students.

Uploaded by

John J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views80 pages

Lecture Notes

This document provides lecture notes for an introductory statistics course. It covers topics including probability distributions, descriptive statistics, estimators, interval estimation, hypothesis testing, and goodness of fit. The notes were written by Tomasz Kosmala using slides from previous lecturer Tatiana Tyukina. The notes include examples and exercises for students to practice the concepts covered. Feedback is welcomed to improve the notes for students.

Uploaded by

John J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

MA1202 Introductory Statistics

Lecturer: Tatiana Tyukina

Lecture notes
written by Tomasz Kosmala and Tatiana Tyukina

6th February 2023


Foreword
I wrote these lecture notes using the slides from Tatiana Tyukina who lectured this course
in previous years. I also used lectures notes by Markus Riedle used for teaching Probability
Theory at King’s College London and lecture notes that I received from Sameer Murthy for
teaching Probability and Statistics I and from George Deligiannidis for teaching Probability
and Statistics II at KCL. I also used N. A. Weiss Introductory Statistics, Pearson, 2017 and
D. Wackerly, W. Mendenhall, and R. L. Scheaffer, Mathematical Statistics with Applications,
7th edition, 2008.
This is the first year that we use lecture notes for teaching statistics at eicester, so they
may contain numerous typos and mistakes. I will greatly appreciate if you send any
comments and corrections to tt51@leicester.ac.uk, so that we can correct them for
everyone’s benefit. The pdf file on blackboard will be updated throughout the term so I
suggest not to print it. The date of the most recent update is on the first page.
The notes include numerous Examples and Exercises, some of them with solutions. Similar
problems will be assigned for you to solve every week and discuss in the feedback sessions.
Tomasz Kosmala

2
Contents

1 Introduction and reminder on probability distributions 5


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Qualitative and quantitative variables . . . . . . . . . . . . . . . . . . . 6
1.1.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Moment-Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Descriptive Statistics 18
2.1 Graphical representation of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Distribution Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Descriptive Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Estimators 24
3.1 Point Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Expectation and variance of the sample mean . . . . . . . . . . . . . . . 31
3.2 Properties of Point Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3
3.3.3 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.4 F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Pivots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.3 Confidence intervals based on the normal distribution . . . . . . . . . . 53
3.4.4 Confidence intervals based on the t-distribution . . . . . . . . . . . . . . 54
3.4.5 Confidence intervals for variance . . . . . . . . . . . . . . . . . . . . . . 55
3.4.6 Confidence intervals for binomial random variables . . . . . . . . . . . . 56
3.4.7 Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.8 Interval Estimation for two population means . . . . . . . . . . . . . . . 58

4 Hypothesis testing 63
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Hypotheses, test statistic, rejection region . . . . . . . . . . . . . . . . . 63
4.1.2 Z-test for sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.3 Errors when testing hypotheses . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.4 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 List of most common statistical test . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Testing for proportions (small sample) . . . . . . . . . . . . . . . . . . . 69
4.2.3 Testing for variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.4 Tests for paired samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5 Goodness of Fit 72
5.1 For continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Probability Distributions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Plots of the denities and probability mass functions . . . . . . . . . . . . . . . . 77

Bibliography 80

4
Chapter 1

Introduction and reminder on


probability distributions

1.1 Introduction
What is statistics? The Cambridge Dictionary defines:

Statistics is the science of collecting and studying numbers that give information about
particular situations or events.

and the Investopedia.com says

Statistics is a form of mathematical analysis that uses quantified models, representations and
synopses for a given set of experimental data or real-life studies. Statistics studies
methodologies to gather, review, analyze and draw conclusions from data.

We all have some intuitive understanding of what statistics is. We can identify two distinct
branches of statistics:

• Descriptive statistics consists of methods for organizing and summarizing information.

• Inferential statistics consists of methods for drawing and measuring the reliability of
conclusions about population based on information obtained from a sample of the
populations.

Population is the collection of the individuals or items under consideration in a statistical


study. Sample is that part of the population from which information is obtained.
To put it simply, the role of descriptive statistics is to describe some collection of items/events
by summarizing the features of the elements. For example, one can measure the height of all
students attending this lecture and then calculate the average height, its standard deviation,

5
find the tallest person, etc. On the other hand, the role of the inferential statistics is to char-
acterize a large population based on some selection of elements. For example, one could select
100 people in Leicester, measure their height, calculate the average and then argue that this an
estimate of the average height of people living in Leiceser. The role of statistician is often to
develop methods for making the step from the sample to the whole population mathematically
precise, that is to measure its reliability and precision.
Statistics is closely connected to probability. Probabilist use knowledge about the distribu-
tion of a random variable to make a conclusion about the probability of obtaining a particular
sample. Continuing the earlier example, a probabilist that knows the distribution of people’s
height is able to calculate how many people in Leicester, on average, are taller than 1.9m.
On the other hand, the goal of the statistician is often to investigate the properties of the
distribution that can be inferred from a sample. Statisticians analyse data for the purpose
of making generalizations and decisions. In summary, probability can be considered a more
abstract approach, whereas statistics is more practical and data-driven.
Statisticians
• assist in organizing, describing, summarizing, and displaying experimental data;

• assist in designing experiments and surveys (sampling);

• draw inferences and to make decisions based on data.

1.1.1 Qualitative and quantitative variables


Statistical data may represent qualitative or quantitative variables. Qualitative variables are
not numeric. Examples of qualitative variables are: eyes color, preferences for political parties.
On the other hand, quantitative variables take numerical values. These could be for instance,
height in centimetres, number of car crashes on a given junction, stock prices in £. Quantitative
variables can be either discrete (when one can make a list of possible values) or continuous
(when its values are real numbers). Depending on the discrete or continuous character of the
variable we may model it with one of the probability distributions of Section 1.2 or 1.3.
In the rest of this chapter we review and extend important concepts from probability (from
the MA1061 module).

1.1.2 Random Variable


Any process whose outcome is not known in advance but is random is termed an experiment.
Assume that the experiment can be repeated any number of times under identical conditions.
Each repetition is called a trial.
The sample space associated with an experiment is the set consisting of all possible out-
comes. A sample space is also referred to as a probability space and is usually denoted by S
or Ω. An outcome in S is also called a sample point. An event A is a subset of outcomes in
S, that is, A ∈ S. We say that an event A occurs if the outcome of the experiment is in A.

6
Definition 1.1. Let S be a sample space of an experiment. Probability P (·) is a real-valued
function that assigns to each event A in the sample space S a number P (A) ∈ [0, 1], called
the probability of A, with the following conditions satisfied:

(i) It is zero for the impossible event: P (∅) = 0 and unity for the certain event: P (S) = 1.

(ii) It is additive over the union of an infinite number of pairwise disjoint events, that is, if
A1 , A2 , . . . form a sequence of pairwise mutually exclusive events (i.e. Ai ∩ Aj = ∅, for
i 6= j) in S, then
∞ ∞
!
[ X
P Ai = P (Ai ).
i=1 i=1

A random variable (abbreviated as r.v.) X is a function defined on a sample space, S, that


associates one and only one real number, X(s) = x, to each outcome s ∈ S.

A discrete random variable takes a finite or countably infinite number of possible values
with specific probabilities associated with each value.
The probability mass function (abbreviated as pmf) of a discrete random variable X is the
function
p(xi ) = P (X = xi ), i = 1, 2, 3, . . .
For a discrete random variable we can make a list x1 , x2 , . . . of values attained with positive
probability. The list may be finite or infinite (in such case we say it is countable).
The cumulative distribution function (abbreviated as cdf) F of the random variable X is
defined by X
F (x) = P (X ≤ x) = p(xi ), for − ∞ < x < ∞.
all y≤x

A cumulative distribution function is also called a distribution function. We may use the
notation pX and FX to stress that it is a probability mass function and cumulative distribution
function of a random variable X. For a given function p to be a pmf, it needs to satisfy the
P∞
following two conditions: p(x) ≥ 0 for all values of x ∈ R, and p(xi ) = 1.
i=1

Exercise 1.2. Suppose that a fair coin is tossed twice so that the sample space is S =
{HH, HT, T H, T T }. Let X be a number of heads.
(a) Find the probability function for X.
(b) Find the cumulative distribution function of X.

Solution. We have
x 0 1 2
1 1 1
p(x) 4 2 4

7
and 


0, x≤0
1,

x ∈ [0, 1)
F (x) = 43


 4, x ∈ [1, 2)

1, x ≥ 2.

Exercise 1.3. The probability mass function of a discrete random variable X is given in the
following table:
x -1 2 5 8 10
p(x) 0.1 0.15 0.25 0.3 0.2
a) Find the cumulative distribution function F (x) and graph it.
b) Find P (X > 2).

A continuous random variable attains uncountably many values, such as the points on a
real line.

Definition 1.4. Let X be a random variable. Suppose that there exists a nonnegative real-
valued function: f : R → R+ such that for any interval [a, b],
Z b
P (X ∈ [a, b]) = f (x) dx.
a

Then X is called a continuous random variable. The function f is called the probability density
function or simply density (and abbreviated as pdf) of X.

The cumulative distribution function is given by


Z x
F (x) = P (X ≤ x) = f (t) dt.
−∞

Sometimes, we use the notation fX and FX to stress the fact that fX is the density of a
random variable X and that FX is the cumulative distribution function of X.
For a given function f toRbe a pdf, it needs to satisfy the following two conditions: f (x) ≥ 0

for all values of x ∈ R, and −∞ f (t) dt = 1.

Exercise 1.5. The probability density function of a random variable X is given by



cx, 0 < x < 4
f (x) =
0, otherwise

(a) Find c.
(b) Find the distribution function F (x).
(c) Compute P (1 < X < 3).

8
1.2 Discrete Probability Distributions
Definition 1.6. A random variable X is said to be uniformly distributed over the numbers
1, 2, 3, . . . , n if
1
P (X = i) = , for i = 1, 2, . . . , n.
n
The uniform probability mass function is:
1
p(i) = P (X = i) = , for i = 1, 2, . . . , n.
n
The cumulative uniform distribution function is:
X X1 bxc
F (x) = P (X ≤ x) = p(i) = = ,
n n
i≤x i≤x

for 0 ≤ x ≤ n. Here, bxc is the floor of x, that is, the largest integer smaller than or equal to
x.
Definition 1.7. A binomial random variable is a discrete random variable that describes the
number of successes X in a sequence of n Bernoulli trials. The binomial probability mass
function is defined as:
 
n k
p(k) = P (X = k) = p (1 − p)n−k , k = 0, 1, 2, 3 . . . , n.
k
The corresponding cumulative distribution function is:
X X n
F (x) = P (X ≤ x) = p(k) = pk (1 − p)n−k ,
k
k≤x k≤x

for 0 ≤ x ≤ n.
Bernoulli distribution arises in the following scenario. Suppose we carry out an experiment,
which may result in 2 outcomes, say a success or a failure. We repeat this experiment n times
and count the number of successes. Bernoulli trials are the fixed sequence of n identical
repetitions of the same Bernoulli experiment; For each trial probability of success is p ∈ (0, 1)
and the probability of failure is q = 1 − p. The outcome of each experiment is independent on
previous experiments and does not influence any subsequent outcomes.
Definition 1.8. The probability distribution function and the cumulative distribution func-
tions for the geometric distribution are:
p(k) = P (X = k) = (1 − p)k−1 p, for k = 1, 2, . . . (1.1)
X
F (x) = P (X ≤ x) = (1 − p)k−1 p = 1 − (1 − p)bxc (1.2)
k≤x

for x ≥ 0

9
The geometric random variable is the number X of Bernoulli trials needed to get one
success. Note that X is supported on the set S = {1, 2, 3, . . .};
Sometimes geometric distribution is defined differently. One can say say that X is the
number failures before the first success. Note that with this definition, X is supported on the
set S = {0, 1, 2, . . .} as one can have X = 0 if the first trial result in a success. If X represents
number of ’failures’ before first success, then instead of (1.1)-(1.2) we have

p(k) = P (Y = k) = (1 − p)k p, for k = 0, 1, 2, 3, . . .


X
F (x) = P (X ≤ x) = (1 − p)k p = 1 − (1 − p)bxc+1
k≤x

Definition 1.9. A discrete random variable X has a Poisson distribution with parameter
λ > 0 if its probability mass function is given by:

λk e−λ
p(k) = P (X = k) = , for k = 0, 1, 2, . . .
k!

The cumulative distribution function is:


X X λk e−λ X λk
F (x) = P (X ≤ x) = p(k) = = e−λ ,
k! k!
k≤x k≤x k≤x

for 0 ≤ x < ∞.
A Poisson random variable describes the number of random events occurring in a fixed
unit of time and space. For example, the number of customers that entered a supermarket in
between 9 and 10 AM (here ‘a random event’ means arrival of a customer at the supermarket).
More precisely,

• k is the number of times an event occurs in a time interval and k = 0, 1, 2, . . ..

• The occurrence of one event does not affect the probability that a second event will
occur. That is, events occur independently.

• The average rate at which events occur is independent of any occurrences.

• Two events cannot occur at exactly the same instant; instead, at each very small sub-
interval, either exactly one event occurs, or no event occurs.

The fact that these properties give rise to a Poisson random variable can be derived form-
ally, see D. D. Wackerly, W. Mendenhall and R. L. Scheaffer, Mathematical Statistics with
Applications, 7th edition, Section 3.8.

10
Figure 1.1: Probability mass function of the Poisson distribution

1.3 Continuous Probability Distributions


Definition 1.10. The probability density function of continuous uniform distribution is
defined as:  1
f (x) = b−a , a ≤ x ≤ b
0, otherwise

The cumulative distribution function is:



 0, x < a,
x−a
F (x) = , a ≤x≤b
 b−a
1, x>b

for 0 ≤ x < ∞.
The uniform distribution describes an experiment where there is an arbitrary outcome that
lies between certain bounds and a and b and obtaining a value in each small interval [x, x + dx]
is equally probable.

Definition 1.11. A random variable X is called Gaussian or normal if its probability density
function f (x) is of the form
1 (x−µ)2
f (x) = √ e− 2σ2 ,
2πσ
where −∞ < x < ∞ and µ and σ are two parameters, such that −∞ < µ < ∞, σ > 0.

11
Figure 1.2: Density of the normal distribution

The cumulative distribution function is:


Z x (t−µ)2
1
F (x) = √ e− 2σ 2 dt
2πσ −∞

for −∞ < x < ∞. This integral cannot be simplified further. There is no ’nice’ formula for
the Gaussian cumulative distribution function.
Notation 1.12. We denote the normal distribution with mean µ and variance σ 2 by N (µ, σ 2 ).
Definition 1.13. The exponential distribution is the probability distribution with the density
λe−λx x ≥ 0

f (x) = , (1.3)
0, x<0
where λ > 0 is parameter called a rate parameter. The cumulative distribution function is
1 − e−λx x ≥ 0

F (x) =
0, x<0

Exponential distribution is sometimes parametrised in a different way. 1 Let the scale


parameter β > 0 be given by β = 1/λ.
( −x/β
e
β x≥0
f (x) = (1.4)
0, x<0
1
When I say ’parametrised in two ways’, this means that we can define the density with two different
formulas: (1.3) or (1.4), however if I carefully choose the parameters (taking β = λ1 ), then the formulas become
the same.

12
Exponential distribution is often used for modelling the waiting time for an event. For
example if Nt is the number of customers that joined the queue before time t, the time
between arrivals of the customers would be modelled by the exponential random variables.

1.4 Quantiles
Definition 1.14. For a random variable X and 0 < p < 1, the p-th quantile of X, denoted
by ϕp , is the smallest value such that P (X ≤ ϕp ) = F (ϕp ) ≥ p.
If F is an invertible function, then the p-th quantile is simply

ϕp = F −1 (p).

ϕ0.5 is the median. We often express quantiles in percentages e.g. the 0.05-quantile is the 5-th
percentile

1.5 Moments
Definition 1.15. Let X be any random variable. For any positive integer r the r-th moment
(r-th moment about the origin) is
µr := E[X r ].
The central r-th moment of X is µ0r (r-th moment of X about the mean), is given by

µ0r := E[(X − µ1 )r ].

The moments are only defined if the expectation is well-defined and finite.2
If X is a continuous random variable with a pdf fX (x), then
Z +∞
µr = xr fX (x) dx,
−∞

and Z +∞
µ0r = (x − µ1 )r fX (x) dx,
−∞
R∞ r f (x) dx
provided −∞ |x| X < ∞.
Definition 1.16. The r−th standardized moment, µ̃r , is a moment that is normalized, typic-
ally by the standard deviation raised to the power of r, σ r .
E [(W − µ)r ]
µ̃r = .
σr
2
When we calculate the r-th moment we either find a sum of a series (discrete case) or an integral (continuous
case). In some rare cases these series and integrals may diverge.

13
We also have

• When r = 1, the subscript is usually omitted, i.e. µ1 := µ is the mean (or expectation)
of X.

• When r = 2, µ02 = σ 2 is called the variance, and σ is called the standard deviation.

• We define the skewness E[(X − µ)3 ]/(E[(X − µ)2 ])3/2 , the third standardized moment
and the kurtosis E[(X − µ)4 ]/(E[(X − µ)2 ])2 , the forth standardized moment.

The third moment of the distribution shows the extent to which the distribution is not
symmetric about the mean. For example, since the density of Z ∼ N (0, 1) is symmetric, its
third moment is zero: Z +∞
1 1 2
x3 √ e− 2 x dx = 0
 3
E Z =
−∞ 2π

1 2
Figure 1.3: The function x3 √12π e− 2 x .

The fourth moment allows to express the flatness or both the ‘peakedness’ of the distribu-
tion and the heaviness of its tail.
We illustrate these definitions with two exercises - one involving a discrete random variable
and one continuous.

Exercise 1.17. To find out the prevalence of smallpox vaccine use, a researcher inquired into
the number of times a randomly selected 200 people aged 16 and over in an African village
had been vaccinated. He obtained the following figures:

N 0 1 2 3 4 5
proportion 17/200 30/200 58/200 50/200 38/200 7/200

Assume that these proportions continue to hold exhaustively for the population of that village.

14
a) What is the expected number of times those people in the village had been vaccinated?

b) What is the standard deviation?

Exercise 1.18. Let X be a random variable with pdf:


 3 2
f (x) = 64 y (4 − y) 0 ≤ x ≤ 4 .
0, otherwise

Find the expected value and variance of X.

Example 1.19. Find the expectation and the variance for a random variable X:

• with Poisson distribution: E[X] = λ, Var(X) = λ, see e.g. MA1061 module.

• with normal distribution: E[X] = µ, Var(X) = σ 2 . This can be shown by lengthy


calculations involving integrals or using the notions introduced in the next section.

1.6 Moment-Generating Functions

 tX  X, suppose
Definition 1.20. For a random variable
3
that there is a positive number h such that
for −h < t < h the expectation E e exists . The moment-generating function (abbreviated
as mgf) of the random variable X is defined by

MX (t) = E etX
 

for t ∈ R such that E etX exists.


 

We have

X
MX (t) = etxi pX (xi ), if X is discrete, taking values x1 , x2 , . . . ,
Zi=1∞
MX = etx fX (x) dx, if X is continuous.
−∞

Using the Taylor’s expansion

(tX)2 (tX)n
etX = 1 + tX + + ··· + + ···
2! n!
3
This assumption is needed because, as mentioned earlier, the series and integrals used for finding an
expectation may diverge

15
Plugging in the definition of mgf

t2 tn
MX (t) = E etX = 1 + tE[X] + E[X 2 ] + · · · + E[X n ] + · · ·
 
2! n!
On the other hand
00 (0) (n)
0 MX M (0) n
MX (t) = MX (0) + MX (0)t + t2 + · · · + X t + ··· .
2! n!
Comparing the coefficients, we must have
0 00 (n)
MX (0) = E[X], MX (0) = E[X 2 ], ..., MX (0) = E[X n ].

We have thus justified:


Theorem 1.21. If MX (t) exists, then for any positive integer r,

dr MX

(r)
r
= MX (0) = µr .
dt
t=0

This is an important result, which provides a way of finding expectation, variance and
higher moments of random variables.
Other properties of moment-generating functions are:
Theorem 1.22. (i) The moment-generating function of X is unique in the sense that, if
two random variables X and Y have the same mgf (MX (t) = MY (t), for t in an interval
containing 0), then X and Y have the same distribution.

(ii) If X and Y are independent, then MX+Y (t) = MX (t)MY (t). That is, the mgf of the sum
of two independent random variables is the product of the mgfs of the individual random
variables. The result can be extended to n random variables.

(iii) Let Y = aX + b. Then MY (t) = ebt MX (at).


Proof. Skipped.

Example 1.23. We find the moment generating function of Z ∼ N (0, 1). We have

MZ (t) = E etZ
 
Z ∞
1 1 2
= etx √ e− 2 x dx

Z−∞∞
1 − 1 (x−t)2 + 1 t2
= √ e 2 2 dx we put the exponentials together and completed the square
−∞ 2π
Z ∞
1 2 1 1 2
=e 2
t
√ e− 2 (x−t) dx.
−∞ 2π

16
Since density of N (µ, σ 2 ) integrates to 1, we know that for any µ and σ > 0
Z ∞
1 (x−µ)2
√ e− 2σ2 dx = 1.
−∞ 2πσ

In particular (taking σ = 1 and µ = t) we get that


Z ∞
1 1 2
√ e− 2 (x−t) dx = 1.
−∞ 2π
Thus
1 2
MZ (t) = e 2 t , t ∈ R.

17
Chapter 2

Descriptive Statistics

Recall from Section 1.1 that a population is the collection of the individuals or items under
consideration in a statistical study. Recall that a sample is that part of the population from
which information is obtained.
A sample should reflect all the characteristics (of importance) of the population - be rep-
resentative - reflect as closely as possible the relevant characteristics of the population. A
sample that is not representative of the population characteristics is called a biased sample.
Nevertheless, the results of the studies almost never replicate the features of the whole
population exactly. There is always some error: a non-zero difference between the quantities
found in a study and the (unknown) value characteristic for the whole population:

Definition 2.1. Sampling error occur when the sample is not representative, its characteristics
differ from the population.

The sample size is an important feature of any empirical study. Sample size depends:

• on the population size;

• on the required reliability;

• available resources (money, time, etc.)

Definition 2.2. Nonsampling errors occur in the collection, recording, and processing of
sample data.

Some typical sources of a nonsampling error are:

• poorly designed survey;

• missing values;

• incorrect measurements or responses.

18
2.1 Graphical representation of Data
Graphical representation

◦ frequency distribution (Qualitative Data) or histogram (Quantitative Data);

◦ relative frequency distribution (Qualitative Data) or histogram (Quantitative Data);

◦ Pareto Chart;

◦ Pie chart (Qualitative Data or Quantitative Data - for Range values);

◦ Horizontal Bar Chart (Nominal Qualitative Data);

◦ Stem-Leaves Diagram (Quantitative Data);

2.1.1 Distribution Shapes


Definition 2.3. The distribution of a data set is a table, graph, or formula that provides
the values of the observations and how often they occur.

Figure 2.1: Relative frequency

2.2 Descriptive Measures


Numbers that are used to describe data sets are called descriptive measures. Descriptive
measures that indicate where the center or most typical value of a data set lies are called
measures of central tendency or, more simply, measures of center.
Descriptive measures that indicates the amount of variation, or spread, in a data set are
referred to as measures of variation or measures of spread.

19
2.2.1 Measures of central tendency
Measures of central tendency are

• The Mean (Sample Mean) - the sum of observations divided by the number of observa-
tions.
n
1X
x̄ := xi ,
n
i=1

where x1 , x2 , . . . , xn are observations and n is the sample size.

The order statistics of x1 , . . . , xn are their values put in increasing order, which we denote
x(1) ≤ x(2) ≤ . . . ≤ x(n) . x(i) is called i-th order statistic.

• The Median is the point that divides the sample in half



 x n+1 , if n is odd,
( 2 )
x̃ := 
 21 x n + x n +1 , if n is even.
(2) (2 )

• An α-trimmed Mean - the mean of the remaining middle 100(1−2α)% of the observations.

n−bnαc
P
x(i)
x(bnαc+1) + x(bnαc+2) + · · · + x(n−bnαc) i=bnαc+1
x̄α := = ,
n − 2bnαc n − 2bnαc

where bsc is the greatest integer smaller than or equal to s.

• The mode is most frequently occurring value of the data.

Note that the mode is the only measure of center that can be used for qualitative data.

• A weighted mean is used when some data points contribute more ‘weight’ than others.
Pn
j=1 xj wj
x̄w := Pn ,
j=1 wj

where xj denotes some data, wj - corresponding non-negative weights, n is the sample


size (i.e. the number of data points).

• A Grouped-Data Mean Grouped data are data formed by aggregating individual obser-
vations of a variable into groups, so that a frequency distribution of these groups serves
as a convenient means of summarizing or analyzing the data. Suppose n observations

20
are grouped into m classes indexed by j = 1, . . . , m Let fj be the frequency for each
group. The grouped-data mean can be calculated as:
Pm Pm
j=1 xj fj j=1 xj fj
x̄g := Pm = ,
j=1 fj n

where xj denotes either a midpoint for each class.

Exercise 2.4. Some quantitative data were collected:

79, 61, 77, 74, 54, 70, 62, 83, 93, 66, 70, 89

80, 98, 54, 90, 83, 50, 72, 82, 60, 83, 51, 86, 61
Using the limit grouping with a first class of 50 − 59 and a class width of 10. Find the
sample mean, the median, the α-trimmed mean (α = 5%), and the mode.
Solution. The ordered sample is

50, 51, 54, 54, 60, 61, 61, 62, 66, 70, 70, 72, 74, 77, 79,

80, 82, 83, 83, 83, 86, 89, 90, 93, 98


i.e. x(1) = 50, x(2) = 51 and so on, x(25) = 98. Since n is odd the median is x̃ = x(13) = 74.
The remaining calculations can be performed in Excel or R.

2.2.2 Measures of spread


The Sample Variance is defined as the average of the squared difference of data points from
the mean of the data set:
n
(xi − x̄)2
P
2 i=1
s̃x = , (2.1)
n
Often we divide by n − 1 instead of n:
n
(xi − x̄)2
P
i=1
s2x = .
n−1
This will be motivated later in Section 3.2.1.
The Sample Standard Deviation measures variation by indicating how far, on average, the
observations are from the mean:
v
u n
uP
u (xi − x̄)2
s̃x = i=1
t
,
n

21
or v
u n
uP
u (xi − x̄)2
sx = i=1
t
.
n−1
The Sample Variance for grouped data:
Pm
j=1 (xj − x̄)2 fj
s2x,grouped = ,
n−1
where xj denotes either a class
Pm mark or a midpoint, fj - a class frequency, m is the number of
classes or intervals, and n = j=1 fj is the sample size.
The Sample Standard Deviation for grouped data:
sP
m 2
j=1 (xj − x̄) fj
sx,grouped = ,
n−1

The Range is the difference between the largest and smallest values of the sample

range = x(max) − x(min) = x(n) − x(1) .

The Midrange is average of the largest and the smallest values of the sample
x(max) + x(min)
midrange = .
2
We defined quantiles of a probability distribution in Section 1.4. Similarly, we can define
quantiles of a sample. If p ∈ (0, 1), the p-th quantile is a value x which divides the ordered
sample in a way that p100% values are below x and (1 − p)100% are above x.
Instead of quantiles, we often use percentiles. Percentile is the value below which a per-
centage of data falls and vary between 0 and 100. E.g. the 20th percentile is value that divides
the sample into the bottom 20% and top 80%.
Quartiles are the values that split the data into quarters, denoted as Q0 , Q1 , Q2 , Q3 , Q4 .
A five-number summary consists of

• Q0 = the smallest data value,

• Q1 = the first quartile, which divides the bottom half of the sample into halves

• Q2 = the median,

• Q3 = the third quartile, , which divides the top half of the sample into halves,

• Q4 = the largest data value.

22
The five-number summary usually presented as a box-plot. The range is r = Q4 − Q0 . The
Interquartile Range (IQR) is the difference between the upper and lower quartiles of the
sample:
IQR = Q3 − Q1 .
Formally, we may define
x(b1+(n−1)/4c) + x(d1+(n−1)/4e)
Q1 = ,
2
x(b1+3(n−1)/4c) + x(d1+3(n−1)/4e)
Q3 = .
2

23
Chapter 3

Estimators

3.1 Point Estimators


We often repeat the same experiment multiple times. This results in a sequence of random
variables X1 , . . . , Xn , in which all have the same distribution. Moreover, each experiment is
carried out independently from the other, so the random variables are independent. From
now on we will deal only with sequences X1 , . . . , Xn of independent and identically distributed
random variables, abbreviated as i.i.d.
In practice, the distribution of Xi is unknown. We agree on certain simplifications. For
example, we assume that the distribution of these random variables is of certain type such as
exponential or normal, but with an unknown value of the parameter. The unknown parameter
is usually denoted by θ. θ can be a single real number or a vector:
Example 3.1. (i) We may assume that each Xi is Poisson. There is just one parameter
θ = λ.
(ii) We may assume that each Xi is normal. In this case θ is a two-dimensional vector as
the normal distribution has 2 parameters: θ = (µ, σ 2 ). This is slightly awkward way to
write it down, but we usually deal with σ 2 , not σ.
A distribution with parameter θ is denoted by Pθ .
The set of all possible values of the parameter, the parametric space is denoted by ΩΘ . In
Example 3.1 we have (i) ΩΘ = R+ and in (ii) ΩΘ = R × R+ .
The collection of distributions Pθ , θ ∈ Ωθ is called parametric family.
Recall, the following convention: when we write x = (x1 , . . . , xn ), we think about a par-
ticular realisation of the sequence of experiments. When we write X1 , . . . , Xn we acknowledge
the fact that the results of our experiments are random.
The goal of estimation is the following:
given the sample x1 , . . . , xn , what guess can we make about the value of the parameter θ?
(3.1)

24
We formalize this procedure. Let k be the dimension of θ, that is θ = (θ1 , . . . , θk ). The
problem of point estimation is to determine the statistics

θ̂i = gi (X1 , . . . , Xn ), i = 1, . . . , k (3.2)

based on the observed sample data from the population. Any random variable that is calcu-
lated using the sample is called a statistic. Note that the functions gi establish the link between
the results of the experiments (the sample) and our guess for the value of the parameters, which
we decided to seek in (3.1).
The statistics θ̂i are called estimators for the parameters. The estimators θ̂i are random
variables as they depend on X1 , . . . , Xn . When a particular sample x1 , . . . , xn is taken, the
values calculated from these statistics are called estimates of the parameters.
The rest of this Section is to study two most popular methods of finding estimators:

• the method of moments,

• the method of maximum likelihood.

We then study the criteria for choosing a desired point estimator such as bias, efficiency,
sufficiency, and consistency.

3.1.1 Method of Moments


Recall, that we defined moments, central moments and standardised moments in Definitions
1.15 and 1.16.
Sample moments are calculated with respect to the observed sample:

Definition 3.2. Let X = (X1 , X2 , . . . , Xn ) be a sample of independent and identically dis-


tributed random variables. For any positive integer r the r-th sample moment of X, is given
by
n
1X r
mr = Xi .
n
i=1
n
1 P
For r = 1, m1 = X̄ = n Xi is the sample mean.
i=1

Suppose there are k parameters to be estimated: θ = (θ1 , . . . , θk ) and we have a sample of


size n. The Method of Moments Procedure consists of the following steps:

(i) Find k population moments, µr , r = 1, 2, . . . , k. µr depends one or more parameters


θ1 , . . . , θ k .

(ii) Find the corresponding k sample moments, mr = n1 ni=1 Xir , r = 1, 2, . . . , k. The


P
number of sample moments should equal the number of parameters to be estimated.

25
(iii) From the system of equations, µr = mr , r = 1, 2, . . . , k, solved for the parameter θ =
(θ1 , . . . , θk ) we find a moment estimator of θ̂ = (θ̂1 , . . . , θ̂k ).
Example 3.3. Let X1 , . . . , Xn be a random sample from a Bernoulli population with para-
meter p. Tossing a coin 10 times and equating heads to value 1 and tails to value 0, we
obtained the following values: 0 1 1 0 1 0 1 1 1 0. We find an estimator for p using the method
of moments.
For the Bernoulli random variable, µ1 = E[X] = 1 · p + 0 · (1 − p) = p. We require that
µ1 = m1 , that is
n
1X
p= Xi .
n
i=1
Thus, the estimator of p is
n
1X
p̂ = Xi
n
i=1
i.e. we can estimate p as the ratio of the total number of heads to the total number of tosses.
n
P
For the partciluar sample x = (0, 1, 1, 0, 1, 0, 1, 1, 1, 0) we have xi = 6, so the moment
i=1
estimate of p is p̂ = 6/10.
Example 3.4. (i) Let the distribution of X be N (µ, σ 2 ). For a sample X1 , . . . , Xn of size
n, we use the method of moments to estimate µ and σ 2 . Since there are two parameters,
we find the first two moments: µ1 = µ and

µ2 = E[X12 ] = Var(X1 ) + (E[X1 ])2 = σ 2 + µ2 .

Thus the system (


µ1 = m1
µ2 = m2
gives (
µ = X̄
σ 2 + µ2 = n1 ni=1 Xi2 .
P

Thus
n n
1X 2 1X 2
σ2 = Xi − µ2 = Xi − (X̄)2
n n
i=1 i=1
n n
! !
1 X 2 2 1 X 2 2 2
= Xi − n(X̄) = Xi − 2n(X̄) + n(X̄) (3.3)
n n
i=1 i=1
n n n n
!
1 X
2
X X
2 1X
Xi2 − 2Xi X̄ + (X̄)2

= Xi − 2 Xi X̄ + (X̄) =
n n
i=1 i=1 i=1 i=1

26
n
1X 2
= Xi − X̄ .
n
i=1

Note that (3.3) is ‘reverse engineering’ as we are performing these steps knowing that
the variance should be estimated by an expression similar to (2.1).

(ii) The following data (rounded to the third decimal digit) were generated from a normal dis-
tribution with mean 2 and a standard deviation of 1.5. 3.163, 1.883, 3.252, 3.716, −0.049,
−0.653, 0.057, 4.098, 1.670, 1.396, 2.332, 1.838, 3.024, 2.706, 3.830, 3.349, −0.230, 1.496, 0.231,
2.987. Obtain the method of moments estimates of the true mean and the true variance.
µ̂ = 2.0048 and σ̂ = 1.44824. 1

Exercise 3.5. Let X1 , X2 , . . . , Xn be a sample of i.i.d. random variables with pdf with θ > 0:
(
θxθ−1 , 0 < x ≤ 1,
fX (x) =
0, otherwise.

(i) Use the method of moments to estimate θ.

(ii) For the following observations of X calculate the method of moments estimate for θ:

0.3, 0.5, 0.8, 0.6, 0.4, 0.4, 0.5, 0.8, 0.6, 0.3


Solution. Answer: θ̂ = 1−X̄
.

3.1.2 Method of Maximum Likelihood


Definition 3.6. Let f (x1 , . . . , xn , θ), θ ∈ ΩΘ ⊂ Rk , be the joint probability mass (or density)
function of n random variables X1 , . . . , Xn with sample values x1 , . . . , xn .
The likelihood function of the sample is given by

L(θ) = L(θ, x1 , . . . , xn ) = f (x1 , . . . , xn , θ).

We emphasize that L is a function of θ for fixed sample values.

If X1 , . . . , Xn are discrete i.i.d. random variables with probability mass function p(x, θ),
then, the likelihood function is given by
n
Y
L(θ) = Pθ (X1 = x1 , . . . , Xn = xn ) = Pθ (X1 = x1 ) · · · Pθ (Xn = xn ) = Pθ (Xi = xi ).
i=1
1
If you use a software to calculate the standard deviation and need to have exact results, you need to check
if the programme is using the factor n1 or n−11
. In Excel, the function STDEV.P has n1 and STDEV.S has n−1 1
.

27
Thus in the discrete case we have
n
Y
L(θ) = p(xi , θ) (3.4)
i=1

and similarly in the continuous case, if the probability density function is f (x, θ), the likelihood
function is
Yn
L(θ) = f (xi , θ) (3.5)
i=1

In practice, we will always use either (3.4) or (3.5) to calculate the likelihood.
Example 3.7. Continuing Example 3.1 we have
k
(i) If X1 , . . . , Xn ∼ P ois(λ), then as we said θ = λ and p(k, λ) = e−λ λk! for k = 0, 1, 2, . . ..
Thus
n n
Y Y λ xi λx1 +···+xn
L(λ) = p(xi , λ) = e−λ = e−nλ .
xi ! x1 ! · · · xn !
i=1 i=1

1 (x−µ)2
(ii) If X1 , . . . , Xn ∼ N (µ, σ 2 ), then θ = (µ, σ 2 ) and f (x; µ, σ 2 ) = √ 1 e− 2 σ2 for x ∈ R.
2πσ
Thus
n n (xi −µ)2
Pn 2
1 1 i=1 (xi −µ)
e− 2σ2 = e−
Y Y
L(µ, σ 2 ) = f (xi , µ, σ 2 ) = √ 2σ 2 .
i=1 i=1
2πσ (2π)n/2 σ n

Definition 3.8. The Maximum Likelihood Estimates are those values θ̂ of the parameters
that maximize the likelihood function with respect to the parameter θ. That is,

L(θ̂, x1 , . . . , xn ) = max L(θ, x1 , . . . , xn ) (3.6)


θ∈ΩΘ

where ΩΘ is the set of possible values of the parameter θ.


To put it differently, θ̂ is such that

L(θ̂, x1 , . . . , xn ) ≥ L(θ, x1 , . . . , xn )

for all θ ∈ ΩΘ .
As explained earlier, since θ̂ depends on the sample, it is also a random variable.
The procedure to find Maximum Likelihood Estimators (MLEs) is:
(i) Define the likelihood function, L(θ).

(ii) Often it is easier to take the natural logarithm of L(θ), that is work with the log-likelihood
l(θ) = ln(L(θ)).

28
(iii) Differentiate l(θ) with respect to θ, and then equate the derivative to zero.

(iv) Solve for the parameter θ to obtain θ̂.

(v) Check whether θ̂ is a global maximizer.

Taking the logarithm in (ii) simplifies the calculations, and is justified as logarithm is a
monotonically increasing function so it doesn’t change the location of the maxima.
We now try this recipe in a few examples and exercises.

Example 3.9. Suppose X1 , . . . , Xn are discrete i.i.d. random variables with the geomet-
ric distribution with an unknown parameter p, so that P (X = x) = (1 − p)x−1 p for x =
1, 2, . . .. We find the maximum likelihood estimator for the parameter p for n independent
observations x1 , x2 , . . . , xn and calculate the estimate for the following set of observations:
231, 127, 60, 4, 3, 183, 2, 4, 71, 22.

Solution. We find the likelihood


n
Y P P
L(p) = (1 − p)xi −1 p = pn (1 − p) (xi −1) = pn (1 − p) xi −n .
i=1

Thus, the log-likelihood is


X 
l(p) = ln(L(p)) = n ln(p) + xi − n ln(1 − p).

We differentiate and set the derivative to be 0


dl(p) n X  −1
= + xi − n = 0.
dp p 1−p

Solving for p we get


n 1
p̂ = P = .
xi x̄
Plugging in our data we can get p̂ = 0.014.

Exercise 3.10. Suppose the isolated weather-reporting station has an electronic device with
operating for a random time until a failure occurs. The station also has one spare device, and
the time Y until this second instruments is not available has a distributed with the density

1 −y
fY (y, θ) = ye θ , 0 ≤ y < ∞, 0 < θ < ∞ (3.7)
θ2
Five data points have been collected: 9.2, 5.6, 18.4, 12.1, 10.7. Find the maximum likelihood
estimate for θ.

29
Solution. We find the likelihood
n n
!
Y 1 yi 1 Y 1 Pn
L(θ) = 2
yi e− θ = 2n yi e− θ i=1 yi
.
θ θ
i=1 i=1

Thus, the log-likelihood is


n n
!
Y 1X
l(θ) = ln(L(θ)) = −2n ln(θ) + ln yi − yi .
θ
i=1 i=1

We differentiate and set the derivative to be 0


n
dl(θ) −2n 1 X
= + 2 yi = 0.
dθ θ θ
i=1

Solving for θ we get


n
1 X 1
θ̂ = yi = ȳ.
2n 2
i=1
Plugging in data we can get θ = 56/10 = 5.6
If a family of probability models is indexed by two or more unknown parameters, θ1 , θ2 , . . . , θk ,
finding maximum likelihood estimates requires the solution of k simultaneous equations. For
example, for k = 2, (
∂ ln L(θ1 ,θ2 )
∂θ1 = 0,
∂ ln L(θ1 ,θ2 )
∂θ2 = 0.

Exercise 3.11. For a random sample X1 , X2 , . . . , Xn from normal distribution N (µ, σ 2 ) with
the pdf:
1 (x−µ)2
fX (x) = √ e− 2σ2
2πσ
find the maximum likelihood estimators for µ and σ 2 . Compare these estimators to the method
of moments estimators discussed in the previous section.
n
1
Solution. Answer: µ̂ = X̄ and σ̂ 2 = (Xi − X̄)2 .
P
n
i=1

Example 3.12. Let X1 , . . . , Xn be a random sample from U (0, θ), θ > 0. We find the MLE
of θ. (
1
, 0<x≤θ
fX (x) = θ
0, otherwise
Thus (
1
θn 0 < x1 , x2 , . . . , xn ≤ θ
L(θ, x1 , . . . , xn ) =
0, otherwise.

30
Unlike the previous examples, we don’t take logarithm nor calculate derivatives. Instead, we
observe that L is non-zero only when all xi are smaller than or equal to θ, or equivalently,
when max xi ≤ θ. L is increasing on [0, max xi ] and attains maximum at θ̂ = max xi .
i=1,...,n i=1,...,n

3.1.3 Expectation and variance of the sample mean


Recall that a statistic is a random variable which is a function of a sample: g(X1 , . . . , Xn ).
The distribution of such random variable is called the sampling distribution.
We saw in many examples that many estimators involve calculating the sample mean X̄
and therefore in this short section we investigate the sampling distribution of X̄. We find E[X̄]
and Var(X̄).

Theorem 3.13. Let X1 , . . . , Xn be a random sample of size n from a population with mean
2
µ and variance σ 2 . Then E[X̄] = µ and Var(X̄) = σn .

Proof. We have
n n n
" #
1X 1X 1X nµ
E[X̄] = E Xi = E[Xi ] = µ= =µ
n n n n
i=1 i=1 i=1

and by independence
n n
!
1X 1 X 1 2 σ2
Var(X̄) = Var Xi = Var(Xi ) = nσ = .
n n2 n2 n
i=1 i=1

We summarize this in the following formulas:

2 σ2
µX̄ = E[X̄] = µ, σX̄ = Var(X̄) =
n
Definition 3.14. The value σX̄ is called the standard error of the mean.

If in Theorem 3.13 we additionally assume that the sample is from a Gaussian population,
then the sample mean is also Gaussian. Firtst though we recall the following results from
MA1061 - Probability module:

Proposition 3.15. Let X1 , X2 , . . . , Xn be independent normally distributedP randomP variables


such that Xi ∼ N (µi , σi2 ). Let Y = c1 X1 +c2 X2 +· · ·+cn Xn . Then Y ∼ N ( ni=1 ci µi , ni=1 c2i σi2 ).

Theorem 3.16. Let X1 , . . . , Xn be an i.i.d. random sample from N (µ, σ 2 ). Then X̄ ∼


X̄−µ
N (µ, σ 2 /n) and σ/√ ∼ N (0, 1) i.e. it is standard normal.
n

31
Proof. By Proposition 3.15, a linear combination of independent Gaussian random variables
is Gaussian, thus both X̄ = n1 ni=1 Xi and σ/X̄−µ
P
√ are Gaussian. Their mean and variance can
n
easily be found similarly to the previous proof.

The above theorems formalize the important concept of standardizing, that is creating
mean-zero and unit-variance random variables:
X −µ
X ∼ N (µ, σ 2 ) =⇒ Z = ∼ N (0, 1).
σ

3.2 Properties of Point Estimators


In this section we define various properties of estimators. The purpose of studying them is to
be able to distinguish between good and poor estimators.

3.2.1 Unbiasedness
Definition 3.17. A point estimator θ̂ is called an unbiased estimator of the parameter θ if
E[θ̂] = θ. Otherwise θ̂ is said to be biased. Furthermore, the bias of θ̂ is given by
    h i
B θ̂, θ = Bias θ̂, θ = E θ̂ − θ.

Example 3.18. Let X1 , . . . , Xn be a random sample from a Bernoulli population with para-
meter p. We show that the method of moments estimator obtained in Example 3.3 is also an
unbiased estimator. We have Pn
Xi Y
p̂ = i=1 = ,
n n
where Y is the binomial random variable, E[Y ] = np, hence,
1 1
E [p̂] = E[Y ] = np = p.
n n
Theorem 3.19. The mean of a random sample X̄ is an unbiased estimator of the population
mean µ.

Proof. Let X1 , . . . , Xn be random variables with mean µ. Then, the sample mean is X̄ =
1 Pn
n i=1 Xi .
n
  1X 1
E X̄ = E[Xi ] = nµ = µ.
n n
i=1

Hence, X̄ is an unbiased estimator of µ.


1
The following result motivates taking the factor n−1 when estimating the standard devi-
ation:

32
Theorem 3.20. Let X1 , . . . , Xn be random sample drawn from an infinite population with
variance σ 2 < ∞ . Let
n
1 X
S2 = (Xi − X̄)2
n−1
i=1
be the variance of the random sample, then S 2 is an unbiased estimator for σ 2 .
Proof.
" n #
 2 1  X 2
E S = E (Xi − µ) − (X̄ − µ)
n−1
i=1
" n n n
#
1 X X X
= E (Xi − µ)2 − 2(X̄ − µ) (Xi − µ) + (X̄ − µ)2
n−1
i=1 i=1 i=1
" n #
1 X
2 2
= E (Xi − µ) − 2(X̄ − µ)n(X̄ − µ) + n(X̄ − µ)
n−1
i=1
n
!
1 X  2
  2

= E (Xi − µ) − nE (X̄ − µ) .
n−1
i=1

By Theorem 3.13 E (X̄ − µ)2 = Var(X̄) = σ 2 /n. Thus


 

n
!
 2 1 X σ 2
E S = σ2 − n = σ2
n−1 n
i=1

Hence, S 2 is an unbiased estimator for σ 2

The estimator σ̂ 2 = n1 ni=1 (Xi − X̄)2 is biased estimator for σ 2 :


P

n−1 2
E[σ̂ 2 ] = σ .
n
However, if the sample size n → ∞
E[σ̂ 2 ] → σ 2 ,
hence, σ̂ 2 is asymptotically unbiased.
Exercise 3.21. Let θ̂1 and θ̂2 be two unbiased estimators of θ. Show that a convex combin-
ation of θ̂1 and θ̂2
θ̂3 = aθ̂1 + (1 − a)θ̂2 , 0 ≤ a ≤ 1
is an unbiased estimator of θ.
Solution. We are given that E[θ̂1 ] = θ and E[θ̂2 ] = θ. Therefore,
E[θ̂3 ] = E[aθ̂1 + (1 − a)θ̂2 ] = aE[θ̂1 ] + (1 − a)E[θ̂2 ] = aθ + (1 − a)θ = θ.
Hence, θ̂3 is unbiased.

33
3.2.2 Efficiency
Among unbiased estimators, we may still argue that one is better than the other as the
following example shows.
Example 3.22. Let X1 , X2 , X3 be a sample of size n = 3 from a distribution with unknown
mean µ, −∞ < µ < ∞, where the variance σ 2 is a known positive number.
One can show that both θ̂1 = X̄ and θ̂2 = (2X1 + X2 + 5X3 )/8 are unbiased estimators
for µ. Indeed, for θ̂1 we have
1
E[θ̂1 ] = E[X̄] = · 3µ = µ
3
and for θ̂2 :
1 2µ + µ + 5µ
E[θ̂2 ] = · (2E[X1 ] + E[X2 ] + 5E[X3 ]) = = µ.
8 8
However, the variances of θ̂1 and θ̂2 are different:
σ2
Var(θ̂1 ) = ,
3  
2X1 + X2 + 5X3 4 2 1 25 30
Var(θ̂2 ) = Var = σ + σ2 + σ2 = σ2.
8 64 64 64 64

Since Var(θ̂1 ) < Var(θ̂2 ), X̄ is a better unbiased estimator.


Definition 3.23. Let θ̂1 and θ̂2 be two unbiased estimators for a parameter θ. If
   
Var θ̂1 < Var θ̂2

we say that θ̂1 is more efficient than θ̂2 .


The relative efficiency of θ̂1 with respect to θ̂2 is the ratio Var(θ̂2 )/ Var(θ̂1 ).
Definition 3.24. An unbiased estimator θ̂∗ is the minimum variance estimator if

Var(θ̂∗ ) ≤ Var(θ̂)

for every unbiased estimator θ̂ of a parameter θ.


Theorem 3.25 (The Cramér-Rao lower bound). Let fX (x, θ) be a continuous joined pdf with
continuous first-order and second-order derivatives.
Let X1 , X2 , . . . , Xn be a random sample from fX (x, θ), and suppose that the set of x values,
where fX (x, θ) 6= 0 does not depend on θ. Let θ̂ = g(X1 , X2 , . . . , Xn ) be any unbiased estimator
of θ. Then
( " 2 #)−1 −1
∂ 2 ln fX (X, θ)
 
  ∂ ln fX (X, θ)
Var θ̂ ≥ nE = −nE
∂θ ∂θ2

34
We skip the proof of this Theorem.
The expression
"  #
∂ ln fX (X, θ) 2
 2 
∂ ln fX (X, θ)
I(θ) = nE = −nE
∂θ ∂θ2

is called the Fisher information. This Theorem says that the lowest possible variance of an
estimator equals to I(θ)−1 .

Definition 3.26. The unbiased estimator θ̂ is said to be efficient if the variance of θ̂ equals
to the Cramér-Rao lower bound associated with fX (x, θ).
The efficiency of an unbiased estimator θ̂ is the ratio of the Cramér-Rao lower bound for
fX (x, θ) to the variance of θ̂.

An efficient estimator is optimal in the sense that it has the smallest possible variance
among all unbiased estimators.

Example 3.27. Let X1 , X2 , . . . , Xn be a random sample from the Poisson distribution pX (x, θ) =
e−λ λx
x! , x = 0, 1, . . .. We compare the Cramér-Rao lower bound for pX (x, λ) to the variance
of the maximum likelihood estimator P
for λ. Firstly, we find the MLE. The likelihood is given
Qn e−λ λxi xi
−nλ λ
. The log-likelihood is ln L(λ) = −nλ + ln(λ) ni=1 xi −
P
by L(θ) = i=1 xi ! = e Q
x i !
Pn
i=1 ln(xi !). Differentiating with respect to λ we get

n
d ln L(λ) X
= −n + λ−1 xi = 0.

i=1

Solving for λ gives λ̂ = n1 ni=1 xi = X̄.


P
Secondly, we find the Cramér-Rao Lower Bound. Differentiating ln pX (x, λ) twice we get

d ln pX (x, λ) x
= −n +
dλ λ
d2 ln pX (x, λ) x
=− 2
dλ2 λ
Thus, the Fisher information is equal to (recall from Exercise 1.19 that E[X] = λ)
 2 
d ln fX (X, λ) 1 λ 1
I(λ) = −E 2
= 2 E[X] = 2 = .
dλ λ λ λ

Thirdly, the Cramer-Rao Lower Bound (CRLB) is

1 λ
CRLB = 1 = n

35
and is equal to the variance of λ̂:
 Pn  Pn Pn
i=1 Xi i=1 Var(Xi ) λ nλ λ
Var(λ̂) = Var = 2
= i=1
2
= 2 = .
n n n n n

We conclude that the maximum-likelihood estimator λ̂ = X̄ for the parameter λ of the Poisson
distribution is efficient.
For biased estimators, one can measure the precision of an estimator by finding expected
squared distance between the true value of the parameter and its estimator:

Definition 3.28. The mean square error of the estimator θ̂, denoted by MSE θ̂ , is defined
as h i
MSE θ̂ = E (θ̂ − θ)2 .


We have
 h  2 i
MSE θ̂ = E θ̂ − E[θ̂] + E[θ̂] − θ
h 2 2  i
= E θ̂ − E[θ̂] + E[θ̂] − θ + 2 θ̂ − E[θ̂] E[θ̂] − θ
h 2 i h 2 i h i 
= E θ̂ − E[θ̂] + E E[θ̂] − θ + 2E θ̂ − E[θ̂] E[θ̂] − θ
 2 h i
= Var θ̂ + E[θ̂] − θ because E θ̂ − E[θ̂] = 0,
 2
= Var θ̂ + Bias θ̂, θ .

Definition 3.29. The unbiased estimator θ̂ that minimizes the mean square error is called
the Minimum-Variance Unbiased Estimator of θ.
Exercise 3.30. If X has a binomial distribution with parameters n and p, then p̂1 = X/n is
an unbiased estimator of p. Another estimator of p is p̂2 = (X + 1)/(n + 2).
1) Derive the bias of p̂2 .
2) Derive MSE(p̂1 ) and MSE(p̂2 ).
3) Show that for p ≈ 0.5 MSE(p̂1 ) < MSE(p̂2 ).

3.2.3 Sufficiency
Definition 3.31. Let X = (X1 , . . . , Xn ) be a random sample from a probability distribution
with unknown parameter θ. Then, the statistic U = g(X1 , . . . , Xn ) is said to be sufficient for
θ if the conditional pdf fX (x1 , . . . , xn |U = u) (or pmf pX (x1 , . . . , xn |U = u)) does not depend
on θ for any value of u.
An estimator of θ that is a function of a sufficient statistic for θ is said to be a sufficient
estimator of θ.

36
Example 3.32. Let X1 , . . . , Xn be a random sample of size n drawn from the Bernoulli pmf:
pX (k, p) = pk (1 − p)1−k , where k =P0, 1 and p is an unknown parameter. The maximum
likelihood estimator for p is p̂ = n1 ni=1 Xi . Let us also P denote the maximum likelihood
estimate for p given a sample x = (k1 , . . . , kn ) by p̂e = n1 ni=1 ki .
We show that p̂ is a sufficient estimator for p. We have
n
!  
X n
P (p̂ = pe ) = P Xi = npe = pnpe (1 − p)n−npe . (3.8)
npe
i=1

The conditional pmf is equal to

P (X1 = k1 , . . . , Xn = kn | p̂ = pe )
P (X1 = k1 , . . . , Xn = kn and p̂ = pe ) P (X1 = k1 , . . . , Xn = kn )
= =
P (p̂ = pe ) P (p̂ = pe )
Pn Pn
pk1 (1 − p)1−k1 · · · pkn (1 − p)1−kn p i=1 ki (1− p)n− i=1 ki pnpe (1 − p)n−npe
= = = .
P (p̂ = pe ) P (p̂ = pe ) P (p̂ = pe )
Plugging in (3.8) we get
1
P (X1 = k1 , . . . , Xn = kn | p̂ = pe ) = n
,
npe

which does not depend on p. This shows that p̂ is a sufficient statistic.


Theorem 3.33. If θ̂ is sufficient statistic for θ, then any one-to-one function of θ̂ is also
sufficient statistic for θ.
We skip the proof. For example, p̂∗ = np̂ = n n1 ni=1 Xi = ni=1 Xi is also sufficient.
P P
We give an example of a statistic, which is not sufficient:
Example 3.34. Continuing the setting of Example 3.32, a statistic which is not sufficient is
for example p̂∗ = X1 . We have
Pn Pn
p i=1 ki (1
− p)n− i=1 ki

P (X1 = k1 , . . . , Xn = kn |p = p ) = ,
pk1 (1 − p)1−k1
which depends on p. We conclude that p̂∗ = X1 is not sufficient
In practice, in order to verify that an estimator is sufficient we use the following criteria.
Theorem 3.35 (Neyman-Fisher factorization criteria). Let θ̂ = u(X1 , . . . , Xn ) be a statistic
based on the random sample X1 , . . . , Xn . Then, θ̂ is a sufficient statistic for θ if and only if
the discrete joint pmf pX (x1 , . . . , xn , θ) (which depends on the parameter θ) can be factored
into two non-negative functions.

pX (x1 , . . . , xn , θ) = g(u(x1 , . . . , xn ), θ) · h(x1 , . . . , xn ), for all x1 , . . . , xn ,

37
where g(θ̂, θ) is a function only of θ̂ and θ and h(x1 , . . . , xn ) is a function of only x1 , . . . , xn
and not of θ.
The same statement holds for continuous case.

Note that h may also depend on θ̂ as θ̂ = u(x1 , . . . , xn ) is a itself a function of the sample.

Example 3.36. Let X1 = k1 , ..., Xn = kn be a random sample of size n from the Poisson
distribution. We show that λ̂ = X̄i is a sufficient statistic for λ. We have
n
λki
= e−nλ λ ki (k1 ! · · · kn !)−1 = e−nλ λnλ̂ (k1 ! · · · kn !)−1
Y P
pX (k1 , . . . , kn , λ) = e−λ
ki
i=1
= g(λ, λ̂)h(x1 , . . . , xn ),

if we take g(λ, λ̂) = e−nλ λnλ̂ and h(x1 , . . . , xn ) = (k1 ! · · · kn !)−1 .

Exercise 3.37. Let X1 , ..., Xn denote a random sample from a geometric population with
parameter p:
pX (x; p) = p(1 − p)x−1 , x = 1, 2, 3...
Show that X̄ is sufficient for p.

3.2.4 Consistency
Definition 3.38. A sequence of random variables X1 , X2 , . . . converges in probability to a
random variable X if for every ε > 0

lim P (|Xn − X| < ε) = 1 (3.9)


n→∞

p
Convergence in probability is denoted as Xn →
− X.

Note that (3.9) is equivalent to

lim P (|Xn − X| ≥ ε) = 0.
n→∞

Consistency, which we now define, involves investigating the convergence of the estimator
as the sample size increases. Intuitively, the larger the sample, the more accurate estimate of
the parameter. Since the sample size n is very important in these considerations, we denote
by θ̂n an estimator, which is based on the sample containing n observations.

Definition 3.39. A sequence θ̂n = u(X1 , X2 , . . . , Xn ), n = 1, 2, . . . is said to be consistent


sequence of estimators for θ if it converges in probability to θ, i.e. for ε > 0
 
lim P |θ̂n − θ| < ε = 1
n→∞

38
Consistency means that the probability of our estimator being within some small ε-interval
of θ can be made as close to one as we like by making the sample size n sufficiently large.
The fact that the sample mean is a consistent estimator for the true mean no matter what
pdf the data come from is often refer as the weak law of large numbers:

Theorem 3.40 (Weak Law of Large Numbers). LetPX1 , X2 , . . . be i.i.d random variables with
E[Xi ] = µ and Var(Xi ) = σ 2 < ∞. Define X̄n = n1 ni=1 Xi . Then for every ε > 0,

lim P |X̄n − µ| < ε = 1)
n→∞

that is, X̄n converges in probability to µ.

Proof. Recall from MA1601, the Markov inequality: for any random variable X and any
non-negative function g we have for any k > 0,

E[g(X)]
P (g(X) ≥ k) ≤ .
k
Using this, we have

P |X̄n − µ| < ε = 1 − P |X̄n − µ| ≥ ε = 1 − P (X̄n − µ)2 ≥ ε2


  

E[(X̄n − µ)2 ] Var(X̄n ) σ2


≥1− = 1 − = 1 −
ε2 ε2 nε2

Thus P |X̄n − µ| < ε → 1 as n → ∞.

Definition 3.41. A sequence of random variables X1 , X2 , . . . converges in distribution to a


random variable X if
lim FXn (x) = FX (x)
n→∞

at all points x where the cdf FX (x) is continuous.


p
Convergence in probability is stronger than convergence in distribution. That is, if Xn →

d
X, then Xn − → X. The converse is not necessarily true. The proofs are beyond the scope of
this course.

Theorem 3.42 (The Central Limit Theorem). Let X1 , X, . . . be a sequence of independent


identically distributed random variables such that E[Xi ] = µ and Var(Xi ) = σ 2 < ∞. Define

X̄ − µ
Un = √
σ/ n

Then the sequence of random variables U1 , U2 , . . . converges in distribution to a random variable


U ∼ N (0, 1).

39
The assertion of this theorem means that lim FUn (u) = Φ(u), where Φ is the cdf of the
n→∞
standard normal distribution. Equivalently:
Z z
1 2
lim P (Un ≤ z) = √ e−t /2 dt.
n→∞ 2π −∞
The Central Limit Theorem means that the sample mean is asymptotically normally dis-
tributed whatever the distribution of the original random variables is.

3.3 Probability distributions


In this section we introduce some new probability distributions, which are useful in Interval
estimation and hypothesis testing. For lack of time in this course, we state the properties
of these distributions (like expectation and variance) without derivations. In most cases the
proofs of those formulas are straightforward.

3.3.1 Gamma Distribution


Definition 3.43. The gamma function, denoted by Γ(a), is defined as
Z ∞
Γ(a) = e−x xa−1 dx, a > 0.
0

It can be shown that for a > 1, Γ(a) = (a − 1)Γ(a − 1) For n ∈ N

Γ(n) = (n − 1)!

and

 
1
Γ = π.
2
Definition 3.44. A random variable X is said to possess a gamma probability distribution
with parameters α > 0 and β > 0 if it has the pdf given by
(
1 α−1 e−x/β ,
f (x) = β α Γ(α) x x>0
0, otherwise

We denote this by Gamma(α, β) or Γ(α, β). The parameter α is called a shape parameter,
and β is called a scale parameter.
If X ∼ Γ(α, β):
E[X] = αβ, Var(X) = αβ 2
and the moment-generating function:

MX (t) = (1 − βt)−α

40
Figure 3.1: Density of the Gamma distribution

A special case of the gamma distribution is the exponential distribution, if α = 1, β = 1/λ


for which we get f (x) = λe−λx , x > 0.
We give the following results without proof:
Theorem 3.45. Let X1 , XP 2 , . . . , Xn be independent random variables such that Xj ∼ Γ(αj , β),
j = 1, 2, . . . , n. Then Y = nk=1 Xk is a Γ( nj=1 αj , β) random variable.
P

Corollary 3.46. Let X1 , X2 , . . . , Xn be i.i.d random variables, each with an exponential


distribution with parameter β. Then Y has Gamma distribution Γ(n, β).
Example 3.47. In Example 3.10 we described the following situation. A weather-reporting
station has two devices, which can operate for a random, exponentially distributed time X1 ∼
Exp(λ) and X2 ∼ Exp(λ). The second device is turned on when the first device fails so the
total time the station can operate is Y = X1 + X2 According to Corollary 3.46 Y ∼ Γ(2, λ)
i.e. it has a density given by (3.7).

3.3.2 χ2 distribution
Usually when defining a new continuous distribution, we give a formula for its density. This
time we will construct it differently.
Definition
Pn 3.48. Let Z1 , Z2 , . . . , Zn be independent random variables with Zi ∼ N (0, 1). If
Y = i=1 Zi2 , then Y follows the chi-square distribution with n degrees of freedom. We write,
Y ∼ χ2n .
In particular, if Z ∼ N (0, 1) and X = Z 2 , then X follows the chi-square distribution with
1 degree of freedom.
If X ∼ χ2n , then the probability density is2 :
(
1 n/2−1 e−x/2 , x > 0
2n/2 Γ(n/2) x
f (x) =
0, otherwise.
2
Derivation of the χ2 pdf: https://en.wikipedia.org/wiki/Proofs_related_to_chi-squared_
distribution

41
Figure 3.2: Density of the χ2 -distribution

Comparing with Definition 3.44, we see the χ2 -distribution with n degrees of freedom is the
same as Γ(n/2, 1/2). The expectation and variance are
E[X] = n, Var(X) = 2n,
and the moment-generating function
1
M (t) = (1 − 2t)−n/2 for t < . (3.10)
2
We have the following result about sums of χ2 -random variables (without proof):
Corollary 3.49. If X1 , X2 , . . . , Xn are independent RVs such that Xj ∼ χ2 (rj ), j = 1, 2, . . . , n,
then Y is a χ2(Pn rj ) RV.
i=1

Theorem 3.50. Let X1 , X2 , . . . , Xn independent random


Pn variables 2with Xi ∼ N (µ, σ 2 ). It
follows directly form the previous theorem that if Y = i=1 (Xi − µ) /σ , then Y ∼ χ2n
2

Xi −µ
Proof. This obvious from the definition of the χ2 -distribution, since σ is standard normal.

We also have the following result about independence:


Theorem 3.51. If X1 , . . . , Xn is a random sample from a normal population with the mean
µ and variance σ 2 , then X̄ and S 2 are independent.
Proof of this theorem can be found in Introduction to mathematical statistics and its ap-
plications by Richard J. Larsen; Morris L. Marx (2014), See Appendix 7.A.2 Some Distribution
Results for Ȳ and S 2
Theorem 3.52. Let X1 , X2 , . . . , Xn be independent random variables with Xi ∼ N (µ, σ 2 ).
Define the sample variance as
n
2 1 X
S = (Xi − X̄)2 . (3.11)
n−1
i=1

42
Then
(n − 1)S 2
∼ χ2n−1 .
σ2
Proof. This is the most advanced proof of this module. We have
n n
(n − 1) 2 (n − 1) 1 X 2 1 X 2
S = · Xi − X̄ = (Xi − µ) − (X̄ − µ)
σ2 σ2 n−1 σ2
i=1 i=1
n n n
X (Xi − µ)2 X (Xi − µ) (X̄ − µ) X (X̄ − µ)2
= −2 + (3.12)
σ2 σ σ σ2
i=1 i=1 i=1

The middle term satisfies


n n
(X̄ − µ)2
P
X (Xi − µ) (X̄ − µ) (X̄ − µ) X (Xi − µ) (X̄ − µ) n Xi − nµ
= = · =n
σ σ σ σ σ σ σ2
i=1 i=1

Plugging in to (3.12)
n
(n − 1) 2 X (Xi − µ)2 (X̄ − µ)2
S = − n .
σ2 σ2 σ2
i=1

Equivalently,
n
(n − 1) 2 (X̄ − µ)2 X (Xi − µ)2
S + n = .
σ2 σ2 σ2
i=1

Let
n
(n − 1) 2 (X̄ − µ)2 X (Xi − µ)2
Y1 = S , Y2 = n , Y3 = .
σ2 σ2 σ2
i=1

The variables Y1 and Y2 are independent by Theorem 3.51. By Theorem 1.22(ii)

MY1 (t)MY2 (t) = MY3 (t). (3.13)

Since Y3 ∼ χ2n , we have MY3 (t) = (1 − 2t)−n/2 (cf. Theorem 3.50 and formula (3.10)). Note
 2
X̄−µ X̄−µ
that σ/√ is a standard normal random variable. Thus Y2 =
n

σ/ n
∼ χ21 . We conclude
MY2 (t) = (1 − 2t)−1/2 . From (3.13)

MY1 (t) = (1 − 2t)−(n−1)/2 ,

which is the moment-generating function of the χ2 -distribution with (n−1) degrees of freedom.
Since the moment genrating function uniquelly identifies the distribution (Theorem 1.22(i))
we conclude that (n−1)
σ2
S 2 has χ2 -distribution with (n − 1) degrees of freedom.

43
If X ∼ χ2n , then from the χ2 table, we can read the values of χ2α,n (or χ2α (n)) such that

P (X > χ2α,n ) = α. (3.14)


χ2α,n is a (1 − α)-quantile of the χ2 -distribution with n degrees of freedom.
Exercise 3.53. Let the randomPvariables X1 , X2 ,. . . , X5 be from an N (5, 1) distribution.
5 2
Find a number ε such that PX i=1 (Xi − 5) ≤ ε = 0.90?

Solution. Since 5i=1 (Xi − 5)2 ∼ χ25 we are looking in the table for the row with 5 degrees
P
of freedom (df) and for for χ20.10 we obtain ε = 9.23635.

Alternatively, we can find the quantiles in MATLAB: eps=chi2inv(0.9, 5) or in R: eps<-qchisq(0.9, df=5)


Exercise 3.54. Let X1 , X2 , . . . , X10 be a random sample from a standard normal distribution.
Find the numbers a and b such that
10
X
P (a ≤ Xi2 ≤ b) = 0.95
i=1

3.3.3 Student’s t-distribution


Suppose that a random sample X1 , X2 , . . . , Xn is taken from normally distributed population.
The objective is to draw an inference about µ.
X̄−µ
√ ∼ N (0, 1). However, in practice we do not know σ so
So far we mentioned that Z = σ/ n
X̄−µ
we will need to replace it with its estimator S defined in (2.1). When using √
S/ n
Student’s
t-distribution becomes useful. 3

Definition 3.55. If Y and Z are independent random variables, Y has a χ2n distribution, and
Z ∼ N (0, 1), then
Z
T =p
Y /n
is said to have a (Student) t-distribution with n degrees of freedom. We denote this by T ∼ Tn .
3
William Sealy Gossett - English statistician, chemist and brewer who served as Head Brewer of Guinness
and Head Experimental Brewer of Guinness - derived the formula for the pdf in 1908.
Ronald Aylmer Fisher presented a rigorous mathematical derivation of Gossett’s density in 1924.

44
Figure 3.3: Density of the t-distribution

If X ∼ Tn :
Γ( n+1
2 )
f (x) = (n+1)/2 , −∞ < x < ∞
√ n

x2
nπΓ( 2 ) 1 + n

The pdf for Tn is symmetric: fTn (x) = fTn (−x), for all x.
n
E[X] = 0, Var(X) =
n−2

Theorem 3.56. If X̄ and S 2 are the mean and the variance of a random sample of size n
from a normal population with the mean µ and variance σ 2 , then

X̄ − µ
T = √
S/ n

has a t-distribution with (n − 1) degrees of freedom.

Proof. By Theorem 3.16


X̄ − µ
Z= √ ∼ N (0, 1)
σ/ n
By Theorem 3.50
n
(n − 1)S 2 1 X
Y = = 2 (Xi − X̄)2 ∼ χ2n−1 .
σ2 σ
i=1

Hence,
√ X̄−µ
X̄ − µ σ/ n Z
T = √ =q =q
S/ n (n−1)S 2 Y
σ 2 (n−1) n−1

Also, by Theorem 3.51 X̄ and S 2 are independent. Thus, Y and Z are independent, and by
definition, T follows a t-distribution with (n − 1) degrees of freedom.

45
A conclusion from this Theorem is that if a random sample of size n is given, then the
corresponding degrees of freedom will be (n − 1).

Exercise 3.57. The 95% quantile for the normal distribution are given by 1.96, and the 99%
quantile by 2.58.
What are the corresponding quantiles for the t distribution if
(a) n = 4, (b) n = 12, (c) n = 30, (d) n = 40, (e) n = 100

Solution. We find that:


95% 99%
tα,4
tα,12
tα,30
tα,40
tα,100

3.3.4 F -distribution
Definition 3.58. Suppose that U ∼ χ2m and V ∼ χ2n are independent random variables. A
random variable of the form
U/m
F =
V /n
is said to have an F -distribution with m and n degrees of freedom. We denote this by F ∼
F (m, n)

If X ∼ F (m, n):

Γ( m+n
2 ) m m/2 m
− m+n
X m/2−1

1+ nx
2
, x>0

m n n
f (x) = Γ( 2 )Γ( )
2
 0, otherwise

n 2n2 (m + n − 2)
E[X] = , Var(X) =
n−2 m(n − 2)2 (n − 4)
We denote the quantiles of the F -distribution by Fα (m, n). If we know Fα (m, n), it is possible
to find F1−α (n, m) by using the identity

1
F1−α (n, m) = (3.15)
Fα (m, n)

The reason for studying the F distribution is that it allows us to study the ratio of sample
variances of two populations:

46
Figure 3.4: Density of the F -distribution

Theorem 3.59. Let two independent random samples of size m and n be drawn from two
normal populations with variances σ12 , σ22 , respectively. If the variances of the random samples
are given by S12 , S22 , respectively, then the statistic

S12 /σ12 S12 σ22


F = =
S22 /σ22 S22 σ12
has the F -distribution with (m − 1) and (n − 1) degrees of freedom.
Proof.
(m − 1)S12 2 (n − 1)S22
U= ∼ χm−1 and V = ∼ χ2n−1 .
σ12 σ22
So, by definition,
S12 /σ12 U/(m − 1)
F = 2 2 = ∼ F (m − 1, n − 1).
S2 /σ2 V /(n − 1)

Corollary 3.60. If σ12 = σ22 , then


S12
F = ∼ F (m − 1, n − 1).
S22

Two populations, with σ12 = σ22 , are called homogeneous with respect to their variances.
Exercise 3.61. (i) Use table for F -distributions to find the values of x that P (0.109 <
F4,6 < x) = 0.95, where F4,6 denotes a random variable with the F -distribution with 6
and 4 degrees of freedom.

(ii) Use table for F -distributions to find the values of x that P (0.427 < F11,7 < 1.69) = x,
where F11,7 denotes a random variable with the F -distribution with 11 and 7 degrees of
freedom..
Solution. We have the following tables in Richard J. Larsen, Morris L. Marx An Introduction
to Mathematical Statistics and Its Applications, 5th Edition, page 704-717.

47
(i) We have P (F4,6 < 1.09) = 0.025. Thus P (F4,6 < x) = P (0.109 < F4,6 < x) + P (F4,6 <
1.09) = 0.975. Thus x ≈ 6.23.
(ii) P (0.427 < F11,7 < 1.69) = P (F11,7 < 1.69) − P (F11,7 < 0.427) ≈ 0.75 − 0.10 = 0.65.

Exercise 3.62. Let S12 denote the sample variance for a random sample of size 10 from
Population I with normal pdf and let S22 denote the sample variance for a random sample of
size 8 from Population II with normal pdf. The variance of Population I is assumed to be
three times the variance of Population II.
Find two numbers a and b such that
S12
 
P a < 2 < b = 0.90
S2
assuming S12 to be independent of S22 .
Solution. From the assumption σ12 = 3σ22 , n1 = 10, n2 = 8
S12 /σ12 S12 /3σ22 S12
= = ∼ Fn1 −1,n2 −1
S22 /σ22 S22 /σ22 3S22

S12
 
P L< < U = 0.90 (3.16)
3S22

48
From the tables we get: L = F0.05,(9,7) , U = F0.95,(9,7) . Rearranging (3.16)

S12
 
P 0.912 < 2 < 11.04 = 0.90
S2

i.e. a = 0.912, b = 11.04 .

3.4 Interval Estimation


3.4.1 Definition
Point estimators do not give us any information about their precision. The true value of the
parameter will seldom be equal to the estimated value. The goal of interval estimation is to
calculate an interval, which contains the true value of the parameter with high probability.
For an interval estimator of a single parameter θ, we use the random sample to find two
quantities L and U with the following two properties:

• P (L ≤ θ ≤ U ) is high,

• the length of the interval [L, U ] should be relatively narrow.

Definition 3.63. The problem of confidence estimation is that of finding a family of


random sets S(X) for a parameter θ such that for a given α, 0 < α < 1,

Pθ (θ ∈ S(X)) > 1 − α, for all θ ∈ Θ.

In the confidence interval estimation, the sets S(X) are intervals: S(X) = [L(X), U (X)]. The
limits L and U are called the lower and the upper confidence limits, respectively.
The probability 1 − α that a confidence interval contains the true parameter θ is called the
confidence coefficient.

Example 3.64. Let’s say that we have an i.i.d sample X1 , . . . , Xn from the normal distribution
N (µ, σ 2 ) and the value of σ is known but µ is unknown and needs to be estimated. We find
a 95%-confidence interval for µ.

49
The estimator for the mean is µ̂ = X̄ = n1 ni=1 Xi . Hence, according to Theorem 3.16,
P
X̄−µ
X̄ ∼ N (µ, σ 2 /n) and Z := σ/ √
n
∼ N (0, 1). We need two numbers a, b ∈ R such that
P (a < Z < b) = 0.95. These can’t be easily calculated from the density nor cdf of the
Gaussian distribution. However, one can use statistical tables to pick a and b. By convention,

Figure 3.5: Construction of the 95% confidence interval using the standard normal density

we pick these values in a symmetric manner, that is we make sure P (Z < a) ≈ 0.025 and
P (Z > b) ≈ 0.025. This will make the confidence interval as narrow as possible.

W see that a = −1.96 and b = 1.96 We have


 
X̄ − µ
P −1.96 ≤ √ ≤ 1.96 = 0.95
σ/ n
or equivalently  
σ σ
P X̄ − √ 1.96 ≤ µ ≤ X̄ + √ 1.96 = 0.95. (3.17)
n n
In practice (outside an exam) we’d use the some software to the quantiles of N (0, 1):
MATLAB: x = norminv([0.025 0.975]
R: x=qnorm(0.025) and x=qnorm(0.975)
EXCEL: =NORMINV(0.025,0,1) and =NORMINV(0.975,0,1)

50
Example 3.65. Suppose that 6.5, 9.2, 9.9, 12.4 are the realisations of a random variable X
from N (µ, 0.82 ).
We construct a 95% confidence interval for µ using (3.17):

P (µ ∈ [8.72, 10.28]) = 0.95.

The correct interpretation of confidence interval for the population mean is that if samples
of the same size, n, are drawn repeatedly from a population, and a confidence interval is
calculated from each sample, then 95% of these intervals should contain the population mean.

Figure 3.6: Interpretation of the confidence intervals

3.4.2 Pivots
Definition 3.66. Let X ∼ Pθ . A random variable T (X, θ) is known as a pivot if the distri-
bution of T (X, θ) does not depend on θ.

The pivotal method relies on our knowledge of sampling distributions.

The pivotal quantity should have the following two characteristics:

• It is a function of the random sample (a statistic or an estimator θ̂) and the unknown
parameter θ, where θ is the only unknown quantity, and

• It has a probability distribution that does not depend on the parameter θ.

Suppose that θ̂ = g(X) is a point estimate of θ, and let T (θ̂, θ) be the pivotal quantity.
Let a and b be constants with (a < b), such that

P (a ≤ T (θ̂, θ) ≤ b) = 1 − α

for a given value of α, (0 < α < 1).


Hence, if we resolve this inequality in terms of θ, the result will be a desired confidence interval.

51
In the previous example involving the normal distribution we had

X̄ − µ
T (X̄, µ) = √ , T ∼ N (0, 1).
σ/ n

Note that T (X̄, µ) depends on the estimaor and the parameter but its distribution doesn’t.
As we saw earlier (in Example 3.64), the inequality a ≤ T (X̄, µ) ≤ b) can be solved and gives
a confidence interval for µ.

Theorem 3.67. Let T (X, θ) be a pivot such that for each θ, T (X, θ) is a statistic, and as a
function of θ, T is either strictly increasing or decreasing at each x ∈ R.
Let Λ ⊆ R be the range of T , and for every λ ∈ Λ and x ∈ R, let the equation λ = T (x, θ) be
solvable. Then one can construct a confidence interval for θ at any level.

Proof. Let 0 < α < 1, choose a pair of numbers λ1 (α), λ2 (α) ∈ Λ:

Pθ (λ1 (α) < T (X, θ) < λ2 (α)) ≥ 1 − α

T is monotone in θ, hence, T (x, θ) = λ1 (α) and T (x, θ) = λ2 (α) for every x uniquely for θ.

Pθ (θ(X) < θ < θ̄(X)) ≥ 1 − α for all θ,

θ(X) < θ̄(X) are random variables.

The condition that λ = T (x, θ) be solvable will be satisfied if, for example, T is continuous
and strictly increasing or decreasing as a function of θ ∈ Θ.
Procedure for finding CI for θ using pivot

(i) Find an estimator θ̂ = g(X) of θ (usually MLE of θ works).

(ii) Find a function of θ̂ and θ, T (θ̂, θ) (pivot), such that the probability distribution of
T (θ̂, θ) does not depend on θ.

(iii) Find a < b such that P (a < T (θ̂, θ) < b) = 1 − α.


Choose a and b such that P (a ≥ T (θ̂, θ)) = α2 = P (T (θ̂, θ) ≥ b)

(iv) Transform the pivot confidence interval to a confidence interval for the parameter θ:
P (L < θ < U ) = 1 − α, where L is the lower confidence limit and U is the upper
confidence limit.

Exercise 3.68. Suppose the random sample X1 , . . . , Xn has U (0, θ) distribution. Construct
a 90% confidence interval for θ. Identify the upper and lower confidence limits.

52
Figure 3.7: Construction of the 95% confidence interval for the fT (t) = ntn−1

Solution. The MLE estimator for θ is Y = max1≤i≤n Xi , see Example 3.12.


The random variable Y has the pdf fY (y) = ny n−1 /θn , is not independent of the parameter
θ.
Consider instead T = Y /θ. The the cdf of T is FT (t) = tn , for 0 ≤ t ≤ 1 and the pdf of T
is fT (t) = ntn−1 , 0 ≤ t ≤ 1 (check!).
We want to find a, b such that P a ≤ Yθ ≤ b = 0.90 i.e. FT (a) = 0.05 and FT (b) = 0.95.

√ √
We conclude that a = n 0.05 and b = n 0.95. the 90% confidence interval is:
 
Y Y
P √ n
≤θ≤ √
n
= 0.90
0.95 0.05

3.4.3 Confidence intervals based on the normal distribution


Let X1 , X2 , . . . , Xn be a sample from the Gaussian distribution N (µ, σ 2 ). Assume that the
standard deviation is known. Let zα be a (1 − α)-quantile of N (0, 1) (see Definition 1.14),
that is
Z ∼ N (0, 1) =⇒ P (Z > zα ) = α.
X̄−µ
We construct a confidence interval for the mean. According to Theorem 3.16 Z := √
σ/ n

N (0, 1). We have
P (|Z| ≥ zα/2 ) = α.
For example when α = 5%, then zα/2 = 1.96. Equivalently, we have

P (|Z| ≤ zα/2 ) = 1 − α.
X̄−µ √σ zα/2 √σ ,
Since √
σ/ n
≤ zα/2 ⇐⇒ X̄ − n
≤ µ ≤ X̄ + n
we get
 
σ σ
P X̄ − √ zα/2 ≤ µ ≤ X̄ + √ = 1 − α.
n n
h i
The (1 − α)-confidence interval for µ is thus X̄ − √σn zα/2 ≤ µ ≤ X̄ + √σn .
Due to the Central limit theorem (Theorem 3.42) this method can also be used when the
sample is not Gaussian but the sample size n is large. Also, typically the standard deviation
is unknown, but the estimate S 2 from the sample is a good enough replacement (when sample
size is large).

53
Exercise 3.69. Hemoglobin levels in 11-year old boys have a normal distribution with un-
known mean µ and σ = 1.209 g/dl. Suppose that a random sample of size 10 has the sample
mean 12 g/dl. Find the 90% confidence interval for µ.

Exercise 3.70. Let X be the mean of a random sample of size n from a distribution that is
N (µ, 9). Find n such that P (X − 1 < µ < X + 1) = 0.90, approximately.

Exercise 3.71. Let a random sample of size 17 from the normal distribution N (µ, σ 2 ) yield
x̄ = 4.7 and σ 2 = 5.76. Determine a 90% confidence interval for µ.

3.4.4 Confidence intervals based on the t-distribution


Let X1 , X2 , . . . , Xn be a sample from the Gaussian distribution N (µ, σ 2 ) with unknown mean
µ and unknwon variance σ 2 . Recall that X̄ is the sample mean and S 2 is the sample variance.
By Theorem 3.56
X̄ − µ
√ ∼ Tn−1 .
S/ n
This allows us to justify confidence intervals for population mean when the true variance is
unknown:
 
X̄ − µ
P −tα/2,n−1 ≤ √ ≤ tα/2,n−1 = 1 − α.
S/ n
Equivalently:
√ √ 
P X̄ − tα/2,n−1 · S/ n ≤ µ ≤ X̄ + tα/2,n−1 · S/ n = 1 − α.

In the formulas above, tα/2,n−1 is a (1 − α2 )-quantile of the t-distribution with n − 1 degrees


of freedom.

Example 3.72. A manufacturer wants to estimate the reaction time of fuses under 20%
overload. To run the test a random sample of 20 of fuses was subjected to a 20% overload,
and it was found that the times it took them to blow had the mean of 10.4 minutes and
a sample standard deviation of 1.6 minutes. It can be assumed that the data constitute a
random sample from a normal population.

(i) Construct 95% confidence interval to estimate the mean of reaction time.

(ii) Construct one-tailed(lower) 95% confidence interval to estimate the mean reaction time.

Solution. (i) Let X1 , X2 , . . . , X20 represent the random sample of 20 fuses.


√ √ 
P X̄ − tα/2,n−1 · S/ n ≤ µ ≤ X̄ + tα/2,n−1 · S/ n = 1 − α

54
Given X̄ = 10.4, and S 2 = 1.6. From the table for Student t distribution, obtain
tα/2,n−1 = t0.025,19 = 2.093. Hence,
 √ √ 
P 10.4 − 2.093 · 1.6/ 20 ≤ µ ≤ 10.4 + 2.093 · 1.6/ 20 = 0.95

i.e.
P (9.65 ≤ µ ≤ 11.15) = 0.95.
Hence, the 95% confidence interval for the mean fuse operating time is (9.65, 11.15)

(ii) For one-tailed interval, we need to consider the upper boundary, because the reaction
time should be as short as possible. The confidence interval is:
√ 
P µ ≤ X̄ + tα,n−1 · S/ n = 1 − α.

tα,n−1 = t0.05,19 = 1.73, so


 √ 
P µ ≤ 10.4 + 1.73 · 1.6/ 20 = P (µ ≤ 11.10) = 0.95

3.4.5 Confidence intervals for variance


Let χ2α/2,n−1 and χ1−α/2,n−1 be quantiles of the χ2 distribution with n − 1 degrees of freedom:

α α
P (Y > χ2α/2,n−1 ) = , P (Y < χ1−α/2,n−1 = forY ∼ χ2n−1 .
2 2
Suppose that the population is normal. As (n − 1)S 2 /σ 2 ∼ χ2n−1 we have

(n − 1)S 2
 
2
P χ1−α/2,n−1 < < χα/2,n−1 = 1 − α
σ2

The procedure to find confidence interval for σ 2 :

(i) Calculate x̄ and s2 from the sample x1 , . . . , xn .

(ii) Find L = χ2α/2,n−1 , and U = χ21−α/2,n−1 using the χ2 table with (n − 1) degrees of
freedom.
 
2 (n−1)s2 (n−1)s2
(iii) Compute the (1−α)100% confidence interval for the population variance s as χ2 , χ2 ,
1−α/2 α/2

where χ2 -values are with (n − 1) degrees of freedom.

Exercise 3.73. Suppose we have an independent random sample X1 , . . . , X10 from N (µ, σ 2 )
and we want to find a 95% confidence interval for σ 2 .

55
Solution. As (n − 1)S 2 /σ 2 ∼ χ2n−1 , the pivot T = 9S 2 /σ 2 ∼ χ29 . Recall that the appropriate
notation for the quantiles of the χ-distribution was introduced in (3.14). Using χ2 table, we
find the lower bound L = χ2α/2,n−1 and the upper bound U = χ21−α/2,n−1 : L = 2.70, U = 19.02,
then
(n − 1)S 2
 
P (L < T < U ) = P L < <U =1−α
σ2
(n − 1)S 2 (n − 1)S 2
 
2
P <σ < =1−α
U L
9S 2 9S 2
 
2
P <σ < = 0.95
19.02 2.70
The confidence interval is
9S 2 9S 2
 
,
19.02 2.70

3.4.6 Confidence intervals for binomial random variables


A population proportion is the proportion (percentage) of a population that has a specified
attribute.
Confidence intervals for proportions are very common in statical studies as researchers
often want to investigate the presence or lack of certain feature in individuals. For example in
clinical trials the researcher investigates if vaccine is effective/ineffective. A social researcher
may wish to investigate if voters support party A or party B. In each case they use a sample
of Bernoulli random variables to measure a proportion.
Let Xi be a sequence of Bernoulli random variables with probability of success p. X =
X1 + · · · + Xn represents the number of successes in n independent trials and follows Binomial
distribution, where n is large and p = P (success) is unknown.
Recall that p̂ = X/n is a MLE for p, E[X/n] = p and Var(X/n) = p(1 − p)/n. The random
variable √ X/n−p√ = √X−np can be approximated by the standard normal distribution
p(1−p)/ n np(1−p)
according to the Central Limit Theorem (Theorem 3.42).
We approximate the p in the variance by p̂ and we get that approximately
X − np X/n − p
p =p ∼ N (0, 1),
np(1 − p) (X/n)(1 − (X/n))/n
Thus we use the quantiles of the normal distribution and obtain a confidence interval:
!
X/n − p
P −zα/2 ≤ p ≤ zα/2 = 1 − α
(X/n)(1 − (X/n))/n

Symmetric confidence intervals can be represented in the form

56
point estimate ± margin of error.
Definition 3.74. The margin of error E for the estimate of µ is

E = zα/2 · σ/ n. (3.18)

The margin of error for the estimate of a population mean indicates the accuracy with
which a sample mean estimates the unknown population mean.
Let d be the width of a (1 − α)% confidence interval for the true proportion, p. Then
r r !
X (X/n)(1 − (X/n)) X (X/n)(1 − (X/n))
d= + zα/2 − − zα/2
n n n n
r r
(X/n)(1 − (X/n)) 1 zα/2
= 2zα/2 ≤ 2zα/2 = √
n 4n n
Definition 3.75. The margin of error for confidence level 100(1 − α)% associated with an
estimate X
n , where X is a number of success in n independent trials, and p is unknown, is
zα/2
E= √ .
2 n

If we can estimate that true value of p is greater than 21 (or less than 1
2 ) the margin of
error for confidence level 100(1 − α)% associated with an estimate X n is
p
zα/2 pg (1 − pg )
E= √ .
n

3.4.7 Sample Size


When designing a statistical study, we need to decide what the sample should be for example
how many people should be invited to participate in a clinical trial. Solving equation (3.18)
for n we get the sample size needed to construct a confidence interval of a prescribed width.
The minimal sample size required for estimation of the population mean µ at level (1 − α)
with margin of error E is given by
2 σ2
zα/2
n=
E2
and, if this is not an integer, rounding up to the next integer.
For Bernoulli trials we have σ = p(1 − p) ≤ 14 . Thus, a (1 − α)-level confidence interval for
a population proportion p, if p is unknown, that has a margin of error of at most E can be
obtained by choosing
2
zα/2
n=
4E 2

57
rounded up to the next integer. Or, if we can make ‘educated guess’ about the value of p (let’s
say it is at most equal to a value pg ), then the smallest sample sample required is
2
zα/2
n= pg (1 − pg ).
E2
Exercise 3.76. The Bureau of Labor Statistics collects information on the ages of people in
the civilian labor force and publishes the results in Current Population Survey.
(i) Determine the sample size needed to be collected in order to be 95% confident that µ
(mean age of all people in the civilian labor force) is within 0.5 year of the point estimate,
X̄. Assuming that σ = 12.1 years.

(ii) Find a 95% confidence interval for µ if a sample of the size determined in part i) has a
mean age of 43.8 years.
Solution. (i) We have E = 0.5, for 95% confidence interval α = 0.05 and zα/2 = 1.96
2 σ2
zα/2 1.962 · 12.12
n≥ = ≈ 2249.8
E2 0.052
Hence, n = 2250.

(ii) X̄ = 43.8, hence the 95% CI is


 
12.1 12.1
43.8 − 1.96 √ , 43.8 + 1.96 √ = [43.3, 44.3]
2250 2250
Exercise 3.77. Hemoglobin levels in 11-year old boys have a normal distribution with un-
known mean µ and σ = 1.209 g/dl. How large a sample is needed to estimate µ with 95%
confidence and a margin of error of 0.5?

3.4.8 Interval Estimation for two population means


Often we want to compare the parameters in two populations, for example we may went to
compare sugar level in blood of a group of patients that were given a new drug with a control
group that was given placebo.
Suppose that two random samples are selected from the two populations and the random
variables (and observations) are independent from random variables within the same sample
as well as from random variables from another sample.
Let X1,1 , . . . , X1,n1 be a random sample from a distribution with mean µ1 and variance σ12 ,
and let X2,1 , . . . , X2,n2 be a random sample from a distribution with mean µ2 and variance
σ22 . In this section we will obtain an estimate or a confidence interval for µ1 − µ2 . The answer
to this question depends on whether the samples are Gaussian and whether the sample size is
large.

58
Large sample - normal distribution
Let X1,1 , . . . , X1,n1 be a random sample from a normal distribution N (µ1 , σ12 ), and let X2,1 , . . . , X2,n2
be a random sample from a normal distribution N (µ2 , σ22 ).
Let X̄1 = n11 ni=1 X1,i and X̄2 = n12 ni=1
P 1 P 2
X2,i .
As we assume that the two samples are independent, the averages X̄1 and X̄2 are also
independent, and the distribution of X̄1 − X̄2 is N (µ1 − µ2 , n11 σ12 + n12 σ22 ). Thus, similarly to
the one-sample case, the (1 − α)-confidence interval for µ1 − µ2 is:
q
X̄1 − X̄2 ± zα/2 σ12 /n1 + σ22 /n2

To apply this formula we need to know σ1 and σ2 . If σ12 and σ22 are unknown, then σ1
and σ2 can be replaced by respective sample standard deviations S1 and S2 , provided that the
samples are large (say n1 , n2 ≥ 30). In such case the confidence interval is:
q
X̄1 − X̄2 ± zα/2 S12 /n1 + S22 /n2 .

For a large sample, by the Central Limit Theorem, this formula may be applied even if the
sample is not Gaussian.

Small samples - t-distribution


Assume that the samples are from normal populations and are independent.
If the two populations have unknown variance, but we can assume that σ12 = σ22 = σ 2
and the estimate of the variance can be obtained by pooling standard deviations of the two
samples.

Definition 3.78. The pooled sample variance Sp2 is


Pn1
− X̄1 )2 + ni=1 (X2,i − X̄2 )2 (n1 − 1)S12 + (n2 − 1)S22
P 2
i=1 (X1,i
Sp2 = =
n1 + n2 − 2 n1 + n2 − 2

It can be shown that


(X̄1 − X̄2 ) − (µ1 − µ2 )
T = q ∼ tn1 +n2 −2
Sp n11 + n12

(has a t-distribution with n1 + n2 − 2 degrees of freedom).


Therefore, if the two sample are independent and are from two normal populations with equal
variances the confidence interval for µ1 − µ2 is determined by

P (−tα/2,n1 +n2 −2 < T < tα/2,n1 +n2 −2 ) = 1 − α.

59
Rearranging for µ1 − µ2 we get
 q
P (X̄1 − X̄2 ) − tα/2,n1 +n2 −2 · Sp n11 + 1
n2 < µ 1 − µ2
q 
1 1
< (X̄1 − X̄2 ) + tα/2,n1 +n2 −2 · Sp n1 + n2 =1−α

Thus the confidence interval is


r
1 1
X̄1 − X̄2 ± tα/2,n1 +n2 −2 · Sp +
n1 n2

Exercise 3.79. Independent random samples from two normal populations with equal vari-
ances produced the following data.
Sample 1 : 1.2, 3.1, 1.7, 2.8, 3.0
Sample 2 : 4.2, 2.7, 3.6, 3.9

(i) Calculate the pooled estimate of σ 2 .

(ii) Obtain a 90% confidence interval for µ1 − µ2 .

Solution. (i) We have n1 = 5, n2 = 4. Also:

x̄1 = 2.36, s21 = 0.733

x̄2 = 3.6, s22 = 0.420

Hence,
(n1 − 1)s21 + (n2 − 1)s22
s2p = = 0.599
n1 + n2 − 2

(ii) For the confidence coefficient 0.90, α = 0.10 and from the t-table, t0.05,7 = 1.895. Thus,
a 90% confidence interval for µ1 − µ2 is
r
1 1
(x̄1 − x̄2 ) ± tα/2,n1 +n2 −2 · sp + =
n1 n2
s  
1 1
(2.36 − 3.6) ± 1.895 · 0.599 + = −1.24 ± 0.98
5 4

or (−2.22, −0.26).

60
Small samples with different variances
If the equality of the variances cannot be reasonably assumed, σ12 6= σ22 , the previous procedure
still can be used, except that the pivot random variable is
(X̄1 − X̄2 ) − (µ1 − µ2 )
T = q 2 ∼ tν ,
S1 S22
n1 + n2

where ν is the degree of freedom defined as (rounded down):


S22 2
 2 
S1
n1 + n2
ν =  2 2  2 2
S1 S2
n1 n2
n1 −1 + n2 −1
s
S12 S22
X̄1 − X̄2 ± tα/2,ν · +
n1 n2
Exercise 3.80. Assuming that the two populations are normally distributed with unknown
and unequal variances. Two independent samples are taken with the following summary
statistics:
n1 = 16, x̄1 = 20.17, s1 = 4.3
n2 = 11, x̄2 = 19.23, s2 = 3.8
Construct a 95% confidence interval for µ1 − µ2 .
Solution. ν = 23.312; the 95% confidence interval for µ1 − µ2 :

−2.3106 < µ1 − µ2 < 4.1906

Proportions
Let X1 and X2 denote the numbers of successes observed in two independent sets of n1 and
n2 Bernoulli trials, respectively, where p1 and p2 are the true success probabilities associated
with each set of trials. According to the Central Limit Theorem, the distribution of X1 and
X2 can be approximated by the Gaussian distribution and thus
(X1 /n1 − X2 /n2 ) − (p1 − p2 )
T =q ∼ N (0, 1)
X1 /n1 (1−X1 /n1 ) X2 /n2 (1−X2 /n2 )
n1 + n2

and the confidence interval for p1 − p2 is


s
X1 /n1 (1 − X1 /n1 ) X2 /n2 (1 − X2 /n2 )
(X1 /n1 − X2 /n2 ) ± zα/2 · + .
n1 n2

61
The approximation is valid provided that the two samples are independent, large, and (Xi /ni )ni >
5 and (1 − Xi /ni )ni > 5 for i = 1, 2.

Exercise 3.81. The phenomenon of handedness has been extensively studied in human pop-
ulations. The percentages of adults who are right-handed, left-handed, and ambidextrous are
well documented. What is not so well known is that a similar phenomenon is present in lower
animals. Dogs, for example, can be either right-pawed or left-pawed.
Suppose that in a random sample of 200 beagles, it is found that 55 are left-pawed and
that in a random sample of 200 collies, 40 are left-pawed.
Obtain a 95% confidence interval for p1 − p2 .

Solution. Let X1 be a number of left-pawed beagles in the sample n1 = 200, and


X2 be a number of left-pawed collies in the sample n2 = 200:

(X1 /n1 ) = 55/200 = 0.275, (X2 /n2 ) = 40/200 = 0.2

(X1 /n1 ) = 55/200 = 0.275, (X2 /n2 ) = 40/200 = 0.2


The requirements are satisfied, hence, we can use approximation by standard normal distri-
bution, so the 95% confidence interval for p1 − p2 is:
s
X1 /n1 (1 − X1 /n1 ) X2 /n2 (1 − X2 /n2 )
(X1 /n1 − X2 /n2 ) ± zα/2 · + ⇒
n1 n2
r
0.275 · 0.725 0.2 · 0.8
0.275 − 0.2 ± 1.96 + ⇒
200 200
0.075 ± 0.083

62
Chapter 4

Hypothesis testing

Let’s motivate hypothesis testing with the following example: a car manufacturer is looking
for additives that might increase car’s performance. As a pilot study, they send thirty cars
fuelled with a new additive on a road. Without the additive, those same cars are known to
have average fuel consumption µ0 = 25.0 mpg with a standard deviation of σ0 = 2.4 mpg.
The fuel consumption is assumed to be normally distributed.
Suppose it turns out that the thirty cars average x̄ = 26.3 mpg with the additive. Can the
company claim that the additive has significant effect on mileage increase?
To formalize this setting, we may say that the company wants to collect a sample X1 , . . . , Xn ∼
N (µ, σ02 ) to compare two claims:
(i) H0 : the cars with the new additive have average fuel consumption equal to µ = 25.0
mpg i.e. there is no improvement, with
(ii) H1 : the cars with new additive have mileage higher than 25.0 i.e. there is an improve-
ment.
These can be written mathematically as:
(i) H0 : µ = 25.0,
(ii) H1 : µ > 25.0.
It seems natural to use Z = X̄−25.0
2.4/n and choose H1 in favour of H0 if we get a value of Z
larger than some threshold value z.
Let us now introduce this framework formally.

4.1 Definitions
4.1.1 Hypotheses, test statistic, rejection region
A statistical test consist of the following 4 components:

63
(i) The null hypothesis, denoted by H0, is usually the nullification of a claim. Unless evidence
from the data indicates otherwise, the null hypothesis is assumed to be true.

(ii) The alternative hypothesis, denoted by HA (or sometimes denoted by H1), is the claim
the truth (or validity) of which needs to be shown.

(iii) The test statistic, denoted by TS, is a function of the sample measurements upon which
the statistical decision, to reject or not reject the null hypothesis, will be based.

(iv) A rejection region (or a critical region) is the region (denoted by RR) that specifies the
values of the observed test statistic for which the null hypothesis will be rejected. This
is the range of values of the test statistic that corresponds to the rejection of H0 at some
fixed level of significance α.

Having specified those and carried out the calculations, we reach a conclusion, that is
an answer to the question posed at the beginning of the whole process. If the value of the
observed test statistic falls in the rejection region, the null hypothesis is rejected and we will
conclude that there is enough evidence to decide that the alternative hypothesis is true. If the
test statistic does not fall in the rejection region, we conclude that we cannot reject the null
hypothesis.
Failure to reject the null hypothesis does not necessarily mean that the null hypothesis is
true.

4.1.2 Z-test for sample mean


Let X1 , X2 , . . . , Xn be from Normal Distribution N (µ, σ 2 ) with known value of σ. The test
hypotheses are:
H0 : µ = µ0 HA : µ 6= µ0
The test statistic and its distribution given the null hypothesis are

X̄ − µ0
Z= √ ∼ N (0, 1)
σ/ n
x̄−µ
√ 0 . Given a significance level
Let us denote the calculated value of the statistic by zT S = σ/ n
α we find a rejection region using the quantile of the standard normal distribution zα/2

RR = {|Z| ≥ zα/2 }

Conclusion: If zT S ≤ −zα/2 or zT S ≥ zα/2 reject H0, otherwise do not reject.


The test in which the alternative hypothesis involves the ‘6=’ sign are called two-tailed.
When conducting a two-tailed test we are checking simply for a deviation of the mean from
the value assumed in the null hypothesis (in either direction). If we want to assert that the
mean is higher (resp. lower) than the value in the null hypothesis we use a one-tailed test:

64
Let X1 , X2 , . . . , Xn be from Normal Distribution N (µ, σ 2 ): The test hypotheses are:

H0 : µ = µ0

HA : µ > µ0 resp. µ < µ0


The test Statistics is again
X̄ − µ
Z= √ ∼ N (0, 1)
σ/ n
The rejection region for significance level α is

RR = {Z ≥ zα } resp. RR = {Z ≤ −zα }.

Conclusion: If zT S ≥ zα/2 we reject the null hypothesis (resp. if zT S ≤ −zα we reject the
null hypothesis). Otherwise, we do not reject H0.
A rejection region (also called a critical region) is the region (denoted by RR) that specifies
the values of the observed test statistic for which the null hypothesis will be rejected. This is
the range of values of the test statistic that corresponds to the rejection of H0 at some fixed
level of significance, α.
The probability distribution of the test statistic is known and does not depend on the
parameters of the population!
With this framework we can now solve the problem of car manufacturer posed at the
beginning of this chapter.

Exercise 4.1. A car manufacturer is looking for additives that might increase car’s perform-
ance. As a pilot study, they send thirty cars fuelled with a new additive on a road. Without
the additive, those same cars are known to have average fuel consumption µ0 = 25.0 mpg
with a standard deviation of σ0 = 2.4 mpg. The fuel consumption is assumed to be normally
distributed. Suppose it turns out that the thirty cars average x̄ = 26.3 mpg with the additive.
At the 5% significance level, what should the company conclude?

Solution. We have with µ0 = 25.0

H0 : µ = µ0 HA : µ > µ0

X̄ − µ
TS : Z = √ ∼ N (0, 1)
σ/ n
x̄−µ
√0 26.3−25.0
Thus zT S = σ/ n
= √
2.4/ 30
= 2.97 This is a one-tailed test so

RR = {Z ≥ zα , zα = 1.65.

Since zT S = 2.97 > zα , we reject H0 and conclude that at 5% significance level the company
can claim that the additive provides an increase in petrol mileage.

65
4.1.3 Errors when testing hypotheses
In a statistical test, it is impossible to establish the truth of a hypothesis with 100% certainty.
There are two possible types of errors.
Definition 4.2. A type I error is made if H0 is rejected when in fact H0 is true. The
probability of type I error is denoted by α. That is,

P (rejecting H0|H0 is true) = α.

The probability of type I error, α, is called the level of significance.


A type II error is made if H0 is accepted (not rejected) when in fact HA is true. The
probability of a type II error is denoted by β. That is,

P (not rejecting H0|H0 is false) = β.

These definition can be summarized in (T S denotes the test statistic):

α = P (T S ∈ RR|H0) (4.1)
β = P (T S ∈
/ RR|HA) (4.2)

The right and wrong decisions we may take when conducting a statistical test are sum-
marized in the following table:
Statistical Decision and Error Probabilities
True State of Null Hypothesis
Statistical Decision H0 True H0 False
Do not reject H0 Correct decision Type II error (β)
Reject H0 Type I error (α) Correct decision
In another nomenclature, these are called true/false positive/negative results:

True State of Null Hypothesis


Statistical Decision H0 True H0 False
Do not reject H0 True Negative False Negative(β)
Reject H0 False Positive (α) True Positive

It is desirable that a test should have α and β are as small as possible. It can be shown
that there is a relation between the sample size, α (Type I error) and β (type II error)
Once we are given the sample size n, an α, a simple alternative HA, and a test statistic,
we have no control over β and it is exactly determined. In other words, for a given sample size
and test statistic, any effort to lower β will increase in α and vice versa. However, increasing
the sample size n, we can decrease β for the same α to an acceptable level.
Definition 4.3. The power or sensitivity of a test is the probability that the null hypothesis
is rejected given that the alternative hypothesis is true: 1 − β

66
4.1.4 p-value
Definition 4.4. Corresponding to an observed value of a test statistic, the p-value is the
lowest level of significance at which the null hypothesis would have been rejected.

p − value = min{α ∈ (0, 1) : we rejst H0 at signifficance level α}.

Based on the p-value we report the test result:


• Choose the maximum value of α that you are willing to tolerate.

• If the p-value of the test is less than the maximum value of α, reject H0.
For the simple Z-test of Section 4.1.2 the p-value can be found as (zT S is the value of the
test statistic)

P (T S < zT S |H0 ) = Φ(zT S )
 for lower tail test
p − values = P (T S > zT S |H0 ) = 1 − Φ(zT S ) for upper tail test

P (|T S| > |zT S ||H0 ) = 2(1 − Φ(|zT S |) for two-tailed test

Exercise 4.5. Find the p-value for Exercise 4.1.


x̄−µ
√0 26.3−25.0
Solution. We found zT S = σ/ n
= √
2.4/ 30
= 2.97. The p-value: is P (T S > 2.97|H0 ) =
1 − Φ(2.97) = 1 − 0.9985 = 0.0015. This is less than 0.05, so the result is consistent with the
previous conclusion, we can reject H0 hypothesis at 5% significance level.

4.2 List of most common statistical test


4.2.1 t-test
If the standard deviation is unknown and the sample size is small, we use a t-test instead
of the z-test. The procedure is the same as described in Section 4.1.2 except that the test
statistic is
X̄ − µ0
√ ∼ tn−1 .
S/ n
Under the null hypothesis, the test statistic has t distribution with n − 1 degrees of freedom,
so we use quantiles of tn−1 instead of N (0, 1).
Exercise 4.6. A company would like to assess fatigue in steel plant workers due to heat stress.
A random sample of 25 casting workers had a mean post-work heart rate of 78.3 beats per
minute (bpm) with the sample standard deviation of 11.2 bpm.
At the 5% significance level, do the data provide sufficient evidence to conclude that the
mean post-work heart rate for casting workers exceeds the normal resting heart rate of 72
bpm? Assume that the population post-work heart rates for casting workers is normal.

67
Solution. 1) Hypotheses: H0: µ = 72, H1: µ > 62 (one-sided test).
X̄−µ
√0. 78.3−72
2) The test statistic T S = S/ n
gives the value √
11.2/ 25
= 2.8125.

3) tcritical = t0.05,24 = 1.711 so RR = {T S > 1.711}.

4) Conclusion: tT S > tcritical hence, reject H0.

5) Additionally we may report the p-value: P (t > 2.81) < P (t > 2.797) = 0.005.
In the following Exercise, since the sample size is large (greater than 30) we use the z-test:
Exercise 4.7. We need to analyse the height of wheat plants in the field in different irrigation
(watering) regime. Assume that the height has normal distribution with unknown variance.
35 plants from different parts of the field and found that the average height is 102 cm and
the sample variance is 16. It was shown that in usual regime the mean height of the plants is
100 cm.
We want to test the hypothesis that the change of irrigation regime has significant influence
on the height of plants at 5% significance level.
Solution. 1) Hypotheses:

2) TS: (102 − 100)/(4/sqrt(35) = 2.95

3) RR: zcritical = ±1.96 two sided

4) Conclusion: zT S > zcritical hence, do reject

5) p-value: P (|z| > 2.95) = 1 − 0.9984 = 0.0016


In the following exercise we test for proportions. Since the sample size is large, we can use
the z-test.
Example 4.8. A machine in a certain factory must be repaired if it produces more than 12%
defectives among the large lot of items it produces in a week. A random sample of 175 items
from a week’s production contains 35 defectives, and it is decided that the machine must be
repaired.
Does the sample evidence support this decision? Use α = 0.02.
Compute the p-value.
Solution:
1) Hypotheses:

2) TS:

3) RR:

4) Conclusion:

68
5) p-value:

4.2.2 Testing for proportions (small sample)


Take a single (small) sample.
Suppose that X1 , X2 , . . . , Xn is a random sample of Bernoulli random variables where n
is too small to use the approximation by normal distribution.
For testing H0 : p = p0 the critical region should defined by using the exact binomial
distribution:
 
n k
P (X = k) = p (1 − p0 )n−k
k 0
Rejection region:
P (X ≤ k1 ) + P (X ≥ k2 ) = α

Exercise 4.9. How long sporting events last is quite variable. This variability can cause
problems for TV broadcasters, since the amount of commercials and commentator blather
varies with the length of the event. Assume that the lengths for a random sample of 16
middle-round contests at the 2008 Wimbledon Championships in women’s tennis has sample
standard deviation 27.25.
Assuming that match lengths are normally distributed, test the hypothesis that standard
deviation of a match length is no more than 25 mins using α = 0.05.
Compute the p-value.

Solution. :

1) Hypotheses:

2) TS:

3) RR:

4) Conclusion:

5) p-value:

69
4.2.3 Testing for variance
When testing hypotheses about the variance σ 2 , the F -distribution introduced in Section 3.3.4
is useful. Let X1 , X2 , . . . , Xn be from the Normal Distribution N (µ1 , σ12 ) and X1 , X2 , . . . , Xn
be from the Normal Distribution N (µ2 , σ22 ). We test

H0 : σ12 = σ22

against
HA : σ12 6= σ22 (resp. HA : σ12 > σ22 or σ12 < σ22 )
The test statistics is
S12 S12 /σ12
F = = ∼ Fn1 −1,n2 −1
S22 S22 /σ22
(because under the null hypothesis σ1 = σ2 ). We denote the value of the test statistic for a
s2
particular sample by fT S = s12 .
2
We find the rejection region for fixed α using the statistical tables of the F -distribution.

Exercise 4.10. Consider two independent random samples X1 , . . . , Xn1 from N (µ1 , σ12 ) dis-
tribution and Y1 , . . . , Yn2 from N (µ2 , σ22 ) distribution.
Test H0 : σ12 = σ22 versus HA : σ12 6= σ22 for α = 0.20 using the following basic statistics:
Sample Size Sample Mean Sample Variance
1 25 410 95
2 16 390 300

Solution. 1) Hypotheses: H0 : σ12 = σ22 versus HA : σ12 6= σ22 - two sided


S12 95
2) TS: fT S = S22
. Thus fT S = 300 = 0.317

3) RR: From the tables F0.1,24,15 = 1.90. By (3.15) F0.9,24,15 = 1/F0.1,15,24 = 1/1.78 = 0.56.

4) Conclusion: fT S < 0.56 hence, reject H0; there is evidence that the population variances
are not equal.

5) p-value: P (F <= 0.317) · 2 = 0.00595 · 2 ≈ 0.012

4.2.4 Tests for paired samples


We say that the the samples are paired if each score in one sample is paired with a specific score
in the other sample. In this case, the two samples are not independent. For example, a group
of people may be weighted before and after following a certain diet program. This results in
two measurement for each person. The first sample consists of pre-diet measurements and the
second consists of post-diet measurements.

70
Let X1,1 , . . . , X1,n be the first sample and X2,1 , . . . , X2,n be the second sample. The proced-
ure to test the significance of the difference between two population means when the samples
are dependent:
(i) calculate for each pair of scores the difference, Di = X1,i − X2,i , i = 1, 2, . . . , n, between
the two scores.

(ii) Because D1 , . . . , Dn are i.i.d. random variables, if d1 , . . . , dn are the observed values of
D1 , . . . , D n
n
¯ 1X
d= di
n
i=1
n
1 X ¯2
s2d = (di − d)
n−1
i=1

(iii) Now the testing will proceed as in the case of a single sample. Let µD = E[D], be the
expected value of the difference. The hypotheses are H0 µD = 0 against HA: µD 6= 0.
Exercise 4.11. A new diet and exercise program has been advertised as remarkable way to
reduce blood glucose levels in diabetic patients. Ten randomly selected diabetic patients are
put on the program, and the results after 1 month are given by the following table:
Before 268 225 252 192 307 228 246 298 231 185
After 106 186 223 110 203 101 211 176 194 203
Do the data provide sufficient evidence to support the claim that the new program reduces
blood glucose level in diabetic patients? Use α = 0.05
Solution. We find the differences:
Before 268 225 252 192 307 228 246 298 231 185
After 106 186 223 110 203 101 211 176 194 203
D -162 -39 -29 -82 -104 -127 -35 -122 -37 18
From the table, the mean of the differences is d¯ = 71.9 and the standard deviation sd = 56.2.
1) Hypotheses: H0 : µd = 0 versus HA : µd < 0 - one sided
D̄−0 −71.9
2) TS: T = √
SD / n
∼ tn−1 , we find tT S = 56.2 = −4.046

3) This is a one-sided test so we find t0.05,9 = −1.833.

4) Conclusion: tT S = −4.046 < −1.833, reject H0. The sample evidence suggests that the
new diet and exercise program is effective.

5) p-value: P (t ≤ −4.045) = 0.00145.

71
Chapter 5

Goodness of Fit

The idea behind the chi-square goodness-of-fit test is to check if given data come from a
particular probability distribution. We have outcomes X of an experiment that can produce
k different results. We say that X is a categorical random variable (with k categories). Let
pi be the probability of observing category i. We want to test if the data comes from certain
probability distribution:
Categories 1 2 ··· k
probabilities p1,0 p2,0 ··· pk,0
For example X could be a result of a coin toss X ∈ {head, tail} (k = 2) or a result of rolling a
die X ∈ {1, 2, 3, 4, 5, 6} (k = 6). The experiment is repeated n times and we count the number
of times that each category is attained.
The null hypothesis can be stated as

H0 : p1 = p1,0 , p2 = p2,0 , . . . , pk = pk,0 ,

The alternative hypothesis is that at least one of these equalities doesn’t hold.
Let
• ni be the number of observations in the i-th category, n = n1 + · · · + nk . ni ’s are called
observed frequencies.
• pi be the probability of getting an observation in the i-th class
Then npi,0 is the expected count for the i-th category and is called expected frequency. Note
that the expected frequencies are calculated assuming the null hypothesis is true. The prob-
abilities pi,0 can either be given from the very beginning or estimated from the sample (in
which case the notation p̂i,0 would be more appropriate).
We use the statistics
k
X (ni − npi,0 )2
g= ∼ χ2k−1−l
npi,0
i=1

72
which has approximately χ2 distribution with k − 1 − l degrees of freedom and

• k is the number of classes,

• l is the number of parameters that were estimated while calculating the probabilities
pi,0 .

This approximation works well, provided that the expected frequency of each class is at least
5. If the expected frequencies are smaller than 5 we may combine categories.
Note that small values of the test statistic suggest that the null hypothesis holds whereas
large values of g suggest that the alternative is true. Therefore, the rejection region is construc-
ted using the right tail of the χ2 distribution. The rejection region is RR = {G ≥ χ2α,k−1−l }.

Exercise 5.1. Researchers in Germany concluded that the risk of heart attack on a Monday
for a working person may be as much as 50% greater than on any other day.
The researchers kept track of heart attacks and coronary arrests over a period of 5 years
among 330, 000 people who lived near Augsberg, Germany. In an attempt to verify the re-
searcher’s claim, 200 working people who had recently had heart attacks were surveyed.
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
24 36 27 26 32 26 29
Do these data present sufficient evidence to indicate that there is a difference in the percentages
of heart attacks that occur on different days of the week? Test using α = 0.05.

Solution. The sample size n = 200, observed ni are listed in the table. The hypotheses are

H0 : p1 = p2 = . . . = p7 = 1/7, HA : p2 > 1/7

Assumption npi ≥ 5 is satisfied. The goodness-of-fit statistic is:

(24 − 200/7)2 (36 − 200/7)2 (29 − 200/7)2


g= + + ··· + = 3.63
200/7 200/7 200/7
We did not estimate any parameter, so l = 0 and the number of degrees of freedom is k−1−l =
6. The critical value for χ20.05,6 = 12.59, hence, H0 cannot be rejected.

5.1 For continuous random variables


To test the hypothesis:

H0 : The given data follow a specific probability distribution F0

versus
HA : The data do not follow the specified probability distribution

73
we divide the range of the random variable into a finite number of bands. Specify ranges as
ri = (YiL , YiU ],where i = 1, 2, . . . , k is the number of classes where YiL and YiU the lower limit
upper limits of class i, respectively.
Then the goodness-of-fit statistics is defined similarly as before
k
X (ni − npi )2
G= ,
npi
i=1

where ni is the i-th observed outcome frequency (in class i) and pi is the i-th expected (theor-
etical) relative frequency. Note that pi can be found using the cumulative distribution function
F0 appearing in H0
pi = F0 (YiU ) − F0 (YiL ).

Exercise 5.2. (Question 1 from MockTest 1)


Data was collected for the mileage ratings of 40 cars of a new car model determined for an
environmental survey. The frequency distribution is presented in the table:

(0, 32] (32, 34] (34, 36] (36, 38] (38, 40] (40, 42] (42, ∞)
3 6 8 9 8 4 2

Test the hypothesis that the distribution is normal N (µ, 10), for α = 0.05

Solution. As we need to test for N (µ, 10), where µ is unknown, then according to the theorem
the value of µ for the null hypothesis can be find using the MLE for µ.
We can show that MLE for µ for population with normal distribution is X̄, hence, as
x̄ = 36.765:

H0 : F0 distribution is N (36.765, 10), HA : F0 distribution is not N (36.765, 10)

Using the statistical tables or software we can find the values of the cumulative distribution
function and the expected frequencies for each interval. The information is collected in the
table:
(YLi , YRi ] (0, 34] (34, 36] (36, 38] (38, 40] (40, ∞)
ni 9 8 9 8 6
F0 (YLi ) 0.0000 0.1910 0.4044 0.6519 0.8468
F0 (YRi ) 0.1910 0.4044 0.6519 0.8468 1.0000
p̂i 0.1910 0.2135 0.2475 0.1949 0.1532
np̂i 7.64 8.5386 9.9003 7.7965 6.128

Note that we combined some classes in order to have npi ≥ 5. After this operation there are
k = 5 classes. We estimated one parameter, so l = 1 and thus there are k − 1 − l = 3 degrees
of freedom.

74
For the goodness-of-fit statistics:
k
X (ni − np̂i )2
G= ,
np̂i
i=1

the observed g = 0.374, the corresponding critical value from the χ2 distribution is χ20.05,5−1−1 =
7.81. Therefore, we cannot reject the null hypothesis, and so it is likely that the sample was
obtained from population with normal distribution.

Exercise 5.3. The speeds of vehicles (in mph) passing through a section of Highway 75 are
recorded for a random sample of 150 vehicles and are given below. Test the hypothesis that
the speeds are normally distributed with a mean of 70 and a standard deviation of 4. Use
α = 0.01.
Range 40 − 55 56 − 65 66 − 75 76 − 85 > 85
Number 12 14 78 40 6

Exercise 5.4. Based on the sample data of 50 days contained in the following table, test the
hypothesis that the daily mean temperatures in the City of Tampa are normally distributed
with mean 77 and variance 6. Use α = 5%.

Temperature 46 − 55 56 − 65 66 − 75 76 − 85 86 − 95
Number of days 4 6 13 23 4

75
Appendix A: Summary of important
random variables and their
distributions

Variable Distribution Reference


X̄−µ
T (X̄, µ) = σ/√
n
N (0, 1) Theorem 3.16
2
T = (n−1)S
σ2
χ2n−1 Theorem 3.52
X̄−µ
T = S/ √
n
Tn−1 Theorem 3.56
S12 /σ12
F = S22 /σ22
F (n1 − 1, n2 − 1) Theorem 3.59

76
Appendix B: Matlab and R code

5.2 Probability Distributions in R


• Discrete probability distributions
Distribution pdf cdf inverse cdf random deviates
Binomial dbinom pbinom qbinom rbinom
Geometric dgeom pgeom qgeom rgeom
Hypergeometric dhyper phyper qhyper rhyper
Poisson dpois ppois qpois rpois
• Continuous probability distributions
Distribution pdf cdf inverse cdf random deviates
Normal dbinom pbinom qbinom rbinom
Exponential dexp pexp qexp rexp
Uniform dunif punif qunif runif

5.3 Plots of the denities and probability mass functions


Figure 3.1 can be plotted in MATLAB:
b=1;
a=2;
x=0:0.1:10;
y = gampdf(x,a,b);
plot(x,y)
or in R:
a<-4;
b<-1;
x<-seq(0,10,0.1);
y<-dgamma(x, shape=a, scale = b, log = FALSE);
plot(x,y)

77
Figure 3.2 can be plotted in MATLAB:
n=1;
x=0:0.1:10;
y=chi2pdf(x,n);
plot(x,y)
or in R:
n<-3;
x<-seq(0,10,0.1);
y<-dchisq(x, fd=n,);
plot(x,y)
To plot pdf of t-distribution in Figure 3.3 in MATLAB:
n=5;
x=-10:0.1:10;
y=tpdf(x,n);
plot(x,y)
and in R:
n<-5;
x<-seq(-10,10,0.1);
y<-dt(x, df=n);
plot(x,y)
To find x percentile of t-distribution, MATLAB:
n=5;
x=0.025;
y= tinv(x,n)
R:
n<-5;
x<-0.025;
y<-qt(x, df=n)
To plot pdf of F (m, n)-distribution (Figure 3.4), MATLAB:
m=10;
n=5;
x=0:0.1:10;
y=fpdf(x,m,n);
plot(x,y)

78
R:

m<-10;
n<-5;
x<-seq(0,10,0.1);
y<-df(x, df1=m, df2 =n,);
plot(x,y)

To find x percentile of F (m, n)-distribution in MATLAB:

m=10;
n=5;
x=0.025;
y=finv(x,m,n)

in R:

m<-10;
n<-5;
x<-0.025;
y<-qf(x, df1=m, df2=n)

79
Bibliography

80

You might also like