Data, Models, Parameters, and Statistics
In this lecture, we will see more datasets and give a brief introduction to some typical
models and setups.
In statistics, our starting point is a collection of data . Each could be
a number, a vector, or even a matrix. Our goal is to draw useful information from the
data.
Examples:
1. Old faithful data.
data(faithful)
faithful
eruptions: numeric Eruption time in minutes.
Waiting: numeric Waiting time to next eruption (in minutes).
2. ChickWeight data
data(ChickWeight)
ChickWeight
Weight: a numeric vector giving the body weight of the chick (gm).
Time: a numeric vector giving the number of days since birth when the measurement
was made.
Chick: an ordered factor with levels 18 < ... < 48 giving a unique identifier for the
chick.
Diet: a factor with levels 1,...,4 indicating which experimental diet the chick received.
3. Longley's Economic Regression Data
data(longley)
longley
This is a macroeconomic data set which provides a well-known example for a highly collinear
regression.
GNP.deflator: GNP implicit price deflator (1954=100)
GNP: Gross National Product.
Unemployed: number of unemployed.
Armed.Forces: number of people in the armed forces.
Population: ‘noninstitutionalized’ population >= 14 years of age.
Year: the year (time).
Employed: number of people employed.
lm(Employed ~ GNP,data=longley)
4. Air passenger data.
data(AirPassengers)
AirPassengers
Assumptions: Once we have a dataset, we need proper assumptions to do statistical
inferences (Estimation, Testing, Prediction, Confidence Interval, etc).
1. The samples are independent.
2. The samples are identically distributed.
3. Relationship among the coordinates of each sample (linear, for example).
4. The samples follow a particular distribution (normal, exponential, uniform, etc.).
5. ……..
We should be careful when apply those assumptions on the dataset.
Parameters: If we assume the samples follow some particular distribution, there will
be parameters for the distribution, generally unknown.
Example : Michaelson-Morley Speed of Light Data.
data(morley)
morley
attach(morley)
hist(Speed)
qqnorm(Speed)
The samples of Speed are approximately normal, so assume Speed follow a
distribution is reasonable. But the parameters and are unknown. We need to
estimate them in some cases.
Basic Models and Goals
1. Estimation.
Observe i.i.d. samples . They follow some distribution with parameter
. Our goal is to estimate , or more generally, a function of , .
2. Confidence Interval.
We do not need a actual estimate of the parameter. But we want to find a interval
such that it will cover the true parameter with high probability (for example,
95%).
3. Hypothesis Testing.
We want to get a yes or no answer to some questions. Foe example or ,
or .
For example:In ChickWeight data, we want to compare the weight of Chicken with
different diet.
4. Prediction.
Predict the value of next observation. For example, the air passenger data.
5. Linear Regression Model.
We observe paired data. . We assume are nonrandom
and are realization of the random variables
where are independent random variables with expectation 0 and variance .
and are unknown parameters. is called the regression line. We want to
estimate it.
data(trees)
attach(trees)
plot(Volume, Girth)
Measurement of Performance
Once we got an answer to a statistics problem, we need to know how good it is. We
need to measure the performance of our decision.
Unbiased estimation.
Mean squared error.
Efficiency.
……
Unbiased Estimator
In this lecture, we will study the estimation problem. Our goal here is to use the
dataset to estimate a quantity of interest. We will focus on the case where the quantity
of interest is a certain function of the parameter of the distribution of samples.
Examples:
1. data(morley)
We want to estimate the speed of light, under normal assumption.
2. Exponential distribution. (life time of a machine)
X<-rexp(100,rate=2)
Let us pretend that we do not know the true parameter (which is 2), and estimate it
based on the samples.
An estimate is a value that only depends on the dataset , i.e., the estimate
is a function of the data set .
One can often think of several estimates for the parameter of interest.
In example 1, we could use sample mean or sample median.
In example 2, we could use the reciprocal of the sample mean or .
Then we need to answer the following questions:
When is one estimate better than another? Does there exist a best estimate?
Since the dataset is a realization of random variables . So
the estimate is a realization of random variable . is called an
estimator.
Example:
y<-rep(0,50);
z<-rep(0,50);
for (i in 1:50) {
X<-rexp(100,rate=2);
y[i]<-1/mean(X);
z[i]<-log(2)/median(X);
}
For each set of samples, we have an estimate. So the estimator is a
random variable. We need to investigate the behavior of the estimators.
hist(y); mean(y); var(y);
hist(z); mean(z); var(z);
The mean squared error of the estimator is defines as
mean((y-2)^2)
mean((z-2)^2)
Now we know that an estimator is a random variable. The probability distribution of
is also called the sampling distribution of .
Definition: An estimator is called an unbiased estimator for parameter , if
for all . Generally, the difference is called the bias of .
Let us consider the normal mean problem. Suppose follow
distribution and we want to estimate . Since is the expectation of the distribution,
an intuitive estimator will be . This is an unbiased estimator.
Unbiased estimator for expectation and variance
Suppose are i.i.d. random variables with mean and variance . Now
we have the following unbiased estimators for both of them.
is an unbiased estimator of and
is an unbiased estimator of .
Remark: Unbiaed estimators do not necessarily exist. Unbiasedness does not
always carry over. is an unbiased estimator of does not mean is an
unbiased estimator of , unless is a linear function.
Method of Moments
From the previous normal example, we can see that if the parameter of interest is the
expectation or variance of the distribution, we can use the sample expectation or
sample variance to estimate it. This estimator is reasonable.
Suppose we have i.i.d. samples follow some distribution with unknown
parameter . Now we want to estimate this parameter . We first calculate the
expectation of the distribution . Usually, this is a function of (think about
normal or exponential distribution). Suppose , then under suitable
conditions, can be written as . Since we can always use sample mean
to estimate the expectation, we have an intuitive estimator for ,
In general, we can calculate the expectation of a function of . Suppose
for some function . In the previous discussion, .
Actually, could be any function, for example , , …as long as its
expectation is easy to compute. Then an estimator of would be
This method is called method of moments. From the Law of Large Number, we know
that these estimators are not bad.