Sampling distribution and
parameters estimation
Junzhe Bao
Department of Biostatistics and Epidemiology
Email: statistics_bjz@126.com
1
Outline
1 Sampling distribution and sampling error
2 Central Limit Theorem and Standard error
3 z distribution and t-distribution
4 Parameter estimation
2
Sampling distribution and sampling error
➢ Generally, the total population of interest is often very large, and it is difficult to
obtain data for all individuals in the population to conduct research. Therefore,
samples are often drawn from the population to estimate population parameters.
➢ For example, it is necessary to investigate the average triglyceride level of adult
men in a city. It is impossible to draw blood and test the triglyceride level of all
adult men in the city. A representative sample can be drawn, such as taking 500
adult men and using their average level to represent the average level of the city.
3
Sampling distribution and sampling error
➢ Suppose that the average level of these 500 men’s triglyceride concentration was
187 milligrams/deciliter. Then, If you re-sampled 500 men and tested for
triglycerides, their average level might be 191. Continue the process for 100
samples. What happens then is that the sample mean (average level) becomes a
random variable, and the sample means (average levels) 187, 191, 184, . . . , 196
constitute a sampling distribution of sample means.
➢ A sampling distribution of sample means is a distribution using the means
computed from all possible random samples of a specific size taken from a
population.
4
Sampling distribution and sampling error
➢ If the samples are randomly selected with replacement, the sample
means, for the most part, will be somewhat different from the
population mean. These differences are caused by sampling error.
➢ Sampling error is the difference between the sample measure and the
corresponding population measure due to the fact that the sample is
not a perfect representation of the population.
5
Central Limit Theorem and Standard error
The Central Limit Theorem:
As the sample size (n) increases without limit, the shape of the distribution of the sample
means taken with replacement from a population will approach a normal distribution.
1. When the original variable is normally distributed, the distribution of the sample
means will be normally distributed, for any sample size n.
2. When the distribution of the original variable might not be normal, a sample size of
30 or more is needed to use a normal distribution to approximate the distribution of
the sample means. The larger the sample, the better the approximation will be.
6
Central Limit Theorem and Standard error
➢ The standard deviation introduced in the previous course is used to reflect the
variation between individual observations. The standard deviation of the
sample means is called the standard error of the mean.
➢ Suppose a professor gave an 8-point quiz to a small class of four students. The
results of the quiz were 2, 6, 4, and 8. For the sake of discussion, assume that
the four students constitute the population. The mean of the population is
The standard deviation of the population is
7
Central Limit Theorem and Standard error
Now, if all samples of size 2 are taken with replacement and the mean of each sample
is found, the distribution is as shown. (2, 6, 4, and 8)
A frequency distribution of sample means is as follows.
8
Central Limit Theorem and Standard error
The standard deviation of sample means is
which is the same as the population standard deviation, divided by
In summary, if all possible samples of size n are taken with replacement from the
same population, the mean of the sample means equals the population mean; and
the standard deviation of the sample means equals
9
z distribution and t-distribution
According to the central limit theorem, even if the population from which the
sample statistic comes from does not follow a normal distribution, when the
sample size is large enough, the sample mean also approximately follows a
normal distribution.
X −
Z=
ത μ
𝑋− ത μ
𝑋−
~ z distribution
σ/ n 𝑆/ n
10
z distribution and t-distribution
The population standard deviation (σ) is often unknown in actual research. If the sample
standard deviation (s) is used to represent the population standard deviation (σ), it
will no longer obey the Z distribution, but obey the t distribution.
ത μ
𝑋−
~ t (v)
𝑆/ n
v is the degree of freedom, which determines the shape of the t distribution
11
z distribution and t-distribution
The t distribution shares some characteristics of the normal distribution and differs from
it in others. The t distribution is similar to the standard normal distribution in these ways:
1. It is bell-shaped.
2. It is symmetric about the mean.
3. The mean, median, and mode are equal to 0 and are located at the center of the
distribution.
4. The curve never touches the x axis.
12
z distribution and t-distribution
The t distribution differs from the standard normal distribution in the following ways:
1. The variance is greater than 1.
2. The t distribution is actually a family of curves based on the concept of degrees of
freedom, which is related to sample size.
3. As the sample size increases, the t distribution approaches the standard normal
distribution.
13
Parameter estimation
Parameter estimation refers to the estimation of population parameters from sample
statistics. There are two types of parameter estimation commonly used: point
estimation and interval estimation.
Suppose a college president wishes to estimate the average age of students attending
classes this semester. The president could select a random sample of 100 students
and find the average age of these students, say, 22.3 years. From the sample mean,
the president could infer that the average age of all the students is 22.3 years. This
type of estimate is called a point estimate.
A point estimate is a specific numerical value estimate of a parameter. The best point
ത
estimate of the population mean (μ) is the sample mean (𝑋).
14
Parameter estimation
You might ask why other measures of central tendency, such as the
median and mode, are not used to estimate the population mean.
The reason is that the means of samples vary less than other
statistics (such as medians and modes) when many samples are
selected from the same population. Therefore, the sample mean is
the best estimate of the population mean.
15
Parameter estimation
The sample mean will be, for the most part, somewhat different from the population
mean due to sampling error. Therefore, you might ask a second question: How good
is a point estimate?
The answer is that there is no way of knowing how close a particular point estimate
is to the population mean. This answer places some doubt on the accuracy of point
estimates. For this reason, statisticians prefer another type of estimate, called an
interval estimate.
An interval estimate of a parameter is an interval or a range of values used to
estimate the parameter. This estimate may or may not contain the value of the
parameter being estimated.
16
Parameter estimation
Either the interval contains the parameter or it does not. A degree of confidence
(usually a percent) can be assigned before an interval estimate is made. For instance,
you may wish to be 95% confident that the interval contains the true population
mean. Another question then arises. Why 95%? Why not 99 or 99.5%?
If you desire to be more confident, such as 99 or 99.5% confident, then you must
make the interval larger. For example, a 99% confidence interval for the mean age of
college students might be 21.7 < μ <22.9, or 22.3 ± 0.6. Hence, a tradeoff occurs.
To be more confident that the interval contains the true population mean, you must
make the interval wider.
17
Parameter estimation
The confidence level of an interval estimate of a parameter is the probability that the
interval estimate will contain the parameter, assuming that a large number of
samples are selected and that the estimation process on the same parameter is
repeated.
A confidence interval is a specific interval estimate of a parameter determined by
using data obtained from a sample and by using the specific confidence level of the
estimate.
Intervals constructed in this way are called confidence intervals. Three common
confidence intervals are used: the 90%, the 95%, and the 99% confidence intervals.
18
Parameter estimation
Confidence Intervals for the Mean When σ Is Known
The central limit theorem states that when the sample size is large, approximately
95% of the sample means taken from a population and same sample size will fall
within 𝑋ത ± 1.96 standard errors of the population mean (z distribution and z table).
Therefore, the 95% confidence interval of the population mean is
σ
𝑋ത ±1.96( 𝑛)
σ σ
𝑋ത - 1.96( 𝑛) < μ < 𝑋ത +1.96( 𝑛)
19
Parameter estimation
Confidence Intervals for the Mean When σ Is Unknown (s to represent σ)
Ten randomly selected people were asked how long they slept at night. The mean
time was 7.1 hours, and the standard deviation was 0.78 hour. Find the 95%
confidence interval of the mean time (population mean). Assume the variable is
normally distributed.
Since σ is unknown and s must replace it, the t-distribution must be used for (z table
t table) the confidence interval. Hence, with 9 (v) degrees of freedom 𝑡𝑎/2 = 2.262.
The 95% confidence interval can be found by substituting in the formula.
s s
𝑋ത - 𝑡𝑎/2 ( 𝑛) < μ < 𝑋ത + 𝑡𝑎/2 ( 𝑛)
0.78 0.78
7.1 – 2.26( 10 ) < μ < 7.1 + 2.26( 10 )
6.54 < μ < 7.66 7.1(6.54, 7.66)
20
Parameter estimation
21
Test
The difference between confidence interval and reference range
22
Test
The difference between confidence interval and reference range
Confidence interval: Estimate a range that might include the population parameters.
s σ
𝑋ത ± 𝑡𝑎/2 ( 𝑛) 𝑋ത ±𝑧𝑎/2 ( 𝑛)
Reference range: The range of most individual observations.
𝑋ത ±𝑧𝑎/2 𝑠 P5 / P95
23
Test
The difference and connection between standard deviation and standard error
24
Test
The difference and connection between standard deviation and standard error
Standard deviation: Describe the discrete trend of normal distribution data,
reflecting the variation of individual values.
As the sample size increases, the sample standard deviation tends to stabilize
and approach the overall standard deviation.
Standard deviation can be used to describe discrete trends, calculate
coefficient of variation, determine the medical reference range, and calculate
standard errors.
25
Test
The difference and connection between standard deviation and standard error
Standard error: Describe the sampling error of the mean, reflecting the
representativeness of the sample mean to the population mean.
As the sample size increases, the standard error decreases.
Standard error is used to reflect the size of sampling error, which can be used
to estimate the confidence interval of the population mean and perform
hypothesis testing
26
Thanks!
27