Wachemo University
College of Medicine and Health Sciences
School of Public Health
Lecture Notes for Biostatistics
Abriham S. Areba (Assistant Professor)
abriamshiferaw@gmail.com
October, 2024
Hossana, Ethiopia
STATISTICAL ESTIMATION AND HYPOTHESIS TESTING
o The process of drawing conclusions about an entire population
based on the data in a sample is known as statistical inference.
Types of statistical inference
Statistical Estimation
Statistical Hypothesis testing
Estimation is the computation of a statistic from sample data, often
yielding a value that is an approximation (guess) of its target, an
unknown true population parameter value.
Is the investigator does not have any prior notion about values or
characteristics of the population parameter.
Abriham S.
Estimator: is the rule or random variable that helps us to approximate a
population parameter.
Estimate: is the different possible values which an estimator can assume.
Properties of Best Estimator
To explain these properties let 𝜃 be an estimator of θ
Unbiased Estimator: An estimator whose expected value is the
value of the parameter being estimated. i.e. E 𝜃 =𝜃
Consistent Estimator: An estimator which gets closer to the value of
the parameter as the sample size increases. i.e. 𝜃 gets closer to θ as the
sample size increases.
Abriham S.
Relatively Efficient Estimator: The estimator for a parameter with the
smallest variance.
This actually compares two or more estimators for one parameter.
Two methods of estimation are commonly used:
✓ Point Estimation and
✓ Interval Estimation
Point Estimation: A single numerical value used to estimate the
corresponding population parameter
❖ A point estimate is of the form: [Value]
❖ഥ
X is an estimator of the population mean μ
❖ S is an estimator of the population standard deviation σ
❖ pො is an estimator of the population proportion P
Abriham S.
Interval Estimation(CI):Specifies a range of reasonable values for the
population parameter based on a point estimate
❖ Provides more information about a population characteristic than does
a point estimate
❖ Give information about the precision of an estimate
– When sampling variability is high, the CI will be wide to reflect
the uncertainty of the observation.
– Wider CIs indicate less certainty.
✓ Gives information about closeness to unknown population parameters
Abriham S.
The general formula for all confidence intervals is:
Lower limit = Point Estimate - (Critical Value) x (Standard Error)
Upper limit = Point Estimate +(Critical Value) x (Standard Error)
Interval Estimation form is [lower limit, upper limit]
Abriham S.
Estimation for Population
Confidence Interval for Population Mean
Case 1: If n is large or the population is normal with known variance
σ2
Xഥ ~N μ,
n
To use the normal distribution curve for computing confidence
intervals.
ഥ −μ
X
Z= σ has normal distribution with mean=0 and variance=1
n
To obtain the value of Z, there is an area of size 1-α such
P −Zα < Z < Zα = 1 − α
2 2
α = is the probability that the parameter lies outside the interval
Zα = the standard normal distribution
2 Abriham S.
ത
𝑋−𝜇
𝑃 −𝑍𝛼 < 𝜎 < 𝑍𝛼 = 1 − 𝛼
2 𝑛 2
𝜎 𝜎
𝑃 𝑋ത −𝑍𝛼 < 𝜇 < 𝑋ത + 𝑍𝛼 =1−𝛼
2 𝑛 2 𝑛
Thus,( 1-α)100% confidence interval for the µ is given by
𝛔 𝛔
ഥ −𝐙𝛂
𝐗 , ഥ + 𝐙𝛂
𝐗
𝟐 𝐧 𝟐 𝐧
σ σ
ഥ −Zα
o The end points of the interval, X ഥ + Zα
and X are called
2 n 2 n
confidence limits and
o the probability 1-α is called the degree of confidence
Abriham S.
Case 2: If sample size is small and the 𝝈𝟐 is unknown
ഥ −μ
X
t= σ has t distribution with n-1 degrees of freedom.
n
A ( 1-α)100% confidence interval for the µ
𝐒 𝐒
ഥ −𝐭 𝛂 (n−1)
𝐗 , ഥ + 𝐭 𝛂 (n−1)
𝐗
𝟐 𝐧 𝟐 𝐧
Where:
𝛼
t α is the critical value of t-test statistic providing an area in the right tail of
2 2
ഥ 2
Xi −X
the t-distribution with n − 1 df, and S =
n−1
The unit of measurement of the confidence interval is the standard error.
This is just the standard deviation of the sampling distribution of the statistic.
Commonly used CLs are:
Abriham S.
Example 1: From a normal sample of size 25 a mean of 32 was found
Given that the population standard deviation is 4.2. Find
a) A 95% confidence interval for the population mean.
Solution:
ഥ=32, σ=4.2,
X n=25, α=0.05, Zα =1.96
2
σ σ
ഥ −Zα
X , ഥ + Zα
X
2 n 2 n
4.2 4.2
32 − 1.96 ∗ , 32 + 1.96 ∗
25 25
30.35, 33.65
Abriham S.
Example 2. A drug company is testing a new drug which is supposed to
reduce blood pressure. From the six people who are used as subjects, it is
found that the average drop in blood pressure is 2.28 points, with a standard
deviation of .95 points. What is the 95% confidence interval for the mean
change in pressure?
ഥ
X =2.28, S=0.95, n=6, α = 0.05, t α (n-1)=t 0.025 (5)=2.571
2
S S
ഥ −t α (𝑛 − 1)
X , ഥ + t α (𝑛 − 1)
X
2 n 2 n
0.95 0.95
2.28 − 2.571 ∗ , 2.28 + 2.571 ∗
6 6
2.28 − 1.008, 2.28 + 1.008
1.28, 3.28
Abriham S.
Confidence Interval for Population Proportion
When the sample size, n is large, according to the central limit theorem,
the sampling distribution of pෝ is approximately normally distributed
pq
with mean p and variance . That is,
n
if n is large, pො ~N(μpෝ , σpෝ 2 )
A (1-α)100% CI for p is given by
𝐩Ƹ 𝐪ො 𝐩Ƹ 𝐪ො
𝐩Ƹ − 𝐙 𝛂 , 𝐩Ƹ + 𝐙 𝛂
𝟐 𝐧 𝟐 𝐧
x
where: pො =
n
≥ 5 and n(1 − P
if nP ) ≥5
Abriham S.
Example: In the U.S. the prevalence of skin cancer by women between 45-54
years old. We have a sample of 5000 women of which 28 developed cancer.
a) Calculate a point estimate for the prevalence of skin cancer
b) Construct a 95% CI for the prevalence of skin cancer
Solution: Given: n = 5000; x = those developed cancer = 28
𝛼
𝛼 = 0.05, =0.025, 𝑍0.025 =1.96
2
x 28
a) pො = = =0.0056. That is, the prevalence is about 0.56%.
n 500
𝑝ො𝑞ො 𝑝ො𝑞ො
b) 𝑝Ƹ − 𝑍𝛼 𝑛
, 𝑝Ƹ + 𝑍𝛼 𝑛
2 2
0.0056 1 − 0.0056 0.0056(1 − 0.0056)
0.0056 − 1.96 , 0.0056 + 1.96
5000 5000
(0.0034, 0.0078)
Abriham S.
Difference Between Two Means
A. Known Variances (Two Independent Samples)
B. Unknown Population Variances
With equal population, we can obtain a pooled value from the sample
variances.
A(1-𝛂)100% CI for 𝛍𝟏 −𝛍𝟐 is:
2 1 1
xഥ1 − x2 ± t α n1 + n2 −2 Sp +
2 n1 n2
n1 − 1 s1 2 + n2 − 1 s2 2
Sp 2 =
n1 + n2 −2
𝑠1 2
If 0.5 ≤ ≤ 2 , we assume that the population variances are equal.
𝑠2 2
Confidence Interval for two Population Proportions
Confidence Interval for two Population Proportions
o We are often interested in comparing proportions from 2 populations:
o Both sample sizes are large at least 40 observations in each sample.
Abriham S.
Statistical Hypothesis Testing
▪ Hypothesis Testing (HT) provides an objective framework for
making decisions using probabilistic methods.
▪ The purpose of HT is to aid the clinician, researcher or
administrator in reaching a decision (conclusion) concerning a
population by examining a sample from that population.
▪ Making inference about population parameter, where the investigator
has prior notion about the value of the parameter.
Types of Hypothesis
1. Null Hypothesis
2. Alternative Hypothesis
Abriham S.
The Null Hypothesis
It is the hypothesis (statement) of equality or of no difference between the
hypothesized value and the population value.
✓ (The effect of interest is zero = no difference)
✓ H0 is a statement of agreement (or no difference)
✓ H0 is always about a population parameter, not about a sample
statistic
Abriham S.
Begin with the assumption that the Ho is true
✓ Similar to the notion of innocent until proven guilty
• Always contains “=” , “ ≤” or “≥ ” sign
• May or may not be rejected
The Alternative Hypothesis, 𝐻1 or 𝐻𝐴
✓It is the hypothesis available when the Ho has to be rejected
• It is the hypothesis of difference.
• Is generally the hypothesis that is believed (or needs to
be supported) by the researcher.
• Is a statement that disagrees (opposes) with Ho
(The effect of interest is not zero)
✓May or may not be accepted Abriham S.
Errors in Hypothesis Tests
Type I Error
❖ The error committed when a true Ho is rejected
❖ Considered a serious type of error
❖ The probability of a type I error is the probability of rejecting the
Ho when it is true
❖ The probability of type I error is α
❖ Called level of significance of the test
Type II Error
❖ The error committed when a false Ho is not rejected
❖ The probability of Type II Error is 𝛽
❖ Usually unknown but larger than α
Power of Test
❖ The probability of rejecting the Ho when it is false.
❖ Power = 1 – β = 1- probability of type II error
Abriham S.
We would like to maintain low probability of a Type I error (α) and
low probability of a Type II error (β) [high power = 1 - β].
Type I & II Error Relationship
Type I & Type II errors can not happen at the same time.
Type I error can only occur if 𝐻𝑜 is true
Type II error can only occur if 𝐻𝑜 is false
If Type I error probability (𝛼) increases, then Type II error probability
(𝛽) decreases.
Abriham S.
General steps in hypothesis testing:
1. State the null and the alternative hypothesis.
2. Choose a fixed significance level α.
3. Choose an appropriate test statistic
4. Identify the critical Region
5. Compute the value of the test statistic.
6. Making decision.
7. Conclusion
Abriham S.
Hypothesis Testing about the Population Mean
I. For two sided test 𝐻𝑂 : 𝜇 = 𝜇0 versus 𝐻𝐴 : 𝝁 ≠ 𝝁𝟎
Reject 𝐻𝑂 if 𝒁𝒄𝒂𝒍𝒄𝒖𝒍𝒂𝒕𝒆 > 𝒁𝜶
𝟐
Figure: Area of acceptance and rejection of 𝐻𝑂 (Two-tailed test)
II. For right-tailed test 𝐻𝑂 : 𝜇 = 𝜇0 versus 𝐻𝐴 : 𝝁 > 𝝁𝟎
Reject 𝐻𝑂 if 𝒁𝒄𝒂𝒍𝒄𝒖𝒍𝒂𝒕𝒆 > 𝒁𝜶
Figure: Area of acceptance and rejection of 𝐻𝑂 (right-tailed test)
Abriham S.
III. For left-tailed test 𝐻𝑂 : 𝜇 = 𝜇0 versus 𝐻𝐴 : 𝝁 < 𝝁𝟎
Reject 𝐻𝑂 if 𝑍𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 < −𝑍𝛼
Figure: Area of acceptance and rejection of 𝐻𝑂 (left-tailed test)
𝑡𝑐𝑎𝑙 > t α n − 1
2
Reject 𝐻𝑂 if ൞𝑡𝑐𝑎𝑙 > t α n − 1
𝑡𝑐𝑎𝑙 < −t α n − 1
Abriham S.
Draw and state the conclusion
❖ If the numerical value of the test statistic falls in the rejection region, we
reject the Ho and conclude that the H1.
❖ If the test statistic does not fall in the rejection region, we do not reject
H0
Another way to state conclusion
❖ Reject Ho if P-value < 0.05
❖ Accept Ho if P-value ≥ 0.05
P-value is the probability of obtaining a test statistic at least as extreme as
the one that was actually observed, assuming the null hypothesis is true.
The larger the test statistic, the smaller is the P-value.
/The smaller the P-value the stronger the evidence against the Ho.
Abriham S.
Example 1: The mean life time of a sample of 16 fluorescent light
bulbs produced by a company is computed to be 1570 hours. The
population standard deviation is 120 hours. Can we conclude that the
life time of light bulbs is decreasing 1600?
ഥ=1570, 𝜇0 =1600, 𝜎 = 120
Given: n=16, X
To test hypothesis:
Step 1: 𝐻𝑂 : 𝜇0 =1600 Versus 𝐻𝐴 : 𝜇0 < 1600
Step 2: Level of significance, 𝛼 = 0.05
Step 3: Z- Statistic is appropriate b/se 𝜎 is known.
Step 4: The Critical Region is 𝑍0.05 = 1.645
Abriham S.
Step 5: Computations:
𝑋ത − 𝜇0 1570 − 1600
𝑍𝑐𝑎𝑙 = 𝜎 = = −1
ൗ 𝑛 120ൗ
16
Step 6: Decision: 𝑍𝑐𝑎𝑙 = −1 > −𝑍0.05 = −1.645
Since, Accept H0
Step 7: Conclusion
At 5% level of significance, we conclude that the life time of light
bulbs is not decreasing.
Abriham S.
Example 2: It is known in a pharmacological experiment that rats fed with a
particular diet over a certain period gain an average of 40 gms in weight. A
new diet was tried on a sample of 20 rats yielding a weight gain of 43 gms
with variance 7 gms2. Test the hypothesis that the new diet is an improvement
assuming normality. (use level of significance 5%)
Given: n=20, 𝑋ത = 43, 𝜇0 = 40, 𝑆 2 = 7, 𝛼 = 0.05
1. 𝐻𝑂 : 𝜇0 =40 (The new diet is not an improvement assuming normality)
Versus
𝐻𝐴 : 𝜇0 > 40 (The new diet is an improvement assuming normality)
2. 𝛼 = 0.05
3. t- test statistic b/se n is small & 𝜎 unknown
4. Critical region: 𝑡𝛼 (n-1)= 𝑡0.05(20-1)= 𝑡0.05 (19)=1.729
ത 0
𝑋−𝜇 43−40
5. 𝑡𝑐𝑎𝑙 = 𝑆 = = 5.07
ൗ 𝑛 7
20
6. 𝑡𝑐𝑎𝑙 = 5.07 > 𝑡0.05 (19)=1.729 since Reject 𝐻𝑂
We conclude that the new diet is an improvement assuming normality.
Abriham S.
Hypothesis Testing about the Population Proportion
o The value of this statistic is compared with a hypothesized population
proportion 𝑃𝑜 so as to arrive at a conclusion about the hypothesis.
Thus the hypothesis to test the population proportion p are as follows:
i. HO : P= PO versus H1 : P≠ PO (Two-tailed test)
ii. HO : P= PO versus H1 : P> PO (Right-tailed test)
iii. HO : P= PO versus H1 : P< PO (Left-tailed test)
To conduct hypothesis testing, it is assumed that the sampling distribution of a
proportion follows a standardized normal distribution for large sample.
Using the pො and its sd, 𝜎𝑝ො we compute a value for the Z-test statistic as follows:
− PO
P − PO
P
Z= = ~N(0,1)
σpෝ PO (1 − PO)
n
Abriham S.
Example 1: In a survey of injection drug users in a large city, a researcher found
that 18 out of 423 were HIV positive. We wish to know if we can conclude that fewer
than 5 percent of the injection drug users in the sampled population are HIV positive.
Use 5% level of significance.
Given: n = 423; x = drug user who are HIV positive from the sample=18,
x 18
=
PO = 0.05, P = = 0.0426
n 423
1. HO : P= 0.05 versus HA : P<0.05
2. α = 0.05
3. Z- test statistic
4. Left-tailed test, the critical (table) value is -Zα = -Z0.05 =-1.65
−PO
P 0.0426−0.05 −0.0074
5. ZCal = PO (1−PO)
= 0.05(1−0.05)
=
0.0106
= −0.70
n 423
6. ZCal= -0.7 > -Z0.05 =-1.65, Accept HO
We conclude that in the population of injection drug users the proportion of
HIV positive individuals are not less than 5 percent at 5% level of significance.
Exercise
1. Random samples of 200 senior school students produce a mean weight of
58kg with standard deviation 4 kg.
A. Estimate 95% CI
B. Test the hypothesis that the mean weights of the population is greater
than 60 kg. (Use 𝛼 =0.01 level of significance)
2. The life expectancy of people in the year 1999 in a country is expected to
be 50 years. A survey was conducted in eleven regions of the country and
the data obtained, in years, are given below: Life expectancy (years):
54.2, 50.4, 44.2, 49.7, 55.4, 47.0, 58.2, 56.6, 61.9, 57.5, and 53.4.
A. Estimate 95% CI
B. Do the data confirm the expected view? (Use 10% level of
significance)
Abriham S.
3. The mean life time of a sample of 16 fluorescent light bulbs produced by a
company is computed to be 1570 hours. The population standard
deviation is 120 hours.
A. Estimate 95% CI
B. Can we conclude that the life time of light bulbs is not equal to 1600?
(Use 10% level of significance)
4. A survey was conducted to study the dental health practices & attitudes of a
certain urban adult population. Of 300 adults interviewed, 123 said that they
regularly had a dental check up twice a year.
a) Calculate a point estimate for the population proportion
b) Construct a 95% CI the population proportion
c) At the 0.01 level of significance, can we conclude that the population
proportion is 0.5? (Use, 𝑍0.01 = 2.33)
Abriham S.