Statistical Analysis and Design
Dr. Ryan Jeffrey P. Curbano
MODULE 1 - ONE SAMPLE TEST
What is Inferential Statistics
• Inferential statistics allows you to make predictions (“inferences”) from
that data. With inferential statistics, you take data from samples and
make generalizations about a population.
• Inferential statistics use a random sample of data taken from a
population to describe and make inferences about the population.
• Inferential statistics use a random sample of data taken from a
population to describe and make inferences about the population.
Inferential statistics are valuable when examination of each member of
an entire population is not convenient or possible.
• For example, to measure the diameter of each nail that is manufactured
in a mill is impractical. You can measure the diameters of a
representative random sample of nails. You can use the information
from the sample to make generalizations about the diameters of all of
the nails.
3
There are two main areas of inferential
statistics:
1. Estimating parameters. This means taking a statistic from your
sample data (for example the sample mean) and using it to say
something about a population parameter (i.e. the population
mean).
2. Hypothesis tests. This is where you can use sample data to answer
research questions. For example, you might be interested in
knowing if a new cancer drug is effective. Or if breakfast helps
children perform better in schools
4
What is Hypothesis Testing
• A hypothesis test is rule that specifies whether to accept or reject a
claim about a population depending on the evidence provided by a
sample of data.
• A hypothesis test examines two opposing hypotheses about a
population: the null hypothesis and the alternative hypothesis. The
null hypothesis is the statement being tested. Usually the null
hypothesis is a statement of "no effect" or "no difference". The
alternative hypothesis is the statement you want to be able to
conclude is true based on evidence provided by the sample data.
5
Method Used in Hypothesis Testing
1. Traditional Method
• Widely used especially in statistics books
2. P- value or Probability value method
• Is the probability of getting a sample statistics or a mean extreme
sample statistics in the direction of the H1 when the H0 is true.
• It is actual area under the standard normal distribution curve
representing the probability of a particular sample of statistics or a
more extreme sample statistics occurring if the H0 is True
3. Confidence Interval
• is a range of values, derived from sample statistics, that is likely
to contain the value of an unknown population parameter
6
Six basic steps to correctly set up and perform
a hypothesis test.
• For example, the manager of a pipe manufacturing facility must ensure that
the diameters of its pipes equal 5cm. The manager follows the basic steps
for doing a hypothesis test.
1. Specify the hypotheses.
• First, the manager formulates the hypotheses. The null hypothesis is: The population mean of all
the pipes is equal to 5 cm. Formally, this is written as: H0: μ = 5
• Because they need to ensure that the pipes are not larger or smaller than 5 cm, the manager
chooses the two-sided alternative hypothesis, which states that the population mean of all the
pipes is not equal to 5 cm. Formally, this is written as H1: μ ≠ 5
7
Continuation…
2. Choose a significance level (also called alpha or α).
• The manager selects a significance level 0.05, which is the most commonly
used significance level
3. Collect the data
• They collect a sample of pipes and measure their diameters.
4. Compare the p-value from the test to the significance level.
• After they perform the hypothesis test, the manager obtains a p-value of
0.004. The p-value is less than the significance level of 0.05.
5. Decide whether to reject or fail to reject the null hypothesis.
• The manager rejects the null hypothesis and concludes that the mean pipe
diameter of all pipes is not equal to 5cm.
8
About the null and alternative hypotheses
• The null and alternative hypotheses are two mutually exclusive statements
about a population. A hypothesis test uses sample data to determine
whether to reject the null hypothesis.
1. Null hypothesis (Ho)
• The null hypothesis states that a population parameter (such as the mean,
the standard deviation, and so on) is equal to a hypothesized value. The
null hypothesis is often an initial claim that is based on previous analyses
or specialized knowledge.
2. Alternative Hypothesis (H1)
The alternative hypothesis states that a population parameter is smaller,
greater, or different than the hypothesized value in the null hypothesis. The
alternative hypothesis is what you might believe to be true or hope to prove
true.
9
One-sided and two-sided hypotheses
• The alternative hypothesis can be either one-sided or two sided.
1. Two-sided
Use a two-sided alternative hypothesis (also known as a
nondirectional hypothesis) to determine whether the population
parameter is either greater than or less than the hypothesized
value. A two-sided test can detect when the population parameter
differs in either direction, but has less power than a one-sided test
10
2. One-sided
• Use a one-sided alternative hypothesis (also known as a directional
hypothesis) to determine whether the population parameter differs
from the hypothesized value in a specific direction. You can specify
the direction to be either greater than or less than the hypothesized
value. A one-sided test has greater power than a two-sided test, but it
cannot detect whether the population parameter differs in the
opposite direction.
11
Critical Values
• Is the value that separates the critical and non critical region.
• Critical or rejection region – the range of values that indicates that there is a
significant difference and the null hypothesis should be rejected.
• Non Critical region – range of values of the test value that indicates that the
difference was probably due to chance and that the null hypothesis should
not be rejected
12
What are type I and type II errors?
No hypothesis test is 100% certain. Because the test is based on
probabilities, there is always a chance of making an incorrect
conclusion
A. Type I error
• When the null hypothesis is true and you reject it, you make a type I
error. The probability of making a type I error is α, which is the level
of significance you set for your hypothesis test. An α of 0.05
indicates that you are willing to accept a 5% chance that you are
wrong when you reject the null hypothesis. To lower this risk, you
must use a lower value for α. However, using a lower value for alpha
means that you will be less likely to detect a true difference if one
really exists
13
B. Type II error
• When the null hypothesis is false and you fail to reject it, you make a
type II error. The probability of making a type II error is β, which
depends on the power of the test. You can decrease your risk of
committing a type II error by ensuring your test has enough power. You
can do this by ensuring your sample size is large enough to detect a
practical difference when one truly exists.
14
Level of Significance
• Level of Significance
• Refers to the degree of significance in which we accept or reject the null
hypothesis. In hypothesis testing, 100% accuracy is not possible for accepting
or rejecting a null hypothesis.
• Is the maximum probability of committing a Type 1 error, denoted as α (greek
Alpha)
15
Steps of P- value Method
▪ State the null hypothesis (Ho) and the alternative (H1 ).
▪ Choose the level of significance, α and the sample size.
▪ Determine the test statistics and sampling distribution.
▪ Compute the test value.
▪ Determine the P-value
▪ Make a statistical decision
▪ State the conclusion.
DECISION RULE
if P- value ≤ α, Reject H0 and if P –value > α, Accept Ho
16
Confidence Interval and Hypothesis Testing
❑ a range of values so defined that there is a specified probability that
the value of a parameter lies within it.
❑ Decision Rule
When the confidence interval contains the hypothesized mean,
Accept Ho
When the confidence interval does not contain the hypothesized
mean, Reject Ho
❑ Decision Rule
if Z computed < Z critical, Do not reject Ho
if Z computed ≥ Z critical, Reject Ho
17
Test on Single Mean Variance Known or
Population Standard Deviation is Known
• It is statistical test for the mean population. It is used when n ≥ 30 or
the population is normally distributed and population Standard
Deviation is KnowN
• Formula
Where:
X = sample mean
μ = population mean
σ = population standard deviation
s = sample standard deviation
n = number of observation in a
sample
18
Confidence Interval Formula
19
Hypotheses for 1-Sample Z
20
Assumptions
• The population standard deviation is known
• The data must be continuous
• The sample data should not be severely skewed, and the sample size
should be greater than 30
• The sample data should be selected randomly
• Each observation should be independent from all other observations
• Determine an appropriate sample size
21
Example 1.0
• A scientist for a company that manufactures processed food wants to
assess the percentage of fat in the company's bottled sauce. The
advertised percentage is 15%. The scientist measures the percentage
of fat in 20 random samples. Previous measurements found that the
population standard deviation is 2.6%
1. State Hypothesis
µ = 15%
µ ≠ 15%
22
• Choose Stat > Basic Statistics > 1-Sample Z.
• From the drop-down list, select One or more
samples, each in a column and enter Percent
Fat.
• In Known standard deviation, enter 2.6.
• Select Perform hypothesis test.
• In Hypothesized mean, enter 15.
• Click OK
23
Result
P-value Interpretation:
The null hypothesis states that the mean
of the percentage of fat equals 15%.
Because the p-value is 0.012, which is less
Sample Confidence than the significance level of 0.05, the
mean Interval
scientist rejects the null hypothesis. The
results indicate that mean percentage of
fat differs from 15%.
Confidence Interval Interpretation:
Population mean
In these results, the estimate of the
population mean for fat percentage is
16.46%. You can be 95% confident that
the population mean is between 15.321%
and 17.599%.
Computed
value
24
T – Test for Single Mean (Variance in Known) n < 30
❑ It is statistical procedure that is used to know the mean difference
between the sample and know value of the population mean. The
sample size should be less than 30
❑ Assumptions in One Sample t- test
◦ The population must be approximately normally distributed
◦ Samples drawn from the population should be random
◦ Cases of the samples should be independent
◦ Sample size should be less than 30
◦ The population mean should be known.
25
Formula X = sample mean
μ= population mean
s = sample standard
deviation
n = number of observation
in the sample
df= degree of freedom
Confidence Interval
26
Hypotheses for 1-Sample t
27
Example 2.0
• An economist wants to determine whether the monthly energy cost
for families has changed from the previous year, when the mean cost
per month was $200. The economist randomly samples 25 families
and records their energy costs for the current year. Use α = 0.05
• The economist performs a 1-sample t-test to determine whether the
monthly energy cost differs from $200.
1. State Hypothesis
µ = $200
µ ≠ $200
28
Procedure:
• Choose Stat > Basic Statistics > 1-Sample t.
• From the drop-down list, select One or more
samples, each in a column and enter Energy
Cost.
• Select Perform hypothesis test.
• In Hypothesized mean, enter 200.
• Click OK.
29
RESULTS
P value Interpretation
The null hypothesis states that the mean of
the energy costs is $200. Because the p-value
is 0.000, which is less than the significance
level of 0.05, the economist rejects the null
hypothesis and concludes that the average
monthly energy cost for families differs from
$200
Confidence Interval Interpretation
In these results, the estimate of the
population mean for energy cost is 330.6.
You can be 95% confident that the
population mean is between 266.9 and
394.2.
30
Test on Single Proportion
• A lot of hypothesis testing situations involve proportions (proportion
is the same as the percentage of the population). This test can be
considered as a binomial experiment when there are only two
outcomes and probability of success does not change from trial to
trial (the outcomes of each trial are independent)
• Assumptions:
• Subject are randomly selected
• Population distribution is normal
• Observation are dichotomous
31
Hypotheses for Single Proportion
32
Formula
Confidence Interval Formula
33
Example
• A marketing analyst wants to determine whether mailed
advertisements for a new product result in a response rate different
from the national average. A random sample of 1000 households is
chosen to receive advertisements. Of the 1000 households sampled,
87 make a purchase after receiving the advertisement.
• The analyst performs a 1 proportion test to determine whether the
proportion of households that made a purchase is different from the
national average of 6.5%.
34
• Open the 1 Proportion dialog box.
• Mac: Statistics > 1-Sample Inference > Proportion
• PC: STATISTICS > One Sample > Proportion
• From the drop-down list, select Summarized
data.
• In number of events, enter 87.
• In number of trials, enter 1000.
• Select Perform hypothesis test.
• In Hypothesized proportion, enter 0.065.
• Click OK.
35
Results
Interpretation of Results
based on CI
In these results, the
estimate of the
population proportion
for households that
made a purchase is
0.087. You can be 95%
confident that the
population proportion is
between approximately
0.07 and 0.106.
Interpretation of Results based on P-value
The null hypothesis states that the proportion of
households that make a purchase equals 0.065.
Because the p-value is 0.0085, which is less than the
significance level of 0.05, the analyst rejects the null
hypothesis. The results indicate that the proportion of
households that make a purchase is different from the
national average of 6.5%.
36
Test on Single Variance
• use single variance to estimate either the variance or the standard
deviation of a population and to compare that value to a target value
or a reference value.
• Assumptions
• The sample data should be selected randomly
• The sample data should not be severely skewed, and the sample size should
be greater than 40
• Each observation should be independent from all other observations
• Determine an appropriate sample size
37
Formula
38
Example
• The manager of a lumber yard wants to assess the performance of a
saw mill that cuts beams that are supposed to be 100 cm long. The
manager takes a sample of 50 beams from the saw mill and measures
their lengths.
• The manager performs a 1 variance test to determine whether the
standard deviation of the saw mill is different from 1.
39
• Choose Stat > Basic Statistics > 1 Variance.
• From the drop-down list, select One or more
samples, each in a column and enter Length.
• Select Perform hypothesis test and enter 1 in Value.
• Click OK.
40
Results
Interpretation of Results using P-value approach
Because a previous analysis showed that the data does not
appear to come from a normal distribution, the manager uses
the confidence interval for the Bonett method. Because the p-
value is greater than 0.05, the manager cannot conclude that
the population standard deviation is different from 1.
Interpretation of Results using CI
In these results, the estimate of the population standard
deviation for the length of beams is 0.871, and the estimate
of the population variance is 0.759. Because the data did not
pass a normality test, use the Bonett method. You can be 95%
confident that the population standard deviation is between
0.704 and 1.121.
41
TEST OF NORMALITY
Overview for Normality Test
• Use Normality Test to determine whether data do not follow a normal
distribution
• For example, a food scientist for a company that manufactures
processed food wants to assess the percentage of fat in the
company's bottled sauce to ensure the percentage is not different
from the advertised value of 15%. The scientist wants to determine
whether the data does not follow a normal distribution before
performing other analyses with this data.
• Where to find this analysis
• To perform a normality test, choose Stat > Basic Statistics > Normality Test.
43
44
Hypotheses for Normality Test
• For a normality test, the hypotheses are as follows.
H0: Data follow a normal distribution.
H1: Data do not follow a normal distribution.
45
Assumptions
• The data must be numeric
• The sample data should be selected randomly
• The sample size should be greater than 20
Note:
If p-value > 0.05 – Normally distributed
If p-value ≤ 0.05 – Not Normally distributed
46
Example
• The manager of a lumber yard wants to assess the performance of a
saw mill that cuts beams that are supposed to be 100 cm long. The
manager takes a sample of 50 beams from the saw mill and measures
their lengths. The manager wants to determine whether the saw mill
has different performance.
• The manager wants to verify the assumption of normality before
performing a hypothesis test.
47
Using Anderson
Darling
48
Interpretation:
The data points are not relatively close to
the fitted distribution line. The p-value is
less than or equal the significance level of
0.05. Therefore, the manager reject the
null hypothesis because the data does not
follow a normal distribution.
49
Activity for One Sample Test
50
Activity 1.0
• The Edison Electric Institute has published figures on the number of
kilowatt hours used annually by various home appliances. It is
claimed that a vacuum cleaner uses an average of 46 kilowatt hours
per year. If a random sample of 12 homes included in a planned study
indicates that vacuum cleaners use an average of 42 kilowatt hours
per year with a standard deviation of 11.9 kilowatt hours, does this
suggest at the 0.05 level of significance that vacuum cleaners use, on
average, less than 46 kilowatt hours annually? Assume the population
of kilowatt hours to be normal
51
Activity 2.0
• It is claimed that automobiles are driven on average more than
20,000 kilometers per year. To test this claim, 100 randomly selected
automobile owners are asked to keep a record of the kilometers they
travel. Would you agree with this claim if the random sample showed
an average of 23,500 kilometers and a standard deviation of 3900
kilometers?
52
Activity 3.0
• According to a dietary study, high sodium intake may be related to
ulcers, stomach cancer, and migraine headaches. The human
requirement for salt is only 220 milligrams per day, which is surpassed
in most single servings of ready-to-eat cereals. If a random sample of
20 similar servings of a certain cereal has a mean sodium content of
244 milligrams and a standard deviation of 24.5 milligrams, does this
suggest at the 0.05 level of significance that the average sodium
content for a single serving of such cereal is greater than 220
milligrams? Assume the distribution of sodium contents to be normal.
53
Activity 4.0
• Test the hypothesis that the average content of containers of a
particular lubricant is 10 liters if the contents of a random sample of
10 containers are 10.2, 9.7, 10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and
9.8 liters. Use a 0.01 level of significance and assume that the
distribution of contents is normal.
54
Activity 5.0
• A commonly prescribed drug for relieving nervous tension is believed
to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering
from nervous tension show that 70 received relief. Is this sufficient
evidence to conclude that the new drug is superior to the one
commonly prescribed? Use a 0.05 level of significance
55
Activity 6.0
• A new radar device is being considered for a certain missile defense
system. The system is checked by experimenting with aircraft in which
a kill or a no kill is simulated. If, in 300 trials, 250 kills occur, accept or
reject, at the 0.04 level of significance, the claim that the probability
of a kill with the new system does not exceed the 0.8 probability of
the existing device.
56
Activity 7.0
• Past experience indicates that the time required for high school
seniors to complete a standardized test is a normal random variable
with a standard deviation of 6 minutes. Test the hypothesis that σ = 6
against the alternative that σ ≠ 6 if a random sample of the test times
of 20 high school seniors has a standard deviation s = 4.51. Use a 0.05
level of significance
57
Activity 8.0
• Past data indicate that the amount of money contributed by the
working residents of a large city to a volunteer rescue squad is a
normal random variable with a standard deviation of $1.40. It has
been suggested that the contributions to the rescue squad from just
the employees of the sanitation department are much more variable.
If the contributions of a random sample of 12 employees from the
sanitation department have a standard deviation of $1.75, can we
conclude at the 0.01 level of significance that the standard deviation
of the contributions of all sanitation workers is greater than that of all
workers living in the city
58
Activity 9.0
• A manufacturer of sports equipment has developed a new synthetic
fishing line that the company claims has a mean breaking strength of
8 kilograms with a standard deviation of 0.5 kilogram. Test the
hypothesis that μ = 8 kilograms against the alternative that μ ≠ 8
kilograms if a random sample of 50 lines is tested and found to have a
mean breaking strength of 7.8 kilograms. Use a 0.01 level of
significance
59
Activity 10
• According to the journal Chemical Engineering, an important property
of a fiber is its water absorbency. A random sample of 20 pieces of
cotton fiber was taken and the absorbency on each piece was
measured. The following are the absorbency values. Test the
normality of the data below
60