STAT 1520 Notes
STAT 1520 Notes
STAT1520
ECONOMIC AND BUSINESS
STATISTICS
- Premium Edition -
Exclusively Designed by
FIRST-CLASS HONOURS
- High Distinction -
TV
Table of
Contents
A
PART A
Topic Summary
& Exam Tip s
IN TROD U CTION
1. Mean 𝑋 𝜇
2. Variance 𝑠 𝜎
3. Standard Deviation 𝑠 𝜎
4. Proportion 𝑝 𝜋
∑𝑵
𝒊 𝟏 𝑿𝒊
Mean The average value in a dataset 𝑴𝒆𝒂𝒏
𝑵
𝒏 𝟏
Median The middle value in a dataset 𝒏𝒕𝒉
𝟐
𝒏 𝟏
First quartile 25th percentile of the sorted sample data 𝒏𝒕𝒉
𝟒
𝒏 𝟏
Third quartile 75th percentile of the sorted sample data 𝒏𝒕𝒉 𝟑∗
𝟒
BASIC PROBABILITY
𝑃 𝐴 𝑎𝑛𝑑 𝐵
𝑃 𝐵|𝐴 𝑃 𝐴 is Marginal Probability of A
𝑃 𝐴
𝑃 𝐴 𝑎𝑛𝑑 𝐵
𝑃 𝐴|𝐵 𝑃 𝐵 is Marginal Probability of B
𝑃 𝐵
Type I
The scatter diagram graphs pairs of numerical data, with one variable on each axis, to look for a
relationship between them.
Type II
**Multiple box plot represents ranges of values of multiple variables. A box plot is a standardised
way of displaying the distribution of data based on: Lower Fence, 1st Quartile, Median, 3rd
Quartile, Upper Fence.
Type III
Cross-tabulation*
Categorical Variable Categorical Variable
Multiple Bar Chart**
*Cross-Tabulations: Cross tabulation is a statistical tool that is used to analyze categorical data.
Your eye color can be divided into 'categories' (i.e., blue, brown, green), and it is impossible for
eye color to belong to more than one category (i.e., color).
**Multiple Bar Chart: A multi-bar chart is a bar chart in which multiple data sets are represented
by drawing the bars side by side in a cluster (i.e., on- off- campus).
Exam Tips 1: If you are given an exact Exam Tips 2: If your question has an average
probability and you want to find the probability of an event happening per unit (i.e.
probability of the event happening in a per unit of time, cycle, event) and you want to
certain number out times of x (i.e. 10 times find probability of a certain number of events
out of 100, or 99 times out of 1000), use the happening in a period of time, then use the
Binomial Distribution Poisson Distribution
PROPORTION MEAN
𝒑 𝝁, 𝝈 𝑿, 𝒔, 𝒏
Calculate 90% confidence Find 95% interval estimate for Calculate 99% confidence
interval. the number of absent days. interval of all houses.
Find the value of Z score Find the value of Z score To find t-critical, we
corresponding to (given) corresponding to (given) need Degree of
confidence interval confidence interval Freedom
Z-Score = 1.645, 1.645 Z-Score = 1.96, 1.96 and 𝛼 = 5%. Use t-table:
t-critical = 2.6387
PROPORTION MEAN
A hypothesis test is a statistical test that is used to determine whether there is enough statistical
evidence in a sample of data to infer that a certain condition is true for an entire population. A
hypothesis test examines two opposing hypotheses about a population: Null Hypothesis 𝐻 ,
Alternative Hypothesis 𝐻
67(36WDWH1XOO+\SRWKHVLVDQG$OWHUQDWLYH+\SRWKHVLV
Proportion Mean
+ʌ D + D
3RSXODWLRQSURSRUWLRQHTXDOVWRD 3RSXODWLRQPHDQHTXDOVWRD
+ʌD +D
3RSXODWLRQSURSRUWLRQGLIIHUHQFHIURPD 3RSXODWLRQPHDQGLIIHUHQFHIURPD
+ʌD +D
3RSXODWLRQSURSRUWLRQOHVVWKDQRUHTXDOWRD 3RSXODWLRQPHDQOHVVWKDQRUHTXDOVD
+ʌ!D +!D
3RSXODWLRQSURSRUWLRQKLJKHUWKDQD 3RSXODWLRQPHDQKLJKHUWKDQD
+ʌD +D
3RSXODWLRQSURSRUWLRQKLJKHUWKDQRUHTXDOWRD 3RSXODWLRQPHDQKLJKHUWKDQRUHTXDOVD
+ʌD +D
3RSXODWLRQSURSRUWLRQOHVVWKDQD 3RSXODWLRQPHDQOHVVWKDQD
67(3'HWHUPLQHWDLOWHVW'(3(1'217+(6,*12)7+($/7(51$7,9(+<327+(6,6
67(3'HWHUPLQHYDOXHRI=DQGWFULWLFDOYDOXH
Į
/RZHUWDLOWHVW
/2:(57$,/7(67
=±6&25( 7±6&25(
,IĮ Î= ,IĮ Î'HWHUPLQH')&ROXPQ>@
,IĮ Î= ,IĮ Î'HWHUPLQH')&ROXPQ>@
,IĮ Î= ,IĮ Î'HWHUPLQH')&ROXPQ>@
Į 8SSHUWDLOWHVW
833(57$,/7(67
=±6&25( 7±6&25(
x ,IĮ Î= x ,IĮ Î'HWHUPLQH')&ROXPQ>@
x ,IĮ Î= x ,IĮ Î'HWHUPLQH')&ROXPQ>@
x ,IĮ Î= x ,IĮ Î'HWHUPLQH')&ROXPQ>@
ONLY 1 VALUE OF Z-SCORE AND POSITIVE ONLY 1 VALUE OF Z-SCORE AND POSITIVE
હ હ
7ZRWDLOVWHVW
ଶ ଶ
7:27$,/67(67
=±6&25( 7±6&25(
x ,IĮ Î= >@ x ,IĮ Î'HWHUPLQH')&ROXPQ>@
x ,IĮ Î= >@ x ,IĮ Î'HWHUPLQH')&ROXPQ>@
x ,IĮ Î= >@ x ,IĮ Î'HWHUPLQH')&ROXPQ>@
TWO VALUES OF Z-SCORE TWO VALUES OF T-SCORE( - + )
67(3'(&,6,2158/(6
x/RZHUWDLOWHVW,I=FDOFXODWHOHVVWKDQ=FULWLFDORUWFDOFXODWHOHVVWKDQWFULWLFDOZHUHMHFWWKHQXOOK\SRWKHVLVWHVW
x8SSHUWDLOWHVW,I=FDOFXODWHKLJKHUWKDQ=FULWLFDORUWFDOFXODWHKLJKHUWKDQWFULWLFDOZHUHMHFWWKHQXOO
K\SRWKHVLVWHVW
x7ZRWDLOVWHVW,I=FDOFXODWHOHVVWKDQ=FULWLFDORUKLJKHUWKDQ=FULWLFDOZHUHMHFWWKHQXOOK\SRWKHVLVWHVW
,IWFDOFXODWHOHVVWKDQWFULWLFDORUKLJKHUWKDQWFULWLFDOZHUHMHFWWKHQXOOK\SRWKHVLVWHVW
67(3&DOFXODWH=FDOFXODWHDQGWFDOFXODWH
67(3'HFLVLRQDQGFRQFOXVLRQ
'HFLVLRQGHSHQGRQ6WHSÎ5HMHFWRUGRQRW5HMHFW+
𝒚 𝒃𝟏 𝒃𝟐 𝒙𝒊
𝒆 𝒚𝒊 𝒚 𝒚𝒊 𝒃𝟏 𝒃𝟐 𝒙𝒊
(Formula 2D) 𝒙𝒊 𝒃𝟏 𝑥𝟐 𝑏 𝒙𝒊 𝒚𝒊
∑ 𝒙𝒊 𝒙 𝒚𝒊 𝒚 𝒏 ∑ 𝑿𝒊 𝒀𝒊 ∑ 𝑿𝒊 ∑ 𝒀𝒊
(Formula 3D) 𝒃𝟏
∑ 𝒙𝒊 𝒙 𝟐 𝒏 ∑ 𝑿𝟐𝒊 ∑ 𝑿𝒊 𝟐
(Formula 4D) 𝒃𝟎 𝒚 𝒃𝟐 𝒙
SR2 The average value of the random error 𝒆 is 𝑬 𝒆 𝟎 since we assume that 𝑬 𝒚 𝜷𝟏 𝜷𝟐 𝒙
SR4 The covariance between any pair of random errors, 𝒆𝒊 and 𝒆𝒋 is 𝒄𝒐𝒗 𝒆𝒊 , 𝒆𝒋 𝒄𝒐𝒗 𝒚𝒊 , 𝒚𝒋 𝟎
SR5 The variable 𝒙 is not random and must take at least two different values
Gauss-Markov Theorem: Under the assumptions SR1 – SR5 of the linear regression model
the estimators 𝑏 and 𝑏 have the smallest variance of all linear and unbiased estimators of 𝛽
and 𝛽 . They are the Best Linear Unbiased Estimators (BLUE) of 𝛽 and 𝛽 .
B
PART B
Exam Practice
Qu e s tio n s & Answers
The waiting time is defined as the time the customer enters the line to when he or she reaches the
teller window. A random sample of 15 customers is selected, and the results are as follows:
4.21 5.55 0.5 5.13 4.77 2.34 3.54 3.2 4.5 6.1 0.38 5.12 6.46 12.19 3.79
Use an appropriate technique to summarise the data and answer following questions.
Part a. What are the shortest and the longest waiting times?
Part c. Around what values, if any, are the waiting times concentrated?
Part d. The standard deviation is approximately 2.8 minutes, what does this mean?
Part h. Determine if there are any unusual waiting times in the above dataset.
Part i. Use the results above to provide a summary of the waiting time in plain language.
Step-by-step Solutions
Part a.
0.38 0.5 2.34 3.2 3.54 3.79 4.21 4.5 4.77 5.12 5.13 5.55 6.1 6.46 12.19
Part b.
Part c.
Calculate Median
𝟏𝟓 𝟏
𝐋𝐨𝐜𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐦𝐞𝐝𝐢𝐚𝐧 𝐯𝐚𝐥𝐮𝐞 𝟖𝐭𝐡 𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧
𝟐
Value of median is at 8th position in the sorted data array above, or 4.5
Part d.
On average, each individual customer’s waiting time deviates by 2.8 minutes from the mean of 4.51
minutes.
Part e.
𝟏𝟓 𝟏
𝐋𝐨𝐜𝐚𝐭𝐢𝐨𝐧 𝟏𝐬𝐭 𝐐𝐮𝐚𝐫𝐭𝐢𝐥𝐞 𝟒𝐭𝐡 𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧
𝟒
Value of 1st Quartile is at 4th position in the sorted data array above, or 3.2.
Part f.
𝐈𝐧𝐭𝐞𝐫𝐪𝐮𝐚𝐫𝐭𝐢𝐥𝐞 𝐑𝐚𝐧𝐠𝐞 𝐐𝟑 𝐐𝟏
𝟓. 𝟓𝟓 𝟑. 𝟐 𝟐. 𝟑𝟓
Part h.
Since all of data value lie between lower limit and upper limit, there seem to be no outliers in the
data set.
Part i.
The average waiting time for these 15 customers was around 4.5 minutes.
The shortest waiting time was 0.4 minutes, while the longest waiting time was 12.2 minutes,
25% of the shortest waiting times did not exceed 3.2 minutes, while the longest 25% lasted
at least 5.6 minutes. Thus, the middle half was spread across around 2.4 minutes.
The waiting time of 12.2 minutes seemed unusually long as it was almost twice the second
Question 2 - Probability
A survey has been conducted of companies involved in software development. It showed that the last
200 computer software packages recently released, the production expenditure and the profitability
300 or more 25 25
Part a. For these data, construct the row percentage contingency table.
Part b. Calculate Probability of Production Expenditure for at least $300,000 and Unprofitable.
Part c. Calculate Probability of Production Expenditure for less than $100,000 given Profitable.
Part d. Test for whether Production Expenditure & Profit dependent, or independent with each other
Part e. Without any further calculations, state whether you think production cost & profitability are
Step-by-step Solutions
Part 1. (a)
Step 1: Find Marginal Probability of each variable
300 or more 25 25 50
Exam Tips: Values are summed in the rows for a total of 100%.
Total 70 30 100
Part 1. (b)
Part 2. (a)
From the table in Step 1, Production Expenditure $100,000 is a total of (60 + 50)
Part 2. (b)
𝟐𝟓
𝐏 𝐄𝐱𝐩𝐞𝐧𝐝𝐢𝐭𝐮𝐫𝐞 𝟑𝟎𝟎 𝐚𝐧𝐝 𝐔𝐧𝐩𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞 𝟎. 𝟏𝟐𝟓 𝐨𝐫 𝟏𝟐. 𝟓%
𝟐𝟎𝟎
Part 2. (c)
𝟏𝟓
𝐏 𝐄𝐱𝐩𝐞𝐧𝐝𝐢𝐭𝐮𝐫𝐞 𝟏𝟎𝟎 𝐚𝐧𝐝 𝐏𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞 𝟎. 𝟎𝟕𝟓 𝐨𝐫 𝟕. 𝟓%
𝟐𝟎𝟎
𝟔𝟎
𝐏 𝐏𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞 𝟎. 𝟑𝟎 𝐨𝐫 𝟑𝟎. 𝟎%
𝟐𝟎𝟎
𝐏 𝐄𝐱𝐩𝐞𝐧𝐝𝐢𝐭𝐮𝐫𝐞 𝟏𝟎𝟎 𝐚𝐧𝐝 𝐏𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞
𝐏 𝐄𝐱𝐩𝐞𝐧𝐝𝐢𝐭𝐮𝐫𝐞 𝟏𝟎𝟎 𝐠𝐢𝐯𝐞𝐧 𝐏𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞
𝐏 𝐏𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞
𝟎. 𝟎𝟕𝟓
𝐏 𝐄𝐱𝐩𝐞𝐧𝐝𝐢𝐭𝐮𝐫𝐞 𝟏𝟎𝟎 𝐠𝐢𝐯𝐞𝐧 𝐏𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞 𝟎. 𝟐𝟓 𝐨𝐫 𝟐𝟓. 𝟎%
𝟎. 𝟑𝟎
Part 2. (d)
𝟗𝟎
𝐏 𝐄𝐱𝐩𝐞𝐧𝐝𝐢𝐭𝐮𝐫𝐞 𝟏𝟎𝟎 𝟎. 𝟒𝟓 𝐨𝐫 𝟒𝟓. 𝟎%
𝟐𝟎𝟎
𝐏 𝐄𝐱𝐩𝐞𝐧𝐝𝐢𝐭𝐮𝐫𝐞 𝟏𝟎𝟎 𝐠𝐢𝐯𝐞𝐧 𝐏𝐫𝐨𝐟𝐢𝐭𝐚𝐛𝐥𝐞 𝟐𝟓. 𝟎%
Since P (Expenditure < 100) P (Expenditure < 100 given Profitable), we can conclude that
Part 2. (e)
If the production expenditure is less than $100,000, barely 17% of software packages are profitable. If
the expenditure is increased to somewhere in the range of $100,000 to less than $300,000, around
33% are profitable. Finally, with expenditure in excess of $300,000 chances of profitability are at 50%.
In general, it seems that if production expenditure increases, so does the chance of being profitable.
Part a.
The probability that an individual packet of biscuits is damaged in a box of biscuits has been found to
be 0.2 over many years for a company. If a box consists of 10 packets of biscuits, what is the probability
that:
𝑃 𝑒𝑥𝑎𝑐𝑡𝑙𝑦 2 20𝐶 𝑥 𝑝 𝑥 𝑞
20!
𝑥 0. 2 𝑥 0. 8 0.1369
2! 18!
ii. One (1) or more packets of biscuits will be damaged in the box?
𝑃 1 𝑜𝑟 𝑚𝑜𝑟𝑒 1 𝑃 0
1 20𝐶 𝑥 𝑝 𝑥 𝑞
20!
1 𝑥 0.2 𝑥 0.8
0! 20!
1 0.0115 0.9885
Hence there is a 98.85% chance that one (1) or more sales will be made.
iii. How many packets of damaged biscuits would you expect in a box of 10 packets of
biscuits?
Part b.
Seriously injured people arrive at an emergency unit of a hospital at an average rate of 3 per hour.
Assume that the number arriving has a Poisson distribution. What is the probability -
i. Exactly three (3) seriously injured people arrive in the next hour?
Arrivals are a Poisson distribution with 𝜆 0.5 persons per minute expected
So 𝑃 𝑋 7 0.1044
! !
Hence there is a 10.44% chance that 7 people will arrive in the next 10 minutes.
ii. One (1) or more seriously injured people arrive in the next 30 minutes?
Hence there is a 99.75% chance that 1 or more people will arrive in the next 12 minutes.
Part a. The life (in hours) of a particular brand of light bulb is known to be uniformly distributed
between three hundred (300) and eight hundred (800) hours.
1
𝑆𝑜, ℎ𝑒𝑖𝑔ℎ𝑡 0.002
500
ii. What is probability that a light bulb will last less than five hundred (500) hours?
with a mean of 20 hours and a standard deviation of 10 hours. The company is considering a
𝑿 𝝁 𝟑𝟎 𝟐𝟎
𝒁 𝟏
𝝈 𝟏𝟎
𝟎. 𝟓 𝟎. 𝟑𝟒𝟏𝟑 𝟎. 𝟏𝟓𝟖𝟕
𝑿𝟏 𝝁 𝟏𝟓 𝟐𝟎
𝒁𝟏 𝟎. 𝟓
𝝈 𝟏𝟎
𝑿𝟐 𝝁 𝟑𝟎 𝟐𝟎
𝒁𝟐 𝟏
𝝈 𝟏𝟎
iii. If the company wants to replace less than 5% of all its batteries under a warranty,
how many hours of use should the warranty cover?
From Z-table, we find the value of 0.05, or 5%, which the value is determined by 1.645
Part c. The game of Poker consists of dealing 5 cards to each player from a pack of 52 different cards
without replacement. In Poker the order in which cards are dealt is not important. For example, one
How many different Poker hands of five cards are possible from a pack of 52 cards?
Order is not important, so use combinations. Hence the number of possible poker hands of 5
52! 52 ∗ 51 ∗ 50 ∗ 49 ∗ 48 ∗ 47!
𝑁𝑜. ℎ𝑎𝑛𝑑𝑠 52 𝐶
5! 47! 5 ∗ 4 ∗ 3 ∗ 2 ∗ 1 ∗ 47!
We wish to estimate the mean Lot Size (square metres) of all houses in the Tasmania region.
Assume the random sample of 120 houses sold are representative of all houses in Tasmania.
Part a.
Calculate the 95% confidence interval estimate of the mean lot size (square metres) given that
there were n = 120 houses in the sample, sample mean = 1175 (square metres) and standard
Step-by-step Solutions
𝑠 373
𝑥̅ 𝑡∗ 1175 1.9799 ∗ 1175 67.42
√𝑛 √120
Step 5 Applying formula
[1107.58, 1242.42] square metres
We are 95% confident that the mean lot size all houses in
Step 6 Conclusion
Tasmania somewhere between 1107.58 𝑚2 and 1242.42𝑚2
Part b.
Suppose that the mean lot size for Sydney overall is 1,000 square metres. From your confidence
interval in part (a), what can we say about the lot sizes of Tasmania houses compared to Sydney
overall?
Given the above confidence interval, it seems that the mean lot size for all Tasmania houses is
Part c.
For the confidence interval calculated above in part (a), had you used 90% confidence instead of
95%, could you have come to a different conclusion in (b)? Explain your answer. (No calculations
are required).
90% confidence instead of 95% would produce a narrower interval. Hence, this would not change
Part d.
For the confidence interval calculated above in part (a), had the sample size of houses been n = 50
instead of n =120, could you have come to a different conclusion in (b)? Explain your answer. (No
calculations are required).
If we decrease the sample size, the Margin of Error would increase, in turn resulting in a wider
confidence interval – potentially including the Sydney mean of 1,000. Hence, this might change
Part e.
If the sample size of houses was n =20 what potential problems could there be in performing this
type of analysis?
If the sample size was less than 30, we would not be able to simply assume that the distribution of
the sample mean is normal. If, upon checking, it was not normal, we would not be able to construct
a confidence interval
At a recent Union meeting of Westfield staff, concern was expressed about the increasing number
of hours that stores were open. Staff felt that they we being made to work longer and longer hours.
One union official claimed that, on average, all Westfield stores were open (i.e. trading) for more
To test this claim, the Union took a survey of 100 stores where it was found that the average
opening hours (𝑥̅ ) was 104.35 with a standard deviation (s) of 23.677 hours.
Use the “Six Steps in Hypothesis Testing” to see if there are grounds to the Unions claim.
Part a.
Part b.
Part c.
Part d.
Part e.
Part f.
Exam Tips 1:
How do we know?
𝐻 > : (upper tail test)
Step 2 Upper tail test
It depends on the sign of 𝐻 𝐻 < : (lower tail test)
𝐻 : (2-tails test)
Exam Tips 2:
We use Z-Score in
Because we only know standard deviation of the
Step 3 Hypothesis Testing when:
sample, 𝑠 23.677, then apply t-score calculation.
• Test for Proportion
• Test for Mean + 𝜎 given
Step 5
𝑋 𝜇 104.35 100 4.35
𝑡 1.8372
𝑠/√𝑛 23.677/√100 2.3677
Step 6
Hence, there is sufficient evidence to conclude that average number of hours all
Westfield stores are opened for more than 100 hours, and we agree with Union claim
Exam Tips: Students should provide a graph for Rejection regions at ‘Step 4’ to gain full mark.
supermarkets. In particular, one Board member has suggested that a key factor in improving Sales is
for individual supermarkets to advertise more. You subsequently develop a simple regression model to
try and help explain the variation in sales. Below is the computer output of your simple regression
model. The dependent variable is Sales (measured in $million) and the independent variable is
Model Summary
Coefficients
(a) How well does this model do in explaining the variation in sales? Explain fully.
(b) Write down the regression model equation and use it to predict sales for a store that spends
$100,000 on advertising.
(c) In the simple regression output above, the advertising variable has a 95% confidence interval
Step-by-step Solutions
Part a.
R2 = 0.709
70.9% of the variation in sales can be explained by the variation in the amount spent on advertising.
The remaining 29.1% of variation would be explained by other factors, or variables, not in the model.
Part b.
Hence, predicted sales for a store that spends $100,000 on advertising would be $9.545 million.
Part c.
We are 95% confident that, on average, for every extra $1000 spent on advertising, sales will increase
C
PART C
Sample Final
Examination Paper
Question 1
A bank branch located in a commercial district of a city has developed a process to improve
customer service during the noon to 1 pm lunch period. The waiting time in minutes of all
customers during this hour is recorded over a period of one week. The waiting time is
defined as the time the customer enters the line to when he or she reaches the teller window.
(a) Following is a table of key summary measures for the minutes of waiting time for
these 16 customers. Use the numbers above, and/or the summary measures given
below, to complete the table by including the eight (8) missing summary measure?
Minimum 2.8
Maximum 17.2
Total 138
Median (ii)
Mode (iii)
Range (vi)
(b) Refer to the Interquartile Range. Using your result, explain in plain language how
(c) Refer to the Standard Deviation. Using your result, explain in plain language how
(d) We would describe this data set of 16 values as being skewed to the right (or having
a positive skew). Provide two sets of evidence from your table of summary measures
(3 + 3 + 3 + 2) = 11 marks
Question 2
Investigate whether or not Age of Store is dependent on Location. You have the following
0 < 5 years 8 15 15 38
5 < 10 years 22 26 7 55
10 < 15 years 6 14 12 32
15 < 20 years 2 5 11 18
25 < 25 years 0 1 6 7
Total 38 61 51 150
% of Total Location
Total
Age of Store Country Mall Strip
0 < 5 years 5.3% 10.0% 10.0% 25.3%
% of Row Location
Total
Age of Store Country Mall Strip
0 < 5 years 21.0% 39.5% 39.5% 100%
% of Column Location
Total
Age of Store Country Mall Strip
0 < 5 years 21.0% 24.6% 29.4% 25.3%
5 < 10 years 57.9% 42.6% 13.7% 36.7%
10 < 15 years 15.8% 23.0% 23.5% 21.3%
15 < 20 years 5.3% 8.2% 21.6% 12.0%
25 < 25 years 0.0% 1.6% 11.8% 4.7%
Total 100% 100% 100% 100%
(a) Complete the four (4) missing boxes in the cross-tabulation table above.
(b) If you were to choose one store at random from the 150 in the group:
(i) What is the probability it would be a Mall store with an Age of less than 5
years?
(iii) What is the probability it would be aged from 10 to less than 15 years given
(iv) What is the probability it would be a Country store given that it was aged 20
years or more?
(4 + 8) = 12 marks
Question 3
(a) Wilson is interested in determining the true proportion of all customers who rank
the length of time they have to spend in queues as ‘excellent’. Given that 16.25% of
the 400 customers who were surveyed gave a rating of ‘excellent’, calculate a 90%
(b) Suppose management want to know the true proportion of customers who rank the
length of time they have to spend in queues as ‘excellent’ to within 3% with 95%
requirements?
(c) Below is output from a 90% confidence interval for the difference between two
population proportions. In this case the proportion of females who gave a rating of
‘excellent’ versus the same proportion of males. Based on the output can you
conclude that there is a difference between all male and all female customers?
Confidence Interval: Two sided Confidence Level % 90.0% Sample Proportion Difference % -3.079
Sample 2: Count 2 24
(3 + 2 + 2) = 7 marks
Question 4
We wish to estimate the mean Lot Size (square metres) of all houses in the Tasmania
region. Assume the random sample of 120 houses sold are representative of all houses in
Tasmania.
(a) Calculate the 95% confidence interval estimate of the mean lot size (square metres)
given that there were n = 120 houses in the sample, sample mean = 1175 (square
(b) Suppose that the mean lot size for Sydney overall is 1,000 square metres. From your
confidence interval in part (a), what can we say about the lot sizes of Tasmania
(c) For the confidence interval calculated above in part (a), had you used 90%
confidence instead of 95%, could you have come to a different conclusion in (b)?
(d) For the confidence interval calculated above in part (a), had the sample size of
houses been n = 50 instead of n =120, could you have come to a different conclusion
(e) If the sample size of houses was n =20 what potential problems could there be in
(3 + 2 + 2 + 1 + 1) = 9 marks
Question 5
The Foodmart Board is concerned about variation in Sales ($ million) between individual
supermarkets. In particular, one Board member has suggested that a key factor in
improving Sales is for individual supermarkets to advertise more. Other Board members
believe there would be several factors, such as Gender of the manager (where Male is coded
as 0 and Female is 1), or car parking spaces, that would influence Sales.
You now develop a multiple regression model to try and explain the variation in sales.
Model Summary
Coefficients
(a) Are all of the variables included in the model significant? Explain.
(c) Give a practical interpretation of the coefficients b 0 and b 1 , in the regression model.
(d) How well does this model do in explaining variation in weekly shops? Explain.
(e) What is the purpose of the Adjusted R2 in the regression model above.
(f) Can you conclude the population coefficient (β 1 ) for Advertisement is not zero?
Explain.
(3 + 2 + 3 + 2 + 1 + 2) = 13 marks
Question 6
Six months ago, Cali supermarkets launched their new website which allows customers to
shop online. The web development company who created web site believes that 10% of
Cali’s customers will use the site for shopping. Cali’s management is interested in whether
the actual proportion of customers who would shop online is any different to that claimed
(a) Write down the null and alternative hypotheses in both symbols and words for the
above situation.
(b) A sample of 400 randomly selected Cali customers was taken. Of the 400 customers,
28 said they would use the new website to shop online. Using this information,
complete the hypothesis test from part (a) using a level of significance of 5%.
(c) Based on your answer to part (b), write down if the p-value would be bigger or
(2 + 4 + 2) = 8 marks
END OF EXAMINATION
D
PART D
Suggested Solutions
to Final Exam Paper
SOLUTIONS
Question 1
Part a
Minimum 2.8
Maximum 17.2
Total 138
Re-arrange data
(ii) Median
𝐇𝐇𝐌𝐌𝐏𝐏𝐇𝐇𝐌𝐌, 𝐌𝐌𝐌𝐌𝐌𝐌𝐏𝐏𝐌𝐌𝐏𝐏 𝐏𝐏𝐏𝐏 𝐌𝐌𝐚𝐚𝐌𝐌𝐚𝐚𝐌𝐌𝐚𝐚𝐌𝐌 𝐚𝐚𝐌𝐌𝐒𝐒𝐯𝐯𝐌𝐌 𝐛𝐛𝐌𝐌𝐏𝐏𝐛𝐛𝐌𝐌𝐌𝐌𝐏𝐏 𝐌𝐌𝐌𝐌𝐏𝐏𝐌𝐌 𝐒𝐒𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 𝐌𝐌𝐏𝐏 𝐒𝐒𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 8th and 9th
𝟖𝟖. 𝟐𝟐 + 𝟖𝟖. 𝟑𝟑
𝐌𝐌𝐌𝐌𝐌𝐌𝐏𝐏𝐌𝐌𝐏𝐏 = = 𝟖𝟖. 𝟐𝟐𝟓𝟓
𝟐𝟐
(iii) Mode
Mode is the number with highest frequency of occurrence in the data table, or 5.9 (2
𝟏𝟏 𝟏𝟏 𝟏𝟏
𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 𝐏𝐏𝐨𝐨 𝟏𝟏𝐏𝐏𝐏𝐏 𝐐𝐐𝐯𝐯𝐌𝐌𝐚𝐚𝐏𝐏𝐏𝐏𝐒𝐒𝐌𝐌 = ∗ (𝐒𝐒𝐌𝐌𝐒𝐒𝐒𝐒𝐒𝐒𝐌𝐌 𝐒𝐒𝐏𝐏𝐒𝐒𝐌𝐌 + 𝟏𝟏) = ∗ (𝟏𝟏𝟏𝟏 + 𝟏𝟏) = ∗ 𝟏𝟏𝟏𝟏
𝟒𝟒 𝟒𝟒 𝟒𝟒
= 𝟒𝟒. 𝟐𝟐𝟓𝟓𝐏𝐏𝐭𝐭
𝐇𝐇𝐌𝐌𝐏𝐏𝐇𝐇𝐌𝐌, 𝐅𝐅𝐏𝐏𝐚𝐚𝐏𝐏𝐏𝐏 𝐐𝐐𝐯𝐯𝐌𝐌𝐚𝐚𝐏𝐏𝐏𝐏𝐒𝐒𝐌𝐌 𝐏𝐏𝐏𝐏 𝐚𝐚𝐌𝐌𝐒𝐒𝐯𝐯𝐌𝐌 𝐌𝐌𝐏𝐏 𝐌𝐌𝐌𝐌𝐏𝐏𝐌𝐌 𝐒𝐒𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 − 𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 𝟒𝟒𝐏𝐏𝐭𝐭 (𝐚𝐚𝐏𝐏𝐯𝐯𝐏𝐏𝐌𝐌𝐌𝐌𝐌𝐌)
𝟑𝟑 𝟑𝟑 𝟑𝟑
𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 𝐏𝐏𝐨𝐨 𝟑𝟑𝐚𝐚𝐌𝐌 𝐐𝐐𝐯𝐯𝐌𝐌𝐚𝐚𝐏𝐏𝐏𝐏𝐒𝐒𝐌𝐌 = ∗ (𝐒𝐒𝐌𝐌𝐒𝐒𝐒𝐒𝐒𝐒𝐌𝐌 𝐒𝐒𝐏𝐏𝐒𝐒𝐌𝐌 + 𝟏𝟏) = ∗ (𝟏𝟏𝟏𝟏 + 𝟏𝟏) = ∗ 𝟏𝟏𝟏𝟏
𝟒𝟒 𝟒𝟒 𝟒𝟒
= 𝟏𝟏𝟐𝟐. 𝟏𝟏𝟓𝟓𝐏𝐏𝐭𝐭
𝐇𝐇𝐌𝐌𝐏𝐏𝐇𝐇𝐌𝐌, 𝐓𝐓𝐭𝐭𝐏𝐏𝐚𝐚𝐌𝐌 𝐐𝐐𝐯𝐯𝐌𝐌𝐚𝐚𝐏𝐏𝐏𝐏𝐒𝐒𝐌𝐌 𝐏𝐏𝐏𝐏 𝐚𝐚𝐌𝐌𝐒𝐒𝐯𝐯𝐌𝐌 𝐌𝐌𝐏𝐏 𝐌𝐌𝐌𝐌𝐏𝐏𝐌𝐌 𝐒𝐒𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 − 𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏𝐏 𝟏𝟏𝟑𝟑𝐚𝐚𝐌𝐌 (𝐚𝐚𝐏𝐏𝐯𝐯𝐏𝐏𝐌𝐌𝐌𝐌𝐌𝐌)
(vi) Range
Part b
The Interquartile Range measures the amount of variation in the data set. In this case, the middle
50% of “waiting time at the line till customers reach the teller window” lie within a range of 6.4
minutes.
Part c
The Standard Deviation also measures the amount of variation in the data set. It tells us how far,
on average, the data is away from the mean (average). In this case, the average variation away
Part d
Skewness Coefficient is 0.684. The positive value indicates positive skew. This value closes to 0, or
slight skewness.
Mean of 8.625 minutes, which is higher than median of 8.25 minutes. Positive skewness results in
the mean being pulled to the right of the median. Mean not far from the median, slight skewness.
(11 marks)
Question 2
Part a
% of Total Location
Total
Age of Store Country Mall Strip
0 < 5 years 5.3% 10.0% 10.0% 25.3%
Hints:
𝟏𝟏 𝟓𝟓
= 𝟒𝟒. 𝟏𝟏% (𝐚𝐚𝐏𝐏𝐯𝐯𝐏𝐏𝐌𝐌𝐌𝐌𝐌𝐌); = 𝟑𝟑. 𝟑𝟑% (𝐚𝐚𝐏𝐏𝐯𝐯𝐏𝐏𝐌𝐌𝐌𝐌𝐌𝐌)
𝟏𝟏𝟓𝟓𝟎𝟎 𝟏𝟏𝟓𝟓𝟎𝟎
% of Row Location
Total
Age of Store Country Mall Strip
0 < 5 years 21.0% 39.5% 39.5% 100%
Hints:
𝟐𝟐𝟏𝟏 𝟐𝟐
= 𝟒𝟒𝟏𝟏. 𝟑𝟑% (𝐚𝐚𝐏𝐏𝐯𝐯𝐏𝐏𝐌𝐌𝐌𝐌𝐌𝐌); = 𝟏𝟏𝟏𝟏. 𝟏𝟏% (𝐚𝐚𝐏𝐏𝐯𝐯𝐏𝐏𝐌𝐌𝐌𝐌𝐌𝐌)
𝟓𝟓𝟓𝟓 𝟏𝟏𝟖𝟖
Part b
𝟏𝟏𝟓𝟓
= 𝟏𝟏𝟎𝟎%
𝟏𝟏𝟓𝟓𝟎𝟎
𝟏𝟏
= 𝟒𝟒. 𝟏𝟏%
𝟏𝟏𝟓𝟓𝟎𝟎
Probability it would be aged from 10 to less than 15 years given that it was a Strip store
𝟏𝟏𝟐𝟐
= 𝟐𝟐𝟑𝟑. 𝟓𝟓%
𝟓𝟓𝟏𝟏
Probability it would be a Country store given that it was aged 20 years or more
𝟎𝟎
= 𝟎𝟎%
𝟏𝟏
(12 marks)
Question 3
Part a
Sample Proportion
Sample Size
𝐏𝐏 = 𝟒𝟒𝟎𝟎𝟎𝟎
Part b
𝐙𝐙 = ± 𝟏𝟏. 𝟗𝟗𝟏𝟏
Hence, if the management want to know the true proportion of customers who rank the length of
time they have to spend in queues as ‘excellent’ to within 3% with 95% confidence, 581 data
Part c
We cannot conclude that there is a difference between all male and female customers, who gave a
rating of ‘excellent’.
It is because the numbers are mixed; that is, we have a positive and a negative number.
(7 marks)
Question 4
Part a
INSTRUCTIONS
Population standard
Step 2 No Use t scores
deviation 𝛔𝛔 known?
𝐏𝐏
𝐌𝐌� ± 𝐏𝐏 ∗
√𝐏𝐏
𝟑𝟑𝟏𝟏𝟑𝟑
𝟏𝟏𝟏𝟏𝟏𝟏𝟓𝟓 ± 𝟏𝟏. 𝟗𝟗𝟏𝟏𝟗𝟗𝟗𝟗 ∗
√𝟏𝟏𝟐𝟐𝟎𝟎
Step 5 Applying formula
𝟏𝟏𝟏𝟏𝟏𝟏𝟓𝟓 ± 𝟏𝟏𝟏𝟏. 𝟒𝟒𝟏𝟏𝟓𝟓𝟖𝟖
metres.
Part b
Given the above confidence interval, it seems that the mean lot size for all
Tasmania houses is higher than the mean lot size for Sydney houses.
Part c
Part d
If we decrease the sample size, the Margin of Error would increase, in turn
of 1,000.
Part e
If the sample size was less than 30, we would not be able to simply assume that
If, upon checking, it was not normal, we would not be able to construct a
confidence interval.
(9 marks)
Question 5
Part a
Part b
� = 𝟒𝟒. 𝟏𝟏𝟏𝟏𝟎𝟎 + 𝟎𝟎. 𝟎𝟎𝟑𝟑𝟗𝟗 ∗ 𝐀𝐀𝐌𝐌𝐚𝐚 − 𝟎𝟎. 𝟎𝟎𝟒𝟒𝟓𝟓 ∗ 𝐌𝐌𝐏𝐏𝐚𝐚𝐒𝐒𝐌𝐌𝐌𝐌 + 𝟎𝟎. 𝟎𝟎𝟐𝟐𝟏𝟏 ∗ 𝐂𝐂𝐌𝐌𝐚𝐚 𝐒𝐒𝐒𝐒𝐌𝐌𝐇𝐇𝐌𝐌𝐏𝐏
𝐒𝐒𝐌𝐌𝐒𝐒𝐌𝐌𝐏𝐏
Part c
b 0 = 4.660
A supermarket will have sales of $million 4.660 when all other variables have a zero
value.
b 1 = 0.039
Part d
The remaining 27.10% of the variation in Sales can be explained by other factors
Part e
The adjusted R2 is ONLY used to compare one regression model with another,
Part f
The population coefficient (β 1 ) for Advertising is not zero because the p-value for
this variable is less than 0.05, we can conclude that it is a significant variable in the
model.
(13 marks)
Question 5
Step 1
𝐇𝐇𝟎𝟎 : 𝛑𝛑 = 𝟏𝟏𝟎𝟎%
𝐇𝐇𝟏𝟏 : 𝛑𝛑 ≠ 𝟏𝟏𝟎𝟎%
Step 2
Step 3
Step 4
If the sample produces a Z Score lower than −𝟏𝟏. 𝟗𝟗𝟏𝟏 or higher than +𝟏𝟏. 𝟗𝟗𝟏𝟏, we will
reject 𝐇𝐇𝟎𝟎
Step 5
𝟐𝟐𝟖𝟖
Sample Proportion = = 𝟎𝟎. 𝟎𝟎𝟏𝟏
𝟒𝟒𝟎𝟎𝟎𝟎
Step 6
As the sample Z-Score is (-2), which is lower than the critical value of Z (-1.96), we
reject 𝐇𝐇𝟎𝟎 .
Conclusion
P-value
We know that:
• Given that we reject 𝐇𝐇𝟎𝟎 , the p-value must be smaller than alpha.
(8 marks)