Problems on Sampling
1. Suppose to solve the following problems you decide to collect data by probability
sampling. In the context of each problem, define (i) the population, (ii) the element,
(iii) an appropriate sampling design, (iv) the sampling frame, (iv) the sampling unit,
(v) the character under study (variable or attribute), (vi) the parameter of interest,
and (vii) an appropriate statistic to be used for estimating the parameter (You may
consider any sample size that would give a reasonable estimate. We will talk about
the sample size determination problem later).
Problem 1. To estimate the percentage of families in Ahmedabad who have sent
their children to private schools. (To assess the demand for private school education
in Ahmedabad)
Problem 2. To estimate the average time on a day(in hours) that a college student
in Ahmedabad spends in social network site(s). (To understand the networking habit
of Ahmedabad college students)
Problem 3. To estimate the percentage of students studying in grade VI-VIII in
schools run by the Ahmedabad Municipality Corporation who are able to read a sim-
ple Gujarati text. (To assess the reading skill of the students studying in municipality
schools)
Problem 4. To estimate the mean rating (1: poor, 2: okay, 3: good, 4: very good &
5: Excellent) of all the members of the IIMA gymnasium about its perceived service
quality. (To assess the service quality of the Gymnasium)
Problem 5. To estimate the percentage of C-section delivery during 2016 in private
hospitals in Ahmedabad. (According to WHO, rate of C-section delivery should ide-
ally be between 10-15%. To understand whether the hospitals follow this norm.)
2. To calculate literacy rate of India, census data are used. In census, household literacy
data are collected by asking the head of the household who are literate and who are
not.
Later a sample of 12000 households were selected from five states in Hindi belt of In-
dia. From the selected households the literacy data were collected by census method
(CM), and also by giving a simple reading test (RT) to each member of the house-
holds.
It was found that the estimate of reading literacy rate by CM is at least 16% more
than that by RT. Also the estimate obtained by census method was found to be very
1
close to the literacy rate of the states reported by the govt.. Ref: Can Indias literate
read? International Review of Education (2010), pp. 705-728.
In the above clearly the literacy data collected by census method are
subject to substantial error? Suggest a simple method to reduce this
error.
3. Trumps Muslim ban: Different agencies conducted opinion survey after Trumps pro-
posal to bar Muslim noncitizens from entering the United States, at least temporarily.
CBS News: Do you think the US should temporarily ban Muslims from other coun-
tries from entering the United States, or not? 36% support, 58% oppose.
YOUGOV: Do you agree or disagree that there should be a total and complete
shutdown of Muslims entering the United States until our countrys representative
can figure out what is going on ? 45% agree, 41% disagree.
What could be the reason for such divergent results in two surveys? Is it
possible to avoid this kind of bias?
Ref: https://www.nytimes.com/2015/12/16/upshot/how-unpopular-is-trumps-muslim-
ban-depends-how-you-ask.html
4. An experiment was conducted with the following two questions (Schuman & Presser
(1981)):
A. Do you think United States should let Communist newspaper reporters from other
countries come in here and send back to their papers the news as they see it?
B. Do you think a communist country like Russia should let American newspaper
reporters come in and send back to America the news as they see it?
If the questions appeared in the order AB(BA): 54.7% (74.6%) Yes to A, 63.7%
(81.9%) Yes to B
Why is this divergence in outcomes? Suggest a method to reduce the bias.
5. Suppose a FPM student would like to draw a random sample of size 200 from the
population of mid level HR executives working in Financial sector in India to do her
thesis. She has two options to collect data:
(i) Collect data from those who would visit the campus for MDP/customized pro-
grammes in the next one year.
(ii) Collect data by mailing the questionnaire (web survey) to the mid level HR ex-
ecutives of a large number of companies.
Which of the two options would she prefer? Discuss the kind of errors
that are expected in the two methods. Is it feasible to get a truly random
2
sample in this case (truly random sample means that each element in the
population has equal probability of being selected)? What could be the
major inhibitor/inhibitors? Suggest a sampling design that could be used
for getting a representative sample.
6. Suppose there are 10 colleges in a city. To select a college student at random from the
population of all students, a college is chosen at random, and then from the chosen
college, a student is picked up at random.
(i) Do you think that the sampling procedure will lead to the random selection of a
student?
(ii) If not, what procedure should be followed?
7. Suppose a probability sample of students of size 500 or more is to be chosen from the
colleges in Ahmedabad (assuming all the colleges have 500 or more students). Two
sampling schemes are suggested:
(i) Select a college at random and collect data from each student of the selected col-
lege.
(ii) Select a city block at random and collect data from all the college students resid-
ing in the selected block.
Which sampling method would you prefer? Give your justification.
8. In each of the following studies indicate whether the data are collected by observa-
tional study or experimental study.
(i) A car manufacturer has developed a new engine to enhance the mileage of an
existing model of car. The manufacturer finds the mileages of 100 cars manufactured
with the new engine to estimate the average mileage before marketing it.
(ii) The R & D team of a pharmaceutical company administers a newly developed
pain relieving drug to 100 terminally ill patients to get an estimate of the average
number of hours of relief.
(iii) A public interest group tested 100 cell phones of a particular model to estimate
the average number of hours the battery works after full charging.
(iv) A researcher recorded the increase in sugar level in blood of 100 diabetics after
they drank 300 ml of coke.
(v) A researcher collected data from 100 randomly chosen college students in Ahmed-
abad on the average number of hours in a day each of them talks on the cell phone.
(vi) Suppose to estimate the prevalence of HIV in India among children aged between
one to five years in 2017 an NGO decides to carry out medical tests on 500000 ran-
domly chosen children.
3
(vii) A public interest group test 100 packets of one kg. Basmati rice of a particular
brand for its pesticide content.
(viii) In 2016, Broadcast Audience Research Council India installed meters in ten
thousand households selected by a proper sampling design from all over India to
monitor the TV watching habits of the people living in these households. These me-
ters recorded who in the family watched which TV programmes in 2016.
9. Suppose ten thousands payment vouchers are generated in 2016 in IIMA. An auditor
checks the vouchers by drawing a probability sample, often called audit sampling
(i) Why simple random sampling may not be appropriate? What kind of alternative
sampling design or designs the auditor may use?
(ii) Which sampling design would you prefer? Justify it.
(iii) Describe the sampling error (auditors often called sampling risk) and non-sampling
errors in this context.
10. Suppose a consignment of 50 sacks (each containing 20 kg, and of length 36 inches)
chilli powder are to be inspected for molds (a fungus that grows on chilli powder)
by the lab of Spices Board of India. A sample of size 100 each of 50 grams of chilli
powder is to be selected from the sacks. You may consider that each sack to be
divided into six layers each of length six inches (along its length) and each such layer
is a sampling unit
Discuss an appropriate method of sampling in this context. (Often called sack sam-
pling)
11. Suppose on Arunas birthday two of her friends decide independently to present her
an Amazon gift cheque. Suppose Amazon gift cheques are of denominations Rs. 500,
Rs. 1000, & Rs. 1500 only. Suppose, each friend picks up one of the denomination
at random.
(i) Find the probability distribution of the total amount of the gift cheques that they
pick up.
(ii) Find the mean and standard deviation of the total amount.
(iii) Check from the standard formulas whether you are getting the same answers.
(iv) Do (i)-(iii) if her friends decide to select different denominations at random. (v)
Suppose now, each picks up a gift cheque of Rs. 500, Rs. 1000 and Rs. 1500 with
probability 0.5, 0.3 and 0.2 respectively. Then do the exercise (i)-(ii) above.
12. An alchemist visited the court of a medieval warlord and said Your excellency, here
is my tribute to you. I have six envelopes. One of these contains a single copper coin,
another contains two copper coins, while a third one contains three copper coins.
The remaining three envelopes are empty. Kindly pick up any three of these six
envelopes at random and without replacement. I shall convert all the coins in the
4
selected envelopes to gold coins dating from the period of King Solomon you can
imagine their value as antiques ! But what happens if I end up picking only the three
empty envelopes?, thundered the warlord, I shall behead you then. Take it easy, your
excellency, calmly replied the alchemist I am also a sorcerer in that extreme case,
I shall make seven gold coins for you, again dating from King Solomons era, simply
from the air. Assume that all the claims of the alchemist were true and that he kept
all his promises (the latter point is natural given the threat about his head!). Let X
be the number of gold coins that the warlord eventually ended up with. Obtain (a)
P(X = 3), (b) P(X = 4), (c) P(X = 5), (d) P(X = 6), (e) P(X = 7), (f) E(X) and (g)
Var(X).
13. A textbook on business statistics contains five chapters. A student, who is not very
serious, takes a simple random sample (without replacement) of three chapters. He
studies these three chapters with some seriousness and completely ignores the re-
maining two chapters.
In the final examination, the question paper on this subject consists of five questions,
one from each chapter. The questions from Chapters 1 and 2 are compulsory and
carry 18 and 12 marks respectively. The questions from the other three chapters
carry 20 marks each and each student is supposed to answer any one of these three
questions (even if a student answers more than one of these three questions, he/she
gets credit for only one of them). Thus the maximum possible score for any student
is 50.
Obviously, the student under consideration gets zero in any question from a chapter
that he had ignored (so he does his best to avoid such a question, if possible). Fur-
thermore, as he is not very serious with his studies, he gets only 50% of the marks
in any question from a chapter that he had included for study. Let T be his score in
the examination.
Obtain the probability distribution of T and hence the expectation and variance of
T.
14. Suppose a circus owner had 5 crocodiles to ship from Chennai to Mumbai. The
shipping company agreed to ship but would charge Rs. 20,000 per 100 kg. Naturally,
they need to know the total weight of all five crocodiles. Weighing a crocodile is
difficult and at the same time expensive too. Let us name the crocodiles as Jumbo
(J), Kambo (K), Lambo (L), Mambo (N) and Shambo (S). They hired a statistician
for estimating the total weight by weighing two crocodiles only. The statistician
proposed the following procedure.
Step 1. Select two crocodiles at random without replacement.
Step 2. Weigh them, find the mean weight and multiply it by 5.
By following the statisticians procedure the total weight came out to be 1750 kgs. The
manager of the shipping company is not happy with the estimate. After observing
5
the size of the crocodiles, and from his experience of shipping crocodiles, the manager
felt that the estimate was very low. Though, the statistician was claiming that his
estimate is unbiased and if the distribution of weight could be assumed to be normal
then it is actually the best among all unbiased estimates.
There was a guy who helped the company in the past for weighing crocodiles. He
could measure the weight of a crocodile by measuring its length and knowing its
age. His error in estimation was always within 10 kgs. The manager called the guy.
His estimates of weights (in kg) were: 1000 (J), 600(K), 500 (L), 400 (M) and 300 (S).
(i) Had the manager accepted the statisticians estimate, what would have
been the minimum loss of the shipping company?
This episode, of course, leads the manager to distrust of the effectiveness of statistical
estimation theory, and he decided not to call the statistician any more for consultation
in future.
(ii) As a statistician how you would have advised the manager in this case?
Incidentally, the estimated weights of the crocodiles by the second method matched
with the actual weights. Write down all possible samples of size two. For each sample
find the sample mean and hence the probability distribution of the sample mean. Find
the mean and variance of the distribution of the sample mean. Check whether the
values of the mean and standard deviation of the probability distribution of sample
mean match with the values that you get directly by using the formulas.
15. A statistician who belonged to a group of rebellions was taken as a prisoner by the
army of king Juna and produced before the king. The king offered to play a game
with him that may save his life. Six bags of coins labeled B1 to B6 are placed before
him. Each bag contains either gold or silver coin. The statistician has to pick up two
bags at random to observe its contents. Based on this information he has to predict
the number of bags containing gold coin. Naturally, as a statistician his prediction
would be six times the proportion of gold coin bags in the sample. If he predicts
correctly he will be freed, and if he errs by 1 bag, he will be imprisoned for 5 years,
else he will be executed.
Suppose the emperor ordered to keep two bags of gold coin and four bags of silver
coin.
(i) What are the possible choices of sample of two bags?
(ii) For each choice find the proportion of bags having gold coins.
(iii) Find the probability distribution of sample proportion of bags having gold coins,
and hence the probability distribution of the estimated number of gold bags.
(iv) Find the probabilities of the statistician getting free and getting executed respec-
tively.
(v) Find the mean and standard deviation of the distribution of the estimated num-
ber of gold bags. Check whether the results match with the results obtained from
6
the formulas.
(vi) What could be the best strategy for the king to maximize the chance of the
statisticians execution?
16. As a promotion strategy, a cell phone company decides to offer a discount of either
Rs. 5000 or Rs. 3000 or Rs. 2000 to the first 10000 customers e-ordering a particular
model on its website. The price of the phone is Rs. 10000. As soon as the customer
places an order the discount amount will be flashed and will be deducted from the
price. To decide on the discount to be offered to a customer, the company uses the
following random mechanism. As soon as an order is placed, a digit between 0 to 9
will be selected at random. If the chosen digit is either 0 or 1, the offered discount
will be Rs. 5000, if it is between 2 and 4, the discount will be Rs. 3000, and otherwise
the discount will be Rs. 2000.
(i) Find the probability distribution of the price of the phone for a customer (one
among the first 10000) and also find its mean and standard deviation.
Ans: Mean = 7100, SD = 1135.78 (approx).
(ii) Suppose a couple places an order for two such phones (supposing their orders are
among the first 10000). Find the probability distribution of total price of the two
phones. Also find its mean and standard deviation.
Ans: For Average: Mean = 7100, SD = 803.12 (approx).
(iii) Suppose a local cell phone shop owner places orders for 40 cell phones (all are
among the first 10000 orders) using his network of friends and family members.
(a) Find the mean and the standard deviation of the average price of the forty phones.
(b) Find an approximation to its probability distribution. (c) Also find the approxi-
mate probability that the average price is (i) less than equal to 6000, (ii) more than
Rs. 7000 and (iii) between Rs. 6000 to Rs. 8000.
(Hint: Use central limit theorem for finding approximation to the distri-
bution of average)
17. (Application in Statistical Quality Control): A manufacturing process is supposed
to produce capsules containing 400 mg of a chemical, say, C. However, variation in
a manufacturing process is inherent, so the contents of different capsules would not
be identical. Suppose the regulatory authority makes it mandatory that the content
of every capsule should be between 399 mg. and 401 mg. To ensure it, the mean
and standard deviation of the contents produced by the manufacturing process are
set at 400 mg and 0.5 mg. The production supervisor knows from his experience
that the standard deviation of the process does rarely change. However, he feels
that continuous monitoring of the process is necessary for checking the stability of
7
the mean of the process. A consultant suggested him to implement the following
procedure.
In every hour during a shift a sample of 100 capsules is to be selected and if the
average content of the sample falls below 399.90 or above 400.10 stop the process and
hunt for the trouble.
(i) Find the probability of a false a alarm if this procedure is followed.
(ii) Assuming that the mean has actually shifted to 400.1, find the probability that
the shift will be detected by a sample.
(iii) Find the probability that the change in mean will remain undetected after in-
specting two consecutive samples since the beginning of morning shift
(iv) Find the probability that it remains undetected in the first two and gets detected
at the inspection of the third sample.
(v) Suppose the process produces 10000 capsules per hour. What is the expected
number of capsules produced that will violate the norm of the regulatory authority
till the change in mean is detected in an eight hour shift?
(If instead of 100 the sample size is 25, what assumption would be neces-
sary for the calculation of the above probabilities? Make the assumption
and solve it.)
18. (Application in Statistical Quality Control) For assessing the quality of lots sent by
vendors, the quality control department uses sampling inspection plan to decide on
whether to accept or reject a lot. Suppose the department receives lots of size 100 (N),
then the sampling inspection plan selects a sample at random without replacement
from the lot, say, of size 10 (n, to be specified), and if the sample contains , say,
more than 1 (c, tobe specified) defective item, the decision would be to reject the lot,
otherwise do not reject. Sampling is often the only option if the testing is destructive
in nature.
For designing sampling inspection plans the interests of both the consumer and the
vendor should be protected. Since the decision to accept or reject a lot is taken on the
basis of a sample, there is a chance that a good (bad) lot may be rejected (accepted).
In order to avoid rejection of good lots, the vendor imposes a condition like: if a lot
has 5% (p1 ) defective items, the chance of rejecting such a lot should not exceed 10%
(VRisk). Let us call it the vendors risk. On the other hand, to reduce acceptance of
bad lots, the consumer imposes a condition like, the chance of accepting a lot with
10% (p2 ) defective items should not exceed 10% (CRisk). Let us call it consumers
risk.
For illustration, suppose the lot size N = 20, and the sampling plan chosen is
given by n = 5, c = 0 (in other words, draw a sample of size 5, if no defec-
tive is found the lot is accepted, otherwise rejected). Suppose you are also given
p1 = 5%, V risk = 10%, p2 = 10%, CRisk = 10%.
8
(i) Is the above sampling plan able to meet the specified Vendors risk & consumers
risk?
(ii) If the actual number of defectives in the lot is 4, what is the chance of accepting
such a lot by using the above sampling plan?
(iii) Solve (i) and (ii) with N = 1000, n = 20, c = 2, p1 = 5%, V risk = 10%, p2 =
10%, CRisk = 10%.
(iv) Solve (iii) when the number of defectives in the lot equal to 40.
(Use Binomial approximation)
9
Answers to Questions
1. See Table 1
Table 1: Question 1 solution
Problem Population Element Design Frame Unit Variable Parameter Estimator
1 Families in Family SRSWOR- All House- HouseHold 1/0 whether % families sample pro-
Ahmedabad can involve holds in Ahd, family has in Ahd who portion
stratification details that sent any child have sent
and Cluster help identify atleast one
Sampling by clusters and to private child to
areas/locality strata school private school
etc
2 College stu- Student SRSWOR- Colleges in College Time spent Average time sample mean
dents in may involve Ahmedabad on social net- spent on so-
Ahmedabad cluster or along with working site cial network-
stratified details that on a given ing sites per
sampling help identify day day
based on strata and
locality, type clusters
of college
(engg/medical),
clusters may
inviolve dif-
ferent days
during the
time frame of
study
3 Current student May involve list of School Number of Percentage of sampleMean
class Vi-Viii stratifications Ahmedabad students in students of of percentage
students in and cluster municipal class Vi-Viii clas Vi to Viii of students
Gujrat sampling schools who can read who can read who can
and then Gujrati, To- Gujrati read Gujrati
SRSWOR tal number of from each
students in schoolOR ra-
those classes tio estimator
- (number
of students
who can read
Gujrati in
sample)/
(number of
students in
sample)
4 Members member SRSWOR List of all member rating of Average sample mean
of IIMA users of gym quality of Rating, ( or (or propor-
Gymnasium service e.g.percentage tion)
people who
chose rating
as 4 or 5)
5 All deliveries Delivery SRSWOR List of all Hospital Number of % of deli- Ratio
that hap- identified private hos- deliveries , varies that Estimator-
pened in 2016 by some pitalsin number of were C- (number of
in private code, Ahmedabad C-sections sections C-sections
hospitals of date and in sam-
Ahmedabad hospital ple)/(Number
of Deliveries
in sample)
2. Answer: There is a measurement error involved because in CM method, literacy is
checked by just asking the head of household which can be biased and wrong. Method
to remove the error is to use the RT method and by training personnel collecting the
data to administer the test.
3. Answer: The second question is hinting a positivity about the ban and hence in-
fluencing the responder. By saying countrys representative can figure out what is
going on, they are implying something is perhaps wrong and authorities are looking
into it, which people generally may find hard to disagree with!
10
4. Answer:This is called question-order effect. It may be easy for an american to
agree with the question A. However, if that is asked first, he/she may be forced to
agree with B. Similar effect will occur if asked in the reverse, i.e. answer to second
question is forced because you have already expressed a view in the first question.
Typically such issue can be avoided by asking a more general question first followed
by specifics. For example, Do you think a country should allow reporters from other
countries to come in and report news back to their home countries. Then follow up
with specific questions.
5. Answer Random sample from MDP participant is convenient to obtain. However,
it may not represent the population of HR executives. Since only selected executives
are sent to MDP at IIMA, there will be a sampling bias. The second approach has
potential to get a more representative sample. However, there may be issues with
Non-response. A better way to collect data here may be to first create a frame of
companies that are of interest, sample the companies, once you reach the companies
a frame of executives can be created and then a sample out of those collected. In
general, it is difficult to collect a truly random sample. Convenience and judgement
may need to be used while collecting the data.
6. Answer Suppose there are unequal number of students per college, N1 , N2 , , .., N10 .
Let N = N1 + N2 + ... + N10 denote total number of students. For SRS, probability
of picking a student should be N1 , i.e same for all students. However, in (i) the
1
probability is 10 N1i if student is from college i. For SRS, student needs to be picked
randomly from a list of all students obtained by pooling all colleges.
7. Answer: Clearly (i) is more convenient. However, it will work only if the heterogenity
that exists within any college is similar to what exists in the city and also if variation
across colleges is not much. Method (ii) is preferable over (i) if there is variation
across colleges w.r.t the variable of interest, since it would look at more than one
college and capture some variation that exists across colleges.
8. Answer: (i) Experimental (E) , (ii) E, (iii) Observational (O), (iv) E, (vi) O (vii) O
(viii) O
9. Answer: Since some payments may be large and some small, it is possible that SRS
may just result in looking at perhaps only small payments. To avoid such a situation,
one may do stratified sampling by size of payment. After stratification, it may be
preferred to do systematic sampling rather than SRS, again to avoid similar sized
vouchers getting selected. Sampling error would be present because the audit cannot
be done on all the vouchers but only on a representative sample. Non-sampling errrors
here would mainly include any errors done in the audit process itself, since it may
not be possible to do it perfectly on every voucher.
10. Answer First we need 100 samples from 50 sacks. So it is better to take 2 samples
from each sack to cover the heterogeneity that may exist across sacks. While selecting
11
the 2 samples from a sack, it is important to ensure that the 2 samples are not close
to each other. A systematic sampling scheme may be better here, where we choose
one layer in the sack at random and then choose the next layer that is 3 layers away.
e.g. If I choose one sample from layer 1, then choose second sample from layer 4. If
the first chosen layer is 5, then choose the next layer as 1.
11. Answer (ii) mean=2000, sd=577.35 (iii) Mean=2000 SD=408.25 (v) Mean=1700,
SD=552.268.
12. Answer: (a) 0.3, (b) 0.15, (c) 0.15, (d) 0.05, (e) 0.05, (f) 3.35, (g) 2.6275
13. Answer: The possible values of T are 10, 16, 19 and 25 with respective probabilities
0.1, 0.3, 0.3 and 0.3; E(T) = 19, V(T) = 21.6
14. Answer: (i) 200000 (ii) Statistical estimate should atleast be accompanied by stan-
dard error to show the uncertainty in the estimate.
15. Answer Best strategy: two gold and four silver or four gold and two silver.
16. Answer: (a) 0 (approx)(b) 0.29 (approx) (c) 1 (approx)
17. Answer: (i) 0.05 (ii) 0.5 (iii) 0.25 (iv) 0.125 (v) 9804 (approx. Assuming production
stops after the 8 hour shift)
18. Answer: (i)No (Probability = 0.25) No (Probability = 0.55) (ii) 0.71
12