Chapter 1
OBTAINING DATA
Introduction
Statistics may be defined as the science that deals with the collection, organization,
presentation, analysis, and interpretation of data in order be able to draw judgments or
conclusions that help in the decision-making process. The two parts of this definition correspond
to the two main divisions of Statistics. These are Descriptive Statistics and Inferential Statistics.
Descriptive Statistics, which is referred to in the first part of the definition, deals with the
procedures that organize, summarize and describe quantitative data. It seeks merely to describe
data. Inferential Statistics, implied in the second part of the definition, deals with making a
judgment or a conclusion about a population based on the findings from a sample that is taken
from the population.
Statistical Terms
Before proceeding to the discussion of the different methods of obtaining data, let us have first
definition of some statistical terms:
Population or Universe refers to the totality of objects, persons, places, things used in a
particular study. All members of a particular group of objects (items) or people (individual), etc.
which are subjects or respondents of a study.
Sample is any subset of population or few members of a population.
Data are facts, figures and information collected on some characteristics of a population or
sample. These can be classified as qualitative or quantitative data.
Ungrouped (or raw) data are data which are not organized in any specific way. They are simply
the collection of data as they are gathered.
Grouped Data are raw data organized into groups or categories with corresponding
frequencies. Organized in this manner, the data is referred to as frequency distribution.
Parameter is the descriptive measure of a characteristic of a population
Statistic is a measure of a characteristic of sample
Constant is a characteristic or property of a population or sample which is common to all
members of the group.
Variable is a measure or characteristic or property of a population or sample that may have a
number of different values. It differentiates a particular member from the rest of the group. It is
the characteristic or property that is measured, controlled, or manipulated in research. They
differ in many respects, most notably in the role they are given in the research and in the type of
measures that can be applied to them.
1.1 Methods of Data Collection
Collection of the data is the first step in conducting statistical inquiry. It simply refers to the data
gathering, a systematic method of collecting and measuring data from different sources of
information in order to provide answers to relevant questions. This involves acquiring information
published literature, surveys through questionnaires or interviews, experimentations, documents
and records, tests or examinations and other forms of data gathering instruments. The person
who conducts the inquiry is an investigator, the one who helps in collecting information is an
enumerator and information is collected from a respondent. Data can be primary or secondary.
According to Wessel, “Data collected in the process of investigation are known as primary data.”
These are collected for the investigator’s use from the primary source. Secondary data, on the
other hand, is collected by some other organization for their own use but the investigator also
gets it for his use. According to M.M. Blair, “Secondary data are those already in existence for
some other purpose than answering the question in hand.”
In the field of engineering, the three basic methods of collecting data are through retrospective
study, observational study and through a designed experiment. A retrospective study would use
the population or sample of the historical data which had been archived over some period of
time. It may involve a significant amount of data but those data may contain relatively little useful
information about the problem, some of the relevant data may be missing, recording errors or
transcription may be present, or those other important data may not have been gathered and
archived. These result in statistical analysis of historical data which identifies interesting
phenomena but difficulty of obtaining solid and reliable explanations is encountered.
In an observational study, however, process or population is observed and disturbed as little as
possible, and the quantities of interests are recorded. In a designed experiment, deliberate or
purposeful changes in the controllable variables of the system or process is done. The resulting
system output data must be observed, and an inference or decision about which variables are
responsible for the observed changes in output performance is made. Experiments designed
with basic principles such as randomization are needed to establish cause-and-effect
relationships. Much of what we know in the engineering and physical-chemical sciences is
developed through testing or experimentation. In engineering, there are problem areas with no
scientific or engineering theory that are directly or completely applicable, so experimentation and
observation of the resulting data is the only way to solve them. There are times there is a good
underlying scientific theory to explain the phenomena of interest. Tests or experiments are
almost always necessary to be conducted to confirm the applicability and validity of the theory in
a specific situation or environment. Designed experiments are very important in engineering
design and development and in the improvement of manufacturing processes in which statistical
thinking and statistical methods play an important role in planning, conducting, and analyzing the
data. (Montgomery, et al., 2018)
1.2 Planning and Conducting Surveys
A survey is a method of asking respondents some well-constructed questions. It is an efficient
way of collecting information and easy to administer wherein a wide variety of information can be
collected. The researcher can be focused and can stick to the questions that interest him and
are necessary in his statistical inquiry or study.
However surveys depend on the respondents honesty, motivation, memory and his ability to
respond. Sometimes answers may lead to vague data. Surveys can be done through face-to-
face interviews or self-administered through the use of questionnaires. The advantages of face-
to-face interviews include fewer misunderstood questions, fewer incomplete responses, higher
response rates, and greater control over the environment in which the survey is administered;
also, the researcher can collect additional information if any of the respondents’ answers need
clarifying. The disadvantages of face-to-face interviews are that they can be expensive and time-
consuming and may require a large staff of trained interviewers. In addition, the response can be
biased by the appearance or attitude of the interviewer.
Self-administered surveys are less expensive than interviews. It can be administered in large
numbers and does not require many interviewers and there is less pressure on respondents.
However, in self-administered surveys, the respondents are more likely to stop participating mid-
way through the survey and respondents cannot ask to clarify their answers. There are lower
response rates than in personal interviews.
When designing a survey, the following steps are useful:
1. Determine the objectives of your survey: What questions do you want to answer?
2. Identify the target population sample: Whom will you interview? Who will be the
respondents? What sampling method will you use?
3. Choose an interviewing method: face-to-face interview, phone interview, self-
administered paper survey, or internet survey.
4. Decide what questions you will ask in what order, and how to phrase them.
5. Conduct the interview and collect the information.
6. Analyze the results by making graphs and drawing conclusions.
In choosing the respondents, sampling techniques are necessary. Sampling is the process of
selecting units (e.g., people, organizations) from a population of interest. Sample must be a
representative of the target population. The target population is the entire group a researcher is
interested in; the group about which the researcher wishes to draw conclusions.
There are two ways of selecting a sample. These are the non-probability sampling and the
probability sampling.
Non-Probability Sampling
Non-probability sampling is also called judgment or subjective sampling. This method is
convenient and economical but the inferences made based on the findings are not so reliable.
The most common types of non-probability sampling are the convenience sampling, purposive
sampling and quota sampling.
In convenience sampling, the researcher use a device in obtaining the information from the
respondents which favors the researcher but can cause bias to the respondents.
In purposive sampling, the selection of respondents is predetermined according to the
characteristic of interest made by the researcher. Randomization is absent in this type of
sampling.
There are two types of quota sampling: proportional and non proportional. In proportional quota
sampling the major characteristics of the population by sampling a proportional amount of each
is represented.
For instance, if you know the population has 40% women and 60% men, and that you want a
total sample size of 100, you will continue sampling until you get those percentages and then
you will stop.
Non-proportional quota sampling is a bit less restrictive. In this method, a minimum number of
sampled units in each category is specified and not concerned with having numbers that match
the proportions in the population.
Probability Sampling
In probability sampling, every member of the population is given an equal chance to be selected
as a part of the sample. There are several probability techniques. Among these are simple
random sampling, stratified sampling and cluster sampling.
Simple Random Sampling
Simple random sampling is the basic sampling technique where a group of subjects (a sample)
is selected for study from a larger group (a population). Each individual is chosen entirely by
chance and each member of the population has an equal chance of being included in the
sample. Every possible sample of a given size has the same chance of selection; i.e. each
member of the population is equally likely to be chosen at any stage in the sampling process.
Stratified Sampling
There may often be factors which divide up the population into sub-populations (groups / strata)
and the measurement of interest may vary among the different sub- populations. This has to be
accounted for when a sample from the population is selected
in order to obtain a sample that is representative of the population. This is achieved by stratified
sampling.
A stratified sample is obtained by taking samples from each stratum or sub-group of a
population. When a sample is to be taken from a population with several strata, the proportion of
each stratum in the sample should be the same as in the population.
Stratified sampling techniques are generally used when the population is heterogeneous, or
dissimilar, where certain homogeneous, or similar, sub-populations can be isolated (strata).
Simple random sampling is most appropriate when the entire population from which the sample
is taken is homogeneous. Some reasons for using stratified sampling over simple random
sampling are:
1. the cost per observation in the survey may be reduced;
2. estimates of the population parameters may be wanted for each subpopulation;
3. increased accuracy at given cost.
Cluster Sampling
Cluster sampling is a sampling technique where the entire population is divided into groups, or
clusters, and a random sample of these clusters are selected. All observations in the selected
clusters are included in the sample.
1.3 Planning and Conducting Experiments: Introduction to Design of Experiments
The products and processes in the engineering and scientific disciplines are mostly derived from
experimentation. An experiment is a series of tests conducted in a systematic manner to
increase the understanding of an existing process or to explore a new product or process.
Design of Experiments, or DOE, is a tool to develop an experimentation strategy that maximizes
learning using minimum resources. Design of Experiments is widely and extensively used by
engineers and scientists in improving existing process through maximizing the yield and
decreasing the variability or in developing new products and processes. It is a technique needed
to identify the "vital few" factors in the most efficient manner and then directs the process to its
best setting to meet the ever-increasing demand for improved quality and increased productivity.
The methodology of DOE ensures that all factors and their interactions are systematically
investigated resulting to reliable and complete information. There are five stages to be carried
out for the design of experiments. These are planning, screening, optimization, robustness
testing and verification.
1. Planning
It is important to carefully plan for the course of experimentation before embarking upon the
process of testing and data collection. At this stage, identification of the objectives of conducting
the experiment or investigation, assessment of time and available resources to achieve the
objectives. Individuals from different disciplines related to the product or process should
compose a team who will conduct the investigation. They are to identify possible factors to
investigate and the most appropriate responses to measure. A team approach promotes synergy
that gives a richer set of factors to study and thus a more complete experiment. Experiments
which are carefully planned always lead to increased understanding of the product or process.
Well planned experiments are easy to execute and analyze using the available statistical
software
2. Screening
Screening experiments are used to identify the important factors that affect the process under
investigation out of the large pool of potential factors. Screening process eliminates unimportant
factors and attention is focused on the key factors. Screening experiments are usually efficient
designs which require few executions and focus on the vital factors and not on interactions.
3. Optimization
After narrowing down the important factors affecting the process, then determine the best setting
of these factors to achieve the objectives of the investigation. The objectives may be to either
increase yield or decrease variability or to find settings that achieve both at the same time
depending on the product or process under investigation.
4. Robustness Testing
Once the optimal settings of the factors have been determined, it is important to make the
product or process insensitive to variations resulting from changes in factors that affect the
process but are beyond the control of the analyst. Such factors are referred to as noise or
uncontrollable factors that are likely to be experienced in the application environment. It is
important to identify such sources of variation and take measures to ensure that the product or
process is made robust or insensitive to these factors.
5. Verification
This final stage involves validation of the optimum settings by conducting a few follow- up
experimental runs. This is to confirm that the process functions as expected and all objectives
are achieved.
Chapter 2
Probability
Introduction
Probability is simply how likely an event is to happen. “The chance of rain today is 50%” is a
statement that enumerates our thoughts on the possibility of rain. The likelihood of an outcome
is measured by assigning a number from the interval [0, 1] or as percentage from 0 to 100%.
The higher the number means the event is more likely to happen than the lower number. A zero
(0) probability indicates that the outcome is impossible to happen while a probability of one (1)
indicates that the outcome will occur inevitably.
This module intends to discuss the concept of probability for discrete sample spaces, its
application, and ways of solving the probabilities of different statistical data.
Probability
For example, the probability of flipping a coin and it being heads is ½, because there is 1 way of
getting a head and the total number of possible outcomes is 2 (a head or tail). We write
P(heads) = ½ .
• The probability of something which is certain to happen is 1.
• The probability of something which is impossible to happen is 0.
• The probability of something not happening is 1 minus the probability that it will happen.
Experiment – is used to describe any process that generates a set of data
Event – consists of a set of possible outcomes of a probability experiment. Can be one outcome
or more than one outcome.
Simple event – an event with one outcome.
Compound event – an event with more than one outcome.
2.1 Sample Space and Relationships among Events
Sample space is the set of all possible outcomes or results of a random experiment. Sample
space is represented by letter S. Each outcome in the sample space is called an element of that
set. An event is the subset of this sample space and it is represented by letter E. This can be
illustrated in a Venn Diagram. In Figure 2.1, the sample space is represented by the rectangle
and the events by the circles inside the rectangle.
The events A and B (in a to c) and A, B and C (in d and e) are all subsets of the sample space S.
Figure 2.1 Venn diagrams of sample space with events (adapted from Montgomery et al., 2003)
For example if a dice is rolled we have {1, 2, 3, 4, 5, and 6} as sample space. The event can be
{1, 3, and 5} which means set of odd numbers. Similarly, when a coin is tossed twice the sample
space is {HH, HT, TH, and TT}.
Difference between Sample Space and Events
As discussed in the beginning sample space is set of all possible outcomes of an experiment
and event is the subset of sample space. Let us try to understand this with few examples. What
happens when we toss a coin thrice? If a coin is tossed three times we get following
combinations,
HHH, HHT, HTH,THH, TTH, THT, HTT and TTT
All these are the outcomes of the experiment of tossing a coin three times. Hence, we can say
the sample space is the set given by,
S = {HHH, HHT, HTH,THH, TTH, THT, HTT, TTT}
Now, suppose the event be the set of outcomes in which there are only two heads. The
outcomes in which we have only two heads are HHT, HTH and THH hence the event is given by,
E = {HHT, HTH, THH}
We can clearly see that each element of set E is in set S, so E is a subset of S. There can be
more than one event. In this case, we can have an event as getting only one tail or event of
getting only one head. If we have more than one event we can represent these events by E1,
E2, E3 etc. We can have more than one event for a Sample space but there will be one and only
one Sample space for an Event. If we have Events E1, E2, E3, …… En as all the possible
subset of sample space then we have,
S = E1 ∪ E2 ∪ E3 ∪ …….∪ En
We can understand this with the help of a simple example. Consider an experiment of rolling a
dice. We have sample space,
S = {1, 2, 3, 4, 5, 6}
Now if we have Event E1 as getting odd number as outcome and E2 as getting even number as
outcome for this experiment then we can represent E1 and E2 as the following set,
E1 = {1, 3, 5}
E2 = {2, 4, 6}
So we have
{1, 3, 5} ∪ {2, 4, 6} = {1, 2, 3, 4, 5, 6} Or S = E1 ∪ E2
Hence, we can say union of Events E1 and E2 is S.
Null space – is a subset of the sample space that contains no elements and is denoted by the
symbol Ø. It is also called empty space.
Operations with Events
Intersection of events
The intersection of two events A and B is denoted by the symbol A ∩ B. It is the event containing
all elements that are common to A and B. This is illustrated as the shaded region in Figure 2.1
(c).
For example,
Let A = {3,6,9,12,15} and B = {1,3,5,8,12,15,17}; then A ∩ B = {3,12,15}
Let X = {q, w, e, r, t,} and Y = {a, s, d, f}; then X ∩ Y = Ø, since X and Y have no elements in
common.
Mutually Exclusive Events
We can say that an event is mutually exclusive if they have no elements in common.
This is illustrated in Figure 2.1 (b) where we can see that A ∩ B = Ø.
Union of Events
The union of events A and B is the event containing all the elements that belong to A or to B or to
both and is denoted by the symbol A B. The elements A B maybe listed or defined by the
rule A B = { x | x A or x B}.
For example,
Let A = {a,e,i,o,u} and B = {b,c,d,e,f}; then A B = {a,b,c,d,e,f,i,o,u}
Let X = {1,2,3,4} and Y = {3,4,5,6}; then A B = {1,2,3,4,5,6}
Compliment of an Event
The complement of an event A with respect to S is the set of all elements of S that are not in A
and is denoted by A’. The shaded region in Figure 2.1 (e) shows (A ∩ C)’.
For example,
Consider the sample space S = {dog, cow, bird, snake, pig}
Let A = {dog, bird, pig}; then A’ = {cow, snake}
Probability of an Event
Sample space and events play important roles in probability. Once we have sample space and
event, we can easily find the probability of that event. We have following formula to find the
probability of an event.
Where,
n (S) represents number of elements in a sample space of an experiment;
n (E) represents a number of elements in the event set; and
P (E) represents the probability of an event.
When probabilities are assigned to the outcomes in a sample space, each probability must lie
between 0 and 1 inclusive, and the sum of all probabilities assigned must be
equal to 1. Therefore,
0 P (E) 1 and P(S) = 1
Let us try to understand this with the help of an example. If a die is tossed, the sample space is
{1, 2, 3, 4, 5, 6}. In this set, we have a number of elements equal to 6. Now, if the event is the set
of odd numbers in a dice, then we have {1, 3, and 5} as an event. In this set, we have 3
elements. So, the probability of getting odd numbers in a single throw of dice is given by
2.2 Counting Rules Useful in Probability
Multiplicative Rule
Suppose you have j sets of elements, n1 in the first set, n2 in the second set, ... and nj in
the jth set. Suppose you wish to form a sample of j elements by taking one element from each of
the j sets. The number of possible sets is then defined by:
Permutation Rule
The arrangement of elements in a distinct order is called permutation. Given a single set
of n distinctively different elements, you wish to select k elements from the n and arrange them
within k positions. The number of different permutations of the n elements taken k at a time is
denoted Pkn and is equal to
Partitions rule
Suppose a single set of n distinctively different elements exists. You wish to partition
them into k sets, with the first set containing n1 elements, the second containing n2 elements,
..., and the kth set containing nk elements. The number of different partitions is
Where,
The numerator gives the permutations of the n elements. The terms in the denominator remove
the duplicates due to the same assignments in the k sets (multinomial coefficients).
Combinations Rule
A sample of k elements is to be chosen from a set of n elements. The number of different
samples of k samples that can be selected from n is equal to
2.3 Rules of Probability
Before discussing the rules of probability, we state the following definitions:
• Two events are mutually exclusive or disjoint if they cannot occur at the same time.
• The probability that Event A occurs, given that Event B has occurred, is called a
conditional probability. The conditional probability of Event A, given Event B, is denoted by the
symbol P (A|B).
• The complement of an event is the event not occurring. The probability that Event A will
not occur is denoted by P (A').
• The probability that Events A and B both occur is the probability of the intersection of A
and B. The probability of the intersection of Events A and B is denoted by P (A ∩ B). If Events A
and B are mutually exclusive, P(A ∩ B) = 0.
• The probability that Events A or B occur is the probability of the union of A and B. The
probability of the union of Events A and B is denoted by P(A ∪ B).
• If the occurrence of Event A changes the probability of Event B, then Events A and B are
dependent. On the other hand, if the occurrence of Event A does not change the probability of
Event B, then Events A and B are independent.
Rule of Addition
Rule 1: If two events A and B are mutually exclusive, then:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
Rule 2: If events A and B are not mutually exclusive events, then:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
Example 1. A student goes to the library. The probability that she checks out (a) a work of fiction
is 0.40, (b) a work of non-fiction is 0.30, and (c) both fiction and non-fiction is 0.20. What is the
probability that the student checks out a work of fiction, non-fiction, or both?
Solution:
Let F = the event that the student checks out fiction;
Let N = the event that the student checks out non-fiction.
Then, based on the rule of addition:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐹) + 𝑃(𝑁) − 𝑃(𝐹 ∩ 𝑁)
𝑃(𝐴 ∪ 𝐵) = 0.4 + 0.3 − 0.2 = 𝟎. 𝟓
Rule of Multiplication
Rule 1: When two events A and B are independent, then:
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵)
Dependent - Two outcomes are said to be dependent if knowing that one of the outcomes has
occurred affects the probability that the other occurs
Conditional Probability - an event B in relationship to an event A is the probability that event B
occurs after event A has already occurred. The probability is denoted by 𝑃(𝐵|𝐴).
Rule 2: When two events are dependent, the probability of both occurring is:
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵|𝐴)
Example 1. A day’s production of 850 manufactured parts contains 50 parts that do not meet
customer requirements. Two parts are selected randomly without replacement from the batch.
What is the probability that the second part is defective given that the first part is defective?
Solution:
Let A = event that the first part selected is defective
Let B = event that the second part selected is defective.
P (B|A) =?
If the first part is defective, prior to selecting the second part, the batch contains 849
parts, of which 49 are defective, therefore
P (B|A) = 49/849
Example 2. An urn contains 6 red marbles and 4 black marbles. Two marbles are drawn without
replacement from the urn. What is the probability that both of the marbles are black?
Solution:
Let A = the event that the first marble is black;
and let B = the event that the second marble is black.
We know the following:
• In the beginning, there are 10 marbles in the urn, 4 of which are black. Therefore, P (A) =
4/10.
• After the first selection, there are 9 marbles in the urn, 3 of which are black.
Therefore, P (B|A) = 3/9.
Example 3. Two cards are selected from a pack of cards. What is the probability that they are
both queen?
Solution:
Let A = First card which is a queen
Let B = Second card which is also a queen
We require P (A B). Notice that these events are dependent because the probability that the
second card is a queen depends on whether or not the first card is a queen.
P (A B) = P (A) P (B|A)
P (A) = 1/13 and P (B|A) = 3/51
P (A B) = (1/13) (3/51) = 1/221 = 0.004525
Rule of Subtraction
The probability that event A will occur is equal to 1 minus the probability that event A will not
occur.
𝑃(𝐴) = 1 − 𝑃(𝐴′)
Example 1.The probability of Bill not graduating in college is 0.8. What is the probability that Bill
will not graduate from college?
Solution:
𝑃(𝐴) = 1 − 0.8 = 𝟎. 𝟐