LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Module no. 1
Obtaining Data
Topic: 1.1. Methods of Data Collection
1.1.1 Retrospective Study
1.1.2 Observational Study
1.1.3 Designed Experiments
1.2. Planning and Conducting Surveys
1.2.1 Sampling Methods
1.2.2 Sources of Bias in Sampling and Surveys
1.3. Planning and Conducting Experiments: Introduction to Design of
Experiments
1.3.1 Strategy of Experimentation
1.3.2 Mechanistic and Empirical Model
Time Frame: 2 hours
Introduction:
Historically, measurements were obtained from a sample of people and generalized to a
population, and the terminology has remained. Sometimes the data are all of the
observations in the population. This results in a census. However, in the engineering
environment, the data are almost always a sample that has been selected from the
population. Three basic methods of collecting data are
• A retrospective study using historical data
• An observational study
• A designed experiment
An effective data-collection procedure can greatly simplify the analysis and lead to
improved understanding of the population or process that is being studied. We now
consider some examples of these data-collection methods.
Objectives:
At the end of this topic, the students should be able to
1. Discuss the different methods that engineers use to collect data;
2. Describe the different methods of sampling in planning and conducting surveys;
3. Identify the advantages that designed experiments have in comparison to other
methods of collecting engineering data.
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 1
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Pre – Test
Module 1 – Obtaining Data
Name: Subject:
Course/Section: Date:
Direction: Read the questions carefully.
1. What are three methods of collecting data?
2. What are the differences between population and sample?
3. What is the difference between mechanistic and empirical model?
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 2
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Learning Activities:
1.1 Methods of Data Collection
1.1.1 Retrospective Study
Montgomery, Peck, and Vining (2012) describe an acetone-butyl alcohol distillation
column (A distillation column is an essential item used in the distillation of liquid mixtures
to separate the mixture into its component parts, or fractions, based on the differences
in volatilities) for which concentration of acetone in the distillate (the output product
stream) is an important variable. Factors that may affect the distillate are the reboil
temperature, the condensate temperature, and the reflux rate. Production personnel
obtain and archive the following records:
• The concentration of acetone in an hourly test sample of output product
• The reboil temperature log, which is a record of the reboil temperature over time
• The condenser temperature controller log
• The nominal reflux rate each hour
The reflux rate should be held constant for this process. Consequently, production
personnel change this very infrequently.
A retrospective study would use either all or a sample of the historical process data
archived over some period of time. The study objective might be to discover the
relationships among the two temperatures and the reflux rate on the acetone
concentration in the output product stream. However, this type of study presents some
problems:
1. We may not be able to see the relationship between the reflux rate and acetone
concentration because the reflux rate did not change much over the historical
period.
2. The archived data on the two temperatures (which are recorded almost
continuously) do not correspond perfectly to the acetone concentration
measurements (which are made hourly). It may not be obvious how to construct
an approximate correspondence.
3. Production maintains the two temperatures as closely as possible to desired
targets or set points. Because the temperatures change so little, it may be
difficult to assess their real impact on acetone concentration.
4. In the narrow ranges within which they do vary, the condensate temperature
tends to increase with the reboil temperature. Consequently, the effects of these
two process variables on acetone concentration may be difficult to separate.
As you can see, a retrospective study may involve a significant amount of data, but
those data may contain relatively little useful information about the problem.
Furthermore, some of the relevant data may be missing, there may be transcription or
recording errors resulting in outliers (or unusual values), or data on other important
factors may not have been collected and archived.
1.1.2 Observational Study
In an observational study, the engineer observes the process or population, disturbing it
as little as possible, and records the quantities of interest. Because these studies are
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 3
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
usually conducted for a relatively short time period, sometimes variables that are not
routinely measured can be included. In the distillation column, the engineer would
design a form to record the two temperatures and the reflux rate when acetone
concentration measurements are made. It may even be possible to measure the input
feed stream concentrations so that the impact of this factor could be studied.
Generally, an observational study tends to solve problems 1 and 2 and goes a long
way toward obtaining accurate and reliable data. However, observational studies may
not help resolve problems 3 and 4.
1.1.3 Designed Experiments
In a designed experiment, the engineer makes deliberate or purposeful changes in the
controllable variables of the system or process, observes the resulting system output
data, and then makes an inference or decision about which variables are responsible for
the observed changes in output performance. The nylon connector example below
illustrates a designed experiment; that is, a deliberate change was made in the
connector’s wall thickness with the objective of discovering whether or not a stronger
pull-off force could be obtained. Experiments designed with basic principles such as
randomization are needed to establish cause-and-effect relationships.
Example:
Suppose that an engineer is designing a nylon connector to be used in an automotive
engine application. The engineer is considering establishing the design specification on
wall thickness at 3∕32 inch but is somewhat uncertain about the effect of this decision on
the connector pull-off force. If the pull-off force is too low, the connector may fail when it
is installed in an engine. Eight prototype units are produced and their pull-off forces
measured, resulting in the following data (in pounds): 12.6, 12.9, 13.4, 12.3, 13.6, 13.5,
12.6, 13.1. As we anticipated, not all of the prototypes have the same pull-off force. We
say that there is variability in the pull-off force measurements.
Much of what we know in the engineering and physical-chemical sciences is
developed through testing or experimentation. Designed experiments play a very
important role in engineering design and development and in the improvement of
manufacturing processes.
1.2 Planning and Conducting Surveys
Planning and conducting surveys are useful in describing the characteristics of a large
population to ensure accurate sample in gathering targeted results to draw conclusions
and make important decisions.
Population
A population is the entire group of individuals, scores, measurements, etc. about which
we want information.
Sample
The part of the population from which we actually collect information and is used to
draw conclusions about the whole.
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 4
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Random Selection
A process of gathering a representative sample for a particular study. Random means
the people are chosen by chance, each person has the same probability of being
chosen.
1.2.1 Sampling Methods
There are two types of sampling methods:
Probability sampling involves random selection, allowing you to make statistical
inferences about the whole group.
Non-probability sampling involves non-random selection based on convenience
or other criteria, allowing you to easily collect initial data.
Probability sampling methods:
1. Simple Random Sampling – all members of a population has an equal chance of
being selected in which bias is avoided. You can use tools like random number
generators or other techniques that are based entirely on chance when
conducting this type of sampling.
2. Systematic Sampling – similar to simple random sampling, but is usually slightly
easier to conduct. Every member of the population is listed with a number and
individuals are chosen at regular intervals instead of randomly generating
numbers.
3. Stratified Random Sampling – the population is divided into subgroups (strata) so
that subjects within the same subgroup share the same characteristics (e.g.
gender, age) then a sample is drawn from each.
4. Cluster Sampling – involves dividing the population into sections (clusters), but
each section should have similar characteristics to the whole sample. Some of
those clusters are then randomly selected and then chooses all members of the
selected clusters.
Non-probability sampling methods:
1. Convenience sampling – an easy and inexpensive way to gather initial data
where individuals who happen to be most accessible to the researcher are
included but there is no way to tell if the sample is representative of the
population, so it can’t produce generalizable results.
2. Voluntary response sampling – mainly based on ease of access. People
volunteer themselves (e.g. by responding to a public online survey) instead of the
researcher choosing participants and directly contacting them.
3. Purposive sampling – involves the researcher using their judgment to select a
sample that is most useful to the purposes of the research.
4. Snowball sampling – used to recruit participants via other participants if the
population is hard to find. Just like a snowball increasing in size (sample size),
the sampling technique can go on and on until the researcher has enough data to
analyze and draw conclusions.
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 5
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
1.2.2 Sources of Bias in Sampling and Surveys
The two common methods of collecting data that usually produce biased results are
1. Convenience Samples where there is selection of individuals that are easiest to
reach.
2. Voluntary Response Samples where respondents decide if they want to be
included in a survey.
Often, physical laws (such as Ohm’s law and the ideal gas law) are applied to help
design products and processes. We are familiar with this reasoning from general laws to
specific cases. But it is also important to reason from a specific set of measurements to
more general cases to answer the previous questions. This reasoning comes from a
sample (such as the eight connectors) to a population (such as the connectors that will
be in the products that are sold to customers). The reasoning is referred to as statistical
inference. See Figure 2.1. Clearly, reasoning based on measurements from some
objects to measurements on all objects can result in errors (called sampling errors).
However, if the sample is selected properly, these risks can be quantified and an
appropriate sample size can be determined.
Figure 2.1 Statistical inference is one type of reasoning.
1.3 Planning and Conducting Experiments: Introduction to Design of Experiments
Experiments are used to study the performance of processes and systems.
Figure 3.1
The objectives of the experiment may include:
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 6
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
1. Determining which variables are most influential on y.
2. Determining where to set the influential x’s such that
y is almost always near the desired nominal value
variability in y is small
the effects of , . . . , are minimized
Experiments often involve several factors.
Example:
In a golf experiment all possible combinations of factor levels are tested such as the
following:
Type of driver Type of beverage Type of golf spike
Type of ball Time of round Etc
Walking vs. riding Weather
1.3.1 Strategy of experimentation: To planning and conducting the experiment
1. Best-guess approach:
frequently used in practice
often works reasonably well
often have great deal of technical or theoretical knowledge of the system
disadvantage: spend time to guess the initial best-guess; no guarantee that
the best solution has been found
2. One-factor-at-a-time(OFAT)
Used extensively in practice
disadvantage: fails to consider interaction between the factors and less
efficient
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 7
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Figure 3.2 OFAT
3. Factorial experiment: factors are varied together
extremely important
all possible combinations of the factors across their levels are used in the
design
enable to investigate the individual effects of each factor and to determine
whether the factors interact
Figure 3.3 factorial design: two factors; each at two levels
Consider again the problem involving the choice of wall thickness for the nylon
connector. This is a simple illustration of a designed experiment. The engineer chose
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 8
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
two wall thicknesses for the connector and performed a series of tests to obtain pull-off
force measurements at each wall thickness.
Designed experiments offer a very powerful approach to studying complex systems,
such as the distillation column (section 1.1). This process has three factors—the two
temperatures and the reflux rate—and we want to investigate the effect of these three
factors on output acetone concentration. A good experimental design for this problem
must ensure that we can separate the effects of all three factors on the acetone
concentration. The specified values of the three factors used in the experiment are
called factor levels. Typically, we use a small number of levels such as two or three for
each factor. For the distillation column problem, suppose that we use two levels, ―high‖
and ―low‖ (denoted +1 and −1, respectively), for each of the three factors. A very
reasonable experiment design strategy uses every possible combination of the factor
levels to form a basic experiment with eight different settings for the process. See Table
1.1 for this experimental design.
Figure 3.4 illustrates that this design forms a cube in terms of these high and low
levels. With each setting of the process conditions, we allow the column to reach
equilibrium, take a sample of the product stream, and determine the acetone
concentration. We then can draw specific inferences about the effect of these factors.
Such an approach allows us to proactively study a population or process.
An important advantage of factorial experiments is that they allow one to detect an
interaction between factors. Consider only the two temperature factors in the distillation
experiment. Suppose that the response concentration is poor when the reboil
temperature is low, regardless of the condensate temperature. That is, the condensate
temperature has no effect when the reboil temperature is low. However, when the reboil
temperature is high, a high condensate temperature generates a good response, but a
low condensate temperature generates a poor response. That is, the condensate
temperature changes the response when the reboil temperature is high. The effect of
condensate temperature depends on the setting of the reboil temperature, and these
two factors are said to interact in this case.
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 9
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Figure 3.4 The factorial design for the distillation column.
The effect of condensate temperature depends on the setting of the reboil temperature,
and these two factors are said to interact in this case. If the four combinations of high
and low reboil and condensate temperatures were not tested, such an interaction would
not be detected.
We can easily extend the factorial strategy to more factors. Suppose that the
engineer wants to consider a fourth factor, type of distillation column. There are two
types: the standard one and a newer design. Figure 3.5 illustrates how all four factors—
reboil temperature, condensate temperature, reflux rate, and column design—could be
investigated in a factorial design. Because all four factors are still at two levels, the
experimental design can still be represented geometrically as a cube (actually, it’s a
hypercube). Notice that as in any factorial design, all possible combinations of the four
factors are tested. The experiment requires 16 trials.
Generally, if there are k factors and each has two levels, a factorial experimental
design will require runs. For example, with k = 4, the design in Figure 3.5 requires
16 tests. Clearly, as the number of factors increases, the number of trials required in a
factorial experiment increases rapidly. This quickly becomes unfeasible from the
viewpoint of time and other resources. Fortunately, with four to five or more factors, it is
usually unnecessary to test all possible combinations of factor levels. A fractional
factorial experiment is a variation of the basic factorial arrangement in which only a
subset of the factor combinations is actually tested. Figure 3.6 shows a fractional
factorial experimental design for the distillation column. The circled test combinations in
this figure are the only test combinations that need to be run. This experimental design
requires only 8 runs instead of the original 16; consequently it would be called a one-
half fraction.
Figure 3.5 A four-factorial experiment for the distillation column.
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 10
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
Figure 3.6 A fractional factorial experiment for the distillation column.
This is an excellent experimental design in which to study all four factors. It will provide
good information about the individual effects of the four factors and some information
about how these factors interact.
Factorial and fractional factorial experiments are used extensively by engineers and
scientists in industrial research and development, where new technology, products, and
processes are designed and developed and where existing products and processes are
improved.
Basic Principles
1. Randomization: Running the trials in random order
the allocation of the experimental material
the order in which the individual runs of the experiment
2. Replication
to obtain an estimate of the experimental error
to estimate the true mean response for one of the factor levels
reflects sources of variability both between runs and within runs
distinction between replication and repeated measurements
3. Blocking: a design technique
used to reduce the variability transmitted from nuisance factors
Since so much engineering work involves testing and experimentation, it is essential
that all engineers understand the basic principles of planning efficient and effective
experiments.
1.3.2 Mechanistic and Empirical Model
Models play an important role in the analysis of nearly all engineering problems. Much
of the formal education of engineers involves learning about the models relevant to
specific fields and the techniques for applying these models in problem formulation and
solution.
Mechanistic Model
Mechanistic model is built from our underlying knowledge of the basic physical
mechanism.
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 11
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
As a simple example, suppose that we are measuring the flow of current in a thin
copper wire. Our model for this phenomenon might be Ohm’s law:
Empirical Model
Empirical model uses our engineering and scientific knowledge of a phenomenon, but it
is not directly developed from our theoretical or first-principles understanding of the
underlying mechanism.
For instance, suppose that we are interested in the number average molecular
weight ( ) of a polymer. Now we know that is related to the viscosity of the material
(V), and it also depends on the amount of catalyst (C) and the temperature (T) in the
polymerization reactor when the material is manufactured. The relationship between
and these variables is
say, where the form of the function f is unknown. Perhaps a working model could be
developed from a first-order Taylor series expansion, which would produce a model of
the form
where the β’s are unknown parameters. Now just as in Ohm’s law, this model will not
exactly describe the phenomenon, so we should account for the other sources of
variability that may affect the molecular weight by adding another term to the model;
therefore,
is the model that we will use to relate molecular weight to the other three variables.
Self-Evaluation:
1. Which of the three basic methods of data collection is the least useful? Why?
2. Which of the two basic sampling methods is the most useful in conducting
surveys? Why?
3. What are the advantages the designed experiment method of collecting data has
compared to the two methods?
Review of Concepts:
1. The three methods of data collection are
a. retrospective study
b. observational study
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 12
LEARNING MODULE SURIGAO STATE COLLEGE OF TECHNOLOGY
c. designed experiment
2. A population is the entire group of individuals, scores, measurements, etc. about
which we want information.
3. Sample is that part of the population from which we actually collect information
and is used to draw conclusions about the whole.
4. The four types of probability sampling method:
a. Simple Random Sampling
b. Systematic Sampling
c. Stratified Random Sampling
d. Cluster Sampling
5. The four types of non-probability sampling method:
a. Convenience sampling
b. Voluntary response sampling
c. Purposive sampling
d. Snowball sampling
6. The three strategies of experimentation are
a. Best-guess approach
b. One-factor-at-a-time (OFAT)
c. Factorial experiment
7. The basic principles of conducting experiments (design of experiments) are
Randomization, replication and blocking.
8. Mechanistic model is built from our underlying knowledge of the basic physical
mechanism.
9. Empirical model uses our engineering and scientific knowledge of a
phenomenon, but it is not directly developed from our theoretical or first-
principles understanding of the underlying mechanism.
References:
Douglas C. Montgomery & George C. Runger. Applied Statistics And Probability
For Engineers. John Wiley & Sons; 7th ed. 2018.
Hongshik Ahn. Probability And Statistics For Sciences & Engineering with
Examples in R. Cognella, Inc.; 2nd ed. 2018.
Jay L. Devore. Probability and Statistics for Engineering and the Science.
Cengage Learning; 9th ed. 2016.
MATH 114 – ENGINEERING DATA ANALYSIS (ENGR. VERNON V. LIZA) 13