lOMoARcPSD|30008743
EDA - EDA lecture notes
Engineering (Batangas State University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
AMPARO, Faylon C. TYPES OF DATA
ENGINEERING DATA ANALYSIS - MATH 403 Primary Data - collected for the investigator’s use from
CHAPTER 1: OBTAINING DATA the primary source.
INTENDED LEARNING OUTCOME Secondary data - collected by some other organization
At the end of this module, it is expected that the students for their own use but the investigator also gets it for his
will be able to: use. Secondary data are those already in existence for
1. Demonstrate an understanding of the different some other purpose than answering the question in
methods of obtaining data. hand.
2. Explain the procedures in planning and conducting
surveys and experiments. BASIC METHODS OF COLLECTING DATA
A retrospective study would use the population or
WHAT IS STATISTICS? sample of the historical data which had been archived
Statistics may be defined as the science that deals with over some period of time.
the collection, organization, presentation, analysis, and
interpretation of data in order be able to draw judgments In an observational study, however, process or
or conclusions that help in the decision-making process. population is observed and disturbed as little as
possible, and the quantities of interests are recorded. In
Real Life examples of Statistics:
a designed experiment, deliberate or purposeful changes
Stock Market Data Analysis
in the controllable variables of the system or process is
Weather Forecasting
done.
DIVISION OF STATISTICS There are problem areas with no scientific or engineering
Descriptive Statistics, which is referred to in the theory that are directly or completely applicable, so
first part of the definition, deals with the procedures experimentation and observation of the resulting data is
that organize, summarize and describe quantitative the only way to solve them.
data. It seeks merely to describe data.
Inferential Statistics, implied in the second part of 1.2 PLANNING AND CONDUCTING SURVEYS
the definition, deals with making a judgment or a A survey is a method of asking respondents some well-
conclusion about a population based on the findings constructed questions. It is an efficient way of collecting
from a sample that is taken from the population. information and easy to administer wherein a wide
variety of information can be collected.
STATISTICAL TERMS Face to Face Interviews or Self-Administered
Population or Universe
Sample DESIGNING A SURVEY
Data (Grouped or Ungrouped) 1. Determine the objectives of your survey: What
Parameter questions do you want to answer?
Statistics 2. Identify the target population sample: Whom will you
Constant interview? Who will be the respondents? What sampling
Variable method will you use?
3. Choose an interviewing method: face-to-face
1.1 METHODS OF COLLECTION interview, phone interview, self-administered paper
Collection of the data is the first step in conducting survey, or internet survey.
statistical inquiry. It simply refers to the data gathering, a 4. Decide what questions you will ask in what order, and
systematic method of collecting and measuring data how to phrase them.
from different sources of information in order to provide 5. Conduct the interview and collect the information.
answers to relevant questions. 6. Analyze the results by making graphs and drawing
There are several Ways on how to collect data. conclusions.
There are two types of Data in which you can
collect. In choosing the respondents, sampling techniques are
There are Three Basic Methods of Collecting necessary. Sampling is the process of selecting units
Data. (e.g., people, organizations) from a population of
interest. Sample must be a representative of the target
WAYS TO COLLECT DATA population. The target population is the entire group a
This involves acquiring information published researcher is interested in; the group about which the
literature, surveys through questionnaires or researcher wishes to draw conclusions.
interviews, experimentations, documents and
records, tests or examinations and other forms of WAYS OF SELECTING SAMPLE
data gathering instruments. Non-probability sampling is also called judgment or
The person who conducts the inquiry is an subjective sampling. This method is convenient and
investigator, the one who helps in collecting economical but the inferences made based on the
information is an enumerator and information is findings are not so reliable.
collected from a respondent.
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
Types of Non-Probability Sampling The determination the best setting of these factors to
In convenience sampling, the researcher uses a device achieve the objectives of the investigation. The
in obtaining the information from the respondents which objectives may be to either increase yield or decrease
favor the researcher but can cause bias to the variability or to find settings that achieve both at the
respondents. same time depending on the product or process under
In purposive sampling, the selection of respondents is investigation.
predetermined according to the characteristic of interest 4. Robustness Testing
made by the researcher. Randomization is absent in this It is important to identify such sources of variation and
type of sampling. take measures to ensure that the product or process is
made robust or insensitive to these factors.
There are two types of quota sampling: proportional 5. Verification
and non-proportional. In proportional quota sampling This final stage involves validation of the optimum
the major characteristics of the population by sampling a settings by conducting a few follow-up experimental
proportional amount of each is represented. runs. This is to confirm that the process functions as
expected and all objectives are achieved.
WAYS OF SELECTING SAMPLE
Probability sampling - every member of the population CHAPTER 2: PROBABILITY
is given an equal chance to be selected as a part of the INTENDED LEARNING OUTCOMES
sample. There are several probability techniques. At the end of this module, it is expected that the students
will be able to:
Types of Probability Sampling 1. Understand and describe sample spaces and events
Simple random sampling is the basic sampling technique for random experiments.
where a group of subjects (a sample) is selected for 2. Explain the concept of probability and its application to
study from a larger group (a population). different situations.
Stratified Sampling - data are obtained by taking 3. Define and illustrate the different probability rules.
samples from each stratum or sub-group of a 4. Solve for the probability of different statistical data.
population. When a sample is to be taken from a
population with several strata, the proportion of WHAT IS PROBABILITY?
each stratum in the sample should be the same as in the Probability is simply how likely an event is to happen.
population. EXAMPLE: “The chance of rain today is 50%” is a
Cluster sampling - a sampling technique where the statement that enumerates our thoughts on the
entire population is divided into groups, or possibility of rain.
clusters, and a random sample of these clusters are The higher the number means the event is more
selected. likely to happen than the lower number.
1.3 PLANNING AND CONDUCTING INTRODUCTION
EXPERIMENTS: INTRODUCTION TO DESIGN OF
EXPERIMENTS
An experiment is a series of tests conducted in a manner
to increase the understanding of an existing process or
to explore a new product or process. Design of
Experiments, or DOE, is a tool to develop an experiment.
It is a technique needed to identify the "vital few" factors
in the most efficient manner and then directs the process
to its best setting to meet the ever-increasing demand for
improved quality and increased productivity.
Experimentation strategy that maximizes learning using
minimum resources.
2.1 SAMPLE SPACE & RELATIONSHIP AMONG
FIVE STAGES TO CARRY OUT DOE EVENTS
1. Planning Sample space - the set of all possible outcomes or
At this stage, identification of the objectives of results of a random experiment. Sample space is
conducting the experiment or investigation, represented by letter S.
assessment of time and available resources to achieve Element of the Set - each outcome in the sample
the objectives. space
2. Screening Event - is the subset of this sample space and it is
Screening experiments are used to identify the important represented by letter E.
factors that affect the process under investigation out of Venn Diagram
the large pool of potential factors. Screening process
eliminates unimportant factors and attention is focused
on the key factors.
3. Optimization
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
Alternate Solution: Get your calculator, type 5P3, then
you will get 60.
Combination Rule – the arrangements of objects or
things by taking r of n things without considering the
definite order of arrangement,
Example: How many combinations can be taken out of
the letters A, B, & C taken 3 at a time?
Sample Space is simply the possible results of an
experiment, while Event is simply the number of possible
results given a situation.
Example 1:
What is the probability of getting odd numbers when you
roll a dice,
S = {1, 2, 3, 4, 5, 6} Alternate Solution: Get your calculator, type 3C3, then
E = {1, 3, 5} you will get 1.
Example 2: 2.3 RULES OF PROBABILITY
What is the probability of getting two heads, when a coin Definition of Terms
is tossed thrice Mutually exclusive or disjoint – two events cannot
S = {HHH, HHT, HTH, THH, TTH, THT, HTT, TTT} occur at the same time.
E = {HHT, HTH, THH} Conditional Probability - the probability that Event
A occurs, given that Event B has occurred.
Operations with Events Complement of an event – event not occurring.
Intersection of Events Intersection – both events occur.
Let A = {3,6,9,12,15} and B = {1,3,5,8,12,15,17};
then A ∩ B = {3,12,15} RULE OF ADDITION
Let X = {q, w, e, r, t,} and Y = {a, s, d, f}; then X ∩ Y = ∅, Rule 1: If two events A and B are mutually exclusive,
since X and Y have no elements in common. then: �(� ∪ � )=�(�)+�(�)
Mutually Exclusive Events
We can say that an event is mutually exclusive if they Rule 2: If events A and B are not mutually exclusive
have no elements in common. events, then:�(� ∪ �)=�(�)+�(�)−�(� ∩ �)
Union of Events
Let A = {a, e, i, o, u} and B = {b, c, d, e, f}; Example:
then A U B = {a, b, c, d, e, f, i, o, u} A student goes to the library. The probability that she
Let X = {1,2,3,4} and Y = {3,4,5,6}; then A U B = checks out (a) a work of fiction is 0.40, (b) a work of non-
{1,2,3,4,5,6} fiction is 0.30, and (c) both fiction and non-fiction is 0.20.
Compliment of the Event What is the probability that the student checks out a
Consider the sample space work of fiction, non-fiction, or both?
S = {dog, cow, bird, snake, pig}
Let A = {dog, bird, pig}; then A’ = {cow, snake} Solution:
Let F = the event that the student checks out fiction;
2.2 COUNTING RULES IN PROBABILITY Let N = the event that the student checks out non-fiction.
Permutation Rule – the arrangement of elements in Then, based on the rule of addition:
a distinct order. �(� ∪ �)=�(�)+�(� )−�(� ∩ �) �(� ∪ �)
=0.4+0.3−0.2=�.�
RULE OF MULTIPLICATION
Rule 1: When two events A and B are independent,
then:�(�∩�)=�(�)�(�)
Example: How many 3 letter words with or without
meaning can be formed out of the letters of the word Rule 2: When two events are dependent, the probability
SWING when repetitions of letters is not allowed? of both occurring is: �(� ∩ �)=�(�)�(�|�)
Where �(�|�)= �(� ∩ �)�(�) , provided that P (A) ≠ 0
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
CHAPTER 3: DISCREET PROBABILITY
DISTRIBUTIONS
In this chapter, the analysis of several random
experiments and discrete random variables that often
appear in applications is discussed. A discussion of the
basic sample space of the random experiment is
frequently omitted and the distribution of a particular
random variable is directly described.
RULE OF SUBTRACTION
The probability that event A will occur is equal to 1 minus INTENDED LEARNING OUTCOMES
the probability that event A will not occur. At the end of this module, it is expected that the students
�(�)=1−�(�′) will be able to:
Example 1: The probability of Bill not graduating in 1. Determine probabilities from probability mass
college is 0.8. What is the probability that Bill will not functions.
graduate from college? 2. Determine probabilities from cumulative functions and
Solution: �(�)=1−0.8=�.� cumulative distribution functions from probability mass
functions.
ADDITIONAL PROBLEMS 3. Calculate means and variances for discrete random
variables.
1. How many different committees of 6 can be formed
4. Understand the assumptions for each of the discrete
from 12 boys and 8 girls if each committee is consisting
probability distributions presented.
of 4 boys and 2 girls?
5. Select an appropriate discrete probability distribution
Solution: (12C4) (8C2) Answer: 13,860
to calculate probabilities in specific applications.
6. Calculate probabilities, determine means and
variances for each of the discrete probability distributions
presented.
DISCRETE PROBABILITY DISTRIBUTION
Discrete distribution - describes the probability of
occurrence of each value of a discrete random
variable.
Discrete random variable - is a random variable
that has countable values, such as a list of non-
negative integers.
Non zero probability
Presented in TABULAR FORM
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
3.1 RANDOM VARIABLES AND THEIR Cumulative Distribution Functions
PROBABILITY DISTRIBUTIONS
Random Variables - variable whose value is subject
to variations due to chance.
Discrete random Variables - variables can take on
either a finite or at most a countably infinite set of
discrete values
Example of Discreet Probability Distribution
The value of x1 takes on the probability p1, the value of
x2 takes on the probability p2, and so on. The
probabilities pi must satisfy two requirements: every
probability pi is a number between 0 and 1, and the sum
of all the probabilities is 1. (p1+p2+⋯+pk=1)
Probability Distributions for Discreet Random
Variables
Examples of discrete random variables include:
The number of eggs that a hen lays in a given day (it
can’t be 2.3)
The number of people going to a given soccer match
The number of students that come to class on a
given day
The number of people in line at McDonald’s on a
given day and time
The expected value of a random variable is the weighted
average of all possible values that this random variable
can take on.
Discreet Random Variable
A discrete random variable X has a countable number of
possible values. The probability distribution of a discrete
random variable X lists the values and their probabilities,
such that xi has a probability of pi. The probabilities pi
must satisfy two requirements:
1. Every probability pi is a number between 0 and 1.
2. The sum of the probabilities is 1: p1+p2+⋯+pi = 1.
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
Expected Value Definition
In probability theory, the expected value (or expectation,
mathematical expectation, EV, mean, or first moment) of
a random variable is the weighted average of all possible
values that this random variable can take on.
Example: Determine the Expected value when you roll a
dice.
Soln.:
Let X represent the outcome of a roll of a six-sided die.
The possible values for X are {1, 2, 3, 4, 5, 6}, all equally
likely (each having the probability of 1/6). The
expectation of X is:
E[X] = (1x1/6) + (2x2/6) + (3x3/6) + (4x4/6) + (5x5/6) +
(6x6/6) = 3.5.
In this case, since all outcomes are equally likely, we
could have simply averaged the numbers together:
(1 + 2 + 3 + 4 + 5 + 6) /6 = 3.5.
3.4 THE BINOMIAL DISTRIBUTION
Binomial Distribution 3.5 THE POISSON DISTRIBUTION
A binomial random variable is the number of successes x A Poisson distribution is the probability distribution that
in n repeated trials of a binomial experiment. The results from a Poisson experiment.
probability distribution of a binomial random variable is
called a binomial distribution. Attributes of a Poisson Experiment
Example: A Poisson experiment is a statistical experiment that
Suppose we flip a coin two times and count the number has the following properties:
of heads (successes). The binomial random variable is The experiment results in outcomes that can be
the number of heads, which can take on values of 0, 1, classified as successes or failures.
or 2. The binomial distribution is presented below. {TT, The average number of successes (μ) that occurs in
HT, TH, HH} a specified region is known.
The probability that a success will occur is
The binomial distribution has the following proportional to the size of the region.
properties: The probability that a success will occur in an
The mean of the distribution (μx) is equal to n * P. extremely small region is virtually zero.
The variance (σ2x) Note that the specified region could take many forms.
The standard deviation (σx) For instance, it could be a length, an area,
a volume, a period of time, etc.
Notation
The following notation is helpful, when we talk about the
Poisson distribution.
e: A constant equal to approximately 2.71828.
(Actually, e is the base of the natural logarithm
system.)
μ: The mean number of successes that occur in a
specified region.
x: The actual number of successes that occur in a
specified region.
P (x; μ): The Poisson probability that exactly x
successes occur in a Poisson experiment, when the
mean number of successes is μ.
Poisson Distribution
A Poisson random variable is the number of successes
that result from a Poisson experiment. The probability
distribution of a Poisson random variable is called a
Poisson distribution.
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
Poisson Formula. (125) / 6]
Suppose we conduct a Poisson experiment, in which the P (x < 3, 5) = [0.0067] + [0.03369] + [0.084224] +
average number of successes within a given region is μ. [0.140375]
Then, the Poisson probability is: P (x; μ) = (e-μ) (μx) / x! P (x < 3, 5) = 0.2650
where x is the actual number of successes that result Thus, the probability of seeing at no more than 3 lions is
from the experiment, and e is approximately equal to 0.2650.
2.71828.
CHAPTER 4: CONTINUOUS PROBABILITY
The Poisson distribution has the following DISTRIBUTIONS
properties:
INTENDED LEARNING OUTCOMES
The mean of the distribution is equal to μ.
At the end of this module, it is expected that the students
The variance is also equal to μ.
will be able to:
1. Determine the probabilities from probability density
Example: The average number of homes sold by the
functions
Cursor Realty company is 2 homes per day. What is the
2. Determine the probabilities from cumulative
probability that exactly 3 homes will be sold tomorrow?
distribution functions
Solution: This is a Poisson experiment in which we
3. Calculate means and variances for continuous
know the following:
random variables
μ = 2; since 2 homes are sold per day, on average.
4. Standardize normal random variables
x = 3; since we want to find the likelihood that 3 homes
5. Use the table for cumulative distribution function of a
will be sold tomorrow.
standard normal distribution to calculate probabilities
e = 2.71828; since e is a constant equal to
6. Approximate probabilities for some binomial and
approximately 2.71828.
Poisson distributions
7. Use continuity corrections to improve the normal
We plug these values into the Poisson formula as
approximations to those binomial and Poisson
follows:
distributions.
P (x; μ) = (e-μ) (μx) / x! = (2.71828-2) (23) / 3!
P (3; 2) = (0.13534) (8) / 6
P (3; 2) = 0.180 4.1 CONTINUOUS RANDOM VARIABLES AND
Thus, the probability of selling 3 homes tomorrow is THEIR PROBABILITY DISTRIBUTION
0.180. A continuous random variable has a probability of
zero of assuming exactly any of its values.
Cumulative Poisson Probability Its probability distribution cannot be given in tabular
A cumulative Poisson probability refers to the form.
probability that the Poisson random variable is greater
than some specified lower limit and less than some Let us discuss a random variable whose values are the
specified upper limit. heights of all people over 21 years of age. Ofcourse
there are infinite number of heights of a certain person
Example. Suppose the average number of lions seen on over 21 years of age. (e.g. 163.5 and 164.5 cm, or even
a 1-day safari is 5. What is the probability that tourists 163.99 and 164.01 cm). The probability of selecting a
will see fewer than four lions on the next 1-day safari? person at random who is exactly 164 centimeters tall is
impossible to get. Thus, we assign a probability of zero
Solution: This is a Poisson experiment in which we to that event.
know the following:
μ = 5; since 5 lions are seen per safari, on average. This is not the case, however, if we talk about the
x = 0, 1, 2, or 3; since we want to find the likelihood that probability of selecting a person who is at least 163
tourists will see fewer than 4 lions; that is, we want the centimeters but not more than 165 centimeters tall. Now
probability that they will see 0, 1, 2, or 3 lions. we are dealing with an interval rather than a point value
of our random variable.
To solve this problem, we need to find the probability
that tourists will see 0, 1, 2, or 3 lions. Thus, we need to
calculate the sum of four probabilities: P (0; 5) + P (1; 5)
+ P (2; 5) + P (3; 5).
To compute this sum, we use the Poisson formula: P (x;
μ) = (e-μ) (μx) / x!
P (x < 3, 5) = P (0; 5) + P (1; 5) + P (2; 5) + P (3; 5)
P (x < 3, 5) = [ (e-5) (50) / 0!] + [ (e-5) (51) / 1!] + [(e-5)
(52) / 2!] + [(e-5) (53) / 3!]
P (x < 3, 5) = [(0.006738) (1) / 1] + [(0.006738) (5) / 1] +
[(0.006738) (25) / 2] + [(0.006738)
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
Example: Let X be a uniform (a, b) random variable.
What is E(X) given the function:
The Normal Distribution is the most important and most
widely used continuous probability distribution. It is the
cornerstone of the application of statistical inference in
analysis of data because the distributions of several
important sample statistics tend towards a normal
distribution as the sample size increases.
Empirical studies have indicated that the Normal
distribution provides an adequate approximation to the
distributions of many physical variables. Specific
examples include meteorological data, such as
temperature and rainfall, measurements on living
organisms, scores on aptitude tests, physical
measurements of manufactured parts, weights of
contents of food packages, volumes of liquids in
bottles/cans, instrumentation errors and other deviations
from established norms, and so on.
The graphical appearance of the Normal distribution is a
symmetrical bell-shaped curve that extends without
bound in both positive and negative directions.
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
REMEMBER:
When solving for the normal distribution of events, follow
these steps:
1. Convert the random variable to its corresponding Z
score
2. Sketch the Graph
3. Find the area from the table.
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
Example: There is an 85% chance that a student will
pass Engineering Data analysis subject. A random
sample of 120 students taking the course is selected.
Find the probability that:
a. At least 110 students will pass.
b. At most 105 students will pass.
c. Between 99 and 108 students will pass.
The exponential distribution obtains its name from the
exponential function in the probability density function.
Plots of the exponential distribution for selected values of
are shown in Fig. 4.4. For any value of, the exponential
distribution is quite skewed.
Continuity Correction
The binomial and Poisson distributions are discrete Figure 4.4 Probability density function of exponential
random variables, whereas the normal distribution is random variables for selected values of λ.
continuous. We need to take this into account when we
are using the normal distribution to approximate a If the random variable X has an exponential distribution
binomial or Poisson using a continuity correction. with parameter λ,
In the discrete distribution, each probability is
represented by a rectangle (right hand diagram):
It is important to use consistent units in the calculation of
probabilities, means, and variances involving exponential
random variables. The following example illustrates unit
conversions.
Example: Laptops produced under company ABC lasts
5 years in average. Note that the lifespan of each laptop
is distributed exponentially. What is the probability that:
a. Laptops will last less than 3 years?
b. More than 10 years?
c. Between 4 and 7 years?
When working out probabilities, we want to include
whole rectangles, which is what continuity correction is
all about.
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)
lOMoARcPSD|30008743
Downloaded by JOHN ALVIN SAPUNGAN (23-05385@g.batstate-u.edu.ph)