CS3352 FOUNDATIONS OF DATA SCIENCE
UNIT II DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores.
I. TYPES OF DATA
Any statistical analysis is performed on data, a collection of actual observations or scores in
a survey or an experiment.
THREE TYPES OF DATA
The precise form of a statistical analysis often depends on whether data are qualitative,
ranked, or quantitative.
Qualitative Data
A set of observations where any single observation is a word, letter, or numerical code
that represents a class or category. Generally, qualitative data consist of words (Yes or No),
letters (Y or N), or numerical codes (0 or 1) that represent a class or category.
Ex: Academic Major, Gender
Ranked Data
A set of observations where any single observation is a number that indicates relative
standing. Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative
standing within a group.
Ex: Third place
Quantitative Data
A set of observations where any single observation is a number that represents an amount
or a count. Quantitative
data consist of numbers (weights of 238, 170, . . . 185 lbs) that represent an amount or a count.
Ex: Age, Family size, IQ Score, Temperature
II. TYPES OF VARIABLES
General Definition
A variable is a characteristic or property that can take on different values.
Discrete and Continuous Variables
Quantitative variables can be further distinguished in terms of whether they are discrete or
continuous.
A discrete variable consists of isolated numbers separated by gaps. A variable, whose value is
obtained by counting. (Discrete variable-Physical object & Countable).
Examples include most counts, such as the number of children in a family (1, 2, 3, etc.,
Ex: Number of Boys in a class.
A continuous variable consists of numbers whose values, at least in theory, have no restrictions.
(Continuous variable-Non physical & Infinite).
Examples include amounts, such as weights of male statistics students;
Ex: Temperature is a Continuous variable
Height of boys in a class
Weight of Students in a class
Income
Age
Approximate Numbers
An Approximate number, time or position is close to correct number, time or position but is not
exact.
Whenever values are rounded off, as is always the case with actual values for continuous
variables, the resulting numbers are approximate.
Ex: A student whose weight is listed as 150 lbs could actually weigh between 149.5 and
150.5 lbs. In effect, any value for a continuous variable, such as 150 lbs, must be identified with
a range of values from 149.5 to 150.5 rather than with a solitary value.
Independent and Dependent Variables
Independent Variable Dependent Variable
1 Variable that is Changed Variable affected by change
2 Cause Effect
3 Manipulated(Handled) Measured(Scalable)
Independent Variable
A variable that is manipulated to determine the value of dependent variable.
In an experiment, an independent variable is the treatment manipulated by the investigator. The
independent variable is the variable , the experimenter manipulates or changes, and is assumed to
have direct effect on dependent variable.
Ex: The liquid used to water each plant
Dependent variable
It is the variable being tested and measured in an experiment.
When a variable is believed to have been influenced by the independent variable, it is called a
dependent variable. In an experimental setting, the dependent variable is measured, counted, or
recorded by the investigator.
Ex: Change in the height or health of plant.
Ex: Independent Variable- Stress (Cause), Dependent Variable- Mental state of human
being(Effect).
Independent Variable vs Dependent Variable
Type of drug treatment Behavioural Adjustment
Different types of drug treatment or psycological Measurement of activity levels and eating
treatment behaviour
Confounding Variable
Third variable is independently associated with Dependent and Independent variable.
An uncontrolled variable that compromises the interpretation of a study is known as a
confounding variable.
A Confounding variable is third variable that influences both Independent & Dependent variable.
III. DESCRIBING DATA WITH TABLES AND GRAPHS
TABLES (FREQUENCY DISTRIBUTIONS)
FREQUENCY DISTRIBUTIONS FOR QUANTITATIVE DATA
RELATIVE FREQUENCY DISTRIBUTIONS
CUMULATIVE FREQUENCY DISTRIBUTIONS
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
GRAPHS
GRAPHS FOR QUANTITATIVE DATA
TYPICAL SHAPES
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
TABLES (FREQUENCY DISTRIBUTIONS)
FREQUENCY DISTRIBUTIONS FOR QUANTITATIVE DATA
A frequency distribution is a collection of observations produced by sorting observations
into classes and showing their frequency (f ) of occurrence in each class.
Frequency Distribution for Ungrouped Data
Frequency Distribution for Grouped Data
Frequency Distribution for Ungrouped Data
A frequency distribution produced whenever observations are sorted into classes of single
values.
An ungrouped frequency distribution, which displays the frequency of each individual data
value rather groups of data values.
Ex: Suppose we conduct a survey in which we ask 15 households how many pets they have in
their home. The results are as follows:
1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8
Here’s an example of an ungrouped frequency distribution for our survey data:
This type of frequency distribution allows us to directly see how often different values occurred
in our dataset. For example:
4 families had 1 pet
3 families had 2 pets
2 families had 3 pets
1 family had 4 pets
And so on.
When to Use Ungrouped Frequency Distributions
Ungrouped frequency distributions can be useful when you want to see how often each
individual value occurs in a dataset.
Note that ungrouped frequency distributions work best with small datasets in which there are
only a few unique values.
For example, in our survey data from earlier there were only 8 unique values so it made sense to
create an ungrouped frequency distribution.
However, if we had a dataset with hundreds or thousands of unique values, an ungrouped
frequency distribution would be incredibly long and difficult to gather information from.
For larger datasets, it makes sense to construct grouped frequency distributions.
How to Visualize Ungrouped Frequency Distributions
The easiest way to visualize the values in an ungrouped frequency distribution is to create
a frequency polygon, which displays the frequencies of each individual value in a simple chart.
Here’s what a frequency polygon would look like for our sample data:
Frequency polygon
This helps us quickly gain an understanding of how often each value occurs in the dataset.
Alternatively, we could create a bar chart to display the exact same data using bars rather than
a single line:
Bar chart
Both charts allow us to quickly understand the distribution of values in our dataset.
Frequency Distribution for Grouped Data
A frequency distribution produced whenever observations are sorted into classes of more
than one value.
Ex: Suppose we conduct a survey in which we ask 15 households how many pets they have
in their home. The results are as follows:
1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8
One way to summarize these results is to create a frequency distribution, which tells us how
frequently different values occur in a dataset.
Often we use grouped frequency distributions, in which we create groups of values and then
summarize how many observations from a dataset fall into those groups.
Here’s an example of a grouped frequency distribution for our survey data:
We first created groups of size 2, then we counted how many individual observations from the
dataset fell in each group. For example:
7 families had either 1 or 2 pets
3 families had either 3 or 4 pets
3 families had either 5 or 6 pets
2 families had either 7 or 8 pets
RELATIVE FREQUENCY DISTRIBUTIONS
Relative frequency distributions show the frequency of each class as a part or fraction of
the total frequency for the entire distribution.
For example, suppose we gather a simple random sample of 400 households in a city and
record the number of pets in each household. The following table shows the results:
This table represents a frequency distribution.
A related distribution is known as a relative frequency distribution, which shows the relative
frequency of each value in a dataset as a percentage of all frequencies.
For example, in the previous table we saw that there were 400 total households. To find the
relative frequency of each value in the distribution, we simply divide each individual frequency
by 400:
Note that relative frequency distributions have the following properties:
Each individual relative frequency is between 0% and 100%.
The sum of all individual relative frequencies adds up to 100%.
If these conditions are not met, then the relative frequency distribution is not valid.
Visualizing a Relative Frequency Distribution
The most common way to visualize a relative frequency distribution is to create a relative
frequency histogram, which displays the individual data values along the x-axis of a graph and
uses bars to represent the relative frequencies of each class along the y-axis.
For example, here’s what a relative frequency histogram would look like for the data in our
previous example:
The x-axis displays the number of pets in the household and the y-axis displays the relative
frequency of households that have that number of pets.
This histogram is a useful way for us to visualize the distribution of relative frequencies.
CUMULATIVE FREQUENCY DISTRIBUTIONS
Cumulative frequency distributions show the total number of observations in each class and in all
lower-ranked classes.
Constructing Cumulative Frequency Distributions
To convert a frequency distribution into a cumulative frequency distribution, add to the
frequency of each class the sum of the frequencies of all classes ranked below it.
Types of Cumulative Frequency Distribution
The cumulative frequency distribution is classified into two different types namely: less
than cumulative frequency and more/greater than cumulative frequency.
Less Than Cumulative Frequency:
The Less than cumulative frequency distribution is obtained by adding successively the
frequencies of all the previous classes along with the class against which it is written. In this
type, the cumulate begins from the lowest to the highest size.
Greater Than Cumulative Frequency:
The greater than cumulative frequency is also known as the more than type cumulative
frequency. Here, the greater than cumulative frequency distribution is obtained by determining
the cumulative total frequencies starting from the highest class to the lowest class.
Graphical Representation of Less Than and More Than Cumulative Frequency
Representation of cumulative frequency graphically is easy and convenient as compared to
representing it using a table, bar-graph, frequency polygon etc.
The cumulative frequency graph can be plotted in two ways:
1. Cumulative frequency distribution curve(or ogive) of less than type
2. Cumulative frequency distribution curve(or ogive) of more than type
Less Than Cumulative Frequency:
Steps to Construct Less than Cumulative Frequency Curve
The steps to construct the less than cumulative frequency curve are as follows:
1. Mark the upper limit on the horizontal axis or x-axis.
2. Mark the cumulative frequency on the vertical axis or y-axis.
3. Plot the points (x, y) in the coordinate plane where x represents the upper limit value and
y represents the cumulative frequency.
4. Finally, join the points and draw the smooth curve.
5. The curve so obtained gives a cumulative frequency distribution graph of less than type.
To draw a cumulative frequency distribution graph of less than type, consider the following
cumulative frequency distribution table which gives the number of participants in any level of
essay writing competition according to their age:
Table 1 Cumulative Frequency distribution table of less than type
Number of
Level of Age Group Cumulative
Age group participants
Essay (class interval) Frequency
(Frequency)
Level 1 10-15 Less than 15 20 20
Level 2 15-20 Less than 20 32 52
Level 3 20-25 Less than 25 18 70
Level 4 25-30 Less than 30 30 100
On plotting corresponding points according to table 1, we have
Greater Than Cumulative Frequency:
Steps to Construct Greater than Cumulative Frequency Curve
The steps to construct the more than/greater than cumulative frequency curve are as follows:
1. Mark the lower limit on the horizontal axis.
2. Mark the cumulative frequency on the vertical axis.
3. Plot the points (x, y) in the coordinate plane where x represents the lower limit value, and
y represents the cumulative frequency.
4. Finally, draw the smooth curve by joining the points.
5. The curve so obtained gives the cumulative frequency distribution graph of more than
type.
To draw a cumulative frequency distribution graph of more than type, consider the same
cumulative frequency distribution table, which gives the number of participants in any level of
essay writing competition according to their age:
Table 2 Cumulative Frequency distribution table of more than type
Number of
Level of Age Group Cumulative
Age group participants
Essay (class interval) Frequency
(Frequency)
Level 1 10-30 More than 10 20 100
Level 2 15-30 More than 15 32 80
Level 3 20-30 More than 20 18 48
Level 4 25-30 More than 25 30 30
On plotting these points, we get a curve as shown in the graph 2.
These graphs are helpful in figuring out the median of a given data set. The median can be found
by drawing both types of cumulative frequency distribution curves on the same graph. The value
of the point of intersection of both the curves gives the median of the given set of data. For the
given table 1, the median can be calculated as shown:
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
Among a set of observations, any single observation is a word, letter, or numerical code,
the data are qualitative. Frequency distributions for qualitative data are easy to construct.
Frequency distributions for qualitative data are easy to construct.
For the Facebook profile survey. This frequency distribution reveals that Yes replies are
approximately twice as prevalent as No replies.
Ordered Qualitative Data
When, however, qualitative data have an ordinal level of measurement because
observations can be ordered from least to most, that order should be preserved in the
frequency table.
Relative and Cumulative Distributions for Qualitative Data
Frequency distributions for qualitative variables can always be converted into relative frequency
distributions. Furthermore, if measurement is ordinal because observations can be ordered from
least to most, cumulative frequencies (and cumulative percentages) can be used.
GRAPHS
GRAPHS FOR QUANTITATIVE DATA
Histograms
A bar-type graph for quantitative data. The common boundaries between adjacent bars
emphasize the continuity of the data, as with continuous variables.
The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes.
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various
class intervals of the frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency.
The intersection of the two axes defines the origin at which both numerical scales
equal 0.
Numerical scales always increase from left to right along the horizontal axis and
from bottom to top along the vertical axis.
Histograms
Frequency Polygon
A line graph for quantitative data that also emphasizes the continuity of continuous
variables.
Constructing frequency polygon
Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpoints for classes
on the horizontal axis, and connect them with straight lines.
Transition from Histogram to Frequency polygon
TYPICAL SHAPES
Some of the more typical shapes for smoothed frequency polygons (which ignore the
inevitable irregularities of real data).
A. NORMAL
The familiar bell-shaped silhouette of the normal curve can be superimposed on many
frequency distributions, including those for uninterrupted gestation periods of human fetuses,
scores on standardized tests.
B. BIMODAL
It reflect the coexistence of two different types of observations in the same distribution.
C. POSITIVELY SKEWED
Distribution that includes a few extreme observations in the positive direction (to the
right of the majority of observations).
A lopsided distribution caused by a few extreme observations in the positive direction (to the
right of the majority of
observations), is a positively skewed distribution.
D. NEGATIVELY SKEWED
A distribution that includes a few extreme observations in the negative direction (to the left of
the majority of observations). A lopsided distribution caused by a few extreme observations in
the negative direction (to the left of the majority of observations) is a negatively skewed
distribution.
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
The distribution , based on replies to the question “Do you have a Facebook profile?” appears as
a bar graph. A glance at this graph confirms that Yes replies occur approximately twice as often
as No replies.
A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not
some impossible intermediate value, such as 40 percent Yes and 60 percent No. Gaps are placed
between adjacent bars of bar graphs to emphasize the discontinuous nature of qualitative data. A
bar graph also can be used with quantitative data to emphasize the discontinuous nature of a
discrete variable.
Bar Chart
IV. DESCRIBING DATA WITH AVERAGES
Branch of mathematics dealing with the collection , analysis, interpretation and presentation of
masses of numerical data.
Descriptive Statistics
Descriptive statistics provides us with tools—tables, graphs, averages, ranges,
correlations—for organizing and summarizing the inevitable variability in collections of actual
observations or scores.
Examples are:
1. A tabular listing, ranked from most to least.
2. A graph showing the annual change in global temperature during the last 30 years.
3. A report that describes the average difference in grade point average (GPA)between
college students.
Inferential Statistics
Statistics also provides tools—a variety of tests and estimates—for generalizing beyond
collections of actual observations. This more advanced area is known as inferential statistics.
for example:
An assertion about the relationship between job satisfaction and overall happiness.
Different statistical methods,
MODE
MEDIAN
MEAN
WHICH AVERAGE?
MODE
The mode reflects the value of the most frequently occurring score.
Set of numbers that appears most often.
Ex: Determining the mode of following retirement ages: 60,63,45,63,65,70,55,63,60,65,63(Here,
63 occurs the most.)
More Than One Mode
Bimodal: Distributions withtwo obvious peaks, even though they are not exactly the
same height, are referred to as bimodal.
Multimodal: Distributions with more than two peaks are referred to as multimodal.
MEDIAN
The median reflects the middle value when observations are ordered from least to
most.
The median splits a set of ordered observations into two equal parts, the upper and
lower halves.
MEAN
The mean is the most common average, one you have doubtless calculated many times.
The mean is found by adding all scores and then dividing by the number of scores.
That is,
Two types of MEAN
Sample Mean
A subset of scores.
The balance point for a sample, found by dividing the sum for the values of all scores in
the sample by the number of scores in the sample.
Formula for Sample Mean
It’s usually more efficient to substitute symbols for words in statistical formulas,
including the word formula given above for the mean. When symbols are used, X designates the
sample mean, and the formula becomes
and reads: “X-bar equals the sum of the variable X divided by the sample size n.” [Note that the
uppercase Greek letter sigma (Σ) is read as the sum of, not as sigma.
Formula for Population Mean
Population Mean (μ)
The balance point for a population, found by dividing the sum for all scores in the
population by the number of scores in the population.
The formula for the population mean differs from that for the sample mean only because
of a change in some symbols. In statistics, Greek symbols usually describe population
characteristics, such as the population mean, while English letters usually describe sample
characteristics, such as the sample mean. The population mean is represented by μ (pronounced
“mu”), the lowercase Greek letter m for mean,
where the uppercase letter N refers to the population size. Otherwise, the calculations are the
same as those for the sample mean.
Sample question: All 57 residents in nursing home were surveyed to see how many times a day
that eat meals.
1 meal (2 people)
2 meal (7 people)
3 meal (28 people)
4 meal (12 people)
5 meal (8 people)
What is population mean for the number of meals eaten per day?
=(1x2)+(2x7)+(3x28)+(4x12)+(5x8)/57
=2+14+84+48+40/57
=188/57
=3.29 (Approx 3.3)
The Population mean is 3.3
WHICH AVERAGE
If Distribution Is Not Skewed
When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.
If Distribution Is Skewed
Positively skewed distribution
In positively skewed distribution, the mean is greater than Median and skewed in Positive
direction (Right side)
.
Negatively skewed distribution
In Negatively skewed distribution, the mean is greater than Median and skewed in Negative
direction (Left side).
V. DESCRIBING VARIABILITY
Descriptions of the amount by which scores are dispersed or scattered in a distribution.
This chapter describes several measures of variability, including the range, the interquartile
range, the variance, and most important, the standard deviation.
INTUITIVE APPROACH
RANGE
VARIANCE
STANDARD DEVIATION
DEGREES OF FREEDOM (df )
INTERQUARTILE RANGE (IQR)
a. INTUITIVE APPROACH
You probably already possess an intuitive feel for differences in variability.
From the above diagrams, each of the three frequency distributions consists of seven scores
with the same mean (10) but with different variabilities. Before reading on, rank the three
distributions from least to most variable. Your intuition was correct if you concluded that
distribution A has the least variability, distribution B has intermediate variability, and
distribution C has the most variability.
Ex:
For distribution A with the least (zero) variability, all seven scores have the same
value (10).
For distribution B with intermediate variability, the values of scores vary slightly
(one 9 and one 11), and
For distribution C with most variability, they vary even more (one 7, two 9s, two
11s, and one 13).
b. RANGE
The range is the difference between the largest and smallest scores.
In distribution A, the least variable, has the smallest range of 0 (from 10 to 10); distribution B,
the moderately variable, has an intermediate range of 2 (from 11 to 9); and distribution C, the
most variable, has the largest range of 6 (from 13 to 7), in agreement with our intuitive
judgments about differences in variability. The range is a handy measure of variability that can
readily be calculated and understood.
c. VARIANCE (Type of Mean)
The mean of all squared deviation scores.
Particularly for its square root, the standard deviation, because these measures serve as key
components for other important statistical measures.
Mean of the Squared Deviations
Before calculating the variance (a type of mean), negative signs must be eliminated from
deviation scores. Squaring each deviation—that is, multiplying each deviation by itself—
generates a set of squared deviation scores, all of which are positive. (Remember, the product of
any two numbers with similar signs is always positive.) Now it’s merely a matter of adding the
consistently positive values of all squared deviation scores and then dividing by the total number
of scores to produce the mean of all squared deviation scores, also known as the variance.
Variance=(Sum of all squared Deviation/Number of Scores)
For A, variance=0.00
For B, variance=(−1)2 +02 +(−1)2 /7=0.29
For C, variance=(−3)2 + (−1)2 + (−1)2 +02 +(1)2 + (1)2 + (3)2/7=22/7=3.14
Its value equals 0.00 for the least variable distribution, A, 0.29 for the moderately variable
distribution, B, and 3.14 for the most variable distribution, C, in agreement with our intuitive
judgments about the relative variability of these three distributions.
d. STANDARD DEVIATION
A rough measure of the average (or standard) amount by which scores deviate on either side of
their mean.
The standard deviation, the square root of the mean of all squared deviations from the
mean, that is,
The value of standard deviation can never be Negative.
For distribution C in Figure 4.1, the square root of the variance of 3.14 yields a standard
deviation of 1.77. Given this perspective, a standard deviation of 1.77 is a rough measure of the
average amount by which the seven scores in distribution C (7, 9, 9, 10, 11, 11, 13) deviate on
either side of their mean of 10. In other words, the standard deviation of 1.77 is a rough measure
of the average amount for the seven deviation scores in distribution C, namely, one 0, four 1s,
and two 3s.
Sum of Squares (SS)
The sum of squared deviation scores.
Calculating the standard deviation requires that we obtain first a value for the variance. However,
calculating the variance requires, in turn, that we obtain the sum of the squared deviation scores.
The sum of squared deviation scores, or more simply the sum of squares, symbolized by SS.
Sum of Squares (SS)
Population
Definition Formula :
Computational Formula :
Variance :
Standard Deviation :
Sample
Definition Formula:
Computational Formula:
Variance:
Standard Deviation:
Sum of Squares Formulas for Population
Standard Deviation for Population σ
A rough measure of the average amount by which scores in the population deviate on either side
of their population mean.
Recall that, most generally, a mean is defined as the sum of all scores divided by the number of
scores. Since the variance is the mean of all squared deviation scores, it can be defined as the
sum of all squared deviation scores divided by the number of scores:
σ2 (pronounced “sigma squared”), represents the population variance, SS is the sum of squared
deviations for the population, and N is the population size.
The definition formula provides the most accessible version of the population sum of squares:
where SS represents the sum of squares, Σ directs us to sum over the expression to its right, and
(X − μ)2 denotes each of the squared deviation scores. “The sum of squares equals the sum of all
squared deviation scores.” You can reconstruct this formula by remembering the following three
steps:
1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score, X −
μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X − μ)2.
Standard Deviation for Sample (s )
A rough measure of the average amount by which scores in the sample deviate on either side
Of their sample mean.
Although the sum of squares term remains essentially the same for both populations and
samples, there is a small but important change in the formulas for the variance and standard
deviation samples. This change appears in the denominator of each formula where N, the
population size, is replaced not by n, the sample size, but by n − 1, as shown:
where s2 and s represent the sample variance and sample standard deviation, SS is the sample
sum of squares.
Sum of Squares Formulas for Sample
Sample notation can be substituted for population notation in the above two formulas without
causing any essential changes:
e. DEGREES OF FREEDOM (df )
The degrees of freedom in a statistical calculation represent how many values involved in a
calculation have the freedom to vary.
Ex: Let us consider 3 Students in a Class, A, B and C has secured 10, 5, & 15 mark. Identify
Degree Of Freedom for the Given Data.
A 10
B 5
C 15
Step 1: Average (Mean)=(10+5+15)/3=30/3=10
Step 2: Find the Difference from the Mean for A,B & C.
A: 10-10=0
B: 5-10=-5
C: 15-10=?
Conclusion: Addition of all should be 0.
Here, Two equations are available to modify based on expected output.
Hence, Degree of Freedom is 2.
f. INTERQUARTILE RANGE (IQR)
The range for the middle 50 percent of the scores.
The interquartile range (IQR), is simply the range for the middle 50 percent of the scores.
More specifically, the IQR equals the distance between the third quartile (or 75th percentile) and
the first quartile (or 25th percentile), that is, after the highest quarter (or top 25 percent) and the
lowest quarter (or bottom 25 percent) have been trimmed from the original set of scores.
VI. NORMAL DISTRIBUTIONS AND STANDARD (Z) SCORES
The shape of Normal distribution curve is determined by two Parameters: The Mean(μ) and the
Standard Deviation( σ), and it is Abbreviated as N(μ, σ)
The Mean determines the location of the Center of the Graph and Standard Deviation determines
the height & width of Graph.
THE NORMAL CURVE
A theoretical curve noted for its symmetrical bell-shaped form.
Normal curve superimposed on the distribution of heights.
Properties of the Normal Curve
Obtained from a mathematical equation, the normal curve is a theoretical curve defined
for a continuous variable and noted for its symmetrical bell-shaped form.
Because the normal curve is symmetrical, its lower half is the mirror image of its upper
half.
Being bell shaped, the normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak.
The values of the mean, median (or 50th percentile), and mode, located at a point midway
along the horizontal spread, are the same for the normal curve.
Different Normal Curves
A. Different Means, Same Standard Deviation
B. Same Mean, Different Standard Deviations
z SCORES
A unit-free, standardized score that indicates how many standard deviations a score is above or
below the mean of its distribution.
To obtain a z score, express any original score, whether measured in inches, milliseconds,
dollars, IQ points, etc., as a deviation from its mean (by subtracting its mean) and then split this
deviation into standard deviation units (by dividing by its standard deviation), that is
where X is the original score and μ and σ are the mean and the standard deviation, respectively,
for the normal distribution of the original scores.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation units.
Converting to z Scores
To answer the question about eligible FBI applicants, replace X with 66 (the maximum
permissible height), μ with 69 (the mean height), and σ with 3 (the standard deviation of heights)
and solve for z as follows:
Ex: a score of 470 on the SAT math test, given a mean of 500 and a standard deviation of 10.
=(470-500)/100
=-30/100
=-3
STANDARD NORMAL CURVE or STANDARD NORMAL DISTRIBUTION
The tabled normal curve for z scores, with a mean of 0 and a standard deviation of 1.
The standard normal distribution is one of the forms of the normal distribution. It occurs when
a normal random variable has a mean equal to zero and a standard deviation equal to one. In
other words, a normal distribution with a mean 0 and standard deviation of 1 is called the
standard normal distribution. Also, the standard normal distribution is centred at zero, and the
standard deviation gives the degree to which a given measurement deviates from the mean.
The random variable of a standard normal distribution is known as the standard score or a z-
score. It is possible to transform every normal random variable X into a z score using the
following formula:
z = (X – μ) / σ
where X is a normal random variable, μ is the mean of X, and σ is the standard deviation of X.
Area of Standard Normal Distribution
Diagrammatically, the probability of Z not exactly “a” being Φ(a), figured from the standard
normal distribution table, is demonstrated as follows:
P(Z < –a)
As specified over, the standard normal distribution table just gives the probability to values, not
exactly a positive z value (i.e., z values on the right-hand side of the mean).
P(Z > a)
The probability of P(Z > a) is 1 – Φ(a). To understand the reasoning behind this look at the
illustration below:
You know Φ(a), and you realize that the total area under the standard normal curve is 1 so by
numerical conclusion: P(Z > a) is 1 Φ(a).
P(Z > –a)
The probability of P(Z > –a) is P(a), which is Φ(a). To comprehend this, we have to value the
symmetry of the standard normal distribution curve. We are attempting to discover the region
Below:
If this area is in the region we need.
Notice this is the same size area as the area we are searching for, just we know this area, as we
can get it straight from the standard normal distribution table: it is
P(Z < a). In this way, the P(Z > –a) is P(Z < a), which is Φ(a).
Probability between z values
Let us find the probability between the values of z, i.e., a and b.
Consider the graph given below:
Now,
P(Z < b) – P(Z < a) = Φ(b) – Φ(a)
Thus,
P(a < Z < b) = Φ(b) – Φ(a)
Here, the values of a and b are positive.
Standard Normal Distribution Uses
The standard normal distribution is a tool to translate a normal distribution into numbers.
We may use it to get more information about the data set than was initially known.
Standard normal distribution allows us to quickly estimate the probability of specific
values befalling in our distribution or compare data sets with varying means and standard
deviations.
Also, the z-score of the standard normal distribution is interpreted as the number of
standard deviations a data point falls above or below the mean.
Characteristics of Standard Normal Distribution
A z-score of a standard normal distribution is a standard score that indicates how many standard
deviations are away from the mean an individual value (x) lies:
When z-score is positive, the x-value is greater than the mean
When z-score is negative, the x-value is less than the mean
When z-score is equal to 0, the x-value is equal to the mean