Statistics for the
Sciences
2023-24
Department of Mathematics
University of the West Indies
Kingston, Jamaica
Course Code: STAT1001
Course Title: Statistics for the Sciences
Class Times & Venue: MONDAYS 8-10 (M3) / WEDNESDAY 1-2 (M3)
Lecturer: Ajani Ausaru
Office Hours: 9 – 10 a.m. Tuesdays
Location: Room #6, upstairs in the Department of
Mathematics
Email: ajani.ausaru02@uwimona.edu.jm
Assessment
Incourse exam … (20%) Project … (15%)
Homework Assignments (x2) … 15%
Final exam… 50%
1.0 Introduction- Definitions
Definitions
The following definitions are important in
understanding the underlying concepts behind
statistics:
Statistics: “the science of data involving collecting,
classifying, summarizing, organizing, analyzing, and
interpreting numerical information” -- McClave, Dietrich,
Sincich
Observation: a single collected data value (point).
Data or Data set: a set of numerical observations.
Definitions
Individuals: The people or objects Individual Population
from whom, or about whom, data is
collected.
Individuals may be people, but they
may also be animals or things.
Example: Freshmen, 6-week-old babies, Sample
golden retrievers, fields of corn, cells
Population: The entire group of
individuals that we wish to get
information about in a single study.
Sample: A subgroup within a
population.
Definitions
Variable: A characteristic of an individual.
When we collect data about individuals, we collect
values of variables. A variable can take different
values for different individuals.
More Examples of Variables: Weight, Age, Race,
Shoe Size, Favorite Football Team
Two types of variables
A variable can be either
Quantitative
Something that can be counted or measured for each individual and then added,
subtracted, averaged, etc., across individuals in the population.
Example: How tall you are, your age, your blood cholesterol level, the number of
credit cards you own.
Numerical characteristics or quantities of the individuals such as Height (65 inches
tall), Weight (180 pounds), and Income ($40,000 per year).
HINT: most quantitative variables are accompanied by units
Qualitative variable / Categorical
Something that falls into one of several categories. What can be counted is the
count or proportion of individuals in each category.
Example: Your blood type (A, B, AB, O), your hair color, your ethnicity, whether you
paid income tax last tax year or not.
Non-numerical characteristics or labels of the individual such as their Race (Black,
White, Asian…), Gender (Male, Female), Political Party (Democrat, Republican,
Independent)
Two types of variables
Determine if the following variables are
Qualitative or Quantitative:
Number of siblings –---------------------- ANSWER: ____
Number on a football player’s jersey –- ANSWER:____
Phone number – --------------------------- ANSWER:____
Cost (in dollars) to fill up a Corvette –- ANSWER:____
Marital status –---------------------------- ANSWER:____
Two Types of Quantitative Variables
1) Discrete variable: these kind of quantitative variables are
chosen from a finite set of numbers or from a countable set of
numbers.
Example 1: Suppose we ask a UWI student the number of days
per week (7 days a week) that they go to school. The possible
values are {0,1, 2, 3, 4, 5, 6, 7}, which is a finite set that contains
8 number.
Example 2: Suppose we ask UWI students how many times
they have visited library this year. The possible value for this
number will only be 0, 1, 2, 3, … … (it can not be 1.5, or 1.2),
which represent counts, so they are discrete data.
Two Types of Quantitative Variables
2) Continuous variable: these kind of quantitative
variables are values that are chosen from infinitely
many numbers, and there is no gap among these
numbers (i.e they are uncountable).
Example of Continuous variable: Suppose we ask
UWI students the time that it takes them to drive to
school. It can be 10.0 minutes, 10.01 minutes, 10.001
minutes, etc. There are infinitely many possible
answers from each individual, and the variable is not
countable, so they are continuous data.
Discrete vs. Continuous
Determine if the following
variables are Discrete or
Continuous:
Volume of water lost each day ANSWER:____
through a leaky faucet
Number of donors at blood bank ANSWER:____
Points scored in an NCAA ANSWER:____
basketball game
Weight of a randomly selected ANSWER:____
person
Example BMW cars
Model Body Style Weight # of
Identify the individuals.
____________________
(pd) Seats
Identify the variables.
_____________________ M/Z3 Coupe 2945 2
Coupe
Identify the data corresponding to
the variables. M/Z3 Convertible 2690 2
_____________________ Roadster
_____________________
_____________________
3 Series Coupe 2780 5
Determine whether each variable
is qualitative, continuous, or 5 Series Sedan 3450 5
discrete.
___________________ 7 Series Sedan 4255 5
___________________
___________________ Z8 Convertible 3600 2
Graphing Qualitative/Categorical Data
There are many different graphs (charts) that can be
used to visualize qualitative data; being that this is the
case, just three common graphs will be shown:
Bar Chart
Pareto Chart
Pie Chart
Bar Chart Bar graph of accidents involving Firestone
tire models
Bar Chart:
-Each category is labeled on the
x-axis and is represented by one
bar.
-The bar’s height (y-axis) shows
the count or the percentage for
that particular category.
NOTE: The categories can
appear in any order regardless
of the bar’s height.
Interpretation:________________
____________________________
____________________________
Pareto Chart Pareto graph of accidents
involving Firestone tire models
Pareto Chart:
is a just like a bar chart EXCEPT
the categories on the graph
appear in descending order.
So using the above bar chart
example, the pareto chart looks
like chart across.
Interpretation: From this chart,
we can see which categories the
majority of the individuals
belong to ___________________.
Pie Charts
Pie Chart:
Each sector represents one category and
there must be no gaps between the
sectors.
The proportion of the pie occupied by
each sector is equal to the percentage
contributed by the category it
represents.
All percentages must add up to 100%.
NOTE: Categories may be displayed
in any order around the pie.
Interpretation:______________________
___________________________________
___________________________________
___________________________________
Graphing Quantitative Data
Just as for qualitative data, there are many ways to
visualize quantitative data; being that this is the case,
just three common ways will be shown:
1) Histograms
2) Stem Plots
3) Time Plots
Histograms
Histograms look like bar charts,
with some important differences.
Data are grouped into classes and
the class limits are marked on the
horizontal axis.
Counts are marked on the
vertical axis, which must begin at
zero.
The data classes must be:
non-overlapping,
of equal width
must cover the entire range of data
values without any gaps
The class intervals are labeled on the x-axis, and the counts or
percent of values that lie within a class are labeled on the y-axis.
Histograms
For large datasets and/or quantitative variables that take many values:
§ Divide the possible values into classes or intervals of equal widths.
§ Count how many observations fall into each interval. Instead of
counts, one may also use percents.
§ Draw a picture representing the distribution―each bar height is
equal to the number (or percent) of observations in its interval.
21
Interpreting histograms
When describing a quantitative variable, we look for the overall
pattern and for striking deviations from that pattern. We can describe
the overall pattern of a histogram by its shape, center, and spread.
Histogram with a line connecting Histogram with a smoothed curve
each column à too detailed highlighting the overall pattern of
the distribution
Shapes of histograms
14
10
12
8 10
6 8
6
4
4
2 2
0
0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6
Uniform: the same number of data Symmetric: right half of histogram is
values in each class mirror image of left half
12
14
10
12
8 10
8
6
6
4
4
2
2
0 0
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Right-skewed: more low and less high Left-skewed: less low and more high data
data values values
Interpreting histograms
Overall Pattern
Shape: symmetric, right-skewed and left-skewed
Center: the value that has the property that roughly half of the values
(50% of the values) are larger than it and roughly half of the values are
smaller than it
Spread: lowest and highest values (or the distance between them)
Outliers : fall outside of the overall pattern. Deviations from the overall
pattern:
Interpreting histograms
Example: Histogram of Architectural Firm Staff
Employees
Shape: right-skewed 9 8
8
7
Center: 60 staff
6
6
Frequency
5
4 3
Spread: 0 to 140 staff 3 2 2 2
members 2
1
1 1
0
10 30 50 70 90 110 130 150
No outliers Staff counts
Stem Plots
Stem Plots: used to draw organize quantitative data by
separating the numbers into stems and leaves.
The stem is all digits except the last digit of the number
The leaf is the last digit of the number you are given.
In some cases, leaves may consist of more than just the last
digit and stems would then consist of all but those digits.
Stem Plots
Example:
Given the number 118, the stem is 11, leaf is 8.
Given the number 18, the stem is 1, leaf is 8.
Given the number 2.1, the stem is 2, leaf is 1.
How to Construct a Stem Plot
1 – Separate each data value into a stem (all but the last digit) and a
leaf (the last digit)
2 – Draw up a table with two columns, label the left column "Stem"
and the right column "Leaves“
3 – In the Stem column, write down the stems in ascending order from
top to bottom, be sure to include all stems between the first and the
last, even if there is no data value that has that stem
4 – In the Leaves column, write down the leaves beside the
appropriate stem, in ascending order from left to right
5 – Write down a key with units for your stem plot: e.g. 8|9 = 8.9
grams.
Stem Plot - Example
Example:
Stem Leaves
Construct a stem plot of the
following lengths (in inches):
0.61, 0.70, 0.74, 0.82, 0.86,
0.63, 0.92, 0.98, 0.65, 0.49,
0.67, 0.78
KEY:
Stem Plot
If there are very few stems (when the data cover only a very small range
of values), then we may want to create more stems by splitting the
original stems.
Example: If all of the data values are between 150 and 179, then we may
choose to use the following stems:
15
15 Leaves 0–4 would go on each upper stem (first
16 “15”), and leaves 5–9 would go on each lower
16 stem (second “15”).
17
17
33
Bar & Pie Charts in Excel
This will be demonstrated in class using the excel
example.
Bar Chart
Commands – highlight the data then click on the
“insert” tab then click on “column chart” for bar chart.
You can click on “chart design” to get different options
for the charts.
Pie Chart
Commands – highlight the data then click on the
“insert” tab then click on “column chart” for pie chart.
Excel Data & Output
Describing distributions with numbers
Measure of center: mean and median
Measure of spread: quartiles, IQR, standard
deviation.
The five-number summary and boxplots
Outliers
Choosing among summary statistics
Measures of Center
Measures of center- Mean, Median: a value that is used to
describe the center of a data set.
Mean: the average of a data set. To find the mean, it is the sum
of all data values divided by the total number of values in the
data set.
Sample mean (statistic): The mean of a sample data set is
denoted by x (what we are interested in finding since statistics
is based on sample data).
Population mean (parameter): The mean of a population
data set is denoted byµ the Greek letter.
Mean: Mathematical notation
Example: The following
gives the volumes (in ounces)
of the Coke in different cans.
Find the mean of this sample x 1 + x 2 + .... + xn
12.3, 12.1, 12.4, 12.1, 12.2
x=
n
Step 1: Find n.
1 n
Step 2: Use the formula to x = ∑ xi
find the mean. n i =1
Learn right away how to get the mean using your
calculators.
Median
Median: the middle value of an arranged data set placed in
ascending order.
(i.e. 50% of the values lie below this value, and 50% lie above this
value).
Calculation of Median
1) Arrange the data values in an ascending order from smallest to
largest.
1a) If the number of data values is odd, then the median is the
number in the exact middle. Find the location of the median by
counting (n+1)/2 observations up from the bottom of the list.
1b) If the number of data values is even, the median is the mean
of the two middle numbers in the sorted data. Find the location of
the median is again (n+1)/2 observations up from the bottom of
the list.
Median
Example 1: (Odd data set) Find the median of the
following five students’ scores of an exam: 90, 60, 50,
41, 92.
Step 1: Place the data values in ascending order.
Step 2: Do we have an even/odd data set?
Step 3: Find median.
Example 2: (Even data set) Find the median of the
following six people’s salary (in thousand dollars): 90,
60, 45, 46, 100, 46.
Mode
Mode – is the most frequently occurring number.
Example; What is the mode of the following numbers;
3,4, 7, 1, 3, 5, 8, 9, 3.
Quartiles
Quartiles divide the ordered data set into four equal-sized
groups. Q1, Q2 & Q3
Finding the quartiles
(i) Order the data set from smallest to largest, smallest on the
leftmost end.
(ii) For Q2: Find the overall median of the entire data set .
(iii) For Q1: Find the median of the first half of observations that lie
to the left of Q2.
(iv) For Q3: Find the median of the second half of observations
that lie to the right of Q2.
Five-Number Summary
Five number summary: used to describe the center,
variation of the data, distribution of the data, and
reveals whether outlier(s) exist. The five numbers are
as follows:
Minimum data value (denote it as min.)
Maximum data value (denote it as max.)
Q2 (median)
Q1
Q3
Five-Number Summary Examples
Examples:
Find the quartiles of the following data sets.
(i) 36, 74 , 85 , 29 , 10
Step 1: Place in ascending order. 10, 29, 36, 74, 85
Step 2: Find Q2: 36
Step 3: Find Q1: (10+29) / 2 = 19.5
Step 4: Find Q3: (74+85) / 2 = 79.5
(ii) 36 , 74 , 85 , 29 , 10 , 43
Five-Number Summary and Box Plot
The Five-Number Summary of a Distribution:
Minimum, Q1, M, Q3, Maximum
Box Plot (Graphical Display of the Five-
Number Summary):
Min Q1 M Q3 Max
Box Plots can be displayed vertically also!!!
Five-Number Summary and Box Plot
Example: Write out the five number summary and construct a
box plot for the given data set: 25, 3, 40, 33, 99, 60, 58, 42, 44.
Step 1: Place in ascending order. 3, 25, 33, 40, 42, 44, 58, 60, 99
Step 2: Min = 3
Step 3: Max = 99
Step 4: Find Q2: 42
Step 5: Find Q1: (33+25) / 2 = 29
Step 6: Find Q3: (60+58) / 2 = 59
Step 7: Construct box plot (in space below).
You should get something like
this:
Measures of Variation/ Spread
Measures of Variation- Range, IQR, Variance,
Standard Deviation: a value that is used to describe
the spread/variability within a data set.
Range: the difference between the largest value and
the smallest value in a data set.
Range=largest value-smallest value
Range
It is generally not a good measure of spread because
of its extreme sensitivity to outliers.
We only use the max and min of the data set to
calculate the range, so if either of these is an outlier it
will increase the range dramatically.
The latter then will not give us a good measure of the
spread within the data set.
Example of Finding the Range
Given the data set: 2.5, 5.7, 1.9, 4.8, 6.8, 3.1
Range = 6.8 - 1.9 = 4.9
Inter-quartile Range or IQR
Inter-quartile Range or IQR: the difference
between the third Quartile and the first Quartile.
IQR=Q3-Q1
So the IQR in the Box-Plot example is: IQR = 59-29 =
30
Identifying Outliers
Identifying Outliers - Calculating the “fences”
The "fences" of a data set help us to identify the
outliers (or absence thereof).
Lower fence = Q1 − (1.5 × IQR)
Upper fence = Q3 + (1.5 × IQR)
* You can show the above by letting L = Lower fence and U =
Upper fence.
NOTE: Outliers are now any data points that are either less than
the lower fence or greater than the upper fence.
Calculating Outliers
Example:
Consider the following sorted teacher’s salaries
(dollars per academic year):
26700, 27500, 28000, 29000, 29750, 30000, 31000,
31500, 31500, 33000, 34000, 89000
Conduct a formal statistical test for outliers.
Calculating Outliers –Example Cont’d
Answer
1) Find Q1, Q2, and Q3: Q1=28,500, Q2=30,500, Q3=32,250.
2) Calculate IQR: IQR= Q3-Q1=3750
3) U = Q3 + 1.5 * IQR = 32,250 + 1.5 * 3750 = 37875.
4) L = Q1 - 1.5 * IQR = 28,500 - 1.5 * 3750 = 22875.
Low Outlier(s): there is no value that lies below 22,875.
High Outlier(s): 89,000 is an outlier because it lies above 37,875.
Standard deviation
Sample standard deviation (statistic): measures
the variation of sample data values from the sample
mean x .
The sample standard deviation is denoted by s.
1 n
s= ∑
n −1 1
( xi − x ) 2
Standard deviation
n
We need to calculate ∑ i
( x − x ) 2
i =1
It is often simpler to complete this step using the tabular
method as follows.
Step 1) Calculate x
Step 2) Calculate the difference between each data value
and the sample mean:
xi − x
Step 3) Square each difference:
( xi − x ) 2
Step 4) Add all the squared differences together:
n
∑ ( xi − x )2
i =1
Standard deviation
While this gives a concise formula to follow, we will
break the calculations into parts to keep things simple
for everybody.
Example: Find the sample standard deviation of the
following sample data.
12, 12.3, 11.5, 10.5, 8.7, 2.5, 13.5
Standard deviation – Example Cont’d
Step 1) First we calculate the sample mean
n = 7, so we have
n 7
∑x ∑x
i =1
i
i =1
i
12 + 12.3 + 11.5 + 10.5 + 8.7 + 2.5 + 13.5
x= = = = 71 / 7 ≈ 10.1429
n 7 7
Although this example has 4 decimal places, you can
round to 2 decimal places to keep things simple.
Standard deviation – Example Cont’d
--Step 2 xi − x --Step 3 ( xi − x ) 2
xi
12 (12-10.1429)= 1.8571 1.85722=3.4488
12.3 (12.3-10.1429)=2.1571 2.15712=4.6531
11.5 (11.5-10.1429)=1.3571 1.35712=1.8417
10.5 (10.5-10.1429)=0.3571 0.35712=0.1275
8.7 (8.7-10.1429)=-1.4429 -1.44292=2.0820
2.5 (2.5-10.1429)=-7.6429 -7.64292=58.4139
13.5 (13.5-10.1429)=3.3571 3.35712=11.2701
∑ =81.8371 – Step 4
Standard deviation – Example Cont’d
Step 5) Divide the sum (from step 4) by n-1:
This gives
n
∑ (x
i =1
i − x)2
n −1
= 81.8371 / (7 - 1) = 13.6395
Step 6) Finally, take the square root and this give us what we are looking for.
n
∑ ( xi − x )2
s = i =1 = 13.6395 = 3.6932
n −1
Properties of the standard deviation
1. s is always positive or zero.
2. s = 0 only when there is absolutely no variation, i.e. when all
the observations are the same.
3. s is not resistant — outliers and extreme values in skewed
distributions increase the value of s.
Variance: square of the standard deviation. For a sample, the
sample variance is s 2
Example: What is the sample variance of the previous example?
When to use which
When the distribution is summarize center and spread with
fairly symmetric with no outliers x s
and
strongly skewed or has extreme outliers M and IQR
Properties of Mean & Median
The median is resistant to extreme values and the mean is not.
The median is more appropriate for strongly skewed distributions or
distributions with outliers.
The mean should be reserved for fairly symmetric distributions with
no outliers.
It is appropriate to use the interquartile range as the measure of
spread when the median is used as the measure of center, since the
median itself is a quartile and the quartiles are resistant to extreme
values and outliers.
Standard deviation should be used as the measure of spread when the
mean is used as the measure of center because it measures spread
about the mean.