Highline Class, BI 348
Basic Business Analytics using Excel, Chapter 02
Descriptive Statistics
1
Topics
Data Types & Default Alignment in Excel
Raw Data, Data
Variable, Element, Observation
Proper Data Set: Proper Table of Data
Population and Sample
Categorical and Quantitative Data
Cross Sectional and Time Series Data
Sources of Data
Sort & Filter to Organize Data
Conditional Formatting to Visualizing Data
Topics
Frequency Distributions for Categorical Data, Charts:
Column
Frequency Distributions for Quantitative Data, Charts:
Histogram
Skew of Histograms
Cumulative Distributions
3
Topics
Measures of Location
Mean
Median
Mode
Geometric Mean
Measures of Variability
Range
Variance
Standard Deviation
Coefficient of Variation
Z-score: Number of Standard Deviations
Topics
The Normal Distribution & the Empirical Rule
Identifying Outliers
Percentiles and Quartiles
Box Plots
Raw Data: Data stored in its smallest size
No:
Yes:
Addresses
313 173rd Blvd, Kent, WA 981215
316 66th Blvd, Kent, WA 981244
4358 23rd St, Kent, WA 981225
965 151st St, Kent, WA 981162
7900 173rd Lane, Kent, WA 981266
4047 15th Ave, Kent, WA 981228
4907 13th Ave, Kent, WA 981232
3789 4th Blvd, Seattle, WA 981152
2977 66th Lane, Seattle, WA 981171
3392 23rd St, Seattle, WA 981131
Address
313 173rd Blvd
316 66th Blvd
4358 23rd St
965 151st St
7900 173rd Lane
4047 15th Ave
4907 13th Ave
3789 4th Blvd
2977 66th Lane
3392 23rd St
City
Kent
Kent
Kent
Kent
Kent
Kent
Kent
Seattle
Seattle
Seattle
State
WA
WA
WA
WA
WA
WA
WA
WA
WA
WA
Zip
981215
981244
981225
981162
981266
981228
981232
981152
981171
981131
Why?
Because it is easier to analyze data when it is stored in its smallest parts
Data:
Textbook: Facts or figures collected, analyzed and summarized
for presentation and interpretation
Data = all the unorganized raw data in a Proper Data Set
Transaction
Date
Number
12568
12569
12570
12571
12572
12573
12574
12575
12576
Sales
12/1/2014
12/1/2014
12/2/2014
12/2/2014
12/2/2014
12/3/2014
12/3/2014
12/3/2014
12/3/2014
SalesRep
$19,161 Jo
$15,027 Gigi
$12,953 Chin
$12,670 Jo
$8,893 Gigi
$4,667 Chin
$20,272 Jo
$20,204 Gigi
$17,223 Chin
Data Types & Default Alignment in Excel
Empty Cells Not really a Data Type, but it is a "thing" in Excel that can sometimes
cause problems.
**Refer to Empty Cells as "Empty Cells", not blanks.
Why Default Alignment? Because Left means Excel thinks it is Text and Right means Excel
thinks it is a Number. This is important when dealing with data because some systems will
mistakenly import numbers as text. Numbers as text do not always behave like you expect
(like not being added by the SUM function. The Default Alignment is a visual cue that
informs us about how Excel sees the data.
Proper Data Set: Proper Table of Data
A structure for your data set
necessary so that Excel Data
Analysis features like Sort,
Filter and PivotTables will work
correctly:
1. Fields in first row (no empty
cells)
2. Records or Observations in
rows
3. Empty cells or Excel
Row/Column Headers all the
way around Data Set
4. Try not to have empty cells in
data set
Transaction
Number
Date
12568
12569
12570
12571
12572
12573
12574
12575
12576
Sales
12/1/2014
12/1/2014
12/2/2014
12/2/2014
12/2/2014
12/3/2014
12/3/2014
12/3/2014
12/3/2014
SalesRep
$19,161 Jo
$15,027 Gigi
$12,953 Chin
$12,670 Jo
$8,893 Gigi
$4,667 Chin
$20,272 Jo
$20,204 Gigi
$17,223 Chin
Terms for Proper Data Set
Primary Key /
List of Unique Elements
Variables
Element = Entities on
which data are collected.
We are collecting data for
each Transaction Number.
Transaction Number is the
Element.
Each row is
a Record /
Observation
10
All 4 are called Fields (Column Headers)
Variable, Element, Observation
Variable
A characteristic or quantity of interest that can take on different values
A Variable is also known as a Field or Column Header in Database terminology
Example: Street address, City, State, Zip for a customer
Element
Entities on which data are collected
Like collecting data for an Employee or Invoice Number
Primary Key
When the first column in a Proper Data Set contains a Unique List of Elements, it is
called a Primary Key.
Primary Key, Unique List of Elements, List of Unique Identifiers, Distinct List are all
synonyms
The Primary Key assure that data collected for a give element is stored in one and
only one place.
Observation or Record
A set of values corresponding to a set of Variables (Fields) for a set of Elements
11
Proper Data Set with a Primary Key / List
of Unique Elements:
Proper Data Set:
12
Proper Data Set with NO Primary Key /
List of Unique Elements:
Proper Data Set:
Using the PivotTable feature we can create a
Proper Data Set with a Primary Key (Unique
List of Products or Elements):
13
Variables
Variable (from previous slide)
A characteristic or quantity of interest that can take on different values
Decision Variables
Variables under the direct control of decision makers
Example
The Quantity Variable for a manufacturer. Managers can decide how
many to make each day.
Random (uncertain variables) Variables:
In general, variables that are outside of the decision makers control
A quantity whose value is not known with certainty
Example:
Stock Price of Yahoo
Number of units sold of a particular product
14
Variables and Variation
Variation
If you own Yahoo Stock, you would be
interested in the Variation in the Variable
Price (Adj Close).
The difference in a variable measured over
observations
Differences over time
Differences between customers or products
**We will have a numerical measure for
variation later
Roll of Descriptive Statistics:
Collect Past Observed Values for Variables
or Realizations of Variables or Raw Data
or Data
Analyze Data to gain a better understanding
of the variation and its impact on the
business setting/situation
15
Population and Sample
Population
All elements of interest
Sample
Subset of the population
Random sampling
A sampling method to gather a representative sample of the
population data.
Each element comes from the same population (Target Population)
Each element is selected independently (without bias)
16
Categorical and Quantitative Data
Quantitative Data
Number Data on which numeric and arithmetic operations, such as
addition, subtraction, multiplication, and division, can be performed.
Discrete Quantitative Data: There are gaps between numbers, like
counting: 1, 2, 3
Continuous Quantitative Data: There are no gaps between numbers,
like weight, time, money. The number depends on the measurement
instrument.
Categorical Data
Not Number Data, like Product Names or Yes No Data on which arithmetic
operations cannot be performed.
17
Data Terminology
Cross-sectional Data
Cross-sectional Data
Data collected from several
elements/entities at the same, or
approximately the same, point in time.
Sep 22, 2015
Market Cap:
Employees:
Qtrly Rev Growth (yoy):
Revenue (ttm):
Gross Margin (ttm):
EBITDA (ttm):
Operating Margin (ttm):
Net Income (ttm):
EPS (ttm):
P/E (ttm):
PEG (5 yr expected):
P/S (ttm):
GOOG
426.88B
69.61B
22.62B
14.39B
YHOO
FB
Industry
28.62B
261.91B 277.63M
57148
12500
10955
0.11
0.15
0.39
4.87B
14.64B
132.20M
0.62
0.67
0.83
541.75M 6.38B
3.47M
0.02
0.32
0.26
6.94B
2.72B
N/A
21.22
7.2
0.98
29.34
4.22
94.47
1.22
-2.38
1.59
6.26
6.02
18.39
355
0.15
Time Series Data
Data collected over several time
periods (Year, Month, Day, Hour).
Charts of time series data are common
in business and economics.
Help analysts understand what
happened in the past, identify trends
over time, and project future levels for
the time series.
0.58
0.01
0
33.33
1.07
3.74
18
Sources of Data
Experimental study
A variable of interest is first identified.
Then one or more other variables are identified and controlled or manipulated so that
data can be obtained about how they influence the variable of interest.
Nonexperimental study or observational study - Make no attempt to control the
variables of interest.
A survey is perhaps the most common type of observational study.
Existing Data Sets:
Customer Lists
Sales or Expense Lists
Census Data
Weather Data
Government sources (data.gov)
Purchase data from companies such as: Bloomberg, Dow Jones
19
Sort & Filter to Organize Data
Sort
Organize the Raw Data by sorting
Example: Sort Sales biggest to
smallest
Sort Buttons in Data Ribbon
Sort columns one by one, with
the Major Sort last.
Sort Dialog Box
Make sure that Major Sort
on top.
Keyboard for Sort: Alt, D, S
Filter
Must have a Proper Data Set
Filter Button in Data Ribbon
Great for querying a data set
(Extracting Observations / Records
from a Proper Data Set) to get a
sub-set of data based on a set of
conditions or criteria
20
PivotTables
What does a PivotTable do?
Makes calculations with criteria.
PivotTables create reports that contain calculations with
criteria.
21
How to create PivotTable:
Visualize the PivotTable 1st, see the row headers and column headers, see the values.
Must have Proper Data Set: 1) Field Names in first rows, 2) empty cells or row/column
headers all around data set
Click in one cell in Proper Data Set
Insert Ribbon Tab, Tables group, PivotTable button, make sure location has not data
below it.
Keyboard: Alt, N, V.
Keyboard on new sheet: Alt, N, V, Enter
From Field List, drag field name (Criteria for calculations) to Row Header or Column
Header
From Field List drag field you want to make a calculation upon to values area
Formatting:
Design, Report Layout, Show in Tabular or Outline Form
Right-Click: Number Formatting (so format follows the field if you Pivot)
22
Inside the PivotTable:
Pivot: drag and drop fields
Filter from dropdown arrows
Change calculation:
Right-click Summarize Values As (Change Function)
or
Right-click Show Values As (New Calculation)
If you want more than one calculation, drop the field into the Values area more
than one time and then change the calculation.
To Group, after dragging field to row area, Right-click, Group.
When Grouping in a PivotTable, Numbers with Decimals trigger ambiguous labels.
When Grouping in a PivotTable, Numbers with NO Decimals create unambiguous
labels
23
Conditional Formatting to Visualizing Data
Each cell in the highlighted range must get a logical test
that comes out TRUE (apply formatting) or FALSE (do NOT
apply formatting)
Logical test can be created with built-in features or Logical
Formulas
Great for visualizing data based on a set of conditions or
criteria
24
Frequency Distributions and
Column/Bar Charts for Categorical Data
Frequency Distribution for Categorical Data is a tabular summary which:
1. Shows the number of observations (count or frequency) in each of a set
categories (unique list from data set)
2. Categories must be Collectively Exhaustive Categories (enough categories so
nothing is left out) and Mutually Exclusive Categories (no item can fit into more
than one category)
3. Goal is to is to provide information about frequencies (count)
Relative Frequency Distribution
Shows decimal value that represents "parts compared to the whole" (used in
chapter 4 for assigning probabilities)
Percent Frequency Distribution
Formats Relative Frequencies with Percent Number Format
25
Frequency Distributions and
Column/Bar Charts for Categorical Data
Column/Bar Chart:
Used to show Frequency Distribution or Relative/Percent Frequency
Distribution for Categorical Data
Counts across categories. Height of columns convey count. Order of
categories conveys no info
There are "gaps" between columns to indicate that the data is
categorical or a discrete quantitative variable (not a continuous
quantitative variable). Columns do not touch
26
Frequency Distributions and
Column/Bar Charts for Categorical Data
PivotTable:
COUNTIFS function:
Web Site
Frequency % Frequency
amazon.com
11436
43.12%
coloradoboomerangs.com
6380
24.05%
ebay.com
5810
21.90%
gel-boomerang.com
2898
10.93%
Grand Total
26524
100.00%
Web Site
amazon.com
coloradoboomerangs.com
ebay.com
gel-boomerang.com
Total
Frequency
% Frequency
11436
43.12%
6380
24.05%
5810
21.90%
2898
10.93%
26524
100.00%
Car Chart (Column on its side):
Boomerang Inc. 2015 Sales Frequency by Web Site
gel-boomerang.com
ebay.com
coloradoboomerangs.com
amazon.com
2898
5810
6380
11436
27
Frequency Distributions for Quantitative Data
Frequency Distribution is a tabular summary which:
1. Shows the number of observations (count or frequency) in each of
several nonoverlapping categories / classes / bins
Categories, classes and bins are synonyms
2. Categories must be Collectively Exhaustive Categories (enough
categories so nothing is left out) and Mutually Exclusive Categories
(no one item can fit into 2 or more categories)
3. Goal is to is to provide information about frequencies (count) and
reveal the shape of the quantitative data
28
1.
2.
3.
4.
5.
Creating Classes for Quantitative Variables
The goal is to reveal the natural distribution or shape or variation of the data. This is the "art side of
statistics". It takes practice to get the hang of it.
Determine the number of nonoverlapping classes. Goal is to have enough to show natural shape of
data. One general guideline is: 2^k > n, where n = count and k = number of classes.
Determine the width of each class with something like: Approx. width = (Max-Min)/(Number of
classes).
Determine the class limits: the key is to not create classes where you would double count.
1. If you have a discrete variable (or a continuous variable that is shown as a whole number) it is just a
matter of getting the lower and upper limit, like: 0-9, 10-19...
2. If you have a continuous variable and you choose to use the upper limit from the previous class as the
lower limit for the current class, be sure to include the equal sign on either the lower or upper, but
not both. Create classes like: 0 <= Sales < 20, 20 <= Sales <40... or 0 up to 20, 20 up to 40...
3. When we create a set of classes, we create a type of category for our continuous quantitative variable
4. Making the classes all the same width helps to create tables & charts that are more easily interpreted
5. Sometimes if there are a few large values or small values, it may be efficient to create an open ended
class
Class midpoint is calculated as the halfway mark between the lower and upper limit
29
Relative Frequency Distributions for
Quantitative Data
Relative Frequency Distribution:
Shows decimal value that represents "parts compared to the
whole
Often the basis for probability calculations (Relative Method)
Percent Frequency Distribution:
Formats Relative Frequencies with Percent Number Format
30
Histograms for Quantitative Data
Histograms
Used to show frequency distribution of continuous quantitative data
over a set of class intervals (lower and upper limit for each category)
Column or Bar Charts where columns are touching to indicate that the
variable is continuous
Columns touch to indicate that no numbers can fit between classes.
"No numbers can fit between columns - no gaps"
Height of columns convey count
Order of classes is important to help reveal shape of data, or
distribution of data.
31
Cumulative Distributions
Cumulative Frequency Distribution
is a tabular summary which:
Shows the cumulative number of
observations (count or frequency) in
each of the categories or classes.
Count for "less than or equal to" upper
limit of class. The last class will be
equal to the count of all items in the
data set
Cumulative Percent Frequency
Distribution is a tabular summary
which:
Shows the percent cumulative
frequency in each of the categories or
classes. Calculation is based on
Running Total divided by count of all
items in the data set. The last class will
be equal to 100%
With any particular class you can say
something like: "xx% of the
occurrences are less than or equal to
the upper limit of the class"
Example of
Frequency
Distribution
& Cumulative
Percent
Frequency
Distribution
32
Excel Methods to Create Frequency Distribution
COUNTIFS Excel function with two criteria
Count between the lower and upper limit
Because you have control over the comparative operators, you can create any type of Upper and Lower Limit.
This is different than with the PivotTable Grouping feature and the FREQUENCY Array Function.
PivotTables and the Grouping feature
When Grouping in a PivotTable:
Integer data yields unambiguous labels
Decimal data yields ambiguous labels
Remember: when you are counting between an upper and lower limit, the Upper Limit is NOT included and the
Lower Limit IS included; unlike formulas we do not have control over how the upper and lower limits work when
grouping.
FREQUENCY Array Function:
Next slide has full details about this function
One note here: For FRQUENCY Array Formula when you are counting between an upper and lower limit,
the Upper Limit IS included and the Lower Limit is NOT included; unlike formulas we do not have control
over how the upper and lower limits work when grouping.
FREQUENCY Array Function and Data Analysis Tools, Histogram yield the same answer.
Data Analysis Tools, Histogram
You must add this feature in: File tab, Options, Add-ins, Manage: Excel Ass-ins, Click Go, Check box for
Analysis Toolpak, Click OK
This feature will create the Frequency Table (just like the FREQUENCY Array Function), a Histogram and a
Cumulative Distribution. If Gap Width in Chart is not zero, you must change it!!
FREQUENCY Array Function and Data Analysis Tools, Histogram yield the same answer.
33
FREQUENCY Array Function
FREQUENCY counts how many numbers are in each category.
The bins_array argument contains the upper values for the categoriesnumbers only.
The data_array argument contains the values to countnumbers only.
Keep in mind the following about categories:
Categories are automatically created. There is no visual indication of how the categories are
organized.
The first category counts all the values less than or equal to the first upper limit.
The middle categories count between a lower limit and an upper limit. The lower limit is not
included in the category. The upper limit is included in the category.
The last category catches all the values that are greater than the last upper limit.
There is always one more category than there are bins.
Because this is an array function, you must select the destination range before creating the
formula and enter the formula with Ctrl+Shift+Enter.
If you have n values in the bins_array argument, the selected destination range should contain
n+ + 1 cells.
34
Sales Data
35
Frequency Distributions and
Histograms for Quantitative Data
PivotTable:
Frequency
256
249
246
333
934
975
318
337
2094
2025
4174
1962
341
213
211
226
966
984
1813
1773
1579
4062
2015 Transactional Frequency by Hour
234
219
234
1579
1813
1773
984
226
211
213
341
966
4062
4174
1962
337
318
975
2025
2094
934
333
249
246
256
26524
219
Time (Lower Limit)
12 AM
1 AM
2 AM
3 AM
4 AM
5 AM
6 AM
7 AM
8 AM
9 AM
10 AM
11 AM
12 PM
1 PM
2 PM
3 PM
4 PM
5 PM
6 PM
7 PM
8 PM
9 PM
10 PM
11 PM
Grand Total
Histogram:
12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
AM AM AM AM AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM PM
36
% Cumulative
Frequency Frequency
16,431
61.95%
71.64%
2,570
77.30%
1,501
2,021
84.92%
1,432
90.31%
935
93.84%
707
96.51%
405
98.03%
253
98.99%
89
99.32%
56
99.53%
54
99.74%
49
99.92%
18
99.99%
3
100.00%
26,524
Histogram & % Cumulative Line Chart:
Frequency
% Cumulative Frequency
120.00%
100.00%
80.00%
Revenue (Upper NOT Included)
20.00%
3
18
49
54
56
89
253
405
707
40.00%
935
1,432
2,021
1,501
60.00%
2,570
When Grouping Decimal Quantitative
Data in a PivotTable to create an
upper and lower limit, Upper Limit is
not included.!!!
When using the FREQUENCY Array
Function or the Data Analysis
Revenue (Upper
NOT Included)
0-200
200-400
400-600
600-800
800-1000
1000-1200
1200-1400
1400-1600
1600-1800
1800-2000
2000-2200
2200-2400
2400-2600
2600-2800
2800-3000
Grand Total
16,431
Frequency
Distribution
and Histogram
for Revenue
with
PivotTable:
Frequency & % Frequency Distribution with PivotTable:
0.00%
37
Frequency
Distribution and
Histogram for
Revenue with
FREQUENCY
Array Function:
When using the FREQUENCY Array
Function or the Data Analysis
Histogram feature, for the upper and
lower limit classes/categories, the
Upper Limit IS Included
38
Check to see why the two methods yield different
answers
When Grouping Decimal Quantitative Data in a PivotTable to create an
upper and lower limit, Upper Limit is not included.!!!
When using the FREQUENCY Array Function or the Data Analysis
Histogram feature, for the upper and lower limit classes/categories, the
Upper Limit IS Included
39
Frequency
Distribution and
Histogram for
Revenue with Data
Analysis Histogram
Feature:
When using the FREQUENCY Array
Function or the Data Analysis
Histogram feature, for the upper and
lower limit classes/categories, the
Upper Limit IS Included
40
Why PivotTables Rule: Because you can add Criteria
through a Slicer and Drill Down in the Data
41
Skew of Histograms
What does the distribution
of Histogram Columns look
like?
Skew Left or Negative
means a few short
Histogram Columns are on
the low end (pull mean
down)
Skew Right or Positive
means a few short
Histogram Columns are on
the high end (pull mean up)
No Skew means the
distribution is bell shaped or
nearly bell shaped
Perfect Bell Shape Mean
= Median = Mode
42
Measures of Location
Measures of Location:
Average = Typical Value = Measure of central location
"Typical Values" calculated so that we have one value that can
represent all the data points.
Examples:
Mean
Median
Mode
Geometric Mean
43
Mean, Median, Mode
Mean
Arithmetic Mean: Add them up and divide by the count
Good for quantitative data when there are not extreme values - extreme values can make the mean
look too big or too small (Median more representative of a typical value in that case)
Use AVERAGE function
Median
Sort, then take the one in the middle. If count odd, take one in middle, if even, average middle two.
Marks the point in the sorted list (an actual number) where 50% of the numbers are above and 50%
of the numbers are below
Good for quantitative data when there are extreme values (like house prices and salaries)
Use MEDIAN function
Mode
One that occurs most frequently (can be bimodal, multimodal)
Good for Categorical Data (Nominal and Ordinal)
Use MODE.SNGL for quantitative data and COUNTIF or PivotTable for Categorical or quantitative data.
MODE.SNGL will only show 1 mode if the data set is bi-modal or multi-modal. MODE.MULT can be
used for multiple modes.
44
Mean
45
Geometric Mean
Use Geometric Mean when you have "Growth Rates" or "Rates of Change and you want:
True "Average" Compounding Rate per Period
You have a Begin Value and you want to calculate the End Value after a number of periods, like
in Finance
Arithmetic Mean overestimates
Arithmetic Mean is for additive processes; Geometric Mean is for multiplicative processes
Arithmetic mean is used in some situations like for Standard Deviation, Correlation, and other
calculations that do not require True "Average" Compounding Rate per Period.
"Growth Rates" or "Rates of Change = % change from one period to the next
Growth Factor = Growth Rate + 1
Growth Factor is value that you use when calculating End Value from Begin Value.
Like: BeginValue*(1+GeometricMean)^NumberOfPeriods = EndValue
In Finance: PV*(1+PeriodRate)^NumberOfPeriods = FV
Growth Factor ALWAYS >= 0
Growth Factor > 1 means positive growth
Growth Factor < 1 means negative growth
46
Geometric Mean
Geometric Mean = Average Compounding Rate per Period
Geometric Mean Formula 1:
Use when you are given all the "Growth Rates" or "Rates of Change:
Formulas:
GEOMEAN(RangeOfGrowthRates+1) = Growth Factor
GEOMEAN(RangeOfGrowthRates+1)-1 = Geometric Mean
Geometric Mean Formula 2:
Use when you are given the Begin Value, End Value and the number of periods
Formulas:
(EndValue/BegValue)^(1/NumberOfPeriods)-1 1 = Geometric Mean
RRI(NumberOfPeriods,BegValue, EndValue) or RRI(n,PV,FV) 1 = Geometric Mean
47
Geometric Mean = Average Compounding Rate:
48
Variability
49
Variability
Synonyms for Variability:
Variability
Dispersion
Spread In Data
How Spread Out Is Data?
Are the Data Points Clustered Around the
Mean?
Does the Mean Fairly Represent the Data
Points?
Measures of Variability
Range
Variance
Standard Deviation
Coefficient of Variation
Z-score
50
Range and Interquartile Range
Range
Max - Min
Simple to calculate. Sensitive to extreme values
Interquartile Range
Quartile 3 - Quartile 1
The range for the middle 50% of the data. It overcomes
the sensitivity to extreme values
51
Deviation: X1 Xbar = Particular Value - Xbar
How far is the Particular Value
from the Mean (Average)?
For any data set, the sum of
the deviations is always
zero!!!!
This is why mathematically, we
either square (Variance or
Standard Deviation) or take the
absolute value (Mean Absolute
Value) for calculating our
measures of variation.
52
Formulas for Variance and Standard Deviation:
53
Proof that
two
formulas
for
Sample
Standard
Deviation
are equal
54
Variance
A Numerical Measure that says how much variability there is in the data points
Variance uses all the data points, not just some like Range and Interquartile
Range
Variance has squared units, which makes interpreting it difficult.
Although Variance has squared units, it has many uses in statistics, especially
with Regression Analysis (chapter 4) and Hypothesis Testing
Standard Deviation undoes the squared units and is thus easier to interpret.
Use VAR.P function for population data
Use VAR.S function for sample data.
55
Standard Deviation = SD
Standard Deviation uses all the data points, not just some like Range and Interquartile Range
Standard Deviation does not have squared units (like Variance) and is thus easier to interpret
Standard deviation has the same units as the data!!
The sample standard deviation is a point estimator of the population standard deviation
=
^2
( )
1
Interpretation of Standard Deviation:
A Numerical Measure that says how much variability there is in the data points
Standard Deviation Is Like An Average Of The Deviations
Standard Deviation tells us how fairly the mean represents its data points
Standard Deviation tells us how clustered the data points are around the mean
For financial assets standard deviation is a measure of risk or fluctuation in asset value
Use STDEV.P function for population data
Use STDEV.S for sample data.
56
Standard Deviation: How Fairly Does Mean Represent Its Data Points?
57
Coefficient of Variation
Formula = SD/Mean
Coefficient of Variation converts the SD to SD per unit of Mean
For every one unit of mean, what is the SD?
If you add Percent Number Formatting, it shows SD as a percentage of Mean
What percentage is SD in relation to the Mean?
Use Coefficient Of Variation to compare:
Data in different units.
Data in the same units, but the means are far apart.
58
Z-score: Number of Standard Deviations
Formula for z-score = Deviation/SD = (Xi - Xbar)/SD
Excel Function: STANDARDIZE(X,Mean,SD)
z Score = How Many Standard Deviation is a particular value ways from the mean?
z < 0, value below mean
z > 0, value above mean
z = 0, value is equal to mean
Z score measures the relative location of a particular x in the data set (as compared to the
mean), in units of standard deviation.
Relative Location in terms of "Number of Standard Deviations
z Score = Standardized Value
Observations in 2 different data sets that have the same z-score are said to have the
same relative location or the same number of standard deviations away from the
mean.
59
Uses of z-score:
Used in the Standard Normal Bell Curve or Empirical Rule
One way to measure Outliers (extreme values) is to consider any value
that has z-score greater than 3 to be an Outlier
60
Example of Bell Shaped Normal Distribution:
61
Empirical Rule
62
Example of Empirical Rule:
63
Identifying Outliers: 3 Z Rule
One way to measure
Outliers (extreme values) is
to consider any value that
has z-score greater than 3
to be an Outlier.
In Sep. and Oct. of 1981 the
10-year Government Bond
Yield was above 15%.
This was a value more than 3
standard deviations away from
the mean and is therefore
considered an outlier.
64
Measures for Location: Percentiles
Percentiles:
Percentile: Create Marker in sorted
data set that divides set into 2
Parts with about P% Below the
Marker and 1-P% Above
Excel Functions:
PERCENTILE.EXC
.EXC = Exclusive: Excludes 0% & 100% =
Min and Max values cannot be found -0% and 100% are not allowed
PERCENTILE.INC
.INC = Inclusive: Includes 0% & 100% =
Min and Max values CAN be found 0%
= Min & 100% = Max
For Large Data Sets the two
functions calculate similar answers
65
Measures for Location: Quartiles
Quartiles:
Create Marker in sorted data set
that divides set into four equal
parts:
Each part contains approximately 25%
of the observations.
The three Markers are referred to as
quartiles:
1 = first quartile, or 25th percentile
2 = second quartile, or 50th percentile
(also the median)
3 = third quartile, or 75th percentile
Excel Functions:
QUARTILE.EXC
.EXC = Exclusive: Min and Max values
cannot be found -- can only enter 1, 2,
3 in second argument
QUARTILE.INC
.INC = Inclusive: 0 = Min, 1 = Quartile 1,
2 = Quartile 2, 3 = Quartile 3, 4 = Max
For Large Data Sets the two
functions calculate similar answers
66
Percentile & Quartile Are Markers That Divide A
Set Of Sorted Numbers Into Two Sets
67
Box Plots by hand
68
Box Plots
No easy way to create Box Plots in Excel
Reference video for how to do it in Excel:
Excel 2010 Statistics #28: Box & Whisker Plot: Stacked Bar with Mean Point Plotted and
Outlier Lines
https://www.youtube.com/watch?v=bgaN446TQXo
XL Minor Add-in makes it easy to create single and multiple variable data sets
Must have a Proper Data Set.
69
Box Plots in XL Minor
70
Box Plots in Excel 2016:
71
Dont Forget:
Q: Why MUST we have a Proper Data Set?
A: So we can ask questions of each Field (Variable)!!!!
Like in a PivotTable when we drag a field like Sales Rep
to ask the question: What is the total sales for each Sales
Rep?
Q: Why do Histograms have No Gap Width?
A: Continuous Quantitative Data that is grouped has
no gaps between categories - so columns must touch
(have no gap width) to visually indicate that no
numbers can fit between the categories or columns.
72