KEMBAR78
Using R For Financial Data | PDF | Analytics | Null Hypothesis
0% found this document useful (0 votes)
373 views132 pages

Using R For Financial Data

This document provides an introduction to using R for accounting and finance applications. It outlines learning outcomes which include understanding and using statistical concepts and techniques, identifying accounting and finance applications, using R to analyze accounting and finance data, and interpreting statistical results within an accounting and finance context. The document describes assessments, preparing students for success, and communicating learning outcomes. It also provides an outline of content to be covered, including introducing R, loading and reviewing different types of accounting and finance data in R, and performing basic statistical analysis.

Uploaded by

Norx 2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
373 views132 pages

Using R For Financial Data

This document provides an introduction to using R for accounting and finance applications. It outlines learning outcomes which include understanding and using statistical concepts and techniques, identifying accounting and finance applications, using R to analyze accounting and finance data, and interpreting statistical results within an accounting and finance context. The document describes assessments, preparing students for success, and communicating learning outcomes. It also provides an outline of content to be covered, including introducing R, loading and reviewing different types of accounting and finance data in R, and performing basic statistical analysis.

Uploaded by

Norx 2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

Accounting & Finance Applications

for
Introduction to Statistics with R

Theophanis C. Stratopoulos, PhD

&

Duane Kennedy, PhD

March 20, 2018

University of Waterloo - 2018

Electronic copy available at: https://ssrn.com/abstract=3184054


University of Waterloo - 2018

Electronic copy available at: https://ssrn.com/abstract=3184054


Preface

Learning statistics using R to analyze real accounting and finance data is


part of the School of Accounting and Finance (SAF) strategy of building a
foundation in business analytics and technology for all AFM students. The
objective of this foundation is to provide students with sufficient knowledge
and skills in analytics and technology, by the end of their second year, so
they may do the following:
1. Convince recruiters that are capable to handle entry level analytics
jobs. This means that - given a well defined business problem and a
reasonably clean data set - they should be able to identify and extract
appropriate data, clean and transform data to create new variables,
perform basic descriptive analytics to understand relevant patterns,
perform basic predictive analytics and interpret results in the context
of the problem, and communicate these findings.
2. Be prepared to pursue an optional specialization in business analyt-
ics and technology, which could include the following types of courses:
domain specific analytics areas such as audit, tax, finance, or perfor-
mance management analytics; more advanced analytics topics focused
on interpretation of the analysis, communication of business implica-
tions, and use of data visualization tools; additional courses in statis-
tics (e.g., focus on predictive analytics); and/or additional courses in
computer science (e.g., machine learning or big data analytics using
hadoop/mapreduce).

Intended Learning Outcomes


To achieve building a foundation of business analytics and technology for
all AFM students, statistics with R aims to deliver the following learning
outcomes: By the end of the term students should be able to ...

1. Understand and use statistical concepts and techniques, and interpret


statistical results.

University of Waterloo - 2018

Electronic copy available at: https://ssrn.com/abstract=3184054


ii

2. Identify accounting and finance (A&F) specific applications of statisti-


cal concepts and techniques.
3. Identify practical sources of A&F data (e.g., stock market data, finan-
cial statement data, actual company specific sales and inventory data)
and how data is structured.
4. Use R to load/understand data, generate new variables (transform
data), and perform appropriate statistical analysis on A&F data.
5. Interpret statistical results leveraging A&F knowledge.

Assessment
1. Students have (need to decide the time) hours to work on an online quiz
(available on Learn), which is structured like a homework assignment.
2. The typical quiz is a combination of multiple choice and numeric ques-
tions.
(a) The focus of the multiple choice question will be on such topics
as:
i. Statistical concepts and techniques.
ii. A&F applications of statistical concepts and techniques.
iii. Practical sources of A&F data and properties/structure of
these data sets.
iv. Understand and transform A&F data.
v. Interpret statistical results leveraging A&F knowledge.
vi. Commonly used R commands.
(b) To answer the numeric questions students will have to use R to
perform statistical analysis on a given A&F data set. The focus
of numeric question will be on such topics as:
i. Load, understand, transform A&F data.
ii. Perform statistical analysis.
iii. Interpret statistical results.
3. The weekly quizzes are open books/open notes.
4. Students can take each quiz twice and the highest score counts.
5. Quizzes are available from Friday afternoon till Monday morning. Stu-
dents can take the quizzes at any time during this window.
6. The first quiz will be on the second week of classes. There is no quiz
for the first week.
7. Students can drop one (lowest) quiz score. There are no make-ups for
any reason.

University of Waterloo - 2018


iii

Prepare Students for Success


1. Develop teaching notes (R primer) with step-by-step instructions on
how to use R to analyze real accounting and finance data.
2. Provide them with annotated version of R script for each chapter of
the R primer that students can use it replicate every step shown in the
corresponding chapter of the R primer.
3. Provide them with videos (screen caption) that students can use for
specific tasks that require memorizing/replicating steps needed to set
up software or extract data from a specific database.
4. Generate an index of key terms (R commands) that students can use
as a reference point for finding relevant examples.
5. Prepare practice problems, which are available through Learn. The
practice problems aim to help student prepare for graded assignments
(understand structure and content of graded assignment), help them
gauge their understanding of concepts and material (they can take the
practice problem as many times as they want and receive instant feed-
back), do this without the risk of losing points (practice problems are
not graded). The practice problems are based on a test library, thus
when student retake the practice problem it randomly generates a dif-
ferent version each time.
6. Run a discussion board on Learn, where students can post questions
related to the content of R assignments, and issue related to using R
(e.g., read error messages).
7. Assign SAF fellows/tutors to offer help at the computer lab.

Communicate and Reinforce Learning Outcomes


1. Include the intended learning outcomes in the course syllabus and in
the opening page on Learn.
2. Start each chapter of the R primer with the chapter specific version of
intended learning outcomes.
3. Start each class with a slide summarizing the chapter specific version
of intended learning outcomes.
4. End each class with a slide summarizing the chapter specific version
of intended learning outcomes, and how statistics with R helps them
prepare to succeed on the chapter specific learning outcomes.

University of Waterloo - 2018


University of Waterloo - 2018
Contents

1 Introduction to R 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 R Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 R packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 R Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Defining Variables . . . . . . . . . . . . . . . . . . . . 7
1.5.3 How to Clean the Content of Console . . . . . . . . . . 9
1.5.4 How to Clean the Content of Environment . . . . . . . 9
1.6 Working with R files . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Generating basic descriptive statistics . . . . . . . . . . 11

2 Load and Review Data 13


2.1 Management Accounting Data . . . . . . . . . . . . . . . . . . 13
2.1.1 Prepare R Environment (RStudio) . . . . . . . . . . . 14
2.1.2 Load and Review Data . . . . . . . . . . . . . . . . . . 14
2.1.3 Creating Subsets . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 Ordering Data . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.5 Basic Descriptive Statistics . . . . . . . . . . . . . . . . 19
2.1.6 Practice Problems . . . . . . . . . . . . . . . . . . . . . 20
2.2 Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Plotting Stock Market Data . . . . . . . . . . . . . . . 23
2.2.2 SP500 Data . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Combine Two Data Sets - cbind . . . . . . . . . . . . . 24
2.2.4 Renaming Variables . . . . . . . . . . . . . . . . . . . . 25
2.2.5 Printing Graphs Side-by-Side . . . . . . . . . . . . . . 26
2.2.6 Subseting Stock Market Data Based on Time . . . . . 27
2.3 Financial Accounting Data . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Financial Ratios: Profitability . . . . . . . . . . . . . . 28

University of Waterloo - 2018


vi CONTENTS

2.3.2 Load and Review Compustat Data . . . . . . . . . . . 29


2.3.3 Create Profitability Ratios . . . . . . . . . . . . . . . . 31
2.3.4 Basic Descriptive Statistics . . . . . . . . . . . . . . . . 33
2.3.5 Graph Apple Performance over Time . . . . . . . . . . 35

3 Summary Statistics 39
3.1 Management Accounting Data . . . . . . . . . . . . . . . . . . 39
3.1.1 Load and Review Sales by Store Data . . . . . . . . . . 39
3.1.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . 40
3.1.3 Detecting Outliers . . . . . . . . . . . . . . . . . . . . 41
3.1.4 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.5 Subset of Outliers . . . . . . . . . . . . . . . . . . . . . 43
3.1.6 Practice Problems: Sales by Store . . . . . . . . . . . . 44
3.1.7 Frequency Tables for Categorical Data . . . . . . . . . 44
3.1.8 Practice Problems . . . . . . . . . . . . . . . . . . . . . 45
3.2 Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Stock Returns . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Summary Statistics: Stock Returns . . . . . . . . . . . 49
3.2.3 Detecting Outliers: Stock Returns . . . . . . . . . . . . 50
3.2.4 Identifying Negative Outliers in Stock Returns . . . . . 51
3.2.5 Categorical Variable for Stock Returns . . . . . . . . . 52
3.2.6 Frequency Tables for Categorical Stock Returns . . . . 53
3.2.7 Practice Problems: Do Stock Returns Have Memory? . 54
3.3 Financial Accounting Data . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Summary Statistics & Outliers: Profits . . . . . . . . . 55
3.3.2 Summary Statistics & Outliers: Financial Ratios . . . . 57
3.3.3 Practice Problems: Financial Ratios . . . . . . . . . . 59
3.4 Solutions to Selected Practice Problems . . . . . . . . . . . . . 61
3.4.1 Sales by Store (p. 44) . . . . . . . . . . . . . . . . . . 61
3.4.2 Do Stock Returns Have Memory? (p. 54) . . . . . . . . 62
3.4.3 Financial Ratios (p. 59) . . . . . . . . . . . . . . . . . 63

4 Hypothesis Testing 65
4.1 Management Accounting Data . . . . . . . . . . . . . . . . . . 65
4.1.1 Load and Review Data . . . . . . . . . . . . . . . . . . 66
4.1.2 Null and Alternative Hypothesis . . . . . . . . . . . . . 67
4.1.3 Create and Review Random Samples . . . . . . . . . . 67
Random Sample: Liquor . . . . . . . . . . . . . . . . . 69
Random Sample: Wine . . . . . . . . . . . . . . . . . . 70
4.1.4 Practice Problems: Random Samples . . . . . . . . . . 71
4.1.5 Hypothesis Testing (two sided t-test) . . . . . . . . . . 71

University of Waterloo - 2018


CONTENTS vii

4.1.6 Practice Problems: Bibitor Sales . . . . . . . . . . . . 73


4.2 Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 Load and Review Data . . . . . . . . . . . . . . . . . . 75
4.2.2 Null and Alternative Hypothesis . . . . . . . . . . . . . 76
4.2.3 Working with Dates: Subset Bear Market . . . . . . . 77
Bear Market: Random Sample . . . . . . . . . . . . . . 78
Bear Market: Hypothesis Testing . . . . . . . . . . . . 78
4.2.4 Subset Bull Market . . . . . . . . . . . . . . . . . . . . 79
Bull Market: Random Sample . . . . . . . . . . . . . . 80
Bull Market: Hypothesis Testing . . . . . . . . . . . . 80
4.2.5 Practice Problems: Stock Returns . . . . . . . . . . . . 81
4.3 Financial Accounting Data . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Load and Review Data . . . . . . . . . . . . . . . . . . 82
4.3.2 Null and Alterative Hypothesis . . . . . . . . . . . . . 83
4.3.3 Create Sub-set for Hypothesis Testing . . . . . . . . . 84
Stage 1 (dt_2010_Q4 ) . . . . . . . . . . . . . . . . . . 84
Stage 2 (dt_2012_oldQ4 ) . . . . . . . . . . . . . . . . 84
Stage 3 (dt_2012_minusOldQ4 ) . . . . . . . . . . . . . 85
Stage 4 (dt_2012_rs) . . . . . . . . . . . . . . . . . . 86
4.3.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 86
4.3.5 Practice Problems: Performance Sustainability . . . . . 87
4.4 Visual Representation: cbind & rbind . . . . . . . . . . . . . 88
4.5 Solutions to Selected Practice Problems . . . . . . . . . . . . . 90
4.5.1 Hypothesis Testing for Bibitor Sales (4.1.6) . . . . . . . 90
Compare Prices (p. 73) . . . . . . . . . . . . . . . . . . 90
Compare Units Sold (p. 73) . . . . . . . . . . . . . . . 90
Compare Revenues (p. 73) . . . . . . . . . . . . . . . . 90
Use rbind to combine random samples and compare
prices (p. 73) . . . . . . . . . . . . . . . . . . 91
Use describe() to view detailed descriptive statistics
(p. 74) . . . . . . . . . . . . . . . . . . . . . . 91
4.5.2 Hypothesis Testing for Stock Returns (4.2.5) . . . . . . 92
Stock Returns in Bear Market (p. 81) . . . . . . . . . . 92
Stock Returns in Bull Market (p. 82) . . . . . . . . . . 93
4.5.3 Hypothesis Testing: Performance Sustainability (4.3.5) 94
Performance Sustainability - 2014 (p. 87) . . . . . . . . 94
Performance Sustainability - 2016 (p. 87) . . . . . . . . 95

University of Waterloo - 2018


viii CONTENTS

I Appendix 97
A R Script 99
A.1 R Script: Introduction to R . . . . . . . . . . . . . . . . . . . 99
A.2 R Script: Load and Review Data . . . . . . . . . . . . . . . . 99
A.2.1 Section 2.1: Management Accounting Data . . . . . . . 99
A.2.2 Section 2.2: Stock Market Data . . . . . . . . . . . . . 100
A.2.3 Section 2.3: Financial Accounting Data . . . . . . . . . 101
A.3 R Script: Summary Statistics . . . . . . . . . . . . . . . . . . 102
A.3.1 Section 3.1: Management Accounting Data . . . . . . . 102
A.3.2 Section 3.2: Stock Market Data . . . . . . . . . . . . . 102
A.3.3 Section 3.3: Financial Accounting Data . . . . . . . . . 104
A.4 R Script: Hypothesis Testing . . . . . . . . . . . . . . . . . . . 106
A.4.1 Section 4.1: Management Accounting Data . . . . . . . 106
A.4.2 Section 4.2: Stock Market Data . . . . . . . . . . . . . 107
A.4.3 Section 4.3: Financial Accounting Data . . . . . . . . . 108

B Compustat 111
B.1 Compustat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.1.1 Compustat: Single Firm . . . . . . . . . . . . . . . . . 112
B.1.2 Compustat: Entire Industry . . . . . . . . . . . . . . . 117

Alphabetical Index 121

University of Waterloo - 2018


Chapter 1

Introduction to R

Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):
1. Learn how to install R and RStudio.
2. Become familiar with the RStudio interface (environment) and the role
of packages in R.
3. Perform some basic tasks in R. For example, use R as a calculator,
define (create) variables, work with R files, and generate descriptive
statistics.

1.1 Introduction
R is a very powerful and versatile software package that is used in several
required and elective courses in the School of Accounting and Finance (SAF).
In very general terms, R can be used to extract, transform, and analyze struc-
tured (e.g., financial statements based data or sales data) and unstructured
data (e.g., Tweets, emails, blogs).
While the list of R applications is very extensive, here are some exam-
ples of what you will learn as an SAF student: run SQL queries in order
to extract and transform data; organize data (e.g., pivot tables); run sim-
ple statistical analysis, such as generating descriptive statistics, identifying
outliers, perform hypothesis testing; run regression analysis and classifica-
tion analysis (logistic regression); perform sentiment analytics, and leverage
neural network applications to make predictions.
Chances are that you may have heard from your fellow students that the
learning curve of R is relatively steep, and you may have heard that some

University of Waterloo - 2018


2 CHAPTER 1. INTRODUCTION TO R

of the things that you can do with R you can also do with Excel. You may
wonder, why bother with R. There are several reasons on why it is worth it
your time to make this investment.

• First, the spectrum of applications that you can work on is very large.
It ranges from basic statical analysis to very advanced data mining
techniques and text mining for sentiment analysis.
• Second, once you have learned R, it would be relatively easier to work
with any other package (e.g., SAS, STATA, SPSS, MATLAB, Octave).
• Third, R and Python are the most powerful and most used tools for
data analytics.
• Fourth, R is an open source software, which means that it is free.

Figure 1.1,1 shows the trade-off between the difficulty and complexity of
Excel and R.

Figure 1.1: Excel vs R Trade Off

1
The figure was prepared by Gordon Shotwell and it is available from the following
URL: http://blog.yhat.com/posts/R-for-excel-users.html.

University of Waterloo - 2018


1.2. R INSTALLATION 3

As you can see the difficulty (learning curve) of Excel is very low. The
red Excel line grows very slowly along the horizontal axis. The complexity
of Excel (i.e., what you can do with Excel) becomes almost vertical, which
means there is a limit. Alternatively, you can think of this as saying that it
becomes much more complex to do advanced analysis with Excel.
Looking on the R line, we can see that it grows up very quickly, which
means that the learning curve is steeper in the beginning. This makes sense
because there is a lot of new terminology to cover. However, once you learn
to work with R, it is much easier to go from basic statistical analysis to
advanced data mining techniques.

1.2 R Installation
R is pre-installed in most computers. Check the applications in your Mac
or All Programs in your Windows machine. If it is not pre-installed, you
can download and install from the appropriate URL (shown below). For the
basic installation your can simply follow the directions on your screen and
accept the default settings.

• For Macs from:http://cran.r-project.org/bin/macosx/

• For Windows from: http://cran.r-project.org/bin/windows/base/

NB: Install R in your machine before RStudio!

1.3 RStudio
While we can run our analysis from R, the interface of RStudio is more
user friendly. RStudio is an integrated development environment (IDE) that
will let us see all components of our R project in an integrated interface.
Download RStudio from http://www.rstudio.com/. Please, remember that
first we install R, and then RStudio.
The interface of RStudio (shown in Figure 1.2) is divided into four panels
(areas):2
1. Source, this is where we write script that we want to save. This is
helpful when we work on projects that we may want to revisit at a later
2
If the interface looks different than the one shown in Figure 1.2, we can change by
selecting preferences in Mac or Tools - Global Options in Windows and then select Pane
Layout. If we want our interface to match Figure 1.2, we can use the drop down menu in
each pane and select source for the upper left pane, console for the upper right, etc.

University of Waterloo - 2018


4 CHAPTER 1. INTRODUCTION TO R

Figure 1.2: RStudio Interface

point. We can create a new R script file by selecting File > New File > R
Script. The new file will show as untitled1 in the source pane.

2. The Console plays a dual role. This is where we execute script inter-
actively, and we see the output/results of our script. Executing interactively
means that when we type an instruction in the console, and we hit return,
the line is immediately executed and the output will be shown in the console.
If we execute a line or an entire script in the source area, the output will
shown in the console.

3. The Files, Plots, Packages, Help, Viewer area serves multiple needs.
For example, if our script contains instructions for the creation of a graph,
the output will show up in the Plots. If we need to install specific packages
that would allow us to execute some sophisticated functions, we do this from
the Packages area. We can view the files in our working directory or access
the built in Help functionality.

4. The Environment area shows the data set(s) currently open and the
variables in each one of these data sets. The History keeps track of every
single line of script that we have entered through the console.

University of Waterloo - 2018


1.4. R PACKAGES 5

1.4 R packages
R has a collection of ready to use programs, called packages. For example,
we can make some very powerful graphs with the package ggplot2 , we can
generate detailed descriptive statistics with the package psych, run SQL
queries from within R using sqldf , and perform sentiment analysis based on
Twitter data using twitteR and stringr.
There are a couple of ways that we can install these packages. First,
we can use the console to type the command install.packages(). For
example, the following line would install the psych package:
i n s t a l l . p a c k a g e s ( ‘ psych ’ )
Alternatively, we can click Packages > Install and select the package from
the new window/drop down menu (see Figure 1.3). In the new window make
sure that the Install Dependencies is selected.

Figure 1.3: Install Packages in R

Once the packages have been installed, we can load them (i.e., include
them in our script), using the library function. For example, the following
line would load the psych package.
l i b r a r y ( psych )

University of Waterloo - 2018


6 CHAPTER 1. INTRODUCTION TO R

• We install a package only once. However, we need to load the appro-


priate library every time we need to leverage the functionality of the
specific package in our script.

1.5 R Basics
1.5.1 Operations
In its simplest form, we can use R as a calculator to perform basic operations.
For example, in the console area we can write (See also Figure 1.4):

> 50+20

It will return

[1] 70

> 80-30

It will return

[1] 50

> 20*3

It will return

[1] 60

> 54/9

It will return

[1] 6

We can calculate powers as follows:

> 2^3

[1] 8

or

> 2**3

University of Waterloo - 2018


1.5. R BASICS 7

Figure 1.4: Console: Basic Operations

[1] 8

We can combine operations. We need to be careful to use parentheses in


the proper places

> (2^3)+(80-50)

[1] 38

We can ask R to create a sequence of values. For example, if we want to get


the numbers 1 through 10, we can do this as follows:

> 1:10

[1] 1 2 3 4 5 6 7 8 9 10

1.5.2 Defining Variables


To define a new variable (object) in R, we use the less than and minus symbol
(<-). For example, with the following commands, we say that variable x is
equal to 5 and variable y is 9, and then we perform operations on the two of
them:

University of Waterloo - 2018


8 CHAPTER 1. INTRODUCTION TO R

> x<- 5
> y<- 9
> x*y

[1] 45

> x-y

[1] -4

> x+y

[1] 14

We can combine existing values to create a new variable/data set using


the c(). For example, we can create a new variable that combines the values
of x and y as follows:

> z <- c(x,y)


> z

As you can see z takes two values.

[1] 5 9

We continue adding more variables and more values.

> w <- c(7,z,7,x,7,y)


> w

The new variable w, takes the following values:

[1] 7 5 9 7 5 7 9

NB When creating new objects (e.g., variables, data sets) in R, we need


to keep two things in mind:

1. R is case sensitive. This means that variable lower case x is different


than variable upper case X.
2. When we specify a new variable in R, the name cannot start with a num-
ber. If the name of the new variable that we want to create starts with
a number (e.g., 123x<- 5), R will return the following error message:
Error: unexpected symbol in "123x"

University of Waterloo - 2018


1.6. WORKING WITH R FILES 9

1.5.3 How to Clean the Content of Console


We can remove all content shown in the console area by clicking

• CTRL+L.

1.5.4 How to Clean the Content of Environment


Notice that in the problems that we have been working on, we keep using
x as the name of our variable. It is a good practice after we are done with
a problem, and before we move to the next one, to clear the data sets and
variables from the R environment. We do this to avoid the problem that we
may end up using the wrong variable specification.
Cleaning the R environment is also important when working with very
large data sets. If you have open multiple large data sets from different
projects your computer may slow down. You may receive a message asking
you to turn off some of the applications which are running.
We can clean the R environment by simply clicking the button showing
the picture of the broom (see Figure 1.5).

Figure 1.5: How to Clean the R Environment

1.6 Working with R files


The console area is good for working interactively with R. However, some-
times, we may want to start working on a problem, save our work, and come
back and finish this later. In order to do this, we go into the source area and
select create a new R script (See Figure 1.6) or select File > New File >

University of Waterloo - 2018


10 CHAPTER 1. INTRODUCTION TO R

R Script, or simply use the shortcut Ctrl+Shift+N. In the Source area, we


can start typing our script line-by-line.

Figure 1.6: Create R Script

Please, notice that when hit Enter, it does not show any results. It simply
moves the cursor to the next line. If we want to execute a line at a time,
we can do this by clicking on the Run icon or the shortcut Ctrl+Enter. (See
Figure 1.6)
As shown in Figure 1.7, if we want to run all lines in our R script, we
can do this clicking on the Source icon and select Source with Echo from
the drop down menu. Once, we are done we can select save or save as and
select to save the file in our working directory on our computer.
When we create R files, it is a good idea to add comments that either
explain what we are trying to do or simply provide a reminder related to
the function that we are using. In R, we can add a comment by using the
# sign. Everything that follows the # is just a comment and does not affect
the functionality of our R script. Please keep in mind that we can add a
comment either at the beginning of a new line or after a command.

1.7 Using Functions


R has built functions that automate tasks. For example, we have already seen
that 1:10 will generate a sequence of numbers from 1 to 10. An alternative
approach would be to use one of the built in functions in R. In this case,
we need to use the function seq(), which stands for sequence. Within the
function seq(), we need to specify the beginning (from), end (to), as well
as the increment (by). If we don’t specify the increment the default value is
1. For example, if we want numbers from 1.0 to 2.0 at increments for .2, we
write this as follows:

University of Waterloo - 2018


1.7. USING FUNCTIONS 11

Figure 1.7: Writing R Script in Source Area

> seq(1, 2, 0.2)

[1] 1.0 1.2 1.4 1.6 1.8 2.0

If we want to get help on any R function, we type the question mark


(?) followed by the key word for the function. For example, a search for seq
would be as follows:?seq.

1.7.1 Generating basic descriptive statistics


In the following example, we are going to create a new variable based on a
sequence of even numbers from 2 to 20, and use the length() function to find
the number of observations, sum() function to find the summation of these
numbers, min() for the minimum, max() for the maximum, and mean() to
generate the average of these values.

> x=seq(2,20,2)
> x

[1] 2 4 6 8 10 12 14 16 18 20

> length(x)

[1] 10

> sum(x)

[1] 110

University of Waterloo - 2018


12 CHAPTER 1. INTRODUCTION TO R

> min(x)

[1] 2

> max(x)

[1] 20

> mean(x)

[1] 11

University of Waterloo - 2018


Chapter 2

Load and Review Data

Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):

1. Load/import data sets in .csv format, as well as stock market data


from the web.
2. Review the structure of these data sets.
3. Create subsets by specifying observations, time periods, and selected
variables.
4. Order data based on ascending or descending order of a specific vari-
able.
5. Generate basic descriptive statistics (min, max, mean).
6. Create new variables (e.g., financial ratios) from existing variables (e.g.,
profits, sales and assets).
7. Generate graphs for observing trends in time-series.

2.1 Management Accounting Data


"Bibitor, LLC is a retail liquor company with 79 locations throughout the
state of Lincoln. They sell spirits and wine products. Bibitor, LLC has been
serving the area for over 50 years. Their wine and spirits selection is hand
picked and focused on value. Their employees are trained as personal beverage
concierges and provide unmatched service to all customers. There are over
10,000 brands throughout the organization."
Through the HUB of Analytics Education portal1 we have access to de-
1
http://www.hubae.org/

13

University of Waterloo - 2018


14 CHAPTER 2. LOAD AND REVIEW DATA

tailed data related to its daily operations. There are millions of lines of
sales, purchases, inventory, invoices, expenses, payroll, and store data in its
database. In this section, we will load and review data related to the size of
the Bibitor stores from the file named tStores.csv.
The .csv files - unlike for example Excel files, which are currently limited
to about a million lines - do not have an upper limit. This difference matters
since some of the Bibitor’s files have more than a million observations (e.g.
the sales file has over 12 million lines).

2.1.1 Prepare R Environment (RStudio)


If this is your first time using R, please read and follow the directions in
Chapter 1 on how to set up R and RStudio. Please make sure to familiarize
yourself with the RStudio environment and basic R operations before you
continue with the rest of this chapter.2

1. Create a directory to store your data and analysis.


2. Start RStudio, and follow directions in Chapter 1 (p. 9) on how to
create a new R file. Name the file BBTR_intro and save it in the same
directory.
3. Set the working directory to this folder as follows:
• From the menu select Session
• select Set Working Directory
• select To Source File Location. The full path of the working
directory will show in the console area of RStudio.

2.1.2 Load and Review Data


Load/Import Data To import store data, we are going to use the R
package data.table. If this the first time you are using this package, follow
directions on Chapter 1 (p. 5) on how to install packages in R. Remember, we
install the package once, but we have to load it using the function library()
every time we want to use it.
After the package has been loaded with library(data.table), we use
the function fread() to import the data from "tStores.csv" and save it to
a new data set that we call tStores.

> library(data.table)
> tStores <- fread("tStores.csv")
2
The complete R Script used for the creation of this section is available in Appendix
A.2.1 (p. 99).

University of Waterloo - 2018


2.1. MANAGEMENT ACCOUNTING DATA 15

Names of Variables: We can review the names of variables in the data


set using the function names(). The function takes one argument, the name
of the dataset. This means that inside the parenthesis, we specify the name
of the data set as follows:

> names(tStores)

[1] "Store" "City" "Location" "SqFt"

The data set has four variables: Store is unique identifier for each store/ob-
servation), City, Location, and size of each store in square feet (SqFt).

Data Structure: We can get detailed information about the entire data
set (e.g., number of observations, variables, and format for each variable)
with the function str(...). The function takes one argument, the name of
the data set:

> str(tStores)

Classes 'data.table' and 'data.frame': 79 obs. of 4 variables:


$ Store : int 1 10 11 12 13 14 15 16 17 18 ...
$ City : chr "HARDERSFIELD" "HORNSEY" "CARDEND" ...
$ Location: chr "HARDERSFIELD #1" "HORNSEY #10" ...
$ SqFt : int 8300 12000 6600 4600 3200 7000 12200 3600 ...

The data set has 79 observations (one row in the table for each store) and 4
variables (columns in the table). The variables Store and SqFt are integers
(int), while the variables City and Location are formatted as text (chr).
Knowing the format of variables matters when doing statistical analysis. For
example, we can calculate the average of numeric data, but we will try to
create a frequency table (or pie chart) when working with categorical data.

Review Sample of Observations from Dataset With the function


head(...), we can view the first six observations in the dataset. With the
function tail(...), we can view the last six observations in the data set.
The script below shows the head of the store data, as an exercise you should
generate the last six observations of the data.

> head(tStores)

University of Waterloo - 2018


16 CHAPTER 2. LOAD AND REVIEW DATA

Store City Location SqFt


1: 1 HARDERSFIELD HARDERSFIELD #1 8300
2: 10 HORNSEY HORNSEY #10 12000
3: 11 CARDEND CARDEND #11 6600
4: 12 LEESIDE LEESIDE #12 4600
5: 13 TARMSWORTH TARMSWORTH #13 3200
6: 14 BROMWICH BROMWICH #14 7000
The table above - showing the first 6 rows from tStores - does not provide
information that is significantly more useful than that provided by the str()
function shown earlier. However, as we will see below, the head and tail
functions can be useful after we have created subsets of data and ordered the
data in ways that are meaningful to us.
Both the head and tail functions can take a second argument that spec-
ifies the number of observations to be shown. That is, the second argument
is in addition to the first argument that specifies the data set. For example,
with the script below, we limit the number of observations to 3.
> head(tStores,3)
Store City Location SqFt
1: 1 HARDERSFIELD HARDERSFIELD #1 8300
2: 10 HORNSEY HORNSEY #10 12000
3: 11 CARDEND CARDEND #11 6600

2.1.3 Creating Subsets


In most cases, developing a useful, focused analysis will entail limiting the
number of observations or variables we use. We are going to see some exam-
ples of how we create such subsets.

Example 1: Create Subset Specifying Observations We want to cre-


ate a new data set that contains the first three observations and all variables
from the data set. We do this by adding a square bracket after the name
of the data set. Inside the bracket we have the statement 1:3 followed by a
comma.
> tStores[1:3,]
Store City Location SqFt
1: 1 HARDERSFIELD HARDERSFIELD #1 8300
2: 10 HORNSEY HORNSEY #10 12000
3: 11 CARDEND CARDEND #11 6600

University of Waterloo - 2018


2.1. MANAGEMENT ACCOUNTING DATA 17

Because in this example, we have specified that the output is to be the first 3
observations for all the variables, the result is the same as that above (head(
tStores , 3)).

Example 2: Create Subset Specifying Variables We want to create


a data set that has all observations (i.e., for stores 1 through 79), but only
the last three variables (i.e., the variables in columns 2, 3 and 4 or columns
2 through 4). We do this by adding a square bracket after the name of the
data set. Inside the bracket we have the comma followed by the statement
2:4.

> tStores[, 2:4]

City Location SqFt


1: HARDERSFIELD HARDERSFIELD #1 8300
2: HORNSEY HORNSEY #10 12000
3: CARDEND CARDEND #11 6600
... Lines 4 to 76 have been deleted ...
77: ALNERWICK ALNERWICK #8 5400
78: BLACKPOOL BLACKPOOL #9 7000
79: BALLYMENA BALLYMENA #79 12000
City Location SqFt

Example 3: Create Subset Specifying Observations and Variables


We can create a subset that shows the first three observations and last three
variables using the combination of the above notations. Inside the bracket
we specify the number of observations (1:3) and the number of variables (2:4)
separated by a comma.
> tStores[1:3, 2:4]

City Location SqFt


1: HARDERSFIELD HARDERSFIELD #1 8300
2: HORNSEY HORNSEY #10 12000
3: CARDEND CARDEND #11 6600

Subseting with data[rows, columns] The main message from these ex-
amples is that when we work with data sets we use the following format:
data[rows, columns]. Using this format we can generate the following
combinations:
1. All rows and all columns by including just a comma inside the bracket:
data[,]

University of Waterloo - 2018


18 CHAPTER 2. LOAD AND REVIEW DATA

2. Specified rows and all columns, by including a constraint before the


comma and nothing after: data[rows,]
3. All rows and specified columns, by leaving the area before the comma
blank and specifying the constraint after the comma: data[,columns]
4. Specified rows and columns, by imposing constraints before and after
the comma: data[rows, columns]

Combine head() with subseting Recall that the argument within the
function head is the name of the data set. This means that we can use any
subset that we have created as the argument within the function head. The
following example, shows how we do this to achieve the same output as the
one shown in Example 3 above. Notice, that the argument before the comma
is the subset.
> head(tStores[,2:4],3)

City Location SqFt


1: HARDERSFIELD HARDERSFIELD #1 8300
2: HORNSEY HORNSEY #10 12000
3: CARDEND CARDEND #11 6600

2.1.4 Ordering Data


When reviewing data, we may want to view observations in ascending or
descending order of a specific variable. For example, we may want to review
the smallest and largest Bibitor stores. In R the ordering is done using the
function order(). The argument inside the parenthesis is the name of the
variable. The default ordering is ascending (smallest to largest) and this is
done by simply entering the name of the variable. To achieve descending
order (largest to smaller) we put the negative sign in front of the variable
name.
Since our data set has more than one variable, we need to specify the
variable. We can do this by using the name of the data set, followed by the
dollar sign, and the name of the variable.

Ordering Ascending Order With the following script we specify that


we want to see the first three observations of a subset (head(subset, 3)).
The subset is made up of the last three variables of the store data set that
has been ordered in terms of store size from smallest to largest (tStores[
order(tStores$SqFt) , 2:4]).

> head(tStores[order(tStores$SqFt),2:4],3)

University of Waterloo - 2018


2.1. MANAGEMENT ACCOUNTING DATA 19

City Location SqFt


1: HORNSEY HORNSEY #3 1100
2: FURNESS FURNESS #18 2700
3: CESTERFIELD CESTERFIELD #64 2800
From this we can see that the three smallest Bibitor stores are 1100, 2700,
and 2800 square feet respectively.

Ordering Descending Order With the following script we specify that


we want to see the first three observations of a subset (head(subset, 3)).
The subset is made of the last three variables of the store data set that
has been ordered in terms of store size from largest to smallest (tStores[
order( -tStores$SqFt) , 2:4]).
> head(tStores[order(-tStores$SqFt),2:4],3)
City Location SqFt
1: GARIGILL GARIGILL #49 33000
2: EANVERNESS EANVERNESS #66 20000
3: PITMERDEN PITMERDEN #34 18400
From this we can see that the three largest Bibitor stores are 33000, 20000,
and 18400 square feet respectively. The largest store (#49) is thirty times
larger than the smallest store (#3).

2.1.5 Basic Descriptive Statistics


In Chapter 1 (p. 11) we have learned how to generate the minimum (min()),
the maximum (max()), and average (mean()) for a variable. We will use
these formulas for store size. Again, since our data set has more than one
variable, when we want to generate our statistics, we need to specify the
variable by using the name of data set, followed by the dollar sign, and the
name of the variable.

Minimum and Maximum


> min(tStores$SqFt)
[1] 1100
> max(tStores$SqFt)
[1] 33000
These results, of course, show the same minimum and maximum that were
revealed above when the data were ordered.

University of Waterloo - 2018


20 CHAPTER 2. LOAD AND REVIEW DATA

Average We can use the function mean() to calculate the average store
size as follows:
> mean(tStores$SqFt)
[1] 7893.671
The average Bibitor store is 7893.671 square feet in size.

2.1.6 Practice Problems


Read the note below before you work on these practice problems
The data sets for the following problems are very large. Before you import
the data set, you may want to turn off all other applications in your computer.
Depending on the memory and processor of your machine, the loading time
may be less than a minute or up to 10 to 20 minutes. As soon as you are
done with each problem follow the directions in section 1.5.4 (p. 9) to clean
the R environment.
To answer the following questions, you can continue working with the
same R file (BBTR_intro.R). Import the sales (SalesFINAL.csv) and pur-
chase (PurchasesFINAL.csv) data.
1. What is the number of observations and variables in the sales data?
2. Are there any variables in the sales data for which we cannot generate
descriptive statistics. Hint: Are there variables which are not numeric?
3. Are there any variables in the sales data for which you could create
descriptive statistics, but doing so does not add any value?
4. Order your sales data in terms of sales price (descending). What is the
price of the most expensive product?
5. Order your sales data in terms of sales price (ascending). What is the
price of the least expensive product?
6. Generate basic descriptive statistics for sales price based on sales data.
7. Generate basic descriptive statistics for purcase price based on purchase
data.

2.2 Stock Market Data


In this section, we will use the package quantmod to import and review stock
market data from 2006 to 2016 for Amazon and contrast them to SP500. As
you will learn in the introduction to information technology class (AFM241),
Amazon introduced its cloud computing technology in the summer of 2006.3
3
The R Script used for the creation of this section is available in Appendix A.2.2 (p.
100.)

University of Waterloo - 2018


2.2. STOCK MARKET DATA 21

If this is the first time using the quantmod package, follow directions
in Chapter 1 (p. 5) on how to install packages in R. Clear the RStu-
dio environment and create a new R file. We load the package, using
the library(quantmod) and use the command Sys.setenv(TZ = "UTC")
to avoid annoying warnings/error messages related to timezones.

> library(quantmod)
> Sys.setenv(TZ = "UTC")

We use the function getSymbols(...) to extract stock market data. The


function takes the following arguments:

1. The ticker symbol of the stock, index, or ETF in quotation marks. The
ticker of Amazon is "AMZN".4
2. The source engine that supplies the financial data. In this example, we
use yahoo Finance (src="yahoo").
3. The frequency for extracting data. This can be daily, weekly, monthly,
quarterly, or annually. For weekly data (periodicity="weekly").
4. The beginning (from=) and the end (to=) of the time period. Notice
that we specify the date using the format "yyyy-mm-dd" and tell R to
read this format as a date (as.Date).

> getSymbols(c("AMZN"), src="yahoo", periodicity="weekly",


from=as.Date("2006-01-01"), to=as.Date("2016-12-31"))

[1] "AMZN"

When ran the function creates a new data set called AMZN. We are not going
to use the function str(...) to review it because the type of data sets
generated by quantmod are beyond the scope of these notes. Instead of this,
we use the functions names as well as head and tail.

> names(AMZN)

[1] "AMZN.Open" "AMZN.High" "AMZN.Low"


[4] "AMZN.Close" "AMZN.Volume" "AMZN.Adjusted"

The data set has six variables (columns). More specifically, for each observa-
tion/week in the data set, we have the following variables: The opening price
of AMZN (AMZN.Open), highest price (AMZN.High), lowest price (AMZN.Low),
4
To extract both data sets at the same time, we specify the getSymbols(c("AMZN",
"^GSPC"), src="yahoo", ...).

University of Waterloo - 2018


22 CHAPTER 2. LOAD AND REVIEW DATA

closing price AMZN.Close, volume/number of shares traded (AMZN.Volume),


and the adjusted closing price (AMZN.Adjusted). As you will learn in your
financial management classes; the adjusted closing price accounts for such
things as splits of the stock and dividends, hence it is the most useful vari-
able for observing the movement of the stock overtime.

Number of Observations in Dataset (nrow) The function nrow() pro-


vides the number of rows/observations in a data set. Running the function
on the Amazon data, we find that the data set has 574 rows. In other words
in the period from 2006-01-01 to 2016-12-31 there are 574 weeks.

> nrow(AMZN)

[1] 574

Using the functions head and tail respectively, we can verify that the start-
ing week corresponds to January 1st, 2006 and the ending week corresponds
to December 25th, 2016.

> head(AMZN,3)

AMZN.Open AMZN.High AMZN.Low AMZN.Close AMZN.Volume


2006-01-01 47.47 48.58 46.25 47.87 26593200
2006-01-08 46.55 47.10 44.00 44.40 37376900
2006-01-15 43.95 45.24 43.10 43.92 27838600
AMZN.Adjusted
2006-01-01 47.87
2006-01-08 44.40
2006-01-15 43.92

> tail(AMZN,3)

AMZN.Open AMZN.High AMZN.Low AMZN.Close AMZN.Volume


2016-12-11 766.40 782.46 754.00 757.77 22334600
2016-12-18 758.89 774.39 756.16 760.59 12381900
2016-12-25 763.40 780.00 748.28 749.87 13232600
AMZN.Adjusted
2016-12-11 757.77
2016-12-18 760.59
2016-12-25 749.87

University of Waterloo - 2018


2.2. STOCK MARKET DATA 23

2.2.1 Plotting Stock Market Data


As mentioned above, if we want to review the price of Amazon stock over
time, we will need to focus on the adjusted closing price. For this we can
create a subset that specifies that we want all observations in the data set and
the 6th variable. As we have seen on p. 18, we can create this subset using
the square bracket notation. More specifically, there is no constraint before
the comma, and the position of the variable (6th) after comma (AMZN[,6]).
Having specified this subset, we can use the function plot to generate the
graph showing the adjusted closing price of Amazon, as follows:

> plot(AMZN[,6])

The graph (Figure 2.1) shows that after the introduction of cloud computing,
the price of Amazon went from around 100 to around 800. This means a
return of around 700%.

Figure 2.1: Amazon: 2006-2016

2.2.2 SP500 Data


Repeating the above process, we use the SP500 ticker symbol ("^GSPC") to
extract the price of the SP500 index as follows:5
5
We can find the ticker symbol for any stock or index by running a search on Yahoo
Finance or Google Finance.

University of Waterloo - 2018


24 CHAPTER 2. LOAD AND REVIEW DATA

> getSymbols("^GSPC", src="yahoo", periodicity="weekly",


from=as.Date("2006-01-01"), to=as.Date("2016-12-31"))

[1] "GSPC"

> names(GSPC)

[1] "GSPC.Open" "GSPC.High" "GSPC.Low"


[4] "GSPC.Close" "GSPC.Volume" "GSPC.Adjusted"

> nrow(GSPC)

[1] 574

head(GSPC,1)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume


2006-01-01 1248.29 1286.09 1245.74 1285.45 9949800000
GSPC.Adjusted
2006-01-01 1285.45

> tail(GSPC,1)

GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume


2016-12-25 2266.23 2273.82 2233.62 2238.83 9386710000
GSPC.Adjusted
2016-12-25 2238.83

As we can see, the SP500 data set (GSPC) has the same number of variables
(6), the same number of observations (574), as well as the same starting week
(2006-01-01) and ending week (2016-12-25).

2.2.3 Combine Two Data Sets - cbind


Based on the above analysis; we know that the two data sets (AMZN and
GSPC) have the same number of observations (574), and cover the same time
period (2006 to 2016). In addition to this, we know that the main variable
of interest is the adjusted closing price. We can combine the two data sets
into one using the function cbind as follows:6

1. specify the first subset: AMZN[,6]


6
A visual representation of cbind appears in section 4.4 (p. 88).

University of Waterloo - 2018


2.2. STOCK MARKET DATA 25

2. specify the second subset: GSPC[,6]


3. combine the two subsets: cbind(AMZN[,6], GSPC[,6])

> dt1 <- cbind(AMZN[,6], GSPC[,6])

> head(dt1,3); tail(dt1,3)

AMZN.Adjusted GSPC.Adjusted
2006-01-01 47.87 1285.45
2006-01-08 44.40 1287.61
2006-01-15 43.92 1261.49

AMZN.Adjusted GSPC.Adjusted
2016-12-11 757.77 2258.07
2016-12-18 760.59 2263.79
2016-12-25 749.87 2238.83

> nrow(dt1)

[1] 574

> names(dt1)

[1] "AMZN.Adjusted" "GSPC.Adjusted"

2.2.4 Renaming Variables


The new data set has two variables: AMZN.Adjusted and GSPC.Adjusted.
These names are relatively long. We can rename them using the function
names as follows:

1. We specify that we want to focus on the names of variables of the data


set: names(dt1)
2. We specify the variables that we want to rename (1st and 2nd) using
the bracket notation: [1:2]
3. We use the function c(...) (see p. 8) to specify the new names:
c("AMZN", "SP500")

> names(dt1)[1:2] <- c("AMZN", "SP500")


> names(dt1)

[1] "AMZN" "SP500"

University of Waterloo - 2018


26 CHAPTER 2. LOAD AND REVIEW DATA

2.2.5 Printing Graphs Side-by-Side


We can print two graphs side-by-side by altering the format of a paragraph
(par(...)) to be in the form of a table with one row and two columns
(mfcol=c(1,2))

> par(mfcol=c(1,2))

Once the format has been changed, we develop the two graphs.

> plot(dt1$AMZN)
> plot(dt1$SP500)

The output (Figure 2.2) compares side-by-side the performance of Amazon


on left versus SP500 on right.

1. Approximately, what was the percentage change of Amazon during this


period?
2. What was the approximate change of SP500 during the same period?

• When comparing graphs which are presented side-by-side pay very close
attention to the scales used.

Figure 2.2: Amazon & SP500: 2006-2016

We can return the paragraph format to its original (one row and one column)
as follows:

> par(mfcol=c(1,1))

University of Waterloo - 2018


2.2. STOCK MARKET DATA 27

2.2.6 Subseting Stock Market Data Based on Time


During the years 2007-08 the US economy went through the financial crisis.
It will be interesting, to contrast the Amazon stock and SP500 index during
this sub-period We can create this graph in two different ways. The first
approach would be to use the function getSymbols(...) to extract data for
both Amazon and SP500 over the period 2007 to 2010 and use it to create
graphs for Amazon and S&P500.
The second approach would be to create a subset based on the dt1 data
set. Note, that the data set produced by the function getSymbols(...) has
dates instead of row numbers. This means that we can subset by specifying
the date range as follows: [‘startDate::endDate’]. When we subset like
this, we don’t use the comma. Hence, the creation of a new data set that
covers the years 2007 to 2010, can be done as follows:

> dt2 <- dt1['2007-01-01::2010-12-31']

Using the new data set, we can print the two graphs as follows:

> par(mfcol=c(1,2))
> plot(dt2$AMZN); plot(dt2$SP500)
> par(mfcol=c(1,1))

Figure 2.3: Amazon & SP500: 2007-2010

The new graph (Figure 2.3) shows that Amazon was able to recover and
achieve a price higher than the pre-crisis level much faster than the SP500.
More specifically; by 2010, Amazon had reached a price level that was almost
twice the pre-crisis level. By 2010, SP500 had not yet reached its pre-crisis
level.

University of Waterloo - 2018


28 CHAPTER 2. LOAD AND REVIEW DATA

2.3 Financial Accounting Data


Financial statements are used primarily by two groups: the management of
the firm and external stakeholders (investors and creditors). Firm’s manage-
ment relies on financial accounting data to make strategic decision (e.g., ex-
pansion into a new market), tactical decision (e.g., acquisition of resources/in-
vestments), and operational decision (e.g., offering discount in the sales of
certain products). Investors and creditors analyze financial statements to
decide whether to invest on and extend credit to a particular firm.
In this section, we will learn how to load and review financial accounting
data that has been extracted from Compustat (see Appendix B). More specif-
ically, we will load and review financial data for firms competing in the same
industry as Apple for the fiscal years 2010 to 2016, and create profitability
ratios.

2.3.1 Financial Ratios: Profitability


Profitability measures (e.g., net profit margin and ROA) are some of the
most important metrics for both managers and stakeholders because they
tell us how successful a company is. The viability of a firm depends on its
ability to generate profits. The following paragraphs provide a summary of
some of the most commonly used profitability ratios and how to calculate
them based on Compustat data.7
Depending on how we measure profits, we can have several variations.
Some of the most common profitability ratios are listed below: Gross margin
(GM ) is defined as sales (Compustat code=SALE) minus cost of goods sold
(Compustat code = COGS) over sales (SALE).8

SALEit − COGSit
GMit = (2.1)
SALEit
Operating margin (OM ) is defined as operating income before deprecia-
tion (Compustat code=OIBDP) to Sales (SALE).

OIBDPit
OMit = (2.2)
SALEit
Profit margin (P Mit ) is defined as income before extraordinary items
(Compustat code=IB) to Sales (SALE).
7
You may want to review your notes from the introduction to financial accounting
(AFM101).
8
In the following formulas, the subscript i refers to the specific firm and the subscript
t refers to the time period.

University of Waterloo - 2018


2.3. FINANCIAL ACCOUNTING DATA 29

IBit
P Mit = (2.3)
SALEit
Return on assets (ROA) measures a firm’s ability to generate profits per
dollar of assets.9 Since, we have three versions of profits, we will create three
versions of ROA. The first, version of ROA is based on gross profits, i.e., the
same numerator as the gross margin (GM ):

SALEit − COGSit
ROA (with GM)it = (2.4)
AT it
The second specification of ROA is based on operating income before
depreciation (OIBDP). This means that we use the same numerator as the
operating margin (OM ):

OIBDPit
ROA (with OM)it = (2.5)
AT it
The third version of ROA is based on income before extraordinary items
(IB), i.e., the same numerator as the profit margin (P M ):

IBit
ROA (with PM)it = (2.6)
AT it
One of the major benefits of using financial ratios to measure profitabil-
ity, rather than actual profits, is the removal of the size factor. Since all
values are expressed as a percentage of the firm’s sales or assets it makes it
feasible to compare companies of different size. Please, keep in mind that
the profitability ratios will not make sense unless the denominator is greater
than zero.

2.3.2 Load and Review Compustat Data


Following the directions shown in Appendix B (p. 117), we have created
and saved a data set named industryAAPL_2010_2016.csv. We clean the R
9
The creation of ROA is based on input from the balance sheet (assets) and income
statement (sales and income). The balance sheet measures values at a point in time while
the income statement measures values over a period of time. To resolve this problem, we
need to use the average of assets rather than assets at a given point in time. We can do this
by taking the average of current period assets and prior period assets (i.e., lagged assets).
Generating the average of assets is complicated when working with multiple firms and
multiple years. It requires the use of a panel data package in R, which is beyond the scope
of this introductory primer. In the rest of the discussion, we will use just end-of-period
total assets in the denominator.

University of Waterloo - 2018


30 CHAPTER 2. LOAD AND REVIEW DATA

environment, and create a new R script file. Using the package data.table
and the function fread we import the data (industryAAPL_2010_2016.csv)
in R and review the names of the variables in the data set using the function
names.10
> library(data.table)
> dt <- fread("industryAAPL_2010_2016.csv")
> names(dt)

[1] "gvkey" "datadate" "fyear" "tic" "conm"


[6] "at" "cogs" "ib" "oibdp" "sale"
[11] "loc" "naics" "sic"

The data set has 13 variables (column names)


1. gvkey a unique identifier assigned by Compustat to each company.
2. datadate the ending date of the filing period.
3. fyear the fiscal year.
4. tic the ticker symbol.
5. conm the company’s name.
6. at total assets
7. cogs cost of goods sold
8. ib income before extraordinary items
9. oibdp operating income before depreciation.
10. sale sales/revenue.
11. loc country/location of the firm’s headquarters
12. naics North American Industry Classification System
13. sic Standard Industry Classification system.

> str(dt)

Classes 'data.table' and 'data.frame':


441 obs. of 13 variables:
$ gvkey : int 1117 1117 1117 1117 1117 1117 1117 1635 ...
$ datadate: int 20101231 20111231 20121231 20131231 ...
$ fyear : int 2010 2011 2012 2013 2014 2015 2016 2010 ...
$ tic : chr "RWC" "RWC" "RWC" "RWC" ...
$ conm : chr "RELM WIRELESS CORP" "RELM WIRELESS CORP" ...
$ at : num 34.8 31.8 34 34 37 ...
$ cogs : num 13.8 12.6 13.2 13.8 16.6 ...
10
The R Script used for the creation of this section is available in Appendix A.2.3 (p.
101)

University of Waterloo - 2018


2.3. FINANCIAL ACCOUNTING DATA 31

$ ib : num -0.66 -0.493 2.065 1.142 1.623 ...


$ oibdp : num 0.183 0.759 4.248 3.116 3.713 ...
$ sale : num 26 24.1 27.6 27 31 ...
$ loc : chr "USA" "USA" "USA" "USA" ...
$ naics : int 334220 334220 334220 334220 334220 334220 ...
$ sic : int 3663 3663 3663 3663 3663 3663 3663 3663 ...

Reviewing the structure of the data set, we see that it has 441 observations
and there is a combination of numeric (num or int) and text variables (chr).

Subset - Apple Data Using the function data[,] we can create and view
a subset that has just the financial data for Apple. In section 2.1.3 (p. 16)
we have seen that we can create a subset by specifying the row numbers. An
alternative approach would be to specify that we want to see rows that meet
a certain condition (i.e., have a ticker symbol which is AAPL). We can do
this by setting the term before the comma as follows: dt$tic=="AAPL".
With the following command, we can view the Apple data for variables
3 through 10 (fyear, tic, conm, at, cogs, ib, oibdp, and sale).

> dt[dt$tic=="AAPL",c(3:10)]

fyear tic conm at cogs ib oibdp sale


1: 2010 AAPL APPLE INC 75183 38609 14013 19317 65225
2: 2011 AAPL APPLE INC 116371 62609 25922 35612 108249
3: 2012 AAPL APPLE INC 176064 84641 41733 58446 156508
4: 2013 AAPL APPLE INC 207000 99849 37037 55756 170910
5: 2014 AAPL APPLE INC 231839 104312 39510 60449 182795
6: 2015 AAPL APPLE INC 290479 129589 53394 81730 233715
7: 2016 AAPL APPLE INC 321686 121576 45687 69276 215091

2.3.3 Create Profitability Ratios


All three profitability ratios have sales in the denominator. To avoid problems
with firms that for any reason may have reported zero or very small level
of sales, we create a subset that limits data to firms with sales above one
million.11 We can do this by specifying that we want to keep observations
where sales > 1, as follows:

> dt <- dt[dt$sale>1,]


11
Compustat data are reported in millions of dollars.

University of Waterloo - 2018


32 CHAPTER 2. LOAD AND REVIEW DATA

We can see by running the function nrow, that the new data set has 396
observations.

> nrow(dt)

[1] 396

Using the formulas 2.1-2.3, we can specify the creation of the three prof-
itability ratios as follows:

> dt$gm <- (dt$sale-dt$cogs)/dt$sale


> dt$om <- dt$oibdp/dt$sale
> dt$pm <- dt$ib/dt$sale

We can see that the new variables (gm, om, and pm) have been added in the
data set.

> names(dt)

[1] "gvkey" "datadate" "fyear" "tic" "conm"


[6] "at" "cogs" "ib" "oibdp" "sale"
[11] "loc" "naics" "sic" "gm" "om"
[16] "pm"

We can visually inspect the new variables (profitability ratios), by creat-


ing and viewing a subset that has all Apple observations (dt$tic=="AAPL")
and the following variables: the fiscal year and ticker symbol (3rd and 4th
variable), all variables used for the creation of the profitability ratios (6
through 10), as well as the derived profitability ratios (14 through 16).

> dt[dt$tic=="AAPL",c(3:4, 6:10, 14:16)]

fyear tic at cogs ib oibdp sale gm


1: 2010 AAPL 75183 38609 14013 19317 65225 0.4080644
2: 2011 AAPL 116371 62609 25922 35612 108249 0.4216205
3: 2012 AAPL 176064 84641 41733 58446 156508 0.4591906
4: 2013 AAPL 207000 99849 37037 55756 170910 0.4157802
5: 2014 AAPL 231839 104312 39510 60449 182795 0.4293498
6: 2015 AAPL 290479 129589 53394 81730 233715 0.4455255
7: 2016 AAPL 321686 121576 45687 69276 215091 0.4347695

om pm
1: 0.2961594 0.2148409

University of Waterloo - 2018


2.3. FINANCIAL ACCOUNTING DATA 33

2: 0.3289823 0.2394664
3: 0.3734378 0.2666509
4: 0.3262302 0.2167047
5: 0.3306929 0.2161438
6: 0.3496994 0.2284577
7: 0.3220776 0.2124078

Using a similar approach we can create the ROA ratios and observe them by
creating a subset.12

> dt$ROA_gm <- (dt$sale-dt$cogs)/dt$at


> dt$ROA_om <- dt$oibdp/dt$at
> dt$ROA_pm <- dt$ib/dt$at

> dt[dt$tic=="AAPL",c(3:4, 6:10, 17:19)]

fyear tic at cogs ib oibdp sale ROA_gm


1: 2010 AAPL 75183 38609 14013 19317 65225 0.3540162
2: 2011 AAPL 116371 62609 25922 35612 108249 0.3921939
3: 2012 AAPL 176064 84641 41733 58446 156508 0.4081868
4: 2013 AAPL 207000 99849 37037 55756 170910 0.3432899
5: 2014 AAPL 231839 104312 39510 60449 182795 0.3385237
6: 2015 AAPL 290479 129589 53394 81730 233715 0.3584631
7: 2016 AAPL 321686 121576 45687 69276 215091 0.2907027

ROA_om ROA_pm
1: 0.2569331 0.1863852
2: 0.3060213 0.2227531
3: 0.3319588 0.2370331
4: 0.2693527 0.1789227
5: 0.2607370 0.1704200
6: 0.2813629 0.1838136
7: 0.2153529 0.1420236

2.3.4 Basic Descriptive Statistics


Using the formulas for minimum (min), maximum (max), and average (mean)
we can generate and compare the performance of Apple versus the rest of
the industry.
12
To avoid scientific notation in our results, we can run the function options( scipen
= 9). The number does not really matter.

University of Waterloo - 2018


34 CHAPTER 2. LOAD AND REVIEW DATA

GM: minimum Apple versus minimum of Industry We can focus on


gm and generate the minimum value for Apple over the years 2010 to 2016
as follows:

> min(dt[dt$tic=="AAPL",]$gm)

[1] 0.4080644

The minimum for the entire industry is calculated as follows:

> min(dt$gm)

[1] -3.583661

At this point, we may want to find out which company has such a low per-
formance. We can do this by creating a subset that specifies that we want
to see the observation that is associated with the minimum of gross margin.

> dt[dt$gm==min(dt$gm),]

gvkey datadate fyear tic conm at cogs


1: 29332 20161231 2016 PRKR PARKERVISION INC 8.576 18.628

ib oibdp sale loc naics sic gm om


1: -21.509 -14.564 4.064 USA 334220 3663 -3.583661 -3.583661

pm ROA_gm ROA_om ROA_pm


1: -5.292569 -1.698228 -1.698228 -2.508046

GM: maximum Apple versus maximum of Industry

> max(dt[dt$tic=="AAPL",]$gm)

[1] 0.4591906

> max(dt$gm)

[1] 0.9974311

• Q: Can you determine which firm achieved such a high gross margin?

University of Waterloo - 2018


2.3. FINANCIAL ACCOUNTING DATA 35

GM: average Apple versus average of Industry


> mean(dt[dt$tic=="AAPL",]$gm)
[1] 0.4306144
> mean(dt$gm)
[1] 0.4357146
The results indicate the gross margin of Apple seems to be close to the
industry average.

2.3.5 Graph Apple Performance over Time


The simplest way to observe a firm’s performance over time is to create a
graph showing time on the horizontal axis and the target variable on the
vertical axis. The ggplot2 package allows for the creation of some very
advanced graphs. In the following example, we will leverage ggplot2 to
create a graph showing Apple sales over time.
We start, by creating a subset that has only the Apple observations, and
loading the library ggplot2.
> dt_AAPL <- dt[dt$tic=="AAPL",]
> library(ggplot2)
The creation of the graph in the script below, specifies the following:
1. We choose a name for our graph (AAPL_sales).
2. We use the function ggplot from the package ggplot2 to build the
graph.
3. The function has two arguments. First, the name of the data set
(dt_AAPL). Second, the aesthetics (x and y axis in this graph). More
specifically, the argument aes specifies that the x-axis is based on the
variable fiscal year (fyear) and the y-axis is based on the variable sales
(sale)
4. The package ggplot2, works incrementally. This means that we can
literally add more features to a graph using the plus (+) sign. In our
case, we specify that we want to see the data points using the function
geom_point().
> AAPL_sales <- ggplot(dt_AAPL, aes(x=fyear, y=sale)) +
geom_point()
> AAPL_sales
The resulting graph is shown in Figure 2.4.

University of Waterloo - 2018


36 CHAPTER 2. LOAD AND REVIEW DATA

Figure 2.4: Apple Sales 2010-2016 - version 1

Refining the Graph As mentioned above, the ggplot2 package gives us


the ability to refine the AAPL_sales graph incrementally. More specifically,
we are going to make the following changes:

1. We will add the function geom_line(), which means that we want to


see a line connecting the data points.
2. We want the horizontal axis to read Fiscal Year, and
3. the vertical axis to read Sales (millions).

> AAPL_sales <- AAPL_sales+geom_line() +


xlab("Fiscal Year") + ylab("Sales (millions)")
> AAPL_sales

The resulting graph is shown in Figure 2.5. From this, we can see that
Apple’s sales more than doubled from 2010 to 2016.

University of Waterloo - 2018


2.3. FINANCIAL ACCOUNTING DATA 37

Figure 2.5: Apple Sales 2010-2016 - version 2

University of Waterloo - 2018


University of Waterloo - 2018
Chapter 3

Summary Statistics

Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):
1. Generate summary descriptive statistics
2. Identify outliers using the interquartile range approach
3. Create a boxplot
4. Creare a subset that indicates which observations are outliers
5. Create a frequency table for categorical variables

3.1 Management Accounting Data


In chapter 2 (p. 13) we learned that Bibitor is a retail liquor company with
79 locations throughout the state of Lincoln and we reviewed the sizes of
these stores (square feet). In this chapter, we are going to take a closer
look at the size of the stores (generate summary descriptive statistics) and
see if there are any stores which are much bigger (mega-stores) than the
average Bibitor store. The management of the company might be interested
in finding/comparing the sales, costs, and/or revenues of these mega-stores
to the rest of Bibitor stores.
For our analysis, we will use a file named salesByStore.csv. The file
contains sales data, as well as the size, for each store.

3.1.1 Load and Review Sales by Store Data


Follow direction in section 2.1.1 (p. 14) on how to prepare the R environ-
ment. Create a new R file and name it BBTR_summaryStats.R. We use the

39

University of Waterloo - 2018


40 CHAPTER 3. SUMMARY STATISTICS

function fread() to import the data from "salesByStore.csv" and save it


to a new data set that we call dt1. Using the function str() we review the
structure of the data set.

> # set working directory to source file location


> library(data.table)
> dt1 <- fread("salesByStore.csv")
> str(dt1)

Classes 'data.table' and 'data.frame': 79 obs. of 5 variables:


$ Store : int 1 2 3 4 5 6 7 8 9 10 ...
$ SqFt : int 8300 7900 1100 7600 2900 6500 7100 ...
$ unitSold : int 576092 403023 33962 279623 124272 ...
$ averagePrice: num 15 16.1 17.3 13.9 12.9 ...
$ revenue : num 6912704 6091406 436062 3201746 1322488 ...

The data set has 79 observations (one row in the table for each store) and
5 variables (columns in the table). The variable Store provides the unique
identification of each store; SqFt provides the size of each store in square feet;
unitSold captures the units sold (number of bottles); averagePrice captures
the average price per unit (bottle) sold; and revenue captures the revenues
(sales) generated by each store.

3.1.2 Summary Statistics


In chapter 2 (p. 19) we learned how to generate basic descriptive statistics,
such the min, max, and average. The function summary() will return all
basic descriptive statistics for a variable (e.g., SqFt) as follows:

> summary(dt1$SqFt)

Min. 1st Qu. Median Mean 3rd Qu. Max.


1100 4000 6400 7894 10200 33000

From this we can see that the size of the smallest store is 1100 square feet,
while the largest is 33000. There are 25% of stores which have a size below
4000 square feet. In other words, the 1st Quartile (Q1) is 4000. The median
is 6400. Therefore, there are 50% of stores which are smaller than 6400 square
feet and 50% which are larger than 6400. The average store is 7894 square
feet. There 75% of stores which are below the 3rd quartile (Q3=10200).
We can generate the first quartile, third quartile, or any other percentile
we want using the function quantile. The function takes two arguments:

University of Waterloo - 2018


3.1. MANAGEMENT ACCOUNTING DATA 41

the variable and the percentage of observations below. For example, we can
generate the 1st quartile of store size by specifying the variable (SqFt) and
25% of observations below, as follows:

> quantile(dt1$SqFt, .25)

25%
4000

If we change the percentage to .75 it will return the 3rd quartile (Q3).

> quantile(dt1$SqFt, .75)

75%
10200

We can take the difference between the two of them to calculate the in-
terquartile range.

> quantile(dt1$SqFt, .75)-quantile(dt1$SqFt, .25)

75%
6200

Alternatively, we can generate the interquartile range (IQR) using the func-
tion IQR as follows:

> IQR(dt1$SqFt)

[1] 6200

3.1.3 Detecting Outliers


We can use the interquartile range to detect outliers. Remember that values
below the lower whisker (Q1 − 1.5 ∗ IQR) or above the upper whisker (Q3 +
1.5 ∗ IQR) are outliers. We can generate the lower and upper whisker as
follows:

> lWhisker4sQFt <- quantile(dt1$SqFt,.25)-1.5*IQR(dt1$SqFt)


> lWhisker4sQFt

25%
-5300

University of Waterloo - 2018


42 CHAPTER 3. SUMMARY STATISTICS

> uWhisker4sQFt <- quantile(dt1$SqFt,.75)+1.5*IQR(dt1$SqFt)


> uWhisker4sQFt

75%
19500

The size of the smallest store (1100) is not below the lower whisker (-5300),
therefore there are no outliers on the left size of the distribution. However,
the largest store (33000) is above the upper whisker (19500), hence at least
this store is an outlier.

3.1.4 Boxplot
A boxplot is a way of graphically showing data in their quartiles, including the
"whiskers" as noted above (i.e., indicators of the variability of data beyond
the upper quartile and below the first quartile). In R, we can create the
boxplot for the size of the stores (shown in Figure 3.1) using the function
boxplot and specifying the variable that we would like to graph as follows:
> boxplot(dt1$SqFt)

Figure 3.1: Boxplot - Size of Stores (Square Feet)

The boxplot shows us that there are 2 outliers on the upper end. This
means that there are two values above the upper whisker (19500). In the
previous section we have seen that the minimum value for the store size is

University of Waterloo - 2018


3.1. MANAGEMENT ACCOUNTING DATA 43

1100 and the lower whisker is −5300. Note that given the fact that there are
no stores with negative size or stores with size less than 1100, the boxplot
shows the minimum value instead of showing the lower whisker.

3.1.5 Subset of Outliers


Figure 3.1 shows that there are two outliers. This means that there are two
stores whose size is much bigger than the rest of the stores. We can identify
these two stores by creating a subset that returns only the observations that
meet a certain condition (i.e., store size higher than the uWhisker4sQFt).
In the following example, we use an ifelse function to add a new variable
(sqftOutlier) in the data set. The variable takes two values: 1 if the store
size is below the lWhisker4sQFt OR store size is above the uWhisker4sQFt;
otherwise, its value is 0.
Inside the parenthesis of the ifelse statement there are three parts (ar-
guments) separated by commas.

1. The first part (tStores$SqFt < lWhisker4sQFt | tStores$SqFt >


uWhisker4sQFt) is our condition and it reads as follows: check to see
if the value of tStores$SqFt is below the lower whisker or (|) above
the upper whisker.1
2. The second part (1) says that if the condition is true, the new variable
(dt1$sqftOutlier) takes the value of 1.
3. The third part (0) says that if the condition is not true, the new variable
(dt1$sqftOutlier) takes the value of 0.

> dt1$sqftOutlier <-


ifelse(dt1$SqFt<lWhisker4sQFt|dt1$SqFt>uWhisker4sQFt,1,0)

We use the new variable to create and view the subset that returns only the
outliers (tStores[tStores$sqftOutlier == 1 , ]) below.

> dt4outlier <- dt1[dt1$sqftOutlier==1,]


> dt4outlier

Store SqFt unitSold averagePrice revenue sqftOutlier


1: 49 33000 682382 15.49496 9002459 1
2: 66 20000 1140215 18.03583 18153040 1
1
In R the vertical line (|) indicates the OR and the ampersand (&) the AND.

University of Waterloo - 2018


44 CHAPTER 3. SUMMARY STATISTICS

Therefore, store 49 and 66 are the two very large stores (outliers) shown in
Figure 3.1.
Alternatively, we can simply use the condition from the function ifelse
to generate the subset.

> dt1[dt1$SqFt<lWhisker4sQFt|dt1$SqFt>uWhisker4sQFt,]

Store SqFt unitSold averagePrice revenue sqftOutlier


1: 49 33000 682382 15.49496 9002459 1
2: 66 20000 1140215 18.03583 18153040 1

The first method (creating a new dataset) is useful if there are a lot of
outliers/exceptions and we need to perform further statistical analysis to
understand patterns or common themes across the entire data set of outliers.
For example, we can use the new data set to analyze (generate summary
statistics) for the group of outliers. The second is more useful when dealing
with just a handful of observations, and a simple visual review would be
enough to see what is going on.

3.1.6 Practice Problems: Sales by Store


Use data from the table salesByStore.csv to answer the following questions:2

1. Generate summary statistics for units sold (unitSold)


2. Use the IQR to detect outliers in unitSold
3. Use the Boxplot to visually detect outliers in unitSold
4. If there are more than ten outliers, show the top ten. Alternatively
show all of them.

3.1.7 Frequency Tables for Categorical Data


Bibitor stores sell over ten thousand different products, which are classified
as either liquor or wine. Liquor is recorded as 1 and wine as 2. In the
following example, we will create frequency tables (count or percentage) in
order to find how many/percentage of products sold in fiscal year 2016 were
liquor/wine.
For this analysis, we will load the file (salesByProduct), which contains
sales data, as well as classification, for each one of the products sold in Bibitor
stores in fiscal year 2016.
2
See p. 61 for the solution to these practice problems.

University of Waterloo - 2018


3.1. MANAGEMENT ACCOUNTING DATA 45

> dt2 <- fread("salesByProduct.csv")


> str(dt2)

Classes 'data.table' and 'data.frame': 10473 obs. of 6 variables:


$ Brand : int 58 60 61 62 63 68 70 72 75 77 ...
$ Description : chr "Gekkeikan Black & Gold Sake"
"Canadian Club 1858 VAP" "Margaritaville Silver" ...
$ Classification: int 1 1 1 1 1 1 1 1 1 1 ...
$ unitSold : int 3163 1931 281 2997 2498 1 8 492 40 ...
$ averagePrice : num 13 10.6 14 38.3 40.3 ...
$ revenue : num 41087 20498 3931 114704 100667 ...

The data set has 10473 observations (one line/observation for each unique
product that Bibitor sells) and 6 variables. The variable Brand is a unique
identifier for each product. Description is the name of the product. Classifi-
cation captures the two categories of products sold (1=liquor, 2=wine). The
remaining variables are units sold (unitSold), average price (averagePrice),
and sales (revenue) for each product.
We can generate the product count (frequency of products) based on their
classification as liquor or wine, using the function table() and specifying the
variable we want to analyze (Classification).

> table(dt2$Classification)

1 2
3182 7291

The results show that there are 3182 liquors and 7291 wines. If we include
the function table within another function prop.table the output will be
expressed as percentage of total observations.

> prop.table(table(dt2$Classification))

1 2
0.3038289 0.6961711

Therefore, 30.4% of products are liquors and 69.6% are wines.

3.1.8 Practice Problems


Use data from the table salesByProduct.csv to answer the following questions:

University of Waterloo - 2018


46 CHAPTER 3. SUMMARY STATISTICS

1. Create a data set that has only liquors and name it dt_liquor.3
(a) Are there outliers in units sold (unitSold)?
(b) How many?
(c) If there are more than ten, show the top ten.
(d) Are there outliers in the average price (averagePrice)?
(e) How many?
(f) If there are more than ten, show the top ten.
(g) Are there outliers in sales (revenue)?
(h) How many?
(i) If there are more than ten, show the top ten.
2. Create a data set that has only wines and name it dt_wine. Perform
the same analysis as above for outliers in units sold, average price, and
revenue.
3. Summarize the main points of your analysis and their implications for
the management of Bibitor

3.2 Stock Market Data


Following the approach shown in Chapter 2 (see section 2.2), we load and
review weekly stock market data for Amazon and S&P500 from 2006 to 2016.
Our focus is on adjusted closing price.

> options(scipen=999)
> library(quantmod)
> Sys.setenv(TZ = "UTC")
> getSymbols(c("AMZN", "^GSPC"), src="yahoo",
periodicity="weekly",
from=as.Date("2006-01-01"), to=as.Date("2016-12-31"))

[1] "AMZN" "GSPC"

> names(AMZN)

[1] "AMZN.Open" "AMZN.High" "AMZN.Low"


[4] "AMZN.Close" "AMZN.Volume" "AMZN.Adjusted"

> names(GSPC)

[1] "GSPC.Open" "GSPC.High" "GSPC.Low"


[4] "GSPC.Close" "GSPC.Volume" "GSPC.Adjusted"
3
Hint: Use the square bracket and specify that classification is 1.

University of Waterloo - 2018


3.2. STOCK MARKET DATA 47

> dt1 <- cbind(AMZN[,6], GSPC[,6])


> head(dt1,3); tail(dt1,3)

AMZN.Adjusted GSPC.Adjusted
2006-01-01 47.87 1285.45
2006-01-08 44.40 1287.61
2006-01-15 43.92 1261.49

AMZN.Adjusted GSPC.Adjusted
2016-12-11 757.77 2258.07
2016-12-18 760.59 2263.79
2016-12-25 749.87 2238.83

> names(dt1)[1:2] <- c("AMZN", "SP500")


> names(dt1)

[1] "AMZN" "SP500"

If you have any questions related to the above commands and/or output,
please review section 2.2.

3.2.1 Stock Returns


We use the function head() to review the top observations (weeks) of the
combined data set (dt1 ). As we can see below, during the first two weeks of
2006, the price of Amazon went from 47.87 to 44.40, and the SP500 index
went from 1285.45 to 1287.61.

> head(dt1)

AMZN SP500
2006-01-01 47.87 1285.45
2006-01-08 44.40 1287.61
2006-01-15 43.92 1261.49
2006-01-22 45.22 1283.72
2006-01-29 38.33 1264.03
2006-02-05 38.52 1266.99

We can use these values to calculate the rate of return using the following
formula:
P ricet − P ricet−1
returnt = (3.1)
P ricet−1

University of Waterloo - 2018


48 CHAPTER 3. SUMMARY STATISTICS

For example, if the current week is 2006-01-08 ; the current week’s P ricet =
44.40, last week’s P ricet−1 = 47.87, and the returnt = 44.40−47.87
47.87
= −.072.
This means, the Amazon stock had a negative return of 7.2% or the Amazon
price dropped by 7.2%.
In order to calculate market returns, we need to have the current price
(P ricet ) as well as the previous (last week’s) price (P ricet−1 ). In R, we
can create a new variable that shows the previous price, using the function
lag(). The function takes two arguments: the variable and the number of
lags. Typically, we don’t specify the number of periods/weeks we want to go
back (number of lags). R assumes the default value which is one period/week.

> dt1$lagAMZN <- lag(dt1$AMZN)


> dt1$lagSP500 <- lag(dt1$SP500)
> head(dt1)

AMZN SP500 lagAMZN lagSP500


2006-01-01 47.87 1285.45 NA NA
2006-01-08 44.40 1287.61 47.87 1285.45
2006-01-15 43.92 1261.49 44.40 1287.61
2006-01-22 45.22 1283.72 43.92 1261.49
2006-01-29 38.33 1264.03 45.22 1283.72
2006-02-05 38.52 1266.99 38.33 1264.03

Based on the above output, we can see that during the first week of 2006
(2006-01-01 ), the current price (P ricet ) of Amazon was 47.87 (AMZN=47.87)
and the previous price (P ricet−1 ) was NA (i.e., not available). During the
second week of 2006 (2006-01-08 ), the current AMZN price was 44.40, and the
previous week’s price (lagAMZN) was 47.87.
Using formula 3.1, we calculate the rate of return for Amazon (rtrnAMZN)
and SP500 (rtrnSP500) as follows:

> dt1$rtrnAMZN <- (dt1$AMZN - dt1$lagAMZN)/dt1$lagAMZN


> dt1$rtrnSP500 <- (dt1$SP500 - dt1$lagSP500)/dt1$lagSP500
> head(dt1)

AMZN SP500 lagAMZN lagSP500 rtrnAMZN


2006-01-01 47.87 1285.45 NA NA NA
2006-01-08 44.40 1287.61 47.87 1285.45 -0.07248793
2006-01-15 43.92 1261.49 44.40 1287.61 -0.01081090
2006-01-22 45.22 1283.72 43.92 1261.49 0.02959934
2006-01-29 38.33 1264.03 45.22 1283.72 -0.15236618
2006-02-05 38.52 1266.99 38.33 1264.03 0.00495690

University of Waterloo - 2018


3.2. STOCK MARKET DATA 49

rtrnSP500
2006-01-01 NA
2006-01-08 0.001680372
2006-01-15 -0.020285642
2006-01-22 0.017622003
2006-01-29 -0.015338191
2006-02-05 0.002341686

Notice, that as a result of using the function lag our data set has missing
values in the first observation (week). This happens because our data set
does not contain the last week of December 2005. If currently, we are in the
first week of January 2006, and we need to generate the one period/week lag,
we need the data from last week of December. Since this piece of information
is not available R returns a missing value (NA). We can eliminate the missing
values from our data set (dt1) using the function na.omit as follows:
> dt1 <- na.omit(dt1)
> head(dt1,2)

AMZN SP500 lagAMZN lagSP500 rtrnAMZN


2006-01-08 44.40 1287.61 47.87 1285.45 -0.07248793
2006-01-15 43.92 1261.49 44.40 1287.61 -0.01081090

rtrnSP500
2006-01-08 0.001680372
2006-01-15 -0.020285642

3.2.2 Summary Statistics: Stock Returns


We use the summary function and specify that we want summary statistics
for the rate of return for Amazon (rtrnAMZN) and SP500 (rtrnSP500) as
follows:
> summary(dt1[,5:6])

Index rtrnAMZN rtrnSP500


Min. :2006-01-08 Min. :-0.181380 Min. :-0.181955
1st Qu.:2008-10-05 1st Qu.:-0.024144 1st Qu.:-0.009933
Median :2011-07-03 Median : 0.003721 Median : 0.002027
Mean :2011-07-03 Mean : 0.006214 Mean : 0.001283
3rd Qu.:2014-03-30 3rd Qu.: 0.033160 3rd Qu.: 0.014131
Max. :2016-12-25 Max. : 0.392658 Max. : 0.120258

University of Waterloo - 2018


50 CHAPTER 3. SUMMARY STATISTICS

Practice Problems Review the above results and answer the following
questions:

1. Compare the average weekly return of Amazon to SP500. Which one


generated higher returns?
2. Compare the range of returns (max-min) of Amazon and SP500. Which
one seems to be more volatile (have a wider range)?
3. In the stock market there is a saying that higher risks mean higher
returns. If range is a proxy for risk, is this saying supported by the
above results?4

3.2.3 Detecting Outliers: Stock Returns


Using the IQR we can detect outliers in the Amazon stock returns as follows:

> lwrWhiskerAMZN <-


quantile(dt1$rtrnAMZN, .25)-1.5*IQR(dt1$rtrnAMZN)
> lwrWhiskerAMZN

25%
-0.1101005

> uprWhiskerAMZN <-


quantile(dt1$rtrnAMZN, .75)+1.5*IQR(dt1$rtrnAMZN)
> uprWhiskerAMZN

75%
0.1191161

Based on the above, we see that weekly Amazon stock returns which are
below -11% or above 11.9% are outliers.
Using the same approach, we find that SP500 returns which are below
-4.6% or above 5% are outliers.

> lwrWhiskerSP500 <-


quantile(dt1$rtrnSP500, .25)-1.5*IQR(dt1$rtrnSP500)
> lwrWhiskerSP500

25%
-0.04602874
4
Hint: The saying is supported if the stock/index that has the widest range generates
on average higher returns.

University of Waterloo - 2018


3.2. STOCK MARKET DATA 51

> uprWhiskerSP500 <-


quantile(dt1$rtrnSP500, .75)+1.5*IQR(dt1$rtrnSP500)
> uprWhiskerSP500

75%
0.0502261

Practice Problems Based on the above analysis and summary statistics:

1. Do you expect to find observations (weeks) in the Amazon data that


have produced extremely high or extremely low returns (i.e., outliers)?
Explain why.5
2. Do you expect to find observations (weeks) in the SP500 data that
have produced extremely high or extremely low returns (i.e., outliers)?
Explain why.
3. Take a close look at the time period of this analysis. Can you speculate
on what time period SP500 had its worst performance (-18.2%)?

3.2.4 Identifying Negative Outliers in Stock Returns


Based on summary statistics, we know that during its worst week the Amazon
stock dropped by (had a negative return) -18.1%. During its worst week
SP500 dropped by -18.2%. As an investor, you may want to identify and
analyze the time period (week) when these returns happened. We can use the
method shown in section 2.1.3 to focus on the observation that corresponds
to the week when the return had its minimum value as follows:

> dt1[dt1$rtrnAMZN==min(dt1$rtrnAMZN),]

AMZN SP500 lagAMZN lagSP500 rtrnAMZN rtrnSP500


2006-07-23 27.17 1278.55 33.19 1240.29 -0.1813799 0.03084763

> dt1[dt1$rtrnSP500==min(dt1$rtrnSP500),]

AMZN SP500 lagAMZN lagSP500 rtrnAMZN rtrnSP500


2008-10-05 56.25 899.22 67 1099.23 -0.1604478 -0.1819547

From the above results, we can see that within the period 2006-2016, SP500
had its worst performance on 2008-10-05 and Amazon had its worst perfor-
mance on 2006-07-23.
5
Hint: compare the min and max returns to the lower and upper whisker respectively.

University of Waterloo - 2018


52 CHAPTER 3. SUMMARY STATISTICS

Practice Question Search the web and try to find information about these
two periods in order to answer the following questions.
1. What happened around the week of 2008-10-05? Does it make sense
that both SP500 and Amazon had such a steep drop on the same week?
Explain why.
2. Try to find information related to Amazon around the period of 2006-
07-23. Does it make sense that Amazon experienced such a steep drop
but not SP500? Explain why.6

3.2.5 Categorical Variable for Stock Returns


We can convert a continuous variable into a categorical one by introducing
one or more thresholds. For example, at least theoretically, the variable
rtrnAMZN can take any value from minus infinity to plus infinity. From
an investment standpoint, there are three ranges that matter: whether the
returns are positive, negative, or zero. Alternatively, we can say that the
direction in which the stock moved was up, down, or unchanged. As we have
seen in section 3.1.5 (p. 43), we can create a new variable based on certain
conditions using an ifelse() statement.
Before we create our categorical variables, we will need to convert the
class (structure) of the stock market data from xts to data.frame. We
create a new data set (dt2 ) that has only the stock returns (5th and 6th
variable) and it is formatted as.data.frame, as follows:

> dt2 <- as.data.frame(dt1[,5:6])

The typical ifelse() statement has two categories. However, in our


case we have three (up, down, unchanged). To achieve this, we add a second
ifelse() statement as the second option of the first ifelse() statement as
follows:
> dt2$directionAMZN <- ifelse(dt2$rtrnAMZN>0, "AMZN_up",
ifelse(dt2$rtrnAMZN==0,"AMZN_unchanged","AMZN_down"))
Notice that with the first ifelse() statement we test whether the return was
positive. If the answer is yes, we assign the value AMZN_up. However, if the
6
Hint: Amazon introduced its cloud computing services (AWS) in the summer of 2006.
If you want to find out about the market reaction to Bezos’ decision to introduce cloud
computing, read the following article: https://www.bloomberg.com/news/articles/
2006-11-12/jeff-bezos-risky-bet We will learn about the firm and market effects
of emerging technologies, such as cloud computing, in the Introduction to Information
Systems (AFM241).

University of Waterloo - 2018


3.2. STOCK MARKET DATA 53

answer is no, we have two more conditions to explore. Was the return zero or
negative. We test them introducing the second ifelse() statement, which
tests whether the return was zero. If the answer is yes, we assign the value of
AMZN_unchanged. If the answer is no, and given the fact that we already
know that returns are not positive, we assign the value of AMZN_down.
We repeat the same approach for the SP500 index.

> dt2$directionSP500 <- ifelse(dt2$rtrnSP500>0, "SP500_up",


ifelse(dt2$rtrnSP500==0,"SP500_unchanged","SP500_down"))

We observe, the set of the top observation using the function head() below.

> head(dt2)

rtrnAMZN rtrnSP500 directionAMZN directionSP500


2006-01-08 -0.07248793 0.001680372 AMZN_down SP500_up
2006-01-15 -0.01081090 -0.020285642 AMZN_down SP500_down
2006-01-22 0.02959934 0.017622003 AMZN_up SP500_up
2006-01-29 -0.15236618 -0.015338191 AMZN_down SP500_down
2006-02-05 0.00495690 0.002341686 AMZN_up SP500_up
2006-02-12 0.01739354 0.015982762 AMZN_up SP500_up

3.2.6 Frequency Tables for Categorical Stock Returns


One Way Frequency Table As we have seen in section 3.1.7 (p. 44) we
can use the function table to create the frequency count for categorical data.

> table(dt2$directionSP500)

SP500_down SP500_up
252 321

> table(dt2$directionAMZN)

AMZN_down AMZN_up
270 303

Practice Problem Interpret and contrast the results from these two ta-
bles.

University of Waterloo - 2018


54 CHAPTER 3. SUMMARY STATISTICS

Two Way Frequency Table In order to create a two-way table, we add


one more argument (i.e., specify a second variable/factor) inside the function
table(). The sequence of these arguments matters. The first arguments
specifies rows, and the second one specifies columns in a two way table.
With the following command, we specify that we want to create a two way
table that shows the possible direction of the SP500 index as rows, and the
directions of Amazon stock as columns.
> table(dt2$directionSP500,dt2$directionAMZN)
AMZN_down AMZN_up
SP500_down 178 74
SP500_up 92 229
As we can see the majority of times (229 weeks) both Amazon and SP500
moved up. The second highest frequency (178 weeks) both Amazon and
SP500 move down.

Practice problem Follow the directions shown in section 3.1.7 to create


relative frequencies.

3.2.7 Practice Problems: Do Stock Returns Have Mem-


ory?
The objective in this exercise is to create a two-way table that lets us generate
frequencies of the direction of Amazon stock movement this week (up or
down) and the direction last week. Continue working with the R script used
to create the stock market data for this chapter (see p. 62).7
1. In the data set dt1 add a new variable (lag_rtrnAMZN ) that captures
Amazon returns with one period lag.
2. Remove missing values from dt1.
3. Create a new data set (dt3 ) that has current and lagged stock return
for Amazon. Make sure to format it as.data.frame
4. Create a new categorical variable in dt3, called directionAMZN that
captures the direction of Amazon stock based on rtrnAMZN.
5. Create a new categorical variable in dt3, called lag_directionAMZN
that captures the direction of Amazon stock based on lag_rtrnAMZN.
6. Create a frequency table that has lag_directionAMZN in rows, and
directionAMZN in columns.
7. Interpret the results. Hint: Use intersection or conditional probabilities
to interpret your results.
7
The solution/R script for this problem is in p. 62.

University of Waterloo - 2018


3.3. FINANCIAL ACCOUNTING DATA 55

3.3 Financial Accounting Data


We load and review financial accounting data from 2010 to 2016 for all pub-
licly traded firms competing in the same industry as Apple. The approach
shown below is same as the one followed in sections 2.3.2 - 2.3.3.

> # set working directory to source file location


> library(data.table)
> dt <- fread("industryAAPL_2010_2016.csv")
> names(dt)

[1] "gvkey" "datadate" "fyear" "tic" "conm"


[6] "at" "cogs" "ib" "oibdp" "sale"
[11] "loc" "naics" "sic"

> nrow(dt)

[1] 441

> dt <- dt[dt$sale>1,]


> nrow(dt)

[1] 396

If you have any questions related to the above commands and/or output
please review sections 2.3.2 - 2.3.3.
For the rest of the analysis we will focus on a single year (2010).

> dt_2010 <- dt[dt$fyear==2010,]

3.3.1 Summary Statistics & Outliers: Profits


Summary statistics Our first objective will be to generate summary statis-
tics for two profitability measures: income before extraordinary items (IB),
and operating income before depreciation and (OIBDP). These two vari-
ables are the 8th and 9th in the data set, therefore we can generate summary
statistics as follows:

> summary(dt_2010[,8:9])

University of Waterloo - 2018


56 CHAPTER 3. SUMMARY STATISTICS

ib oibdp
Min. : -874.000 Min. : -56.546
1st Qu.: -4.081 1st Qu.: 0.222
Median : 1.008 Median : 4.555
Mean : 262.069 Mean : 490.497
3rd Qu.: 14.468 3rd Qu.: 34.996
Max. :14013.000 Max. :19317.000
As expected, the above results show that there is a wide spectrum of values
for both profitability measures. Clearly, the max ib and oibdp, which belong
to Apple (see p. 31), are outliers.

Mental Math We can use mental math to validate that the max ib is an
outlier as follows: First, round Q1 and Q3 very generously. So Q1 is around
-5 and Q3 is around 15. This means that IQR is around 20 (15 − (−5) = 20)
and half of it is around 10. Therefore 1.5 ∗ IQR is around 30. Based on
these values, the upper whisker is 15 + 30 = 45, which is well below the max
(14013). Thus we can conclude that the max value is an outlier in this data
set.

Categorical variables & One Way Frequency Tables As we have seen


from the summary statistics there seem to be a lot of firms that have negative
IB and/or OIBDP. Establishing a threshold of zero profits for OIBDP, we
can create a categorical variable that let us distinguish between firms that
have profits or losses using an ifelse statement as follows:
> dt_2010$status_OIBDP <-
ifelse(dt_2010$oibdp>=0, "profits_OIBDP", "losses_OIBDP")
According to the results shown below, more than 22% of the firms in the
industry have reported negative OIBDP.
> prop.table(table(dt_2010$status_OIBDP))
losses_OIBDP profits_OIBDP
0.2253521 0.7746479
Using the same approach we create the categorical variable to classify
firms that have negative IB. As we can see below, the percentage of firms
that have reported negative IB is almost 44%.
> dt_2010$status_IB <-
ifelse(dt_2010$ib>=0, "profits_IB", "losses_IB")
> prop.table(table(dt_2010$status_IB))

University of Waterloo - 2018


3.3. FINANCIAL ACCOUNTING DATA 57

losses_IB profits_IB
0.4366197 0.5633803

3.3.2 Summary Statistics & Outliers: Financial Ratios


As we have seen, Apple generates more profits than other firms. This was
expected because Apple is the largest firm in this industry. However, this
does not necessarily mean that Apple is more profitable than its competitors.
When we want to compare firms of different size, we generate profitability
ratios that report profits per dollar of sales (profit margin) or per dollar of
assets (ROA).

Profit margin We can calculate the three different version of profit margin
(2.1-2.3) for all firms in the industry as follows:
> dt_2010$gm <- (dt_2010$sale-dt_2010$cogs)/dt_2010$sale
> dt_2010$om <- dt_2010$oibdp/dt_2010$sale
> dt_2010$pm <- dt_2010$ib/dt_2010$sale
> names(dt_2010)
[1] "gvkey" "datadate" "fyear" "tic"
[5] "conm" "at" "cogs" "ib"
[9] "oibdp" "sale" "loc" "naics"
[13] "sic" "status_OIBDP" "status_IB" "gm"
[17] "om" "pm"

Summary statistics Since the three version of profit margin are variables
16-18, we can generate summary statistics as follows:
> summary(dt_2010[,16:18])
gm om pm
Min. :-0.01508 Min. :-1.341778 Min. :-2.17918
1st Qu.: 0.27485 1st Qu.: 0.007278 1st Qu.:-0.09114
Median : 0.40806 Median : 0.075490 Median : 0.01317
Mean : 0.40877 Mean : 0.015356 Mean :-0.10142
3rd Qu.: 0.50625 3rd Qu.: 0.134599 3rd Qu.: 0.05769
Max. : 0.78430 Max. : 0.506473 Max. : 0.80502
Using mental math, we can quickly come to the conclusion that the third
version of profit margin (pm) is the one that seems to have the most extreme
outliers. We are interested in these outliers because these are the firms that
do extremely well or extremely poorly.

University of Waterloo - 2018


58 CHAPTER 3. SUMMARY STATISTICS

Outliers We use the IQR to identify outliers on pm as follows:


> lwrWhisker_pm <-
quantile(dt_2010$pm, .25)-1.5*IQR(dt_2010$pm)
> lwrWhisker_pm

25%
-0.3143745

> uprWhisker_pm <-


quantile(dt_2010$pm, .75)+1.5*IQR(dt_2010$pm)
> uprWhisker_pm

75%
0.2809268

Practice Problem Based on the above analysis and summary statistics


explain why you expect to find outliers in the variable pm.

Relative Competitive Position In strategic management, we are inter-


ested in a firm’s relative competitive position. A firm that outperforms its
competitors has a competitive advantage, while a firm that underperforms
has a competitive disadvantage. In the following example, we create a very
crude measure of a firm’s relative competitive position in terms of pm, by
creating three groups: First, firms which are extreme outliers (above upper
whisker) perform well above the rest of the industry. Second, firms which
perform above the industry median. Third, firms that perform below the
industry median.
> dt_2010$relativePosition_pm <-
ifelse(dt_2010$pm>uprWhisker_pm, "topOutlier_pm",
ifelse(dt_2010$pm>median(dt_2010$pm),
"aboveMedian_pm", "belowMedian_pm"))
Using this new categorical variable (relativePosition_pm) we can generate a
frequency distribution within the industry as follows:

> table(dt_2010$relativePosition_pm)

aboveMedian_pm belowMedian_pm topOutlier_pm


33 36 2

> prop.table(table(dt_2010$relativePosition_pm))

University of Waterloo - 2018


3.3. FINANCIAL ACCOUNTING DATA 59

aboveMedian_pm belowMedian_pm topOutlier_pm


0.46478873 0.50704225 0.02816901

Therefore, there are 2 firms (around 3%) that perform well above the rest
of the industry. There are 33 firms (around 47%) that perform above the
industry median, but below the top performers. The remaining 36 firms
(around 50%) perform below the industry median.

Practice Problem Can you find the two top performing firms?

3.3.3 Practice Problems: Financial Ratios


The term outlier means different things to different people. For auditors
outliers may be signals of potential fraud, for investors outliers may be firms
that generate very high returns, for strategic analysts outliers may be firms
that have a competitive advantage versus their competitors. The objective
of this exercise is to focus on relatively large firms competing in the same
industry as Apple, and identify firms that outperform their competitors in
terms of ROA (operating margin).8

1. Load the data set industryAAPL_2010_2016.csv


2. Create a new data set (dt1 ) that is limited to observations in fiscal year
2016.
3. Remove from dt1 companies that have sales or assets less than 10
million.9 How many observations (firms) are in your data set?
4. Create a new categorical variable (status_OIBDP) that takes the value
profits_OIBDP if the firm has operating profits (oibdp > 0 ), and the
value losses_OIBDP if the firm has operating losses. What percentage
of firms have operating profits and losses?
5. Calculate the ROA based on operating profits (ROA_om),10 and gen-
erate summary statistics. What is the mean and median of ROA_om?
6. Using the dt2 data, calculate the lower whisker for ROA_om. Interpret
your finding.
7. Create a new data set (dt2 ) from which you have removed all firms
that have an ROA_om below the lower whisker that was calculated
above. How many firms are in your new data set (dt2 )?
8. Generate summary statistics for ROA_om. What is the mean and
median? Compare them to the mean and median in step 5.
8
The solution (R script) for this problem is on p. 63.
9
You can create dt1 in one step.
10
See formula 2.5 (p. 29).

University of Waterloo - 2018


60 CHAPTER 3. SUMMARY STATISTICS

9. Using the dt2 data, calculate the upper whisker for ROA_om. Interpret
your finding.
10. Create a new categorical variable (relativePosition_ROA_om) that
takes the value topOutlier_ROA_om if the firm’s ROA_om is above
the upper whisker; aboveMedian_ROA_om if the firm’s ROA_om is
above the median, and belowMedian_ROA_om if the firm is below the
median.
11. How many firms are in the top group (topOutlier_ROA_om)? If there
are less then ten list them.

University of Waterloo - 2018


3.4. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 61

3.4 Solutions to Selected Practice Problems


3.4.1 Sales by Store (p. 44)
> # set working directory to source file location
> library(data.table)
> dt1 <- fread("salesByStore.csv")
> summary(dt1$unitSold)

Min. 1st Qu. Median Mean 3rd Qu. Max.


33962 193070 324654 408908 469678 1623158

> lWhisker4UnitSold <-


quantile(dt1$unitSold,.25)-1.5*IQR(dt1$unitSold)
> lWhisker4UnitSold

25%
-221843.2

> upperWhisker <- quantile(dt1$unitSold,.75)+1.5*IQR(dt1$unitSold)


> upperWhisker

75%
884590.8

> boxplot(dt1$unitSold)
> dt1$units_Outlier <-
ifelse(dt1$unitSold<lWhisker4UnitSold|dt1$unitSold>upperWhisker,1,0)
> nrow(dt1[dt1$units_Outlier==1,])

[1] 8

> dt1[dt1$unitSold<lWhisker4UnitSold|dt1$unitSold>upperWhisker,]

Store SqFt unitSold averagePrice revenue sqftOutlier


1: 34 18400 1369606 17.27582 20730030 0
2: 38 14000 1337209 17.36106 19505303 0
3: 50 5800 899439 16.85767 13958967 0
4: 66 20000 1140215 18.03583 18153040 1
5: 67 9200 977455 17.78134 15487755 0
6: 69 4800 976492 16.87477 14516917 0
7: 73 15000 1415916 17.77670 22554592 0
8: 76 5400 1623158 18.08060 26064575 0

University of Waterloo - 2018


62 CHAPTER 3. SUMMARY STATISTICS

units_Outlier
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
7: 1
8: 1

3.4.2 Do Stock Returns Have Memory? (p. 54)


dt1$lag_rtrnAMZN <- lag(dt1$rtrnAMZN)
dt1 <- na.omit(dt1)
dt3 <- as.data.frame(dt1[,c(5,7)])
dt3$directionAMZN <- ifelse(dt3$rtrnAMZN>0, "AMZN_up",
ifelse(dt3$rtrnAMZN==0,"AMZN_unchanged","AMZN_down"))
dt3$lag_directionAMZN <- ifelse(dt3$lag_rtrnAMZN>0,
"lag_AMZN_up",
ifelse(dt3$lag_rtrnAMZN==0,"lag_AMZN_unchanged",
"lag_AMZN_down"))
head(dt3)
rtrnAMZN lag_rtrnAMZN directionAMZN
2006-01-15 -0.01081090 -0.07248793 AMZN_down
2006-01-22 0.02959934 -0.01081090 AMZN_up
2006-01-29 -0.15236618 0.02959934 AMZN_down
2006-02-05 0.00495690 -0.15236618 AMZN_up
2006-02-12 0.01739354 0.00495690 AMZN_up
2006-02-19 -0.02143407 0.01739354 AMZN_down
lag_directionAMZN
2006-01-15 lag_AMZN_down
2006-01-22 lag_AMZN_down
2006-01-29 lag_AMZN_up
2006-02-05 lag_AMZN_down
2006-02-12 lag_AMZN_up
2006-02-19 lag_AMZN_up
table(dt3$lag_directionAMZN,dt3$directionAMZN)
AMZN_down AMZN_up
lag_AMZN_down 129 140
lag_AMZN_up 140 163

University of Waterloo - 2018


3.4. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 63

3.4.3 Financial Ratios (p. 59)


> library(data.table)
> dt <- fread("industryAAPL_2010_2016.csv")
> names(dt)

[1] "gvkey" "datadate" "fyear" "tic" "conm"


[6] "at" "cogs" "ib" "oibdp" "sale"
[11] "loc" "naics" "sic"

> dt1 <- dt[(dt$at>10 | dt$sale>10) & dt$fyear==2016,c(3:10)]


> nrow(dt1)

[1] 43

> dt1$status_OIBDP <- ifelse(dt1$oibdp>=0, "profits_OIBDP",


"losses_OIBDP")
> prop.table(table(dt1$status_OIBDP))

losses_OIBDP profits_OIBDP
0.2325581 0.7674419

> dt1$ROA_om <- dt1$oibdp/dt1$at


> summary(dt1$ROA_om)

Min. 1st Qu. Median Mean 3rd Qu. Max.


-0.57097 0.00303 0.06952 0.03512 0.11795 0.32426

> lwrWhisker_ROA_om <-


quantile(dt1$ROA_om, .25)-1.5*IQR(dt1$ROA_om)
> lwrWhisker_ROA_om

25%
-0.1693562

> dt2 <- dt1[dt1$ROA_om>lwrWhisker_ROA_om,]


> nrow(dt2)

[1] 40

> summary(dt2$ROA_om)

Min. 1st Qu. Median Mean 3rd Qu. Max.


-0.13370 0.01786 0.07532 0.07288 0.12324 0.32426

University of Waterloo - 2018


64 CHAPTER 3. SUMMARY STATISTICS

> names(dt2)

[1] "fyear" "tic" "conm" "at"


[5] "cogs" "ib" "oibdp" "sale"
[9] "status_OIBDP" "ROA_om"

> uprWhisker_ROA_om <-


quantile(dt2$ROA_om, .75)+1.5*IQR(dt2$ROA_om)
> uprWhisker_ROA_om

75%
0.281306

> dt2$relativePosition_ROA_om <-


ifelse(dt2$ROA_om>uprWhisker_ROA_om, "topOutlier_ROA_om",
ifelse(dt2$ROA_om>median(dt2$ROA_om),
"aboveMedian_ROA_om", "belowMedian_ROA_om"))
> names(dt2)

[1] "fyear" "tic"


[3] "conm" "at"
[5] "cogs" "ib"
[7] "oibdp" "sale"
[9] "status_OIBDP" "ROA_om"
[11] "relativePosition_ROA_om"

> table(dt2$relativePosition_ROA_om)

aboveMedian_ROA_om belowMedian_ROA_om topOutlier_ROA_om


19 20 1

> prop.table(table(dt2$relativePosition_ROA_om))

aboveMedian_ROA_om belowMedian_ROA_om topOutlier_ROA_om


0.475 0.500 0.025

> dt2[dt2$relativePosition_ROA_om=="topOutlier_ROA_om",]

fyear tic conm at cogs ib


1: 2016 UBNT UBIQUITI NETWORKS INC 748.051 335.306 213.616
oibdp sale status_OIBDP ROA_om relativePosition_ROA_om
1: 242.56 666.395 profits_OIBDP 0.324256 topOutlier_ROA_om

University of Waterloo - 2018


Chapter 4

Hypothesis Testing

Learning Objectives
By the end of this chapter students should have achieved the following learn-
ing objectives (know how to do the following):
1. Generate random samples using sample().
2. Run t-tests using t.test() and interpret output.
3. Select appropriate options/arguments when using t.test().
4. Append data sets using rbind and perform t-test on grouped data.
5. Generate detailed descriptive statistics using describe().

4.1 Management Accounting Data


In chapter 3 (p. 44) we saw that Bibitor stores sell over ten thousand different
products. These products are classified as either liquor (coded as 1) or wine
(coded as 2). More specifically, we know that there are 3182 different brands
of liquor and 7291 brands of wine. For planning and budgeting reasons, the
management of Bibitor wants answers to the following questions: On average
are the prices of liquor products higher than wine products? On average, do
we sell more bottles of wine or liquor? Are average revenues about the same
for liquor and wine products?

Important Note The management of Bibitor has access to the entire pop-
ulation of products sold. Therefore, we can answer these questions by simply
comparing the results of summary statistics. This means that there is no
need to take random samples and do hypothesis testing. However, learning
how to perform hypothesis testing is a critical stepping stone for understand-
ing how to evaluate more advanced statistical techniques, such as regression

65

University of Waterloo - 2018


66 CHAPTER 4. HYPOTHESIS TESTING

analysis. Given the objective of this chapter, which is to introduce hypothesis


testing, we will pretend that we do not have access to the entire population of
product sales. We will follow the process of taking one random sample from
liquor products and one for wine products and we will use these samples to
perform the hypothesis testing to answer the above questions.

4.1.1 Load and Review Data


For the rest of our analysis, we will set the working directory to source file
location and load the file that contains sales organized by product for fiscal
year 2016 (salesByProduct).

> library(data.table)
> dt1 <- fread("salesByProduct.csv")

Please recall from the discussion in section 3.1.7, that the file contains
sales data, as well as classification, for each one of the 10473 products sold
in Bibitor stores in fiscal year 2016.

Practice Problem: Review Chapter 3.

1. Create the table that shows frequency distribution of liquor and wine
brands.
2. Create the table that shows the relative frequency distribution (per-
centage) of liquor and wine brands.

Using the notation [,], we create a subset (dt1_L) for liquor products by
imposing the constraint that classification equals to 1, and a subset (dt1_L)
for wine products with the constraint classification equals to 2.

> dt1_L <- dt1[dt1$Classification==1,]


> dt1_W <- dt1[dt1$Classification==2,]

Summary statistics for units sold, average price, and revenue for each
subset are shown below. As we can see the average units sold is much higher
(6065.9) for liquor than for wine (1783). Similarly, the average price and
average revenue for liquor (36.47 and 88364 respectively) are much higher
the corresponding values for wine (30.97 and 21930.6 respectively).

> summary(dt1_L[, 4:6])

University of Waterloo - 2018


4.1. MANAGEMENT ACCOUNTING DATA 67

unitSold averagePrice revenue


Min. : 1.0 Min. : 0.49 Min. : 1
1st Qu.: 87.2 1st Qu.: 10.99 1st Qu.: 1899
Median : 902.0 Median : 18.53 Median : 18203
Mean : 6065.9 Mean : 36.47 Mean : 88364
3rd Qu.: 5431.2 3rd Qu.: 30.30 3rd Qu.: 78597
Max. :319248.0 Max. :4999.99 Max. :4634015

> summary(dt1_W[, 4:6])

unitSold averagePrice revenue


Min. : 1 Min. : 0.99 Min. : 3.0
1st Qu.: 24 1st Qu.: 10.29 1st Qu.: 551.4
Median : 140 Median : 14.99 Median : 3220.1
Mean : 1783 Mean : 30.97 Mean : 21930.6
3rd Qu.: 1176 3rd Qu.: 26.15 3rd Qu.: 16178.0
Max. :200150 Max. :4111.99 Max. :2183992.9

As mentioned above, for practical purposes these statistics, which are


based on the entire population would have been enough to provide feedback
to the management of Bibitor. However, in order to show how hypothesis
testing is done, we will proceed with the creation of random samples and run
the hypothesis based on sample data.

4.1.2 Null and Alternative Hypothesis


The first question that the management of Bibitor wants to know is the
following one: On average, is the price of liquor different than the price of
wine? The way the question is worded, it does not indicate that there is
an expectation that either category of products would be priced on average
higher than the other. This means that we can convert this question into a
two sided hypothesis testing, as follows:
H0 : µliquor = µwine
H1 : µliquor 6= µwine
Where µliquor is the average population price of liquor, and µwine the
average population price of wine.

4.1.3 Create and Review Random Samples


For the testing of this hypothesis, we will need a random sample of liquor
brands and another for wines. Let’s say that we want 25 random observations

University of Waterloo - 2018


68 CHAPTER 4. HYPOTHESIS TESTING

from the liquor data and another 25 from wine data. A manual way for
creating a random sample for liquor would be to write in small pieces of
paper the row numbers for each row (observation) in the liquor data set.
The data set has 3182 rows. Put all these 3182 pieces of paper in a box,
shake it well, and draw/remove 25 pieces. The numbers in these 25 pieces,
are the row numbers of the 25 observations that would make our random
sample.
The function sample() in R achieves the same effect. It generates a
random sample from a data set. The function takes two arguments: 1) the
name of the data set from which to take the random sample, and 2) the
number of observations to draw from this data set. For example if we want
to take a random sample of 25 row numbers from the data set of liquor
products (dt1_L) we can write this as follows:

> sample(1:nrow(dt1_L),25)

The term 1:nrow(dt1_L) is the equivalent of writing the row numbers on


pieces of paper. The term sample(,25) is the equivalent of removing 25
pieces of paper from the box. The resulting numbers are shown below.

[1] 988 246 175 831 1991 214 1157 2394 752 546 619 2753
[13] 2999 695 2680 2819 326 1145 2459 1549 3113 18 1580 1027
[25] 1069

This means that the random sample is made from observations (row numbers)
988, 246, 175, etc.
If we were to re-run the same command, it would produce a different
random sample.

> sample(1:nrow(dt1_L),25)

[1] 1826 2498 595 1429 2980 2800 2399 468 603 1575 1289 2094
[13] 1992 2573 1124 1172 1913 2321 2148 2932 3020 1864 2791 160
[25] 2199

Our second random sample, is made of row numbers 1826, 2498, 595, etc. If
you repeat exactly the same code in your computer you will get a different
set of numbers. Similar to drawing from the box that contains the pieces of
paper; every time we draw, we get different results.

University of Waterloo - 2018


4.1. MANAGEMENT ACCOUNTING DATA 69

Generate Replicable Results There is a way that we can force R to


generate the same random sample in two different computers. We can do
this by using the function set.seed() and entering a number inside the
parenthesis. Any number can be chosen. This number serves as the starting
point for the calculation of the set of random numbers. However, as long
as the same seed number has been entered in two different computers, both
machines would produce the same random sample. In the example below,
and for the rest of our analysis, we use the seed number 123.
> set.seed(123)
> sample(1:nrow(dt1_L),25)
[1] 916 2508 1301 2808 2989 145 1678 2834 1751 1449 3036 1438
[13] 2148 1815 327 2850 780 134 1038 3020 2813 2190 2025 3141
[25] 2071
If we were to repeat this, we would get the same 25 observations.
> set.seed(123)
> sample(1:nrow(dt1_L),25)
[1] 916 2508 1301 2808 2989 145 1678 2834 1751 1449 3036 1438
[13] 2148 1815 327 2850 780 134 1038 3020 2813 2190 2025 3141
[25] 2071

Random Sample: Liquor


Using the subset notation [,], we create a random sample (dt1_L_rs) for
liquor products by specifying that we want to keep only the 25 row numbers
that are based on the random sample of row numbers. We can do this by
setting the statement before the comma in [,] as the above specified function
sample(1:nrow(dt1_L),25).
> set.seed(123)
> dt1_L_rs <- dt1_L[sample(1:nrow(dt1_L),25),]
> head(dt1_L_rs, 3)
Brand Description Classification unitSold
1: 2199 St George Terrior Gin 1 1491
2: 5357 Campari Negroni 1 58
3: 2910 Full Throttle Vanila Whiskey 1 347
averagePrice revenue
1: 30.27125 45108.09
2: 34.99000 2029.42
3: 19.99000 6936.53

University of Waterloo - 2018


70 CHAPTER 4. HYPOTHESIS TESTING

> tail(dt1_L_rs, 3)
Brand Description Classification unitSold
1: 4054 Patron XO Cafe Liqueur 1 3020
2: 8985 J Roget Spumante 1 8939
3: 4167 Shellback Silver Rum 1 23
averagePrice revenue
1: 13.24544 39981.80
2: 5.99000 53544.61
3: 14.99000 344.77

Practice Problem: Verify that the row number 916 corresponds to brand
number 2199 shown as the first observation in the random sample.

Random Sample: Wine


We repeat the same approach to generate the random sample of 25 obser-
vations for wine products. Remember, that in order to make the results
replicable, we have to add the set.seed(123) before we create the sample.
> set.seed(123)
> dt1_W_rs <- dt1_W[sample(1:nrow(dt1_W),25),]
> head(dt1_W_rs, 3)
Brand Description Classification unitSold
1: 18212 Pulenta Est Malbec Mendoza 2 129
2: 31045 Presidential 20-Yr Tawny Dec 2 155
3: 20537 The Pass Svgn Bl 2 24
averagePrice revenue
1: 25.99000 3352.71
2: 61.46541 9558.45
3: 8.99000 215.76
> tail(dt1_W_rs, 3)
Brand Description Classification unitSold
1: 24410 Marques de Grinon 10 Caliza 2 24
2: 46662 Oak Leaf Chard Cal 2 72
3: 24634 Marques de Caceres RSV Red 2 5055
averagePrice revenue
1: 17.99000 431.76
2: 2.99000 215.28
3: 16.96957 85208.45

University of Waterloo - 2018


4.1. MANAGEMENT ACCOUNTING DATA 71

4.1.4 Practice Problems: Random Samples


1. Use the seed number 123 and extract 25 row numbers from the list of
wine products.
2. Write down the first three and the last three row numbers.
3. Verify that the first three row numbers that you have are matching
the three products shown in the output of the head(dt1_W_rs, 3)
statement.
4. Verify that the last three row numbers that you have are matching
the three products shown in the output of the tail(dt1_W_rs, 3)
statement.

4.1.5 Hypothesis Testing (two sided t-test)


The function t-test() performs the one or two sample t-test. In its simplest
form and for testing two samples, the functions needs two arguments: the first
variable (x) and the second variable (y). This default set up will perform
a two-sided test (µx 6= µy ) with an α = .05 and under the assumption of
unequal population variances.
For our hypothesis testing (p. 67), we use the following script:

> t.test(dt1_L_rs$averagePrice, dt1_W_rs$averagePrice,


level = 0.95, var.equal = FALSE,
alternative= "two.sided", conf.level = 0.95)

This means that within the function t-test(), we have specified the
following arguments:

1. What are the two variables that we want to compare?


We want to compare the average price of liquor (dt1_L_rs$averagePrice)
to the average price of wine (dt1_W_rs$averagePrice).
2. What is the level of significance?
We specify the level of significance (α) using the argument level = .
For example, to specify that α = 0.05, we enter level = 0.95. The
default value is: level = 0.95.
3. What is our assumption about population variances?
We use the argument var.equal to specify our assumption regarding
the equality of variance in the two populations from which the samples
were taken. The options are equal (var.equal = TRUE) or not equal
(var.equal = FALSE). The default value is: var.equal = FALSE.
4. What is the form of our alternative hypothesis?

University of Waterloo - 2018


72 CHAPTER 4. HYPOTHESIS TESTING

We specify the alternative hypothesis, using the argument alternative,


which takes one of the following three values: "two.sided", "less",
or "greater". The default value is: alternative = "two.sided".1
5. What is the confidence interval that we want to see?
By default the function t-test() will produce the 95% confidence
interval. If we would like to see a different confidence level (e.g., 80%),
we can do this using the argument conf.level = 0.80. The default
value is: conf.level = 0.95.

The resulting output is shown in figure 4.1.

Figure 4.1: t-test output

From this we can see that


1. The first variable (x) is price of liquor and the second (y) one is price
of wine
2. The value of the t-statistic is -0.043021.
3. The p-value is 0.9659. Since this is a two-sided test, this means that the
probability of observing a t-statistic greater than 0.0430 or less than
-0.0430 is 96.59%. Given that the level of significance (α) for this test
is 5%, and our p − value > α, there is not enough statistical evidence
to reject the null hypothesis that the mean prices of wine and liquor
are the same.
4. The alternative hypothesis assumes that the difference between the two
means is not equal to zero (i.e., the means are different).
5. The 95% confidence interval is between -11.52 and 11.04. This means
that the difference in average prices can be as low as $-11.52 and as
1
The other two options are "less" (i.e., the alternative hypothesis that the average
price of liquor is less than the average price of wine) and "greater" (i.e., the alternative
hypothesis that the average price of liquor is greater than the average price of wine).

University of Waterloo - 2018


4.1. MANAGEMENT ACCOUNTING DATA 73

high as $11.04. Since this range contains the value of zero, we can again
conclude that there is not enough evidence to reject the null hypotheses
(i.e., the true difference in means is equal to zero).
6. The average price of liquor (variable x) in the sample was $19.84 and
the sample average price of wine (y variable) was $20.08.

Based on these results, our message to Bibitor managers is as follows: It does


not seem to be a statistically significant difference between the average price
of liquor and wine.
Please, keep in mind that statistical significance does not necessarily im-
ply economic significance, and vice versa.

4.1.6 Practice Problems: Bibitor Sales


1. Compare the results of the above t-test (Figure 4.1) with the one you
will get if you don’t specify any other argument, other than the vari-
ables (i.e., t.test(dt1_L_rs$averagePrice, dt1_W_rs$averagePrice)
Are there any differences in the two outputs? Explain what/why.
2. Run a t-test to compare the average price of liquor versus wine. In your
t-test you should specify that the α = 10%, assume the variances are
not equal, the alternative is that price of liquor is greater than price of
wine, and you want a 90% confidence interval.2
3. Run a t-test to compare the average units sold (bottles) of liquor versus
wine. In your t-test you should specify that the α = 1%, assume the
variances are not equal, the alternative is that units of liquor is different
than units of wine, and you want a 99% confidence interval.3
4. Run a t-test to compare the average revenues from liquor to average
revenues from wine. In your t-test you should specify that the α = 10%,
assume the variances are not equal, the alternative is that units of liquor
is different than units of wine, and you want a 90% confidence interval.4
5. In chapter 2 (p. 24) we have learned how to combine two data sets side-
by-side using the function cbind. If we have two data sets that have
exactly the same variables and the observations are sequential, we can
append one at the bottom of the other one using the function rbind.5
The objective of the following exercise is to show you an alternative
approach of organizing data for t-test comparison of two means.
2
The solution to this practice problem is on p. 90.
3
The solution to this practice problem is on p. 90.
4
The solution to this practice problem is on p. 90.
5
For a visual representation of the difference between cbind and rbind see 4.4 (p. 88).

University of Waterloo - 2018


74 CHAPTER 4. HYPOTHESIS TESTING

(a) Use the function rbind to combine the random sample from liquor
product (dt1_L_rs) and the random sample from wine products
(dt1_W_rs) and save it as as new data set (dt1_rs).
(b) How many variables are in the new data set (dt1_rs)?
(c) How many observations are in the new data set (dt1_rs)? Ob-
serve, that the new data set has one variable that captures the
average price and another variable that captures the product clas-
sification.
(d) When our data set has observations that can be divided into
groups (i.e., liquor and wine), we can specify that we can run
a t-test that can compare the means of the two groups as follows:
t.test(dt1_rs$averagePrice ∼ dt1_rs$classification).
(e) Use the above formula to run the t-test and compare the new
results with the results in Figure 4.1. If you followed the directions
above, the test results should be identical.6
6. The function describe() from the package psych provides more de-
tailed summary statistics. The output among others, includes the stan-
dard error.7 The function in its simplest form takes just one argument,
the variable or variables for which we would like to see descriptive
statistics. For example, the following statement will generate descrip-
tive statistics for the 4th variable (units sold) from the random sample
of wine products: describe( dt1_W_rs[,4]).
(a) Use the function describe() to generate descriptive statistics for
units sold, price, and revenue from the random sample of liquor
products.
(b) What is the average value, standard deviation, number of obser-
vations, and standard error (se) for units sold.
(c) A quick and dirty way to create an approximately 95% confidence
interval is to multiply the standard error (se) times two, and then
add and subtract this from the mean. Use this approach to create
confidence intervals for units sold, price, and revenue.
(d) Use the function describe() to generate descriptive statistics for
units sold, price, and revenue from the random sample of wine
products.
(e) What is the average value, standard deviation, number of obser-
vations, and standard error (se) for units sold.
(f) Use the quick and dirty way to generate confidence intervals for
units sold, price, and revenue of wine.
6
The solution to this practice problem is on p. 91.
7
The solution to this practice problem is on p. 91.

University of Waterloo - 2018


4.2. STOCK MARKET DATA 75

(g) Use your confidence intervals to compare average units sold, price,
and revenue between liquor and wine.

4.2 Stock Market Data


When it comes to stock market data, there are numerous questions/hypothe-
ses that investors would want to know/test. A relatively simple question that
we could consider is the following one: During bear markets, are average stock
returns of Amazon higher than S&P500? During bull markets are average
returns of Amazon higher than S&P500? In the rest of this section, we test
the first question, by generating a random sample from the period of finan-
cial crisis of 2007-08. Similarly, we test the second question by generating a
random sample from the period of 2016-17.8
Following the approach shown in Chapter 2 (see section 2.2), we load and
review weekly stock market data for Amazon and S&P500 from 2006 to 2017.

4.2.1 Load and Review Data


> library(quantmod)
> Sys.setenv(TZ = "UTC")
> getSymbols(c("AMZN", "^GSPC"), src="yahoo",
periodicity="weekly",
from=as.Date("2006-01-01"), to=as.Date("2017-12-31"))

[1] "AMZN" "GSPC"

> names(AMZN)

[1] "AMZN.Open" "AMZN.High" "AMZN.Low"


[4] "AMZN.Close" "AMZN.Volume" "AMZN.Adjusted"

> names(GSPC)

[1] "GSPC.Open" "GSPC.High" "GSPC.Low"


[4] "GSPC.Close" "GSPC.Volume" "GSPC.Adjusted"

As we have seen in section 2.2.3 (p. 24), we can use the function cbind
to combine the two data sets, as follows:
8
Random samples are being used to demonstrate hypothesis testing.

University of Waterloo - 2018


76 CHAPTER 4. HYPOTHESIS TESTING

> dt1 <- cbind(AMZN[,6], GSPC[,6])


> names(dt1)[1:2] <- c("AMZN", "SP500")
> names(dt1)

[1] "AMZN" "SP500"

We leverage the function lag() to created lagged stock market prices


and use them to calculate stock returns. We remove missing values using the
function na.omit().
> dt1$lagAMZN <- lag(dt1$AMZN)
> dt1$lagSP500 <- lag(dt1$SP500)
> dt1$rtrnAMZN <- (dt1$AMZN - dt1$lagAMZN)/dt1$lagAMZN
> dt1$rtrnSP500 <- (dt1$SP500 - dt1$lagSP500)/dt1$lagSP500
> dt1 <- na.omit(dt1)

4.2.2 Null and Alternative Hypothesis


Both questions that we want to test can be summarized as follows:
H0 : µrtrnAM ZN = µrtrnSP 500
H1 : µrtrnAM ZN > µrtrnSP 500
However, since the Amazon and S&P500 returns are matched on a week-
by-week basis, it makes sense to create a new variable (delta) that captures
the difference between Amazon and S&P500 returns. Hence, the null and
alternative hypotheses will become
H0 : µdelta = 0
H1 : µdelta > 0
With the following script, we create the new variable (delta), and review the
first three observations and three variables: Amazon returns (5th), S&P500
returns (6), and delta (7).

> dt1$delta <- dt1$rtrnAMZN-dt1$rtrnSP500


> names(dt1)

[1] "AMZN" "SP500" "lagAMZN" "lagSP500" "rtrnAMZN"


[6] "rtrnSP500" "delta"

> dt1[1:3,5:7]

rtrnAMZN rtrnSP500 delta


2006-01-08 -0.07248793 0.001680372 -0.074168299
2006-01-15 -0.01081090 -0.020285642 0.009474741
2006-01-22 0.02959934 0.017622003 0.011977338

University of Waterloo - 2018


4.2. STOCK MARKET DATA 77

4.2.3 Working with Dates: Subset Bear Market


In order to test our hypotheses, we need to take random samples during
specific time periods. A careful look at the names() output shows that while
the first column captures the trading weeks (dates), it is not listed as a
variable in dt1. R treats the first column as the row name (row number).
Therefore, in order to specify the time periods, we need to create the variable
date. As we can see below, this is a two stage process:
First, we need to change the format of the data set to a data frame (i.e.,
the R data set format. Think of the data frame as something similar to .xls
for Excel files). We do this using the function as.data.frame(). Within
the parenthesis, we specify the existing data set that we want to convert to
a data frame. Since, for the rest of our analysis we need just stock returns
and delta, we limit the new data set to just these variables .
Second, we need to create the new variable date. We use the function
as.Date to convert the variable format from text to date. The function takes
two arguments. The first one specifies the variable (row.names(dt2)) that we
want to leverage to create the DATE format. The second one specifies the
format of the variable we leverage ("%Y-%m-%d"). This means that the row
numbers are formatted as yyyy-mm-dd.9
> dt2 <- as.data.frame(dt1[,5:7])
> dt2$date <- as.Date(row.names(dt2), "%Y-%m-%d")
Assuming that the financial crisis (bear market) lasted approximately
from January 2007 to December 2008, we use the [,] notation to create a
subset for the bear market (dtBear) and review it.
> dtBear <- dt2[dt2$date>=as.Date("2007-01-01")
& dt2$date<=as.Date("2008-12-31"),]
> head(dtBear,2);tail(dtBear,2)
rtrnAMZN rtrnSP500 date delta
2007-01-07 -0.004430493 0.0149108821 2007-01-07 -0.01934137
2007-01-14 -0.030890078 -0.0001607431 2007-01-14 -0.03072933
rtrnAMZN rtrnSP500 date delta
2008-12-21 0.004266835 -0.01698430 2008-12-21 0.02125113
2008-12-28 0.049826227 0.06759853 2008-12-28 -0.01777231
The results show that our sub-set has correctly captured the intended
beginning and ending dates.
9
We use %y to indicate year abbreviated in two digits, %Y for year in four digits, %m for
month in two digits, %b for abbreviated month name, %B for complete month name, and
%d for the day. The separation can be space, back slash (/), hyphen (-).

University of Waterloo - 2018


78 CHAPTER 4. HYPOTHESIS TESTING

Bear Market: Random Sample


In order to test our hypothesis, we need a random sample from the bear
market. There a couple of options on how to do this. The first option is
to randomly select 15 trading weeks (same as the approach shown in section
4.1.3). The second option is to randomly pick a trading date and the 14 weeks
that follow. For the rest of the analysis, we will use the second approach.
Set the seed to 999 to make the analysis replicable and select a random
start date which is at least 15 weeks before the end of the bear period.

> set.seed(999)
> startDate <- sample(1:nrow(dtBear)-15,1)
> startDate

[1] 26

Using the [,] notation we can see that the trading day that corresponds to
the 26th observation is the first week of July 2007.

> dtBear[startDate,]

rtrnAMZN rtrnSP500 date delta


2007-07-01 0.008185893 0.01801973 2007-07-01 -0.009833839

Therefore, we can create our random sample (dtBear_15w), which is made


of the fifteen trading weeks that start in July 1st 2007, as follows:

> dtBear_15w <- dtBear[startDate:(startDate+14),]

> nrow(dtBear_15w)

[1] 15

Bear Market: Hypothesis Testing


During our sample period, the average difference (Amazon return minus
S&P500 return) was 0.019 (1.9%) and the standard deviation was 0.068.

> mean(dtBear_15w$delta)

[1] 0.01911319

> sd(dtBear_15w$delta)

University of Waterloo - 2018


4.2. STOCK MARKET DATA 79

[1] 0.06805197

While the average difference is greater than zero, we don’t know if the dif-
ference is statistically significant. To test this we use the function t.test().
In the script below; we have specified that our target variable is delta, the
level of significance is 10%, the value of the population delta is zero (µ = 0),
the alternative is one sided (µ > 0), and we want to see the 90% confidence
interval.

> t.test(dtBear_15w$delta, level = 0.90, mu=0,


alternative= "greater", conf.level = 0.90)

Based on the results below; the p-value (0.1475) is greater than the chosen
level of significance (α = 10%). Therefore, there is not enough statistical
evidence to reject the null hypothesis.

One Sample t-test

data: dtBear_15w$delta
t = 1.0878, df = 14, p-value = 0.1475
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
-0.004520262 Inf
sample estimates:
mean of x
0.01911319

This means, that based on our sample, there is not enough statistical evidence
to conclude that on average Amazon returns are higher than S&P500 returns
during a bear market.

4.2.4 Subset Bull Market


To repeat the analysis during a bull market, we create a sub-set that covers
the period from January 2016 to December 2017. Using the [,] notation we
create and review the sub-set for the bull market (dtBull).

> dtBull <- dt2[dt2$date>=as.Date("2016-01-01")


& dt2$date<=as.Date("2017-12-31"),]
> head(dtBull,2);tail(dtBull,2)

University of Waterloo - 2018


80 CHAPTER 4. HYPOTHESIS TESTING

rtrnAMZN rtrnSP500 date delta


2016-01-03 -0.10185093 -0.05964457 2016-01-03 -0.04220636
2016-01-10 -0.06073634 -0.02169585 2016-01-10 -0.03904049

rtrnAMZN rtrnSP500 date delta


2017-12-17 -0.0091422815 0.002814112 2017-12-17 -0.011956393
2017-12-24 0.0009500377 -0.003626071 2017-12-24 0.004576108

Bull Market: Random Sample


Using the same approach as the one used for the financial crisis (bear market),
we create a random sample of 15 consecutive trading weeks based on the
2016-17 (bull market) data.

> set.seed(999)
> startDate <- sample(1:nrow(dtBull)-15,1)
> startDate

[1] 26

> dtBull[startDate,]

rtrnAMZN rtrnSP500 date delta


2016-06-26 0.03822818 0.03216825 2016-06-26 0.006059931

> dtBull_15w <- dtBull[startDate:(startDate+14),]


> nrow(dtBull_15w)

[1] 15

Bull Market: Hypothesis Testing


The average difference between Amazon and S&P500 returns (delta) during
the sample period is 0.00867 and the standard deviation is 0.01647.

> mean(dtBull_15w$delta)

[1] 0.008670262

> sd(dtBull_15w$delta)

[1] 0.01647467

University of Waterloo - 2018


4.2. STOCK MARKET DATA 81

The fact that the mean delta is positive means that average returns of Ama-
zon were higher than the returns of SP500. However, we don’t know if this
difference is statistically significant. To test this, we use the t.test() and
the same specifications as in the bear market. The script and results are
shown below.
> t.test(dtBull_15w$delta, level = 0.90, mu=0,
alternative= "greater", conf.level = 0.90)
One Sample t-test

data: dtBull_15w$delta
t = 2.0383, df = 14, p-value = 0.03044
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
0.002948849 Inf
sample estimates:
mean of x
0.008670262
Based on the above results; the p-value (0.03044) is less than the chosen level
of significance (α = 10%). Therefore, there is enough statistical evidence to
reject the null hypothesis. It seems that based on evidence from the period
2016-17 the average return on Amazon was higher than S&P500 during a
bull market.

4.2.5 Practice Problems: Stock Returns


The difference in returns between Amazon and S&P500 may depend on many
different factors. For example the results may differ depending on timing of
the sample. A sample selected during the beginning of the financial crisis
may generate different results than a sample selected near the end. Similarly,
a sample collected during a period when Amazon had announced earnings
that were higher than analyst’s expectations may produce different results
when compared to a sample from a period when one of Amazon’s competitors
had announced a new strategic initiative. In more advanced classes we will
learn how to control for such factors. Keep this in mind as you work on the
practice problems for this section.
1. Set the seed to 123 and work with the bear market subset (dtBear) to
compare returns between Amazon and S&P500, using a sample of 25
observations.10
10
The R script for this problem is on p. 92

University of Waterloo - 2018


82 CHAPTER 4. HYPOTHESIS TESTING

(a) Select a random start date that would allow you to take a sample
of 25 observations. What is the startDate observation?
(b) What is the first trading date in the sample?
(c) What is the average delta in the sample?
(d) What is the standard deviation?
(e) Run a one sided t-test with the alternative greater than zero and
a 10% level of significance.
(f) What is the p-value?
(g) State the conclusion based on these results.
2. Set the seed to 123 and work with the bull market subset (dtBull) to
compare returns between Amazon and S&P500, using a sample of 25
observations.11
(a) Select a random start date that would allow you to take a sample
of 25 observations. What is the startDate observation?
(b) What is the first trading date in the sample?
(c) What is the avearage delta in the sample?
(d) What is the standard deviation?
(e) Run a one sided t-test with the alternative greater than zero and
a 10% level of significance.
(f) What is the p-value?
(g) State the conclusion based on these results.

4.3 Financial Accounting Data


Financial analysts are interested in the ability of firms to maintain their
performance. For example, if we take a sample of firms that have high
operating margin, are they more likely to outperform a random sample of
their competitors two, four, or six years later.
To test these questions, we are going to use financial accounting data from
2010 to 2016 for all publicly traded firms competing in the same industry as
Apple.

4.3.1 Load and Review Data


Following the approach shown in sections 2.3.2 - 2.3.3, we load-review the
financial data, and create financial profitability ratios.
11
The R script for this problem is on p. 93

University of Waterloo - 2018


4.3. FINANCIAL ACCOUNTING DATA 83

> library(data.table)
> dt <- fread("industryAAPL_2010_2016.csv")
> names(dt)
[1] "gvkey" "datadate" "fyear" "tic" "conm"
[6] "at" "cogs" "ib" "oibdp" "sale"
[11] "loc" "naics" "sic"
> nrow(dt)
[1] 441
> dt <- dt[dt$sale>1,]
> nrow(dt)
[1] 396
> dt$gm <- (dt$sale-dt$cogs)/dt$sale
> dt$om <- dt$oibdp/dt$sale
> dt$pm <- dt$ib/dt$sale
The rest of this section, is organized as follows:
1. Identify the subset of companies that had an operating margin (om)
above the 3rd quartile (top quartile) in 2010. We will save these com-
panies in a data set named dt_2010_Q4.
2. Based on the 2012 data, create a sub-set that contains only firms that
were in the 4th Quartile in 2010. Name this new set dt_2012_oldQ4.
3. Remove the firms that were in the 4th Quartile in 2010 from the 2012
data set, and name the new sub-set dt_2012_minusOldQ4.
4. Generate a random sample based on the data set dt_2012_minusOldQ4.
Name the random sample dt_2012_rs. The random sample should
have the same number of observations as the firms in dt_2012_oldQ4.

4.3.2 Null and Alterative Hypothesis


Is the 2012 average operating margin of the population of firms that were
in the 4th Quartile in 2010 higher than the 2012 average operating margin
of the population of their competitors? We state this in a form of null and
alternative hypothesis as follows:
H0 : The 2012 average om of the population of firms that were top per-
formers in 2010 is equal to the 2012 average om of the population of their
competitors.
H1 The 2012 average om of the population of firms that were top per-
formers in 2010 is greater than the 2012 average om of the population of
their competitors.

University of Waterloo - 2018


84 CHAPTER 4. HYPOTHESIS TESTING

4.3.3 Create Sub-set for Hypothesis Testing


Stage 1 (dt_2010_Q4 )
We use the [,] to extract 2010 data and generate summary statistics for
operating margin.

> dt_2010 <- dt[dt$fyear==2010,]


> names(dt_2010)

[1] "gvkey" "datadate" "fyear" "tic" "conm"


[6] "at" "cogs" "ib" "oibdp" "sale"
[11] "loc" "naics" "sic" "gm" "om"
[16] "pm"

> summary(dt_2010$om)

Min. 1st Qu. Median Mean 3rd Qu. Max.


-1.341778 0.007278 0.075490 0.015356 0.134599 0.506473

As we can see firms that had an operating margin (om) above 13.46% are
in the top quartile (above the 3rd quartile). Remember that as we have
in Chapter 3 (p. 40), we can calculate the 3rd quartile using quintile(x,
.75).
With the following script, we generate the sub-set dt_2010_Q4 and print
the ticker symbol of all firms that were in the 4th quartile in 2010.

> dt_2010_Q4 <- dt_2010[dt_2010$om>quantile(dt_2010$om,.75),]


> print(dt_2010_Q4$tic)

[1] "ANEN" "AAPL" "CMTL" "ERIC" "TCCO" "TKLC" "TMSG"


[8] "APSG" "ARRS" "VSAT" "PPEHF" "TSTC" "CNTF" "MRTKF"
[15] "SATS" "ZSTN" "UBNT" "EVTZF"

As we can see there were 18 firms that were in the 4th quartile in terms of
their operating margin in 2010.

Stage 2 (dt_2012_oldQ4 )
Our objective in this section is to use 2012 data, in order to create a sub-set
that contains only firms that were in the 4th Quartile in 2010. We are going
to name this new set dt_2012_oldQ4.
We start by focusing of the financial data from fiscal year 2012.

University of Waterloo - 2018


4.3. FINANCIAL ACCOUNTING DATA 85

> dt_2012 <- dt[dt$fyear==2012,]


To create the new data set (dt_2012_oldQ4 ) we need the list of firms
that meet the following condition: their ticker symbol must belong to the
list of firms from dt_2010_Q4. In R there is a function %in% that performs
this task. It looks if values in data set A belong (%in%) in data set B. When
used in conjunction with the [,] notation it can produce the new data set
as follows:
> dt_2012_oldQ4 <-
dt_2012[dt_2012$tic %in% c(dt_2010_Q4$tic),]
The new data set has only 13 firms. We can speculate that the remaining 5
firms may have been acquired, gone private, or filed for bankruptcy.
> nrow(dt_2012_oldQ4)

[1] 13

The ticker symbols of the 13 firms are shown below:

> print(dt_2012_oldQ4$tic)

[1] "ANEN" "AAPL" "CMTL" "ERIC" "TCCO" "ARRS" "VSAT"


[8] "PPEHF" "TSTC" "CNTF" "SATS" "UBNT" "EVTZF"

Stage 3 (dt_2012_minusOldQ4 )
In order to see how the firms which were at the top quartile in 2010 compare
to their competitors in 2012, we need to remove them from the list of 2012
firms. This means that we need to remove firms with ticker symbol "ANEN",
"AAPL", "CMTL" etc., from the list of 2012 firms.
In the previous step, we used the following statement dt_2012$tic %in%
c(dt_2010_Q4$tic) to extract the firms that belonged in the list of top
performing 2010 firms. In order to create the dt_2012_minusOldQ4 we
need the opposite of this statement. We need firms that do not belong to
the top performing 2010 firms. To do this we enclose the statement in a
parenthesis and add an exclamation point (!) before it. In other words, the
statement becomes as follows: !(dt_2012$tic %in% c(dt_2010_Q4$tic)).
Using the [,] notation and the above statement, we create the sub-set
dt_2012_minusOldQ4 :

> dt_2012_minusOldQ4 <-


dt_2012[!(dt_2012$tic %in% c(dt_2010_Q4$tic))]

University of Waterloo - 2018


86 CHAPTER 4. HYPOTHESIS TESTING

Practice Problem:
1. What is the number of rows in dt_2012_minusOldQ4 ?
2. Verify that the number of rows in the data set dt_2012_minusOldQ4
plus the number of rows in dt_2012_oldQ4 is equal to the number of
rows in dt_2012.

Stage 4 (dt_2012_rs)
Since the data set dt_2012_minusOldQ4 has more observations than the
data set dt_2012_oldQ4, we will need to take a random sample of 13 obser-
vations from the dt_2012_minusOldQ4.

> set.seed(123)
> dt_2012_rs <-
dt_2012_minusOldQ4[sample(1:nrow(dt_2012_minusOldQ4),13),]

4.3.4 Hypothesis Testing


The 2012 average om of the sample of firms that were in the top quartile
of performance in 2010 is 12.23% and the standard deviation is 0.2022. The
corresponding values for the sample of their competitors are -13.33% and
0.491.

> mean(dt_2012_oldQ4$om)

[1] 0.1223479

> sd(dt_2012_oldQ4$om)

[1] 0.2022083

> mean(dt_2012_rs$om)

[1] -0.1332882

> sd(dt_2012_rs$om)

[1] 0.4909646

Based on these values it seems that the 2010 top performers were able to
sustain their superiority two years later. However, we don’t know if this
difference is statistical significance.
Results based on a one sided t-test are shown below.

University of Waterloo - 2018


4.3. FINANCIAL ACCOUNTING DATA 87

> t.test(dt_2012_oldQ4$om, dt_2012_rs$om,


level=.90, var.equal = FALSE,
alternative= "greater", conf.level = 0.90)

Welch Two Sample t-test

data: dt_2012_oldQ4$om and dt_2012_rs$om


t = 1.7359, df = 15.957, p-value = 0.05093
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
0.05875454 Inf
sample estimates:
mean of x mean of y
0.1223479 -0.1332882

The p-value (0.05093) is less than the chosen level of significance (10%).
Therefore, we can conclude that based on evidence from our sample, the
2012 average om of top performing firms in 2010 seem to be higher than the
2012 average om of their competitors. Therefore; two years later, these firms
seem to be able to maintain their advantage.

4.3.5 Practice Problems: Performance Sustainability


1. Is the 2014 average operating margin of the population of firms that
were in the 4th Quartile in 2010 higher than the 2014 average operating
margin of the population of their competitors?12
2. Is the 2016 average operating margin of the population of firms that
were in the 4th Quartile in 2010 higher than the 2016 average operating
margin of the population of their competitors?13

12
The R script for this problem is on p. 94
13
The R script for this problem is on p. 95

University of Waterloo - 2018


88 CHAPTER 4. HYPOTHESIS TESTING

4.4 Visual Representation: cbind & rbind


When to use cbind Suppose that we have two tables. The first one (4.1)
has two variables X and Y and three observations. The second table (4.2)
has two variables X and Z and three observations. As we can see both tables
share the value and observations for variable X.

X Y X Z
1 10 1 100
2 20 2 200
3 30 3 300

Table 4.1: XY Table 4.2: XZ

In this case, we can place them side-by-side and have them share the com-
mon variable X using the function cbind. The function takes two arguments:
we need to specify the left table (XY) and the right table (XZ).

> XYZ <- cbind(XY, XZ)

The resulting output (XYZ ) is shown IN Table 4.3.

X Y Z
1 10 100
2 20 200
3 30 300

Table 4.3: XYZ

When to use rbind Suppose that we have two tables. The first one (4.4)
has two variables Year and Y and three observations for years 2014 through
2016. The second table (4.5) has the same two variables Year and Y and
observations for years 2017 and 2018. As we can see both tables have the
same variables Year and Y.

Year Y
Year Y
2014 40
2017 90
2015 50
2018 65
2016 70
Table 4.5: dt2
Table 4.4: dt1

University of Waterloo - 2018


4.4. VISUAL REPRESENTATION: CBIND & RBIND 89

In this case, it makes sense to append the second table (dt2 ) at the
bottom of the first table (dt1 ) using the function rbind. The function takes
two arguments: we need to specify the top table (dt1) and the bottom table
(dt2).

> dt <- rbind(dt1, dt2)

The resulting output (dt) is shown in Table 4.6.

Year Y
2014 40
2015 50
2016 70
2017 90
2018 65

Table 4.6: dt

University of Waterloo - 2018


90 CHAPTER 4. HYPOTHESIS TESTING

4.5 Solutions to Selected Practice Problems


4.5.1 Hypothesis Testing for Bibitor Sales (4.1.6)
Compare Prices (p. 73)
> t.test(dt1_L_rs$averagePrice, dt1_W_rs$averagePrice,
level = 0.90, var.equal = FALSE,
alternative= "greater", conf.level = 0.90)

Welch Two Sample t-test

data: dt1_L_rs$averagePrice and dt1_W_rs$averagePrice


t = -0.043021, df = 47.418, p-value = 0.5171
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
-7.532431 Inf
sample estimates:
mean of x mean of y
19.84072 20.08207

Compare Units Sold (p. 73)


> t.test(dt1_L_rs$unitSold, dt1_W_rs$unitSold,
level = 0.99, var.equal = FALSE,
alternative= "two.sided", conf.level = 0.99)

Welch Two Sample t-test

data: dt1_L_rs$unitSold and dt1_W_rs$unitSold


t = 1.717, df = 30.872, p-value = 0.09599
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
-799.2495 3469.8095
sample estimates:
mean of x mean of y
2098.36 763.08

Compare Revenues (p. 73)


> t.test(dt1_L_rs$revenue, dt1_W_rs$revenue,
level = 0.90, var.equal = FALSE,
alternative= "two.sided", conf.level = 0.90)

University of Waterloo - 2018


4.5. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 91

Welch Two Sample t-test

data: dt1_L_rs$revenue and dt1_W_rs$revenue


t = 1.5834, df = 32.932, p-value = 0.1229
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
-1205.284 36198.728
sample estimates:
mean of x mean of y
28721.02 11224.29

Use rbind to combine random samples and compare prices (p. 73)
> dt1_rs <- rbind(dt1_L_rs, dt1_W_rs)
> t.test(dt1_rs$averagePrice~dt1_rs$Classification,
level = 0.95, var.equal = FALSE,
alternative= "two.sided", conf.level = 0.95)

Welch Two Sample t-test

data: dt1_rs$averagePrice by dt1_rs$Classification


t = -0.043021, df = 47.418, p-value = 0.9659
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.52456 11.04186
sample estimates:
mean in group 1 mean in group 2
19.84072 20.08207

Use describe() to view detailed descriptive statistics (p. 74)


> library(psych)
> describe(dt1_L_rs[,4:6])

vars n mean sd median trimmed mad


unitSold 1 25 2098.36 3631.85 347.00 1358.33 508.53
averagePrice 2 25 19.84 18.70 16.99 16.90 11.86
revenue 3 25 28721.02 50584.58 9646.56 18966.60 14062.21

min max range skew kurtosis se


unitSold 3.00 14183.00 14180.00 2.03 3.25 726.37

University of Waterloo - 2018


92 CHAPTER 4. HYPOTHESIS TESTING

averagePrice 0.99 88.28 87.29 2.00 4.70 3.74


revenue 2.97 240969.17 240966.20 2.95 9.51 10116.92

> describe(dt1_W_rs[,4:6])

vars n mean sd median trimmed mad


unitSold 1 25 763.08 1388.83 158.00 425.86 226.84
averagePrice 2 25 20.08 20.90 16.17 15.61 8.63
revenue 3 25 11224.29 22222.98 3943.84 5445.65 5527.25

min max range skew kurtosis se


unitSold 3.00 5072.00 5069.00 2.32 4.41 277.77
averagePrice 2.99 104.99 102.00 2.91 8.55 4.18
revenue 65.97 85208.45 85142.48 2.66 5.80 4444.60

4.5.2 Hypothesis Testing for Stock Returns (4.2.5)


Stock Returns in Bear Market (p. 81)
> set.seed(123)
> startDate <- sample(1:nrow(dtBear)-25,1)
> startDate

[1] 5

> dtBear[startDate,]

rtrnAMZN rtrnSP500 date delta


2007-02-04 0.03557106 -0.007132027 2007-02-04 0.04270309

> dtBear_25w <- dtBear[startDate:(startDate+24),]


> nrow(dtBear_25w)

[1] 25

> mean(dtBear_25w$delta)

[1] 0.03577445

> sd(dtBear_25w$delta)

[1] 0.09114545

University of Waterloo - 2018


4.5. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 93

> t.test(dtBear_25w$delta, level = 0.90, mu=0,


alternative= "greater", conf.level = 0.90)

One Sample t-test

data: dtBear_25w$delta
t = 1.9625, df = 24, p-value = 0.0307
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
0.0117515 Inf
sample estimates:
mean of x
0.03577445

Stock Returns in Bull Market (p. 82)


> set.seed(123)
> startDate <- sample(1:nrow(dtBull)-25,1)
> startDate

[1] 5

> dtBull[startDate,]

rtrnAMZN rtrnSP500 date delta


2016-01-31 -0.1445826 -0.03102191 2016-01-31 -0.1135607

> dtBull_25w <- dtBull[startDate:(startDate+24),]


> nrow(dtBull_25w)

[1] 25

> mean(dtBull_25w$delta)

[1] 0.005769844

> sd(dtBull_25w$delta)

[1] 0.03722621

> t.test(dtBull_25w$delta, level = 0.90, mu=0,


alternative= "greater", conf.level = 0.90)

University of Waterloo - 2018


94 CHAPTER 4. HYPOTHESIS TESTING

One Sample t-test

data: dtBull_25w$delta
t = 0.77497, df = 24, p-value = 0.223
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
-0.004041763 Inf
sample estimates:
mean of x
0.005769844

4.5.3 Hypothesis Testing: Performance Sustainability


(4.3.5)
Performance Sustainability - 2014 (p. 87)
> dt_2014 <- dt[dt$fyear==2014,]
> dt_2014_oldQ4 <- dt_2014[dt_2014$tic %in% c(dt_2010_Q4$tic),]
> nrow(dt_2014_oldQ4)

[1] 11

> print(dt_2014_oldQ4$tic)

[1] "AAPL" "CMTL" "ERIC" "TCCO" "ARRS" "VSAT" "PPEHF"


[8] "CNTF" "SATS" "UBNT" "EVTZF"

> dt_2014_minusOldQ4 <-


dt_2014[!(dt_2014$tic %in% c(dt_2010_Q4$tic))]
> set.seed(123)
> dt_2014_rs <-
dt_2014_minusOldQ4[sample(1:nrow(dt_2014_minusOldQ4),11),]
> mean(dt_2014_oldQ4$om)

[1] 0.1390558

> sd(dt_2014_oldQ4$om)

[1] 0.1754235

> mean(dt_2014_rs$om)

[1] -0.2775218

University of Waterloo - 2018


4.5. SOLUTIONS TO SELECTED PRACTICE PROBLEMS 95

> sd(dt_2014_rs$om)

[1] 0.7319373

> t.test(dt_2014_oldQ4$om, dt_2014_rs$om,


level=.90, var.equal = FALSE,
alternative= "greater", conf.level = 0.90)

Welch Two Sample t-test

data: dt_2014_oldQ4$om and dt_2014_rs$om


t = 1.8357, df = 11.145, p-value = 0.04661
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
0.1074212 Inf
sample estimates:
mean of x mean of y
0.1390558 -0.2775218

Performance Sustainability - 2016 (p. 87)


> dt_2016 <- dt[dt$fyear==2016,]
> dt_2016_oldQ4 <- dt_2016[dt_2016$tic %in% c(dt_2010_Q4$tic),]
> nrow(dt_2016_oldQ4)

[1] 11

> print(dt_2016_oldQ4$tic)

[1] "AAPL" "CMTL" "ERIC" "TCCO" "ARRS" "VSAT" "PPEHF"


[8] "CNTF" "SATS" "UBNT" "EVTZF"

> dt_2016_minusOldQ4 <-


dt_2016[!(dt_2016$tic %in% c(dt_2010_Q4$tic))]
> set.seed(123)
> dt_2016_rs <-
dt_2016_minusOldQ4[sample(1:nrow(dt_2016_minusOldQ4),11),]
> mean(dt_2016_oldQ4$om)

[1] 0.07157916

> sd(dt_2016_oldQ4$om)

University of Waterloo - 2018


96 CHAPTER 4. HYPOTHESIS TESTING

[1] 0.4081531

> mean(dt_2016_rs$om)

[1] 0.04039443

> sd(dt_2016_rs$om)

[1] 0.2157376

> t.test(dt_2016_oldQ4$om, dt_2016_rs$om,


level=.90, var.equal = FALSE,
alternative= "greater", conf.level = 0.90)

Welch Two Sample t-test

data: dt_2016_oldQ4$om and dt_2016_rs$om


t = 0.22403, df = 15.183, p-value = 0.4129
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
-0.1553189 Inf
sample estimates:
mean of x mean of y
0.07157916 0.04039443

University of Waterloo - 2018


Part I

Appendix

97

University of Waterloo - 2018


University of Waterloo - 2018
Appendix A

R Script Used in Chapters

A.1 R Script: Introduction to R


A.2 R Script: Load and Review Data
A.2.1 Section 2.1: Management Accounting Data
The following R script was used to prepare the section on how to load and
review management accounting data.

# set working directory to source file location


library(data.table)
tStores <- fread("tStores.csv")
names(tStores)
str(tStores)
head(tStores)
head(tStores,3)

tStores[1:3,]
tStores[1:3, 2:4]
tStores[, 2:4]
head(tStores[,2:4],3)
tail(tStores[,2:4],3)

head(tStores[order(tStores$SqFt),2:4],3)
head(tStores[order(-tStores$SqFt),2:4],3)

min(tStores$SqFt)
max(tStores$SqFt)

99

University of Waterloo - 2018


100 APPENDIX A. R SCRIPT

mean(tStores$SqFt)

A.2.2 Section 2.2: Stock Market Data


The following R script was used to prepare the section on how to load and
review Stock Market data.

library(quantmod)
Sys.setenv(TZ = "UTC")
getSymbols(c("AMZN"), src="yahoo", periodicity="weekly",
from=as.Date("2006-01-01"), to=as.Date("2014-12-31"))
names(AMZN)
head(AMZN,3)
tail(AMZN,3)
nrow(AMZN)
plot(AMZN[,6])
getSymbols(c("^GSPC"),src="yahoo",periodicity="weekly",
from=as.Date("2006-01-01"),to=as.Date("2014-12-31"))
names(GSPC)
nrow(GSPC)
head(AMZN,1);head(GSPC,1)
tail(AMZN,1);tail(GSPC,1)

dt1<-cbind(AMZN[,6],GSPC[,6])
head(dt1,3);tail(dt1,3)
nrow(dt1)
names(dt1)
names(dt1)[1:2]<-c("AMZN","SP500")
names(dt1)

par(mfcol=c(1,2))
plot(dt1$AMZN);plot(dt1$SP500)
par(mfcol=c(1,1))

dt2<-dt1['2007-01-01::2010-12-31']
par(mfcol=c(1,2))
plot(dt2$AMZN);plot(dt2$SP500)
par(mfcol=c(1,1))

University of Waterloo - 2018


A.2. R SCRIPT: LOAD AND REVIEW DATA 101

A.2.3 Section 2.3: Financial Accounting Data

The following R script was used to prepare the section on how to load and
review Financial Accounting data.

# set working directory to source file location


library(data.table)
dt<-fread("industryAAPL_2010_2016.csv")
names(dt)
str(dt)
dt[dt$tic=="AAPL",c(3:10)]
dt<-dt[dt$sale>1,]
nrow(dt)
dt$gm<-(dt$sale-dt$cogs)/dt$sale
dt$om<-dt$oibdp/dt$sale
dt$pm<-dt$ib/dt$sale
names(dt)
dt[dt$tic=="AAPL",c(3:4,6:10,14:16)]
options(scipen=9)
dt$ROA_gm<-(dt$sale-dt$cogs)/dt$at
dt$ROA_om<-dt$oibdp/dt$at
dt$ROA_pm<-dt$ib/dt$at
dt[dt$tic=="AAPL",c(3:4,6:10,17:19)]
min(dt[dt$tic=="AAPL",]$gm)
min(dt$gm)
max(dt[dt$tic=="AAPL",]$gm)
max(dt$gm)
mean(dt[dt$tic=="AAPL",]$gm)
mean(dt$gm)
dt_AAPL<-dt[dt$tic=="AAPL",]
library(ggplot2)
AAPL_sales<-ggplot(dt_AAPL,aes(x=fyear,y=sale))+
geom_point()
AAPL_sales
AAPL_sales<-AAPL_sales+
geom_line()+
xlab("FiscalYear")+
ylab("Sales(millions)")
AAPL_sales

University of Waterloo - 2018


102 APPENDIX A. R SCRIPT

A.3 R Script: Summary Statistics


A.3.1 Section 3.1: Management Accounting Data
The following R script was used to prepare the section on summary statistics
for management accounting data.

# set working directory to source file location


library(data.table)
dt1<-fread("salesByStore.csv")
str(dt1)
summary(dt1$SqFt)

quantile(dt1$SqFt,.25)
quantile(dt1$SqFt,.75)
quantile(dt1$SqFt,.75)-quantile(dt1$SqFt,.25)
IQR(dt1$SqFt)

lWhisker4sQFt<-quantile(dt1$SqFt,.25)-1.5*IQR(dt1$SqFt)
lWhisker4sQFt
uWhisker4sQFt<-quantile(dt1$SqFt,.75)+1.5*IQR(dt1$SqFt)
uWhisker4sQFt
boxplot(dt1$SqFt)

dt1$sqftOutlier<-
ifelse(dt1$SqFt<lWhisker4sQFt|dt1$SqFt>uWhisker4sQFt,1,0)
dt4outlier <- dt1[dt1$sqftOutlier==1,]
dt4outlier

dt1[dt1$SqFt<lWhisker4sQFt|dt1$SqFt>uWhisker4sQFt,]

dt2<-fread("salesByProduct.csv")
str(dt2)
table(dt2$Classification)
prop.table(table(dt2$Classification))

A.3.2 Section 3.2: Stock Market Data


The following R script was used to prepare the section on summary statistics
for stock market data.

University of Waterloo - 2018


A.3. R SCRIPT: SUMMARY STATISTICS 103

options(scipen=999)
library(quantmod)
Sys.setenv(TZ="UTC")
getSymbols(c("AMZN","^GSPC"),src="yahoo",
periodicity="weekly",
from=as.Date("2006-01-01"),to=as.Date("2016-12-31"))
names(AMZN)
names(GSPC)

dt1<-cbind(AMZN[,6],GSPC[,6])
head(dt1,3);tail(dt1,3)
names(dt1)[1:2]<-c("AMZN","SP500")
names(dt1)
head(dt1)

dt1$lagAMZN<-lag(dt1$AMZN)
dt1$lagSP500<-lag(dt1$SP500)
head(dt1)
dt1$rtrnAMZN<-(dt1$AMZN-dt1$lagAMZN)/dt1$lagAMZN
dt1$rtrnSP500<-(dt1$SP500-dt1$lagSP500)/dt1$lagSP500
head(dt1)
dt1<-na.omit(dt1)
head(dt1)
summary(dt1[,5:6])

lwrWhiskerAMZN<-
quantile(dt1$rtrnAMZN,.25)-1.5*IQR(dt1$rtrnAMZN)
lwrWhiskerAMZN
uprWhiskerAMZN<-
quantile(dt1$rtrnAMZN,.75)+1.5*IQR(dt1$rtrnAMZN)
uprWhiskerAMZN
lwrWhiskerSP500<-
quantile(dt1$rtrnSP500,.25)-1.5*IQR(dt1$rtrnSP500)
lwrWhiskerSP500
uprWhiskerSP500<-
quantile(dt1$rtrnSP500,.75)+1.5*IQR(dt1$rtrnSP500)
uprWhiskerSP500

dt1[dt1$rtrnAMZN==min(dt1$rtrnAMZN),]
dt1[dt1$rtrnSP500==min(dt1$rtrnSP500),]

University of Waterloo - 2018


104 APPENDIX A. R SCRIPT

dt2<-as.data.frame(dt1[,5:6])
dt2$directionAMZN<-ifelse(dt2$rtrnAMZN>0,"AMZN_up",
ifelse(dt2$rtrnAMZN==0,"AMZN_unchanged","AMZN_down"))
dt2$directionSP500<-ifelse(dt2$rtrnSP500>0,"SP500_up",
ifelse(dt2$rtrnSP500==0,"SP500_unchanged","SP500_down"))
head(dt2)
table(dt2$directionSP500)
table(dt2$directionAMZN)
table(dt2$directionSP500,dt2$directionAMZN)

A.3.3 Section 3.3: Financial Accounting Data


# set working directory to source file location
library(data.table)
dt<-fread("industryAAPL_2010_2016.csv")
names(dt)
nrow(dt)

dt<-dt[dt$sale>1,]
nrow(dt)
dt_2010<-dt[dt$fyear==2010,]
summary(dt_2010[,8:9])
dt_2010$status_OIBDP<-ifelse(dt_2010$oibdp>=0,
"profits_OIBDP","losses_OIBDP")
prop.table(table(dt_2010$status_OIBDP))
dt_2010$status_IB<-ifelse(dt_2010$ib>=0,"profits_IB",
"losses_IB")
prop.table(table(dt_2010$status_IB))

dt_2010$gm<-(dt_2010$sale-dt_2010$cogs)/dt_2010$sale
dt_2010$om<-dt_2010$oibdp/dt_2010$sale
dt_2010$pm<-dt_2010$ib/dt_2010$sale
names(dt_2010)
summary(dt_2010[,16:18])

lwrWhisker_pm<-
quantile(dt_2010$pm,.25)-1.5*IQR(dt_2010$pm)
lwrWhisker_pm
uprWhisker_pm<-
quantile(dt_2010$pm,.75)+1.5*IQR(dt_2010$pm)
uprWhisker_pm

University of Waterloo - 2018


A.3. R SCRIPT: SUMMARY STATISTICS 105

dt_2010$relativePosition_pm<-
ifelse(dt_2010$pm>uprWhisker_pm,"topOutlier_pm",
ifelse(dt_2010$pm>median(dt_2010$pm),
"aboveMedian_pm","belowMedian_pm"))

table(dt_2010$relativePosition_pm)
prop.table(table(dt_2010$relativePosition_pm))

University of Waterloo - 2018


106 APPENDIX A. R SCRIPT

A.4 R Script: Hypothesis Testing

A.4.1 Section 4.1: Management Accounting Data

The following R script was used to prepare the section on hypothesis testing
for management accounting data.

library(data.table)
dt1 <- fread("salesByProduct.csv")
str(dt1)

dt1_L <- dt1[dt1$Classification==1,]


dt1_W <- dt1[dt1$Classification==2,]

summary(dt1_L[, 4:6])
summary(dt1_W[, 4:6])

sample(1:nrow(dt1_L),25)
sample(1:nrow(dt1_L),25)

set.seed(123)
sample(1:nrow(dt1_L),25)
set.seed(123)
sample(1:nrow(dt1_L),25)

set.seed(123)
dt1_L_rs<-dt1_L[sample(1:nrow(dt1_L),25),]
head(dt1_L_rs,3)
tail(dt1_L_rs,3)

set.seed(123)
dt1_W_rs<-dt1_W[sample(1:nrow(dt1_W),25),]
head(dt1_W_rs,3)
tail(dt1_W_rs,3)

t.test(dt1_L_rs$averagePrice, dt1_W_rs$averagePrice,
level = 0.95, var.equal = FALSE,
alternative= "two.sided", conf.level = 0.95)

University of Waterloo - 2018


A.4. R SCRIPT: HYPOTHESIS TESTING 107

A.4.2 Section 4.2: Stock Market Data


The following R script was used to prepare the section on hypothesis testing
for stock market data.
library(quantmod)
Sys.setenv(TZ = "UTC")
getSymbols(c("AMZN", "^GSPC"),
src="yahoo", periodicity="weekly",
from=as.Date("2006-01-01"), to=as.Date("2017-12-31"))
names(AMZN)
names(GSPC)
dt1 <- cbind(AMZN[,6], GSPC[,6])
names(dt1)[1:2] <- c("AMZN", "SP500")
names(dt1)
dt1$lagAMZN <- lag(dt1$AMZN)
dt1$lagSP500 <- lag(dt1$SP500)
dt1$rtrnAMZN <- (dt1$AMZN - dt1$lagAMZN)/dt1$lagAMZN
dt1$rtrnSP500 <- (dt1$SP500 - dt1$lagSP500)/dt1$lagSP500
dt1 <- na.omit(dt1)
dt1$delta <- dt1$rtrnAMZN-dt1$rtrnSP500
names(dt1)
dt1[1:3,5:7]
dt2 <- as.data.frame(dt1[,5:7])
dt2$date <- as.Date(row.names(dt2), "%Y-%m-%d")
dtBear <- dt2[dt2$date>=as.Date("2007-01-01")
& dt2$date<=as.Date("2008-12-31"),]
head(dtBear,2);tail(dtBear,2)
set.seed(999)
startDate <- sample(1:nrow(dtBear)-15,1)
startDate
dtBear[startDate,]
dtBear_15w <- dtBear[startDate:(startDate+14),]
nrow(dtBear_15w)
mean(dtBear_15w$delta)
sd(dtBear_15w$delta)
t.test(dtBear_15w$delta, level = 0.90, mu=0,
alternative= "greater", conf.level = 0.90)
dtBull <- dt2[dt2$date>=as.Date("2016-01-01")
& dt2$date<=as.Date("2017-12-31"),]
head(dtBull,2);tail(dtBull,2)
set.seed(999)

University of Waterloo - 2018


108 APPENDIX A. R SCRIPT

startDate <- sample(1:nrow(dtBull)-15,1)


startDate
dtBull[startDate,]
dtBull_15w <- dtBull[startDate:(startDate+14),]
nrow(dtBull_15w)
mean(dtBull_15w$delta)
sd(dtBull_15w$delta)
t.test(dtBull_15w$delta, level = 0.90, mu=0,
alternative= "greater", conf.level = 0.90)

A.4.3 Section 4.3: Financial Accounting Data


The following R script was used to prepare the section on hypothesis testing
for financial accounting data.

# set working directory to source file location


library(data.table)
dt <- fread("industryAAPL_2010_2016.csv")
names(dt)
nrow(dt)
dt <- dt[dt$sale>1,]
nrow(dt)
dt$gm <- (dt$sale-dt$cogs)/dt$sale
dt$om <- dt$oibdp/dt$sale
dt$pm <- dt$ib/dt$sale
dt_2010 <- dt[dt$fyear==2010,]
names(dt_2010)
summary(dt_2010$om)
dt_2010_Q4 <- dt_2010[dt_2010$om>quantile(dt_2010$om,.75),]
print(dt_2010_Q4$tic)
dt_2012 <- dt[dt$fyear==2012,]
dt_2012_oldQ4 <- dt_2012[dt_2012$tic %in% c(dt_2010_Q4$tic),]
nrow(dt_2012_oldQ4)
print(dt_2012_oldQ4$tic)
dt_2012_minusOldQ4 <-
dt_2012[!(dt_2012$tic %in% c(dt_2010_Q4$tic))]
set.seed(123)
dt_2012_rs <-
dt_2012_minusOldQ4[sample(1:nrow(dt_2012_minusOldQ4),13),]
mean(dt_2012_oldQ4$om)
sd(dt_2012_oldQ4$om)

University of Waterloo - 2018


A.4. R SCRIPT: HYPOTHESIS TESTING 109

mean(dt_2012_rs$om)
sd(dt_2012_rs$om)
t.test(dt_2012_oldQ4$om, dt_2012_rs$om,
level=.90, var.equal = FALSE,
alternative= "greater", conf.level = 0.90)

University of Waterloo - 2018


University of Waterloo - 2018
Appendix B

How to Extract Financial Data


from Compustat

Intended Learning Outcomes


The objective of this Appendix is introduce students to Wharton Research
Data Services (WRDS), one of the most comprehensive repositories of ac-
counting/financial data. More specifically, we are going to focus in Compu-
stat Capital IQ (henceforward Compustat), which is one of the libraries/-
databases in WRDS that houses financial statement data of all companies
that are publicly traded.
By the end of the chapter students should have achieved the following
objectives:

1. Understand the data structure of the Compustat database.


2. Know what financial data are available per firm and time periods cov-
ered.
3. Be able to extract financial statement data for a single firm or an entire
industry.

B.1 Compustat
Financial statement analysis requires a point of reference. This point can be
the company itself or other firms/competitors. When the point of reference
is the company itself, we compare the firm’s performance in current period
to its performance in prior periods (this means we use time-series data). In
section (B.1.1), we will learn how to extract data from Compustat for a single
company.

111

University of Waterloo - 2018


112 APPENDIX B. COMPUSTAT

When the point of reference is a list of competitors or the entire industry,


we compare the firm versus its competitors at a given point in time (this
means we use cross-sectional data) or compare the firm versus its competitors
over time (this means we use panel data). In section B.1.2, we will learn how
to extract data for an entire industry.

B.1.1 Compustat: Single Firm


We will start by learning how to extract financial data needed to generate
profitability measures for Apple, from 2010 up to 2016. This means that we
need the following variables:
1. income before extraordinary items (Compustat code=IB)
2. sales (Compustat code=SALE)
3. operating income before depreciation (Compustat code=OIBDP)
4. cost of goods sold (Compustat code = COGS)
5. total assets (Compustat code = AT)
To access Compustat, go to https://wrds-web.wharton.upenn.edu/
wrds/ and enter your user name and password.1 From the menu (subscrip-
tions) select Compustat Capital IQ > North America - Daily > Fundamentals
Annual, and in the new window make the following selections:
1. Choose data range
(a) Date Variable: select Fiscal Year
(b) Confine the date range to January 1, 2010 to December 31, 2016.
2. Apply company codes
(a) We need to specify the format of the company codes that we will
use to extract data from Compustat. The most common selec-
tions are Ticker, GVKEY (a unique number assigned to each com-
pany by Compustat), SIC (Standard Industry Classification), and
NAICS (North American Industry Classification System).
(b) If we don’t know the company’s code, we can use the [Code
Lookup] feature in Compustat. For example, using the code
lookup, we can find that the ticker symbol for Apple is AAPL and
enter it manually, as shown in Figure B.1.2
1
Your class instructor will provide you with the user name and password. Please, do
NOT try to change the password! This is a class password, not an individual password.
2
If we want to extract data for multiple companies, we can create a spreadsheet with
company codes and upload it. The spreadsheet must have the companies sorted by
GVKEY, CIK (numerical order) or Ticker symbol (alphabetical order). The spreadsheet
must be saved as a text file (i.e., with extension “.txt".)

University of Waterloo - 2018


B.1. COMPUSTAT 113

Figure B.1: Compustat: Entering Company Code(TIC)

(c) In the Screening Variables section de-select all output options.


Given the nature of our analysis, we would not be using this out-
put.
(d) In the Conditional Statements section, we do not need to spec-
ify any conditions in order to extract sales data for Apple (AAPL).
We will use these conditions in our next stage, when we extract
data for an entire industry.
3. Choose variable types
(a) Under “Select Variable Types” select "Data Items".
(b) When it comes to the financial variables that we want to extract
we have the following options:
i. We can browse by category. For example, if we select the
category “Identifying Information ...", we will see a list of
available variables. As we can see from Figure B.2, we used
this approach to select Company Name and Ticker Symbol.
ii. The second option would be to search for specific variables.
For example, we know that we want to extract Income Be-
fore Extraordinary Items. If we start typing "Income befor..."
the list of options will show up, and we can select the right
variable (See Figure B.3). Searching for sales (SALE), oper-
ating income before depreciation (OIBDP), cost of good sold
(COGS), and total assets (AT) we select the remaining vari-
ables. In addition to the variables needed for the calculation
of ROA, we also select variables that show the firm’s industry
classification (NAICS and SIC), as well as the firm’s head-
quarter location (LOC).
iii. Please notice that financial variables are organized by the un-
derlying financial statement. We can browse financial vari-

University of Waterloo - 2018


114 APPENDIX B. COMPUSTAT

Figure B.2: Compustat: Variable Selection from List

Figure B.3: Compustat: Search for Variable Selection

University of Waterloo - 2018


B.1. COMPUSTAT 115

ables available in balance sheet, income statement, and cash


flow statements. Various items, such common shares, audi-
tor’s opinion, and dividends per share are listed under Mis-
cellaneous items or Supplemental Data items. You may want
to spend some time familiarizing yourself with the list of vari-
ables.
4. Select query Output In this final section, we make selections regard-
ing the format of the output file, compression, and date format.
(a) The output format will depend on the nature of the analysis we
are planning to do, and the software package that we will use
to do the statistical analysis. For data analysis with R, “comma
delimited text (*.csv) is the best option. However, since this is
just an exercise for learning how to extract data, we will choose
HTML so we can see the output.
(b) For relatively small files, like the one in this exercise, there is no
need to compress the output.
(c) Leave the default selection for the date format.
(d) Since we will be using a common account for the entire class, do
not provide your email and do not save the query.
5. Select query Output - Review your choices to make sure they are
correct and click the “submit query” button.
A new window will open showing the query summary (Figure B.4).
It is very important to make sure that the output shown matches the
choices that we have made.

As you can see in Figure B.4, the output shows the date range that we
have selected (Jan 2010 to Dec 2016), the input code (AAPL), the fact that
there are no constraints, and the seven variables that we have chosen: com-
pany name=CONM, ticker symbol=TIC, total assets (AT), cost of good sold
(COGS), income before extraordinary items (IB), operating income before
depreciation (OIBDP), and sales (SALE). Notice that industry classifications
(NAICS and SIC) and headquarter location (LOC) do not show as variables
selected.
Once the query has been generated, we click on the link and we can
see the results on the browser. The output (a partial screen shot shown in
Figure B.5) shows the financial data for Apple, starting with fiscal year 2000,
as well as the industry classification (NAICS = 3342203 and SIC = 3663).
3
From the web site of Statistics Canada (http://www23.statcan.gc.ca/imdb/p3VD.
pl?Function=getVD&TVD=307532) we can find that NAICS=334220 is for firms in Radio
and television broadcasting and wireless communications equipment manufacturing.

University of Waterloo - 2018


116 APPENDIX B. COMPUSTAT

Figure B.4: WRDS Data Request Summary

Figure B.5: Compustat Fundamentals Annual for Apple

University of Waterloo - 2018


B.1. COMPUSTAT 117

In the following section, we will use this information (industry classification)


to extract data for the entire industry.

B.1.2 Compustat: Entire Industry


Being able to assess how a company does versus its competitors is very use-
ful for all accounting (e.g., auditing, management accounting, and financial
statement analysis) and finance (e.g., financial management, investment anal-
ysis) areas. Finding competitors that are 100% comparable to the firm that
we want to assess is almost impossible. One option is to hand-select a hand-
ful of direct competitors. The second option is to use one of the commonly
accepted industry classification systems (e.g., NAICS, SIC), which try to
classify firms in groups based on their operations.4
As we have seen in Figure B.5, Apple’s NAICS classification is 334220. In
the following paragraphs, we will learn how to extract fundamental annual
data for an entire industry (please keep in mind that Compustat contains only
the publicly traded competitors of Apple). For this we will simply modify
our existing Compustat query as follows:

1. Date range: select Fiscal Year and confine the date range to January
1, 2010 to December 31, 2016.
2. Apply your company codes Instead of searching by a single firm,
select ‘Search the entire database’
3. Screening Variables - de-select all output options. Given the nature
of our analysis, we would not be using this output.
4. We leave the same variables selected as we had them in the Apple query.
Our list of variables includes the following:
(a) company name=CONM,
(b) ticker symbol=TIC,
(c) headquarter location (LOC),
(d) industry classifications (NAICS and SIC)
(e) total assets (AT),
(f) cost of good sold (COGS),
(g) income before extraordinary items (IB),
(h) operating income before depreciation (OIBDP), and
(i) sales (SALE)
5. Under Conditional Statements we select NAICS from the first drop
down menu and set it equal to 334220 (See Figure B.6).
4
Please visit the Wikipedia to learn about NAICS and SIC, as well as other com-
monly used industry classification systems https://en.wikipedia.org/wiki/Industry_

University of Waterloo - 2018


118 APPENDIX B. COMPUSTAT

Figure B.6: Compustat: Conditional Statement

6. Select query Output We select “comma delimited text (*.csv) as


output format. The file is relatively small and there is no need to
compress it. Leave the default selection for the date format.
7. Since we will be using a common account for the entire class, do not
provide your email and do not save the query.
8. Submit Query

As mentioned above, it is very important to verify that the summary of


our query matches the choices that we have made. As we can see in Figure
B.7, the output shows the date range that we have selected (2010-2016), the
fact that we are extracting data from the entire database (-all-), there is
a constraint (NAICS eq 334220), and all the variables we have chosen.
Finally, we follow on screen directions to save the csv file to a local direc-
tory and give it a more useful name (e.g., industryAAPL_2010_2016.csv).

classification.

University of Waterloo - 2018


B.1. COMPUSTAT 119

Figure B.7: Compustat: Data Request Summary - Industry

University of Waterloo - 2018


University of Waterloo - 2018
Alphabetical Index

clear console, 9 install.packages, 5


clear data sets, 9 length(), 11
clear environment, 9 max(), 11
clear variables, 9 mean(), 11
Compustat, 111 min(), 11
nrow(), 22
GVKEY, 112
options(scipen = ), 33
NAICS, 112, 113, 117 power, 6
quintile(), 40
R Run, 10
: sequence of values, 7 Script, 9
<- define object, 7
seq(), 10
[,] all rows and all columns, 17
Source with Echo, 10
[,] all rows and certain
subsets of data, 16
columns, 18
sum(), 11
[,] certain rows and all
table()
columns, 18
[,] certain rows and certain argument, 54
columns, 18 |= or symbol, 43
[] subseting data, 16 R package
# comment sign, 10 ggplot, 5
Basics, 6 psych, 5
c() combine or concatenate, 8 sqldf, 5
cbind(), 24 stringr, 5
comments, 10 twitteR, 5
CTRL+L, 9 R packages, 5
Defining variables, 7 remove data sets, 9
File, 9 remove variables, 9
help, 11 RStudio
ifelse() Console, 4
arguments, 43 Files, 4

121

University of Waterloo - 2018


122 ALPHABETICAL INDEX

Help, 4 Source, 3
Packages, 4
Plots, 4 SIC, 112, 113

University of Waterloo - 2018

You might also like