2019 Book EssentialsOfBusinessAnalytics PDF
2019 Book EssentialsOfBusinessAnalytics PDF
Bhimasankaram Pochiraju
Sridhar Seshadri Editors
Essentials
of Business
Analytics
An Introduction to the Methodology
and its Applications
International Series in Operations Research
& Management Science
Volume 264
Series Editor
Camille C. Price
Stephen F. Austin State University, TX, USA
Essentials of Business
Analytics
An Introduction to the Methodology
and its Applications
123
Editors
Bhimasankaram Pochiraju Sridhar Seshadri
Applied Statistics and Computing Lab Gies College of Business
Indian School of Business University of Illinois at Urbana Champaign
Hyderabad, Telangana, India Champaign, IL, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Professor Bhimasankaram: With the divine
blessings of Bhagawan Sri Sri Sri Satya Sai
Baba, I dedicate this book to my parents—Sri
Pochiraju Rama Rao and Smt. Venkata
Ratnamma.
Sridhar Seshadri: I dedicate this book to
the memory of my parents, Smt. Ranganayaki
and Sri Desikachari Seshadri, my
father-in-law, Sri Kalyana Srinivasan
Ayodhyanath, and my dear friend,
collaborator and advisor, Professor
Bhimasankaram.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Sridhar Seshadri
Part I Tools
2 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Sudhir Voleti
3 Data Management—Relational Database Systems (RDBMS) . . . . . . . . . 41
Hemanth Kumar Dasararaju and Peeyush Taori
4 Big Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Peeyush Taori and Hemanth Kumar Dasararaju
5 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
John F. Tripp
6 Statistical Methods: Basic Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Vishnuprasad Nagadevara
7 Statistical Methods: Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Bhimasankaram Pochiraju and Hema Sri Sai Kollipara
8 Advanced Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Vishnuprasad Nagadevara
9 Text Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Sudhir Voleti
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Disclaimer
This book contains information obtained from authentic and highly regarded
sources. Reasonable efforts have been made to publish reliable data and information,
but the author and publisher cannot assume responsibility for the validity of
all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been
obtained. If any copyright material has not been acknowledged please write and let
us know so we may rectify in any future reprint.
xi
Acknowledgements
This book is the outcome of a truly collaborative effort amongst many people who
have contributed in different ways. We are deeply thankful to all the contributing
authors for their ideas and support. The book belongs to them. This book would not
have been possible without the help of Deepak Agrawal. Deepak helped in every
way, from editorial work, solution support, programming help, to coordination with
authors and researchers, and many more things. Soumithri Mamidipudi provided
editorial support, helped with writing summaries of every chapter, and proof-edited
the probability and statistics appendix and cases. Padmavati Sridhar provided edi-
torial support for many chapters. Two associate alumni—Ramakrishna Vempati and
Suryanarayana Ambatipudi—of the Certificate Programme in Business Analytics
(CBA) at Indian School of Business (ISB) helped with locating contemporary
examples and references. They suggested examples for the Retail Analytics and
Supply Chain Analytics chapters. Ramakrishna also contributed to the draft of the
Big Data chapter. Several researchers in the Advanced Statistics and Computing
Lab (ASC Lab) at ISB helped in many ways. Hema Sri Sai Kollipara provided
support for the cases, exercises, and technical and statistics support for various
chapters. Aditya Taori helped with examples for the machine learning chapters
and exercises. Saurabh Jugalkishor contributed examples for the machine learning
chapters. The ASC Lab’s researchers and Hemanth Kumar provided technical
support in preparing solutions for various examples referred in the chapters. Ashish
Khandelwal, Fellow Program student at ISB, helped with the chapter on Linear
Regression. Dr. Kumar Eswaran and Joy Mustafi provided additional thoughts for
the Unsupervised Learning chapter. The editorial team comprising Faith Su, Mathew
Amboy and series editor Camille Price gave immense support during the book
proposal stage, guidance during editing, production, etc. The ASC Lab provided
the research support for this project.
We thank our families for the constant support during the 2-year long project.
We thank each and every person associated with us during the beautiful journey of
writing this book.
xiii
Contributors
xv
xvi Contributors
Sridhar Seshadri
Business analytics is the science of posing and answering data questions related to
business. Business analytics has rapidly expanded in the last few years to include
tools drawn from statistics, data management, data visualization, and machine learn-
ing. There is increasing emphasis on big data handling to assimilate the advances
made in data sciences. As is often the case with applied methodologies, business
analytics has to be soundly grounded in applications in various disciplines and
business verticals to be valuable. The bridge between the tools and the applications
are the modeling methods used by managers and researchers in disciplines such as
finance, marketing, and operations. This book provides coverage of all three aspects:
tools, modeling methods, and applications.
The purpose of the book is threefold: to fill the void in the graduate-level study
materials for addressing business problems in order to pose data questions, obtain
optimal business solutions via analytics theory, and ground the solution in practice.
In order to make the material self-contained, we have endeavored to provide ample
use of cases and data sets for practice and testing of tools. Each chapter comes
with data, examples, and exercises showing students what questions to ask, how to
apply the techniques using open source software, and how to interpret the results. In
our approach, simple examples are followed with medium to large applications and
solutions. The book can also serve as a self-study guide to professionals who wish
to enhance their knowledge about the field.
The distinctive features of the book are as follows:
• The chapters are written by experts from universities and industry.
• The major software used are R, Python, MS Excel, and MYSQL. These are all
topical and widely used in the industry.
S. Seshadri ()
Gies College of Business, University of Illinois at Urbana Champaign, Champaign, IL, USA
e-mail: sridhar@illinois.edu
• Extreme care has been taken to ensure continuity from one chapter to the next.
The editors have attempted to make sure that the content and flow are similar in
every chapter.
• In Part A of the book, the tools and modeling methodology are developed in
detail. Then this methodology is applied to solve business problems in various
verticals in Part B. Part C contains larger case studies.
• The Appendices cover required material on Probability theory, R, and Python, as
these serve as prerequisites for the main text.
The structure of each chapter is as follows:
• Each chapter has a business orientation. It starts with business problems, which
are transformed into technological problems. Methodology is developed to solve
the technological problems. Data analysis is done using suitable software and the
output and results are clearly explained at each stage of development. Finally, the
technological solution is transformed back to a business solution. The chapters
conclude with suggestions for further reading and a list of references.
• Exercises (with real data sets when applicable) are at the end of each chapter and
on the Web to test and enhance the understanding of the concepts and application.
• Caselets are used to illustrate the concepts in several chapters.
Data Collection: This chapter introduces the concepts of data collection and
problem formulation. Firstly, it establishes the foundation upon which the fields
of data sciences and analytics are based, and defines core concepts that will be used
throughout the rest of the book. The chapter starts by discussing the types of data
that can be gathered, and the common pitfalls that can occur when data analytics
does not take into account the nature of the data being used. It distinguishes between
primary and secondary data sources using examples, and provides a detailed
explanation of the advantages and constraints of each type of data. Following this,
the chapter details the types of data that can be collected and sorted. It discusses the
difference between nominal-, ordinal-, interval-, and ratio-based data and the ways
in which they can be used to obtain insights into the subject being studied.
The chapter then discusses problem formulation and its importance. It explains
how and why formulating a problem will impact the data that is gathered, and
thus affect the conclusions at which a research project may arrive. It describes
a framework by which a messy real-world situation can be clarified so that a
mathematical toolkit can be used to identify solutions. The chapter explains the
idea of decision-problems, which can be used to understand the real world, and
research-objectives, which can be used to analyze decision-problems.
1 Introduction 3
The chapter also details the challenges faced when collecting and collating data.
It discusses the importance of understanding what data to collect, how to collect it,
how to assess its quality, and finally the most appropriate way of collating it so that
it does not lose its value.
The chapter ends with an illustrative example of how the retailing industry might
use various sources of data in order to better serve their customers and understand
their preferences.
Data Management—Relational Database Management Systems: This chapter
introduces the idea of data management and storage. The focus of the chapter
is on relational database management systems or RDBMS. RDBMS is the most
commonly used data organization system in enterprises. The chapter introduces and
explains the ideas using MySQL, an open-source structural query language used by
many of the largest data management systems in the world.
The chapter describes the basic functions of a MySQL server, such as creating
databases, examining data tables, and performing functions and various operations
on data sets. The first set of instructions the chapter discusses is about the rules,
definition, and creation of relational databases. Then, the chapter describes how to
create tables and add data to them using MySQL server commands. It explains how
to examine the data present in the tables using the SELECT command.
Data Management—Big Data: This chapter builds on some of the concepts
introduced in the previous chapter but focuses on big data tools. It describes what
really constitutes big data and focuses on some of the big data tools. In this chapter,
the basics of big data tools such as Hadoop, Spark, and surrounding ecosystem are
presented.
The chapter begins by describing Hadoop’s uses and key features, as well as the
programs in its ecosystem that can also be used in conjunction with it. It also briefly
visits the concepts of distributed and parallel computing and big data cloud.
The chapter describes the architecture of the Hadoop runtime environment. It
starts by describing the cluster, which is the set of host machines, or nodes for
facilitating data access. It then moves on to the YARN infrastructure, which is
responsible for providing computational resources to the application. It describes
two main elements of the YARN infrastructure—the Resource Manager and the
Node Manager. It then details the HDFS Federation, which provides storage,
and also discusses other storage solutions. Lastly, it discusses the MapReduce
framework, which is the software layer.
The chapter then describes the functions of MapReduce in detail. MapReduce
divides tasks into subtasks, which it runs in parallel in order to increase efficiency. It
discusses the manner in which MapReduce takes lists of input data and transforms
them into lists of output data, by implementing a “map” process and a “reduce”
process, which it aggregates. It describes in detail the process steps that MapReduce
takes in order to produce the output, and describes how Python can be used to create
a MapReduce process for a word count program.
The chapter briefly describes Spark and an application using Spark. It concludes
with a discussion about cloud storage. The chapter makes use of Cloudera virtual
machine (VM) distributable to demonstrate different hands-on exercises.
4 S. Seshadri
Data Visualization: This chapter discusses how data is visualized and the way
that visualization can be used to aid in analysis. It starts by explaining that humans
use visuals to understand information, and that using visualizations incorrectly can
lead to mistaken conclusions. It discusses the importance of visualization as a
cognitive aid and the importance of working memory in the brain. It emphasizes
the role of data visualization in reducing the load on the reader.
The chapter details the six meta-rules of data visualization, which are as follows:
use the most appropriate chart, directly represent relationships between data, refrain
from asking the viewer to compare differences in area, never use color on top of
color, keep within the primal perceptions of the viewer, and chart with integrity.
Each rule is expanded upon in the chapter. The chapter discusses the kinds of
graphs and tables available to a visualizer, the advantages and disadvantages of 3D
visualization, and the best practices of color schemes.
Statistical Methods—Basic Inferences: This chapter introduces the fundamental
concepts of statistical inferences, such as population and sample parameters,
hypothesis testing, and analysis of variance. It begins by describing the differences
between population and sample means and variance and the methods to calculate
them. It explains the central limit theorem and its use in estimating the mean of a
population.
Confidence intervals are explained for samples in which variance is both
known and unknown. The concept of standard errors and the t- and Chi-squared
distributions are introduced. The chapter introduces hypothesis testing and the use
of statistical parameters to reject or fail to reject hypotheses. Type I and type II errors
are discussed.
Methods to compare two different samples are explained. Analysis of vari-
ance between two samples and within samples is also covered. The use of the
F-distribution in analyzing variance is explained. The chapter concludes with
discussion of when we need to compare means of a number of populations. It
explains how to use a technique called “Analysis of Variance (ANOVA)” instead
of carrying out pairwise comparisons.
Statistical Methods—Linear Regression Analysis: This chapter explains the idea
of linear regression in detail. It begins with some examples, such as predicting
newspaper circulation. It uses the examples to discuss the methods by which
linear regression obtains results. It describes a linear regression as a functional
form that can be used to understand relationships between outcomes and input
variables and perform statistical inference. It discusses the importance of linear
regression and its popularity, and explains the basic assumptions underlying linear
regression.
The modeling section begins by discussing a model in which there is only a
single regressor. It explains why a scatter-plot can be useful in understanding single-
regressor models, and the importance of visual representation in statistical inference.
It explains the ordinary least squares method of estimating a parameter, and the use
of the sum of squares of residuals as a measure of the fit of a model. The chapter then
discusses the use of confidence intervals and hypothesis testing in a linear regression
1 Introduction 5
model. These concepts are used to describe a linear regression model in which there
are multiple regressors, and the changes that are necessary to adjust a single linear
regression model to a multiple linear regression model.
The chapter then describes the ways in which the basic assumptions of the linear
regression model may be violated, and the need for further analysis and diagnostic
tools. It uses the famous Anscombe data sets in order to demonstrate the existence
of phenomena such as outliers and collinearity that necessitate further analysis. The
methods needed to deal with such problems are explained. The chapter considers
the ways in which the necessity for the use of such methods may be determined,
such as tools to determine whether some data points should be deleted or excluded
from the data set. The possible advantages and disadvantages of adding additional
regressors to a model are described. Dummy variables and their use are explained.
Examples are given for the case where there is only one category of dummy, and
then multiple categories.
The chapter then discusses assumptions regarding the error term. The effect of
the assumption that the error term is normally distributed is discussed, and the Q-Q
plot method of examining the truth of this assumption for the data set is explained.
The Box–Cox method of transforming the response variable in order to normalize
the error term is discussed. The chapter then discusses the idea that the error terms
may not have equal variance, that is, be homoscedastic. It explains possible reasons
for heteroscedasticity, and the ways to adapt the analysis to those situations.
The chapter considers the methods in which the regression model can be
validated. The root mean square error is introduced. Segmenting the data into
training and validation sets is explained. Finally, some frequently asked questions
are presented, along with exercises.
Statistical Methods—Advanced Regression: Three topics are covered in this
chapter. In the main body of the chapter the tools for estimating the parameters of
regression models when the response variable is binary or categorical is presented.
The appendices to the chapter cover two other important techniques, namely,
maximum likelihood estimate (MLE) and how to deal with missing data.
The chapter begins with a description of logistics regression models. It continues
with diagnostics of logistics regression, including likelihood ratio tests, Wald’s
and the Hosmer–Lemeshow tests. It then discusses different R-squared tests, such
as Cox and Snell, Nagelkerke, and McFadden. Then, it discusses how to choose
the cutoff probability for classification, including discussion of discordant and
concordant pairs, the ROC curve, and Youden’s index. It concludes with a similar
discussion of Multinomial Logistics Function and regression. The chapter contains
a self-contained introduction to the maximum likelihood method and methods for
treating missing data. The ideas introduced in this chapter are used in several
following chapters in the book.
Text Analytics: This is the first of several chapters that introduce specialized
analytics methods depending on the type of data and analysis. This chapter begins
by considering various motivating examples for text analysis. It explains the need
for a process by which unstructured text data can be analyzed, and the ways that
it can be used to improve business outcomes. It describes in detail the manner in
6 S. Seshadri
which Google used its text analytics software and its database of searches to identify
vectors of H1N1 flu. It lists out the most common sources of text data, with social
media platforms and blogs producing the vast majority.
The second section of the chapter concerns the ways in which text can be
analyzed. It describes two approaches: a “bag-of-words” approach, in which the
structure of the language is not considered important, and a “natural-language”
approach, in which structure and phrases are also considered.
The example of a retail chain surveying responses to a potential ice-cream
product is used to introduce some terminology. It uses this example to describe
the problems of analyzing sentences due to the existence of grammatical rules, such
as the abundance of articles or the different tense forms of verbs. Various methods
of dealing with these problems are introduced. The term-document matrix (TDM)
is introduced along with its uses, such as generation of wordclouds.
The third and fourth sections of the chapter describe how to run text analysis
and some elementary applications. The text walks through a basic use of the
program R to analyze text. It looks at two ways that the TDM can be used to run
text analysis—using a text-base to cluster or segment documents, and elementary
sentiment analysis.
Clustering documents is a method by which similar customers are sorted into
the same group by analyzing their responses. Sentiment analysis is a method by
which attempts are made to make value judgments and extract qualitative responses.
The chapter describes the models for both processes in detail with regard to an
example.
The fifth section of the chapter then describes the more advanced technique
of latent topic mining. Latent topic mining aims to identify themes present in a
corpus, or a collection of documents. The chapter uses the example of the mission
statements of Fortune-1000 firms in order to identify some latent topics.
The sixth section of the chapter concerns natural-language processing (NLP).
NLP is a set of techniques that enables computers to understand nuances in human
languages. The method by which NLP programs detect data is discussed. The ideas
of this chapter are further explored in the chapter on Deep Learning. The chapter
ends with exercises for the student.
Simulation: This chapter introduces the uses of simulation as a tool for analytics,
focusing on the example of a fashion retailer. It explains the use of Monte Carlo
simulation in the presence of uncertainty as an aid to making decisions that have
various trade-offs.
First, the chapter explains the purposes of simulation, and the ways it can be used
to design an optimal intervention. It differentiates between computer simulation,
which is the main aim of the chapter, and physical simulation. It discusses the
advantages and disadvantages of simulations, and mentions various applications of
simulation in real-world contexts.
The second part of the chapter discusses the steps that are followed in making a
simulation model. It explains how to identify dependent and independent variables,
and the manner in which the relationships between those variables can be modeled.
It describes the method by which input variables can be randomly generated,
1 Introduction 7
and the output of the simulation can be interpreted. It illustrates these steps
using the example of a fashion retailer that needs to make a decision about
production.
The third part of the chapter describes decision-making under uncertainty and
the ways that simulation can be used. It describes how to set out a range of possible
interventions and how they can be modeled using a simulation. It discusses how to
use simulation processes in order to optimize decision-making under constraints, by
using the fashion retailer example in various contexts.
The chapter also contains a case study of a painting business deciding how much
to bid for a contract to paint a factory, and describes the solution to making this
decision. The concepts explained in this chapter are applied in different settings in
the following chapters.
Optimization: Optimization techniques are used in almost every application
in this book. This chapter presents some of the core concepts of constrained
optimization. The basic ideas are illustrated using one broad class of optimization
problems called linear optimization. Linear optimization covers the most widely
used models in business. In addition, because linear models are easy to visualize in
two dimensions, it offers a visual introduction to the basic concepts in optimization.
Additionally, the chapter provides a brief introduction to other optimization models
and techniques such as integer/discrete optimization, nonlinear optimization, search
methods, and the use of optimization software.
The linear optimization part is conventionally developed by describing the deci-
sion variables, the objective function, constraints, and the assumptions underlying
the linear models. Using geometric arguments, it illustrates the concept of feasibility
and optimality. It then provides the basic theorems of linear programming. The
chapter then develops the idea of shadow prices, reduced costs, and sensitivity
analysis, which is the underpinning of any post-optimality business analysis. The
solver function in Excel is used for illustrating these ideas. Then, the chapter
explains how these ideas extend to integer programming and provides an outline
of the branch and bound method with examples. The ideas are further extended
to nonlinear optimization via examples of models for linear regression, maximum
likelihood estimation, and logistic regression.
Forecasting Analytics: Forecasting is perhaps the most commonly used method
in business analytics. This chapter introduces the idea of using analytics to predict
the outcomes in the future, and focuses on applying analytics tools for business and
operations. The chapter begins by explaining the difficulty of predicting the future
with perfect accuracy, and the importance of accepting the uncertainty inherent in
any predictive analysis.
The chapter begins by defining forecasting as estimating in unknown situations.
It describes data that can be used to make forecasts, but focuses on time-series
forecasting. It introduces the concepts of point-forecasts and prediction intervals,
which are used in time-series analysis as part of predictions of future outcomes. It
suggests reasons for the intervention of human judgment in the forecasts provided
by computers. It describes the core method of time-series forecasting—identifying
a model that forecasts the best.
8 S. Seshadri
The chapter then defines the survival analysis functions: the survival function and
the hazard function. It describes some simple types of hazard functions. It describes
some parametric and nonparametric methods of analysis, and defines the cases in
which nonparametric methods must be used. It explains the Kaplan–Meier method
in detail, along with an example. Semiparametric models are introduced for cases
in which several covariate variables are believed to contribute to survival. Cox’s
proportional hazards model and its interpretation are discussed.
The chapter ends with a comparison between semiparametric and parametric
models, and a case study regarding churn data.
Unsupervised Learning: The first of the three machine learning chapters sets
out the philosophy of machine learning. This chapter explains why unsupervised
learning—an important paradigm in machine learning—is akin to uncovering the
proverbial needle in the haystack, discovering the grammar of the process that
generated the data, and exaggerating the “signal” while ignoring the “noise” in it.
The chapter covers methods of projection, clustering, and density estimation—three
core unsupervised learning frameworks that help us perceive the data in different
ways. In addition, the chapter describes collaborative filtering and applications of
network analysis.
The chapter begins with drawing the distinction between supervised and unsuper-
vised learning. It then presents a common approach to solving unsupervised learning
problems by casting them into an optimization framework. In this framework, there
are four steps:
• Intuition: to develop an intuition about how to approach the problem as an
optimization problem
• Formulation: to write the precise mathematical objective function in terms of data
using intuition
• Modification: to modify the objective function into something simpler or “more
solvable”
• Optimization: to solve the final objective function using traditional optimization
approaches
The chapter discusses principal components analysis (PCA), self-organizing
maps (SOM), and multidimensional scaling (MDS) under projection algorithms.
In clustering, it describes partitional and hierarchical clustering. Under density
estimation, it describes nonparametric and parametric approaches. The chapter
concludes with illustrations of collaborative filtering and network analysis.
Supervised Learning: In supervised learning, the aim is to learn from previously
identified examples. The chapter covers the philosophical, theoretical, and practical
aspects of one of the most common machine learning paradigms—supervised
learning—that essentially learns to map from an observation (e.g., symptoms and
test results of a patient) to a prediction (e.g., disease or medical condition), which
in turn is used to make decisions (e.g., prescription). The chapter then explores the
process, science, and art of building supervised learning models.
The first part explains the different paradigms in supervised learning: classifi-
cation, regression, retrieval, recommendation, and how they differ by the nature
10 S. Seshadri
of their input and output. It then describes the process of learning, from features
description to feature engineering to models to algorithms that help make the
learning happen.
Among algorithms, the chapter describes rule-based classifiers, decision trees, k-
nearest neighbor, Parzen window, and Bayesian and naïve Bayes classifiers. Among
discriminant functions that partition a region using an algorithm, linear (LDA) and
quadratic discriminant analysis (QDA) are discussed. A section describes recom-
mendation engines. Neural networks are then introduced followed by a succinct
introduction to a key algorithm called support vector machines (SVM). The chapter
concludes with a description of ensemble techniques, including bagging, random
forest, boosting, mixture of experts, and hierarchical classifiers. The specialized
neural networks for Deep Learning are explained in the next chapter.
Deep Learning: This chapter introduces the idea of deep learning as a part of
machine learning. It aims to explain the idea of deep learning and various popular
deep learning architectures. It has four main parts:
• Understand what is deep learning.
• Understand various popular deep learning architectures, and know when to use
which architecture for solving a business problem.
• How to perform image analysis using deep learning.
• How to perform text analysis using deep learning.
The chapter explains the origins of learning, from a single perceptron to mimic
the functioning of a neuron to the multilayered perceptron (MLP). It briefly recaps
the backpropagation algorithm and introduces the learning rate and error functions.
It then discusses the deep learning architectures applied to supervised, unsupervised,
and reinforcement learning. An example of using an artificial neural network for
recognizing handwritten digits (based on the MNIST data set) is presented.
The next section of the chapter describes Convolutional Neural Networks (CNN),
which are aimed at solving vision-related problems. The ImageNet data set is
introduced. The use of CNNs in the ImageNet Large Scale Visual Recognition
Challenge is explained, along with a brief history of the challenge. The biological
inspiration for CNNs is presented. Four layers of a typical CNN are introduced—
the convolution layer, the rectified linear units layer, the pooling layers, and the fully
connected layer. Each layer is explained, with examples. A unifying example using
the same MNIST data set is presented.
The third section of the chapter discusses recurrent neural networks (RNNs).
It begins by describing the motivation for sequence learning models, and their
use in understanding language. Traditional language models and their functions in
predicting words are explained. The chapter describes a basic RNN model with
three units, aimed at predicting the next word in a sentence. It explains the detailed
example by which an RNN can be built for next word prediction. It presents some
uses of RNNs, such as image captioning and machine translation.
The next seven chapters contain descriptions of analytics usage in different
domains and different contexts. These are described next.
1 Introduction 11
Retail Analytics: The chapter begins by introducing the background and defini-
tion of retail analytics. It focuses on advanced analytics. It explains the use of four
main categories of business decisions: consumer, product, human resources, and
advertising. Several examples of retail analytics are presented, such as increasing
book recommendations during periods of cold weather. Complications in retail
analytics are discussed.
The second part of the chapter focuses on data collection in the retail sector. It
describes the traditional sources of retail data, such as point-of-sale devices, and
how they have been used in decision-making processes. It also discusses advances
in technology and the way that new means of data collection have changed the field.
These include the use of radio frequency identification technology, the Internet of
things, and Bluetooth beacons.
The third section describes methodologies, focusing on inventory, assortment,
and pricing decisions. It begins with modeling product-based demand in order
to make predictions. The penalized L1 regression LASSO for retail demand
forecasting is introduced. The use of regression trees and artificial neural networks
is discussed in the same context. The chapter then discusses the use of such forecasts
in decision-making. It presents evidence that machine learning approaches benefit
revenue and profit in both price-setting and inventory-choice contexts.
Demand models into which consumer choice is incorporated are introduced.
The multinomial logit, mixed multinomial logit, and nested logit models are
described. Nonparametric choice models are also introduced as an alternative to
logit models. Optimal assortment decisions using these models are presented.
Attempts at learning customer preferences while optimizing assortment choices are
described.
The fourth section of the chapter discusses business challenges and opportunities.
The benefits of omnichannel retail are discussed, along with the need for retail
analytics to change in order to fit an omnichannel shop. It also discusses some recent
start-ups in the retail analytics space and their focuses.
Marketing Analytics: Marketing is one of the most important, historically the
earliest, and fascinating areas for applying analytics to solve business problems.
Due to the vast array of applications, only the most important ones are surveyed
in this chapter. The chapter begins by explaining the importance of using marketing
analytics for firms. It defines the various levels that marketing analytics can apply to:
the firm, the brand or product, and the customer. It introduces a number of processes
and models that can be used in analyzing and making marketing decisions, including
statistical analysis, nonparametric tools, and customer analysis. The processes
and tools discussed in this chapter will help in various aspects of marketing
such as target marketing and segmentation, price and promotion, customer valua-
tion, resource allocation, response analysis, demand assessment, and new product
development.
The second section of the chapter explains the use of the interaction effect
in regression models. Building on earlier chapters on regression, it explains the
utility of a term that captures the effect of one or more interactions between other
12 S. Seshadri
variables. It explains how to interpret new variables and their significance. The use
of curvilinear relationships in order to identify the curvilinear effect is discussed.
Mediation analysis is introduced, along with an example.
The third section describes data envelopment analysis (DEA), which is aimed at
improving the performance of organizations. It describes the manner in which DEA
works to present targets to managers and can be used to answer key operational
questions in Marketing: sales force productivity, performance of sales regions, and
effectiveness of geomarketing.
The next topic covered is conjoint analysis. It explains how knowing customers’
preference provides invaluable information about how customers think and make
their decisions before purchasing products. Thus, it helps firms devise their market-
ing strategies including advertising, promotion, and sales activities.
The fifth section of the chapter discusses customer analytics. Customer lifetime
value (CLV), a measure of the value provided to firms by customers, is introduced,
along with some other measures. A method to calculate CLV is presented, along
with its limitations. The chapter also discusses two more measures of customer
value: customer referral value and customer influence value, in detail. Additional
topics are covered in the chapters on retail analytics and social media analytics.
Financial Analytics: Financial analytics like Marketing has been a big consumer
of data. The topics chosen in this chapter provide one unified way of thinking
about analytics in this domain—valuation. This chapter focuses on the two main
branches of quantitative finance: the risk-neutral or “Q” world and the risk-averse
or “P” world. It describes the constraints and aims of analysts in each world, along
with their primary methodologies. It explains Q-quant theories such as the work of
Black and Scholes, and Harrison and Pliska. P-quant theories such as net present
value, capital asset pricing models, arbitrage pricing theory, and the efficient market
hypothesis are presented.
The methodology of financial data analytics is explained via a three-stage
process: asset price estimation, risk management, and portfolio analysis.
Asset price estimation is explained as a five-step process. It describes the use
of the random walk in identifying the variable to be analyzed. Several methods of
transforming the variable into one that is identical and independently distributed
are presented. A maximum likelihood estimation method to model variance is
explained. Monte Carlo simulations of projecting variables into the future are
discussed, along with pricing projected variables.
Risk management is discussed as a three-step process. The first step is risk
aggregation. Copula functions and their uses are explained. The second step,
portfolio assessment, is explained by using metrics such as Value at Risk. The third
step, attribution, is explained. Various types of capital at risk are listed.
Portfolio analysis is described as a two-stage process. Allocating risk for the
entire portfolio is discussed. Executing trades in order to move the portfolio to a
new risk/return level is explained.
A detailed example explaining each of the ten steps is presented, along with data
and code in MATLAB. This example also serves as a stand-alone case study on
financial analytics.
1 Introduction 13
The last three chapters of the book contain case studies. Each of the cases comes
with a large data set upon which students can practice almost every technique and
modeling approach covered in the book. The Info Media case study explains the use
of viewership data to design promotional campaigns. The problem presented is to
determine a multichannel ad spots allocation in order to maximize “reach” given
a budget and campaign guidelines. The approach uses simulation to compute the
viewership and then uses the simulated data to link promotional aspects to the total
reach of a campaign. Finally, the model can be used to optimize the allocation of
budgets across channels.
The AAA airline case study illustrates the use of choice models to design airline
offerings. The main task is to develop a demand forecasting model, which predicts
the passenger share for every origin–destination pair (O–D pair) given AAA, as
well as competitors’ offerings. The students are asked to explore different models
including the MNL and machine learning algorithms. Once a demand model has
been developed it can be used to diagnose the current performance and suggest
various remedies, such as adding, dropping, or changing itineraries in specific city
pairs. The third case study, Ideal Insurance, is on fraud detection. The problem faced
by the firm is the growing cost of servicing and settling claims in their healthcare
practice. The students learn about the industry and its intricate relationships with
various stakeholders. They also get an introduction to rule-based decision support
systems. The students are asked to create a system for detecting fraud, which should
be superior to the current “rule-based” system.
This book is the first of its kind both in breadth and depth of coverage and serves as
a textbook for students of first year graduate program in analytics and long duration
(1-year part time) certificate programs in business analytics. It also serves as a
perfect guide to practitioners.
The content is based on the curriculum of the Certificate Programme in Business
Analytics (CBA), now renamed as Advanced Management Programme in Business
Analytics (AMPBA) of Indian School of Business (ISB). The original curriculum
was created by Galit Shmueli. The curriculum was further developed by the
coeditors, Bhimasankaram Pochiraju and Sridhar Seshadri, who were responsible
for starting and mentoring the CBA program in ISB. Bhimasankaram Pochiraju has
been the Faculty Director of CBA since its inception and was a member of the
Academic Board. Sridhar Seshadri managed the launch of the program and since
then has chaired the academic development efforts. Based on the industry needs,
the curriculum continues to be modified by the Academic Board of the Applied
Statistics and Computing Lab (ASC Lab) at ISB.
Part I
Tools
Chapter 2
Data Collection
Sudhir Voleti
1 Introduction
Collecting data is the first step towards analyzing it. In order to understand and solve
business problems, data scientists must have a strong grasp of the characteristics of
the data in question. How do we collect data? What kinds of data exist? Where
is it coming from? Before beginning to analyze data, analysts must know how to
answer these questions. In doing so, we build the base upon which the rest of our
examination follows. This chapter aims to introduce and explain the nuances of data
collection, so that we understand the methods we can use to analyze it.
In 2017, video-streaming company Netflix Inc. was worth more than $80 billion,
more than 100 times its value when it listed in 2002. The company’s current position
as the market leader in the online-streaming sector is a far cry from its humble
beginning as a DVD rental-by-mail service founded in 1997. So, what had driven
Netflix’s incredible success? What helped its shares, priced at $15 each on their
initial public offering in May 2002, rise to nearly $190 in July 2017? It is well
known that a firm’s [market] valuation is the sum total in today’s money, or the net
present value (NPV) of all the profits the firm will earn over its lifetime. So investors
reckon that Netflix is worth tens of billions of dollars in profits over its lifetime.
Why might this be the case? After all, companies had been creating television and
S. Voleti ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: sudhir_voleti@isb.edu
cinematic content for decades before Netflix came along, and Netflix did not start
its own online business until 2007. Why is Netflix different from traditional cable
companies that offer shows on their own channels?
Moreover, the vast majority of Netflix’s content is actually owned by its
competitors. Though the streaming company invests in original programming, the
lion’s share of the material available on Netflix is produced by cable companies
across the world. Yet Netflix has access to one key asset that helps it to predict
where its audience will go and understand their every quirk: data.
Netflix can track every action that a customer makes on its website—what they
watch, how long they watch it for, when they tune out, and most importantly, what
they might be looking for next. This data is invaluable to its business—it allows the
company to target specific niches of the market with unerring accuracy.
On February 1, 2013, Netflix debuted House of Cards—a political thriller starring
Kevin Spacey. The show was a hit, propelling Netflix’s viewership and proving
that its online strategy could work. A few months later, Spacey applauded Netflix’s
approach and cited its use of data for its ability to take a risk on a project that every
other major television studio network had declined. Casey said in Edinburgh, at the
Guardian Edinburgh International Television Festival1 on August 22: “Netflix was
the only company that said, ‘We believe in you. We have run our data, and it tells us
our audience would watch this series.’”
Netflix’s data-oriented approach is key not just to its ability to pick winning
television shows, but to its global reach and power. Though competitors are
springing up the world over, Netflix remains at the top of the pack, and so long
as it is able to exploit its knowledge of how its viewers behave and what they prefer
to watch, it will remain there.
Let us take another example. The technology “cab” company Uber has taken the
world by storm in the past 5 years. In 2014, Uber’s valuation was a mammoth 40
billion USD, which by 2015 jumped another 50% to reach 60 billion USD. This
fact begs the question: what makes Uber so special? What competitive advantage,
strategic asset, and/or enabling platform accounts for Uber’s valuation numbers?
The investors reckon that Uber is worth tens of billions of dollars in profits over
its lifetime. Why might this be the case? Uber is after all known as a ride-sharing
business—and there are other cab companies available in every city.
We know that Uber is “asset-light,” in the sense that it does not own the cab fleet
or have drivers of the cabs on its direct payroll as employees. It employs a franchise
model wherein drivers bring their own vehicles and sign up for Uber. Yet Uber
does have one key asset that it actually owns, one that lies at the heart of its profit
projections: data. Uber owns all rights to every bit of data from every passenger,
every driver, every ride and every route on its network. Curious as to how much
data are we talking about? Consider this. Uber took 6 years to reach one billion
spacey-speech-why-netflix-model-can-save-television-video-full-transcript-1401970) accessed
on Sep 13, 2018.
2 Data Collection 21
rides (Dec 2015). Six months later, it had reached the two billion mark. That is one
billion rides in 180 days, or 5.5 million rides/day. How did having consumer data
play a factor in the exponential growth of a company such as Uber? Moreover, how
does data connect to analytics and, finally, to market value?
Data is a valuable asset that helps build sustainable competitive advantage. It
enables what economists would call “supernormal profits” and thereby plausibly
justify some of those wonderful valuation numbers we saw earlier. Uber had help,
of course. The nature of demand for its product (contractual personal transporta-
tion), the ubiquity of its enabling platform (location-enabled mobile devices), and
the profile of its typical customers (the smartphone-owning, convenience-seeking
segment) has all contributed to its success. However, that does not take away from
the central point being motivated here—the value contained in data, and the need to
collect and corral this valuable resource into a strategic asset.
A well-known management adage goes, “We can only manage what we can mea-
sure.” But why is measurement considered so critical? Measurement is important
because it precedes analysis, which in turn precedes modeling. And more often than
not, it is modeling that enables prediction. Without prediction (determination of
the values an outcome or entity will take under specific conditions), there can be
no optimization. And without optimization, there is no management. The quantity
that gets measured is reflected in our records as “data.” The word data comes
from the Latin root datum for “given.” Thus, data (datum in plural) becomes facts
which are given or known to be true. In what follows, we will explore some
preliminary conceptions about data, types of data, basic measurement scales, and
the implications therein.
Data collection for research and analytics can broadly be divided into two major
types: primary data and secondary data. Consider a project or a business task that
requires certain data. Primary data would be data that is collected “at source” (hence,
primary in form) and specifically for the research at hand. The data source could
be individuals, groups, organizations, etc. and data from them would be actively
elicited or passively observed and collected. Thus, surveys, interviews, and focus
groups all fall under the ambit of primary data. The main advantage of primary data
is that it is tailored specifically to the questions posed by the research project. The
disadvantages are cost and time.
On the other hand, secondary data is that which has been previously collected
for a purpose that is not specific to the research at hand. For example, sales records,
22 S. Voleti
industry reports, and interview transcripts from past research are data that would
continue to exist whether or not the project at hand had come to fruition. A good
example of a means to obtain secondary data that is rapidly expanding is the API
(Application Programming Interface)—an interface that is used by developers to
securely query external systems and obtain a myriad of information.
In this chapter, we concentrate on data available in published sources and
websites (often called secondary data sources) as these are the most commonly used
data sources in business today.
The data and its analysis can also be classified on the basis of whether a single
unit is observed over multiple time points (time-series data), many units observed
once (cross-sectional data), or many units are observed over multiple time periods
(panel data). The insights that can be drawn from the data depend on the nature
of data, with the richest insights available from panel data. The panel could be
balanced (all units are observed over all time periods) or unbalanced (observations
on a few units are missing for a few time points either by design or by accident).
If the data is not missing excessively, it can be accounted for using the methods
described in Chap. 8.
5 Data Types
Generally, there are four types of data associated with four primary scales, namely,
nominal, ordinal, interval, and ratio. Nominal scale is used to describe categories in
which there is no specific order while the ordinal scale is used to describe categories
in which there is an inherent order. For example, green, yellow, and red are three
colors that in general are not bound by an inherent order. In such a case, a nominal
scale is appropriate. However, if we are using the same colors in connection with
the traffic light signals there is clear order. In this case, these categories carry an
ordinal scale. Typical examples of the ordinal scale are (1) sick, recovering, healthy;
(2) lower income, middle income, higher income; (3) illiterate, primary school pass,
higher school pass, graduate or higher, and so on. In the ordinal scale, the differences
in the categories are not of the same magnitude (or even of measurable magnitude).
Interval scale is used to convey relative magnitude information such as temperature.
The term “Interval” comes about because rulers (and rating scales) have intervals
of uniform lengths. Example: “I rate A as a 7 and B as a 4 on a scale of 10.”
In this case, we not only know that A is preferred to B, but we also have some
idea of how much more A is preferred to B. Ratio scales convey information on
an absolute scale. Example: “I paid $11 for A and $12 for B.” The 11 and 12
2 Data Collection 25
here are termed “absolute” measures because the corresponding zero point ($0) is
understood in the same way by different people (i.e., the measure is independent of
subject).
Another set of examples for the four data types, this time from the world of
sports, could be as follows. The numbers assigned to runners are of nominal data
type, whereas the rank order of winners is of the ordinal data type. Note in the latter
case that while we may know who came first and who came second, we would not
know by how much based on the rank order alone. A performance rating on a 0–10
Table 2.1 A description of data and their types, sources, and examples
Category Examples Type Sourcesa
Internal data
Transaction Sales (POS/online) Numbers, text http://times.cs.uiuc.edu/
data transactions, stock ~wang296/Data/
market orders and https://www.quandl.com/
trades, customer IP https://www.nyse.com/
and geolocation data data/transactions-statistics-
data-library
https://www.sec.gov/
answers/shortsalevolume.
htm
Customer Website click stream, Numbers, text C:\Users\username\App
preference data cookies, shopping Data\Roaming\Microsoft
cart, wish list, \Windows\Cookies,
preorder Nearbuy.com (advance
coupon sold)
Experimental Simulation games, Text, number, image, https://www.
data clinical trials, live audio, video clinicaltrialsregister.eu/
experiments https://www.novctrd.com/
http://ctri.nic.in/
Customer Demographics, Text, number, image,
relationship purchase history, biometrics
data loyalty rewards data,
phone book
External data
Survey data Census, national Text, number, image, http://www.census.gov/
sample survey, audio, video data.html
annual survey of http://www.mospi.gov.in/
industries, http://www.csoisw.gov.in/
geographical survey, https://www.gsi.gov.in/
land registry http://
landrecords.mp.gov.in/
Biometric data Immigration data, Number, text, http://www.migration
(fingerprint, social security image, policy.org/programs/
retina, pupil, identity, Aadhar card biometric migration-data-hub
palm, face) (UID) https://www.dhs.gov/
immigration-statistics
(continued)
26 S. Voleti
scale would be an example of an interval scale. We see this used in certain sports
ratings (i.e., gymnastics) wherein judges assign points based on certain metrics.
Finally, in track and field events, the time to finish in seconds is an example of ratio
data. The reference point of zero seconds is well understood by all observers.
The reason why it matters what primary scale was used to collect data is that
downstream analysis is constrained by data type. For instance, with nominal data, all
we can compute are the mode, some frequencies and percentages. Nothing beyond
this is possible due to the nature of the data. With ordinal data, we can compute
the median and some rank order statistics in addition to whatever is possible with
nominal data. This is because ordinal data retains all the properties of the nominal
data type. When we proceed further to interval data and then on to ratio data,
we encounter a qualitative leap over what was possible before. Now, suddenly,
the arithmetic mean and the variance become meaningful. Hence, most statistical
analysis and parametric statistical tests (and associated inference procedures) all
become available. With ratio data, in addition to everything that is possible with
interval data, ratios of quantities also make sense.
The multiple-choice examples that follow are meant to concretize the understand-
ing of the four primary scales and corresponding data types.
Even before data collection can begin, the purpose for which the data collection
is being conducted must be clarified. Enter, problem formulation. The importance
of problem formulation cannot be overstated—it comes first in any research project,
ideally speaking. Moreover, even small deviations from the intended path at the very
beginning of a project’s trajectory can lead to a vastly different destination than was
intended. That said, problem formulation can often be a tricky issue to get right. To
see why, consider the musings of a decision-maker and country head for XYZ Inc.
Sales fell short last year. But sales would’ve approached target except for 6 territories in 2
regions where results were poor. Of course, we implemented a price increase across-the-
board last year, so our profit margin goals were just met, even though sales revenue fell
short. Yet, 2 of our competitors saw above-trend sales increases last year. Still, another
competitor seems to be struggling, and word on the street is that they have been slashing
prices to close deals. Of course, the economy was pretty uneven across our geographies last
year and the 2 regions in question, weak anyway, were particularly so last year. Then there
was that mess with the new salesforce compensation policy coming into effect last year. 1
of the 2 weak regions saw much salesforce turnover last year . . .
These are everyday musings in the lives of business executives and are far from
unusual. Depending on the identification of the problem, data collection strategies,
28 S. Voleti
resources, and approaches will differ. The difficulty in being able to readily pinpoint
any one cause or a combination of causes as specific problem highlights the issues
that crop up in problem formulation. Four important points jump out from the above
example. First, that reality is messy. Unlike textbook examples of problems, wherein
irrelevant information is filtered out a priori and only that which is required to solve
“the” identified problem exactly is retained, life seldom simplifies issues in such a
clear-cut manner. Second, borrowing from a medical analogy, there are symptoms—
observable manifestations of an underlying problem or ailment—and then there is
the cause or ailment itself. Symptoms could be a fever or a cold and the causes
could be bacterial or viral agents. However, curing the symptoms may not cure
the ailment. Similarly, in the previous example from XYZ Inc., we see symptoms
(“sales are falling”) and hypothesize the existence of one or more underlying
problems or causes. Third, note the pattern of connections between symptom(s) and
potential causes. One symptom (falling sales) is assumed to be coming from one
or more potential causes (product line, salesforce compensation, weak economy,
competitors, etc.). This brings up the fourth point—How can we diagnose a problem
(or cause)? One strategy would be to narrow the field of “ailments” by ruling out
low-hanging fruits—ideally, as quickly and cheaply as feasible. It is not hard to see
that the data required for this problem depends on what potential ailments we have
shortlisted in the first place.
For illustrative purposes, consider a list of three probable causes from the messy
reality of the problem statement given above, namely, (1) product line is obsolete;
(2) customer-connect is ineffective; and (3) product pricing is uncompetitive (say).
Then, from this messy reality we can formulate decision problems (D.P.s) that
correspond to the three identified probable causes:
• D.P. #1: “Should new product(s) be introduced?”
• D.P. #2: “Should advertising campaign be changed?”
• D.P. #3: “Should product prices be changed?”
Note what we are doing in mathematical terms—if messy reality is a large
multidimensional object, then these D.P.s are small-dimensional subsets of that
reality. This “reduces” a messy large-dimensional object to a relatively more
manageable small-dimensional one.
The D.P., even though it is of small dimension, may not contain sufficient detail
to map directly onto tools. Hence, another level of refinement called the research
objective (R.O.) may be needed. While the D.P. is a small-dimensional object,
the R.O. is (ideally) a one-dimensional object. Multiple R.O.s may be needed to
completely “cover” or address a single D.P. Furthermore, because each R.O. is
one-dimensional, it maps easily and directly onto one or more specific tools in
the analytics toolbox. A one-dimensional problem formulation component better be
2 Data Collection 29
Large-dimensional object
One-dimensional object
Decision Research
Problem Objecve
Relavely small-
dimensional object
well defined. The R.O. has three essential parts that together lend necessary clarity
to its definition. R.O.s comprise of (a) an action verb and (b) an actionable object,
and typically fit within one handwritten line (to enforce brevity). For instance, the
active voice statement “Identify the real and perceived gaps in our product line vis-
à-vis that of our main competitors” is an R.O. because its components action verb
(“identify”), actionable object (“real and perceived gaps”), and brevity are satisfied.
Figure 2.1 depicts the problem formulation framework we just described in
pictorial form. It is clear from the figure that as we impose preliminary structure, we
effectively reduce problem dimensionality from large (messy reality) to somewhat
small (D.P.) to the concise and the precise (R.O.).
may be. The second type is descriptive research wherein the problem’s identity is
somewhat clear. For instance, “What kind of people buy our products?” or “Who is
perceived as competition to us?” These are examples of known-unknowns. The third
type is causal research wherein the problem is clearly defined. For instance, “Will
changing this particular promotional campaign raise sales?” is a clearly identified
known-unknown. Causal research (the cause in causal comes from the cause in
because) tries to uncover the “why” behind phenomena of interest and its most
powerful and practical tool is the experimentation method. It is not hard to see that
the level of clarity in problem definition vastly affects the choices available in terms
of data collection and downstream analysis.
Data collection is about data and about collection. We have seen the value inherent
in the right data in Sect. 1. In Sect. 3, we have seen the importance of clarity in
problem formulation while determining what data to collect. Now it is time to turn
to the “collection” piece of data collection. What challenges might a data scientist
typically face in collecting data? There are various ways to list the challenges that
arise. The approach taken here follows a logical sequence.
The first challenge is in knowing what data to collect. This often requires
some familiarity with or knowledge of the problem domain. Second, after the data
scientist knows what data to collect, the hunt for data sources can proceed apace.
Third, having identified data sources (the next section features a lengthy listing of
data sources in one domain as part of an illustrative example), the actual process
of mining of raw data can follow. Fourth, once the raw data is mined, data quality
assessment follows. This includes various data cleaning/wrangling, imputation, and
other data “janitorial” work that consumes a major part of the typical data science
project’s time. Fifth, after assessing data quality, the data scientist must now judge
the relevance of the data to the problem at hand. While considering the above, at
each stage one has to take into consideration the cost and time constraints.
Consider a retailing context. What kinds of data would or could a grocery retail
store collect? Of course, there would be point-of-sale data on items purchased,
promotions availed, payment modes and prices paid in each market basket, captured
by UPC scanner machines. Apart from that, retailers would likely be interested in
(and can easily collect) data on a varied set of parameters. For example, that may
include store traffic and footfalls by time of the day and day of the week, basic
segmentation (e.g., demographic) of the store’s clientele, past purchase history of
customers (provided customers can be uniquely identified, that is, through a loyalty
or bonus program), routes taken by the average customer when navigating the
store, or time spent on an average by a customer in different aisles and product
departments. Clearly, in the retail sector, the wide variety of data sources and capture
points to data are typically large in the following three areas:
2 Data Collection 31
• Volume
• Variety (ranges from structured metric data on sales, inventory, and geo location
to unstructured data types such as text, images, and audiovisual files)
• Velocity—(the speed at which data comes in and gets updated, i.e., sales or
inventory data, social media monitoring data, clickstreams, RFIDs—Radio-
frequency identification, etc.)
These fulfill the three attribute criteria that are required to being labeled “Big
Data” (Diebold 2012). The next subsection dives into the retail sector as an
illustrative example of data collection possibilities, opportunities, and challenges.
Collecting data from multiple sources will not result in rich insights unless the data
is collated to retain its integrity. Data validity may be compromised if proper care is
not taken during collation. One may face various challenges while trying to collate
the data. Below, we describe a few challenges along with the approaches to handle
them in the light of business problems.
• No common identifier: A challenge while collating data from multiple sources
arises due to the absence of common identifiers across different sources. The
analyst may seek a third identifier that can serve as a link between two data
sources.
• Missing data, data entry error: Missing data can either be ignored, deleted, or
imputed with relevant statistics (see Chap. 8).
• Different levels of granularity: The data could be aggregated at different levels.
For example, primary data is collected at the individual level, while secondary
data is usually available at the aggregate level. One can either aggregate the
data in order to bring all the observations to the same level of granularity or
can apportion the data using business logic.
• Change in data type over the period or across the samples: In financial and
economic data, many a time the base period or multipliers are changed, which
needs to be accounted for to achieve data consistency. Similarly, samples
collected from different populations such as India and the USA may suffer from
inconsistent definitions of time periods—the financial year in India is from April
to March and in the USA, it is from January to December. One may require
remapping of old versus new data types in order to bring the data to the same
level for analysis.
• Validation and reliability: As the secondary data is collected by another user, the
researcher may want to validate to check the correctness and reliability of the
data to answer a particular research question.
Data presentation is also very important to understand the issues in the data. The
basic presentation may include relevant charts such as scatter plots, histograms, and
32 S. Voleti
pie charts or summary statistics such as the number of observations, mean, median,
variance, minimum, and maximum. You will read more about data visualization in
Chap. 5 and about basic inferences in Chap. 6.
Bradlow et al. (2017) provide a detailed framework to understand and classify the
various data sources becoming popular with retailers in the era of Big Data and
analytics. Figure 2.2, taken from Bradlow et al. (2017), “organizes (an admittedly
incomplete) set of eight broad retail data sources into three primary groups, namely,
(1) traditional enterprise data capture; (2) customer identity, characteristics, social
graph and profile data capture; and (3) location-based data capture.” The claim
is that insight and possibilities lie at the intersection of these groups of diverse,
contextual, and relevant data.
Traditional enterprise data capture (marked #1 in Fig. 2.2) from UPC scanners
combined with inventory data from ERP or SCM software and syndicated databases
(such as those from IRI or Nielsen) enable a host of analyses, including the
following:
5. Mobile and app based data (both 6. Customers' subconscious, habit based
retailer's own app and from syndicated or subliminally influenced choices (RFID,
sources) eye-tracking etc.)
Finally, data source #9 in Fig. 2.2 is pertinent largely to emerging markets and lets
small, unorganized sector retailers (mom-and-pop stores) to leverage their physical
location and act as fulfillment center franchisees for large retailers (Forbes 2015).
This chapter was an introduction to the important task of data collection, a process
that precedes and heavily influences the success or failure of data science and
analytics projects in meeting their objectives. We started with why data is such a
big deal and used an illustrative example (Uber) to see the value inherent in the
right kind of data. We followed up with some preliminaries on the four main types
of data, their corresponding four primary scales, and the implications for analysis
downstream. We then ventured into problem formulation, discussed why it is of
such critical importance in determining what data to collect, and built a simple
framework against which data scientists could check and validate their current
problem formulation tasks. Finally, we walked through an extensive example of the
various kinds of data sources available in just one business domain—retailing—and
the implications thereof.
Exercises
Ex. 2.1 Prepare the movie release dataset of all the movies released in the last 5 years
using IMDB.
(a) Find all movies that were released in the last 5 years.
(b) Generate a file containing URLs for the top 50 movies every year on IMDB.
(c) Read in the URL’s IMDB page and scrape the following information:
Producer(s), Director(s), Star(s), Taglines, Genres, (Partial) Storyline, Box
office budget, and Box office gross.
(d) Make a table out of these variables as columns with movie name being the first
variable.
(e) Analyze the movie-count for every Genre. See if you can come up with some
interesting hypotheses. For example, you could hypothesize that “Action Genres
occur significantly more often than Drama in the top-250 list.” or that “Action
movies gross higher than Romance movies in the top-250 list.”
(f) Write a markdown doc with your code and explanation. See if you can storify
your hypotheses.
Note: You can web-scrape with the rvest package in R or use any platform that
you are comfortable with.
36 S. Voleti
Ex. 2.2 Download the movie reviews from IMDB for the list of movies.
(a) Go to www.imdb.com and examine the page.
(b) Scrape the page and tabulate the output into a data frame with columns “name,
url, movie type, votes.”
(c) Filter the data frame. Retain only those movies that got over 500 reviews. Let
us call this Table 1.
(d) Now for each of the remaining movies, go to the movie’s own web page on the
IMDB, and extract the following information:
Duration of the movie, Genre, Release date, Total views, Commercial
description from the top of the page.
(e) Add these fields to Table 1 in that movie’s row.
(f) Now build a separate table for each movie in Table 1 from that movie’s web
page on IMDB. Extract the first five pages of reviews of that movie and in each
review, scrape the following information:
Reviewer, Feedback, Likes, Overall, Review (text), Location (of the
reviewer), Date of the review.
(g) Store the output in a table. Let us call it Table 2.
(h) Create a list (List 1) with as many elements as there are rows in Table 1. For the
ith movie in Table 1, store Table 2 as the ith element of a second list, say, List 2.
Ex. 2.3 Download the Twitter data through APIs.
(a) Read up on how to use the Twitter API (https://dev.twitter.com/overview/api).
If required, make a twitter ID (if you do not already have one).
(b) There are three evaluation dimensions for a movie at IMDB, namely, Author,
Feedback, and Likes. More than the dictionary meanings of these words, it is
interesting how they are used in different contexts.
(c) Download 50 tweets each that contain these terms and 100 tweets for each
movie.
(d) Analyze these tweets and classify what movie categories they typically refer to.
Insights here could, for instance, be useful in designing promotional campaigns
for the movies.
P.S.: R has a dedicated package twitteR (note capital R in the end). For additional
functions, refer twitteR package manual.
Ex. 2.4 Prepare the beer dataset of all the beers that got over 500 reviews.
(a) Go to (https://www.ratebeer.com/beer/top-50/) and examine the page.
(b) Scrape the page and tabulate the output into a data frame with columns “name,
url, count, style.”
(c) Filter the data frame. Retain only those beers that got over 500 reviews. Let us
call this Table 1.
(d) Now for each of the remaining beers, go to the beer’s own web page on the
ratebeer site, and scrape the following information:
2 Data Collection 37
(e) Now compile this data in a tabular format. Your data should have these columns:
• Sender name
• Time
• Latitude
• Longitude
• Type of place
(f) Extract your locations from the chat history table and plot it on google maps.
You can use the spatial DC code we used on this list of latitude and longitude
co-ordinates or use leaflet() package in R to do the same. Remember to extract
and map only your own locations not those of other group members.
(g) Analyze your own movements over a week *AND* record your observations
about your travels as a story that connects these locations together.
References
Bijmolt, T. H. A., van Heerde, H. J., & Pieters, R. G. M. (2005). New empirical generalizations on
the determinants of price elasticity. Journal of Marketing Research, 42(2), 141–156.
Blair, E., & Blair, C. (2015). Applied survey sampling. Los Angeles: Sage Publications.
Blattberg, R. C., Kim, B.-D., & Neslin, S. A. (2008). Market basket analysis. Database Marketing:
Analyzing and Managing Customers, 339–351.
Bradlow, E., Gangwar, M., Kopalle, P., & Voleti, S. (2017). The role of big data and predictive
analytics in retailing. Journal of Retailing, 93, 79–95.
Diebold, F. X. (2012). On the origin (s) and development of the term ‘Big Data’.
Forbes. (2015). From Dabbawallas to Kirana stores, five unique E-commerce delivery innovations
in India. Retrieved April 15, 2015, from http://tinyurl.com/j3eqb5f.
Ghose, A., & Han, S. P. (2011). An empirical analysis of user content generation and usage
behavior on the mobile Internet. Management Science, 57(9), 1671–1691.
Ghose, A., & Han, S. P. (2014). Estimating demand for mobile applications in the new economy.
Management Science, 60(6), 1470–1488.
Larson, J. S., Bradlow, E. T., & Fader, P. S. (2005). An exploratory look at supermarket shopping
paths. International Journal of Research in Marketing, 22(4), 395–414.
Lee, N., Broderick, A. J., & Chamberlain, L. (2007). What is ‘neuromarketing’? A discussion and
agenda for future research. International Journal of Psychophysiology, 63(2), 199–204.
Luo, X., Andrews, M., Fang, Z., & Phang, C. W. (2014). Mobile targeting. Management Science,
60(7), 1738–1756.
Montgomery, C. (2017). Design and analysis of experiments (9th ed.). New York: John Wiley and
Sons.
Montgomery, A. L., Li, S., Srinivasan, K., & Liechty, J. C. (2004). Modeling online browsing and
path analysis using clickstream data. Marketing Science, 23(4), 579–595.
Murray, K. B., Di Muro, F., Finn, A., & Leszczyc, P. P. (2010). The effect of weather on consumer
spending. Journal of Retailing and Consumer Services, 17(6), 512–520.
Park, C. W., Iyer, E. S., & Smith, D. C. (1989). The effects of situational factors on in-store grocery
shopping behavior: The role of store environment and time available for shopping. Journal of
Consumer Research, 15(4), 422–433.
Rossi, P. E., & Allenby, G. M. (1993). A Bayesian approach to estimating household parameters.
Journal of Marketing Research, 30, 171–182.
2 Data Collection 39
Rossi, P. E., McCulloch, R. E., & Allenby, G. M. (1996). The value of purchase history data in
target marketing. Marketing Science, 15(4), 321–340.
Russell, G. J., & Petersen, A. (2000). Analysis of cross category dependence in market basket
selection. Journal of Retailing, 76(3), 367–392.
Steele, A. T. (1951). Weather’s effect on the sales of a department store. Journal of Marketing,
15(4), 436–443.
Vrechopoulos, A. P., O’Keefe, R. M., Doukidis, G. I., & Siomkos, G. J. (2004). Virtual store layout:
An experimental comparison in the context of grocery retail. Journal of Retailing, 80(1), 13–22.
Wedel, M., & Pieters, R. (2000). Eye fixations on advertisements and memory for brands: A model
and findings. Marketing Science, 19(4), 297–312.
Chapter 3
Data Management—Relational Database
Systems (RDBMS)
1 Introduction
Storage and management of data is a key aspect of data science. Data, simply
speaking, is nothing but a collection of facts—a snapshot of the world—that can
be stored and processed by computers. In order to process and manipulate data
efficiently, it is very important that data is stored in an appropriate form. Data comes
in many shapes and forms, and some of the most commonly known forms of data
are numbers, text, images, and videos. Depending on the type of data, there exist
multiple ways of storage and processing. In this chapter, we focus on one of the
most commonly known and pervasive means of data storage—relational database
management systems. We provide an introduction using which a reader can perform
the essential operations. References for a deeper understanding are given at the end
of the chapter.
2 Motivating Example
Consider an online store that sells stationery to customers across a country. The
owner of this store would like to set up a system that keeps track of inventory, sales,
operations, and potential pitfalls. While she is currently able to do so on her own,
she knows that as her store scales up and starts to serve more and more people, she
H. K. Dasararaju
Indian School of Business, Hyderabad, Telangana, India
P. Taori ()
London Business School, London, UK
e-mail: taori.peeyush@gmail.com
will no longer have the capacity to manually record transactions and create records
for new occurrences. Therefore, she turns to relational database systems to run her
business more efficiently.
A database is a collection of organized data in the form of rows, columns, tables
and indexes. In a database, even a small piece of information becomes data. We tend
to aggregate related information together and put them under one gathered name
called a Table. For example, all student-related data (student ID, student name, date
of birth, etc.) would be put in one table called STUDENT table. It decreases the
effort necessary to scan for a specific information in an entire database. Since a
database is very flexible, data gets updated and extended when new data is added
and the database shrinks when data is deleted from the database.
As data grows in size, there arises a need for a means of storing it efficiently such
that it can be found and processed quickly. In the “olden days” (which was not
too far back), this was achieved via systematic filing systems where individual files
were catalogued and stored neatly according to a well-developed data cataloging
system (similar to the ones you will find in libraries or data storage facilities
in organizations). With the advent of computer systems, this role has now been
assumed by database systems. Plainly speaking, a database system is a digital
record-keeping system or an electronic filing cabinet. Database systems can be used
to store large amounts of data, and data can then be queried and manipulated later
using a querying mechanism/language. Some of the common operations that can
be performed in a database system are adding new files, updating old data files,
creating new databases, querying of data, deleting data files/individual records, and
adding more data to existing data files. Often pre processing and post-processing of
data happen using database languages. For example, one can selectively read data,
verify its correctness, and connect it to data structures within applications. Then,
after processing, write it back into the database for storage and further processing.
With the advent of computers, the usage of database systems has become
ubiquitous in our personal and work lives. Whether we are storing information about
personal expenditures using an Excel file or making use of MySQL database to
store product catalogues for a retail organization, databases are pervasive and in use
everywhere. We also discuss the difference between the techniques discussed in this
chapter compared to methods for managing big data in the next chapter.
A database management system (DBMS) is the system software that enables users
to create, organize, and manage databases. As Fig. 3.1 illustrates, The DBMS serves
as an interface between the database and the end user, guaranteeing that information
is reliably organized and remains accessible.
3 Data Management—Relational Database Systems (RDBMS) 43
Database
Database Management
Application System Database
(DBMS)
• First normal form (1NF): The table must contain “atomic” values only (should
not contain any duplicate values, and cannot hold multiple values).
Example: Suppose the university wants to store the details of students who are
finalists of a competition. Table 3.1 shows the data.
Three students (Jon, Robb, and Ken) have two different parents numbers so the
university put two numbers in the same field as you see in Table 3.1. This table is
not in 1NF as it does not follow the rule “Only atomic values in the field” as there
are multiple values in parents_number field. To make the table into 1NF we should
store the information as shown in Table 3.2.
• Second normal form (2NF): Must follow first normal form and no non-key
attributes are dependent on the proper subset of any candidate key of the table.
Example: Assume a university needs to store the information of the instructors
and the topics they teach. They make a table that resembles the one given below
(Table 3.3) since an instructor can teach more than one topic.
Instructor_ID Topic
56121 Neural Network
56121 IoT
56132 Statistics
56133 Optimization
56133 Simulation
Here Instructor_ID and Topic are key attributes and Instructor_Age is a non-
key attribute. The table is in 1NF but not in 2NF because the non-key attribute
Instructor_Age is dependent on Instructor_ID. To make the table agree to 2NF, we
can break the table into two tables like the ones given in Table 3.4.
• Third normal form (3NF): Must follow second normal form and none of the non-
key attributes are determined by another non-key attributes.
Example: Suppose the university wants to store the details of students who are
finalists of a competition. The table is shown in Table 3.5.
Here, student_ID is the key attribute and all other attributes are non-key
attributes. Student_State, Student_city, and Student_Area depend on Student_ZIP
and Student_ZIP is dependent on Student_ID that makes the non-key attribute
transitively dependent on the key attribute. This violates the 3NF rules. To make
the table agree to 3NF we can break into two tables like the ones given in Table 3.6.
3NF is the form that is practiced and advocated across most organizational
environments. It is because tables in 3NF are immune to most of the anomalies
associated with insertion, updation, and deletion of data. However, there could be
specific instances when organizations might want to opt for alternate forms of table
normalization such as 4NF and 5NF. While 2NF and 3NF normalizations focus on
functional aspects, 4NF and 5NF are more concerned with addressing multivalued
dependencies. A detailed discussion of 4NF and 5NF forms is beyond the scope
46 H. K. Dasararaju and P. Taori
Table 3.6 Breaking tables into two in order to agree with 3NF
Student table:
Student_ID Student_Name Student_ZIP
71121 Jon 10001
71122 Janet 60201
71123 Robb 02238
71124 Zent 90089
Student_zip table:
Student_ZIP Student_State Student_city Student_Area
10001 New York New York Queens Manhattan
60201 Illinois Chicago Evanston
02238 Massachusetts Boston Cambridge
90089 California Los Angeles Trousdale
of discussion for this chapter, but interested reader can learn more online from
various sources.1 It should be noted that in many organizational scenarios, the focus
is mainly on achieving 3NF.
Most businesses today need to record and store information. Sometimes this may be
only for record keeping and sometimes data is stored for later use. We can store the
data in Microsoft Excel. But why is RDBMS the most widely used method to store
data?
Using Excel we can perform various functions like adding the data in rows and
columns, sorting of data by various metrics, etc. But Excel is a two-dimensional
spreadsheet and thus it is extremely hard to make connections between information
in various spreadsheets. It is easy to view the data or find the particular data from
Excel when the size of the information is small. It becomes very hard to read the
information once it crosses a certain size. The data might scroll many pages when
endeavoring to locate a specific record.
Unlike Excel, in RDBMS, the information is stored independently from the user
interface. This separation of storage and access makes the framework considerably
more scalable and versatile. In RDBMS, data can be easily cross-referenced
between multiple databases using relationships between them but there are no such
options in Excel. RDBMS utilizes centralized data storage systems that makes
backup and maintenance much easier. Database frameworks have a tendency to be
significantly faster as they are built to store and manipulate large datasets unlike
Excel.
In this section, we will walk through the basics of creating a database using MySQL2
and query the database using the MySQL querying language. As described earlier
in the chapter, a MySQL database server is capable of hosting many databases. In
databases parlance, a database is often also called a schema. Thus, a MySQL server
can contain a number of schemas. Each of those schemas (database) is made up of a
number of tables, and every table contains rows and columns. Each row represents
an individual record or observation, and each column represents a particular attribute
such as age and salary.
When you launch the MySQL command prompt, you see a command line like
the one below (Fig. 3.2).
The command line starts with “mysql>” and you can run SQL scripts by closing
commands with semicolon (;).
In order to get started we will first check the databases that are already present in
a MySQL server. To do so, type “show databases” in the command line. Once
you run this command, it will list all the available databases in the MySQL server
installation. The above-mentioned command is the first SQL query that we have
run. Please note that keywords and commands are case-insensitive in MySQL as
compared to R and Python where commands are case-sensitive in nature.
mysql> SHOW DATABASES;
Output:
+-------------------+
| Database |
+-------------------+
| information_schema |
| mysql |
| performance_schema |
| test |
+-------------------+
4 rows in set (0.00 sec)
You would notice that there are already four schemas listed though we have
not yet created any one of them. Out of the four databases, “information_schema”,
“mysql”, and “performance_schema” are created by MySQL server for its internal
monitoring and performance optimization purposes and should not be used when we
are creating our own database. Another schema “test” is created by MySQL during
the installation phase and it is provided for testing purposes. You can remove the
“test” schema or can use it to create your own tables.
50 H. K. Dasararaju and P. Taori
Now let us create our own database. The syntax for creating a database in MySQL is:
CREATE DATABASE databasename;
Output:
Query OK, 1 row affected (0.00 sec)
Output:
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| product_sales |
| test |
+--------------------+
5 rows in set (0.00 sec)
Output:
Query OK, 0 rows affected (0.14 sec)
Output:
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
3 Data Management—Relational Database Systems (RDBMS) 51
| performance_schema |
| test |
+--------------------+
4 rows in set (0.00 sec)
Oftentimes, when you have to create a database, you might not be sure if a
database of the same name exists already in the system. In such cases, conditions
such as “IF EXISTS” and “IF NOT EXISTS” come in handy. When we execute such
query, then the database is created if there is no other database of the same name.
This helps us in avoiding overwriting of the existing database with the new one.
mysql> CREATE DATABASE IF NOT EXISTS product_sales;
Output:
Query OK, 1 row affected (0.00 sec)
One important point to keep in mind is the use of SQL DROP commands with
extreme care, because once you delete an entity or an entry, then there is no way to
recover the data.
There can be multiple databases available in the MySQL server. In order to work on
a specific database, we have to select the database first. The basic syntax to select a
database is:
USE databasename;
Output:
Database changed
When we run the above query, the default database now is “product_sales”.
Whatever operations we will now perform will be performed on this database. This
implies that if you have to use a specific table in the database, then you can simply
do so by calling the table name. If at any point of time you want to check which
your selected database is then issue the command:
mysql> SELECT DATABASE();
Output:
+---------------+
| DATABASE() |
+---------------+
| product_sales |
+---------------+
1 row in set (0.00 sec)
52 H. K. Dasararaju and P. Taori
If you want to check all tables in a database, then issue the following command:
mysql> SHOW TABLES;
Output:
Empty set (0.00 sec)
As of now it is empty since we have not yet created any table. Let us now go
ahead and create a table in the database.
The above command will create the table with table name as specified by the user.
You can also specify the optional condition IF EXISTS/IF NOT EXISTS similar to
the way you can specify them while creating a database. Since a table is nothing
but a collection of rows and columns, in addition to specifying the table name, you
would also want to specify the column names in the table and the type of data that
each column can contain. For example, let us go ahead and create a table named
“products.” We will then later inspect it in greater detail.
mysql> CREATE TABLE products (productID INT 10 UNSIGNED NOT NULL
AUTO_INCREMENT, code CHAR(6) NOT NULL DEFAULT “, productname
VARCHAR(30) NOT NULL DEFAULT “, quantity INT UNSIGNED NOT NULL
DEFAULT 0, price DECIMAL(5,2) NOT NULL DEFAULT 0.00, PRIMARY
KEY (productID) );
Output:
Query OK, 0 rows affected (0.41 sec)
Output:
+-------------------------+
| Tables_in_product_sales |
+-------------------------+
| products |
+-------------------------+
1 row in set (0.00 sec)
54 H. K. Dasararaju and P. Taori
You can always look up the schema of a table by issuing the “DESCRIBE”
command:
mysql> DESCRIBE products;
Output:
+------------+------------------+------+----+---------+---------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+----+---------+---------------+
| productID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| code | char(6) | NO | | | |
| productname | varchar(30) | NO | | | |
| quantity | int(10) unsigned | NO | | 0 | |
| price | decimal(5,2) | NO | | 0.00 | |
+------------+------------------+------+----+---------+---------------+
5 rows in set (0.01 sec)
Once we have created the table, it is now time to insert data into the table. For now
we will look at how to insert data manually in the table. Later on we will see how we
can import data from an external file (such as CSV or text file) in the database. Let
us now imagine that we have to insert data into the products table we just created.
To do so, we make use of the following command:
mysql> INSERT INTO products VALUES (1, ’IPH’, ’Iphone 5S Gold’,
300, 625);
Output:
Query OK, 1 row affected (0.13 sec)
When we issue the above command, it will insert a single row of data into the
table “products.” The parenthesis after VALUES specified the actual values that are
to be inserted. An important point to note is that values should be specified in the
same order as that of columns when we created the table “products.” All numeric
data (integers and decimal values) are specified without quotes, whereas character
data must be specified within quotes.
Now let us go ahead and insert some more data into the “products.” table:
mysql> INSERT INTO products VALUES(NULL, ’IPH’,
’Iphone 5S Black’, 8000, 655.25),(NULL, ’IPH’,
’Iphone 5S Blue’, 2000, 625.50);
Output:
Query OK, 2 rows affected (0.13 sec)
Records: 2 Duplicates: 0 Warnings: 0
In the above case, we inserted multiple rows of data at the same time. Each row
of data was specified within parenthesis and each row was separated by a comma
(,). Another point to note is that we kept the productID fields as null when inserting
the data. This is to demonstrate that even if we provide null values, MySQL will
make use of AUTO_INCREMENT operator to assign values to each row.
3 Data Management—Relational Database Systems (RDBMS) 55
Sometimes there might be a need where you want to provide data only for some
columns or you want to provide data in a different order as compared to the original
one when we created the table. This can be done using the following command:
mysql> INSERT INTO products (code, productname, quantity, price)
VALUES (’SNY’, ’Xperia Z1’, 10000, 555.48),(’SNY’, ’Xperia S’,
8000, 400.49);
Output:
Query OK, 2 rows affected (0.13 sec)
Records: 2 Duplicates: 0 Warnings: 0
Notice here that we did not specify the productID column for values to be inserted
in, but rather explicitly specified the columns and their order in which we want
to insert the data. The productID column will be automatically populated using
AUTO_INCREMENT operator.
Now that we have inserted some values into the products table, let us go ahead and
see how we can query the data. If you want to see all observations in a database
table, then make use of the SELECT * FROM tablename query:
mysql> SELECT * FROM products;
Output:
+----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+----------+------+-----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
| 5 | SNY | Xperia S | 8000 | 400.49 |
+----------+------+-----------------+----------+--------+
5 rows in set (0.00 sec)
SELECT query is perhaps the most widely known query of SQL. It allows you to
query a database and get the observations matching your criteria. SELECT * is the
most generic query, which will simply return all observations in a table. The general
syntax of SELECT query is as follows:
SELECT column1Name, column2Name, ... FROM tableName
This will return selected columns from a particular table name. Another variation
of SELECT query can be the following:
SELECT column1Name, column2Name . . . .from tableName where
somecondition;
56 H. K. Dasararaju and P. Taori
In the above version, only those observations would be returned that match
the criteria specified by the user. Let us understand them with the help of a few
examples:
mysql> SELECT productname, quantity FROM products;
Output:
+-----------------+----------+
| productname | quantity |
+-----------------+----------+
| Iphone 5S Gold | 300 |
| Iphone 5S Black | 8000 |
| Iphone 5S Blue | 2000 |
| Xperia Z1 | 10000 |
| Xperia S | 8000 |
+-----------------+----------+
5 rows in set (0.00 sec)
mysql> SELECT productname, price FROM products WHERE price < 600;
Output:
+-------------+--------+
| productname | price |
+-------------+--------+
| Xperia Z1 | 555.48 |
| Xperia S | 400.49 |
+-------------+--------+
2 rows in set (0.00 sec)
The above query will only give name and price columns for those records whose
price <600.
mysql> SELECT productname, price FROM products
WHERE price >= 600;
Output:
+-----------------+--------+
| productname | price |
+-----------------+--------+
| Iphone 5S Gold | 625.00 |
| Iphone 5S Black | 655.25 |
| Iphone 5S Blue | 625.50 |
+-----------------+--------+
3 rows in set (0.00 sec)
The above query will only give name and price columns for those records whose
price >= 600.
In order to select observations based on string comparisons, enclose the string
within quotes. For example:
3 Data Management—Relational Database Systems (RDBMS) 57
Output:
+-----------------+--------+
| productname | price |
+-----------------+--------+
| Iphone 5S Gold | 625.00 |
| Iphone 5S Black | 655.25 |
| Iphone 5S Blue | 625.50 |
+-----------------+--------+
3 rows in set (0.00 sec)
The above command gives you the name and price of the products whose code
is “IPH.”
In addition to this, you can also perform a number of string pattern matching
operations, and wildcard characters. For example, you can make use of operators
LIKE and NOT LIKE to search if a particular string contains a specific pattern. In
order to do wildcard matches, you can make use of underscore character “_” for a
single-character match, and percentage sign “%” for multiple-character match. Here
are a few examples:
• “phone%” will match strings that start with phone and can contain any characters
after.
• “%phone” will match strings that end with phone and can contain any characters
before.
• “%phone%” will match strings that contain phone anywhere in the string.
• “c_a” will match strings that start with “c” and end with “a” and contain any
single character in-between.
mysql> SELECT productname, price FROM products WHERE productname
LIKE ’Iphone%’;
Output:
+-----------------+--------+
| productname | price |
+-----------------+--------+
| Iphone 5S Gold | 625.00 |
| Iphone 5S Black | 655.25 |
| Iphone 5S Blue | 625.50 |
+-----------------+--------+
3 rows in set (0.00 sec)
Output:
+----------------+--------+
| productname | price |
+----------------+--------+
| Iphone 5S Blue | 625.50 |
+----------------+--------+
1 row in set (0.00 sec)
58 H. K. Dasararaju and P. Taori
Additionally, you can also make use of Boolean operators such as AND, OR in
SQL queries to create multiple conditions.
mysql> SELECT * FROM products WHERE quantity >= 5000 AND
productname LIKE ’Iphone%’;
Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
+-----------+------+-----------------+----------+--------+
1 row in set (0.00 sec)
This gives you all the details of products whose quantity is >=5000 and the name
like ‘Iphone’.
mysql> SELECT * FROM products WHERE quantity >= 5000 AND price >
650 AND productname LIKE ’Iphone%’;
Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
+-----------+------+-----------------+----------+--------+
1 row in set (0.00 sec)
If you want to find whether the condition matches any elements from within a
set, then you can make use of IN operator. For example:
mysql> SELECT * FROM products WHERE productname IN (’Iphone 5S
Blue’, ’Iphone 5S Black’);
Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
+-----------+------+-----------------+----------+--------+
2 rows in set (0.00 sec)
This gives the product details for the names provided in the list specified in the
command (i.e., “Iphone 5S Blue”, “Iphone 5S Black”).
Similarly, if you want to find out if the condition looks for values within a specific
range then you can make use of BETWEEN operator. For example:
3 Data Management—Relational Database Systems (RDBMS) 59
mysql> SELECT * FROM products WHERE (price BETWEEN 400 AND 600)
AND (quantity BETWEEN 5000 AND 10000);
Output:
+-----------+------+-------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-------------+----------+--------+
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
| 5 | SNY | Xperia S | 8000 | 400.49 |
+-----------+------+-------------+----------+--------+
2 rows in set (0.00 sec)
This command gives you the product details whose price is between 400 and 600
and quantity is between 5000 and 10000, both inclusive.
Many a times when we retrieve a large number of results, we might want to sort
them in a specific order. In order to do so, we make use of ORDER BY in SQL. The
general syntax for this is:
SELECT ... FROM tableName
WHERE criteria
ORDER BY columnA ASC|DESC, columnB ASC|DESC
Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
+-----------+------+-----------------+----------+--------+
3 rows in set (0.00 sec)
If you are getting a large number of results but want the output to be limited
only to a specific number of observations, then you can make use of LIMIT clause.
LIMIT followed by a number will limit the number of output results that will be
displayed.
mysql> SELECT * FROM products ORDER BY price LIMIT 2;
Output:
+-----------+------+-------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-------------+----------+--------+
| 5 | SNY | Xperia S | 8000 | 400.49 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
+-----------+------+-------------+----------+--------+
2 rows in set (0.00 sec)
60 H. K. Dasararaju and P. Taori
Output:
+----+-------------+-----------------+------------+
| ID | productCode | Description | Unit_Price |
+----+-------------+-----------------+------------+
| 1 | IPH | Iphone 5S Gold | 625.00 |
| 2 | IPH | Iphone 5S Black | 655.25 |
| 3 | IPH | Iphone 5S Blue | 625.50 |
| 4 | SNY | Xperia Z1 | 555.48 |
| 5 | SNY | Xperia S | 400.49 |
+----+-------------+-----------------+------------+
5 set (0.00 sec)
A key part of SQL queries is to be able to provide summary reports from large
amounts of data. This summarization process involves data manipulation and
grouping activities. In order to enable users to provide such summary reports, SQL
has a wide range of operators such as DISTINCT, GROUP BY that allow quick
summarization and production of data. Let us look at these operators one by one.
4.9.1 DISTINCT
A column may have duplicate values. We could use the keyword DISTINCT to
select only distinct values. We can also apply DISTINCT to several columns to
select distinct combinations of these columns. For example:
mysql> SELECT DISTINCT code FROM products;
Output:
+-----+
| Code |
+-----+
| IPH |
| SNY |
+-----+
2 rows in set (0.00 sec)
The GROUP BY clause allows you to collapse multiple records with a common
value into groups. For example,
3 Data Management—Relational Database Systems (RDBMS) 61
Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
| 5 | SNY | Xperia S | 8000 | 400.49 |
+-----------+------+-----------------+----------+--------+
5 rows in set (0.00 sec)mysql> SELECT * FROM products GROUP BY
code; #-- Only first record in each group is shown
Output:
+-----------+------+----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 |
+-----------+------+----------------+----------+--------+
2 rows in set (0.00 sec)
Output:
+-------+
| Count |
+-------+
| 5 |
+-------+
1 row in set (0.00 sec)
Output:
+------+----------+
| code | COUNT(*) |
+------+----------+
| IPH | 3 |
| SNY | 2 |
+------+----------+
2 rows in set (0.00 sec)
We got “IPH” count as 3 because we have three entries in our table with the
product code “IPH” and similarly two entries for the product code “SNY.” Besides
62 H. K. Dasararaju and P. Taori
COUNT(), there are many other aggregate functions such as AVG(), MAX(), MIN(),
and SUM(). For example,
mysql> SELECT MAX(price), MIN(price), AVG(price), SUM(quantity)
FROM products;
Output:
+------------+------------+------------+---------------+
| MAX(price) | MIN(price) | AVG(price) | SUM(quantity) |
+------------+------------+------------+---------------+
| 655.25 | 400.49 | 572.344000 | 28300 |
+------------+------------+------------+---------------+
1 row in set (0.00 sec)
This gives you MAX price, MIN price, AVG price, and total quantities of all the
products available in our products table. Now let us use GROUP BY clause:
mysql> SELECT code, MAX(price) AS ‘Highest Price‘, MIN(price) AS
‘Lowest Price‘ FROM products GROUP BY code;
Output:
+------+---------------+--------------+
| code | Highest Price | Lowest Price |
+------+---------------+--------------+
| IPH | 655.25 | 625.00 |
| SNY | 555.48 | 400.49 |
+------+---------------+--------------+
2 rows in set (0.00 sec)
This means, the highest price of an IPhone available in our database is 655.25
and the lowest price is 625.00. Similarly, the highest price of a Sony is 555.48 and
the lowest price is 400.49.
To modify the existing data, use UPDATE, SET command, with the following
syntax:
UPDATE tableName SET columnName = {value|NULL|DEFAULT}, ... WHERE
criteria
Output:
Query OK, 1 row affected (0.14 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Output:
+-----------+------+-------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-------------+----------+--------+
| 4 | SNY | Xperia Z1 | 10050 | 600.50 |
+-----------+------+-------------+----------+--------+
1 row in set (0.00 sec)
Use the DELETE FROM command to delete row(s) from a table; the syntax is:
DELETE FROM tableName # to delete all rows from the table.
DELETE FROM tableName WHERE criteria # to delete only the row(s)
that meets the criteria. For example,
mysql> DELETE FROM products WHERE productname LIKE ’Xperia%’;
Output:
Query OK, 2 rows affected (0.03 sec)
Output:
+-----------+------+-----------------+----------+--------+
| productID | code | productname | quantity | price |
+-----------+------+-----------------+----------+--------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 |
+-----------+------+-----------------+----------+--------+
3 rows in set (0.00 sec)
Output:
Query OK, 3 rows affected (0.14 sec)
Output:
Empty set (0.00 sec)
Suppose that each product has one supplier, and each supplier supplies one or more
products. We could create a table called “suppliers” to store suppliers’ data (e.g.,
name, address, and phone number). We create a column with unique value called
supplierID to identify every supplier. We set supplierID as the primary key for the
table suppliers (to ensure uniqueness and facilitate fast search).
In order to relate the suppliers table to the products table, we add a new column
into the “products” table—the supplierID.
We then set the supplierID column of the products table as a foreign key which
references the supplierID column of the “suppliers” table to ensure the so-called
referential integrity. We need to first create the “suppliers” table, because the
“products” table references the “suppliers” table.
mysql> CREATE TABLE suppliers (supplierID INT UNSIGNED NOT NULL
AUTO_INCREMENT, name VARCHAR(30) NOT NULL DEFAULT “, phone
CHAR(8) NOT NULL DEFAULT “, PRIMARY KEY (supplierID));
Output:
Query OK, 0 rows affected (0.33 sec)
Output:
+------------+------------------+------+-----+--------+---------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+--------+---------------+
| supplierID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(30) | NO | | | |
| phone | char(8) | NO | | | |
+------------+------------------+------+-----+--------+---------------+
3 rows in set (0.01 sec)
Output:
Query OK, 3 rows affected (0.13 sec)
Records: 3 Duplicates: 0 Warnings: 0
Output:
+------------+-------------+----------+
| supplierID | name | phone |
+------------+-------------+----------+
| 501 | ABC Traders | 88881111 |
| 502 | XYZ Company | 88882222 |
| 503 | QQ Corp | 88883333 |
+------------+-------------+----------+
3 rows in set (0.00 sec)
Instead of deleting and re-creating the products table, we shall use the statement
“ALTER TABLE” to add a new column supplierID into the products table. As we
have deleted all the records from products in recent few queries, let us rerun the
three INSERT queries referred in the Sect. 4.6 before running “ALTER TABLE.”
mysql> ALTER TABLE products ADD COLUMN supplierID INT UNSIGNED
NOT NULL;
Output:
Query OK, 0 rows affected (0.43 sec)
Records: 0 Duplicates: 0 Warnings: 0
Output:
+-------------+-----------------+-----+-----+---------+---------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-----------------+-----+-----+---------+---------------+
| productID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| code | char(6) | NO | | | |
| productname | varchar(30) | NO | | | |
| quantity | int(10) unsigned | NO | | 0 | |
| price | decimal(5,2) | NO | | 0.00 | |
| supplierID | int(10) unsigned | NO | | NULL | |
+-------------+-----------------+-----+-----+---------+---------------+
6 rows in set (0.00 sec)
66 H. K. Dasararaju and P. Taori
Now, we shall add a foreign key constraint on the supplierID columns of the
“products” child table to the “suppliers” parent table, to ensure that every supplierID
in the “products” table always refers to a valid supplierID in the “suppliers” table.
This is called referential integrity.
Before we add the foreign key, we need to set the supplierID of the existing
records in the “products” table to a valid supplierID in the “suppliers” table (say
supplierID = 501).
Now let us set the supplierID of the existing records to a valid supplierID of
“supplier” table. As we have deleted the records from “products” table, we can add
or update using UPDATE command.
mysql> UPDATE products SET supplierID = 501;
Output:
Query OK, 5 rows affected (0.04 sec)
Rows matched: 5 Changed: 5 Warnings: 0
Output:
Query OK, 0 rows affected (0.56 sec)
Records: 0 Duplicates: 0 Warnings: 0
Output:
+------------+-----------------+------+-----+---------+---------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-----------------+------+-----+---------+---------------+
| productID | int(10) unsigned | NO | PRI | NULL | auto_increment |
| code | char(6) | NO | | | |
| productname | varchar(30) | NO | | | |
| quantity | int(10) unsigned | NO | | 0 | |
| price | decimal(5,2) | NO | | 0.00 | |
| supplierID | int(10) unsigned | NO | MUL | NULL | |
+------------+-----------------+------+-----+---------+---------------+
6 rows in set (0.00 sec)
Output:
+-----------+------+-----------------+----------+--------+------------+
| productID | code | productname | quantity | price | supplierID |
+-----------+------+-----------------+----------+--------+------------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 | 501 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 | 501 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 | 501 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 | 501 |
| 5 | SNY | Xperia S | 8000 | 400.49 | 501 |
+-----------+------+-----------------+----------+--------+------------+
5 rows in set (0.00 sec)
3 Data Management—Relational Database Systems (RDBMS) 67
Output:
Query OK, 1 row affected (0.13 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Output:
+-----------+------+-----------------+----------+--------+------------+
| productID | code | productname | quantity | price | supplierID |
+-----------+------+-----------------+----------+--------+------------+
| 1 | IPH | Iphone 5S Gold | 300 | 625.00 | 502 |
| 2 | IPH | Iphone 5S Black | 8000 | 655.25 | 501 |
| 3 | IPH | Iphone 5S Blue | 2000 | 625.50 | 501 |
| 4 | SNY | Xperia Z1 | 10000 | 555.48 | 501 |
| 5 | SNY | Xperia S | 8000 | 400.49 | 501 |
+-----------+------+-----------------+----------+--------+------------+
5 rows in set (0.00 sec)
SELECT command can be used to query and join data from two related tables.
For example, to list the product’s name (in products table) and supplier’s name (in
suppliers table), we could join the two tables using the two common supplierID
columns:
mysql> SELECT products.productname, price, suppliers.name FROM
products JOIN suppliers ON products.supplierID
= suppliers.supplierID WHERE price < 650;
Output:
+----------------+--------+-------------+
| productname | price | name |
+----------------+--------+-------------+
| Iphone 5S Gold | 625.00 | XYZ Company |
| Iphone 5S Blue | 625.50 | ABC Traders |
| Xperia Z1 | 555.48 | ABC Traders |
| Xperia S | 400.49 | ABC Traders |
+----------------+--------+-------------+
4 rows in set (0.00 sec)
Output:
+----------------+--------+-------------+
| productname | price | name |
+----------------+--------+-------------+
| Iphone 5S Gold | 625.00 | XYZ Company |
| Iphone 5S Blue | 625.50 | ABC Traders |
| Xperia Z1 | 555.48 | ABC Traders |
| Xperia S | 400.49 | ABC Traders |
+----------------+--------+-------------+
4 rows in set (0.00 sec)
In the above query result, two of the columns have the same heading “name.” We
could create aliases for headings. Let us use aliases for column names for display.
mysql> SELECT products.productname AS ‘Product Name’, price,
suppliers.name AS ‘Supplier Name’ FROM products JOIN suppliers
ON products.supplierID = suppliers.supplierID WHERE price < 650;
Output:
+----------------+--------+---------------+
| Product Name | price | Supplier Name |
+----------------+--------+---------------+
| Iphone 5S Gold | 625.00 | XYZ Company |
| Iphone 5S Blue | 625.50 | ABC Traders |
| Xperia Z1 | 555.48 | ABC Traders |
| Xperia S | 400.49 | ABC Traders |
+----------------+--------+---------------+
4 rows in set (0.00 sec)
5 Summary
The chapter describes the essential commands for creating, modifying, and querying
an RDBMS. Detailed descriptions and examples can be found in the list of books
and websites listed in the reference section (Elmasri and Navathe 2014; Hoffer
et al. 2011; MySQL using R 2018; MySQL using Python 2018). You can also refer
various websites such as w3schools.com/sql, sqlzoo.net (both accessed on Jan 15,
2019), which help you learn SQL in gamified console. The practice would help you
learn to query large databases, which is quite a nuisance.
Exercises
Ex. 3.1 Print list of all suppliers who do not keep stock for IPhone 5S Black.
Ex. 3.2 Find out the product that has the biggest inventory by value (i.e., the product
that has the highest value in terms of total inventory).
Ex. 3.3 Print the supplier name who maintains the largest inventory of products.
3 Data Management—Relational Database Systems (RDBMS) 69
Ex. 3.4 Due to the launch of a newer model, prices of IPhones have gone down and
the inventory value has to be written down. Create a new column (new_price) where
price is marked down by 20% for all black- and gold-colored phones, whereas it has
to be marked down by 30% for the rest of the phones.
Ex. 3.5 Due to this recent markdown in prices (refer to Ex. 3.4), which supplier
takes the largest hit in terms of inventory value?
References
Elmasri, R., & Navathe, S. B. (2014). Database systems: Models, languages, design and
application. England: Pearson.
Hoffer, J. A., Venkataraman, R., & Topi, H. (2011). Modern database management. England:
Pearson.
MySQL using R. Retrieved February, 2018., from https://cran.r-project.org/web/packages/
RMySQL/RMySQL.pdf.
MySQL using Python. Retrieved February, 2018., from http://mysql-python.sourceforge.net/
MySQLdb.html.
Chapter 4
Big Data Management
1 Introduction
The twenty-first century is characterized by the digital revolution, and this revo-
lution is disrupting the way business decisions are made in every industry, be it
healthcare, life sciences, finance, insurance, education, entertainment, retail, etc.
The Digital Revolution, also known as the Third Industrial Revolution, started in
the 1980s and sparked the advancement and evolution of technology from analog
electronic and mechanical devices to the shape of technology in the form of machine
learning and artificial intelligence today. Today, people across the world interact
and share information in various forms such as content, images, or videos through
various social media platforms such as Facebook, Twitter, LinkedIn, and YouTube.
Also, the twenty-first century has witnessed the adoption of handheld devices and
wearable devices at a rapid rate. The types of devices we use today, be it controllers
or sensors that are used across various industrial applications or in the household
or for personal usage, are generating data at an alarming rate. The huge amounts of
data generated today are often termed big data. We have ushered in an age of big
data-driven analytics where big data does not only drive decision-making for firms
P. Taori ()
London Business School, London, UK
e-mail: taori.peeyush@gmail.com
H. K. Dasararaju
Indian School of Business, Hyderabad, Telangana, India
but also impacts the way we use services in our daily lives. A few statistics below
help provide a perspective on how much data pervades our lives today:
Prevalence of big data:
• The total amount of data generated by mankind is 2.7 Zeta bytes, and it continues
to grow at an exponential rate.
• In terms of digital transactions, according to an estimate by IDC, we shall soon
be conducting nearly 450 billion transactions per day.
• Facebook analyzes 30+ peta bytes of user generated data every day.
(Source: https://www.waterfordtechnologies.com/big-data-interesting-facts/,
accessed on Aug 10, 2018.)
With so much data around us, it is only natural to envisage that big data holds
tremendous value for businesses, firms, and society as a whole. While the potential
is huge, the challenges that big data analytics faces are also unique in their own
respect. Because of the sheer size and velocity of data involved, we cannot use
traditional computing methods to unlock big data value. This unique challenge has
led to the emergence of big data systems that can handle data at a massive scale. This
chapter builds on the concepts of big data—it tries to answer what really constitutes
big data and focuses on some of big data tools. In this chapter, we discuss the basics
of big data tools such as Hadoop, Spark, and the surrounding ecosystem.
1 Note:When we say large datasets that means data size ranging from petabytes to exabytes and
more. Please note that 1 byte = 8 bits
Metric Value
Byte (B) 20 = 1 byte
Kilobyte (KB) 210 bytes
Megabyte (MB) 220 bytes
Gigabyte (GB) 230 bytes
Terabyte (TB) 240 bytes
Petabyte (PB) 250 bytes
Exabyte (EB) 260 bytes
Zettabyte (ZB) 270 bytes
Yottabyte (YB) 280 bytes
4 Big Data Management 73
what we were generating collectively even a few years ago. Unfortunately, the term
big data is used colloquially to describe a vast variety of data that is being generated.
When we describe traditional data, we tend to put it into three categories:
structured, unstructured, and semi-structured. Structured data is highly organized
information that can be easily stored in a spreadsheet or table using rows and
columns. Any data that we capture in a spreadsheet with clearly defined columns and
their corresponding values in rows is an example of structured data. Unstructured
data may have its own internal structure. It does not conform to the standards of
structured data where you define the field name and its type. Video files, audio files,
pictures, and text are best examples of unstructured data. Semi-structured data tends
to fall in between the two categories mentioned above. There is generally a loose
structure defined for data of this type, but we cannot define stringent rules like we do
for storing structured data. Prime examples of semi-structured data are log files and
Internet of Things (IoT) data generated from a wide range of sensors and devices,
e.g., a clickstream log from an e-commerce website that gives you details about date
and time of classes/objects that are being instantiated, IP address of the user where
he is doing transaction from, etc. But, in order to analyze the information, we need
to process the data to extract useful information into a structured format.
In order to put a structure to big data, we describe big data as having four
characteristics: volume, velocity, variety, and veracity. The infographic in Fig. 4.1
provides an overview through example.
We discuss each of the four characteristics briefly (also shown in Fig. 4.2):
1. Volume: It is the amount of the overall data that is already generated (by either
individuals or companies). The Internet alone generates huge amounts of data.
It is estimated that the Internet has around 14.3 trillion live web pages, which
amounts to 672 exabytes of accessible data.2
2. Variety: Data is generated from different types of sources that are internal and
external to the organization such as social and behavioral and also comes in
different formats such as structured, unstructured (analog data, GPS tracking
information, and audio/video streams), and semi-structured data—XML, Email,
and EDI.
3. Velocity: Velocity simply states the rate at which organizations and individuals
are generating data in the world today. For example, a study reveals that videos
that are 400 hours of duration are uploaded onto YouTube every minute.3
4. Veracity: It describes the uncertainty inherent in the data, whether the obtained
data is correct or consistent. It is very rare that data presents itself in a form that
is ready to consume. Considerable effort goes into processing of data especially
when it is unstructured or semi-structured.
Processing big data for analytical purposes adds tremendous value to organizations
because it helps in making decisions that are data driven. In today’s world,
organizations tend to perceive the value of big data in two different ways:
1. Analytical usage of data: Organizations process big data to extract relevant
information to a field of study. This relevant information then can be used to
make decisions for the future. Organizations use techniques like data mining,
predictive analytics, and forecasting to get timely and accurate insights that help
to make the best possible decisions. For example, we can provide online shoppers
with product recommendations that have been derived by studying the products
viewed and bought by them in the past. These recommendations help customers
find what they need quickly and also increase the retailer’s profitability.
2. Enable new product development: The recent successful startups are a great
example of leveraging big data analytics for new product enablement. Companies
such as Uber or Facebook use big data analytics to provide personalized services
to its customers in real time.
Uber is a taxi booking service that allows users to quickly book cab rides
from their smartphones by using a simple app. Business operations of Uber are
heavily reliant on big data analytics and leveraging insights in a more effective
way. When passengers request for a ride, Uber can instantly match the request
with the most suitable drivers either located in nearby area or going toward the
area where the taxi service is requested. Fares are calculated automatically, GPS
is used to determine the best possible route to avoid traffic and the time taken
for the journey using proprietary algorithms that make adjustments based on the
time that the journey might take.
In today’s world, every business and industry is affected by, and benefits from, big
data analytics in multiple ways. The growth in the excitement about big data is
evident everywhere. A number of actively developed technological projects focus on
big data solutions and a number of firms have come into business that focus solely
on providing big data solutions to organizations. Big data technology has evolved
to become one of the most sought-after technological areas by organizations as they
try to put together teams of individuals who can unlock the value inherent in big
data. We highlight a couple of use cases to understand the applications of big data
analytics.
1. Customer Analytics in the Retail industry
Retailers, especially those with large outlets across the country, generate
huge amount of data in a variety of formats from various sources such as POS
76 P. Taori and H. K. Dasararaju
transactions, billing details, loyalty programs, and CRM systems. This data
needs to be organized and analyzed in a systematic manner to derive meaningful
insights. Customers can be segmented based on their buying patterns and spend
at every transaction. Marketers can use this information for creating personalized
promotions. Organizations can also combine transaction data with customer
preferences and market trends to understand the increase or decrease in demand
for different products across regions. This information helps organizations to
determine the inventory level and make price adjustments.
2. Fraudulent claims detection in Insurance industry
In industries like banking, insurance, and healthcare, fraudulent transactions
are mostly to do with monetary transactions, those that are not caught might
cause huge expenses and lead to loss of reputation to a firm. Prior to the advent of
big data analytics, many insurance firms identified fraudulent transactions using
statistical methods/models. However, these models have many limitations and
can prevent fraud up to limited extent because model building can happen only
on sample data. Big data analytics enables the analyst to overcome the issue
with volumes of data—insurers can combine internal claim data with social data
and other publicly available data like bank statements, criminal records, and
medical bills of customers to better understand consumer behavior and identify
any suspicious behavior.
Big data requires different means of processing such voluminous, varied, and
scattered data compared to that of traditional data storage and processing systems
like RDBMS (relational database management systems), which are good at storing,
processing, and analyzing structured data only. Table 4.1 depicts how traditional
RDBMS differs from big data systems.
There are a number of technologies that are used to handle, process, and analyze
big data. Of them the ones that are most effective and popular are distributed
computing and parallel computing for big data, Hadoop for big data, and big data
cloud. In the remainder of the chapter, we focus on Hadoop but also briefly visit the
concepts of distributed and parallel computing and big data cloud.
Distributed Computing and Parallel Computing
Loosely speaking, distributed computing is the idea of dividing a problem
into multiple parts, each of which is operated upon by an individual machine or
computer. A key challenge in making distributed computing work is to ensure
that individual computers can communicate and coordinate their tasks. Similarly,
in parallel computing we try to improve the processing capability of a computer
system. This can be achieved by adding additional computational resources that
run parallel to each other to handle complex computations. If we combine the
concepts of both distributed and parallel computing together, the cluster of machines
will behave like a single powerful computer. Although the ideas are simple, there
are several challenges underlying distributed and parallel computing. We underline
them below.
Distributed Computing and Parallel Computing Limitations and Chal-
lenges
• Multiple failure points: If a single computer fails, and if other machines cannot
reconfigure themselves in the event of failure then this can lead to overall system
going down.
• Latency: It is the aggregated delay in the system because of delays in the
completion of individual tasks. This leads to slowdown in system performance.
• Security: Unless handled properly, there are higher chances of an unauthorized
user access on distributed systems.
• Software: The software used for distributed computing is complex, hard to
develop, expensive, and requires specialized skill set. This makes it harder
for every organization to deploy distributed computing software in their
infrastructure.
In order to overcome some of the issues that plagued distributed systems, companies
worked on coming up with solutions that would be easier to deploy, develop, and
maintain. The result of such an effort was Hadoop—the first open source big data
platform that is mature and has widespread usage. Hadoop was created by Doug
Cutting at Yahoo!, and derives its roots directly from the Google File System (GFS)
and MapReduce Programming for using distributed computing.
Earlier, while using distributed environments for processing huge volumes of
data, multiple nodes in a cluster could not always cooperate within a communication
system, thus creating a lot of scope for errors. The Hadoop platform provided
78 P. Taori and H. K. Dasararaju
Oozie
Flume Zookeeper
(Workflow Data Management
(Monitoring) (Management)
Monitoring)
Sqoop
Hive Pig Data Access
(RDBMS
(SQL) (Data Flow)
Connector)
YARN
MapReduce Data Processing
(Cluster & Resource
(Cluster Management)
Management)
HDFS HBase
(Distributed File System) (Column DB Storage) Data Storage
We provide a brief description of each of these projects below (with the exception
of HDFS, MapReduce, and YARN that we discuss in more detail later).
HBase: HBASE is an open source NoSql database that leverages HDFS.
Some examples of NoSql databases are HBASE, Cassandra, and AmazonDB.
The main properties of HBase are strongly consistent read and write, Auto-
matic sharding (rows of data are automatically split and stored across multiple
machines so that no single machine has the burden of storing entire dataset.
It also enables fast searching and retrieval as a search query does not have
to be performed over entire dataset, and can rather be done on the machine
that contains specific data rows), Automatic Region Server failover (feature that
enables high availability of data at all times. If a particular region’s server goes
down, the data is still made available through replica servers), and Hadoop/HDFS
Integration. It supports parallel processing via MapReduce and has an easy to
use API.
Hive: While Hadoop is a great platform for big data analytics, a large num-
ber of business users have limited knowledge of programming, and this can
become a hindrance in widespread adoption of big data platforms such as Hadoop.
Hive overcomes this limitation, and is a platform to write SQL-type scripts that
can be run on Hadoop. Hive provides an SQL-like interface and data ware-
house infrastructure to Hadoop that helps users carry out analytics on big data
by writing SQL queries known as Hive queries. Hive Query execution hap-
pens via MapReduce—the Hive interpreter converts the query to MapReduce
format.
Pig: It is a procedural language platform used to develop Shell-script-type
programs for MapReduce operations. Rather than writing MapReduce programs,
which can become cumbersome for nontrivial tasks, users can do data processing
by writing individual commands (similar to scripts) by using a language known as
Pig Latin. Pig Latin is a data flow language, Pig translates the Pig Latin script into
MapReduce, which can then execute within Hadoop.
Sqoop: The primary purpose of Sqoop is to facilitate data transfer between
Hadoop and relational databases such as MySQL. Using Sqoop users can import
data from relational databases to Hadoop and also can export data from Hadoop
to relational databases. It has a simple command-line interface for transforming
data between relational databases and Hadoop, and also supports incremental
import.
Oozie: In simple terms, Oozie is a workflow scheduler for managing Hadoop
jobs. Its primary job is to combine multiple jobs or tasks in a single unit of workflow.
This provides users with ease of access and comfort in scheduling and running
multiple jobs.
With a brief overview of multiple components of Hadoop ecosystem, let us
now focus our attention on understanding the Hadoop architecture and its core
components.
4 Big Data Management 81
Map Reduce
HDFS
HDFS provides a fault-tolerant distributed file storage system that can run on
commodity hardware and does not require specialized and expensive hardware. At
its very core, HDFS is a hierarchical file system where data is stored in directories.
It uses a master–slave architecture wherein one of the machines in the cluster is the
master and the rest are slaves. The master manages the data and the slaves whereas
the slaves service the read/write requests. The HDFS is tuned to efficiently handle
large files. It is also a favorable file system for Write-once Read-many (WORM)
applications.
HDFS functions on a master–slave architecture. The Master node is also referred
to as the NameNode. The slave nodes are referred to as the DataNodes. At any given
time, multiple copies of data are stored in order to ensure data availability in the
event of node failure. The number of copies to be stored is specified by replication
factor. The architecture of HDFS is specified in Fig. 4.5.
82 P. Taori and H. K. Dasararaju
Fig. 4.5 Hadoop architecture (inspired from Hadoop architecture available on https://technocents.
files.wordpress.com/2014/04/hdfs-architecture.png (accessed on Aug 10, 2018))
B2R3
B3R1
Block 3 B3R2
B3R3
The NameNode maintains HDFS metadata and the DataNodes store the actual
data. When a client requests folders/records access, the NameNode validates the
request and instructs the DataNodes to provide the information accordingly. Let us
understand this better with the help of an example.
Suppose we want to store a 350 MB file into the HDFS. The following steps
illustrate how it is actually done (refer Fig. 4.6):
(a) The file is split into blocks of equal size. The block size is decided during the
formation of the cluster. The block size is usually 64 MB or 128 MB. Thus, our
file will be split into three blocks (Assuming block size = 128 MB).
4 Big Data Management 83
(b) Each block is replicated depending on the replication factor. Assuming the
factor to be 3, the total number of blocks will become 9.
(c) The three copies of the first block will then be distributed among the DataNodes
(based on the block placement policy which is explained later) and stored.
Similarly, the other blocks are also stored in the DataNodes.
Figure 4.7 represents the storage of blocks in the DataNodes. Nodes 1 and 2
are part of Rack1. Nodes 3 and 4 are part of Rack2. A rack is nothing but a
collection of data nodes connected to each other. Machines connected in a node
have faster access to each other as compared to machines connected across different
nodes. The block replication is in accordance with the block placement policy. The
decisions pertaining to which block is stored in which DataNode is taken by the
NameNode.
The major functionalities of the NameNode and the DataNode are as follows:
NameNode Functions:
• It is the interface to all files read/write requests by clients.
• Manages the file system namespace. Namespace is responsible for maintaining a
list of all files and directories in the cluster. It contains all metadata information
associated with various data blocks, and is also responsible for maintaining a list
of all data blocks and the nodes they are stored on.
• Perform typical operations associated with a file system such as file open/close,
renaming directories and so on.
• Determines which blocks of data to be stored on which DataNodes.
Secondary NameNode Functions:
• Keeps snapshots of NameNode and at the time of failure of NameNode the
secondary NameNode replaces the primary NameNode.
• It takes snapshots of primary NameNode information after a regular interval
of time, and saves the snapshot in directories. These snapshots are known as
checkpoints, and can be used in place of primary NameNode to restart in case if
it fails.
84 P. Taori and H. K. Dasararaju
DataNode Functions:
• DataNodes are the actual machines that store data and take care of read/write
requests from clients.
• DataNodes are responsible for the creation, replication, and deletion of data
blocks. These operations are performed by the DataNode only upon direction
from the NameNode.
Now that we have discussed how Hadoop solves the storage problem associated
with big data, it is time to discuss how Hadoop performs parallel operations on the
data that is stored across multiple machines. The module in Hadoop that takes care
of computing is known as MapReduce.
3.5 MapReduce
MapReduce Principles
There are certain principles on which MapReduce programming is based. The
salient principles of MapReduce programming are:
• Move code to data—Rather than moving data to code, as is done in traditional
programming applications, in MapReduce we move code to data. By moving
code to data, Hadoop MapReduce removes the overhead of data transfer.
• Allow programs to scale transparently—MapReduce computations are executed
in such a way that there is no data overload—allowing programs to scale.
• Abstract away fault tolerance, synchronization, etc.—Hadoop MapReduce
implementation handles everything, allowing the developers to build only the
computation logic.
Figure 4.8 illustrates the overall flow of a MapReduce program.
MapReduce Functionality
Below are the MapReduce components and their functionality in brief:
• Master (Job Tracker): Coordinates all MapReduce tasks; manages job queues
and scheduling; monitors and controls task trackers; uses checkpoints to combat
failures.
• Slaves (Task Trackers): Execute individual map/reduce tasks assigned by Job
Tracker; write information to local disk (not HDFS).
• Job: A complete program that executes the Mapper and Reducer on the entire
dataset.
• Task: A localized unit that executes code on the data that resides on the local
machine. Multiple tasks comprise a job.
Let us understand this in more detail with the help of an example. In the following
example, we are interested in doing the word count for a large amount of text using
MapReduce. Before actually implementing the code, let us focus on the pseudo
logic for the program. It is important for us to think of a programming problem in
terms of MapReduce, that is, a mapper and a reducer. Both mapper and reducer take
(key,value) as input and provide (key,value) pairs as output. In terms of mapper,
a mapper program could simply take each word as input and provide as output a
(key,value) pair of the form (word,1). This code would be run on all machines that
have the text stored on which we want to run word count. Once all the mappers
have finished running, each one of them will produce outputs of the form specified
above. After that, an automatic shuffle and sort phase will kick in that will take all
(key,value) pairs with same key and pass it to a single machine (or reducer). The
reason this is done is because it will ensure that the aggregation happens on the
entire dataset with a unique key. Imagine that we want to count word count of all
word occurrences where the word is “Hello.” The only way this can be ensured is
that if all occurrences of (“Hello,” 1) are passed to a single reducer. Once the reducer
receives input, it will then kick in and for all values with the same key, it will sum
up the values, that is, 1,1,1, and so on. Finally, the output would be (key, sum) for
each unique key.
Let us now implement this in terms of pseudo logic:
Program: Word Count Occurrences
Pseudo code:
input-key: document name
input-value: document content
Map (input-key, input-value)
For each word w in input-value
produce (w, 1)
output-key: a word
output-values: a list of counts
Reduce (output-key, values-list);
int result=0;
for each v in values-list;
result+=v;
produce (output-key, result);
Now let us see how the pseudo code works with a detailed program (Fig. 4.9).
Hands-on Exercise:
For the purpose of this example, we make use of Cloudera Virtual Machine
(VM) distributable to demonstrate different hands-on exercises. You can download
4 Big Data Management 87
a free copy of Cloudera VM from Cloudera website4 and it comes prepackaged with
Hadoop. When you launch Cloudera VM the first time, close the Internet browser
and you will see the desktop that looks like Fig. 4.10.
This is a simulated Linux computer, which we can use as a controlled envi-
ronment to experiment with Python, Hadoop, and some other big data tools. The
platform is CentOS Linux, which is related to RedHat Linux. Cloudera is one major
distributor of Hadoop. Others include Hortonworks, MapR, IBM, and Teradata.
Once you have launched Cloudera, open the command-line terminal from the
menu (Fig. 4.11): Accessories → System Tools → Terminal
At the command prompt, you can enter Unix commands and hit ENTER to
execute them one at a time. We assume that the reader is familiar with basic Unix
commands or is able to read up about them in tutorials on the web. The prompt itself
may give you some helpful information.
A version of Python is already installed in this virtual machine. In order to
determine version information, type the following:
python -V
Now, you will need a file of text. You can make one quickly by dumping the
“help” text of a program you are interested in:
hadoop --help > hadoophelp.txt
To pipe your file into the word count program, use the following:
cat hadoophelp.txt | ./wordcount.py
The “./” is necessary here. It means that wordcount.py is located in your current
working directory.
Now make a slightly longer pipeline so that you can sort it and read it all on one
screen at a time:
cat hadoophelp.txt | ./wordcount.py | sort | less
With the above program we got an idea how to use python programming scripts.
Now let us implement the code we have discussed earlier using MapReduce.
We implement a Map Program and a Reduce program. In order to do this, we will
4 Big Data Management 89
need to make use of a utility that comes with Hadoop—Hadoop Streaming. Hadoop
streaming is a utility that comes packaged with the Hadoop distribution and allows
MapReduce jobs to be created with any executable as the mapper and/or the reducer.
The Hadoop streaming utility enables Python, shell scripts, or any other language to
be used as a mapper, reducer, or both. The Mapper and Reducer are both executables
that read input, line by line, from the standard input (stdin), and write output to the
standard output (stdout). The Hadoop streaming utility creates a MapReduce job,
submits the job to the cluster, and monitors its progress until it is complete. When the
mapper is initialized, each map task launches the specified executable as a separate
process. The mapper reads the input file and presents each line to the executable
via stdin. After the executable processes each line of input, the mapper collects the
output from stdout and converts each line to a key–value pair. The key consists of
the part of the line before the first tab character, and the value consists of the part
of the line after the first tab character. If a line contains no tab character, the entire
line is considered the key and the value is null. When the reducer is initialized, each
reduce task launches the specified executable as a separate process. The reducer
converts the input key–value pair to lines that are presented to the executable via
stdin. The reducer collects the executables result from stdout and converts each line
90 P. Taori and H. K. Dasararaju
to a key–value pair. Similar to the mapper, the executable specifies key–value pairs
by separating the key and value by a tab character.
A Python Example:
To demonstrate how the Hadoop streaming utility can run Python as a MapRe-
duce application on a Hadoop cluster, the WordCount application can be imple-
mented as two Python programs: mapper.py and reducer.py. The code in mapper.py
is the Python program that implements the logic in the map phase of WordCount.
It reads data from stdin, splits the lines into words, and outputs each word with its
intermediate count to stdout. The code below implements the logic in mapper.py.
Example mapper.py:
#!/usr/bin/env python
#!/usr/bin/ python
import sys
# Read each line from stdin
for line in sys.stdin:
# Get the words in each line
words = line.split()
# Generate the count for each word
for word in words:
4 Big Data Management 91
Once you have saved the above code in mapper.py, change the permissions of the
file by issuing
chmod a+x mapper2.py
Finally, type the following. This will serve as our input. “echo” command below
simply prints on screen what input has been provided to it. In this case, it will print
“jack be nimble jack be quick”
echo “jack be nimble jack be quick”
Before attempting to execute the code, ensure that the mapper.py and reducer.py
files have execution permission. The following command will enable this for both
files:
chmod a+x mapper.py reducer.py
Also ensure that the first line of each file contains the proper path to Python.
This line enables mapper.py and reducer.py to execute as stand-alone executables.
The value #! /usr/bin/env python should work for most systems, but if it does not,
replace /usr/bin/env python with the path to the Python executable on your system.
To test the Python programs locally before running them as a MapReduce job,
they can be run from within the shell using the echo and sort commands. It is highly
recommended to test all programs locally before running them across a Hadoop
cluster.
4 Big Data Management 93
Once the mapper and reducer programs are executing successfully against tests,
they can be run as a MapReduce application using the Hadoop streaming utility.
The command to run the Python programs mapper.py and reducer.py on a Hadoop
cluster is as follows:
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/
hadoop-streaming.jar -files mapper.py,reducer.py -mapper
mapper.py -reducer reducer.py -input /frost.txt -output /output
The options used with the Hadoop streaming utility are listed in Table 4.2.
A key challenge in MapReduce programming is thinking about a problem in
terms of map and reduce steps. Most of us are not trained to think naturally in
terms of MapReduce problems. In order to gain more familiarity with MapReduce
programming, exercises provided at the end of the chapter help in developing the
discipline.
Now that we have covered MapReduce programming, let us now move to the
final core component of Hadoop—the resource manager or YARN.
3.6 YARN
3.7 Spark
In the previous sections, we have focused solely on Hadoop, one of the first and
most widely used big data solutions. Hadoop was introduced to the world in 2005
and it quickly captured attention of organizations and individuals because of the
relative simplicity it offered for doing big data processing. Hadoop was primarily
designed for batch processing applications, and it performed a great job at that.
While Hadoop was very good with sequential data processing, users quickly started
realizing some of the key limitations of Hadoop. Two primary limitations were
difficulty in programming, and limited ability to do anything other than sequential
data processing.
In terms of programming, although Hadoop greatly simplified the process of
allocating resources and monitoring a cluster, somebody still had to write programs
in the MapReduce framework that would contain business logic and would be
executed on the cluster. This posed a couple of challenges: first, a user had to know
good programming skills (such as Java programming language) to be able to write
code; and second, business logic had to be uniquely broken down in the MapReduce
way of programming (i.e., thinking of a programming problem in terms of mappers
and reducers). This quickly became a limiting factor for nontrivial applications.
Second, because of the sequential processing nature of Hadoop, every time a user
would query for a certain portion of data, Hadoop would go through entire dataset
in a sequential manner to query the data. This in turn implied large waiting times
for the results of even simple queries. Database users are habituated to ad hoc data
querying and this was a limiting factor for many of their business needs.
Additionally, there was a growing demand for big data applications that would
leverage concepts of real-time data streaming and machine learning on data at a
big scale. Since MapReduce was primarily not designed for such applications, the
alternative was to use other technologies such as Mahout, Storm for any specialized
processing needs. The need to learn individual systems for specialized needs was a
limitation for organizations and developers alike.
Recognizing these limitations, researchers at the University of California, Berke-
ley’s AMP Lab came up with a new project in 2012 that was later named as Spark.
The idea of Spark was in many ways similar to what Hadoop offered, that is, a
stable, fast, and easy-to-use big data computational framework, but with several key
features that would make it better suited to overcome limitations of Hadoop and to
also leverage features that Hadoop had previously ignored such as usage of memory
for storing data and performing computations.
96 P. Taori and H. K. Dasararaju
Spark has since then become one of the hottest big data technologies and
has quickly become one of the mainstream projects in big data computing. An
increasing number of organizations are either using or planning to use Spark for
their project needs. At the same point of time, while Hadoop continues to enjoy the
leader’s position in big data deployments, Spark is quickly replacing MapReduce as
the computational engine of choice. Let us now look at some of the key features of
Spark that make it a versatile and powerful big data computing engine:
Ease of Usage: One of the key limitations of MapReduce was the requirement
to break down a programming problem in terms of mappers and reducers. While it
was fine to use MapReduce for trivial applications, it was not a very easy task to
implement mappers and reducers for nontrivial programming needs.
Spark overcomes this limitation by providing an easy-to-use programming
interface that in many ways is similar to what we would experience in any
programming language such as R or Python. The manner in which this is achieved
is by abstracting away requirements of mappers and reducers, and by replacing it
with a number of operators (or functions) that are available to users through an
API (application programming interface). There are currently 80 plus functions that
Spark provides. These functions make writing code simple as users have to simply
call these functions to get the programming done.
Another side effect of a simple API is that users do not have to write a lot of
boilerplate code as was necessary with mappers and reducers. This makes program
concise and requires less lines to code as compared to MapReduce. Such programs
are also easier to understand and maintain.
In-memory computing: Perhaps one of the most talked about features of Spark
is in-memory computing. It is largely due to this feature that Spark is considered
up to 100 times faster than MapReduce (although this depends on several factors
such as computational resources, data size, and type of algorithm). Because of
the tremendous improvements in execution speed, Spark can handle the types of
applications where turnaround time needs to be small and speed of execution is
important. This implies that big data processing can be done on a near real-time
scale as well as for interactive data analysis.
This increase in speed is primarily made possible due to two reasons. The first
one is known as in-memory computing, and the second one is use of an advanced
execution engine. Let us discuss each of these features in more detail.
In-memory computing is one of the most talked about features of Spark that sets
it apart from Hadoop in terms of execution speed. While Hadoop primarily uses
hard disk for data reads and write, Spark makes use of the main memory (RAM)
of each individual computer to store intermediate data and computations. So while
data resides primarily on hard disks and is read from the hard disk for the first
time, for any subsequent data access it is stored in computer’s RAM. Accessing
data from RAM is 100 times faster than accessing it from hard disk, and for large
data processing this results in a lot of time saving in terms of data access. While
this difference would not be noticeable for small datasets (ranging from a few KB to
few MBs), as soon as we start moving into the realm of big data process (tera bytes
or more), the speed differences are visibly apparent. This allows for data processing
4 Big Data Management 97
to be done in a matter of minutes or hours for something that used to take days on
MapReduce.
The second feature of Spark that makes it fast is implementation of an advanced
execution engine. The execution engine is responsible for dividing an application
into individual stages of execution such that the application executes in a time-
efficient manner. In the case of MapReduce, every application is divided into a
sequence of mappers and reducers that are executed in sequence. Because of the
sequential nature, optimization that can be done in terms of code execution is very
limited. Spark, on the other hand, does not impose any restriction of writing code in
terms of mappers and reducers. This essentially means that the execution engine of
Spark can divide a job into multiple stages and can run in a more optimized manner,
hence resulting in faster execution speeds.
Scalability: Similar to Hadoop, Spark is highly scalable. Computation resources
such as CPU, memory, and hard disks can be added to existing Spark clusters at
any time as data needs grow and Spark can scale itself very easily. This fits in well
with organizations that they do not have to pre-commit to an infrastructure and can
rather increase or decrease it dynamically depending on their business needs. From
a developer’s perspective, this is also one of the important features as they do not
have to make any changes to their code as the cluster scales; essentially their code
is independent of the cluster size.
Fault Tolerance: This feature of Spark is also similar to Hadoop. When a large
number of machines are connected together in a network, it is likely that some of
the machines will fail. Because of the fault tolerance feature of Spark, however, it
does not have any impact on code execution. If certain machines fail during code
execution, Spark simply allocates those tasks to be run on another set of machines
in the cluster. This way, an application developer never has to worry about the state
of machines in a cluster and can rather focus on writing business logic for their
organization’s needs.
Overall, Spark is a general purpose in-memory computing framework that
can easily handle programming needs such as batch processing of data, iterative
algorithms, real-time data analysis, and ad hoc data querying. Spark is easy to use
because of a simple API that it provides and is orders of magnitude faster than
traditional Hadoop applications because of in-memory computing capabilities.
Spark Applications
Since Spark is a general purpose big data computing framework, Spark can be
used for all of the types of applications that MapReduce environment is currently
suited for. Most of these applications are of the batch processing type. However,
because of the unique features of Spark such as speed and ability to handle iterative
processing and real-time data, Spark can also be used for a range of applications
that were not well suited for MapReduce framework. Additionally, because of the
speed of execution, Spark can also be used for ad hoc querying or interactive data
analysis. Let us briefly look at each of these application categories.
Iterative Data Processing
Iterative applications are those types of applications where the program has
to loop or iterate through the same dataset multiple times in a recursive fashion
98 P. Taori and H. K. Dasararaju
In terms of the types of cluster managers that Spark can work with, there are
currently three cluster manager modes. Spark can either work in stand-alone mode
(or single machine mode in which a cluster is nothing but a single machine), or it
can work with YARN (the resource manager that ships with Hadoop), or it can work
with another resource manager known as Mesos. YARN and Mesos are capable of
working with multiple nodes together compared to a single machine stand-alone
mode.
Worker
A worker is similar to a node in Hadoop. It is the actual machine/computer that
provides computing resources such as CPU, RAM, storage for execution of Spark
programs. A Spark program is run among a number of workers in a distributed
fashion.
Executor
These days, a typical computer comes with advanced configurations such as
multi core processors, 1 TB storage, and several GB of memory as standard. Since
any application at a point of time might not require all of the resources, potentially
many applications can be run on the same worker machine by properly allocating
resources for individual application needs. This is done through a JVM (Java Virtual
Machine) that is referred to as an executor. An executor is nothing but a dedicated
virtual machine within each worker machine on which an application executes.
These JVMs are created as the application need arises, and are terminated once the
application has finished processing. A worker machine can have potentially many
executors.
Task
As the name refers, task is an individual unit of work that is performed. This
work request would be specified by the executor. From a user’s perspective, they
do not have to worry about number of executors and division of program code
into tasks. That is taken care of by Spark during execution. The idea of having
these components is to essentially be able to optimally execute the application by
distributing it across multiple threads and by dividing the application logically into
small tasks (Fig. 4.15).
Similar to Hadoop, Spark ecosystem also has a number of components, and some
of them are actively developed and improved. In the next section, we discuss six
components that empower the Spark ecosystem (Fig. 4.16).
Worker Node
Executor
Task Task
Driver
Program
Worker Node
Executor
Worker Node
Executor
Task Task
Spark Core
• Spark Core uses a fundamental data structure called RDD (Resilient Data
Distribution) that handles partitioning data across all the nodes in a cluster and
holds as a unit in memory for computations.
• RDD is an abstraction and exposes through a language integrated API written in
either Python, Scala, SQL, Java, or R.
4 Big Data Management 101
SPARK SQL
• Similar to Hive, Spark SQL allows users to write SQL queries that are then
translated into Spark programs and executed. While Spark is easy to program,
not everyone would be comfortable with programming. SQL, on the other hand,
is an easier language to learn and write commands. Spark SQL brings the power
of SQL programming to Spark.
• The SQL queries can be run on Spark datasets known as DataFrame. DataFrames
are similar to a spreadsheet structure or a relational database table, and users can
provide schema and structure to DataFrame.
• Users can also interface their query outputs to visualization tools such as Tableau
for ad hoc and interactive visualizations.
SPARK Streaming
• While the initial use of big data systems was thought for batch processing of data,
the need for real-time processing of data on a big scale rose quickly. Hadoop as
a platform is great for sequential or batch processing of data but is not designed
for real-time data processing.
• Spark streaming is the component of Spark that allows users to take real-time
streams of data and analyze them in a near real-time fashion. Latency is as low
as 1sec.
• Real-time data can be taken from a multitude of sources such as TCP/IP
connections, sockets, Kafka, Flume, and other message systems.
• Data once processed can be then output to users interactively or stored in HDFS
and other storage systems for later use.
• Spark streaming is based on Spark core, so the features of fault tolerance and
scalability are also available for Spark Streaming.
• On top of it even machine learning and graph processing algorithms can be
applied on real-time streams of data.
SPARK MLib
• Spark’s MLLib library provides a range of machine learning and statistical func-
tions that can be applied on big data. Some of the common applications provided
by MLLib are functions for regression, classification, clustering, dimensionality
reduction, and collaborative filtering.
GraphX
• GraphX is a unique Spark component that allows users to by-pass complex
SQL queries and rather use GraphX for those graphs and connected dataset
computations.
• A GraphFrame is the data structure that contains a graph and is an extension
of DataFrame discussed above. It relates datasets with vertices and edges that
produce clear and expressive computational data collections for analysis.
Here are a few sample programs using python for better understanding of Spark
RDD and Data frames.
102 P. Taori and H. K. Dasararaju
$python rddtest.py
A sample Dataframe Program:
Consider a word count example.
Source file: input.txt
Content in file:
Business of Apple continues to grow with success of iPhone X. Apple is poised
to become a trillion-dollar company on the back of strong business growth of iPhone
X and also its cloud business.
Create a dftest.py and place the code shown below and run from the unix prompt.
In the code below, we first read a text file named “input.txt” using textFile
function of Spark, and create an RDD named text. We then run the map function
that takes each row of “text” RDD and converts to a DataFrame that can then be pro-
cessed using Spark SQL. Next, we make use of the function filter() to consider only
those observations that contain the term “business.” Finally, search_word.count()
function prints the count of all observations in search_word DataFrame.
text = sc.textFile(“hdfs://input.txt”)
# Creates a DataFrame with a single column“
df = text.map(lambda r: Row(r)).toDF([”line“])
search_word = df.filter(col(”line“).like(”%business%“))
# Counts all words
search_word.count()
# Counts the word “business” mentioning MySQL
search_word.filter(col(”line“).like(”%business%“)).count()
# Gets business word
search_word.filter(col(”line“).like(”%business%“)).collect()
4 Big Data Management 103
From the earlier sections of this chapter we understand that data is growing at
an exponential rate, and organizations can benefit from the analysis of big data
in making right decisions that positively impact their business. One of the most
common issues that organizations face is with the storage of big data as it requires
quite a bit of investment in hardware, software packages, and personnel to manage
the infrastructure for big data environment. This is exactly where cloud computing
helps solve the problem by providing a set of computing resources that can be
shared though the Internet. The shared resources in a cloud computing platform
include storage solutions, computational units, networking solutions, development
and deployment tools, software applications, and business processes.
Cloud computing environment saves costs for an organization especially with
costs related to infrastructure, platform, and software applications by providing a
framework that can be optimized and expanded horizontally. Think of a scenario
when an organization needs additional computational resources/storage capabil-
ity/software has to pay only for those additional acquired resources and for the
time of usage. This feature of cloud computing environment is known as elasticity.
This helps organizations not to worry about overutilization or underutilization of
infrastructure, platform, and software resources.
Figure 4.17 depicts how various services that are typically used in an organization
setup can be hosted in cloud and users can connect to those services using various
devices.
Features of Cloud Computing Environments
The salient features of cloud computing environment are the following:
• Scalability: It is very easy for organizations to add an additional resource to an
existing infrastructure.
• Elasticity: Organizations can hire additional resources on demand and pay only
for the usage of those resources.
• Resource Pooling: Multiple departments of an organization or organizations
working on similar problems can share resources as opposed to hiring the
resources individually.
• Self-service: Most of the cloud computing environment vendors provide simple
and easy to use interfaces that help users to request services they want without
help from differently skilled resources.
• Low Cost: Careful planning, usage, and management of resources help reduce the
costs significantly compared to organizations owning the hardware. Remember,
organizations owning the hardware will also need to worry about maintenance
and depreciation, which is not the case with cloud computing.
• Fault Tolerance: Most of the cloud computing environments provide a feature of
shifting the workload to other components in case of failure so the service is not
interrupted.
104 P. Taori and H. K. Dasararaju
Fig. 4.18 Sharing of responsibilities based on different cloud delivery models. (Source: https://
www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/ (accessed on Aug 10, 2018))
106 P. Taori and H. K. Dasararaju
data storage and data warehouse solution), and RDS (a relational database service
that provides instances of relational databases such as MySQL running out of the
box).
Google Cloud Platform: Google Cloud Platform (GCP) is a cloud services
offering by Google that competes directly with the services provided by AWS.
Although the current offerings are not as diverse as compared to AWS, GCP is
quickly closing the gap by providing bulk of the services that users can get on AWS.
A key benefit of using GCP is that the services are provided on the same platform
that Google uses for its own product development and service offerings. Thus, it
is very robust. At the same point of time, Google provides many of the services
for either free or at a very low cost, making GCP a very attracting alternative.
The main components provided by GCP are Google Compute Engine (providing
similar services as Amazon EC2, EMR, and other computing solutions offered by
Amazon), Google Big Query, and Google Prediction API (that provides machine
learning algorithms for end-user usage).
Microsoft Azure: Microsoft Azure is similar in terms of offerings to AWS listed
above. It offers a plethora of services in terms of both hardware (computational
power, storage, networking, and content delivery) and software (software as a
service, and platform as a service) infrastructure for organizations and individuals to
build, test, and deploy their offerings. Additionally, Azure has also made available
some of the pioneering machine learning APIs from Microsoft under the umbrella
of Microsoft Cognitive Toolkit. This makes very easy for organizations to leverage
power of machine and deep learning and integrate it easily for their datasets.
Hadoop Useful Commands for Quick Reference
Caution: While operating any of these commands, make sure that the user has to
create their own folders and practice as it may impact on the Hadoop system directly
using the following paths.
Command Description
hdfs dfs -ls/ List all the files/directories for the given hdfs destination path
hdfs dfs -ls -d /hadoop Directories are listed as plain files. In this case, this command
will list the details of hadoop folder
hdfs dfs -ls -h /data Provides human readable file format (e.g., 64.0m instead of
67108864)
hdfs dfs -ls -R /hadoop Lists recursively all files in hadoop directory and all
subdirectories in hadoop directory
hdfs dfs -ls /hadoop/dat* List all the files matching the pattern. In this case, it will list
all the files inside hadoop directory that start with ‘dat’ hdfs
dfs -ls/list all the files/directories for the given hdfs destination
path
hdfs dfs -text Takes a file as input and outputs file in text format on the
/hadoop/derby.log terminal
hdfs dfs -cat /hadoop/test This command will display the content of the HDFS file test
on your stdout
(continued)
4 Big Data Management 107
Command Description
hdfs dfs -appendToFile Appends the content of a local file test1 to a hdfs file test2
/home/ubuntu/test1
/hadoop/text2
hdfs dfs -cp /hadoop/file1 Copies file from source to destination on HDFS. In this case,
/hadoop1 copying file1 from hadoop directory to hadoop1 directory
hdfs dfs -cp -p /hadoop/file1 Copies file from source to destination on HDFS. Passing -p
/hadoop1 preserves access and modification times, ownership, and the
mode
hdfs dfs -cp -f /hadoop/file1 Copies file from source to destination on HDFS. Passing -f
/hadoop1 overwrites the destination if it already exists
hdfs dfs -mv /hadoop/file1 A file movement operation. Moves all files matching a
/hadoop1 specified pattern to a destination. Destination location must be
a directory in case of multiple file moves
hdfs dfs -rm /hadoop/file1 Deletes the file (sends it to the trash)
hdfs dfs -rmr /hadoop Similar to the above command but deletes files and directory
in a recursive fashion
hdfs dfs -rm -skipTrash Similar to above command but deletes the file immediately
/hadoop
hdfs dfs -rm -f /hadoop If the file does not exist, does not show a diagnostic message
or modifies the exit status to reflect an error
hdfs dfs -rmdir /hadoop1 Delete a directory
hdfs dfs -mkdir /hadoop2 Create a directory in specified HDFS location
hdfs dfs -mkdir -f /hadoop2 Create a directory in specified HDFS location. This command
does not fail even if the directory already exists
hdfs dfs -touchz /hadoop3 Creates a file of zero length at <path> with current time as the
timestamp of that <path>
hdfs dfs -df /hadoop Computes overall capacity, available space, and used space of
the filesystem
hdfs dfs -df -h /hadoop Computes overall capacity, available space, and used space of
the filesystem. -h parameter formats the sizes of files in a
human-readable fashion
hdfs dfs -du /hadoop/file Shows the amount of space, in bytes, used by the files that
match the specified file pattern
hdfs dfs -du -s /hadoop/file Rather than showing the size of each individual file that
matches the pattern, shows the total (summary) size
hdfs dfs -du -h /hadoop/file Shows the amount of space, in bytes, used by the files that
match the specified file
Source: https://linoxide.com/linux-how-to/hadoop-commands-cheat-sheet/ (accessed on Aug 10,
2018)
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
108 P. Taori and H. K. Dasararaju
Exercises
These exercises are based on either MapReduce or Spark and would ask users to
code the programming logic using any of the programming languages they are
comfortable with (preferably, but not limited to Python) in order to run the code
on a Hadoop cluster:
Ex. 4.1 Write code that takes a text file as input, and provides word count of each
unique word as output using MapReduce. Once you have done so, repeat the same
exercise using Spark. The text file is provided with the name apple_discussion.txt.
Ex. 4.2 Now, extend the above program to output only the most frequently occurring
word and its count (rather than all words and their counts). Attempt this first using
MapReduce and then using Spark. Compare the differences in programming effort
required to solve the exercise. (Hint: for MapReduce, you might have to think of
multiple mappers and reducers.)
Ex. 4.3 Consider the text file children_names.txt. This file contains three columns—
name, gender, and count, and provides data on how many kids were given a specific
name in a given period. Using this text file, count the number of births by alphabet
(not the word). Next, repeat the same process but only for females (exclude all males
from the count).
Ex. 4.4 A common problem in data analysis is to do statistical analysis on datasets.
In order to do so using a programming language such as R or Python, we simply use
the in-built functions provided by those languages. However, MapReduce provides
no such functions and so a user has to write the programming logic using mappers
and reducers. In this problem, consider the dataset numbers_dataset.csv. It contains
5000 randomly generated numbers. Using this dataset, write code in MapReduce
to compute five point summary (i.e., Mean, Median, Minimum, Maximum, and
Standard Deviation).
Ex. 4.5 Once you have solved the above problem using MapReduce, attempt to do
the same using Spark. For this you can make use of the in-built functions that Spark
provides.
Ex. 4.6 Consider the two csv files: Users.csv, and UserProfile.csv. Each file contains
information about users belonging to a company, and there are 5000 such records
in each file. Users.csv contains following columns: FirstName, Surname, Gender,
ID. UserProfile.csv has following columns: City, ZipCode, State, EmailAddress,
Username, Birthday, Age, CCNumber, Occupation, ID. The common field in both
files is ID, which is unique for each user. Using the two datasets, merge them into a
4 Big Data Management 109
single data file using MapReduce. Please remember that the column on which merge
would be done is ID.
Ex. 4.7 Once you have completed the above exercise using MapReduce, repeat the
same using Spark and compare the differences between the two platforms.
Further Reading
5 Reasons Spark is Swiss Army Knife of Data Analytics. Retrieved August 10, 2018, from https://
datafloq.com/read/5-ways-apache-spark-drastically-improves-business/1191.
A secret connection between Big Data and Internet of Things. Retrieved August 10,
2018, from https://channels.theinnovationenterprise.com/articles/a-secret-connection-between-
big-data-and-the-internet-of-things.
Big Data: Are you ready for blast off. Retrieved August 10, 2018, from http://www.bbc.com/news/
business-26383058.
Big Data: Why CEOs should care about it. Retrieved August 10, 2018, from https://
www.forbes.com/sites/davefeinleib/2012/07/10/big-data-why-you-should-care-about-it-
but-probably-dont/#6c29f11c160b.
Hadoop and Spark Enterprise Adoption. Retrieved August 10, 2018, from https://
insidebigdata.com/2016/02/01/hadoop-and-spark-enterprise-adoption-powering-big-data-
applications-to-drive-business-value/.
How companies are using Big Data and Analytics. Retrieved August 10, 2018, from https://
www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/how-companies-are-
using-big-data-and-analytics.
How Uber uses Spark and Hadoop to optimize customer experience. Retrieved August
10, 2018, from https://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop-to-
optimize-customer-experience/.
Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big
data analysis. Sebastopol, CA: O’Reilly Media, Inc..
Reinsel, D., Gantz, J., & Rydning, J. (April 2017). Data age 2025: The evolution of
data to life-critical- don’t focus on big data; focus on the data that’s big. IDC
White Paper URL: https://www.seagate.com/www-content/our-story/trends/files/Seagate-WP-
DataAge2025-March-2017.pdf.
The Story of Spark Adoption. Retrieved August 10, 2018, from https://tdwi.org/articles/2016/10/
27/state-of-spark-adoption-carefully-considered.aspx.
White, T. (2015). Hadoop: The definitive guide (4th ed.). Sebastopol, CA: O’Reilly Media, Inc..
Why Big Data is the new competitive advantage. Retrieved August 10, 2018, from https://
iveybusinessjournal.com/publication/why-big-data-is-the-new-competitive-advantage/.
Why Apache Spark is a crossover hit for data scientists. Retrieved August 10, 2018, from https://
blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/.
Chapter 5
Data Visualization
John F. Tripp
1 Introduction
J. F. Tripp ()
Clemson University, Clemson, SC, USA
e-mail: jftripp@clemson.edu
2 Motivating Example
Consider Fig. 5.1. When you see this graph, what do you believe is true about the
“level” of the variable represented by the line? Is the level greater or less at point 2
compared with point 1?
If you are like most people, you assume that the level of the variable at point 2
is greater than the level at point 1. Why? Because it has been ingrained in you from
childhood that when you stack something (blocks, rocks, etc.), the more you stack,
the higher the stack becomes. From a very early age, you learn that “up” means
“more.”
Now consider Fig. 5.2. Based on this graph, what happened to gun deaths after
2005?1
Upon initial viewing, the reader may be led to believe that the number of gun
deaths went down after 2005. However, look more closely, is this really what
happened? If you observe the axes, you will notice that the graph designer inverted
1 The “Stand Your Ground” law in Florida enabled people to shoot attackers in self-defense without
the y-axis, making larger values appear at lower points in the graph than smaller
values. While many readers may be able to perceive the shift in the y-axis, not all
will. For all readers, this is a fundamental violation of primal perceptive processes.
It is not merely a violation of an established convention; it is a violation of the
principles that drove the establishment of that convention.2
This example is simple, but it illustrates the need for data visualizers to be
well trained in understanding human perception and to work with the natural
understanding of visual stimuli.
2 Thisexample was strongly debated across the Internet when it appeared. For more information
about the reaction to the graph, see the Reddit thread at http://bit.ly/2ggVV7V.
114 J. F. Tripp
As Playfair notes in the quote above, even after a great deal of study, humans
cannot make clear sense of the data provided in tables, nor can they retain it well.
Turning back to Table 5.1, most people note that sets 1–3 are similar, primarily
because the X values are the same, and appear in the same order. However, for
all intents and purposes, these four sets of numbers are statistically equivalent. If
a typical regression analysis was performed on the four sets of data in Table 5.1,
the results would be identical. Among other statistical properties, the following are
valid for all four sets of numbers in Table 5.1:
• N = 11.
• Mean of X = 9.0.
• Mean of Y = 7.5.
• Equation of regression line: Y = 3 + 0.5X.
• Standard error of regression slope = 0.118.
• Correlation coefficient = 0.82.
• r2 = 0.67.
However, if one compares the sets of data visually, the differences between the
data sets become immediately obvious, as illustrated in Fig. 5.3.
Tables of data are excellent when the purpose of the table is to allow the
user to look up specific values, and when the relationships between the values
5 Data Visualization 115
are direct. However, when relationships between sets or groups of numbers are
intended for presentation, the human perception is much better served through
graphical representations. The goal of the remainder of this chapter is to provide
the reader with enough background into human visual perception and graphical
data representation to become an excellent consumer and creator of graphical data
visualizations.
Humans are bombarded by information from multiple sources and through multiple
channels. This information is gathered by our five senses, and processed by the
brain. However, the brain is highly selective about what it processes and humans
are only aware of the smallest fraction of sensory input. Much of sensory input
is simply ignored by the brain, while other input is dealt with based on heuristic
rules and categorization mechanisms; these processes reduce cognitive load. Data
visualizations, when executed well, aid in the reduction of cognitive load, and assist
viewers in the processing of cognitive evaluations.
When dealing with data visualization, likely the most important cognitive con-
cept to understand is working memory. Working memory is the part of short-term
116 J. F. Tripp
For the remainder of this chapter, we will review and discuss six meta-rules for
data visualization. These are based upon the work of many researchers and writers,
and provide a short and concise summary of the most important practices regarding
data visualization. However, it is important to note that these rules are intended to
describe how we can attempt to represent data visually with the highest fidelity and
integrity—not argue that all are immutable, or that trade-offs between the rules may
not have to be made. The meta-rules are presented in Table 5.2.
By following these meta-rules, data visualizers will be more likely to graphically
display the actual effect shown in the data. However, there are specific times and
reasons when the visualizer may choose to violate these rules. Some of the reasons
may be, for example, the need to make the visualization “eye-catching,” such as
for an advertisement. In these cases, knowing the effect on perceptive ability of
breaking the rules is important so that the visualizer understands what is being
5 Data Visualization 117
lost. However, in some cases, the reasons for violating the rules may be because
the visualizer wishes to intentionally mislead. Examples of this kind of lack of
visualization integrity are common in political contexts.
These rules are made to be understood, and then followed to the extent that the
context requires. If the context requires an accurate understanding of the data, with
high fidelity, the visualizer should follow the rules as much as possible. If the context
requires other criteria to be weighed more heavily, then understanding the rules
allows the visualizer to understand how these other criteria are biasing the visual
perception of the audience.
Simplicity Over Complexity
Meta-Rule #1: The simplest chart is usually the one that communicates most clearly. Use
the “not wrong” chart—not the “cool” chart.
When attempting to visualize data, our concern should be, as noted above, to
reduce the cognitive load of the viewer. This means that we should eliminate sources
of confusion. While several of the other meta-rules are related to this first rule, the
concept of simplicity itself deserves discussion.
Many data visualizers focus on the aesthetic components of a visualization, much
to the expense of clearly communicating the message that is present in the data.
When the artistic concerns of the visualizer (e.g., the Florida dripping blood example
above) overwhelm the message in the data, confusion occurs. Aside from artistic
concerns, visualizers often choose to use multiple kinds of graphs to add “variety,”
especially when visualizing multiple relationships. For instance, instead of using
three stacked column charts to represent different part to whole relationships, the
visualizer might use one stacked column chart, one pie chart, and one bubble chart.
So instead of comparing three relationships represented in one way, the viewer must
attempt to interpret different graph types, as well as try to compare relationships.
This increases cognitive load, as the viewer has to keep track of both the data values,
as well as the various manners in which the data has been encoded.
118 J. F. Tripp
Instead, we should focus on selecting a “not wrong” graph.3 To do this, one must
understand both the nature of the data that is available, as well as the nature of
the relationship being visualized. Data is generally considered to be of two types,
quantitative and qualitative (or categorical). At its simplest, data that is quantitative
is data that is (1) numeric, and (2) it is appropriate to use for mathematical operations
(unit price, total sales, etc.). Qualitative or categorical data is data that is (1) numeric
or text, and (2) is not appropriate (or possible) to use for mathematical operations
(e.g., Customer ID #, City, State, Country, Department).
Is a Visualization Needed?
An initial analysis of the data is required to determine the necessity of a visual
representation. For instance, in the graphic shown in Fig. 5.4, a text statement is
redundantly presented as a graph.
When a simple statement is enough to communicate the message, a visualization
may not be needed at all.
Table or Graph?
Once you decide that you need a visual representation of the data to com-
municate your message, you need to choose between two primary categories
of visualization—tables vs. graphs. The choice between the two is somewhat
subjective and, in many cases, you may choose to use a combination of tables and
graphs to tell your data story. However, use the following heuristic to decide whether
to use a table or a graph:
If the information being represented in the visualization needs to display precise and specific
individual values, with the intention to allow the user to look up specific values, and
compare to other specific values, choose a table. If the information in the visualization
must display sets or groups of values for comparison, choose a graph.
3 By using the term “not wrong” instead of “right” or “correct,” we attempt to communicate the
fact that in many cases there is not a single “correct” visualization. Instead, there are visualizations
that are more or less “not wrong” along a continuum. In contrast, there are almost always multiple
“wrong” visualization choices.
5 Data Visualization 119
1. X-axis placement
2. Y-axis placement
3. Size
4. Shape
5. Color
6. Animation (interactive only, often used to display time)
However, many graph types, because of their nature, reduce the number of
possible dimensions that can be displayed (Table 5.3). For instance, while a
scatterplot can display all six dimensions, a filled map can only display two: the
use of the map automatically eliminates the ability to modify dimensions 1–4. As
such, we are left with the ability to use color to show different levels of one data
dimension and animation to how the level of that one dimension changes over time.
While the maximum possible dimensions to represent is six, it is unlikely that
most visualization (or any) should/would use all six. Shape especially is difficult to
use as a dimensional variable and should never be used with size. Notice in Fig. 5.5
that it is difficult to compare the relative sizes of differently shaped objects.
One of the biggest issues in many visualizations is that visualizers attempt to
encode too much information into a graph, attempting to tell multiple stories with a
single graph. However, this added complexity leads to the viewer having a reduced
ability to interpret any of the stories in the graph. Instead, visualizers should create
simpler, single story graphs, using the fewest dimensions possible to represent the
story that they wish to tell.
Direct Representation
Meta-Rule #2: Always directly represent the relationship you are trying to communicate.
Don’t leave it to the viewer to derive the relationship from other information.
It is imperative that the data visualizer provide the information that the data story
requires directly—and not rely on the viewer to have to interpret what relationships
we intend to communicate. For instance, we often wish to tell a story of differences,
such as deviations from plan, and budgets vs. actual. When telling a story of
differences, do not rely on the viewer to calculate the differences themselves.
Figure 5.6 illustrates a visualization that relies on the viewer to calculate
differences. Figure 5.7 presents the actual deviation through the profit margin,
allowing the viewer to focus on the essence of the data set.
Again, the goal of the visualizer is to tell a story while minimizing the cognitive
load on the viewer. By directly representing the relationship in question, we assist
the viewer in making the cognitive assessment we wish to illustrate. When we
“leave it to the viewer” to determine what the association or relationship is that
we are trying to communicate, not only do we increase cognitive load, but we also
potentially lose consistency in the way that the viewers approach the visualization.
Fig. 5.6 The difference between sales and profit is not immediately clear
122 J. F. Tripp
Table 5.4 Sales of fictitious Salesperson YTD sales (in $) Share of sales
salespeople
Deepika Padukone 1,140,000 37%
George Clooney 750,000 25%
Jennifer Garner 740,000 24%
Danny Glover 430,000 14%
2
immediately apparent from the visual. Using π d2 we see that the “smaller” circle
has an area of π4 and the “larger” circle has an area of π2 , hence 1:2.
While both examples are properly encoded, the area of each exactly maintains the
proportion of 1:2 that is found in the data, which of the visualizations more clearly
communicates what is present in the data? The simple bars—these only vary on one
dimension and therefore more clearly illustrate the relationship that is present in the
data.
To 3D or Not to 3D
One of the software features that many people enjoy using, because it looks
“cool,” is 3D effects in graphs. However, like the example above, 3D effects create
cognitive load for the viewer, and create distortion in perception. Let us look
at a simple example. Table 5.4 presents some YTD sales data for four fictitious
salespersons.
In this data set, Deepika is well above average, and Danny is well below average.
This same info is presented in 2D and 3D, respectively, in Fig. 5.9a, b. However,
when viewing the 2D representation, the pie chart is less clear than the table (from
the chart, can you identify the #2 salesperson?). Moreover, the 3D chart greatly
distorts the % share of Deepika due to perspective.
In fact, with 3D charts, the placement of the data points has the ability to change
the perception of the values in the data. Figure 5.10a, b illustrate this point. Note
that when Deepika is rotated to the back of the graph, the perception of her % share
of sales is reduced.
Pie Charts?
Although the previous example uses pie charts, this was simply to illustrate the
impact of 3D effect on placement. Even though its use is nearly universal, the pie
chart is not usually the best choice to represent the part-to-whole relationship. This
is due to the requirement it places on the viewer to compare differences on area
instead of on a single visual dimension, and the difficulty that this causes in making
comparisons.
Going back to the 2D example in Fig. 5.9a, it is very difficult to compare the
differences between George and Jennifer. The typical response to this in practice is
124 J. F. Tripp
Fig. 5.9 (a) 2D pie chart of sales. (b) 3D pie chart of sales
Fig. 5.10 (a) Chart with Deepika at front. (b) Chart with Jennifer at front
to add the % values to the chart, as Fig. 5.11 illustrates. However, at this point, what
cognitive value does the pie chart add that the viewer would not have gained from
Table 5.4?
Building a Fit-for-Purpose Visualization
When considering the visualization that is to be created, the visualizer must focus
on the purpose of the viewer. Creating a stacked column, sorted bar chart, table,
or even a pie chart, could all be “not wrong” decisions. Remember, if the user is
interested in looking up precise values, the table might be the best choice. If the
user is interested in understanding the parts-to-whole relationship, a stacked column
or pie chart may be the best choice. Finally, if the viewer needs to understand rank
order of values, the sorted bar chart (Fig. 5.12) may be the best option.
5 Data Visualization 125
consistently highest in the fourth quarter. This type of data presentation is good for
illustrating comparative performance, but only in a general way, as an overview. To
add more value to the heat map, one might decide to add the actual number to the
cells or, as a better choice, add a tool tip in an interactive visualization (Fig. 5.14).
In almost every case, there are multiple choices for visualizations that are “not
wrong.” The visualization that is “fit for purpose” is the one that properly illustrates
the data story, in the fashion that is most compatible with the cognitive task that the
viewer will use the visualization to complete.
Use Color Properly
Meta-Rule #4: Never use color on top of color—color is not absolute.
behind the small rectangles in Fig. 5.15 causes changes in perception as to the color
of the rectangle, which, in the context of data visualization, changes the meaning of
the color of the rectangles.
Second, color is not perceived as having an order. Although the spectrum has a
true order (e.g., Red, Orange, Yellow, Green, Blue, Indigo, Violet: “ROY G. BIV”),
when viewing rainbow colors, violet is not perceived as “more” or “less” than red.
Rainbow colors are simply perceived as different from one another, without having a
particular “order.” However, variation in the level of intensity of a single color is per-
ceived as having an order. This is illustrated in Fig. 5.17, and in the following quote:
“If people are given a series of gray paint chips and asked to put them in order, they will
consistently place them in either a dark-to-light or light-to-dark order. However, if people
are given paint chips colored red, green, yellow, and blue and asked to put them in order, the
results vary,” according to researchers David Borland and Russell M. Taylor II, professor
of computer science at the University of North Carolina at Chapel Hill.
128 J. F. Tripp
Levels of a single variable (continuous data) are best represented using a gradient
of a single color. This representation of a single color, with different levels of
saturation, visually cues the user that, while the levels of the variable may be
different, the color represents levels of the same concept. When dealing with
categories (or categorical data), the use of different colors cues the users that the
different categories represent different concepts.
Best Practices for Color
When building a visualization, it is easy to create something that is information-
rich, but that does not always allow users to quickly zero in on what is the most
important information in the graph. One way to do this is through the choice of
colors. In general, colors that are lower in saturation, and are further from the
primary colors on the color wheel are considered more “natural” colors, because
they are those most commonly found in nature. These are also more soothing colors
than the brighter, more primary colors.
For this reason, when designing a visualization, use “natural” colors as the
standard color palette, and brighter, more primary colors for emphasis (Fig. 5.18).
By using more natural colors in general, viewers will more calmly be able to
interpret the visualization. Using more primary colors for emphasis allows the
visualizer to control when and where to drive the attention of the viewer. When
this is done well, it allows the viewer to find important data more immediately,
limiting the need for the viewer to search the entire visualization to interpret where
the important information is. This helps to achieve the visualizer’s goal of reducing
the cognitive load on the viewer.
Use Viewers’ Experience to Your Advantage
Meta-Rule #5: Do not violate the primal perceptions of your viewers. Remember, up means
more.
In the example provided in Fig. 5.2, we reviewed the disorientation that can occur
when the viewers’ primal perceptive instincts are violated. When viewing the graph
with the reversed Y-axis, the initial perception is that when the line moves down,
it should have the meaning of “less.” This is violated in Fig. 5.2. However, why
4 Based upon the discussion of the concept of non-order in the perception of rainbow colors, the
use of a two-color gradient will not have meaning outside of a particular context. For instance, a
red–green gradient may be interpreted as having meaning in the case of profit numbers that are
both positive and negative, but that same scale would not have an intuitive meaning in another
context. As such, it is better to avoid multicolor gradients unless the context has a meaning already
established for the colors.
5 Data Visualization 129
is it that humans perceive “up,” even in a 2D line graph, as meaning “more”? The
answer lies in the experiences of viewers, and the way that the brain uses experience
to drive perception.
For example, most children, when they are very young, play with blocks, or
stones, or some other group of objects. They sort them, stack them, line them up,
etc., and as they do so, they begin the process of wiring their brains’ perceptive
processes. This leads to the beginning of the brain’s ability to develop categories—
round is different from square, rough is different from smooth, large is different
than small, etc. At the same time, the brain learns that “more” takes up more space,
and “more,” when stacked, grows higher. It is from these and other childhood
experiences that the brain is taught “how the world works,” and these primal
perceptions drive understanding for the remainder of our lives.
When a visualizer violates these perceptions, it can cause cognitive confusion
in the viewer (e.g., Fig. 5.2). It can also create a negative emotional reaction in
the viewer, because the visualization conflicts with “how the world works.” The
visualizer who created Fig. 5.2 was more interested in creating a “dripping blood”
effect than in communicating clearly to her viewer, and the Internet firestorm that
erupted from the publication of that graph is evident in the emotional reaction of
many of the viewers.
Another common viewer reaction is toward visualizations that present percentage
representations. Viewers understand the concept of percent when approaching a
graph, and they know that the maximum percent level is 100%. However, Fig. 5.19
illustrates that some graphs may be produced to add up to more than 100%.
This is because the graph is usually representing multiple part-to-whole relation-
ships at the same time, but not giving the viewer this insight. The solution to this
perception problem is to always represent the whole when presenting a percentage,
so that viewers can understand to which whole each of the parts is being compared
(Fig. 5.20).
130 J. F. Tripp
There are obviously many more of these primal perceptions that viewers hold,
and modern software packages make it rather difficult to accidentally violate them.
In almost every case, these kinds of violations occur when the visualizer attempts to
5 Data Visualization 131
change, so if 15 pixels represents a year at one point in the axis, 15 pixels should
not represent 3 years at another point in the axis.
Standardize (Monetary) Units. “In time-series displays of money, deflated and
standardized units [ . . . ] are almost always better than nominal units.” This means
that when comparing numbers, they should be standardized. For currency, this
means using a constant unit (e.g., 2014 Euros, or 1950 USD). For other units,
standardization requires consideration of the comparison being communicated.
For instance, in a comparison between deaths by shooting in two states, absolute
numbers may be distorted due to differences in population. Or in a comparison of
military spending, it may be appropriate to use standardization by GDP, land mass,
or population. In any case, standardization of data is an important concept in making
comparisons, and should be carefully considered in order to properly communicate
the relationship in question.
Present Data in Context. “Graphics must not quote data out of context” (Tufte,
p. 60). When telling any data story, no data has meaning until it is compared with
other data. This other data can be a standard value, such as the “normal” human body
temperature, or some other comparison of interest such as year-over-year sales. For
instance, while today’s stock price for Ford might be $30, does that single data point
provide you with any understanding as to whether that price is high, low, good, or
bad? Only when data is provided within an appropriate context can it have meaning.
Further, as Tufte illustrates, visualizers choose what they wish to show, and what
they choose to omit. By intentionally presenting data out of context, it is possible
to change the story of that data completely. Figure 5.22 illustrates the data from
Fig. 5.2 presented out of context, showing the year after the enactment of the “Stand
Your Ground” law.
From the graph in Fig. 5.22 it is not possible to understand if the trend in gun
deaths in Florida is any different than it was before the law was enacted. However,
by presenting this data in this manner, it would be possible for the visualizer to drive
public opinion that the law greatly increased gun deaths. Figure 5.23 adds context
to this data set by illustrating that the trajectory of gun deaths was different after the
law was enacted—it is clearly a more honest graph.
Show the Data. “Above all else show the data” (Tufte, p. 92). Tufte argues that
visualizers often fill significant portions of a graph with “non-data” ink. He argues
that as much as possible, show data to the viewer, in the form of the actual data,
annotations that call attention to particular “causality” in the data, and drive viewers
to generate understanding.
When visualizers graph with low integrity, it reduces the fidelity of the represen-
tation of the story that is in the data. I often have a student ask, “But what if we
WANT to distort the data?”. If this is the case in your mind, check your motives.
If you are intentionally choosing to mislead your viewer, to lie about what the data
say, stop. You should learn the rules of visualization so that (1) you don’t break them
and unintentionally lie, and (2) you can more quickly perceive when a visualization
is lying to you.
5 Data Visualization 133
Based upon the widespread recognition of the power of the visual representation of
data, and the emergence of sufficiently inexpensive computing power, many modern
software packages have emerged that are designed specifically for data visualiza-
tion. Even the more generalized software packages are now adding and upgrading
their visualization features more frequently. This chapter was not intended to
“endorse” a particular software package. It stands to illustrate some of the rules
134 J. F. Tripp
that might explain why some software packages behave in a certain manner when
visualizing particular relationships.
Because most visualizers do not have the freedom to select any software they
choose (due to corporate standards), and because research companies such as
Gartner have published extensive comparative analyses of the various software
packages available for visualization, we do not recreate that here.
As stated above, a single chapter is far too little space to describe the intricacies of
data visualization. Few (2009) and Tufte and Graves-Morris (1983) are good sources
with which to broaden your knowledge of the “whys” of data visualization.
Exercises
References
Vishnuprasad Nagadevara
1 Introduction
The purpose of statistical analysis is to glean meaning from sets of data in order
to describe data patterns and relationships between variables. In order to do this,
we must be able to summarize and lay out datasets in a manner that allows us to
use more advanced methods of examination. This chapter introduces fundamental
methods of statistics, such as the central limit theorem, confidence intervals,
hypothesis testing and analysis of variance (ANOVA).
2 Motivating Examples
A real estate agent would like to estimate the average price per square foot of
residential property in the city of Hyderabad, India. Since it is not possible to collect
data for an entire city’s population in order to calculate the average value (μ), the
agent will have to obtain data for a random sample of residential properties. Based
on the average of this sample, they will have to infer, or draw conclusions, about the
possible value of the population average.
V. Nagadevara ()
IIM-Bangalore, Bengaluru, Karnataka, India
e-mail: nagadevara_v@isb.edu
by the sample mean and P X i be the probability of observing the ith value. For
example, X 1 = 2.00, and P X 1 = 0.2. It can be seen that the expected value of
the variable X is exactly equal to the population mean μ.
5
Xi .P Xi = 2.60 = μ.
i=1
5
2
σX = Xi − μ .P X i = 0.4163.
i=1
N −n
deviation σX as given above. The expression N −1 used in the calculation of
σX is called the finite population multiplier (FPM). It can be seen that the FPM
approaches a value of 1 as the population size (N) becomes very large and sample
size (n) becomes small. Hence, this is applicable to small or finite populations. When
population sizes become large or infinite, the standard deviation of X reduces to √σn .
The standard deviation of X is also called standard error. This is a measure of the
variability of X s.
Just as X is a random variable, each of the sample statistics (sample standard
deviation S, computed as the standard deviation of the values in a sample, sample
proportion p, etc.) is also a random variable, having its own probability distributions.
These distributions, in general, are called sampling distributions.
The fact that the expected value of X is equal to μ implies that on average, the
sample mean is actually equal to the population mean. Thus, it makes X a very
good estimator of μ. The standard error (σX ) which is equal to √σn indicates that as
the sample size (n) increases, the standard error decreases, making the estimator, X,
approach closer and closer to μ.
6 Statistical Methods: Basic Inferences 141
We can see from the earlier section that the sample mean, X, is a point estimate
of the population mean. A point estimate is a specific numerical value estimate of
a parameter. The best point estimate of the population mean μ is the sample mean
X, and since a point estimator cannot provide the exact value of the parameter, we
compute an “interval estimate” for the parameter which covers the true population
mean sufficiently large percentage of times. We have also learned in the previous
section that X is a random variable and that it follows a normal distribution with
mean μ and standard deviation σX or an estimate of it. The standard deviation
is also referred to as “standard error” and can be shown to be equal to √σn (or
its estimate) where n is the sample size. We can use this property to calculate a
confidence interval for the population mean μ as described next.
95%
1.96 – 1.96 –
–
μ
LL UL
– –
P μ − 1.96σX ≤ X ≤ μ + 1.96σX = 0.95 (6.1)
The way to interpret Expression (6.2) is that if we take a large number of samples
of size n and calculate the sample means and then create an interval around each X
with a distance of ±1.96 σX , then 95% of these intervals will be such that they
contain the population mean μ. Let us consider one such sample mean, X 1 , in
Fig. 6.1. We have built an interval around X 1 whose width is the same as the width
between LL and UL . This interval is such that it contains μ. It can be seen that an
interval built around any X lying between LL and UL will always contain μ. It is
obvious from Expression (6.1) that 95% of all Xs will fall within this range. On the
other hand, an interval built around any X which is outside this range (such as X 2
in Fig. 6.1) will be such that it does not contain μ. In Fig. 6.1, 5% of all Xs fall in
this category.
It should be noted that because μ is a parameter and not a random variable,
we cannot associate any probabilities with it. The random variable in Expression
(6.2) above is X. Consequently, the two values, X − 1.96σX and X + 1.96σX , are
also random variables. The appropriate way of interpreting the probability statement
(Expression 6.2) is that there is a large number of Xs possible and there are just
as many intervals, each having a width of 2× 1.96 σX . Moreover, there is a 95%
6 Statistical Methods: Basic Inferences 143
probability that these intervals contain μ. The above statement can be rephrased
like this: “We are 95% confident that the interval built around an X contains μ.” In
other words, we are converting the probability into a confidence associated with the
interval. This interval is referred to as the “confidence interval,” and its associated
probability as the “confidence level.” We can decide on the desired confidence level
and calculate the corresponding confidence interval by using the appropriate z-value
drawn from the standard normal distribution. For example, if we are interested in a
90% confidence interval, the corresponding z-value is 1.645, and this value is to be
substituted in the place of 1.96 in Expression (6.2).
Let us now apply this concept to a business example. Swetha is running a
number of small-sized restaurants in Hyderabad and wants to estimate the mean
value of APC (average per customer check) in such restaurants. She knows that
the population standard deviation σ = | 520 (| stands for Indian Rupee). She has
collected data from 64 customers, and the sample average X = | 1586. The standard
deviation of X is σX = √ 520
= 65. By substituting these values in Expression (6.2),
64
we get
The above probability statement indicates that we are 95% confident that the
population mean μ (which we do not know) will lie between | 1458.6 and | 1713.4.
By using the sample mean (| 1586), we are able to calculate the possible interval
that contains μ and associate a probability (or confidence level) with it. Table 6.4
presents a few more confidence intervals and their corresponding confidence levels
as applied to this example.
Notice that the width of the confidence interval increases as we increase the
confidence level. It may be noted that a larger confidence level means a higher
z-value, and an automatic increase in width. The width of the confidence interval
can be considered as a measure of the precision of the statement regarding the true
value of the population mean. That is, the larger the width of the interval, the lower
the precision associated with it. In other words, there is a trade-off between the
confidence level and the precision of the interval estimate.
It will be good if we can achieve higher confidence levels with increased interval
precision. The width of the confidence interval depends not only on the z-value
as determined by the confidence level but also on the standard error. The standard
error depends on the sample size. As the sample size, n, increases, the standard
error decreases since σX = √σn . By increasing the sample size (n), we can decrease
the standard error (σX ) with a consequent reduction in the width of the confidence
interval, effectively increasing the precision. Table 6.5 presents the width of the
confidence interval for two sample sizes, 64 (earlier sample size) and 256 (new
sample size), as applied to the earlier example. These widths are calculated for three
confidence levels as shown in Table 6.5.
It should be noted that the precision of the 99% confidence interval with n = 256
is more profound than that of the 90% confidence interval with n = 64.
We can consider the confidence interval as the point estimate, X, adjusted for the
margin of error measured by the standard error, σX . Obviously, the margin of error
is augmented by the confidence level as measured by the z-value.
We can use the concept of confidence intervals to determine the ideal sample size. Is
the sample size of 64 ideal in the above example? Suppose Swetha wants to calculate
the ideal sample size, given that she desires a 95% confidence interval with a width
of ±| 100. We already know that the margin of error is determined by the z-value
and the standard error. Given that the confidence level is 95%, the margin of error
or the width of the confidence interval is
where:
• s is the sample standard deviation.
• Xi is the ith sample observation.
• X is the sample mean.
• n is the sample size.
The denominator, (n − 1), is referred to as the degree of freedom (df). Abstractly,
it represents the effect that estimates have on the overall outcome. For example, say
we need to seat 100 people in a theater that has 100 seats. After the 99th person is
seated, you will have only one seat and one person left, leaving you no choice but
to pair the two. In this way, you had 99 instances where you could make a choice
between at least two options. So, for a sample size of n = 100, you had n − 1 = 99
degrees of freedom. Similarly, since we are using X as a proxy for the population
mean μ, we lose one degree of freedom. We cannot randomly choose all n samples
since their mean is bounded by X. As we begin using more proxy values, we will
see how this affects a parameter’s variability, hence losing more degrees of freedom.
Since we are using the sample standard deviation s, instead of σ in our
calculations, there is more uncertainty associated with the confidence interval. In
order to account for this additional uncertainty, we use a t-distribution instead of a
standard normal distribution (“z”). In addition, we will need to consider the standard
error of the sample mean sX = √sn when calculating the confidence intervals using
t-distribution.
Figure 6.2 presents the t-distribution with different degrees of freedom. As
the degrees of freedom increase, the t-distribution approaches a standard normal
distribution and will actually become a standard normal distribution when the
degrees of freedom reach infinity.
It can be seen from Fig. 6.2 that the t-distribution is symmetric and centered at
0. As the degrees of freedom increase, the distribution becomes taller and narrower,
finally becoming a standard normal distribution itself, as the degrees of freedom
become infinity.
146 V. Nagadevara
Df=∞
Df=5
Df=1
-6 -4 -2 0 2 4 6
Fig. 6.2 t-Distribution with different degrees of freedom. When df = ∞, we achieve a standard
normal distribution
X−μ s
The t-value is defined as t = where sX = √ (6.5)
sX n
90%
-5 -4 -3 -2 -1 0 1 2 3 4 5
-1.711 1.711
Estimating the confidence interval for the population proportion π is very similar to
that of the population mean, μ. The point estimate for π is the sample proportion
p. The sampling distribution of p can be derived from binomial distribution. It is
known that the binomial random variable is defined as the number of successes,
given the sample size n and probability of success, π. The expected value of a
binomial random variable is nπ, and the variance is nπ(1-π). Consider a variable Di
which denotes the outcome of a Bernoulli process. Let us map each success of the
Bernoulli process as Di = 1 and failure as Di = 0. Then the number of successes
148 V. Nagadevara
D
X is nothing but Di and the sample proportion, p = Xn = n i . In other words,
p is similar to the sample mean D, and by virtue of the central limit theorem, p (as
analogous to the sample mean) is distributed normally (when n is sufficiently large).
The expected value and the variance of the estimator p are given below:
X 1 1
E(p) = E = E(X) = nπ = π (6.7)
n n n
and
X 1 1 π (1 − π )
V (p) = V = V (X) = 2 .nπ (1 − π ) = (6.8)
n n2 n n
In other words, the sample proportion p is distributed normally with the E(p) = π
(population proportion) and V(p) = π(1-π)/n. We use this relationship to estimate
the confidence interval for π.
Swetha wants to estimate a 99% confidence interval for the proportion (π) of
the families who come for dinner to the restaurant with more than four members
in the family. This is important for her because the tables in the restaurant are
“four seaters,” and when a group consisting of more than four members arrive, the
restaurant will have to join two tables together. She summarized the data for 160
families, and 56 of them had more than four members. The sample proportion is
56/160 = 0.35. The standard error of p, σ p , is the square root of V(p) which is equal
0.5
to π (1−π
n
)
. In order to estimate the confidence interval for π, we need the σ p
which in turn depends on the value of π! We know that p is a point estimate of π,
and hence we can use the value of p in coming up with a point estimate for σ p . Thus,
0.5
0.5
this point estimate σp = p(1−p)n = 0.35(1−0.35)
160 = 0.0377. The z-value
corresponding to the 99% confidence level is 2.575.
The 99% confidence level for the population proportion π is
The above confidence interval indicates that Swetha should be prepared to join
the tables for 25–45% of the families visiting the restaurant.
We can determine the ideal sample size for estimating the confidence interval for π
with a given confidence level and precision. The process is the same as that for the
mean as shown in Expression (6.3)—which is reproduced below for ready reference.
6 Statistical Methods: Basic Inferences 149
z.σ
2
n=
w
In the case of the population mean μ, we have assumed that the population
standard deviation, σ, is known. Unfortunately, in the present case, the standard
deviation itself is a function of π, and hence we need to approach it differently.
Substituting the formula for σ in Expression (6.3), we get
z
2
n = π (1 − π ) (6.9)
w
Needless to say, π(1 – π) is maximum when π = 0.5. Thus we can calculate
the maximum required sample size (rather than ideal sample size) by substituting
the value of 0.5 for π. Let us calculate the maximum required sample size for this
particular case. Suppose Swetha wants to estimate the 99% confidence interval with
a precision of ±0.06. Using the z-value of 2.575 corresponding to 99% confidence
level, we get
2
2.575
n = (0.5)(0.5) = 460.46 ≈ 461
0.06
In other words, Swetha needs to collect data from 461 randomly selected families
visiting her restaurant. Based on Expression (6.9) above, it is obvious that the more
the desired precision, the larger is the sample size and the higher the confidence
level, the larger is the sample size. Sometimes, it is possible to get a “best guess”
value for a proportion (other than 0.5) to be used in Expression (6.9). It could be
based on a previous sample study or a pilot study or just a “best judgment” by the
analyst.
to the right of 2.7826 is equal to 0.95 (i.e., the area to the left of 2.7826 is equal to
1 – 0.95 = 0.05), and the area to the right of 15.5073 is equal to 0.05. Thus, the area
enclosed between these two χ2 values is equal to 0.90.
Swetha had earlier calculated the sample standard deviation, s, as 12 min. The
variance S2 is 144 min2 , and the degrees of freedom is 24. She wants to estimate
a 95% confidence interval for the population variance, σ2 . The formula for this
confidence interval is
(n − 1) s 2
2
P χL,0.975 ≤ ≤ χ U,0.025 = 0.95
2
(6.10)
σ2
2
where χL,0.975 is the “lower” or left-hand side value of χ 2 from Table 6.13,
2
corresponding to column 0.975 and χU,0.025 is the “upper” or right-side value
corresponding to column 0.025. The two values, drawn from Table 6.13, are 12.4012
and 39.3641. The 95% confidence interval for σ2 is
24 × 144 24 × 144
P ≤σ ≤ 2
= P 87.7957 ≤ σ 2 ≤ 278.6827 = 0.95
39.3641 12.4012
6 Statistical Methods: Basic Inferences 151
When estimating the confidence interval for μ, π, or σ2 , we do not make any prior
assumptions about the possible value of the parameter. The possible range within
which the parameter value is likely to fall is estimated based on the sample statistic.
However, sometimes the decision maker has enough reason to presume that the
parameter takes on a particular value and would like to test whether this presumption
is tenable or not based on the evidence provided by the sample data. This process is
referred to as hypothesis testing.
Let us consider a simple example of hypothesis testing. In the field of electronics,
some parts are coated with a specific material at a thickness of 50 μm. If the coating
is too thin, the insulation does not work properly, and if it is too thick, the part will
not fit properly with other components. The coating process is calibrated to achieve
50 μm thickness on average with a standard deviation of 8 μm. If the outcome is
as expected (50 μm thickness), then the process is said to be in control, and if not,
it is said to be out of control. Let us take a sample of 64 parts whose thickness is
measured and sample average, X, is calculated. Based on the value of X, we need
to infer whether the process is in control or not.
In order to effectively understand this, we create a decision table (refer to
Table 6.6).
H0 represents the null hypothesis, or the assumption that the parameter being
observed is actually equal to its expected value, in this case that is μ = 50 μm.
It is obvious from Table 6.6 that there are only four possibilities in hypothesis
testing. The first column in Table 6.6 is based on the assumption that the “process
is in control.” The second column is that the “process is not in control.” Obviously,
these two are mutually exclusive. Similarly, the two rows representing the conclu-
sions based on X are also mutually exclusive.
Let us analyze the situation under the assumption for the null hypothesis wherein
the process is in control (first column) and μ is exactly equal to 50 μm. We denote
this as follows:
H0 : μ0 = 50 μm.
152 V. Nagadevara
μ0 is a notation used to imply that we do not know the actual value of μ, but its
hypothesized value is 50 μm.
If our conclusion based on evidence provided by X puts us in the first cell
of column 1, then the decision is correct, and there is no error involved with the
decision. On the other hand, if the value of X is such that it warrants a conclusion
that the process is not in control, then the null hypothesis is rejected, resulting in
a Type I error. The probability of making a Type I error is represented by α. This
is actually a conditional probability, the condition being “the process is actually in
control.” By evaluating the consequences of such an error and coming up with a
value for α, it allows us to indicate the extent to which the decision maker is willing
to tolerate a Type I error. Based on the value of α, we can evolve a decision rule to
either accept or reject the proposition that the “process is in control.”
The selection of α is subjective, and in our example we will pick 0.05 as the
tolerable limit for α.
Suppose we take a sample of size 64 and calculate the sample mean X. The
next step is to calculate the limits of the possible values of X within which the null
hypothesis cannot be rejected. Considering that α is set at 0.05, the probability of
accepting the null hypothesis when it is actually true is 1 − α = 0.95. The sample
mean is distributed normally with mean 50 and standard error √σ = 1. The limits
64
of X within which we accept the null hypothesis is given by
σ σ
P μ0 − z. √ ≤ X ≤ μ0 + z. √ =P 50 − 1.96(1) ≤ X ≤ 50 + 1.96(1) = 0.95
n n
Rejection Rejection
Region Region
0.025 0.025
Acceptance
Region
45 46 47 48.04 49 50 51 51.96
hypothesis is rejected when X falls in this region. The probability of a Type I error
(α) is equally divided between the two tail regions since such an equal division leads
to the smallest range of an acceptance region, thereby maximizing the precision.
It should be noted that the selection of α is completely within the control of the
decision maker.
Now that we have a decision rule that says that we accept the null hypothesis as
long as X is between 48.04 and 51.96, we can develop the procedure to calculate
the probability of a Type II error. A Type II error is committed when the reality
is that the process is out of control and we erroneously accept the null hypothesis
(that the process is in control) based on evidence given by X and our decision rule.
If the process is out of control, we are actually operating on the second column
of Table 6.6. This implies that μ = 50 μm. This also implies that μ can be any
value other than 50 μm. Let us pick one such value, 49.5 μm. This value is usually
denoted by μA because this is the alternate value for μ. This is referred to as the
“alternate hypothesis” and expressed as
H1 : μ0 = 50
The distribution of X with μA = 49.5 is presented in Fig. 6.6. As per our decision
rule created earlier, we would still accept the null hypothesis that the process is
under control, if X falls between 48.04 and 51.96 μm. This conclusion is erroneous
because, with μA = 49.5, the process is actually out of control. This is nothing but a
Type II error. The probability of committing this error (β) is given by the area under
the curve (Fig. 6.6) between the values 48.04 and 51.96. This area is 0.9209.
154 V. Nagadevara
Value of b
45 46 47 48.04 49 50 52 53 54
51.96
As a matter of fact, the alternate value for μ can be any value other than 50 μm.
Table 6.7 presents values of β as well as 1 − β for different possible values of μ.
1 − β, called the power of the test, is interpreted as the probability of rejecting the
null hypothesis when it is not true (Table 6.6 Q4), which is the correct thing to do.
It can be seen from Table 6.7 that as we move farther away in either direction
from the hypothesized value of μ (50 μm in this case), β values decrease. At
μA = μ0 (the hypothesized value of μ), 1 – β is nothing but α and β is nothing
but 1 – α. This is clear when we realize that at the point μA = μ0 , we are no longer
in quadrants Q2 and Q4 of Table 6.6 but have actually “shifted” to the first column.
6 Statistical Methods: Basic Inferences 155
β
0.9000
0.8000
0.7000
0.6000
0.5000
0.4000
0.3000
0.2000
0.1000
0.0000
44.00 46.00 48.00 50.00 52.00 54.00 μ
Fig. 6.7 Operating characteristic curve
1.2000
1.0000
0.8000
0.6000
N=1024
0.4000
0.2000
0.0000 N=64
44.00 46.00 48.00 50.00 52.00 54.00 56.00
Figure 6.7 presents the values of β plotted against different values for μ. This
curve is referred to as the “operating characteristic” curve. What is interesting is the
plot of 1 – β against different values of μ, which is presented in Fig. 6.8.
156 V. Nagadevara
The hypothesis test is a very important aspect of statistical inference. The following
steps summarize the process involved in the hypothesis test:
1. State the null and alternate hypotheses with respect to the desired parameter (in
this case, the parameter is the population mean, μ).
Eg. H0 : μ0 = 50 μm
HA : μ0 = 50 μm
2. Identify the sample statistic that is to be used for testing the null hypothesis. (In
this case, the sample statistic is X.)
3. Identify the sampling distribution of the sample statistic (normal distribution in
this case).
4. Decide on the value of α (0.05 in the above example).
5. Decide on the sample size (64 in the example).
6. Evolve the decision rule by calculating the acceptance/rejection region based
on α and the sampling distribution (e.g., accept if X falls between 48.04 and
51.96 μm, reject otherwise).
6 Statistical Methods: Basic Inferences 157
(Notice that the above steps can be completed even before the data is
collected.)
7. Draw the conclusions regarding the null hypothesis based on the decision rule.
There are four different ways to test the null hypothesis, and all the four are
conceptually similar. The methods described below use numbers drawn from an
earlier example in our discussion. We also reference the interval of ±1.96 in relation
to a confidence level of 95%.
H0 : μ0 = 50 μm
HA : μ0 = 50 μm
σ=8
n = 64
σX = √8 = 1 μm
64
X= 47.8 μm
1. The first method is to build an interval around μ (based on α) and test whether
the sample statistic, X, falls within this interval.
Example: The interval is 50 ± 1.96 (1) = 48.04 to 51.96, and since X falls
outside these limits, we reject the null hypothesis.
2. The second method is similar to estimating the confidence interval. Recall that
when α is set at 0.05, the probability of accepting the null hypothesis when it is
actually true is 1 – α = 0.95. Using the value of X, we can build a confidence
interval corresponding to (1 – α) and test whether the hypothesized value of μ
falls within these limits.
Example: The confidence interval corresponding to (1–0.05) = 0.95 is
47.8 ± 1.96 (1) = 45.84 to 49.76. Since the hypothesized value of μ, 50 μm,
falls outside these limits, we reject the null hypothesis.
3. Since the sampling distribution of X is a normal distribution, convert the values
into standard normal, and compare with the z-value associated with α. It is
important to note that the comparison is done with absolute values, since this
is a two-sided hypothesis test.
X−μ
Example: The standard normal variate corresponding to X is σX =
47.8−50
1 = −2.2. The absolute value 2.2 is greater than 1.96 and hence we reject
the null hypothesis. This process is shown in Fig. 6.9.
Rejection Rejection
Region Region
-2.2
0.0139
–
X
Let us consider the sampling distribution of Xshown in Fig. 6.10. It should be noted
that this distribution is drawn based on the assumption that the null hypothesis is
true.
(a) First, locate the X value of 47.8 μm on the x-axis, and note that its value is to
the left of μ = 50.
6 Statistical Methods: Basic Inferences 159
(b) Calculate the area under the curve to the left of 47.8 μm (if the X value happened
to the right of μ, we would have calculated the area under the curve to the right).
This area, highlighted in red, turns out to be 0.0139. This is the probability of
committing an error, if we reject the null hypothesis based on any value of X
less than or equal to 47.8.
(c) Since this is a two-sided test, we should be willing to reject the null hypothesis
for any value of X which is greater than or equal to 52.2 μm (which is
equidistant to μ on the right side). Because of the symmetry of a normal
distribution, that area also happens to be exactly equal to 0.0139.
(d) In other words, if we reject the null hypothesis based on the X value of 47.8, the
probability of a type I error is 0.0139 + 0.0139 = 0.0278. This value is called
the p-value, and it is different from α because α is determined by the decision
maker, while p-value is calculated based on the value of X.
(e) In this example, we have taken the value of α (which is the probability of
committing a type I error) as 0.05, indicating that we are willing to tolerate
an error level up to 0.05. The p-value is much less than 0.05, which implies
that this error level, while rejecting the null hypothesis, is less than what we are
willing to tolerate. In other words, we are willing to commit an error level up to
0.05 while the actual error level being committed is only 0.0278 (the p-value).
Hence the null hypothesis can be rejected. In general, the null hypothesis can be
rejected if the p-value is less than α.
(f) Since p = 0.0278 < α = 0.05, we reject the null hypothesis.
Note that the conclusion is the same for all four methods.
The example discussed earlier involves a rejection region under both the tails of the
distribution and the α value to be distributed equally between the two tail regions.
Sometimes, the rejection region is only under one of the two tails and is referred
to as a one-tailed test. Consider the carrying capacity of a bridge. If the bridge is
designed to withstand a weight of 1000 tons but its actual capacity is more than
1000 tons, there is no reason for alarm. On the other hand, if the actual capacity of
the bridge is less than 1000 tons, then there is cause for alarm. The null and alternate
hypotheses in this case are set up as follows:
H0 : μ0 ≤ 1000 tons (bridge designed to withstand weight of 1000 tons or less).
HA : μ0 > 1000 tons (bridge designed to withstand weight > 1000 tons).
The bridge is subjected to a stress test at 25 different locations, and the average
(X) turned out to be 1026.4 tons. There is no reason to reject the null hypothesis for
any value of X less than 1000 tons. On the other hand, the null hypothesis can be
rejected only if X is significantly larger than 1000 tons.
The rejection region for this null hypothesis is shown in Fig. 6.11. The entire
probability (α) is under the right tail only. C is the critical value such that if the value
of X is less than C, the null hypothesis is not rejected, and the contractor is asked
160 V. Nagadevara
to strengthen the bridge. If the value of X is greater than C, the null hypothesis is
rejected, traffic is allowed on the bridge, and the contractor is paid. The consequence
of a Type 1 error here is that traffic is allowed on an unsafe bridge leading to loss
of life. The decision maker in this case would like to select a very small value for α.
Let us consider a value of 0.005 for α.
If the population standard deviation is known, we can calculate C by using the
formula
σ
C = μ0 + z √ where z = 2.575 corresponding to α = .005.
n
The population standard deviation is not known in this particular case, and
hence we calculate C using the sample standard deviation s and t-distribution. The
formula is C = μ0 + 2.787 √s where s is the sample standard deviation and 2.787
25
is the t-value corresponding to 24 degrees of freedom and α √= 0.005. The sample
standard deviation s = 60 tons, and the standard error is 60/ 25 = 12 tons.
Two aspects need to be noted here. First, traffic is allowed on the bridge only if
X is greater than 1033.44 tons, even though the required capacity is only 1000 tons.
The additional margin of 33.444 tons is to account for the uncertainty associated
with the sample statistics and playing it safe. If we reduce α further, the safety
margin that is required will correspondingly increase. In the results mentioned
earlier, X = 1026.4 tons. Based on our value of C, the null hypothesis is not rejected,
and the contractor is asked to strengthen the bridge. We should also note that the null
6 Statistical Methods: Basic Inferences 161
hypothesis is set to μ0 ≤ 1000 tons in favor of a stronger bridge and erring on the
safer side. In fact, it would actually be favoring the contractor (albeit at the cost of
safety) if the null hypothesis was set to μ0 ≥ 1000 tons.
We can extend the concept of the hypothesis test for μ to a similar hypothesis test
for the population proportion, π. Swetha (see the section on confidence intervals)
is concerned about the occupancy of her restaurant. She noticed that sometimes,
the tables are empty, while other times there is a crowd of people in the lobby area
outside, waiting to be seated. Obviously, empty tables do not contribute to revenue,
but she is also worried that some of the people who are made to wait might go off
to other restaurants nearby. She believes it is ideal that people only wait to be seated
20% of the time. Note that this will be her null hypothesis. She decides to divide the
lunch period into four half-hour slots and observe how many customers, if any, wait
to be seated during each time slot. She collected this data for 25 working days, thus
totalling 100 observations (25 days × 4 lunch slots/day). She decides on the value
of α = 0.05.
The null hypothesis is as follows:
H0 : π0 = 0.2 (people wait to be seated 20% of the time).
HA : π0 = 0.2.
The corresponding sample statistic is the sample proportion, p. Swetha found
that 30 out of the 100 time slots had people waiting to be seated. Since we develop
the decision rule based on the assumption that the null hypothesis is actually true,
the standard error of p is
0.2 (1 − 0.2)
σp = = 0.04.
100
The p-value for this example is shown in Fig. 6.12. Note that the probability that
p is greater than 0.3 is 0.0062 (and consequently the probability that p is less than
0.1 is also 0.0062, by symmetry). Thus, the p-value is 0.0062 × 2 = 0.0124. This is
considerably lower than the 5% that Swetha had decided. Hence, the null hypothesis
is rejected, and Swetha will have to reorganize the interior to create more seating
space so that customers do not leave for the competition.
There are many situations where one has to make comparisons between different
populations with respect to their means, proportions, or variances. This is where the
162 V. Nagadevara
0.0062 0.0062
p
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
real usefulness of the statistical inference becomes apparent. First, we will look at a
very simple situation where the data is collected as a pair of observations from the
same respondents. This is called paired-observation comparisons.
HA : μD0 > 0.
The average of Di , D, was 0.52 h, and the standard deviation was
√
1.8
h. The standard error is 1.8/ 16 = 0.45 h. The p-value is calculated as
P D > 0.52|μD0 = 0 and sD = 0.45 . Since we are using the sample standard
deviation instead of the population standard deviation (σ), we need to calculate the
probability using the t-distribution.
D−μD0
tcalc = sD = 0.52−0
0.45 = 1.1556 and df = (16–1) = 15.
Using the t-distribution, the p-value is calculated as 0.1330 which is large enough
for us not to reject the null hypothesis. Thus, the conclusion is that the training
program was not effective.
Let us consider two samples which are drawn independently from two neigh-
bourhoods in Hyderabad, India. The first sample refers to the daily profit from a
restaurant in Jubilee Hills and the second refers to the daily profit for the same
restaurant in Hitech City. Swetha, who owns both these restaurants, wants to test
if there is a difference between the two population means. The null and alternate
hypotheses are:
H0 : μ1 = μ2 → μ1 – μ2 = 0.
HA : μ1 = μ2 → μ1 – μ2 = 0.
The original null hypothesis of H0 : μ1 = μ2 is restated as μ1 – μ2 = 0,
because (μ1 – μ2) can be estimated by X 1 − X 2 and the sampling distribution
of X 1 − X 2 is normal with mean (μ1 – μ2 ) and standard error σX1 −X2 . When
the standard deviations of the populations are known, σX1 −X2 is calculated
σ12 σ22
as n1 + n2 . The hypothesis can be simply tested by using standard normal
X 1 −X 2 −(μ1– μ2)
distribution, as in the case of single sample, that is, zcalc = σ X −X ,
( 1 2)
and comparing it with the value of the z-value corresponding to α, obtained from
the standard normal tables.
When σ1 and σ2 are not known, we need to use the sample standard deviations
s1 and s2 and calculate sX1 −X2 and further use t-distribution, instead of z. Since
σ1 and σ2 are not known, calculating sX1 −X2 becomes somewhat involved. The
calculation of σX1 −X2 depends on whether σ1 = σ2 or not. The formulae are
presented below:
164 V. Nagadevara
s12 s22
1. If σ1 = σ2 , then sX1 −X2 is calculated as n1 + n2 , and the associated degrees
of freedom is calculated using the formula
2
s12 s22
n1 + n2
df = 2 2 .
s12 s22
1
n1 −1 n1 + 1
n2 −1 n2
1 1
sX1 −X2 = sp +
n1 n2
σ12
H0 : σ12 = σ22 → =1
σ22
σ12
HA : σ12 = σ22 → = 1.
σ22
As we had done earlier, the original null hypothesis, σ12 = σ22 , is transformed
σ12
as = 1 so that the ratio of sample standard deviations can be used as the test
σ22
6 Statistical Methods: Basic Inferences 165
statistic. This null hypothesis can be tested using an “F” distribution. The random
variable “F” is defined as
s12
χ12
(n − 1) σ12
F = 2 1 = .
χ2 s22
(n2 − 1) σ22
X 1 = | 13, 750; X 2 = | 14, 110; s12 = 183, 184 |2 ; s22 = 97, 344 |2
It can be seen that the above expression for sp2 is nothing but the weighted average
of the two sample variances, with the respective degrees of freedom as weights. The
degrees of freedom associated with sp2 is the total number of observations minus 2
1 1
sX1 −X2 = sp2 + = 144.0781
n1 n2
X1 − X2 − 0 −360
tcalc = = = −2.4987.
sX1 −X2 144.0781
We have just seen the procedure to compare the means of the two populations. In
many situations, we need to compare the means of a number of populations. For
example, Swetha has been receiving complaints that the time taken to collect the
order from the customers is inordinately long in some of her restaurants. She wants
to test whether there is any difference in the average time taken for collecting the
orders in four of her restaurants. In such a situation, we use a technique called
“analysis of variance (ANOVA)” instead of carrying out pairwise comparisons.
Swetha has set up the null and alternate hypotheses as
H0 : μ1 = μ2 = μ3 = μ4 .
HA : At least one μ is different.
The following assumptions are required for ANOVA:
1. All observations are independent of one another and randomly selected from the
respective population.
2. Each of the populations is approximately normal.
3. The variances for each of the populations are approximately equal to one another.
6 Statistical Methods: Basic Inferences 167
The ANOVA technique estimates the variances through two different routes and
compares them in order to draw conclusions about the population means. She has
collected data from each of the four restaurants and summarized it in Table 6.8.
First we estimate the pooled variance of all the samples, just the way we had
done in the case of comparing two population means. Extending the formula to four
samples, sp2 is defined as
(ni − 1) si2
sp2 = , (i = 1, .., 4)
(ni − 1)
Since one of the assumptions is that σ12 = σ22 = σ32 = σ42 = σ 2 , the sp2 is an
estimate of σ 2 . It may be remembered that the variance of the sample means (Xs),
2
σX2 = σn (consequently, σ 2 = nσX2 ), and it can be estimated by calculating the
variance of the Xs. The variance of the four sample means can be calculated by the
= 2
X i −X =
formula sX2 = where k is the number of samples and X is the overall
k−1
mean for all the observations. We can get an estimate of σ2 by multiplying sX
2 with
sample size. Unfortunately, the sample sizes are different and hence the formula
needs to be modified
as
= 2
X i −X ni
σ̂ 2 = k−1 where σ 2 is an estimate of σ2 obtained through the “means”
route. If the null hypothesis is true, these two estimates ( σ 2 and sp2 ) are estimating
the same σ . On the other hand, if the null hypothesis is not true,
2 σ 2 will be
2
significantly larger than sp . The entire hypothesis test boils down to testing to see
if
σ 2 is significantly larger than sp2 , using the F distribution (testing for equality
of variances). We always test to see if σ 2 is larger than sp2 and not the other way,
and hence this F test is always one-sided (one-tailed test on the right side). This
is because σ 2 is a combination of variation arising out of σ2 as well as difference
between means, if any.
Using the summary data provided in Table 6.8, we can calculate sp2 as10.16
and σ 2 as 32.93. The F value is 32.93/10.16 = 3.24. The corresponding degrees
168 V. Nagadevara
two, using the relationship between the three of them. Similar relationship exists
between the degrees of freedom corresponding to each source. There are several
excellent books that introduce the reader to statistical inference. We have provided
two such references at the end of this chapter (Aczel & Sounderpandian 2009; Stine
& Foster 2014).
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 6.1: ANOVA.csv
• Data 6.2: Confidence_Interval.csv
• Data 6.3: Critical_Value_Sample_Prop.csv
• Data 6.4: Hypothesis_Testing.csv
• Data 6.5: Independent_Samples.csv
• Data 6.6: Paired_Observations.csv
Exercises
Ex. 6.1 Nancy is running a restaurant in New York. She had collected the data
from 100 customers related to average expenditure per customer in US dollars, per
customer eating time in minutes, number of family members, and average waiting
time to be seated in minutes. [Use the data given in Confidence_Interval.csv]
(a) Given the population standard deviation as US$ 480, estimate the average
expenditure per family along with confidence interval at 90%, 95%, and 99%
levels of confidence.
(b) Show that increased sample size can reduce the width of the confidence interval
with a given level of confidence, while, with the given sample size, the width of
the confidence interval increases as we increase the confidence level.
(c) For a width of 150, determine the optimal sample size at 95% level of
confidence.
170 V. Nagadevara
Ex. 6.2 Nancy wants to estimate the average time spent by the customers and their
variability. [Use the data given in Confidence_Interval.csv]
(a) Using appropriate test-statistic at 90% confidence level, construct the confi-
dence interval for the sample estimate of average time spent by the customer.
(b) Using appropriate test-statistic at 95% confidence level, construct the confi-
dence interval for the sample estimate of variance of time spent by the customer.
Ex. 6.3 The dining tables in the restaurant are only four seaters, and Nancy has
to join two tables together whenever a group or family consisting more than four
members arrive in the restaurant. [Use the data given in Confidence_Interval.csv]
(a) Using appropriate test-statistic at 99% confidence level, provide the confidence
interval for the proportion of families visiting the restaurant with more than four
members.
(b) Provide the estimate of the maximum sample size for the data to be collected to
have the confidence interval with a precision of ±0.7.
(c) Suppose it is ideal for Nancy to have 22% families waiting for a table
to be seated. Test the hypothesis that the restaurant is managing the
operation as per this norm of waiting time. [Use the data given in
Critical_Value_Sample_Prop.csv]
Ex. 6.4 In an electronic instrument making factory, some parts have been coated
with a specific material with a thickness of 51 μm. The company collected data
from 50 parts to measure the thickness of the coated material. Using various
methods of hypothesis testing, show that null hypothesis that the mean thickness
of coated material is equal to 51 μm is not rejected. [Use the data given in
Hypothesis_Testing.csv]
Ex. 6.5 A multinational bridge construction company was contracted to design
and construct bridges across various locations in London and New York with a
carrying capacity of 1000 tons. The bridge was subjected to stress test at 50 different
locations. Using appropriate test statistics at 1% level of confidence, test whether
the bridge can withstand a weight of at least 1000 tons. [Use the data given in
Critical_Value_Sample_Prop.csv]
Ex. 6.6 A multinational company provided training to its workers to perform certain
production task more efficiently. To test the effectiveness of the training, a sample of
50 workers was taken and their time taken in performing the same production task
before and after the training. Using appropriate test statistic, test the effectiveness
of the training program. [Use the data given in Paired_Observations.csv]
Ex. 6.7 Bob is running two restaurants in Washington and New York. He wants
to compare the daily profits from the two restaurants. Using the appropriate test
statistic, test whether the average daily profits from the two restaurants are equal.
[Use the data given in Independent_Samples.csv]
6 Statistical Methods: Basic Inferences 171
Ex. 6.8 Paul is running four restaurants in Washington, New York, Ohio, and
Michigan. He has been receiving complaints that the time taken to collect the order
from the customer is inordinately long in some of his restaurants. Test whether
there is any difference in the average time taken for collecting orders in four of
his restaurants. [Use the data given in ANOVA.csv]
Ex. 6.9 Explain the following terms along with their relevance and uses in analysis
and hypothesis testing: [Use data used in Exercises 1–8 as above.]
(a) Standard normal variate
(b) Type I error
(c) Power of test
(d) Null hypothesis
(e) Degree of freedom
(f) Level of confidence
(g) Region of acceptance
(h) Operating characteristics curve
Appendix 1
30 13.7867 14.9535 16.7908 18.4927 20.5992 40.2560 43.7730 46.9792 50.8922 53.6720
40 20.7065 22.1643 24.4330 26.5093 29.0505 51.8051 55.7585 59.3417 63.6907 66.7660
50 27.9907 29.7067 32.3574 34.7643 37.6886 63.1671 67.5048 71.4202 76.1539 79.4900
60 35.5345 37.4849 40.4817 43.1880 46.4589 74.3970 79.0819 83.2977 88.3794 91.9517
70 43.2752 45.4417 48.7576 51.7393 55.3289 85.5270 90.5312 95.0232 100.4252 104.2149
80 51.1719 53.5401 57.1532 60.3915 64.2778 96.5782 101.8795 106.6286 112.3288 116.3211
90 59.1963 61.7541 65.6466 69.1260 73.2911 107.5650 113.1453 118.1359 124.1163 128.2989
100 67.3276 70.0649 74.2219 77.9295 82.3581 118.4980 124.3421 129.5612 135.8067 140.1695
175
176
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.18 2.15 2.12 2.09 2.05 1.99 1.95 1.90
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.17 2.13 2.10 2.08 2.04 1.97 1.93 1.88
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.15 2.12 2.09 2.06 2.02 1.96 1.91 1.87
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.14 2.10 2.08 2.05 2.01 1.94 1.90 1.85
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.13 2.09 2.06 2.04 1.99 1.93 1.89 1.84
177
178 V. Nagadevara
References
Aczel, & Sounderpandian. (2009). Complete business statistics. New York: McGraw-Hill.
Stine, R. A., & Foster, D. (2014). Statistics for business: Decision making and analysis (SFB) (2nd
ed.). London: Pearson Education Inc..
Chapter 7
Statistical Methods: Regression Analysis
1 Introduction
Regression analysis is arguably one of the most commonly used and misused statis-
tical techniques in business and other disciplines. In this chapter, we systematically
develop a linear regression modeling of data. Chapter 6 on basic inference is the only
prerequisite for this chapter. We start with a few motivating examples in Sect. 2.
Section 3 deals with the methods and diagnostics for linear regression. Section
3.1 is a discussion on what is regression and linear regression, in particular, and
why it is important. In Sect. 3.2, we elaborate on the descriptive statistics and the
basic exploratory analysis for a data set. We are now ready to describe the linear
regression model and the assumptions made to get good estimates and tests related
to the parameters in the model (Sect. 3.3). Sections 3.4 and 3.5 are devoted to the
development of basic inference and interpretations of the regression with single and
multiple regressors. In Sect. 3.6, we take the help of the famous Anscombe (1973)
data sets to demonstrate the need for further analysis. In Sect. 3.7, we develop the
basic building blocks to be used in constructing the diagnostics. In Sect. 3.8, we
use various residual plots to check whether there are basic departures from the
assumptions and to see if some transformations on the regressors are warranted.
B. Pochiraju
Applied Statistics and Computing Lab, Indian School of Business, Hyderabad, Telangana, India
H. S. S. Kollipara ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: hemasri.kollipara@gmail.com
2 Motivating Examples
We all have a body fat called adipose tissue. If abdominal adipose tissue area
(AT) is large, it is a potential risk factor for cardiovascular diseases (Despres et al.
1991). The accurate way of determining AT is computed tomography (CT scan).
There are three issues with CT scan: (1) it involves irradiation, which in itself can
be harmful to the body; (2) it is expensive; and (3) good CT equipment are not
available in smaller towns, which may result in a grossly inaccurate measurement
of the area. Is there a way to predict AT using one or more anthropological
7 Statistical Methods: Regression Analysis 181
Mr. Warner Sr. owns a newspaper publishing house which publishes the daily
newspaper Newsexpress, having an average daily circulation of 500,000 copies. His
son, Mr. Warner Jr. came up with an idea of publishing a special Sunday edition of
Newsexpress. The father is somewhat conservative and said that they can do so if
it is almost certain that the average Sunday circulation (circulation of the Sunday
edition) is at least 600,000.
Mr. Warner Jr. has a friend Ms. Janelia, who is an analytics expert whom
he approached and expressed his problem. He wanted a fairly quick solution.
Ms. Janelia said that one quick way to examine this is to look at data on other
newspapers in that locality which have both daily edition and Sunday edition.
“Based on these data,” said Ms. Janelia, “we can fairly accurately determine the
lower bound for the circulation of your proposed Sunday edition.” Ms. Janelia
exclaimed, “However, there is no way to pronounce a meaningful lower bound with
certainty.”
What does Ms. Janelia propose to do in order to get an approximate lower bound
to the Sunday circulation based on the daily circulation? Are there any assumptions
that she makes? Is it possible to check them?
One of the important considerations both for the customer and manufacturer of
vehicles is the average mileage (miles per gallon of the fuel) that it gives. How
does one predict the mileage? Horsepower, top speed, age of the vehicle, volume,
and percentage of freeway running are some of the factors that influence mileage.
We have data on MPG (miles per gallon), HP (horsepower), and VOL (volume of
182 B. Pochiraju and H. S. S. Kollipara
cab-space in cubic feet) for 81 vehicles. Do HP and VOL (often called explanatory
variables or regressors) adequately explain the variation in MPG? Does VOL have
explaining capacity of variation in MPG over and above HP? If so, for a fixed HP,
what would be the impact on the MPG if the VOL is decreased by 50 cubic feet? Are
some other important explanatory variables correlated with HP and VOL missing?
Once we have HP as an explanatory variable, is it really necessary to have VOL also
as another explanatory variable?
The sample dataset1 was inspired by an example in the book Basic Econometrics
by Gujarati and Sangeetha. The dataset “cars.csv” is available on the book’s website.
an approximate formula for the area of the leaf based on its length and breadth which
are relatively easy to measure? The dataset “leaf.csv” is available on the book’s
website.
3 Methods of Regression
Y = f (X1 , . . . , Xk , θ1 , . . . , θr , ε) . (7.1)
Here the functional form f is linear in the regressor, namely, daily circulation. It is
also linear in the parameters, α and β. The error ε has also come into the equation
as an additive term.
184 B. Pochiraju and H. S. S. Kollipara
In Example 2.4, notice that a natural functional form of area in terms of the length
and breadth is multiplicative. Thus we may postulate
Here the functional form is multiplicative in powers of length and breadth and
the error. It is not linear in the parameters either.
Alternatively, one may postulate
If the functional form, f, in Eq. (7.1), is linear in the parameters, then the regression
is called linear regression. As noted earlier, (7.2) is a linear regression equation.
What about Eq. (7.3)? As already noted, this is not a linear regression equation.
However, if we make a log transformation on both sides, we get
which is linear in the parameters: log α, β 1 , and β 2 . However, this model is not
linear in length and breadth.
Such a model is called an intrinsically linear regression model.
However, we cannot find any transformation of the model in (7.4) yielding a
linear regression model. Such models are called intrinsically nonlinear regression
models.
2. To study the impact of one regressor on the response variable keeping other
regressors fixed. For example, one may be interested in the impact of one
additional year of education on the average wage for a person aged 40 years.
3. To verify whether the data support certain beliefs (hypotheses)—for example,
whether β 1 = β 2 = 1 in the Eq. (7.3) which, if upheld, would mean that the leaf
is more or less rectangular.
4. To use as an intermediate result in further analysis.
5. To calibrate an instrument.
Linear regression has become popular for the following reasons:
1. The methodology for linear regression is easily understood, as we shall see in the
following sections.
2. If the response variable and the regressors have a joint normal distribution, then
the regression (as we shall identify in Sect. 3.5, regression is the expectation of
the response variable conditional on the regressors) is a linear function of the
regressors.
3. Even though the model is not linear in the regressors, sometimes suitable
transformations on the regressors or response variable or both may lead to a linear
regression model.
4. The regression may not be a linear function in general, but a linear function may
be a good approximation in a small focused strip of the regressor surface.
5. The methodology developed for the linear regression may also act as a good first
approximation for the methodology for a nonlinear model.
We shall illustrate each of these as we go along.
Analysis of the data starts with the basic descriptive summary of each of the
variables in the data (the Appendix on Probability and Statistics provides the
background for the following discussion). The descriptive summary helps in
understanding the basic features of a variable such as the central tendency, the
variation, and a broad empirical distribution. More precisely, the basic summary
includes the minimum, the maximum, the first and third quartiles, the median, the
mean, and the standard deviation. The minimum and the maximum give us the
range. The mean and the median are measures of central tendency. The range,
the standard deviation, and the interquartile range are measures of dispersion. The
box plot depicting the five-point summary, namely, the minimum, the first quartile,
the median, the third quartile, and the maximum, gives us an idea of the empirical
distribution. We give below these measures for the (WC, AT) data.
From Table 7.1 and Fig. 7.1, it is clear that the distribution of WC is fairly
symmetric, about 91 cm, and the distribution of AT is skewed to the right. We shall
see later how this information is useful.
186 B. Pochiraju and H. S. S. Kollipara
Fig. 7.1 Box plots of adipose tissue area and waist circumference
Note on notation: The notation E(Y|X) stands for the expected value of Y given
the value of X. It is computed using P(Y|X), which stands for the conditional
probability given X. For example, we are given the following information: when
X = 1, Y is normally distributed with mean 4 and standard deviation of 1; whereas,
when X = 2, Y is normally distributed with mean 5 and standard deviation of
1.1. E(Y|X = 1) = 4 and E(Y|X = 2) = 5. Similarly, the notation Var(Y|X) is
interpreted as the variance of Y given X. In this case Var(Y|X = 1) = 1 and
Var(Y|X = 2) = 1.21.
The objective is to draw inferences (estimation and testing) related to the param-
eters β 0 , . . . ,β k ,σ 2 in the model (7.6) based on the data (yi , xi1 , . . . , xik ),i = 1, . . . ,N
on a sample with N observations from (Y, X1 , . . . , Xk ) . Note that Y is a column
vector of size N × 1. The transpose of a vector Y (or matrix M) is written as Yt
(Mt ), the transpose of a column vector is a row vector and vice versa.
7 Statistical Methods: Regression Analysis 187
In (7.8), by X = ((xij )), we mean that X is a matrix, the element in the junction of
the ith row and jth column of which is xij . The size of X is N × k. In the matrix Z, 1
denotes the column vector, each component of which is the number 1. The matrix 1
is appended as the first column and the rest of the columns are taken from X—this
is the notation (1:X). The matrices X and Z are of orders N × k and N × (k+1),
respectively.
We make the following assumption regarding the errors ε1 , . . . ,εN :
ε | X ∼ NN 0, σ 2 I (7.9)
that is, the distribution of the errors given X is an N-dimensional normal distribution
with mean zero and covariance matrix equal to σ2 times the identity matrix. In (7.9),
I denotes the identity matrix of order N × N. The identity matrix comes about
because the errors are independent of one another, therefore, covariance of one error
with another error equals zero.
From (7.8) and (7.9), we have
Y | X ∼ NN Zβ, σ 2 I .
In other words, the regression model Zβ represents the mean value of Y given X.
The errors are around the mean values.
What does the model (7.7) (or equivalently 7.8) together with the assumption
(7.9) mean? It translates into the following:
1. The model is linear in the parameters β 0 , . . . ,β k (L).
2. Errors conditional on the data on the regressors are independent (I).
3. Errors conditional on the data on the regressors have a joint normal
distribution (N).
188 B. Pochiraju and H. S. S. Kollipara
4. The variance of each of the errors conditional on the data on the regressors is
σ 2 (E).
5. Each of the errors conditional on the data on the regressors has 0 mean.
In (4) above, E stands for equal variance. (5) above is usually called the exogene-
ity condition. This actually implies that the observed covariates are uncorrelated
with the unobserved covariates. The first four assumptions can be remembered
through an acronym: LINE.
Why are we talking about the distribution of εi X? Is it not a single number? Let
us consider the adipose tissue example. We commonly notice that different people
having the same waist circumference do not necessarily have the same adipose tissue
area. Thus there is a distribution of the adipose tissue area for people with a waist
circumference of 70 cm. Likewise in the wage example, people with the same age
and education level need not get exactly the same wage.
The above assumptions will be used in drawing the inference on the parameters.
However, we have to check whether the data on hand support the above assumptions.
How does one draw the inferences and how does one check for the validity of
the assumptions? A good part of this chapter will be devoted to this and the
interpretations.
Let us consider Examples 2.1 and 2.2. Each of them has one regressor, namely,
WC in Example 2.1 and daily circulation in Example 2.2. Thus, we have bivariate
data in each of these examples. What type of relationship does the response variable
have with the regressor? We have seen in the previous chapter that covariance or
correlation coefficient between the variables is one measure of the relationship. We
shall explore this later where we shall examine the interpretation of the correlation
coefficient. But we clearly understand that it is just one number indicating the
relationship. However, we do note that each individual (WC, AT) is an ordered pair
and can be plotted as a point in the plane. This plot in the plane with WC as the
X-axis and AT as the Y-axis is called the scatterplot of the data. We plot with the
response variable on the Y-axis and the regressor on the X-axis. For the (WC, AT)
data the plot is given below:
What do we notice from this plot?
1. The adipose tissue area is by and large increasing with increase in waist
circumference.
2. The variation in the adipose tissue area is also increasing with increase in waist
circumference.
The correlation coefficient for this data is approximately 0.82. This tells us the
same thing as (1) above. It also tells that the strength of (linear) relationship between
the two variables is strong, which prompts us to fit a straight line to the data. But by
looking at the plot, we see that a straight line does not do justice for large values of
7 Statistical Methods: Regression Analysis 189
waist circumference as they are highly dispersed. (More details on this later.) So the
first lesson to be learnt is: If you have a single regressor, first look at the scatterplot.
This will give you an idea of the form of relationship between the response variable
and the regressor. If the graph suggests a linear relationship, one can then check the
correlation coefficient to assess the strength of the linear relationship between the
response variable and the regressor.
We have the following linear regression model for the adipose tissue problem:
Model : AT = β0 + β1 W C + ε (7.10)
Model described by (7.12) and (7.13) is a special case of the model described
by (7.7) and (7.9) where k = 1 (For a single regressor case, k = 1, the number of
regressors.) and N = 109.
A linear regression model with one regressor is often referred to as a Simple
Linear Regression model.
Estimation of Parameters
From the assumptions it is clear that all the errors (conditional on the Waist
Circumference values) are of equal importance.
If we want to fit a straight line for the scatterplot in Fig. 7.2, we look for that
line for which some reasonable measure of the magnitude of the errors is small. A
straight line is completely determined by its intercept and slope, which we denote
by β 0 and β 1 , respectively. With this straight line approximation, from (7.12), the
error for the ith observation is εi = ATi − β 0 − β 1 WCi ,i = 1,. . . . ,109.
A commonly used measure for the magnitude of the errors is the sum of their
squares, namely, 109 ε
i=1 i
2 . So if we want this measure of the magnitude of the
errors
109 to be small, we should pick up the values of β 0 and β 1 , which will minimize
2
i=1 εi . This method is called the Method of Least Squares. This is achieved by
solving the equations (often called normal equations):
109
AT i = 109β0 + β1 109 W Ci
109
i=1 109 i=1 109 (7.14)
i=1 AT i W C i = β0 i=1 W C i + β1 i=1 W C i
2
190 B. Pochiraju and H. S. S. Kollipara
250
words
200
Adipose Tissue area
150
100
50
0 , β
Solving the two equations in (7.14) simultaneously, we get the estimators β 1
of β 0 ,β 1 , respectively, as
109 109 109 ⎫
1 = cov(AT ,W C) i=1 AT i W C i − i=1 AT i i=1 W C i /109 ⎬
β V (W C) = 109 109
2
i=1 W C i −
2
i=1 W C i /109 (7.15)
⎭
0 = AT − β
β 1 W C
AT = β 1 W C.
0 + β (7.16)
The predicted value (often called the fitted value) of the ith observation is given
by
ATi = β
0 + β
1 W C i . (7.17)
Notice that this is the part of the adipose tissue area for the ith observation
explained by our fitted model.
The part of the adipose tissue area of the ith observation, not explained by
our fitted model, is called the Pearson residual, henceforth referred to as residual
corresponding to the ith observation, and is given by
ei = AT i − ATi . (7.18)
7 Statistical Methods: Regression Analysis 191
R02
σ2 = . (7.19)
107
The data has 109 degrees of freedom. Since two parameters β 1 are estimated,
0 , β
2 degrees of freedom are lost and hence the effective sample size is 107. That is the
reason for the denominator in (7.19). In the regression output produced by R (shown
below), the square root of
σ 2 is called the residual standard error, denoted as se .
Coefficient of Determination
How good is the fitted model in explaining the variation in the response variable, the
adipose tissue area? The variation in the adipose tissue area can be represented by
109 2
i=1 AT i − AT . As we have seen above, the variation in adipose tissue area not
explained by our model is given by R02 . Hence the part of variation in the adipose
tissue area that is explained by our model is given by
109 2
AT i − AT − R02 . (7.20)
i=1
Thus the proportion of the variation in the adipose tissue area that is explained
by our model is
109 2
i=1 AT i − AT − R02 R02
109 2
= 1 − 109 2 (7.21)
i=1 AT i − AT i=1 AT i − AT
Let us recall that R2 is the proportion of variation in the response variable that
is explained by the model.
In the case of a single regressor, one can show that R2 is the square of
the correlation coefficient between the response variable and the regressor (see
Exercise 7.2). This is the reason for saying that the correlation coefficient is a
measure of the strength of a linear relationship between the response variable and
the regressor (in the single regressor case).
However, the above two are the extreme cases. For almost all practical data sets,
0 < R2 < 1. Should we be elated when R2 is large or should we be necessarily
depressed when it is small? Fact is, R2 is but just one measure of fit. We shall come
back to this discussion later (see also Exercise 7.2).
Prediction for a New Observation
For a new individual whose waist circumference is available, say, WC = x0 cm,
how do we predict his abdominal adipose tissue? This is done by using the formula
(7.16). Thus the predicted value of the adipose tissue for this person is
AT = β
0 + β
1 x0 sq.cm. (7.22)
The average value of the adipose tissue area for all the individuals with
WC = x0 cm is also estimated by the formula (7.22), with the standard error given
by.
2
1 x0 − W C
s 2 = se +
107 109 W C i − W C 2
(7.24)
i=1
Notice the difference between (7.23) and (7.24). Clearly (7.23) is larger than
(7.24). This is not surprising because the variance of an observation is larger than
that of the average as seen in the Chap. 6, Statistical Methods—Basic Inferences.
Why does one need the standard-error-formulae in (7.23) and (7.24)? As we see in
Chap. 6, these are useful in obtaining the prediction and confidence intervals. Also
note that the confidence interval is a statement of confidence about the true line—
because we only have an estimate of the line. See (7.28) and (7.29) for details.
Testing of Hypotheses and Confidence Intervals
Consider the model (7.10). If β 1 = 0, then there is no linear relationship between
AT and WC. Thus testing
H0 : β1 = 0 against H1 : β1 = 0 (7.25)
7 Statistical Methods: Regression Analysis 193
1
β
(7.26)
S.E. β1
where t107,0.025 stands for the 97.5 percentile value of the student’s t distribution
with 107 degrees of freedom.
Similarly, a 95% confidence interval for the average value of the adipose tissue
for all individuals having waist circumference equal to x0 cm is given by
AT − t107,0.025 s2 , AT + t107,0.025 s2 . (7.28)
Also, a 95% prediction interval for the adipose tissue for an individual having
waist circumference equal to x0 cm is given by
AT − t107,0.025 s1 , AT + t107,0.025 s1 . (7.29)
From the expressions for si , i = 1,2 (as in 7.23 and 7.24), it is clear that the widths
of the confidence and prediction intervals is the least when x0 = W C and gets larger
and larger as x0 moves farther away from W C. This is the reason why it is said that
the prediction becomes unreliable if you try to predict the response variable value
for a regressor value outside the range of the regressor values. The same goes for
the estimation of the average response variable value.
Let us now consider the linear regression output in R for regressing AT on WC
and interpret the same.
> model<-lm(AT ~ Waist, data=wc_at)
> summary(model)
Call:
lm(formula = AT ~ Waist, data = wc_at)
Residuals:
Min 1Q Median 3Q Max
-107.288 -19.143 -2.939 16.376 90.342
194 B. Pochiraju and H. S. S. Kollipara
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -215.9815 21.7963 -9.909 <2e-16 ***
Waist 3.4589 0.2347 14.740 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
part of the variation in the response variable is explained by the fitted model
if the value is large. How does one judge how large is large? Statistically this
statistic has an F distribution with the numerator and denominator degrees of
freedom under the null hypothesis of the ineffectiveness of the model. In the
single regressor case, the F-statistic is the square of the t-statistic. In this case,
both the t-test for the significance of the regression coefficient estimate of WC
and the F-statistic test for the same thing, namely, whether WC is useful for
predicting AT (or, equivalently, in explaining the variation in AT) through the
model under consideration.
7. Consider the adult males in the considered population with a waist
circumference of 100 cm. We want to estimate the average abdominal
adipose tissue area of these people. Using (7.30), the point estimate is
−215.9815 + 3.4589 × 100 = 129.9085 square centimeters.
8. Consider the same situation as in point 7 above. Suppose we want a 95%
confidence interval for the average abdominal adipose tissue area of all
individuals with waist circumference equal to 100 cm. Using the formula (7.28),
we have the interval [122.5827, 137.2262]. Now consider a specific individual
whose waist circumference is 100 cm. Using the formula (7.29), we have
the 95% prediction interval for this individual’s abdominal adipose tissue as
[63.94943, 195.8595].
9. In point 7 above, if the waist circumference is taken as 50 cm, then
using (7.30), the estimated average adipose tissue area turns out to be
−215.9815 + 3.4589 × 50 = −42.9365 square centimeters, which is absurd.
Where is the problem? The model is constructed for the waist circumference in
the range (63.5 cm, 119.90 cm). The formula (7.30) is applicable for estimating
the average adipose tissue area when the waist circumference is in the range of
waist circumference used in the estimation of the regression equation. If one
goes much beyond this range, then the confidence intervals and the prediction
intervals as constructed in the point above will be too large for the estimation
or prediction to be useful.
10. Testing whether WC is useful in predicting the abdominal adipose tissue area
through our model is equivalent to testing the null hypothesis, β 1 = 0. This
can be done using (7.26) and this is already available in the output before Fig.
7.3. The corresponding p-value is 2 × 10−16 . This means that if we reject the
hypothesis β 1 = 0, based on our data, then we reject wrongly only 2 × 10−16
proportion of times. Thus we can safely reject the null hypothesis and declare
that WC is useful for predicting AT as per this model.
While we used this regression model of AT on Waist, this is not an appropriate
model since the equal variance assumption is violated. For a suitable model for this
problem, see Sect. 3.16.
196 B. Pochiraju and H. S. S. Kollipara
Let us consider the example in Sect. 2.3. Here we have two regressors, namely, HP
and VOL. As in the single regressor case, we can first look at the summary statistics
and the box plots to understand the distribution of each variable (Table 7.2 and
Fig. 7.4).
From the above summary statistics and plots, we notice the following:
(a) The distribution of MPG is slightly left skewed.
(b) The distribution of VOL is right skewed and there are two points which are far
away from the center of the data (on this variable), one to the left and the other
to the right.
(c) The distribution of HP is heavily right skewed.
Does point (a) above indicate a violation of the normality assumption (7.9)? Not
necessarily, since the assumption (7.9) talks about the conditional distribution of
MPG given VOL and HP whereas the box plot of MPG relates to the unconditional
distribution of MPG. As we shall see later, this assumption is examined using
residuals which are the representatives of the errors. Point (b) can be helpful when
7 Statistical Methods: Regression Analysis 197
we identify, interpret and deal with unusual observations which will be discussed
in the section. Point (c) will be helpful when we will look at the residual plots to
identify suitable transformations which will be discussed in Sect. 3.10.
We can then look at the scatterplots for every pair of the variables: MPG, VOL,
and HP. This can be put in the form of a matrix of scatterplots.
Scatterplots Matrix
The matrix of the scatterplots in Fig. 7.5, is called the scatterplot matrix of all the
variables, namely, the response variable and the regressors. Unlike the scatterplot
in the single regressor case, the scatterplot matrix in the multiple regressors case
is of limited importance. In the multiple regressors case, we are interested in the
influence of a regressor over and above that of other regressors. We shall elaborate
upon this further as we go along. However, the scatterplots in the scatterplot matrix
ignore the influence of the other regressors. For example, from the scatterplot of HP
vs MPG (second row first column element in Fig. 7.2), it appears that MPG has a
quadratic relationship with HP. But this ignores the impact of the other regressor
on both MPG and HP. How do we take this into consideration? After accounting for
these impacts, will the quadratic relationship still hold? We shall study these aspects
in Sect. 3.10. The scatterplot matrix is useful in finding out whether there is almost
perfect linear relationship between a pair of regressors. Why is this important? We
shall study this in more detail in Sect. 3.13.
198 B. Pochiraju and H. S. S. Kollipara
20 30 40 50
250
HP
150
50
20 30 40 50
MPG
140
100
VOL
60
50 100 150 200 250 300 60 80 100 120 140 160
Model: MP G = β0 + β1 H P + β2 V OL + ε (7.31)
Where does the error, ε, come from? We know that MPG is not completely
determined by HP and VOL. For example, the age of the vehicle, the type of the
road (smooth, bumpy, etc.) and so on impact the MPG. Moreover, as we have seen
in the scatterplot: MPG vs HP, a quadratic term in HP may be warranted too. (We
shall deal with this later.) All these are absorbed in ε.
The model described by (7.33) and (7.34) is a special case of the model described
by (7.7) and (7.9) where k = 2 and N = 81.
7 Statistical Methods: Regression Analysis 199
In the formula (7.32), β 1 and β 2 are the rates of change in MPG with respect
to HP and VOL when the other regressor is kept fixed. Thus, these are partial
rates of change. So strictly speaking, β 1 and β 2 should be called partial regression
coefficients. When there is no confusion we shall refer to them as just regression
coefficients.
Estimation of Parameters
As in the single regressor case, we estimate the parameters by the method of
least squares (least sum of squares of the errors). Thus, we are led to the normal
equations:
81 ⎫
MP Gi = 81β0 + β1 81 β2 81
i=1 H P i + i=1 V OLi ⎬
i=1
81 81 81 81
MP G i H P i = β0 i=1 H P i + β1 i=1 H P 2+β
2 i=1 H P i V OL i
i=1
81 81 81 i 81 2⎭
i=1 MP Gi V OLi = β0 i=1 V OLi + β1 i=1 H P i V OLi + β2 i=1 V OLi
(7.35)
0 , β
The solution of the system of equations in (7.35) yields the estimates β 1 , β
2
of the parameters β 0 ,β 1 ,β 2 respectively.
Thus, the estimated regression equation or the fitted model is
G = β
MP 0 + β
1 H P + β
2 V OL (7.36)
Along similar lines to the single regressor case, the fitted value and the residual
corresponding to the ith observation are, respectively,
MP Gi = β
0 + β
1 H P i + β
2 V OLi
Gi . (7.37)
ei = MP Gi − M P
We recall that the fitted value and the residual for the ith observation are the
explained and the unexplained parts of the observation based on the fitted model.
Residual Sum of Squares
As before, we use their sample representatives ei to check the assumptions (7.9) on
the errors.
The sum of squares of residuals, R02 = 81 2
i=1 ei is the part of the variation in
MPG that is not explained by the fitted model (7.36) obtained by the method of
least squares.
The estimator of σ 2 is obtained as
R02
σ2 =
. (7.38)
78
The data has 81 degrees of freedom. Since three parameters β 0 , β
1 , β
2 are
estimated, three degrees of freedom are lost and hence the effective sample size
is 81 − 3 = 78. That is the reason for the denominator in (7.38). As mentioned
earlier, the square root of
σ 2 is called the residual standard error in the R package.
200 B. Pochiraju and H. S. S. Kollipara
Thus the proportion of the variation in MPG that is explained by our fitted
model is
81 2
i=1 MP Gi − MP G − R0
2
R02
81 2 =1− 2 (7.40)
81
i=1 MP Gi − MP G i=1 MP Gi − MP G
It can be shown that R2 almost always increases with more regressors. (It never
decreases when more regressors are introduced.) So R2 may not be a very good
criterion to judge whether a new regressor should be included. So it is meaningful
to look for criteria which impose a penalty for unduly bringing in a new regressor
into the model. One such criterion is Adjusted R2 defined below. (When we deal
with subset selection, we shall introduce more criteria for choosing a good subset of
2
regressors.) We know that both R02 and 81 i=1 MP Gi − MP G are representatives
of the error variance σ 2 . But the degrees of freedom for the former is 81 − 3 = 78
and for the latter is 81 − 1 = 80. When we are comparing both of them as in (7.40),
7 Statistical Methods: Regression Analysis 201
some feel that we should compare the measures per unit degree of freedom as then
they will be true representatives of the error variance. Accordingly, adjusted R2 is
defined as
R02 / (81 − 3)
Adj R 2 = 1 − 2 . (7.41)
81
i=1 MP Gi − MP G / (81 − 1)
The adjusted R2 can be written as 1 − n−K n−1
1 − R 2 , where n is the number
of observations and K is the number of parameters of our model that are being
estimated. From this it follows, that unlike R2 , adjusted R2 may decrease with the
introduction of a new regressor. However, adjusted R2 may become negative and
does not have the same intuitive interpretation as that of R2 .
As we can see, adjusted R2 is always smaller than the value of R2 . A practical
thumb rule is to examine whether R2 and adjusted R2 are quite far apart. One naïve
way to judge this is to see if the relative change, R -adj
2 R2 .
R2
, is more than 10%. If it
is not, go ahead and interpret R2 . However, if it is, then it is an indication that there
is some issue with the model—either there is an unnecessary regressor in the model
or there is some unusual observation. We shall talk about the unusual observations
in Sect. 3.12.
Let us now consider Example 2.3, the gasoline consumption problem. We give
the R-output of the linear regression of MPG on HP and VOL in Table 7.3.
Call:
lm(formula = MPG ~ HP + VOL, data = Cars)
Residuals:
Min 1Q Median 3Q Max
-8.3128 -3.3714 -0.1482 2.8260 15.4828
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.586385 2.505708 26.574 < 2e-16 ***
HP -0.110029 0.009067 -12.135 < 2e-16 ***
VOL -0.194798 0.023220 -8.389 1.65e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
7. A 95% confidence interval for the average value of MPG for all vehicles with
HP = 150 and VOL = 100 cubic feet can be obtained as follows.
Let Z be the matrix of order 81 × 3 where each element in the first column is 1,
the second and third columns, respectively,
are data on HP and VOL on vehicles 1
to 81 in that order. Let ut = 1 150 100 . Then the required confidence interval
is given by the formula
t −1 t −1
u β − u (Z Z) u
t t σ t78,0.025 , u β + u (Z Z) u
2 t t 2
σ t78,0.025
204 B. Pochiraju and H. S. S. Kollipara
where σ 2 , the estimate of the error variance, is the square of the residual standard
error in the output. The computed interval is (29.42484, 31.77965).
A 95% prediction interval for MPG of a vehicle with HP = 150 and VOL = 100
cubic feet can be obtained as follows. Let Z, u, and σ 2 be as specified in point 8
above. Then the required prediction interval is given by the formula:
t −1 + 1+ut (Z t Z) u
−1
u β − 1+ut (Z t Z) u t
σ 2 t78,0.025 , u β σ 2 t78,0.025 .
We have so far considered the cases of single regressor and multiple regressors in
Linear Regression and estimated the model which is linear in parameters and also
in regressors. We interpreted the outputs and obtained the relevant confidence and
prediction intervals. We also performed some basic tests of importance. Are we
done? First, let us consider the data sets of Anscombe (1973) as given in Table 5.1
and Fig. 5.3 of the data visualization chapter (Chap. 5). Look at Table 5.1. There
are four data sets, each having data on two variables. The plan is to regress yi on
xi , i = 1, . . . , 4. From the summary statistics, we notice the means and standard
deviations of the x’s are the same across the four data sets and the same is true for
the y’s. Further the correlation coefficient in each of the four data sets are the same.
Based on the formulae (7.15) adapted to these data sets, the estimated regression
lines are the same. Moreover, the correlation coefficient is 0.82 which is substantial.
So it appears that linear regression is equally meaningful in all the four cases and
gives the same regression line.
7 Statistical Methods: Regression Analysis 205
Now let us look at the plots in the Fig. 5.3. For the first data set, a linear regression
seems reasonable. From the scatterplot of the second data set, a parabola is more
appropriate. For the third data set, barring the third observation (13, 12.74), the
remaining data points lie almost perfectly on a straight line, different from the line
in the plot. In the fourth data set, the x values are all equal to 8, except for the eighth
observation where the x value is 19. The slope is influenced by this observation.
Otherwise we would never use these data (in the fourth data set) to predict y based
on x.
This reiterates the observation made earlier: One should examine suitable plots
and other diagnostics to be discussed in the following sections before being satisfied
with a regression. In the single regressor case, the scatterplot would reveal a lot
of useful information as seen above. However, if there are several regressors,
scatterplots alone will not be sufficient since we want to assess the influence of
a regressor on the response variable after controlling for the other regressors.
Scatterplots ignore the information on the other regressors. As we noticed earlier, the
residuals are the sample representatives of the errors. This leads to the examination
of some suitable residual plots to check the validity of the assumptions in Eq. (7.9).
In the next section, we shall develop some basic building blocks for constructing the
diagnostics.
The Hat Matrix: First, we digress a little bit to discuss a useful matrix. Consider the
linear regression model (7.8) with the assumptions (7.9). There is a matrix H, called
the hat matrix which transforms the vector of response values Y to the vector of the
. In other words, Y
fitted values Y = H Y . The hat matrix is given by Z(Zt Z)−1 Zt . It is
called the hat matrix because when applied to Y it gives the estimated (or hat) value
of Y. The hat matrix has some good properties, some of which are listed below
(a) V ar Y = σ 2 H.
(b) Var(residuals) = σ 2 (I − H), where I is the identity matrix.
Interpretation of Leverage Values and Residuals Using the Hat Matrix
Diagonal elements hii of the hat matrix have a good interpretation. If hii is large,
then the regressor part of data for the ith observation is far from the center of the
regressor data. If there are N observations and k regressors, then hii is considered to
be large if it is 2(k+1)
N or higher.
Let us now look at the residuals. If the residual corresponding to the ith residual
is large, then the fit of the ith observation is poor. To examine whether a residual
is large, we look at a standardized version of the residual. It can be shown that
the mean of each residual is 0 and the variance of the ith residual is given by
Var(ei ) = (1 − hii )σ 2 . We recall that σ 2 is the variance of the error of an observation
(of course conditional on the regressors). The estimate of σ 2 obtained from the
206 B. Pochiraju and H. S. S. Kollipara
model after dropping the ith observation, denoted by 2 is preferred (since large ith
σ(i)
residual also has a large contribution to the residual sum of squares and hence to the
estimate of σ 2 ).
An observation is called an Outlier if its fit in the estimated model is poor, or,
equivalently, if its residual is large.
The statistic that is used to check whether the ith observation is an outlier is
ei − 0
ri = , (7.43)
(1 − hii ) 2
σ(i)
often referred to as the ith studentized residual, which has a t distribution with N –
k − 1 degrees of freedom under the null hypothesis that the ith observation is not an
outlier.
When one says that an observation is an outlier, it is in the context of the
estimated model under consideration. The same observation may be an outlier with
respect to one model and may not be an outlier in a different model (see Exercise
7.3).
As we shall see, the residuals and the leverage values form the basic building
blocks for the deletion diagnostics, to be discussed in Sect. 3.12.
We shall now proceed to check whether the fitted model is adequate. This involves
the checking of the assumptions (7.9). If the fitted model is appropriate then the
residuals are uncorrelated with the fitted values and also the regressors. We examine
these by looking at the residual plots:
(a) Fitted values vs residuals
(b) Regressors vs residuals
In each case the residuals are plotted on the Y-axis. If the model is appropriate,
then each of the above plots should yield a random scatter. The deviation from the
random scatter can be tested, and an R package command for the above plots gives
the test statistics along with the p-values.
Let us look at the residual plots and test statistics for the fitted model for the
gasoline consumption problem given in Sect. 3.7.
How do we interpret Fig. 7.7 and Table 7.4? Do they also suggest a suitable
corrective action in case one such is warranted?
We need to interpret the figure and the table together.
We notice the following.
1. The plot HP vs residual does not appear to be a random scatter. Table 7.4 also
confirms that the p-value corresponding to HP is very small. Furthermore, if we
7 Statistical Methods: Regression Analysis 207
Fig. 7.7 The residual plots for gasoline consumption model in Sect. 3.7
look at the plot more closely we can see a parabolic pattern. Thus the present
model is not adequate. We can try to additionally introduce the square term for
HP to take care of this parabolic nature in the plot.
2. The plot VOL vs residual is not quite conclusive whether it is really a random
scatter. Read this in conjunction with Table 7.4. The p-value corresponding to
VOL is not small. So we conclude that there is not a significant deviation from a
random scatter.
3. Same is the case with the fitted values vs residual. For fitted values vs residual,
the corresponding test is the Tukey test.
A note of caution is in order. The R package command for the residual plots
leads to plots where a parabolic curve is drawn to notify a deviation from random
scatter. First, the deviation from random scatter must be confirmed from the table
of tests for deviation from random scatter by looking at the p-value. Second, even
if the p-value is small, it does not automatically mean a square term is warranted.
Your judgment of the plot is important to decide whether a square term is warranted
or something else is to be done. One may wonder why it is important to have the
plots if, anyway, we need to get the confirmation from the table of tests for random
scatter. The test only tells us whether there is a deviation from random scatter but
it does not guide us to what transformation is appropriate in case of deviation. See,
208 B. Pochiraju and H. S. S. Kollipara
for example, the residual plot of HP vs Residual in Fig. 7.7. Here we clearly see a
parabolic relationship indicating that we need to bring in HP square. We shall see
more examples in the following sections.
Let us add the new regressor which is the square of HP and look at the regression
output and the residual plots.
The following points emerge:
1. From Table 7.5, we notice that HP_Sq is highly significant (p-value: 1.2 × 10−15 )
and is positive.
2. R square and Adj. R square are pretty close. So we can interpret R square. The
present model explains 89.2% of the variation in MPG. (Recall that the model in
Sect. 3.7 explained only 75% of the variation in MPG.)
3. Figure 7.8 and Table 7.6 indicate that the residuals are uncorrelated with the fitted
values and regressors.
4. Based on Table 7.5, how do we assess the impact of HP on MPG? The partial
derivative of M PG with respect to HP is −0.4117 + 0.001808 HP. Unlike in
the model in Sect. 3.7 (see point 4), in the present model, the impact of unit
increase in HP on the estimated MPG, keeping the VOL constant, depends on
the level of HP. At the median value of HP (which is equal to 100—see Table
7.2), one unit increase in HP will lead to a reduction in M P G by 0.2309 when
VOL is kept constant. From the partial derivative, it is clear that as long as HP
is smaller than 227.710177, one unit increase in HP will lead to a reduction in
MP G, keeping VOL constant. If HP is greater than this threshold, then based on
this model, M PG will increase (happily!) with increasing HP when the VOL is
held constant.
The question is: Are we done? No, we still need to check a few other assump-
tions, like normality of the errors. Are there some observations which are driving
the results? Are there some important variables omitted?
Suppose we have performed a linear regression with some regressors. If the plot
of fitted values vs residuals shows a linear trend, then it is an indication that there
is an omitted regressor. However, it does not give any clue as to what this regressor
is. This has to come from domain knowledge. It may also happen that, after the
regression is done, we found another variable which, we suspect, has an influence
on the response variable. In the next section we study the issue of bringing in a new
regressor.
Let us consider again the gasoline consumption problem. Suppose we have run a
regression of MPG on VOL. We feel that HP also has an explaining power of the
variation in MPG. Should we bring in HP? A part of MPG is already explained
by VOL. So the unexplained part of MPG is the residual e (unexplained part of
MPG) after regressing MPG on VOL. There is a residual value corresponding to
7 Statistical Methods: Regression Analysis 209
Table 7.5 Regression output of MPG vs HP, HP_SQ (Square of HP), and VOL
> model2<-lm(MPG ~ HP + VOL + HP_sq, data=Cars)
> summary(model2)
Call:
lm(formula = MPG ~ HP + VOL + HP_sq, data = Cars)
Residuals:
Min 1Q Median 3Q Max
-8.288 -2.037 0.561 1.786 11.008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.744e+01 1.981e+00 39.100 < 2e-16 ***
HP -4.117e-01 3.063e-02 -13.438 < 2e-16 ***
VOL -1.018e-01 1.795e-02 -5.668 2.4e-07 ***
HP_sq 9.041e-04 9.004e-05 10.042 1.2e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Fig. 7.8 Residual plots corresponding to the fitted model in Table 7.5
each observation. Let us call these residual values ei , i = 1, . . . ,81. So we will bring
in HP if it has an explaining power of this residual. At the same time, a part of the
explaining power of HP may already have been contained in VOL. Therefore, the
210 B. Pochiraju and H. S. S. Kollipara
part of HP that is useful for us is the residual f after regressing HP on VOL. Let us
call these residuals fi ,i = 1, . . . ,81. It now boils down to the single regressor case
of regressing e on f. It is natural to look at the scatterplot (fi , ei ),i = 1, . . . ,81. This
scatterplot is called the Added Variable Plot. The correlation coefficient is called
the partial correlation coefficient between MPG and HP fixing (or equivalently,
eliminating the effect of) VOL. Let us look at this added variable plot (Fig. 7.9).
The following points emerge from the above added variable plot.
(a) As the residual of HP increases, the residual of MPG decreases, by and large.
Thus it is useful to bring in HP into the regression in addition to VOL.
(b) We can notice a parabolic trend in the plot, suggesting that HP be brought
in using a quadratic function. Thus, it reinforces our earlier decision (see
Sect. 3.10.) to use HP and HP_SQ. Barring exceptions, in general, the added
variable plot suggests a suitable functional form in which the new regressor can
be added.
(c) The partial correlation coefficient is the strength of a linear relationship between
the two residuals under consideration. For the gasoline consumption example,
the partial correlation coefficient between MPG and HP keeping VOL fixed is
−0.808. The correlation coefficient between MPG and HP is −0.72. (As we
know the negative sign indicates that as HP increases, MPG decreases.) Is it
surprising?
It is a common misconception that the correlation coefficient between two
variables is always larger than or equal to the partial correlation coefficient
between these variables after taking away the effect of other variables. Notice
that the correlation coefficient between MPG and HP represents the strength of
7 Statistical Methods: Regression Analysis 211
the linear relationship between these variables ignoring the effect of VOL. The
partial correlation coefficient between MPG and HP after taking away the effect
of VOL is the simple correlation coefficient between the residuals e and f which
are quite different from MPG and HP, respectively. So a partial correlation
coefficient can be larger than or equal to or less than the corresponding
correlation coefficient.
We give an interesting formula for computing the partial correlation coefficient
using just the basic regression output. Let T denote the t-statistic value for a
regressor. Let there be N observations and k(≥2) regressors plus one intercept. Then
the partial correlation coefficient between this regressor and the response variable
(Greene 2012, p. 77) is given by
T2
. (7.44)
T 2 + (N − k − 1)
Recall (see 7.38) that N − k − 1 is the degree of freedom for the residual sum of
squares.
Look at the third data set of Anscombe (1973) described in Sect. 3.8 and its
scatterplot (Fig. 5.3) in the Data Visualization chapter (Chap. 5). But for the third
observation (13, 12.74), the other observations fall almost perfectly on a straight
line. Just this observation is influencing the slope and the correlation coefficient.
In this section, we shall explain what we mean by an influential observation, give
methods to identify such observations, and finally discuss what one can do with an
influential observation.
We list below some of the quantities of interest in the estimation of a linear
regression model.
1. Regression coefficient estimates
2. Fit of an observation
3. Standard errors of the regression coefficient estimates
4. The error variance
5. Coefficient of multiple determination
In a linear regression model (7.8) with the assumptions as specified in (7.9), no
single observation has any special status. If the presence or absence of a particular
observation can make a large (to be specified) difference to some or all of the
quantities above, we call such an observation an influential observation.
Let us describe some notation before we give the diagnostic measures. Consider
the model specified by (7.8) and (7.9) and its estimation in Sect. 3.7. Recall the
definitions of the fitted values,
yi , the residuals, ei , the residual sum of squares, R02 ,
212 B. Pochiraju and H. S. S. Kollipara
and the coefficient of multiple determination, R2 , given in Sect. 3.7. Let β denote
the vector of the estimated regression coefficients.
In order to assess the impact of an observation on the quantities (1)–(5)
mentioned above, we set aside an observation, say the ith observation and estimate
the model with the remaining observations. We denote the vector of estimated
regression coefficients, fitted value of the jth observation, residual of the jth
observation, residual sum of squares and the coefficient of multiple determination
(i) ,
after dropping the ith observation by β 2 , R 2 respectively.
yj (i) , ej (i) , R0(i) (i)
We give below a few diagnostic measures that are commonly used to detect
influential observations.
(a) Cook’s distance: This is an overall measure of scaled difference in the fit of the
observations due to dropping an observation. This is also a scaled measure of
the difference between the vectors of regression coefficient estimates before and
after dropping an observation. More specifically, Cook’s distance after dropping
2
the ith observation, denoted by Cookdi , is proportional to N yj −
j =1 yj (i)
where N is the number of observations. (Cookdi is actually a squared distance.)
The ith observation is said to be influential if Cookdi is large. If Cookdi is larger
than a cutoff value (usually 80th or 90th percentile value of F distribution with
parameters k and N – k − 1 where N is the number of observations and k is the
number of regressors), then the ith observation is considered to be influential.
In practice, a graph is drawn with an observation number in the X-axis and
Cook’s distance in the Y-axis, called the index plot of Cook’s distance, and a
few observations with conspicuously large Cook’s distance values are treated
as influential observations.
(b) DFFITSi : This is a scaled absolute difference in the fits of the ith observation
before and after the deletion of the ith observation. More specifically, DFFITS i
is proportional to | yi −
yi(i) |. Observations with DFFITS larger than 2 k+1N
are flagged as influential observations.
(c) COVRATIOi : COVRATIOi measures the change in the overall variability of the
regression coefficient estimates due to the deletion of the ith observation. More
specifically it is the ratio of the determinants of the covariance matrices of the
regression coefficient estimates after and before dropping the ith observation.
If | COV RAT I O i − 1 |> 3(k+1) th
N , then the i observation is flagged as an
influential observation in connection with the standard errors of the estimates.
It is instructive to also look at the index plot of COVRATIO.
(d) The scaled residual sum of squares estimates the error variance as we have seen
in (7.38). The difference in the residual sum of squares R02 and R0(i)
2 before and
th
after deletion of the i observation, respectively, is given by
ei2
R02 − R0(i)
2
= .
1 − hii
7 Statistical Methods: Regression Analysis 213
Thus the ith observation is flagged as influential in connection with error variance
if it is an outlier (see 7.43).
Two points are worth noting:
(a) If an observation is found to be influential, it does not automatically suggest
“off with the head.” The diagnostics above are only markers suggesting that an
influential observation has to be carefully examined to find out whether there is
an explanation from the domain knowledge and the data collection process why
it looks different from the rest of the data. Any deletion should be contemplated
only after there is a satisfactory explanation for dropping, from the domain
knowledge.
(b) The diagnostics are based on the model developed. If the model under consid-
eration is found to be inappropriate otherwise, then these diagnostics are not
applicable.
We shall illustrate the use of Cook’s distance using the following example on
cigarette consumption. The data set “CigaretteConsumption.csv” is available on the
book’s website.
Example 7.1. A national insurance organization in USA wanted to study the
consumption pattern of cigarettes in all 50 states and the District of Columbia. The
variables chosen for the study are given in Fig. 7.10.
Variable Definition
Age Median age of a person living in a state
HS % of people over 25 years of age in a state who completed high school
Income Per capita personal income in a state (in dollars)
Black % of blacks living in a state
Female % of females living in a state
Price Weighted average price (in cents) of a pack of cigarettes in a state
Sales Number of packs of cigarettes sold in a state on a per capita basis
The R output of the regression of Sales on the other variables is in Table 7.7.
The index plots of Cook’s distance and studentized residuals are given in
Fig. 7.10.
From the Cook’s distance plot, observations 9, 29, and 30 appear to be influential.
Observations 29 and 30 are also outliers. (These also are influential with respect to
error variance.) On scrutiny, it turns out that observations 9, 29, and 30 correspond to
Washington DC, Nevada, and New Hampshire, respectively. Washington DC is the
capital city and has a vast floating population due to tourism and otherwise. Nevada
is different from a standard state because of Las Vegas. New Hampshire does not
impose sales tax. It does not impose income tax at state level. Thus these three states
behave differently from other states with respect to cigarette consumption. So it is
meaningful to consider regression after dropping these observations.
The corresponding output is provided in Table 7.8.
214 B. Pochiraju and H. S. S. Kollipara
Call:
lm(formula = Sales ~ ., data = CigaretteConsumption[, -1])
Residuals:
Min 1Q Median 3Q Max
-48.398 -12.388 -5.367 6.270 133.213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 103.34485 245.60719 0.421 0.67597
Age 4.52045 3.21977 1.404 0.16735
HS -0.06159 0.81468 -0.076 0.94008
Income 0.01895 0.01022 1.855 0.07036 .
Black 0.35754 0.48722 0.734 0.46695
Female -1.05286 5.56101 -0.189 0.85071
Price -3.25492 1.03141 -3.156 0.00289 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Diagnostic Plots
Cook’s distance
0.00 0.10 0.20
Studentized residuals
–2 0 2 4 6
0 10 20 30 40 50
Fig. 7.10 Index plots for cigarette consumption model in Table 7.7
Table 7.8 Regression results for cigarette consumption data after dropping observations 9, 29,
and 30
> model4<- lm (Sales ~ ., data = CigaretteConsumption[-c(9,29,30),-1])
> summary(model4)
Call:
lm(formula = Sales ~ ., data = CigaretteConsumption[-c(9, 29,
30), -1])
Residuals:
Min 1Q Median 3Q Max
-40.155 -8.663 -2.194 6.301 36.043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 100.68317 136.24526 0.739 0.464126
Age 1.84871 1.79266 1.031 0.308462
HS -1.17246 0.52712 -2.224 0.031696 *
Income 0.02084 0.00546 3.817 0.000448 ***
Black -0.30346 0.40567 -0.748 0.458702
Female 1.12460 3.07908 0.365 0.716810
Price -2.78195 0.57818 -4.812 2.05e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
models. There is also significant reduction in the residual standard error (from 28.17
to 15.02). Furthermore, R2 has improved from 0.32 to 0.49.
This is not to say that we are done with the analysis. There are more checks, such
as checks for normality, heteroscedasticity, etc., that are pending.
3.13 Collinearity
Let us revisit Example 2.3 (cars) which we analyzed in Sects. 3.7 and 3.10. We
incorporate data on an additional variable, namely, the weights (WT) of these cars.
A linear regression of MPG on HP, HP_SQ, VOL, and WT is performed. The output
is given below.
Compare the output in Tables 7.5 and 7.9. The following points emerge:
(a) WT is insignificant, as noticed in Table 7.9.
(b) VOL which is highly significant in Table 7.5 turns out to be highly insignificant
in Table 7.9. Thus, once we introduce WT, both VOL and WT become
insignificant which looks very surprising.
(c) The coefficient estimates of VOL in Tables 7.5 and 7.9 (corresponding to the
models without and with WT, respectively) are −0.1018 and −0.0049 which are
216 B. Pochiraju and H. S. S. Kollipara
Table 7.9 Output for the regression of MPG on HP, HP_SQ, VOL, and WT
> model5<-lm(MPG ~ HP + HP_sq + VOL + WT, data = Cars)
> summary(model5)
Call:
lm(formula = MPG ~ HP + HP_sq + VOL + WT, data = cars)
Residuals:
Min 1Q Median 3Q Max
-8.3218 -2.0723 0.5592 1.7386 10.9699
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.734e+01 2.126e+00 36.374 < 2e-16 ***
HP -4.121e-01 3.099e-02 -13.299 < 2e-16 ***
HP_sq 9.053e-04 9.105e-05 9.943 2.13e-15 ***
VOL -4.881e-02 3.896e-01 -0.125 0.901
WT -1.573e-01 1.156e+00 -0.136 0.892
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
quite far apart. Also, the corresponding standard errors are 0.01795 and 0.3896.
Once we introduce WT, the standard error of the coefficient for VOL changes by
more than 20 times. Furthermore, the value of the coefficient is halved. Thus,
we notice that the standard error has gone up by as high as 20 times and the
magnitude of the coefficient is halved once we introduce the variable WT.
(d) There is virtually no change in R square.
Since there is virtually no change in R square, it is understandable why WT
is insignificant. But why did VOL, which was highly significant before WT was
introduced, became highly insignificant once WT is introduced? Let us explore. Let
us look at the scatterplot matrix (Fig. 7.11).
One thing that is striking is that VOL and WT are almost perfectly linearly
related. So in the presence of WT, VOL has virtually no additional explaining
capacity for the variation in the residual part of MPG not already explained by WT.
The same is the situation with WT that it has no additional explaining capacity in the
presence of VOL. If both of them are in the list of regressors, both of them become
insignificant for this reason. Let us look at the added variable plots for VOL and WT
in the model corresponding to Table 7.9 which confirm the same thing (Fig. 7.12).
It is said that there is a collinear relationship among some of the regressors if one
of them has an almost perfect linear relationship with others. If there are collinear
relationships among regressors, then we say that there is the problem of Collinearity.
7 Statistical Methods: Regression Analysis 217
20 30 40 50 20 30 40 50
250
HP
150
50
50
40
MPG
30
20
160
120
VOL
60 80
50
40
WT
30
20
Fig. 7.11 Scatterplot matrix with variables MPG, HP, VOL, and WT
Why should we care if there is collinearity? What are the symptoms? How do we
detect collinearity and if detected, what remedial measures can be taken?
Some of the symptoms of collinearity are as follows:
(a) R square is high, but almost all the regressors are insignificant.
(b) Standard errors of important regressors are large, and important regressors
become insignificant.
218 B. Pochiraju and H. S. S. Kollipara
(c) Very small changes in the data produce large changes in the regression
coefficient estimates.
We noticed (b) above in our MPG example.
How does one detect collinearity? If the collinear relation is between a pair of
regressors, it can be detected using the scatterplot matrix and the correlation matrix.
In the MPG example, we detected collinearity between VOL and WT this way.
Suppose that more than two variables are involved in a collinear relationship. In
order to check whether a regressor is involved in a collinear relationship, one can
use variance inflation factors (VIF).
The variance inflation factor for the ith regressor, denoted by VIFi is defined
as the factor by which the variance of a single observation, σ 2 , is multiplied to
get the variance of the regression coefficient estimate of the ith regressor. It can be
shown that VIFi is the reciprocal of 1 − Ri2 where Ri2 is the coefficient of multiple
determination of the ith regressor with the other regressors. The ith regressor is said
to be involved in a collinear relationship if Ri2 is large. There is no unanimity on
how large is considered to be large, but 95% is an accepted norm. If Ri2 is 95% or
larger, then VIFi is at least 20.
Variance decomposition proportions (VP): Table 7.10 (see Belsley et al. 2005)
is sometimes used to identify the regressors involved in collinear relationships. We
shall explain below how to use the VP table operationally for the case where there
are four regressors and an intercept. The general case will follow similarly. For more
details and the theoretical basis, one can refer to BKW.
By construction, the sum of all the elements in each column starting from
column
3 (columns corresponding to the intercept and the regressors), namely, 4j =0 πj i , is
1 for i = 0,1, . . . ,4.
Algorithm
Step 0: Set i = 0.
Step 1: Check whether the condition index c5 − i is not more than 30. If yes,
declare that there is no serious issue of collinearity to be dealt with and stop. If no,
go to step 2.
Step 2: Note down the variables for which the π values in the (5 − i)th row are
at least 0.5. Declare these variables as variables involved in a collinear relationship.
Go to Step 3.
Step 3: Delete the row 5 − i from the table and calibrate the π values in each
column corresponding to the intercept and the regressors so that the corresponding
columns is 1.
Step 4: Replace i by i+1 and go to step 1,
When the algorithm comes to a stop, say, at i = 3, you have 2 (i.e., (i − 1))
collinear relationships with you.
Let us return to the MPG example and the model that led us to the output in
Table 13.1. The VIFs and the variance decomposition proportions table are given in
Tables 7.11 and 7.12:
The VIFs of VOL and WT are very high (our cutoff value is about 20), and
thereby imply that each of VOL and WT is involved in collinear relationships. This
also explains what we already observed, namely, VOL and WT became insignificant
(due to large standard errors). The VIFs of HP and HP_SQ are marginally higher
than the cutoff.
From the variance decompositions proportions table, we see that there is a
collinear relationship between VOL and WT (condition index is 338.128 and the
relevant π values corresponding to VOL and WT are 1 and 0.999, respectively).
The next largest condition index is 29.667 which is just about 30. Hence, we can
conclude that we have only one mode of collinearity.
What remedial measures can be taken once the collinear relationships are
discovered?
Let us start with the easiest. If the intercept is involved in a collinear relationship,
subtract an easily interpretable value close to the mean of each of the other
regressors involved in that relationship and run the regression again. You will notice
that the regression coefficient estimates and their standard errors remain the same
as in the earlier regression. Only the intercept coefficient and its standard error will
change. The intercept will no longer be involved in the collinear relationship.
Consider one collinear relationship involving some regressors. One can delete the
regressor that has the smallest partial correlation with the response variable given
220 B. Pochiraju and H. S. S. Kollipara
the other regressors. This takes care of this collinear relationship. One can repeat
this procedure with the other collinear relationships.
We describe below a few other procedures, stepwise regression, best subset
regression, ridge regression, and lasso regression, which are commonly employed
to combat collinearity in a blanket manner.
It may be noted that subset selection in the pursuit of an appropriate model is of
independent interest for various reasons, some of which we mention below.
(a) The “kitchen sink” approach of keeping many regressors may lead to collinear
relationships.
(b) Cost can be a consideration, and each regressor may add to the cost. Some
balancing may be needed.
(c) Ideally there should be at least ten observations per estimated parameter. Oth-
erwise one may find significances by chance. When the number of observations
is not large, one has to restrict the number of regressors also.
The criteria that we describe below for selecting a good subset are based on
the residual sum of squares. Let us assume that we have one response variable
and k regressors in our regression problem. Suppose we have already included r
regressors (r < k) into the model. We now want to introduce one more regressor from
among the remaining k – r regressors into the model. The following criteria place a
penalty for bringing in a new regressor. The coefficient of multiple determination,
R2 unfortunately never decreases when a new regressor is introduced. However,
adjusted R2 can decrease when a new regressor is introduced if it is not sufficiently
valuable. We introduce a few other criteria here for which the value of the criterion
increases unless the residual sum of squares decreases sufficiently by introducing
the new regressor, indicating that it is not worth introducing the regressor under
consideration. The current trend in stepwise regression is to start with the model in
which all the k regressors are introduced into the model and drop the regressors as
long as the criterion value decreases and stop at a stage where dropping a regressor
increases the criterion value. The object is to get to a subset of regressors for which
the criterion has the least value. We use the following notation:
N = The number of observations
k = The total number of regressors
r= The number of regressors used in the current model
N
i=1 (yi − y) = The sum of squared deviations of the response variable from its
2
2mean
R0 r = The sum of squared residuals when the specific subset of r regressors is
2used
in the model
R0 k = The sum of squared residuals when all the k regressors are used in the
model
The major criteria used in this connection are given below:
2
N −1 R0 r
(a) Adjusted R2 : 1 −
N −r N (yi −y)2 (see also Eq. 7.11)
i=1
7 Statistical Methods: Regression Analysis 221
R02 r
(b) AIC : log N + 2r
N
R02 r
(c) BIC : log N + r log
N
N
2
R
(d) Mallow’s Cp : 2 0 r − n + 2r
R0 k /(N k−1)
Note: The AIC, or Akaike Information Criterion, equals twice the negative of
the log-likelihood penalized by twice the number of regressors. This criterion has
general applicability in model selection. The BIC or Bayes Information Criterion is
similar but has a larger penalty than AIC and like AIC has wider application than
regression.
Among (a), (b), and (c) above, the penalty for introducing a new regressor is
in the ascending order. The criterion (d) compares the residual sum of squares of
the reduced model with that of the full model. One considers the subset models for
which Cp is close to r and chooses the model with the least number of regressors
from among these models.
We illustrate the stepwise procedure with the cigarette consumption example
(Example 7.1) using the criterion AIC. We give the R output as in Table 7.13.
The first model includes all the six regressors, and the corresponding AIC is
266.56. In the next stage one of the six regressors is dropped at a time, keeping all
other regressors, and the AIC value is noted. When Female is dropped, keeping all
other regressors intact, the AIC is 264.71. Likewise, when age is dropped, the AIC
is 265.79, and so on. We notice the least AIC corresponds to the model dropping
Female. In the next stage, the model with the remaining five regressors is considered.
Again, the procedure of dropping one regressor from this model is considered and
the corresponding AIC is noted. (The dropped regressor is brought back and its
AIC is also noted. In the case of AIC this is not necessary because this AIC value is
already available in a previous model. However, for the case of Adj. R2 this need not
necessarily be the case.) We find that the least AIC equaling 263.23 now corresponds
to the model which drops Black from the current model with the five regressors. The
procedure is repeated with the model with the four regressors. In this case dropping
any variable from this model yields an increased AIC. The stepwise procedure stops
here. Thus, the stepwise method yields the model with the four regressors, age, HS,
income, and price. The corresponding estimated model is given below.
Compare Tables 7.14 and 7.8. We notice that the significance levels of HS,
Income, and Price have remained the same (in fact, the p-values are slightly smaller
in the subset model). Age, which was insignificant in the full model (Table 7.8),
is now (Table 7.14) significant at 5% level (p-value is 0.039). So when some undue
regressors are dropped, some of the insignificant regressors may become significant.
It may also be noted that while R2 dropped marginally from 0.4871 to 0. 4799
corresponding to full model and the subset model, respectively, there is a substantial
increase in adjusted R2 from 0. 412 in the full model to 0.4315 in the subset model.
222 B. Pochiraju and H. S. S. Kollipara
Start: AIC=266.56
Sales ~ Age + HS + Income + Black + Female + Price
Step: AIC=264.71
Sales ~ Age + HS + Income + Black + Price
Step: AIC=263.23
Sales ~ Age + HS + Income + Price
Call:
lm(formula = Sales ~ Age + HS + Income + Price, data = Cig_data)
Residuals:
Min 1Q Median 3Q Max
-40.196 -8.968 -1.563 8.525 36.117
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.966923 37.849699 3.302 0.001940 **
Age 2.628064 1.232730 2.132 0.038768 *
HS -0.894433 0.333147 -2.685 0.010267 *
Income 0.019223 0.004915 3.911 0.000322 ***
Price -2.775861 0.567766 -4.889 1.45e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In such a case, stepwise regression can be employed. For the cigarette consumption
data, the best subset regression using AIC leads to the same regressors as in the
stepwise regression using AIC.
Ridge Regression and Lasso Regression
In the least squares method (see estimation of parameters in Sects. 3.6 and 3.7),
we estimate the parameters by minimizing the error sum of squares. Ridge regres-
sion and lasso regression minimize the error sum of squares subject to constraints
that place an upper bound on the magnitude of the regression 2coefficients. Ridge
regression minimizes the error sum of squares subject to βi ≤ c, where c is
a constant. Lasso regression (Tibshirani 1996) minimizes the error sum of squares
subject to | βi | ≤ d, where d is a constant. Both these methods are based on
the idea that the regression coefficients are bounded in practice. In ridge regression
all the regressors are included in the regression and the coefficient estimates are
nonzero for all the regressors. However, in lasso regression it is possible that some
regressors may get omitted. It may be noted that both these methods yield biased
estimators which have some interesting optimal properties under certain conditions.
The estimates for the regression coefficients for the Cigarette consumption data
are given in Table 7.15.
In ridge regression, all the regressors have nonzero coefficient estimates which
are quite different from those obtained in Table 7.8 or Table 7.14. The signs match,
however. Lasso regression drops the same variables as in stepwise regression and
best subset regression. The coefficient estimates are also not too far off.
In practice, it is better to consider stepwise/best subset regression and lasso
regression and compare the results before a final subset is selected.
224 B. Pochiraju and H. S. S. Kollipara
Lasso regression
> lasso.mod <- glmnet(as.matrix(Cig_data[,-7]),as.matrix(Cig_data[,7]), alpha
= 1, lambda = lambda)
> predict(lasso.mod, s = 0, Cig_data[,-7], type = 'coefficients')[1:6,]
Gender discrimination in wages is a highly debated topic. Does a man having the
same educational qualification as a woman earn a higher wage on average? In order
to study the effect of gender on age controlling for the educational level, we use data
7 Statistical Methods: Regression Analysis 225
Call:
lm(formula = Wage ~ Education + Female, data = female)
Residuals:
Min 1Q Median 3Q Max
-12.440 -3.603 -1.353 1.897 91.603
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.85760 0.51279 -1.672 0.0945 .
Education 0.95613 0.04045 23.640 <2e-16 ***
Female -2.26291 0.15198 -14.889 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
on hourly average wage, years of education, and gender on 8546 adults collected by
Current Population Survey (1994), USA. Gender is a qualitative attribute. How does
one estimate the effect of gender on hourly wages? An individual can either be male
or female with no quantitative dimension. We need a quantifiable variable that can
be incorporated in the multiple regression framework, indicating gender. One way
to “quantify” such attributes is by constructing an artificial variable that takes on
values 1 or 0, indicating the presence or absence of the attribute. We can use 1 to
denote that the person is a female and 0 to represent a male. Such a variable is
called a Dummy Variable. A Dummy Variable is an indicator variable that reveals
(indicates) whether an observation possesses a certain characteristic or not. In other
words, it is a device to classify data into mutually exclusive categories such as male
and female.
We create the dummy variable called “female,” where female = 1, if gender is
female and female = 0, if gender is male. Let us write our regression equation:
Y = β0 + β1 X + β2 female + ε,
20
Fig. 7.13 Parallel regression
lines for the wages of males Male
and females Female
Average Wage
15
10
5
0
0 5 10 15
Years of Education
Table 7.17 Wage equation with interaction term between female and education
> dummy_reg2 <-lm(Wage ~ Education + Female + Female.Education, data=female)
> summary(dummy_reg2)
Call:
lm(formula = Wage ~ Education + Female + Female.Education, data = female)
Residuals:
Min 1Q Median 3Q Max
-12.168 -3.630 -1.340 1.904 91.743
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07569 0.69543 0.109 0.913
Education 0.88080 0.05544 15.887 < 2e-16 ***
Female -4.27827 1.02592 -4.170 3.07e-05 ***
Female.Education 0.16098 0.08104 1.986 0.047 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Thus, the interaction term is the product of the dummy variable and education.
Female and Education interact to produce a new variable Female * Education. For
the returns to education data, here is the regression output when we include the
interaction term (Table 7.17).
We notice that the interaction effect is positive and significant at the 5% level.
What does this mean?
The predicted wage for Male is given by
Notice that for the two regression equations, both the slope and the intercepts are
different!
Good news for the feminist school! An additional year of education is worth more
to females because β3 = 0.16098 > 0. An additional year of education is worth about
$ 0.88 extra hourly wage for men. An additional year of education is worth about
$1.04178 extra hourly wage for women. A man with 10 years of education earns:
$ (0.07569 + 0.88080 * 10) = $ 8.88369 average hourly wage. A woman with
10 years of education earns: $ (−4.20258 + 1.04178 * 10) = $ 6.21522 average
hourly wage.
228 B. Pochiraju and H. S. S. Kollipara
Thus, we see that there are two effects in work: (a) The female wage-dampening
effect (through lower intercept for women) across education and (b) narrowing of
gap in wage with years of education. This is depicted visually in Fig. 7.14.
It appears from the above figure that women start earning more than men starting
from 27 years of education. Unfortunately, this conclusion cannot be drawn from
the data on hand as the maximum level of education in our data is 18 years.
So far we have considered the case of two categories. In the returns to education
data set considered in this section, the dataset refer to the variable “PERACE.”
An individual can come from five different races—WHITE, BLACK, AMERICAN
INDIAN, ASIAN, OR PACIFIC ISLANDER, OTHER. The question under con-
sideration is: Is there also racial discrimination in wages? How to model race as a
regressor in the wage determination equation? Clearly, one dummy variable taking
values 0, 1 will not work! One possibility is that we assign five different dummy
variables for the five races. D1 = 1, if white and = 0, otherwise; D2 = 1, if Black
and = 0, otherwise; D3 = 1, if American Indian and = 0, otherwise; D4 = 1, if Asian
or Pacific Islander and = 0, otherwise; and D5 = 1, if other and = 0, otherwise.
The regression output after introducing these dummies into the model is as given
in Table 7.18.
What are the NAs corresponding to the others? Clearly, there is some problem
with our method. R did not compute estimates for the “Other” dummy citing the
reason “1 not defined because of singularities.” The issue actually is the following:
for every individual one and only one D1–D5 is 1 and the rest are all 0. Hence the
sum of D1–D5 is 1 for each individual. Hence there is perfect collinearity between
the intercept and the dummies D1–D5.
What is the solution? When you have n categories, assign either (a) n Dummies
and no intercept OR b) (n − 1) Dummies and an intercept. In our example, for the
five races, either assign four dummies and an intercept or just assign five dummies
but no intercept.
R automatically inputted four dummies to denote the five race categories and one
dummy to denote gender. If female = 0, and all four race dummies (D1, D2, D3,
D4) = 0, then estimated regression equation is
7 Statistical Methods: Regression Analysis 229
Call:
lm(formula = Wage ~ Education + Female + D1 + D2 + D3 + D4 + D5, data = femal
e)
Residuals:
Min 1Q Median 3Q Max
-12.527 -3.615 -1.415 1.963 92.124
Thus, the intercept here denotes the effect of all the excluded categories—that is,
the effect of the base category “Other male.” All the dummies measure the effect
over and above the base category “Other male.” Looking at the R-output in Table
7.18, we infer that none of the race dummies is significant. (White is just about
significant at the 10% level.) Whether you are white or black or any other race does
not affect your wages. No racial discrimination in wages! But since the coefficient
female is negative in estimate and highly significant, there is gender discrimination
in wages!
Consider the “Education” variable. Till now, we were estimating the incremental
effect of an additional year of education on wages. Moving education level from
class 5 to class 6 is not so much likely to make a difference to wages. Rather,
going from school to college or college to higher education may make a difference.
Perhaps a more sensible way to model education is to group it into categories as
show in Table 7.19.
The categories defined here (school, college, and higher education) are also
qualitative but they involve an ordering—a college graduate is higher up the
230 B. Pochiraju and H. S. S. Kollipara
Call:
lm(formula = Wage ~ Female + College + Higher_Ed + Female.College +
Female.Higher_Ed, data = female)
Residuals:
Min 1Q Median 3Q Max
-13.694 -3.688 -1.539 2.068 91.112
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.1322 0.1440 70.356 < 2e-16 ***
Female -2.2438 0.2049 -10.950 < 2e-16 ***
College 1.7267 0.2281 7.570 4.12e-14 ***
Higher_Ed 6.0848 0.7416 8.205 2.65e-16 ***
Female.College 0.3164 0.3144 1.006 0.314
Female.Higher_Ed 0.6511 1.0940 0.595 0.552
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
education ladder than a school pass out and one with a degree higher than a college
degree is still higher up. To incorporate the effect of education categories, ordinary
dummy variable is not enough. The effect of College on wages is over and above
the effect of schooling on wages. The effect of “Higher Education” on wages will
be some notches higher than that of college education on wages. Assign dummy
variables as shown in Table 7.20.
The output after incorporating these dummies usually called ordinal dummies is
shown in Table 7.21.
How do we interpret this output? The increment in the hourly wage for
completing college education over high school is $1.7267 and that for completing
higher education over college degree is $6.0848. Both these are highly significant
(based on the p-values).
For identifying that there is a dummy in play using residual plots see Exercise
7.4.
For an interesting application of dummy variables and interactions among them,
see Exercise 7.5.
For the use of interaction between two continuous variables in linear regression,
see the chapter on marketing analytics (Chap. 19).
7 Statistical Methods: Regression Analysis 231
One of the assumptions we made in the linear regression model (Sect. 3.5) is that
the errors are normally distributed. Do the data on hand and the model proposed
support this assumption? How does one check this? As we mentioned several
times in Sect. 3.7 and later, the residuals of different types are the representatives
of the errors. We describe below one visual way of examining the normality of
errors, known as Q-Q plot of the studentized residuals (see 7.43) for the definition
of a studentized residual). In Q-Q plot, Q stands for quantile. First, order the
observations of a variable of interest in the ascending order. We recall that the first
quartile is that value of the variable below which 25% of the ordered observations lie
and above which 75% of the ordered observations lie. Let p be a fraction such that
0 < p < 1. Then the pth quantile is defined as that value of the variable below which a
proportion p of the ordered observations lie and above which a proportion 1 − p of
the ordered observations lie. In the normal Q-Q plot of the studentized residuals, the
quantiles of the studentized residuals are plotted against the corresponding quantiles
of the standard normal distribution. This plot is called the normal Q-Q plot of the
studentized residuals. Let the ith quantile of studentized residuals (often referred to
as sample quantile) be denoted by qi and the corresponding quantile of the standard
normal distribution (often referred to as theoretical quantile) be denoted by ti . If the
normality assumption holds, in the ideal case, (ti , qi ),i = 1, . . . ,N fall on a straight
line. Since we are dealing with a sample, it is not feasible that all the points fall on
a straight line. So a confidence band around the ideal straight line is also plotted. If
the points go out of the band, then there is a concern regarding normality. For more
details regarding the Q-Q plots, you can read stackexchange.2
We give below the Q-Q plots of the studentized residuals for different models √
related to Example 2.1. For Example 2.1, we consider the models AT vs WC, AT
vs WC, and log AT vs WC and WC2 . (We shall explain in the case study later why
the latter two models are of importance.) The Q_Q plots are given in Fig. 7.15.
Compare the three plots in Fig. 7.15. We find that √ AT vs WC is not satisfactory
as several points are outside the band. The plot of AT vsW C is somewhat better
and that of logATvsWC,WC2 is highly satisfactory.
There are also formal tests for testing for normality. We shall not describe them
here. The interested reader may consult Thode (2002). A test tells us whether the
null hypothesis of normality of errors is rejected. However, the plots are helpful in
taking the remedial measures, if needed.
If the residuals support normality, then do not make any transformation as any
nonlinear transformation of a normal distribution never yields a normal distribution.
On the other hand, if Q-Q plot shows significant deviation from normality, it may
be due to one or more of several factors, some of which are given below:
1
where Ỹ = (y1 × y2 × . . . yN ) N is the geometric mean of the data on the response
variable.
With this transformation, the model will now be Y(λ) = Zβ+ε. This model has
an additional parameter λ over and above the parametric vector β. Here the errors
are minimized over β and λ by the method of least squares. In practice, for various
values of λ in (−2, 2), the parametric vector β is estimated and the corresponding
estimated log-likelihood is computed. The values of the estimated log-likelihood (y-
7 Statistical Methods: Regression Analysis 233
axis) are plotted against the corresponding value of λ. The value of λ at which the
estimated log-likelihood is maximum is used in (7.45) to compute the transformed
response variable. Since the value of λ is estimated from a sample, a confidence
interval for λ is also obtained. In practice, that value of λ in the confidence interval
is selected which is easily interpretable. We shall illustrate this with Example 2.1
where the response variable is chosen as AT and the regressor as WC. The plot of
log-likelihood against λ is given below (Fig. 7.16).
Notice that the value
of λ at which the log-likelihood is the maximum is 0.3434.
This is close to 1 3 . Still it is not easily interpretable. But 0.5 is also in the
confidence interval and this corresponds to the square-root which is more easily
interpretable. One can use this and make the square-root transformation on the AT
variable.
3.16 Heteroscedasticity
One of the assumptions in the least squares estimation and testing was that of equal
variance of the errors (E in LINE). If this assumption is violated, then the errors do
not have equal status and the standard least squares method is not quite appropriate.
The unequal variance situation is called heteroscedasticity. We shall talk about the
sources for heteroscedasticity, the methods for detecting the same, and finally the
remedial measures that are available.
Consider AT–Waist problem (Example 2.1). Look at Fig. 7.2 and observation (ii)
following the figure. We noticed that the variation in adipose tissue area increases
with increasing waist circumference. This is a typical case of heteroscedasticity.
We shall now describe some possible sources for heteroscedasticity as given in
Gujarati et al. (2013).
(a) Error-learning models: As the number of hours put in typing practice increases,
the average number of typing errors as well as their variance decreases.
(b) As income increases, not only savings increases but the variability in savings
also increases—people have more choices with their income than to just save!
(c) Error variance changes with values of X due to some secondary issue.
234 B. Pochiraju and H. S. S. Kollipara
Waist Vs.log(AT)
4.0 5.5
5.0
4.5
AT
3.5
3.0
2.5
(d) Omitted variable: Due to the omission of some relevant regressor which is
correlated with a regressor in the model, the omitted regressor remains in the
error part and hence the error demonstrates a pattern when plotted against X—
for example, in the demand function of a commodity, if you specify its own
price but not the price of its substitutes and complement goods available in the
market.
(e) Skewness in distribution: Distribution of income and education—bulk of income
and wealth concentrated in the hands of a few.
(f) Incorrect data transformation, incorrect functional form specification (say linear
X instead of Quadratic X, the true relation).
How does one detect heteroscedasticity? If you have a single regressor as
in the case of Example 2.1 one can examine the scatterplot. If there are more
regressors, one can plot the squared residuals against the fitted values and/or
the regressors. If the plot is a random scatter, then you are fine with respect to
the heteroscedasticity problem. Otherwise the pattern in the plot can give a clue
regarding the nature of heteroscedasticity. For details regarding this and for formal
tests for heteroscedasticity, we refer the reader to Gujarati et al. (2013).
Coming back to Example 2.1, one way to reduce the variation among adipose
tissue values is by transforming AT to Log AT. Let us look at the scatterplot of Log
AT vs Waist (Fig. 7.17).
We notice that the variation is more or less uniform across the range of Waist.
However, we notice that there is a parabolic relationship between Log AT and Waist.
So we fit a linear regression of Log AT on Waist and the square of Waist. The output
is given below in Table 7.22.
The corresponding normal Q-Q plot is shown in Fig. 7.18.
7 Statistical Methods: Regression Analysis 235
Table 7.22 Regression output for the linear regression of Log AT on Waist and Square of Waist
> modellog<-lm(log(AT)~ Waist + Waist_sq, data=wc_at)
> summary(modellog)
Call:
lm(formula = log(AT) ~ Waist + Waist_sq, data = wc_at)
Residuals:
Min 1Q Median 3Q Max
-0.69843 -0.20915 0.01436 0.20993 0.90573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.8240714 1.4729616 -5.312 6.03e-07 ***
Waist 0.2288644 0.0322008 7.107 1.43e-10 ***
Waist_sq -0.0010163 0.0001731 -5.871 5.03e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Fig. 7.18 Normal Q-Q plot of the standardized residuals in the linear regression of Log AT on
Waist and Square of Waist
The corresponding plot of fitted values against residuals is given in Fig. 7.19.
Thus, both the plots tell us that the model is reasonable.
Alternatively, one can look at the linear regression of AT on Waist and look at
the plot of squared residuals on Waist which is given in Fig. 7.20.
236 B. Pochiraju and H. S. S. Kollipara
Fig. 7.19 Plot of fitted values vs residuals in the linear regression of log AT on Waist and Square
of Waist
Call:
lm(formula = ATdWaist ~ Waistin, data = wc_at)
Residuals:
Min 1Q Median 3Q Max
-0.91355 -0.21062 -0.02604 0.17590 0.84371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.5280 0.2158 16.35 <2e-16 ***
Waistin -222.2786 19.1967 -11.58 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The linear regression model is developed using the sample on hand. Using the
residual sum of squares, R2 , and other measures that we discussed, we can examine
the performance of the model on the sample. But how does it perform on the
population in general? How can we get an idea regarding this? For this purpose,
if we have a reasonable size data, say of about 100 observations or more, about 20–
30% of randomly selected observations from the sample are kept aside. These data
are called validation data. The remaining data are called training data. The model
is developed using the training data. The developed model’s performance on the
validation data is then checked using one or more of the following measures:
(a) Root Mean Square Error (RMSE)
We have the residual sum of squares, R02 for the training data set. Using
the developed model, predict the response variable for each of the observations
in the validation data and compute the residual. Look at the residual sum
of squares, R02 (V ) for the validation data set. Let N1 and N2 denote the
number of observations in the training and validation data sets, respectively.
The residual sum of squares per observation is called the RMSE. Let RMSE(T)
and RMSE(V) denote the RMSE for training and validation data sets. Compare
the RMSE for the training and validation data sets by computing
|RMSE (V) – RMSE(T)|/RMSE(T). If this is large, then the model does not
fit well to the population. A thumb rule is to use a cutoff of 10%.
238 B. Pochiraju and H. S. S. Kollipara
(b) Comparison of R2
Define achieved R2 for the validation data set as
R 2 (V )
R 2 (V ) = 1− 0s 2 , where s2 is the sum of squared deviations of the response
variable part of the validation data from their mean. Compare the achieved R2
with R2 of the training data set in the same manner as the RMSE above.
(c) Compare the box plots of the residuals in the training and validation data sets.
If we have a small data set, we can use the cross-validation method described
as follows. Keep one observation aside and predict the response variable of this
observation based on the model developed from the remaining observations. Get
the residual. Repeat this process for all the observations. Let the sum of squared
residuals be denoted by R02 (C). Compute the cross-validation R2 as
R 2 (C)
R 2 (C) = 1 − 0 .
(yi −y)2
4 FAQs
1. I have fitted a linear regression model to my data and found that R2 = 0.07.
Should I abandon performing regression analysis on this data set?
Answer: There are several points to consider.
(a) If R2 is significantly different from 0 (based on the F-test), and if the
assumptions are not violated, then one can use the model to study the impact
of a significant regressor on the response variable when other regressors are
kept constant.
(b) If R2 is not significantly different from 0, then this model is not useful. This
is not to say that you should abandon the data set.
(c) If the object is to predict the response variable based on the regressor
knowledge of a new observation, this model performs poorly.
(d) Suitable transformations may improve the model performance including R2 .
(See the Exercise 7.1.) The basic principle is to look for simple models, as
more complicated models tend to over-fit to the data on hand.
2. I have data for a sample on a response variable and one regressor. Why should I
bother about regression which may not pass through any point when I can do a
polynomial interpolation which passes through all the points leading to a perfect
fit to the data?
Answer: The data is related to a sample and our object is to develop a model
for the population. While it is true that the polynomial interpolation formula
is an exact fit to the data on hand, the formulae will be quite different if we
bring in another observation or drop an observation. Thus, the model that we
develop is not stable and is thus not suitable for the problem on hand. Moreover,
the regressor considered is unlikely to be the only variable which impacts
the response variable. Thus, there is error intrinsically present in the model.
Interpolation ignores this. This is one of the reasons why over-fitting leads to
problems.
3. I have performed the linear regression and found two of the regressors are
insignificant. Can I drop them?
Answer: There can be several reasons for a regressor to be insignificant:
(a) This regressor may be involved in a collinear relationship. If some other
regressor, which is also insignificant, is involved in this linear relationship,
you may find the regressor insignificant. (See Table 7.9 where both VOL and
WT are insignificant. Drop the variable WT and you will find that VOL is
significant.)
7 Statistical Methods: Regression Analysis 241
(b) A variable in the current regression may not be significant. It is possible that
if you bring in a new regressor, then in the presence of the new regressor, this
variable may become significant. This is due to the fact that the correlation
coefficient between a regressor and response variable can be smaller than the
partial regression coefficient between the same two variables fixing another
variable. (See Models 3 and 4 in Exercise 7.4.)
(c) Sometimes some unusual observations may make a regressor insignificant.
If there are reasons to omit such observations based on domain knowledge,
you may find the regressor significant.
(d) Sometimes moderately insignificant regressors are retained in the model for
parsimony.
(e) Dropping a regressor may be contemplated as a last resort after exhausting
all possibilities mentioned above.
4. I ran a regression of sales on an advertisement and some other regressors and
found the advertisement’s effect to be insignificant and this is not intuitive. What
should I do?
Answer: Examine points (a)–(c) in (3) above. Sometimes model misspecifica-
tion can also lead to such a problem. After exhausting all these possibilities, if
you still find the problem then you should examine whether your advertisement
is highly uninspiring.
5. I have performed a linear regression. The residual plots suggested that I need to
bring in the square term of a regressor also. I did that and once I included the
square term and found from the variance decompositions proportions table that
there is a collinearity among the intercept, the regressor, and its square. What
should I do?
Answer: Subtract an interpretable value close to the mean from the regressor
and repeat the same with its square term. Usually this problem gets solved.
6. I have performed a linear regression and got the following estimated equation:
= −4.32 + 542.7X1 − 3.84X2 + 0.043X3 . Can I conclude that the relative
Y
importance of X1 ,X2 ,X3 is in that order?
Answer: The value of the regression coefficient estimate depends on the scale
of measurement of that regressor. It is possible that X1 ,X2 ,X3 are measured
in centimeters, meters, and kilometers, respectively. The way to assess is by
looking at their t-values. There is also some literature on relative importance
of regressors. One may see Kruskal (1987) and Gromping (2006).
7. How many observations are needed to perform a meaningful linear regression?
Answer: The thumb rule is at least ten observations per estimated parameter.
At least 30 observations are needed to estimate a linear regression model with
two regressors and an intercept. Otherwise you may get significance by chance.
8. I made two separate regressions for studying the returns to education–one for
men and the other for women. Can I compare the regression coefficients of both
the models for education?
242 B. Pochiraju and H. S. S. Kollipara
Answer: If you want to test the equality of coefficients in the two models, you
need an estimate of the covariance between the two coefficient estimates, which
cannot be obtained from separate regressions.
9. How do I interpret the intercept in a linear regression equation?
Answer: We shall answer through a couple of examples.
Consider the wage equation w a ge = 14.3 + 2.83∗ edu, where edu stands for
years of education. If you also have illiterates in the data used to develop the
model, then the intercept 14.3 is the average wage of an illiterate (obtained from
the wage equation by taking edu = 0). However, if your data is on individuals
who have at least 7 years of education, do not interpret the intercept.
Consider again the wage equation in Table 7.21. Here the intercept is the
average wage of males having education of at most 12 years.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 7.1: AnscombesQuarter.csv
• Data 7.2: cars.csv
• Code 7.1: cars.R
• Data 7.3: cigarette_consumption.csv
• Code 7.2: cigarette_consumption.R
• Data 7.4: female.csv
• Code 7.3: female.R
• Data 7.5: healthcare1997.csv
• Data 7.6: leaf.csv
• Data 7.7: US_Dept_of_Commerce.csv
• Data 7.8: wage.csv
• Data 7.9: wc-at.csv
• Code 7.4: wc-at.R
Exercises
Value −2 −1 0 1 2
Probability 0.2 0.2 0.2 0.2 0.2
Define Y = 3X2 +2 and Z = X2 . Show that X and Y are uncorrelated. Show also
that the correlation coefficient between Y and Z = X2 is 1.
7 Statistical Methods: Regression Analysis 243
Y 0.6 0.2 0.2 0.2 0.1 0.1 0.1 0.05 0.05 0.05
X 2.01 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
(h) In model 4, consider a male and a female of same age, 28, and the same
education, of 10 years. Who earns more and by how much? Does it change
if the age is 40 and the education is 5 years?
(i) If one wishes to test whether the females catch up with males with increasing
level of education, how would you modify model 4 and what test do you
perform?
(j) Notice that the Q-Q plots in models 1 and 3 are similar. Are they satisfactory?
What about those in models 2 and 4? What is the reason for this dramatic
change?
Ex. 7.5 In order to study the impact of bank deregulation on income inequality,
yearly data was collected on the following for two states, say 1 and 0 during the
years 1976 to 2006. Bank deregulation was enacted in state 1 and not in state
0. Gini index is used to measure income inequality. To control for time-varying
changes in a state’s economy, we use the US Department of Commerce data
(“US_Dept_of_Commerce.csv”) to calculate the growth rate of per capita Gross
State Product (GSP). We also control for the unemployment rate, obtained from
the Bureau of Labor Statistics, and a number of state-specific, time-varying socio-
demographic characteristics, including the percentage of high school dropouts, the
proportion of blacks, and the proportion of female-headed households.
Name Description
Log_gini Logarithm of Gini index of income inequality
Gsp_pc_growth Growth rate of per capita Gross State Product (2000 dollars)
Prop_blacks Proportion blacks
Prop_dropouts Proportion of dropouts
Prop_female_headed Proportion female-headed households
Unemployment Unemployment
Post Bank deregulation dummy
Treatment Denoting two different states 1 and 0
Interaction Post*treatment
Wrkyr Year of study
References
Vishnuprasad Nagadevara
Three topics are covered in this chapter. In the main body of the chapter, the tools for
estimating the parameters of regression models when the response variable is binary
or categorical are presented. The appendices cover two other important techniques,
namely, maximum likelihood estimate (MLE) and how to deal with missing data.
1 Introduction
In general, regression analysis requires that the response variable or the depen-
dent variable is a continuous and quantifiable variable, while the independent or
explanatory variables can be either quantifiable or indicator (nominal or categorical)
variables. The indicator variables are managed using dummy variables as discussed
in Chap. 7 (Statistical Methods: Linear Regression Analysis). Sometimes it becomes
necessary to use indicator or categorical variables as the response variable. Some of
such situations are discussed in the next section.
V. Nagadevara ()
IIM-Bangalore, Bengaluru, Karnataka, India
e-mail: nagadevara_v@isb.edu
2 Motivation
p = P (Y = 1|X = x) = α + βx +
where p is the probability that Y takes on a value of 1 given the explanatory variable
x. α and β are the regression coefficients and ε is the error term.
There are two problems associated with the above model. The first is that the
above model is un-bounded, implying p can take on any value depending on the
values of X and the regression coefficients. In reality, p cannot be less than 0 or
8 Advanced Regression Analysis 249
The error term ε does not follow normal distribution since it can take on only two
values for a given x.
eα+βx
p = P (Y = 1|X = x) =
1 + eα+βx
1
(1 − p) =
1 + eα+βx
1.2
0.8
0.6
Probability
0.4
0.2
0
0 200 400 600 800 1000 1200 1400 1600
-0.2
Variable X
Logisc Linear
The above model is usually referred to as the “logit (logistic probability unit)
model.”
The parameters of the logistic regression are estimated using the maximum likeli-
hood method. A brief description of the maximum likelihood method is given below.
(Please see Appendix 2 for more details on the maximum likelihood method.)
8 Advanced Regression Analysis 251
eα+βx
p = P (Y = 1|X = x) = = f (x).
1 + eα+βx
n
L = P (Y1 , Y2 , . . . , Yn ) = f (xi )Yi (1 − f (xi ))(1−Yi ) .
i=1
n
n
ln(L) = LL = Yi ln {f (xi )} + (1 − Yi ) ln {1 − f (xi )} .
i=1 i=1
We can take the partial derivatives of the above log likelihood function (LL)
with respect to α and β and equate the resultant equations to 0. Solving these two
equations for the two unknowns α and β, we can obtain their estimates. These two
equations do not have a closed form solution and hence the solution is obtained
numerically through an iterative process.
3.2 Interpretation of β
Let us consider an example where we know the existing odds and would like to
estimate the new odds corresponding to an increase in the value of x. Let us assume
that the response variable is purchasing a washing machine and the independent
variable is monthly income. Let us say, the current value of x is $ 12 (income
measured in $ ‘000s); corresponding odds is 1 (implying that the probability is 0.5)
and the exponentiated β is 1.16. We are interested in finding the increase in the
odds and probability of purchasing the washing machine when the monthly income
increases by $ 2 (in ’000). The increase in odds can be calculated by using the
following formula:
New Odds = Old Odds × Exponentiate β×Change in income=1×1.16×2 = 2.32.
The corresponding probability is 2.32/(1+2.32) = 0.6988. Thus, the probability
of purchasing a washing machine increases from 0.5 to 0.6988 when the monthly
income increases from $ 12,000 to $ 14,000.
The logistic regression can be extended from a single explanatory variable to
multiple explanatory variables easily. If we have three explanatory variables, the
logistic response function becomes
There are three diagnostic tests available for testing the logistic regression model.
They are as follows:
• Likelihood ratio test
• Wald’s test
• Hosmer and Lemeshow test
Likelihood Ratio
The likelihood ratio test compares two different models, one without any
independent variables and the other with independent variables. This is a chi-square
test with degrees of freedom equal to the number of independent variables. This
test is used to check whether the variance explained by the model is more than the
unexplained variance. In other words, this is an omnibus test which can be used to
test the entire model and comparable to the F test in multiple regression. The chi-
square value is calculated based on two models, the first without any explanatory
8 Advanced Regression Analysis 253
variables and the second with the explanatory variables. The null hypothesis with
respect to the likelihood ratio test is as follows:
H0 : β1 = β2 = . . . = βk = 0.
HA : at least one βj is not equal to zero.
Let us start with the log likelihood function given below and also the data
presented in Table 8.1.
n
n
ln(L) = LL = Yi ln {f (xi )} + (1 − Yi ) ln {1 − f (xi )} .
i=1 i=1
N 106
C
P Y = " Y" = = = 0.7211.
N 147
In order to convert the log likelihood function (LL) to a chi-square distribution,
we multiply it with –2. Substituting NNL for f (xi ) and NNC for {1 − f (xi )} in the log
likelihood function, we get
8 Advanced Regression Analysis 255
NL NC
−2LL0 = −2 NL ln + NC ln = 174.025.
N N
We denote the above as –2LL0 because there are 0 explanatory variables. The
logit function is fitted using “Experience” as the explanatory variable, and the result
is reproduced in Table 8.2. It can be seen from the table that it required four iterations
to obtain the estimates for α and β. The corresponding –2LL is 165.786. Let us refer
to this as –2LL1 because there is only one explanatory variable. The chi-square
value is calculated as the difference between –2LL0 and –2LL1 (i.e., 174.025 –
165.786 = 8.239). The degree of freedom corresponding to this chi-square is 1
because there is one independent variable.
The R code (“Advanced_Regression_Analysis.R”) to generate tables and charts
referred in the chapter is provided on the website. Table 8.23 (Relevant R functions)
in Appendix 3 contains a summary of the R commands used in this chapter.
The p-value corresponding to chi-square value of 8.239 with 1 degree of freedom
is 0.004. Considering the small p-value, the null hypothesis can be rejected,
concluding that the explanatory variable “Experience” has a significant impact on
the probability of attrition.
Wald’s Test
While the likelihood ratio test is used for testing the significance of the overall
model, Wald’s test is used to test the significance of each of the individual βj s. The
null hypothesis in this case is:
H0 : βj = 0.
HA : βj = 0.
The statistic for Wald’s test follows chi-square distribution with 1 degree of
freedom. The statistic for each βj is calculated as
2
j − 0
β
W = ,
j
Se β
256 V. Nagadevara
Table 8.3 Logistic regression coefficients for the employee attrition model
95% C.I. for Exp(β)
β S.E. Wald df Sig. Exp(β) Lower Upper
Constant 1.605 .311 26.596 1 .000 4.977
Experience −.010 .004 7.567 1 .006 .990 .982 .997
where Se β j is the standard error of the estimate β j . The logistic regression
coefficients along with the standard errors and Wald’s statistics and significance
levels are presented in Table 8.3.
The null hypothesis that β = 0 is rejected based on the p-value (Sig.) of 0.006.
Table 8.3 also presents the 95% confidence interval for β. The confidence interval is
calculated based on the formula below:
± Zα/2 Se β
A two-sided confidence interval of (1 – α) level for β is β where Z
is the standard normal value corresponding to a confidence level of (1 – α) × 100%.
Hosmer–Lemeshow Test
The Hosmer–Lemeshow test is also a chi-square test which is used as a goodness
of fit test. It checks how well the logistic regression model fits the data. The test
divides the dataset into deciles and calculates observed and expected frequencies
for each decile. Then it calculates the “H” statistic using the formula
g
2 2
g
Oij − Eij (O1i − E1i )2
H = =
Eij ni πi (1 − πi )
j =1 i=1 i=1
where Oij is the observed frequency of ith group and jth category, Eij is the expected
frequency of ith group and jth category, ni is the number of observations in ith
category, g is the number of groups, and π i is the predicted risk for ith group. The
statistic H is somewhat similar to the chi-square used in the goodness of fit tests.
The cases are grouped together according to their predicted values from the
logistic regression model. Specifically, the predicted values are arrayed from lowest
to highest and then separated into several groups of approximately equal size. While
the standard recommendation by Hosmer and Lemeshow is ten groups, the actual
number of groups can be varied based on the distribution of the predicted values.
The null and alternate hypotheses for Hosmer–Lemeshow test are:
H0 : the logistic regression model fits the data.
HA : the logistic regression model does not fit the data.
Table 8.4 presents the contingency table for Hosmer–Lemeshow test, while
Table 8.5 presents the Hosmer–Lemeshow statistic with the significance levels.
The contingency table had only nine classes owing to the number of observed
and expected frequencies. Consequently, the chi-square of the final model had only
seven degrees of freedom.
8 Advanced Regression Analysis 257
where n is the sample size. The full model is the one which has the explanatory
variables included in the model and used for further analysis such as calculating
class probability or prediction.
258 V. Nagadevara
It is not necessary that the maximum possible value of Cox and Snell R2 is 1.
Nagelkerke R2 adjusts Cox and Snell R2 such that the maximum possible value is
equal to 1.
The Nagelkerke R2 is calculated using the formula
2
1− L(I ntercept only model) n
L(F ull model)
Nagelkerke R2 =
2
n
1 − L I ntercept only model
McFadden R2 is defined as
Ln (LF ull model )
McFadden R2 = 1 −
Ln LI ntercept only model
where LIntercept only model (L0 ) is the value of the likelihood function for a model with
no predictors and LFull model (LM ) is the likelihood for the model being estimated
using predictors.
The rationale for this formula is that ln(L0 ) plays a role analogous to the residual
sum of squares in linear regression. Consequently, this formula corresponds to
a proportional reduction in “error variance.” Hence, this is also referred to as
“Deviance R2 .”
These Pseudo R2 values are intended to mimic the R2 values of linear regression.
But the interpretation is not the same. Nevertheless, they can be interpreted as a
metric for the amount of variation explained by the model, and in general, these
values can be used to gauge the goodness of fit. It is expected that the larger the
value of pseudo R2 , the better is the fit. It should also be noted that only Nagelkerke
R2 can reach a maximum possible value of 1.0.
The three pseudo R2 values for the logistic regression model using the attrition
data (data from Table 8.1) are presented in Table 8.6.
Normally the logistic model sets the cutoff value at 0.5 by default. It implies that
those observations which have a predicted probability of less than or equal to 0.5
are classified as belonging to one category and those with a probability of above 0.5
8 Advanced Regression Analysis 259
are classified as belonging to the other category. This cutoff value can be tweaked in
order to improve the prediction accuracy. The tweaking can be based on either the
classification plot or Youden’s index.
Classification Plot
The classification plot is created by plotting the frequencies of observations on
the Y-axis and the predicted probability on the X-axis. It also presents the cutoff
value of probability on the X-axis. The classification plot for the attrition model
obtained from SPSS is presented in Fig. 8.2.1
The classification table with the cutoff value of 0.5 is presented in Table 8.7.
There are a series of ”N“s and ”Y“s below the X-Axis. These indicate the cutoff
of 0.5 (i.e., the point where Ns change to Ys). Each N and Y in the graph above
the X-Axis represents 5 data points. There are two Ns and four Ys to the left of 0.5
but not shown because of the granularity. There are a few Ns to the right of 0.5,
and these are misclassifications. If the cutoff is shifted beyond 0.5, some of these
will be correctly classified. A cutoff of 0.7 is also tried and found to be better in
classifying the Ns more correctly as shown in Table 8.8. The downside is that more
Ys are misclassified by the shift in the cutoff value.
It can be seen from Fig. 8.2 and Table 8.7 that many of those who have left the
company are misclassified as “continuing with the company.” The overall prediction
accuracy of 70.7% is of no use because the model is completely ineffective with
respect to those who have left the company. The very objective of the exercise is to
be able to predict who are likely to leave the company so that appropriate strategy
(even at an individual level) can be created to retain them. It is important to predict
accurately, as many of those who are likely to leave the company, even at the cost
of misclassifying those who are going to stay with the company. The prediction
accuracy of those who are likely to leave the company can be increased by tweaking
the cutoff value. By analyzing Fig. 8.2, it can be deduced that a cutoff value of
0.70 can give better results in terms of predicting those who are not continuing with
the company. The prediction accuracies with 0.7 as the cutoff value are presented
in Table 8.8. The new cutoff value of 0.7 has resulted in a significant increase in
the prediction accuracy of those who are likely to leave the company. The overall
prediction accuracy has marginally gone down to 68.7%.
Youden’s Index
As in the case of any predictive technique, the sensitivity and specificity of
statistics can be calculated for the logistic regression. The formulae for sensitivity
and specificity are reproduced below for quick reference. The values of sensitivity
and specificity corresponding to the predictions shown in Table 8.7 are also given
below.
True Positive (TP)
Sensitivity = = 41.5%
True Positive (TP) + False Negative (TN)
Youden’s index enables us to calculate the cutoff probability such that the
following function is maximized.
8 Advanced Regression Analysis 261
0.9
0.8 p
0.7
0.6
Sensitivity
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(1 - Specificity)
Let us consider the first two observations (Employee numbers 142 and 125)
of Table 8.7. The probabilities estimated by the model are 0.4034 and 0.6590,
respectively. There is no cutoff probability that can predict these two observations
correctly. If we set the cutoff at a point less than 0.4034, Employee number 125
will be misclassified as “Yes.” If we set the cutoff above 0.6590, Employee number
142 will be misclassified. If we set the cutoff between 0.4034 and 0.6590, both
observations will be misclassified. This type of pairs is called “discordant” pairs.
Now, let us consider the last two observations (Employee numbers 136 and 90) of
Table 8.7. The estimated probabilities are 0.5061 and 0.7019. If we set the cutoff
at any point between these two probabilities, both the observations are classified
correctly. Such pairs are called “concordant” pairs. Needless to say, a logistic
regression model with a large number of concordant pairs is preferred.
The attrition dataset has many more explanatory variables in addition to Experience
(Refer to dataset “employee_attrition_nvars.csv” on the website). These variables
are:
• TotExp_Binned: Total Experience (Binned)
• Age
• Gender
• ExpinTeam_Binned: Experience in the Team (Binned)
• Pos_Binned: Position in the company (Categorical)
• ExpCrnt_Binned: Experience in the current position (Binned)
• Tech_Binned: Technology specialization (Categorical)
• TotalJobHops: Total Job Hops
• CL_binned: Change in Use of Casual Leave (CL)
• PL_binned: Change in Use of Privilege Leave (PL)
• LC_Binned: Late coming to work
The logit function with k explanatory variables is
p
= eα+β1 x1 +β2 x2 +β3 x3 +···+βk xk
(1 − p)
8 Advanced Regression Analysis 263
Table 8.11 Model summary. Estimation terminated at iteration number 6 because parameter
estimates changed by less than .001
Step −2 Log likelihood Cox and Snell R2 Nagelkerke R2 McFadden R2
1 57.76 .533 .765 .638
2 58.34 .531 .762 .635
3 59.94 .525 .754 .625
4 61.39 .520 .747 .616
5 62.96 .515 .739 .606
of the final model had only 7 degrees of freedom. The contingency table and the
Hosmer and Lemeshow test are presented in Tables 8.12 and 8.13, respectively.
Model Selection
When we have multiple explanatory variables, the model selection can be
facilitated by Akaike information criterion (AIC) or Bayesian information criterion
(BIC). The concepts of AIC and BIC were discussed in Chap. 7 on linear regression
analysis. In the case of logistic regression, these criteria are defined as
Table 8.14 AIC and BIC Step (Model) −2 Log likelihood AIC BIC
values of different models
1 57.758 85.76 126.33
2 58.347 84.35 122.02
3 59.949 83.95 118.72
4 61.386 83.38 115.26
5 62.96 82.96 111.94
Fig. 8.4 Classification table of the final model with a cutoff value of 0.5
model is able to predict those who left the company with an accuracy level of 76.3%.
The prediction accuracy with respect to those who are continuing with the company
is 90.6%. The overall prediction accuracy of the final model is 86.6%.
The classification plot of the final model is presented in Fig. 8.4. It can be seen
from the classification plot that a cutoff value of 0.5 is appropriate.
The final model is significant and fits the data the best. The variables that are
significant in determining the probability of attrition are total experience (binned),
experience in the team, and the position in the company (a categorical variable).
This model can predict those who are likely to leave the company with an accuracy
266 V. Nagadevara
level of about 76.3%. In summary, this model can be effectively used to predict
those who are likely to attrite at the individual level and take necessary action.
The discussion above on logistic regression was with respect to dependent variables
which are qualitative and binary in nature. In other words, the dependent variable
can take on only two possible values. There are many situations where the
dependent variable is qualitative in nature but has more than two categories, that
is, multinomial. When these categories are nominal, then such models are called
polytomous models. These categories can also be ordinal in nature. A typical
example would be ratings of employees into five categories, namely, (1) Excellent,
(2) Very Good, (3) Average, (4) Below Average, and (5) Poor. We would like to
predict the performance of the employee based on different explanatory variables
such as age, education, and years of work experience. Such models where the
dependent variable is ordinal are called ordinal logistic regression models. The
concept of binary logistic regression can be easily extended to multinomial logistic
regression.
Consider a situation where there is a qualitative dependent variable with k
categories. We select one of the k categories as the base category and build
logit functions relative to it. If there is no ordinality in the categories, any of
the k categories can be selected as the base category. The logit function with m
independent !variables and kth category as the base is defined as
p (X )
ln pjk (Xii ) = αj + β1j X1i + β2j X2i + · · · + βmj Xmi where j =
k
1, 2, . . . , (k − 1) and i = 1 to n and j =1 pj = 1
1. Single cocoon weight (COCWT): This is simply the average weight of a cocoon.
This is usually calculated by selecting 25 cocoons at random, taking the total
weight, and then calculating the average of a single cocoon weight. This is
measured in grams or centigrams.
2. Shell weight (SHELLWT): This is the average of the single shell weight. The
shell is the portion of the cocoon after the pupae is removed. This is calculated
by taking the same 25 cocoons that are used for calculating the single cocoon
weight. The pupae are removed from these 25 cocoons, and then the average
weight of the shells is calculated. The shell yields the raw silk and hence the
higher the shell weight, the higher the yield of the raw silk. This is also measured
in grams or centigrams.
3. Shell ratio (SR): This is defined as the ratio of average shell weight to the average
single cocoon weight and expressed as a percentage. This ratio actually estimates
the raw silk content of each cocoon. Thus, the higher the shell ratio, the better is
the quality.
4. Filament length (FILENGTH): This is the total length of the silk filament reeled
from the cocoon. This is measured in meters.
5. Filament size (FILSIZE): This is the thickness of the silk filament. This is also
expressed as the denier. The denier is expressed as the weight of the silk filament
measured in grams for 9,000 m of the filament. A lower denier implies finer silk
filament and hence is more desirable.
6. Reelability (REEBLTY): This is a measure of the re-reelability of the silk
filament. It is the ratio of the cocoon reeled without break and the total number
of cocoons casted, and it is measured as a percentage. This ratio is calculated
from the number of times of casting filaments and the number of cocoons reeled.
This characteristic actually measures the frequency of breakages of the filament
during reeling.
7. Raw silk (RAWSILK): This is a measure of the raw silk expressed as a
percentage. It is the ratio of the number of kilograms of cocoons required to
produce 1 kilogram of raw silk and expressed as a percentage.
8. Neatness (NEATNESS): This measures the neatness of the silk filament. This
is expressed as a percentage. The number of small knots and loops and the
frequency of distribution on raw silk are represented as percentage by comparing
a sample of 20 panels taken on a seriplane board, with the standard photographs
for neatness defects. This characteristic has an impact on the quality of the fabrics
woven from the silk.
9. Boil loss (BOILLOSS): Boil loss or degumming loss is the loss of sericin that is
used as the gum for binding the silk filaments together in the form of a cocoon.
Cocoons selected for reeling are boiled in soap solution for removing the gum or
sericin. This is the ratio of the weight of cocoons after degumming to the original
weight of cocoons (green cocoons) and expressed as a percentage.
There are 18,127 observations of which 5903 are of low quality, 6052 are of
medium quality, and the remaining 6172 are of high quality.
268 V. Nagadevara
Since there are three categories, two different functions were built for prediction.
These were Low vs. High and Medium vs. High. The estimated coefficients along
with the standard errors are presented in Table 8.16. All the coefficients are
statistically significant at the 5% level.
Pseudo R2 values are presented in Table 8.17. All the three R2 s, namely, Cox and
Snell R2 , Nagelkerke R2 , and McFadden R2 , are fairly high.
The model fitting information values (AIC, BIC, and –2LL) along with the chi-
square value are presented in Table 8.18. The degrees of freedom are 18 (since there
are nine explanatory variables and two functions). The p-value of 0.000 suggests
that the models are statistically significant.
The prediction accuracies are presented in Table 8.19. The two models predict
the quality with an accuracy level of 88.7%.
8 Advanced Regression Analysis 269
6 Conclusion
The flexibility that logistic regression offers makes it a very useful technique
in predicting categorical dependent variables (binary as well as multinomial).
Logistic regression predicts the probability rather than the event itself. Unlike linear
regression, it ensures that the probability remains between the limits, namely, 0
and 1. Logistic regression is more suitable for prediction when majority of the
explanatory variables are metric in nature. While binary logit function can be
used for predicting categorical variables which can take on only two values, the
technique can be easily extended to variables that take on more than two values. The
multinomial logistic model can also be used to predict variables which are ordinal
in nature. The ordinal logistic model is beyond the scope of this chapter. Interested
readers may look at the additional readings given at the end of this chapter.
Further Reading
• Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John
Wiley and Sons.
• Chatterjee, S., & Hadi, A. S. (2012). Regression analysis by example (5th ed.).
Hoboken, NJ: Wiley-Interscience.
• DeMaris, A. (1995). A tutorial on logistic regression. Journal of Marriage and
Family, 57(4), 956–968.
• Kleinbaum, D. G., & Klein, M. (2010). Logistic regression, a self-learning text
(3rd ed.). New York: Springer.
• Menard, S. (2002). Applied logistic regression analysis. Sage University paper
(Vol. 106).
270 V. Nagadevara
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Code 8.1: Advanced_Regression_Analysis.R
• Data 8.1: Employee_attrition.csv
• Data 8.2: Employee_attrition_nvars.csv
• Data 8.3: Quality_Index.csv
Exercises
Iteration historya,b,c,d
Coefficients
Iteration −2 Log likelihood Constant Age
1 114.451 3.057 −.080
2 114.205 3.488 −.092
3 114.204 3.504 −.092
4 114.204 3.504 −.092
a Method: Enter
b Constant is included in the model
c Initial −2 Log Likelihood: 114.451
d Estimation terminated at iteration number 4 because
Ex. 8.3 The beta coefficients and the standard errors are given in the table below.
Calculate Wald’s statistic and test the two coefficients for statistical significance.
8 Advanced Regression Analysis 271
What are the degrees of freedom associated with Wald’s statistic? What are the p-
values?
β S.E.
Age −.092 .045
Constant 3.504 1.276
Ex. 8.4 Calculate a 95% confidence interval for the β coefficient for the variable
“Age.” What is the interpretation of the coefficient for Age, –0.092?
Ex. 8.5 The model summary is given in the table below. Comment on the model
based on the pseudo R2 s and –2LL.
Ex. 8.6 Hosmer and Lemeshow chi-square is given below. What is the conclusion
with respect to the model based on Hosmer and Lemeshow test?
Chi-square Df Sig.
23.437 8 .001
Ex. 8.7 The prediction matrix for the above logistic regression is given below.
Calculate Youden’s index based on this matrix.
Predicted
Subscribed or not Percentage correct
Observed N Y
Subscribed or not N 1 27 3.6
Y 3 69 95.8
Overall percentage 70.0
272 V. Nagadevara
Missing data is a common issue in almost all analyses. There are a number
of ways for handling missing data, but each of them has its own advantages
and disadvantages. This section discusses some of the methods of missing value
imputation and the associated advantages and disadvantages.
The type of missing values can be classified into different categories. These
categories are described below:
(a) Missing at random: when the probability of non-response to a question depends
only on the other items where the response is complete, then it is categorized as
“Missing at Random.”
(b) Missing completely at random: if the probability of a value missing is the same
for all observations, then it is categorized as “Missing Completely at Random.”
(c) Missing value that depends on unobserved predictors: when the value that is
missing depends on information that has not been recorded and the same infor-
mation also predicts the missing values. For example, a discomfort associated
with a particular treatment might lead to patients dropping out of a treatment
leading to missing values.
(d) Missing values depending on the variable itself : this occurs when the proba-
bility of missing values in a variable depends on the variable itself. Persons
belonging to very-high-income groups may not want to report their income
which leads to missing values in the income variable.
Handling Missing Data by Deletion
Many times, missing data problem can be handled simply by discarding the
data. One method is to exclude all observations where the values are missing. For
example, in regression analysis, any observation which has either the values of
the dependent variable or any independent variable is missing, such observation
is excluded from the analysis. There are two disadvantages of this method. The first
is that the exclusion of observations may introduce bias, especially if those excluded
differ significantly from those which are included in the analysis. The second is that
there may be only a few observations left for analysis after deletion of observations
with missing values.
This method is often referred to as “Complete Case Analysis.” This method is
also called “List-wise Deletion”. This method is most suited when there are only a
few observations with missing values.
The next method of discarding the data is called “Available Case Analysis”
or “Pair-wise Deletion.” The analysis is carried out with respect to only those
observations where the values are available for a particular variable. For example,
let us say, out of 1000 observations, information on income is available only for
870 and information on age is available for 960 observations. The analysis with
respect to age is carried out using 960 observations, whereas the analysis of income
is carried out for 870 observations. The disadvantage of this method is that the
8 Advanced Regression Analysis 273
analysis of different variables is based on different subsets and hence these are
neither consistent nor comparable.
Handling Missing Data by Imputation
These methods involve imputing the missing values. The advantage is that the
observations with missing values need not be excluded from the analysis. The
missing values are replaced by the best possible estimate.
• Mean Imputation
This method involves substituting missing values with the mean of the observed
values of the variable. Even though it is the simplest method, mean imputation
reduces the variance and pulls the correlations between variables toward “zero.”
Imputation by median or mode instead of mean can also be done. In order to
maintain certain amount of variation, the missing values are replaced by “group
mean.” The variable with missing values is grouped into different bins and mean
values of each group (bin) are calculated. The missing values in any particular group
are replaced by the corresponding group mean, instead of overall mean.
• Imputation by Linear Regression
The missing value can be imputed by using simple linear regression. The first
step is to treat the variable with missing values as the dependent variable and identify
several predictor variables in the dataset. The identification of the predictor variables
can be done using correlation. These variables are used to create a regression
equation, and the missing values of the dependent variable are estimated using the
regression equation. Sometimes, an iterative process is used where all the missing
values are first imputed and using the completed set of observations, the regression
coefficients are re-estimated, and the missing values are recalculated. The process is
repeated until the difference in the imputed values between successive iterations is
below a predetermined threshold. While this method provides “good estimates” for
the missing values, the disadvantage is that the variables tend to fit too well because
the missing values themselves are estimated using the other variables of the dataset
itself.
• Imputation with Time Series Data
Certain imputation methods are specific to time series data. One method is to
carry forward the last observation or carry backward the next observation. Another
method is to linearly interpolate the missing value using the adjacent values. This
method works better where the time series data exhibits trend. Wherever seasonality
is involved, linear interpolation can be carried out after adjusting for seasonality.
• Imputation of Categorical Variables
Categorical variables, by their nature, require different methods for imputation.
Imputation by mode is the simplest method. Yet it will introduce bias, just the same
way as imputation by mean. Missing values of categorical variables can be treated as
a separate category in itself. Different prediction models such as classification trees,
274 V. Nagadevara
k-nearest neighbors, logistic regression, and clustering can be used to estimate the
missing values. The disadvantage of these methods is that it requires building high-
level predictive models which in itself can be expensive and time consuming.
As mentioned earlier, missing data is a major issue in data analysis. While there
are number of methods available for imputing the missing values, there is no such
method that is the “best.” The method that needs to be used depends on the type of
dataset, the type of variables, and the type of analysis.
Further Reading
• Enders, C. K. (2010). Applied missing data analysis. New York: The Guilford
Press.
• Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd
ed.). Hoboken, NJ: John Wiley & Sons, Inc.
• Yadav, M. L. (2018). Handling missing values: A study of popular imputation
packages in R. Knowledge-Based Systems. Retrieved July 24, 2018, from
https://doi.org/10.1016/j.knosys.2018.06.012.
8 Advanced Regression Analysis 275
One of the most commonly used techniques for estimating the parameters of
a mathematical model is the least squares estimation, which is commonly used
in linear regression. Maximum likelihood estimation (MLE) is another approach
developed for the estimation of parameters where the least squares method is not
applicable, especially when the estimation involves complex nonlinear models.
MLE involves an iterative process, and the availability of computing power has
made MLE more popular recently. Since MLE does not impose any restrictions on
the distribution or characteristics of independent variables, it is becoming a preferred
approach for estimation.
Let us consider a scenario where a designer boutique is trying to determine the
probability, π , of a purchase made by using a credit card. The boutique is interested
in calculating the value of p which is a maximum likelihood estimate of π. The
boutique had collected the data of 50 purchases and found that 32 out of 50 were
credit card purchases and the remaining are cash purchases.
The maximum likelihood estimation process starts with the definition of a
likelihood function, L(β), where β is the vector of unknown parameters. The
elements of the vector β are the individual parameters β0 , β1 , β2 , . . . , βk. The
likelihood function, L(β), is the joint probability or likelihood of obtaining the data
that was observed. The data of the boutique mentioned above can be described by
binomial distribution with 32 successes observed out of 50 trials. The likelihood
function for this example can be expressed as
0.00
-10.00
Log of Likelihood Funcon
-20.00
-30.00
-40.00
-50.00
-60.00
-70.00
-80.00
0 0.2 0.4 0.6 0.8 1
Values of p
Table 8.20 Sample data Sl. no. Income (’000) Sl. no. Income (’000)
1 131 11 80
2 107 12 97
3 88 13 98
4 75 14 92
5 83 15 76
6 136 16 103
7 72 17 84
8 113 18 91
9 109 19 124
10 130 20 82
Consider the data on income levels presented in Table 8.20. There are 20
observations, and we would like to estimate the two parameters, mean and standard
deviation of the population from which these observations are drawn.
The estimation process involves the probability density function of the distribu-
tion involved. Assuming that the above observations are drawn from a univariate
normal distribution, the density function is
1 (Xi −μ)2
−
Li = √ e 2σ 2
2π σ
The joint probability of two events, Ei and Ej , occurring is the product of the two
probabilities, considering that the two events are independent of each other. Even
though Li and Lj , the two likelihood functions associated with observations i and
j, are not exactly probabilities, the same rule applies. There are 20 observations in
the given sample (Table 8.20). Thus, the likelihood of the sample is given by the
product of the corresponding likelihood values.
The sample likelihood is given by
" #
20
1 −
(Xi −μ)2
L= √ e 2σ 2
i=1
2π σ
The likelihood values of the 20 observations are presented in Table 8.21. These
values are calculated based on μ = 100 and σ2 = 380 (these two values are taken
from a set of all possible values).
To get the sample likelihood, the above likelihood values are to be multiplied.
Since the likelihood values are small, the sample likelihood will have an extremely
small value (in this case, it happens to be 7.5707 × 10−39 ). It will be much better
to convert the individual Li values to their log values, and the log likelihood of the
sample can be obtained simply by adding the log values. The log likelihood of the
sample is obtained by
" #
20
1 −
(Xi −μ)2
log L = log √ e 2σ 2
i=1
2π σ
Table 8.22 presents the values of Li along with their log values. The log
likelihood value (log L value) of the entire sample obtained from Table 8.22 is
–87.7765. The Log Li values in the above table are based on assumed values of
μ = 100 and σ2 = 380. The exercise is repeated with different values of μ. The
log L values of the entire sample are plotted against the possible values of μ in
Fig. 8.6.
278 V. Nagadevara
-87.6
-87.7
log L values of the entire sample
-87.8
-87.9
-88
-88.1
-88.2
-88.3
-88.4
-88.5
-88.6
92 94 96 98 100 102
Values of μ
Fig. 8.6 Log L values of the entire sample and possible values of μ
The maximum likelihood estimate for σ2 based on Fig. 8.7 is 378. It can
be concluded that the sample data has a univariate normal distribution with the
maximum likelihood estimates of μ = 98.5 and σ 2 = 378.
It can be seen from Fig. 8.6 that the maximum value of log L is obtained when
value of μ = 98.5. This is the maximum likelihood estimate ( μ) for the population
parameter, μ. The entire exercise is repeated with different possible values of σ2
while keeping the value of μ at 98.5. The values of log L corresponding to different
values of σ2 are plotted in Fig. 8.7.
This appendix provides a brief description of the maximum likelihood estimation
method. It demonstrated the method by using a univariate normal distribution. The
same technique can be extended to multivariate distribution functions. It is obvious
that the solution requires many iterations and considerable computing power. It may
8 Advanced Regression Analysis 279
-87.72
Log L values of the entire sample
-87.722
-87.724
-87.726
-87.728
-87.73
-87.732
-87.734
355 360 365 370 375 380 385 390 395
Values of σ2
Fig. 8.7 Log L values of the entire sample and possible values of σ2
also be noted that the values of likelihood function are very small, and consequently,
log likelihood function values tend to be negative. It is a general practice to use
−log likelihood function (negative of log likelihood function) and correspondingly
minimize it instead of maximizing the log likelihood function.
Further Reading
Appendix 3
We provide the R functions and command syntax that are used to build various
tables and charts referred in the chapter. Refer to Table 8.23. It can be helpful for
practice purpose.
Sudhir Voleti
1 Introduction
The main focus of this textbook thus far has been the analysis of numerical data.
Text analytics, introduced in this chapter, concerns itself with understanding and
examining data in word formats, which tend to be more unstructured and therefore
more complex. Text analytics uses tools such as those embedded in R in order to
extract meaning from large amounts of word-based data. Two methods are described
in this chapter: bag-of-words and natural language processing (NLP). This chapter
is focused on the bag-of-words approach. The bag-of-words approach does not
attribute meaning to the sequence of words. Its applications include clustering or
segmentation of documents and sentiment analysis. Natural language processing
uses the order and “type” of words to infer the meaning. Hence, NLP deals more
with issues such as parts of speech.
Consider the following scenarios: A manager wants to know the broad contours of
what customers are saying when calling into the company’s call center. A firm wants
to know if there are persistent patterns in the content of their customer feedback
S. Voleti ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: sudhir_voleti@isb.edu
So what text data sources are individuals and organizations finding most common or
useful for their various purposes? Figure 9.1 shows the results of a survey conducted
at the 2014 Journal of Data Analysis Techniques (JDAT) conference in Europe,
attended by academia and industry alike. The results are unsurprising. Social media
and other user-generated content sources top the list—microblogs, full-length blogs,
online forums, Facebook postings, etc. followed by more “traditional” text content
sources such as mainstream media news articles, surveys (of customers, employees),
reports, and medical records and compliance filings.
Data collection and extraction from these sources is now possible at scale by the
development of new tools, standards, protocols, and techniques (e.g., Application
Programming Interfaces (APIs)).
What are the building blocks of text analysis? What is the basic unit of analysis?
How similar or different is text analysis from the analysis of more structured, metric
data? The next section offers a briefing.
There are two broad approaches to handling text analysis. The first, bag-of-words,
assumes that words in the text are “exchangeable,” that is, their order does not matter
in conveying meaning. While this assumption vastly oversimplifies the quirks of text
as a means of communication, it vastly reduces the dimensionality of the text object
and makes the analysis problem very tractable. Thus, if the order did not matter then
two occurrences of the same word in any text document could be clustered together
and their higher-level summaries (such as counts and frequencies) could be used
for analysis. Indeed, the starting point of most text analysis is a data object called
the term-document matrix (TDM) which lists the counts for each term (word/phrase
token) in each document within a given set of text documents. The second approach,
natural language processing (NLP), attempts to interpret “natural language” and
assumes that content (as also context) depends on the order and “type” of words
used. Hence, NLP deals more with issues such as parts of speech and named entity
recognition.
Consider a passage from Shakespeare’s As You Like It, “All the world’s a
stage, and all the men and women merely players: they have their exits and
their entrances; and one man in his time plays many parts . . . ” While some may
see profound meaning in this (admittedly nonrandom) collection of words, Fig.
9.2 displays what happens when we process this text input in a computer’s text
analysis program (in this case, the open source R platform, in particular its “tm”
and “stringr” packages). The code that runs the above processing uses two user-
defined functions, “Clean_String()” and “Clean_Text_Block()”, for which a tutorial
can be found on Matt Denny’s Academic Website1 or you may refer to the code
(Fig_9.2_Shakespeare_quote.R) available on the book’s website. The breaking up
of text into “words” (or more technically, word “tokens”) is called tokenization.
While what we see in this simple routine are single-word tokens (or unigrams), it
is possible to define, identify, and extract phrases of two or more words that “go
together” in the text (bigrams, trigrams, etc.). However, that would require us to
consider the order in which the words first occurred in the text and would form a
part of NLP.
># View a sample of the DTM, sorted from most to least frequent token count
>dtm_clean <- dtm_clean[,order(apply(dtm_clean,2,sum),decreasing=T)]
>inspect(dtm_clean[1:5,1:5])
<<DocumentTermMatrix (documents: 5, terms: 5)>>
Non-/sparse entries: 11/14
Sparsity : 56%
Maximal term length: 7
Weighting : term frequency (tf)
Sample :
Terms
Docs butter chip chocol mint vanilla
1 0 0 1 0 1
3 0 0 1 0 1
4 1 0 1 0 0
5 0 1 1 0 0
6 1 0 1 0 1
>
Fig. 9.4 (a, b) Wordclouds under the TF and TFIDF weighing schemes
second weighing scheme for tokens across documents, labeled TFIDF for “term
frequency–inverse document frequency,” has gained popularity. The basic idea is
that a token’s frequency should be normalized by the average number of times that
token occurs in a document (i.e., document frequency). Thus, tokens with very
high frequency spread evenly across documents tend to get discounted whereas
those which occur more in some documents rather than all over the corpus gain
in importance. Thus, “butter-scotch” would get a higher TFIDF score than “ice-
cream” because only a subset of people would say the former, thereby raising its
inverse document frequency. Various TFIDF weighing schemes have been proposed
and implemented in different text analysis packages. It is prudent to test a few to see
if they make more sense in the context of a given corpus than the simple TF scheme.
Figure 9.4b displays a TFIDF wordcloud. The differences between that and the TF
wordcloud in Fig. 9.4a are visibly apparent.
Wordclouds are very basic in that they can only say so much. Beyond simple
term frequency, one might also want to know which tokens occur most frequently
together within a document. For instance, do “vanilla” and “chocolate” go together?
Or do “vanilla” and “peanut butter” go together more? More generally, does
someone who uses the term “big data” also say “analytics”? A co-occurrence graph
(COG) highlights token-pairs that tend to co-occur the most within documents
across the corpus. The idea behind the COG is straightforward. If two tokens co-
occur in documents more often than by random chance, we can use a network graph
290 S. Voleti
ice-cream
cookies-cream cup
vanilla-chocolate
peanut-butter
choc-chip
cookie-dough
french-vanilla vanilla
chocolate-chip
mint
chocolate-vanilla vanilla-beam
chocolate
rocky-road moose-tracks
butter-pecan
chocolate-chocolate
swirl
strawberry
Fig. 9.5 A cleaned co-occurrence graph (COG) for the ice-cream dataset
framework to “connect” the two tokens as nodes with a link or an “edge.” Because
the odds are high that any two tokens will co-occur at least once in some document
in the corpus, we introduce connection “thresholds” that ensure two node tokens are
linked only if they co-occur more than the threshold number of times. Even so, we
often see too many connections across too many nodes. Hence, in “cleaned” COGs,
we designate nodes as “primary” (central to the graph) and “secondary” (peripheral
nodes which only connect to one central node at a time), and we suppress inter-
connections between peripheral nodes for visual clarity. The R shiny app (discussed
in the next section) can be used to generate a COG for the ice-cream dataset as
shown in Fig. 9.5. Figure 9.5 displays one such cleaned COG that arises from the
top tokens in the ice-cream dataset. We can interpret the COG in Fig. 9.5 as follows.
We assume the peripheral (pink) nodes connect to one another only through the
central (green) nodes, thus taking away much of the clutter in the earlier COG.
Again, the links or edges between nodes appear only if the co-occurrence score
crosses a predefined threshold.
9 Text Analytics 291
Today, a wide variety of platforms and tools are available to run standard text
analysis routines. To run the analysis and obtain outputs similar to the ones shown
in the figures, do the following.
• Open RStudio and copy-paste the code for “shiny” example3 from author’s
github account into a script file.
– Now, select lines 37–39 (#Basic Text-An-app) in RStudio and click “Run.”
RStudio first installs required libraries from the source file (line 38). Then it
launches the shiny app for Factor An (line 39).
• Examine what the app is like—the Input Sidebar and the Output Tabs. This entire
process has been described in the YouTube video tutorial4 for a sample dataset
and set of shiny apps in R.
• Now use the app and read in the ice-cream data, either as a text or a csv file.
• Once the app has run (may take up to several minutes to finish processing
depending on the size of the dataset and the specification of the local machine),
explore the output tabs to see what shows up. The video tutorial provides some
assistance in this regard.
3 https://raw.githubusercontent.com/sudhir-voleti/profile-script/master/sudhir%20shiny%20app
to a heterogeneous mass. So, how do marketers segment and on what basis? Well,
several bases can be considered: demographic (age, income, family size, zip code,
etc.) and psychographic (brand conscious, price sensitive, etc.) are two examples.
The ideal basis for marketers specifically is “customer need,” which is often latent
and hard to observe. In such instances, text offers a way to access qualitative insights
that would otherwise not be reachable by traditional means. After segmentation,
marketers would typically evaluate which segments are worth pursuing and how to
target them.
Let us run a segmentation of the ice-cream survey’s respondents based on the
ice-cream flavor preferences they have freely stated in unstructured text form. The
idea is to simply take the DTM and run the k-means clustering algorithm (see
the Chap. 15 on Unsupervised Learning) on it. Thus, documents that are “similar”
based on the tokens used would group together. To proceed with using the k-means
algorithm, we must first input the number of clusters we want to work with.
To run segmentation, select codes (#Segmentation-discriminant-targeting App)
in RStudio and click “Run.” The line 27 launches the shiny app for segmentation,
discriminant, and classification. Upload the ice-cream data for segmentation (refer
Fig 9.6). We can estimate the number of clusters through the scree plot on the
“Summary—Segmentation” tab. Looking at where the “elbow” is sharpest (refer
Fig. 9.7), we can arrive at what a reasonable number of clusters would be. The
current example suggests that 2, 4, or 7 clusters might be optimal. Let us go with 7
(since we have over a thousand respondents).
We can interpret the output as follows. The “Summary—Segmentation” tab
yields segment sizes, centroid profiles, and other classic descriptive data about the
obtained clusters. The tab “Segmentation—Data” yields a downloadable version
of segment assignment to each document. The two output tabs “Segmentation—
Wordcloud” and “Segmentation co-occurrence” display segment wordclouds and
COGs alongside segment sizes. We can observe that the rather large Segment 1 (at
44% of the sample) seems to be those who prefer vanilla followed by those who
prefer butter-pecan at Segment 2 (8.4% of the sample) and so on.
Sentiment Analysis
Sentiment mining is an attempt to detect, extract, and assess value judgments,
subjective opinion, and emotional content in text data. This raises the next question,
“How is sentiment measured?” Valence is the technical term for the subjective
inclination of a document—measured along a positive/neutral/negative continuum.
Valence can be measured and scored.
Machines cannot determine the existence of sentiment content in a given word
or phrase without human involvement. To aid machines in detecting and evaluating
sentiment in large text corpora, we build word-lists of words and phrases that are
likely to carry sentiment content. In addition, we attach a “score” to each word or
phrase to represent the amount of sentiment content found in that text value. Since
sentiment is typically measured along a positive/neutral/negative continuum, the
steps needed to perform an analysis are fairly straightforward. First, we start with a
processed and cleaned corpus. Second, we match the stem of each document in the
corpus with positive (or negative) word-lists. Third, we score each document in the
9 Text Analytics
3.5
3.0
2.5
Variances
2.0
1.5
1.0
0.5
1 2 3 4 5 6 7 8 9 10
corpus with a positive (or negative) “polarity.” We can then plot the distribution of
sentiment scores for each document as well as for the corpus as a whole. The positive
and negative word-lists we use in the App are from the Princeton dictionary5 . Since
they are general, their effectiveness is somewhat limited in detecting sentiment in a
given content. To customize sentiment analysis, one can and probably should build
one’s own context-specific sentiment scoring scheme giving valence weights for the
most common phrases that occur in the domain of interest. Ideally, businesses and
organizations would make domain-specific wordlists (and corresponding sentiment
scores) for greater accuracy.
Let us see the result of running sentiment analysis on the ice-cream dataset. In
the R app that you previously launched, look at the last two output tabs, namely
“sentiment analysis” and “sentiment score data.” One reason why the sentiment-
laden wordclouds in the first output tab are so small is that most sentiment-laden
tokens are adjectives while the ice-cream dataset consists of mostly nouns. Once
sentiment scores by document are obtained, a host of downstream analysis becomes
possible. The documents can now be sorted based on their “polarity” toward the
topic at hand (i.e., ice-cream flavors). Such analyses are useful to businesses study-
ing brand mentions on social media, for instance. Running trackers of sentiment
over time can also provide useful information to managers. What we saw in this app
was elementary sentiment mining. More advanced versions of sentiment mining and
polarity-scoring schemes, leveraging natural language processing or NLP (discussed
in detail in the next section), can be performed.
Let us step back and look at what we have covered with elementary text analysis
thus far. In a nutshell, we have been able to rapidly crunch through raw text input on
a scalable level, reduce open-ended text to a finite dimensional object (TDM), apply
standard analysis techniques (k-means segmentation) to this object, and sense what
might be the major preference groups and important attributes that emerged through
our analysis. These techniques help us to achieve a state in which a survey method
can be leveraged for business insight. Next, we leave behind “basic” text analysis
and head into a somewhat more advanced version wherein we will uncover hidden
structure beneath text data, that is, latent topic mining and modeling.
We know that organizations have huge data stored in text form—and it is likely
that people are looking for ways to extract “meaning” from these texts. Imagine
files piled up on a desk—the pile represents a corpus and each individual file a
“document.” One may ask, can we “summarize” the text content of those files? If
so, what might a summary look like? One answer is that it should contain broad
commonalities in coherent, inter-related text content patterns that occur across
the files in the pile. Thus, one way to view meaning is in the form of coherent,
condensed “topics” or “themes” that underlie a body of text. In the past, automated
analysis has relied on simple models that do not directly address themes or topics,
leaving us to derive meaning through a manual analysis of the text corpus. An
approach to detect and extract coherent “themes” in the text data has become
popular. It is the basis for topic mining of text corpus described in detail below.
To illustrate what underlying themes might look like or mean in a text corpus
context, we take a simple example. Consider a text corpus of 20 product reviews.
Suppose there are two broad themes or topics in the structure of the corpus, “price”
and “brand.” Further, suppose that each document is a mixture of these two topics
in different proportions. We can then argue that a document that talks about “price”
90% of the time, and “brand” 10%, should have nine times more “price” terms
than “brand” terms. A topic model formalizes this intuition mathematically. The
algorithm yields which tokens belong to which topics with what probability, and
which documents “load” (have strong association, see below for an example) on
which topics with what proportion. Given this, we can sort, order, plot, analyze, etc.
the tokens and the documents.
So, how can we directly mine for these latent topics or themes in text? The
basic idea is analogous to the well-known Factor Analytic procedures (see the
Chap. 15 on Unsupervised Learning). Recall what happens in traditional factor
analysis. A dataset with R rows and C columns is factorized into two components—
an R × F scores matrix and an F × C loadings matrix (R stands for the number
of observations, C stands for the attributes of each observation, F then stands for
the number of factors into which the data is decomposed). These factors are then
labeled and interpreted in terms of the composite combinations of variables that
load on them. For example, we can characterize each factor by observing which
variables “load” highest onto it (Variable 1 positively, Variable 4 negatively, etc.).
Using this we interpret what this means and give each factor an informative label.
296 S. Voleti
Now let us see what would happen if instead of a traditional metric variable dataset,
our dataset was a document term matrix (DTM) with D documents and T terms?
And instead of conventional factors, we use the term “Topic Factors” for the factor.
This would result in a D × F scores matrix of documents-on-factors, and an F × T
loadings matrix of terms-on-factors. What are these matrices? What do these scores
and loadings mean in a text context? Let us find out through a hands-on example.
The next dataset we will use is a comma-separated values (CSV) file containing
the mission statements from a subset of Fortune 1000 firms. It is currently stored on
the author’s github page. You can run the following lines of R code in RStudio to
save the file on your local machine.
> # saving the data file as a .csv in your local machine
> mission.stments.data = read.csv(”https://raw.githubusercontent.
com/sudhir-voleti/sample-data-
sets/master/Mission%20Statements%20v1.csv“)
> # save file as data1.csv on your local machine
> write.csv(mission.stments.data, ”data1.csv“)
> ### Topic Mining App
> source(”https://raw.githubusercontent.com/sudhir-voleti/text-
topic-analysis-shinyapp/master/dependency-text-topic-analysis-
shinyapp.R“)
> runGitHub(”text-topic-analysis-shinyapp“, ”sudhir-voleti“)
Now invoke the Topic mining app in the shiny apps list (code above) and explore
the app’s input fields and output tabs. Now read in the saved .csv file on mission
statements into the app. Similar to the process used in our basic text analysis app,
this too tokenizes the corpus, creates a TDM as its basic unit of analysis, and applies
a latent topic model upon it. It calls upon the user (in this case, us) to tell it the
optimal number of topics we think there are in the corpus. The default in the app is
2, but this can be manually changed.
The “TDM & Word Cloud” tab shows the corpus level wordcloud and the “Topic
Model—Summary” tab shows the top few phrases loading onto each topic factor.
The next three tabs display topic model output. At first glance, the output in the
“Topics Wordcloud” and “Topics Co-occurrence” tabs may seem no different from
those in the Basic Text An app. However, while the Basic Text An app represented
TDM-based clusters (wherein entire documents are hard-allocated to one cluster
or another), what we see now are topic factors that can co-exist within the same
document. To recap, this model says in a nutshell that every document in the corpus
is a mixture of topics in some proportion and that each topic is a distribution over
word or phrase tokens with some topic membership probability. Both the topic
membership probability vector per token and topic proportions vector per document
can be simultaneously estimated in a Bayesian manner.
Returning to the example at hand, Fig. 9.8a, b show the wordclouds from Topics
1 and 2 respectively. Topic 1 seems to emphasize “corporation” and allied terms
(services, systems, provide), whereas Topic 2 emphasizes “Customers” with allied
terms (solutions, employees, product-services). This suggests that Topic 1 is perhaps
a measure of company (and perhaps, product) centricity whereas Topic 2 is that
of customer centricity in firms’ mission statements. As before, the co-occurrence
graphs are used in conjunction with the wordclouds for better interpretation.
9 Text Analytics 297
Fig. 9.8 (a, b) Wordclouds for topic factors 1 and 2 for the mission statements dataset
Finally, the last tab in the app, “Data with topic proportions” yields a download-
able file that shows the proportion of each document that is dominated by tokens
loading on a particular topic. Thus, in this example, Microsoft corporation seems
to have a 70–30 split between the company and customer centricity of its mission
statement while that of toymaker Mattel, Inc. is 15–85.
While the “bag-of-words” approach we have seen thus far is useful, it has severe
limitations. Human language or “natural language” approach is much more complex
than a bag-of-words approach. The same set of words used in a different order could
produce a different meaning. Even the same set of words uttered in a different
tone could have a different meaning. Often, what precedes or succeeds particular
sentences or paragraphs can impact the contextual meaning of any set of words. This
brings our discussion to the natural language processing or NLP. By definition, NLP
is a set of techniques that enable computers to detect nuances in human language
that humans are able to detect automatically. Here, “nuances” refers to entities,
relationships, context, and meaning among other things. So, what are some things
we humans process automatically when reading or writing “natural” text? We parse
text out into paragraphs and sentences—while we may not explicitly label the parts-
of-speech (nouns, verbs, etc.), we can certainly understand and identify them. We
notice names of people, places, dates, etc. (“entities”) as they come up. And we can
infer whether a sentence or paragraph portrays a happy, angry, or sad tone. What
even human children appear to do effortlessly presents some tall challenges to the
machine. Human language is too rich and subtle for computer languages to capture
anywhere near the total amount of information “encoded” in it.
298 S. Voleti
Between R and Python, the two main open-source data science alternatives,
which is better for NLP? Python’s natural language toolkit6 (NLTK) is a clear
winner here, but R is steadily closing the gaps with each passing quarter. The
Apache OpenNLP7 package in R provides access from the local machine to some
trained NLP models. In what follows, let us very briefly see some of the main
functions NLP of written text involves.
The simplest NLP functions involve recognizing sentences (and words) as
distinct data forms. Tokenization does achieve some of this for words but sentence
annotations come well within the ambit of NLP. So how does a machine identify
that one sentence has ended and another has begun? Typically, large corpora of
preclassified and annotated text content are fed to the machine and the machine
trains off this data. The output of applying this trained algorithm on new or “virgin”
data is an annotated document that delineates every word and sentence in the
text separately. How can we use these sentence level annotations? Theoretically,
sentences could now take the place of documents to act as our “rows” in the TDM.
A simple sentence expresses a single idea, typically (unlike compound sentences or
paragraphs). So, instead of doing sentiment analysis at the document level—we can
do it at the sentence level. We can then see which associations (adjectives) appear
with what nouns (brands, people, places, etc.). Co-occurrence graphs (COGs) can be
built giving more weight to words co-occurring in sentences than in the document
as a whole. We now have a building block that we could scale up and apply to
documents and to corpora, in principle.
A popular application of NLP is in the Named Entity Recognition (NER)
space. To illustrate, imagine a large pile of files on your work desk constituting
of documents stacked into a corpus. Your task is to identify and extract every
instance of a person’s name, or an organization’s name or phone-numbers or some
combination of the above that occur in that corpus. This would then become an
NER problem. An entity is basically a proper noun, such as the name of a person or
place. In R, OpenNLP’s NER annotator identifies and extracts entities of interest
from an annotated document that we saw previously. OpenNLP can find dates,
locations, money, organizations, percentages, people, and times (corresponding to
“date,” “location,” “money,” “organization,” “percentage,” “person,” “misc”). The
quality of the data recovery and the results that we get depend hugely on the type
and training of the NER algorithm being employed. OpenNLP’s NER annotator is
fairly basic but does well on western entities. A more detailed exposition of NLP can
be found in several excellent books on the subject, see for example Bird et al. (2009)
and Robinson and Silge (2017). Also, Exercise 9.2 provides step-by-step guidance
to NER problem-solving using OpenNLP package in R.
6 NLTK package and documentation are available on http://www.nltk.org/ (accessed on Feb 10,
2018).
7 Apache OpenNLP package and documentation are available on https://opennlp.apache.org/
All the datasets, code, and other material referred in this section are available in
http://www.allaboutanalytics.net.
• Data 9.1: Generate_Document_Word_Matrix.cpp
• Code 9.1: Github_shiny_code.R
• Code 9.2: Icecream.R
• Data 9.2: Icecream.txt
Exercises
Go to IMDB and extract 100 reviews (50 positive and 50 negative) for your
favorite movie.
(a) Preprocess the data like removing punctuation marks, numbers, ASCII charac-
ters, converting whole text to lowercase/uppercase, and removing stop-words
and stemming.
Do elementary text analysis
(b) Create document term matrix.
(c) Check word-clouds and COGs under both TF (Term Frequency) and TFIDF
(term frequency––inverse document frequency) weighing schemes for which
configurations appear most meaningful/informative.
(d) Iterate by updating the stop-words list, etc.
(e) Compare each review’s polarity score with its star rating. You can choose to use
a simple cor() function to check correlation between the two data columns.
(f) Now, make a recommendation. What movie attributes or aspects (plot? star cast?
length? etc.) worked well, which the studio should retain? Which ones did not
work well and which the studio should change?
Explore with trial-and-error different configurations of possibilities (what stop-
words to use for maximum meaning? TF or IDF? etc.) in the text analytics of a
simple corpus. You may also use topic modeling if you wish.
Ex. 9.2 NLP for Entity recognition.
(a) Select one well-known firm from the list of the fortune 500 firms.
(b) For the selected firm, scrape its Wikipedia page.
(c) Using openNLP, find all the locations and persons mentioned in the Wikipedia
page.
Note: You can use either openNLPs NER functionality, or, alternately, use
the noun-phrase home-brewed chunker. If using the latter, manually separate
persons and locations of interest.
(d) Plot all the extracted locations from the Wikipedia page on a map.
(e) Extract all references to numbers (dollar amounts, number of employees, etc.)
using Regex.
Algorithm:
Step 1: Web scraping to choose one company among the list of Fortune 500 firms.
For example, I have chosen Walmart.
Step 2: Navigate to the wiki page of the selected firm. For example: Copy-paste
the top wiki paragraphs into a string in R.
Step 3: Install OpenNLP in R
1. Load the required packages in R and perform basic tokenization.
2. Now generate an annotator which will compute sentence and word annotations
in openNLP.
3. Now, Annotate Persons and locations using the entity annotator. Example: Max-
ent_Entity_Annotator(Kind = “location”) and Maxent_Entity_Annotator(Kind =
“person”).
9 Text Analytics 301
Step 4: Load ggmap, rworldmap package to plot the locations in the map.
Step 5: Using regular expressions in R, match the patterns of numbers and display
the results.
Ex. 9.3 Empirical topic modeling.
(a) Choose three completely different subjects. For example, choose “cricket,
“macroeconomics,” and “astronomy.”
(b) Scrape Wikipedia pages related to the given subjects. Make each paragraph as
a document and annotate each document for its respective category (approx. 50
paragraph should be analyzed for each category). For example on subject cricket
we can search One Day International, test cricket, IPL etc.
(c) Now create a simulated corpus of 50 documents thus: The first of the 50
documents is a simple concatenation of the first document from subject 1, from
subject 2 and from subject 3. Likewise, for the other 49 documents.
Thus, our simulated corpus now has “composite” documents, that is, docu-
ments composed of three distinct subjects each.
(d) Run the latent topic model code for k = 3 topics on this simulated corpus of 50
composite documents.
(e) Analyze the topic model results—Word clouds, COGs, topic proportions in
documents.
(f) See
• Whether the topic model is able to separate each subject from other subjects.
To what extent is it able to do so?
• Are there mixed tokens (with high lift in more than one topic)? Are the
highest LIFT tokens and the document topic proportions (ETA scores) clear
and able to identify each topic?
References
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Sebastopol, CA:
O’Reilly Media.
Robinson, D., & Silge, J. (2017). Text mining with R: A tidy approach. Sebastopol, CA: O’Reilly
Media.
Part II
Modeling Methods
Chapter 10
Simulation
Sumit Kunnumkal
1 Introduction
S. Kunnumkal ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: sumit.kunnumkal@queensu.ca
2 Motivating Examples
We will use the following example throughout the chapter: Consider a fashion
retailer who has to place an order for a fashion product well in advance of the selling
season, when there is considerable uncertainty in the demand for the product. The
fashion item is manufactured in its factories overseas and so the lead time to obtain
the product is fairly long. If the retailer orders too large a quantity, then it is possible
that the retailer is left with unsold items at the end of the selling season, and this
being a fashion product loses a significant portion of its value at the end of the
season. On the other hand, if the retailer orders too little, then it is possible that
the product may be stocked out during the selling season. Since the lead time of the
product is long compared to the length of the selling season (typically 12 weeks), the
retailer is unable to replenish inventory of the product during the selling season and
stock outs represent a missed sales opportunity. The retailer would like to understand
how much to order factoring in these trade-offs.
The example described above is an example of a business problem where
decisions have to be made in the presence of uncertainty, considering a number
of different trade-offs. In this chapter, we will study Monte Carlo simulation, which
is an effective technique to make decisions in the presence of uncertainty.
What Is Simulation?
Applications
While simulation has its origins in World War II, it continues to find new
and interesting applications: HARVEY is a biomedical engineering software that
simulates the flow of blood throughout the human body based on medical images
of a patient. This can be a useful tool to inform surgical planning or to design new
drug delivery systems (Technology Review 2017). GE uses a simulation model of
a wind farm to inform the configuration of each wind turbine before the actual
construction (GE Look ahead 2015). UPS has developed a simulation software
called ORION ride that it uses to simulate the effectiveness of new package delivery
routes before actually rolling them out (Holland et al. 2017). Jian et al. (2016)
describe an application to bike-sharing systems, while Davison (2014) describes
applications in finance.
The continued interest in building simulation models to answer business ques-
tions stems from a number of reasons. For one, the greater availability of data
allows for models that are able to describe the underlying uncertainty more
accurately. A second reason is the growing complexity of business problems in
terms of the volume and frequency of the transactions as well as the nonlinear
nature of the relationships between the variables. Nonlinear models tend to quickly
become challenging to analyze mathematically and a simulation model is a partic-
ularly effective technique in such situations. Furthermore, the performance metrics
obtained by simulation models, such as expected values, are also more appropriate
for business situations involving a large volume of transactions. Expected values
308 S. Kunnumkal
can be interpreted as long run averages and are more meaningful when applied to
a large number of repeated transactions. Finally, advances in computing hardware
and software also make it possible to run large and complex simulation models in
practice.
One of the main advantages of a simulation model is that it is easy to build and
follow. A simulation model is a virtual mirror of the real-world business problem. It
is therefore easy to communicate what the simulation model is doing, since there is a
one-to-one correspondence between elements in the real-world problem and those in
the simulation model. A simulation model is also flexible and can be used to model
complex business processes, where the mathematical analysis becomes difficult. It
is also possible to obtain a range of performance metrics from the simulation model
and this can be particularly helpful when there are multiple factors to be considered
when making decisions.
One disadvantage of a simulation model is that it often tends to be a “black-
box” model. It describes what is happening, but does not provide insight into the
underlying reasons. Since simulation models are easy to build, there is a tendency
to add a number of extraneous features to the model. This drives up the complexity,
which makes it difficult to derive insight from the model. On the other hand, it
usually does not improve the quality of the solution.
Before we go and build a simulation model, we note that we can answer the
questions exactly in this case since the relation between the input and the output
is linear. We therefore do the exact analysis first so that we have a benchmark to
compare the simulation results later.
Exact analysis: Since Y is a linear function of X, we can use linearity of
expectations and conclude that E[Y] = 250E[X] – 150,000 = Rs 95,000. Therefore,
the expected profit is Rs. 95,000.
We have V[Y] = 2502 V[X] = 2502 * 3002 . The standard deviation of Y is
therefore 250 * 300 = 75,000. Therefore, the standard deviation of the retailer’s
profit is Rs. 75,000.
Since X is normally distributed and Y is a linear function of X, Y is also normally
distributed with a mean of 95,000 and a standard deviation of 75,000. We can use
z-tables or Excel functions to determine the answer. For example, using the Excel
function NORM.DIST(.), we get P(Y <= 0) = 0.10 and P(Y >= 100,000) = 0.47.
Therefore, there is about a 10% chance that the retailer would not break even and
there is a 47% chance that the retailer’s profits would exceed Rs. 100,000.
Simulation model: We next build a simulation model to answer the same
questions and compare the results from the simulation model to those obtained
from the exact analysis. We have already described the input and the output random
variables and the relationship between the two. In a simulation model, we first
generate a random sample of size n of the input. So let X1 , . . . , Xn be a random
sample of size n of the demand drawn from a normal distribution with a mean of
980 and a standard deviation of 300. Given the sample of the input, we then obtain
a sample of the output by using the relation between the input and the output. Let
Y1 , . . . , Yn be the sample of size n of the output (profit) where Yi = 250Xi – 150,000.
Finally, we use the sample of the output to estimate the performance measures of
interest. In particular, we use the sample mean Y = Y1 +···+Y n
n
to estimate E[Y].
2 2
Y −Y +···+ Y −Y
We use the sample variance SY2 = 1 n−1
n
to estimate V[Y] (refer to
the Chap. 6 on basic inferences). We use the fraction of the output sample that is
smaller than zero to estimate the probability P(Y <= 0) and the fraction of the output
sample that is larger than 100,000 to estimate the probability P(Y >= 100,000). That
is, letting Ii = 1 if Yi < = 0 and Ii = 0 otherwise, we estimate P(Y <= 0) using
I1 +···+In
n . We estimate the probability P(Y >= 100,000) in a similar way.
Implementing the simulation model: We describe how to implement the simu-
lation model in Excel using @Risk. @Risk is an Excel add-in that is part of the
Palisade DecisionTools Suite.1 We note that there are a number of other software
packages and programming languages that can be used to build simulation models.
While some of the implementation details vary, the modeling concepts remain
the same. We also note that Excel has some basic simulation capabilities and we
describe this in the appendix.
Define Browse
Simulation
Distributions results
settings
Implementation in @Risk
Once the Palisade DecisionTools Suite is installed, @Risk can be launched from
the Start Menu in Windows. The @Risk toolbar is shown in Fig. 10.1. @Risk has
many features and we will only cover the basic ones in this chapter. The more
advanced features can be understood from the @Risk documentation.2
• Step 1: We specify the input cell corresponding to demand using the “Define
Distributions” command from the @Risk toolbar (see Fig. 10.2). @Risk has
a number of in-built distributions including the binomial, exponential, normal,
Poisson, and uniform. It also allows for custom distributions through the
“Discrete Distribution” option. In our example, demand is normally distributed.
So we select the normal distribution from the list of options and specify the mean
(980) and the standard deviation (300) of the demand random variable. Note that
alternatively, the demand distribution can be directly specified in the input cell
by using the @Risk function RISKNORMAL(.).
• Step 2: Next, we specify the relation between the input and the output in the
output cell (Y = 250X – 150,000).
• Step 3: We use the “Add Output” command from the @Risk toolbar to indicate
that the cell corresponding to profit is the output cell (see Fig. 10.3).
• Step 4: Before we run the simulation, we click the “Simulation Settings” button.
Under the “General” tab, we set the number of iterations to be 1000 (see
Fig. 10.4). The number of iterations corresponds to the sample size. Here, the
sample size n = 1000. We later describe how we can determine the appropriate
sample size for our simulation model. Under the “Sampling” tab, we set the
sampling type to “Monte Carlo,” we use the default option for the generator
(Mersenne Twister) and fix the initial seed to 1 (see Fig. 10.5). We briefly
comment on the generator as well as the seed. There are different algorithms
to generate random numbers on a computer. @Risk has a number of such built-
in algorithms and Mersenne twister is one such algorithm. We provide more
details in the appendix. The seed provides the starting key for the random number
generator. Briefly, the seed controls for the random sample of the input that is
generated and fixing the seed fixes the random sample of size n (here 1000) that
is drawn from the input distribution. It is useful to fix the seed initially so that it
becomes easier to test and debug the simulation model. We provide more details
in the appendix.
• Step 5: We are now ready to run the simulation. We do so by clicking “Start
Simulation.”
10 Simulation 313
• Step 6: After @Risk runs the simulation, the raw data generated by @Risk
can viewed by clicking on “Simulation data” (see Fig. 10.6). The column
corresponding to the input (demand) shows the 1000 values drawn from the
normal distribution with a mean of 980 and a standard deviation of 300. The
column corresponding to output (profit) shows the output sample, which is
obtained by applying the relation between the input and the output (Y = 250X –
150,000) to the input sample. The simulation data can be exported to an Excel
file for further analysis. @Risk also provides summary statistics and this can be
accessed, for example, through “Browse Results” (see Fig. 10.7).
Simulation Results
The summary statistics indicate that the sample mean of profit is Rs 94,448 and
the sample standard deviation of profit is Rs 75,457. Therefore, our estimate of the
retailer’s expected profit from the simulation model is Rs 94,448 and that of the
standard deviation is Rs 75,457. Interpreting the area under the histogram between
an interval as an estimate of the probability that the random variable is contained
in that range, we estimate P(Y <= 0) as 9.9% and P(Y => 100,000) = 47.8%.
Therefore, based on the simulation model we think there is roughly a 10% chance
of not breaking even and a 48% chance of the profits exceeding Rs 100,000.
314 S. Kunnumkal
Fig. 10.5 Simulation settings: sampling type, generator, and initial seed
Table 10.1 Comparison of simulation results with the results from the exact analysis
Simulation model Exact analysis
Expected profit (in Rs) 94,448 95,000
Standard deviation of profit (in Rs) 75,457 75,000
Sample size (no. of iterations) 1000
95% confidence interval [89771, 99124]
P(Y <= 0) 9.9% 10.3%
P(Y >= 100,000) 47.8% 47.3%
Table 10.1 summarizes the results obtained from the simulation model and
compares it to the exact analysis. We observe that the simulation results are close but
not exact. The error comes from finite sampling. Moreover, the results are random
in that if we had taken a different random sample of size 1000 of the input, the
results would have been slightly different (see Table 10.2). Therefore, we assess
the accuracy of the √simulation results by building a confidence interval. We use the
formula x ±1.96 s/ n to construct a 95% confidence interval for the expected profit
(see, e.g., Stine and Foster (2014)).
Table 10.2 compares the simulation results obtained using different seeds. The
second column shows the simulation results when the initial seed is set to 1, and the
last column shows the simulation results when the initial seed is set to 2. Note that
the estimates change a little when we change the seed since we change the sample
of the input random variable.
316 S. Kunnumkal
Table 10.2 Comparison of simulation results for different values of the initial seed
Simulation model Initial Simulation model Initial
seed = 1 seed = 2
Expected profit (in Rs) 94,448 92,575
Standard deviation of profit (in Rs) 75,457 73,367
Sample size (no. of iterations) 1000 1000
95% confidence interval [89771, 99124] [88027, 97122]
P(Y <= 0) 9.9% 10.0%
P(Y >= 100,000) 47.8% 44.0%
We note that the question as to which is the “right” seed is not meaningful since
the estimates always have sampling error associated with them. The more relevant
question to ask is regarding the sample size (or the number of iterations in @Risk)
since this determines the accuracy of the simulation results. Table 10.3 shows how
the accuracy of the simulation estimates changes with the sample size. We observe
that as the sample size increases, we obtain progressively more accurate estimates;
the sample mean of profit we obtain from the simulation model is closer to the
population mean (95,000) and the confidence intervals for the expected profit are
also narrower. This is a natural consequence of the central limit theorem, which
states that as the sample size increases, the sample mean gets more and more
concentrated around the population mean (see, e.g., Ross (2013)). As a result, we
get more accurate estimates as the sample size increases.
The natural question that then arises is: what should be the appropriate sample
size? This in general depends on the nature of the business question being answered
by the simulation model and so tends to be quite subjective. The main ideas come
from statistical sampling theory. We first fix a margin of error, that is, a target
width of the confidence interval that we are comfortable with. We then determine
the sample size so that the actual width of the confidence interval matches with
the target width. Therefore, if we let e be the target margin of error (for a 95%
confidence interval), then we determine the sample size n as
2
1.96 sn
n= .
e
In the above equation, sn is the sample standard deviation, which in turn depends
on the sample size n. We break this dependence by running the simulation model for
10 Simulation 317
a small number of iterations (say, 100 or 200) to obtain an estimate of the sample
standard deviation sn and use this estimate in the above formula to determine the
appropriate value of n. For example, we might use the sample standard deviation
corresponding to 100 iterations (s100 = 79,561) from Table 10.3 as an approximation
to sn in the above equation. So if we have a margin of error e = 1000, the required
sample size n is
2
1.96 ∗ 79561
n= ≈ 24317.
1000
We have used simulation so far simply as an evaluation tool to obtain the summary
statistics associated with a performance metric. We now build on this and add a
decision component to our model. In particular, we consider business settings where
we have to make decisions under uncertainty and see how we can use simulation to
inform decision making.
Suppose that we have to pick an action “a” from a range of possible alternatives
{a1 , . . . , aK }. In making this decision, we have a performance metric (output) in
mind and we would like to choose the action that optimizes this performance metric.
The output variable is affected not only by the action that we take (decisions) but
also by uncertain events (inputs). That is, if we let Y denote the output variable, we
have Y = f(X, a), where X represents the random input and f(.,.) is a function that
relates the output to our decisions as well as the random inputs. If the function f(.,.)
is nonlinear, then the mathematical analysis quickly becomes challenging. In such
cases, a simulation model can be a useful alternative.
The simulation approach remains quite similar to what we have discussed in
the previous section. The first step is to identify the decision variable, as well as
the input and the output random variables. We specify the range of the decision
variable, the distribution of the input random variable and the relation between the
output and the decision and the input variables. Once we do this, we evaluate the
outcomes associated with each possible action using simulation and then pick the
action that optimizes the output metric. That is, we pick an action ak from the list
of possible actions. We then generate a random sample of the input X1 , . . . , Xn .
Given the action ak and the random sample of the input, we obtain a random sample
of the output using Yi = f(Xi , ak ). We use the random sample of the output to
estimate the performance metric of interest when we choose action ak . We repeat this
process for each action in the list and pick the action that optimizes the performance
metric. Note that since the results of the simulation are random and approximate, the
previous statements regarding the interpretation of the simulation results continue
to hold.
We next build on the fashion retailer example to illustrate these steps.
318 S. Kunnumkal
A fashion retailer purchases a fashion product for Rs 250 and sells it for Rs 500.
The retailer has to place an order for the product before the start of the selling season
when the demand is uncertain. The demand for the product is normally distributed
with a mean of 980 and a standard deviation of 300. If the retailer is left with unsold
items at the end of the selling season, it disposes them off at a salvage value of Rs
100. The retailer also incurs a fixed cost of Rs 150,000, which is independent of the
sales volume. How many units of the product should the retailer order? Assume that
the retailer’s objective is to maximize its expected profit.
Solution
We first specify the decision variable, as well as the input and the output random
variables. Then we relate the output to the input and the decision variables.
• Step 1: Specify decision—stocking quantity. We let q denote the stocking
decision. The theoretical range for the stocking quantity is [0, ∞). However,
for practical reasons (minimum batch size, budget constraints, etc.) we may want
to impose lower and upper limits on the stocking quantity. Here, given that the
demand is normally distributed, we consider order quantities that are within one
standard deviation above and below the mean. So we will consider q ∈ {680,
730, . . . , 1280} (q increases in step size = 50). We note that we could have
considered an extended range and even a more refined range for the decision
variable. The trade-off is the increased solution time stemming from evaluating
the profit (output) at each possible value of the decision.
• Step 2: Specify input—demand. We denote demand by the random variable X. X
is normally distributed with a mean of 980 and a standard deviation of 300.
• Step 3: Specify output—profit. We denote profit by the random variable Y.
• Step 4: Relate the input and the decision to the output. The unit profit margin
is Rs. 250. We have sales = min(X, q) since we cannot sell more than what
is demanded (X) and what we have in stock (q). It follows that the number of
unsold items is the difference between the stocking quantity and the sales. That
is, Unsold = q – Sales. We have Revenue = 250 Sales + 100 Unsold, where the
first term captures the revenues from sales at the full price while the second term
captures the salvage value of the unsold items. On the other hand, Cost = 250
q + 150,000. Therefore, we have that the profit Y = Revenue – Cost = 250
Sales + 100 Unsold – 250q – 150,000.
Now, we proceed to build a simulation model to determine the optimal stocking
quantity. We note that the problem we are considering is an example of a newsven-
dor problem, which is a very well-studied model in the operations management
literature and it is possible to obtain an analytical expression for the optimal ordering
quantity (see, e.g., Porteus (2002)). We do not dwell on the exact mathematical
analysis here and instead focus on the simulation approach. An advantage of the
10 Simulation 319
simulation model is that it can be easily adapted to the case where the retailer may
have to manage a portfolio of products as well as accommodate a number of other
business constraints.
Simulation Model
We have already described how the decision variable together with the random
input affects the output. We evaluate the expected profit corresponding to each
stocking level in the list {680, 730, . . . , 1280} and pick the stocking level that
achieves the highest expected profit. So we pick a stocking level qk from the above
list. Then we generate a random sample of size n of the input. So, let X1 , . . . , Xn be
a random sample of size n of the demand drawn from a normal distribution with a
mean of 980 and a standard deviation of 300. Given the sample of the input, we then
obtain a sample of the output by using the relation between the decision, the input,
and the output. Let Y1 , . . . , Yn be the sample of size n the output (profit) where
Finally, we use the sample of the output to estimate the expected profit corre-
sponding to ordering qk units. We repeat this process for all the possible stocking
levels in the list to determine the optimal decision.
Implementing the simulation model in @Risk: We specify the decision cell
corresponding to the stocking quantity using the RISKSIMTABLE(.) function. The
argument to the RISKSIMTABLE(.) function is the list of stocking quantities that
we are considering (see Fig. 10.8). The remaining steps are similar to the previous
simulation model: we specify the input cell corresponding to demand using the
RISKNORMAL(.) function and specify the output cell corresponding to profit using
the RISKOUTPUT(.) function. We link the decision variable and the input to the
output using the relation described above.
Before we run the simulation model, we specify the “Number of Iterations” in the
“Simulation Settings” tab. We set the “Number of Iterations” to be 1000 as before.
Now we would like to generate an input sample of size 1000 for each possible
value of the decision variable. There are 13 possible stocking decisions that we are
considering ({680, . . . , 1280}) and so we would like an input sample of size 1000
to be generated 13 times, one for each stocking decision in the list. We specify this
by setting the “Number of Simulations” to be 13 (see Fig. 10.9). In the “Sampling”
tab, we set the “Sampling Type,” “Generator,” and “Initial Seed” as before. We also
set “Multiple Simulations” to use the same seed (see Fig. 10.10). This ensures that
the same input sample of size 1000 is used to evaluate the profits corresponding
to all of the stocking decisions. That is, the same underlying uncertainty drives the
outcomes associated with the different actions. As a result, it is more likely that any
differences in the output (expected profit) associated with the different decisions
(stocking quantity) are statistically significant. The benefit of using common random
numbers can be formalized mathematically (see, e.g., Ross (2013)).
320 S. Kunnumkal
Fig. 10.8 Determining the optimal stocking quantity using simulation. The decision cell is
specified using the RISKSIMTABLE(.) function and the argument to the function is the list of
possible stocking quantities described in the list F2:F14 in the Excel spreadsheet
After running the simulation, we view the simulation results by clicking on the
“Summary” button. The table summarizes the simulation statistics for the different
values of the ordering quantity (see Fig. 10.11). It indicates that the expected profit
is maximized by ordering 1080 units.
We make two observations regarding the simulation results. First, the expected
profit for each stocking level is estimated as the sample average of a sample of size
1000. It remains to be verified that the results are statistically significant. That is, are
the differences we see in the samples representative of a corresponding difference in
the populations? This question can be answered using standard statistical tests (see,
e.g., Stine and Foster (2014)). Second, we note that ordering 1080 units is the best
choice from the list {680, . . . , 1280}. It is possible that we could further increase
profits if we were not restricted to these ordering levels. We can evaluate the profits
on a more refined grid of ordering levels to check if this is indeed the case. The
trade-off is the increased computation time stemming from evaluating the profit for
a larger number of stocking levels.
Our analysis so far was based on the assumption that the retailer was interested
in maximizing its expected profit for a single product. We now extend the model in
a couple of directions.
10 Simulation 321
Suppose that the retailer also cares for opportunities lost. That is, if the retailer
runs out of inventory of the product then some customers who visit the store
would be unable to purchase the product. If stockouts occur frequently and a large
number of customers do not find the product available, the customers may take their
business elsewhere and this would impact the profits of the retailer in the long run.
Consequently, the retailer would also like to factor in lost demand when making the
stocking decisions. Suppose that the retailer would like to ensure that the expected
percentage unmet demand is no larger than 5%. How many units of the product
should the retailer order in order to maximize expected profits while ensuring that
the stockout constraint is satisfied?
Solution
We can easily modify our simulation model to answer this question. The decision
variable and the input random variable remain the same. In addition to the profit
metric, we add the expected percentage unmet demand as a second metric in our
simulation model. We define the percentage unmet demand as max(X – q, 0)/X,
where X is the demand and q is the stocking level. Note that there is unmet demand
322 S. Kunnumkal
Fig. 10.10 Specifying the seed used for the different simulations. We set multiple simulations to
use the same seed
Table 10.4 Expected profit Stocking quantity Expected profit Exp. % unmet demand
and expected % unmet
demand as a function of the 680 |10,102 28.94
ordering quantity 730 |19,229 24.95
780 |27,351 21.31
830 |34,480 17.98
880 |40,645 14.94
930 |45,632 12.24
980 |49,385 9.87
1030 |51,838 7.83
1080 |53,012 6.11
1130 |52,792 4.70
1180 |51,487 3.54
1230 |49,082 2.61
1280 |45,836 1.85
only when X > q and so X – q > 0. We add the percentage unmet demand as a second
output in our spreadsheet model (using the RISKOUTPUT(.) function).
Table 10.4 describes the summary results we obtain after running the simulation
model again. The first column gives the stocking level, the second column gives
the corresponding expected profit while the last column gives the expected percent
unmet demand. We notice that the previous stocking level of 1080 which maximized
the expected profit results in the expected percent unmet demand being around 6%,
violating the 5% threshold. The feasible stocking levels, that is, those which satisfy
the 5% stockout constraint, are {1130, 1180, 1230, 1280}. Among these stocking
levels, the one which maximizes the expected profit is 1130. Therefore, based on
our simulation model we should order 1130 units of the product, which results in an
expected profit of around Rs 52,792 and the expected percent unmet demand being
around 4.7%.
We now consider the case where the retailer sells multiple products that are
potential substitutes. That is, if a customer does not find the product that she is
looking for on the shelf, she may switch and buy a product that is a close substitute.
To keep things simple, let us assume that the retailer sells two products during the
selling season. The base (primary) demands for the two products are independent,
normally distributed random variables. The cost and demand characteristics of the
two products are described in Table 10.5. In addition, the retailer incurs a fixed
cost of Rs 150,000 that is independent of the sales volumes. The two products
are substitutes in that if one product is stocked out then some of the customers
interested in purchasing that product might switch over and purchase the other
product (provided it is in stock). If product 1 is stocked out, then 70% of the unmet
demand for that product shifts to product 2. On the other hand, if product 2 is stocked
out, then 30% of the demand for that product shifts to product 1. Therefore, there is
324 S. Kunnumkal
a secondary demand stream for each product that is created when the other product
is stocked out. What should be the stocking levels of the two products for the retailer
to maximize its expected profits?
Solution
We determine the decision variable, as well as the input and the output random
variables. We then describe the relation between the output and the decision and the
input variables.
• Step 1: Specify decisions—stocking quantities of the two products. We let q1
denote the stocking level of product 1 and q2 denote the stocking level of product
2. For each product, we consider stocking levels that are within one standard
deviation of the mean demand. Therefore, we consider q1 ∈ {680, 730, . . . , 1280}
and q2 ∈ {1500, 1550, . . . , 2500}. We again note that it is possible to work with
an expanded range of values and also consider a more refined set of grid points,
at the expense of greater computational effort.
• Step 2: Specify input—demands for the two products. We let X1 denote the
primary demand random variable for product 1 and X2 denote the primary
demand random variable for product 2. X1 is normally distributed with a mean of
980 and a standard deviation of 300, while X2 is normally distributed with a mean
of 2000 and a standard deviation of 500. Furthermore, X1 and X2 are independent
random variables.
• Step 3: Specify output—profit. We denote profit by the random variable Y.
• Step 4: Relate the input and the decision to the output. We have primary
Salesi = min (Xi , qi ) for i = 1, 2, where primary sales refer to the sales
generated from the primary demand for that product. The remaining inventory
of product i is therefore Inventoryi = qi − Primary salesi . On the other hand,
the portion of the demand that cannot be satisfied from the on-hand inventory,
unmet Demandi = Xi − Primary salesi . Now, if there is unmet demand for
product 2, then 30% of that is channeled to product 1. Therefore, the secondary
sales of product 1 is the smaller of the remaining inventory of product 1 (which
remains after satisfying the primary demand for product 1) and the secondary
demand for product 2. That is, Secondary Sales1 = min(Inventory1 , 0.3 * Unmet
Demand2 ). By following a similar line of reasoning, the secondary sales of
product 2, Secondary Sales2 = min (Inventory2 , 0.7 * Unmet Demand1 ). The
number of unsold items of product i is, therefore, Unsoldi = qi − Primary
Salesi – Secondary Salesi .
10 Simulation 325
Tallying up the revenues and costs, we have that the total revenue = 250 * Pri-
mary Sales1 + 250 * Secondary Sales1 + 100 * Unsold1 + 200 * Pri-
mary Sales2 + 200 * Secondary Sales2 + 50 * Unsold2 . The total
cost = 250q1 + 200q2 + 150,000. We obtain the profit, Y, as the difference
between the total revenue and the total cost.
Implementing the Simulation Model in @Risk
Now we have two decision variables q1 and q2 , where q1 takes values in the
range {680, 730, . . . , 1280} and q2 takes values in the range {1500, 1550, .., 2500}.
So there are 273 (13 * 21) possible combinations of q1 and q2 that we have to
consider. While it is possible to implement the model using the RISKSIMTABLE(.)
function, this would involve creating a list of all the possible combinations of the
two decision variables and can be cumbersome. An alternative way to implement
the simulation model is using the RISK Optimizer function (see Fig. 10.12). Under
“Model Definition” we specify the optimization goal (maximize) and the metric that
is optimized (mean value of profit). Under “Adjustable Cell Ranges,” we specify
the decision variables and the range of values that they can take (between the
minimum and the maximum stocking levels, in increments of 50, see Fig. 10.13).
We start the optimization routine by clicking on the “Start” button under the RISK
Optimizer tab (see Fig. 10.14) and after @Risk has finished, we obtain the results
from the “Optimization Summary” report (see Figs. 10.14 and 10.15). From the
optimization summary report, we see that the optimal stocking level for product
1 is 1230 units and the optimal stocking level for product 2 is 1600 units. The
corresponding expected profit is Rs 228,724. It is interesting to note that the optimal
stocking levels of the two products when considered together are different from the
Fig. 10.13 Specifying the optimization objective and the decision variables in RISK Optimizer
4 Conclusion
Daniel operates a successful civil work firm, Coloring the World. He has been
running this decades-old family business for the last 10 years. Coloring the
World provides painting and civil work services to commercial buildings and large
apartments. Daniel has a dedicated sales team which is very active in identifying
construction projects that may require the firm’s services. He has another sales team
which keeps following up with the businesses from established customers who may
require refurbishment or painting work. Both the teams are quite active in generating
leads. He also has a robust customer relationship system in place to engage with
existing clients that helps in building long term association with the customers.
An existing customer, Feel Good Fabrics, a maker of cotton and linen cloth,
has sent Coloring the World an RFP (Request for Proposal) to paint its entire
manufacturing plant. Though Daniel has provided various small refurbishment work
support to this client, he knows that this is a big requirement that can fetch good
recognition and lead to a long-term association with the customer. With his earlier
experience, Daniel knows that Brian Painters has also been approached for this
offer and suspects that Whitney-White colors (W&W) is also trying hard to get
empaneled on Feel Good Fabrics’ vendors list. Daniel does not want to lose this
opportunity to create an impactful relationship with Feel Good’s commercial and
operation team.
Daniel has competed with Brian and W&W for many other projects and believes
he can more or less estimate the bidding strategies of these competitors. Assuming
that these competing firms are bidding for this contract, Daniel would like to develop
a bid that offers him a good shot at winning, but also does not result in a loss on the
contract since the firm has many expenses including labor, paints, and materials.
Daniel estimates that Brian’s painters bid could be anywhere between $450,000
and $575,000. As for Whitney-White colors, Daniel predicts the bid to be as low as
$425,000 or as high as $625,000, but he thinks $550,000 is most likely. If Daniel
bids too high, one of the competitors is likely to win the contract and Daniel’s
company will get nothing. If, on the other hand, he bids too low, he will probably
win the contract but may have to settle for little or no profit, even a possible loss.
Due to the complexity in the plant structure of Feel Good Fabrics, Daniel in
consultation with his service department estimates the direct cost to service the
client at $300,000. Realizing that the costs are actually uncertain, Daniel takes this
number to be expected value of the costs and thinks that the actual cost will be
normally distributed around this mean value with a standard deviation of $25,000.
The preparation and budget estimation cost $10,000 to Daniel since that includes
in-person visits and technical test clearance by the Feel Good Fabrics team. How
much should Daniel bid in order to maximize his expected profit?
Solution
The underlying source of uncertainty is in the bids of the two competitors and
in the direct costs. The decision is how much to bid for the project. Note that
Coloring the World wins the project if they bid the lowest. Given the bids of the
10 Simulation 329
competitors we also observe that Coloring the World will win the project for sure if
they bid lower than $425,000. So it does not make sense to bid anything lower than
$425,000 since it does not further improve their chances of winning the contract. On
the other hand, bidding lower shrinks their profit margins. Coloring the World will
also never win the project if they bid more than $575,000. Therefore, the optimal
bidding strategy must lie between $425,000 and $575,000.
Simulation Model:
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 10.1: Priceline_Hotelbids.xlsx
• Data 10.2: Watch_bids.xlsx
Exercise Caselets
and friends than as a famous architect. He has a collection of more than a hundred
antique watches representing different cultures and manufacturing styles. He keeps
looking for a new variety of mechanical watch to add to his collection during his
travel across the globe. He has even subscribed to bidding websites that regularly
add a variety of new watches to their product pages. He has been successful many
a time in winning auctions but unfortunately at very high prices. Chris believes in
paying the right price for the product and an added extra based on his emotional
attachment to the product.
Over time, Chris has developed a bidding strategy that combines bidding at
selective time points and at a specific price ratio considering the existing bids. He
has started winning bids at lower prices in recent times, but he is a little disappointed
as his new strategy did not work out with a few watches that he wanted to win
desperately. Though, thankfully due to the new strategy he has not paid high prices
on the winning deals.
Chris also notices that he is investing a lot of time and effort following up on
multiple websites and placing the bid. Some websites even limit the number of bids
for each customer and he is running out of number of bids very quickly. The other
websites even charge per bid to the customers in order to restrain the customers from
placing multiple bids of small differences. His experience tells that the winning bid
ranges between 1.5× and 3× of the first few bid prices and follows a “regular” price
trajectory that can help in estimating the final price. The price trajectories of the 100
products are provided on the book’s website “Watch_bids.xlsx”. The table contains
the number of bids and the highest bid so far at 36 h before closing, 24 h, 12 h, and
so on.
He wishes to develop a new strategy to place a single bid or a maximum of two
bids at specific time points rather than following up multiple times and wasting time
and effort monitoring the outcome.
• When should Chris bid in order to secure the deal?
• At what price ratio should Chris bid to secure the deal?
• How can he add extra value of emotional quotient to the bid in terms of timing
and price?
turers. They decided that they will contract the cloth and garment manufacturing
to a quality and trusted supplier and sell the product online through e-commerce
websites.
Under the women entrepreneurship scheme, Monika decided to set up her own
plant to manufacture the fabric. Professor Akihiko approached her ex-students
who had expertise in winter-wear manufacturing and designing. A few of them
showed interest and also bid for the contract. Monika and Akihiko decided to
contract the manufacturing of jackets to a third party. They outsourced the product
manufacturing to Vimal Jain, a close friend of Monika and trusted student of
Professor Akihiko, considering his family experience in textile manufacturing.
Vimal proposed two designs—a Sports jacket and a Trendy jacket using the special
fabric. Though he also proposed to mix this specific thread with another material
in order to cater to different geographic needs, Monika and Akihiko rejected the
idea and decided to target the niche segment of woolen cloths. Vimal agreed to
design, manufacture, and supply the two types of jackets and in different sizes small,
medium, large, and extra-large to the retail locations.
The product was initially sold only through an online channel. However, looking
at the increasing demand, Monika and Akihiko decided to go offline and partnered
with a large retail chain that had store presence across the world. The retailer
negotiated a 20% profit margin on the items sold with a condition to return unsold
products at 80% of the purchase price. Since the manufacturing of fabric and
garment are done at different locations and by different manufacturers, it is essential
to estimate the demand in advance to optimize inventory at various stages. Also, as
this is a seasonal and/or fashion product, the excess inventory of unsold products
may lead to deep discounts. The product demand also depends on the severity of
weather.
They requested Chris Alfo, head of operations with the partner retailer to estimate
the demand based on his experience with comparable products. Note: The numbers
in demand estimate table are in thousands (Table 10.6).
Monika estimated the manufacturing cost of one jacket at $125 and fixed 40%
markup when selling to the retailer. She thought that retailers would add 20% profit
margin. Monika also found that unsold products can be sold at 50% of the cost
in deep-discount outlets. She knew that all the stock has to be cleared within the
same year considering the contemporary nature of fashion. Looking at historical
weather reports, she estimated the probability of mild winter at 0.7 and cold at 0.3.
The customers may switch between the product designs based on availability of
the product. In case of extreme demand and unavailability, if winter is mild, the
probability to switch from one design to another is 0.5 while in case of cold winter
it is 0.9. Monika is planning how much to manufacture every year in order to procure
raw material and finalize the manufacturing contract with Vimal. She has to estimate
demand in such a way that she does not end up with too much unsold product, as
well as, does not lose the opportunity to sell more.
Priceline popularized the name your own price (NYOP). For example, in its
®
website,3 it advertises “For Deeper Discounts Name Your Own Price .” In this
model, called a reverse auction, the buyer specifies the product or service and names
a price at which the buyer is willing to purchase the product. On the other side,
sellers offer products or services at two or more prices and also the number of
products that are available. For example, a room in a three-star hotel on September
21st in downtown Manhattan for a one-night stay could be a product. Several hotels
would offer, say, anywhere from 1 to 5 rooms at rates such as $240 and $325. When
a bid is made, the market-maker (in this case Priceline) picks a seller (a hotel that
has made rooms available) at random and sees if there is a room that is available
at a rate that is just lower than the bidder’s price. If no such price is available, the
market-maker chooses another seller, etc. More details about the model are found in
Anderson and Wilson’s article.4
The NYOP mechanism is fairly complex because it involves buyers, sellers, and
the intermediary who are acting with limited information about the actual market
conditions. For example, even the intermediary is not aware of the exact supply
situation. How does this model benefit everyone? The buyer benefits because it
creates a haggle-free environment: In case the bid fails the buyer cannot bid for
a day, thus forcing the bidder to either reveal the true reservation price or be willing
to forgo an opportunity to purchase in order to learn more about the model and act
with delay. The seller benefits because the model avoids direct price competition.
The intermediary benefits because it gets to keep the difference between the bidder’s
and the seller’s price.
Barsing is a boutique hotel situated in downtown Manhattan. It has 100 rooms,
out of which 65 are more or less indistinguishable with regard to size and amenities.
It often finds some of its rooms remain unsold even during peak seasons due to the
relative newness of the hotel and its small size. The variable cost of a room-night
is around $75. Sometimes this cost may increase or decrease by 10% depending
on the amount of cleaning and preparation necessary. Richard Foster who manages
Barsing started offering rooms on an NYOP program called HotelsManhattan. He
classifies Barsing’s as a mid-range hotel with a family atmosphere. He feels that the
program was intriguing and requires constant tinkering to get the price right.
Foster’s assistant, Sarah, was tasked with reviewing the data and recommending
an automatic approach to making rooms available on HotelsManhattan. Historical
data on the number of bids made on the past 40 weekdays are available to her, see
“Priceline_Hotelbids.xlsx” on book’s website. Typically, 4–5 rooms are available to
sell using the NYOP program.
1. Assume that the bidders are fixed in number. If Sarah uses one price, what price
maximizes the expected profit if four rooms are available?
2. Assume that the number of bidders is random. What is the optimal single price?
3. Assume that the number of bidders is random and Sarah can specify two prices.
How should she set those prices to maximize her expected profit from four
rooms?
Fig. 10.16 Histogram of the sequence generated by the linear congruential generator with
a = 16,807, m = 2,147,483,647, c = 0, and x0 = 33,554,432
The LCG algorithm generates a sequence of numbers that have the appearance
of coming from a uniform distribution. It is possible to build on this to generate
pseudo-random numbers from other probability distributions (both discrete as well
as continuous distributions). We refer the reader to Ross (2013) and Law (2014) for
more details on the algorithms and their properties.
nature of the data set (discrete or continuous) as well as additional details regarding
the range of values it can take (minimum, maximum values), we run the Distribution
Fitting tool. This gives us a range of distributions and their corresponding fit values.
We can broadly think of the fit values as measuring the error between the observed
data and the hypothesized distribution and a smaller fit value indicates a better fit in
general. There are different goodness-of-fit tests available. The fit values as well as
the relative ranking of the distributions in terms of their fit can vary depending on
the test that is used. We refer the reader to Ross (2013) and Law (2014) for more
details regarding the different goodness-of-fit tests and when a given test is more
applicable (Fig. 10.18).
336 S. Kunnumkal
Excel has some basic simulation capabilities. The RAND(.) function generates
a (pseudo) random number from the uniform distribution between 0 and 1. The
RANDBETWEEN(.,.) function takes two arguments a and b, and generates an
integer that is uniformly distributed between a and b. There are methods that can use
this sequence of uniform random numbers as an input to generate sequences from
other probability distributions (see Ross (2013) and Law (2014) for more details). A
limitation of using the RAND(.) and RANDBETWEEN(.,.) functions is that it is not
possible to fix the initial seed and so it is not possible to easily replicate the results
of the simulation.
As mentioned, there are a number of other packages and programming languages
that can be used to build simulation models. For example, the code snippet below
implements the fashion retailer simulation described in Example 10.1 in R:
Sales = rnorm(1000, 980, 300)
Profit = 250 * Sales − 150,000.
The first line generates a random input sample of size 1000 from the normal
distribution with a mean of 980 and a standard deviation of 300. The second line
generates a random sample of the output (profit) by using the random sample of the
input and the relation between the profit and the sales. Note that the seed as well as
the random number generator can be specified in R; refer the R documentation.
References
Anderson, C. K., & Wilson, J. G. (2011). Name-your-own price auction mechanisms – Mod-
eling and future implications. Journal of Revenue and Pricing Management, 10(1), 32–39.
https://doi.org/10.1057/rpm.2010.46.
Davison, M. (2014). Quantitative finance: A simulation-based approach using excel. London:
Chapman and Hall.
GE Look ahead. (2015). The digital twin: Could this be the 21st-century approach to productivity
enhancements. Retrieved May 21, 2017, from http://gelookahead.economist.com/digital-twin/.
Holland, C., Levis, J., Nuggehalli, R., Santilli, B., & Winters, J. (2017). UPS optimizes delivery
routes. Interfaces, 47(1), 8–23.
Jian, N., Freund, D., Wiberg, H., & Henderson, S. (2016). Simulation optimization for a large-scale
bike-sharing system. In T. Roeder, P. Frazier, R. Szechtman, E. Zhou, T. Hushchka, & S. Chick
(Eds.), Proceedings of the 2016 winter simulation conference.
Law, A. (2014). Simulation modeling and analysis. McGraw-Hill Series in Industrial Engineering
and Management.
Porteus, E. (2002). Foundations of stochastic inventory theory. Stanford business books, Stanford
California.
Ross, S. (2013). Simulation. Amsterdam: Elsevier.
Stine, R., & Foster, D. (2014). Statistics for business decision making and analysis. London:
Pearson.
Technology Review. (2017). September 2017 edition.
Chapter 11
Introduction to Optimization
Milind G. Sohoni
1 Introduction
M. G. Sohoni ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: milind_sohoni@isb.edu
Formulate
Problems in
the model Model
the
real world. (Quantitative)
Interpret and
Conclusions validate Conclusions
about the real
from the model
problem
that needs to be considered. Once the model is analyzed the output needs to
be interpreted appropriately and implemented in the real world with suitable
modifications.
able solvers, newer mathematical algorithms and implementations have also been
developed that compete with the Simplex algorithm effectively. From a business
analytics standpoint, however, understanding the models being built to address the
optimization problem, the underlying assumptions, and pertinent interpretation of
the obtained analytical solutions are equally important. In this chapter, we discuss
these details of the linear modeling. We will try to build our understanding using
a prototypical LP example in Sect. 2.1 and two-dimensional geometry in Sect. 2.4.
The insights gained are valid for higher-dimensional problems too and also reveal
how the Simplex algorithm works. For a detailed description of the Simplex
algorithm and other solution algorithms the reader is referred to Bazaraa et al.
(2011), Bertsimas and Tsitsiklis (1997), and Chvátal (1983).
Consider the following problem2 for a manufacturer who produces two types of
glasses, P1 and P2 . Suppose that it takes the manufacturer 6 h to produce 100 cases
of P1 and 5 h to produce 100 cases of P2 . The production facility is operational for
60 h per week. The manufacturer stores the week’s production in her own stockroom
where she has an effective capacity of 15,000 ft3 . Hundred cases of P1 occupy
1000 ft3 of storage space, while 100 cases of P2 require 2000 ft3 due to special
packaging. The contribution margin of P1 is $5 per case; however, the only customer
available will not accept more than 800 cases per week. The contribution of P2 is
$4.5 per case and there is no limit on the amount that can be sold in the market.
The question we seek to answer is the following: How many cases of each product
should the glass manufacturer produce per week in order to maximize the total
weekly contribution/profit?
Using these decision variables, we can now represent the manufacturer’s objective
function analytically as:
Equation (11.1) is called the objective function, and the coefficients 500 and 450
are called the objective function coefficients.
In our problem description, however, the manufacturer is resource constrained,
i.e., the manufacturer has limited weekly production and storage capacity.
Additionally, the demand for P1 in the market is limited. Hence, we need
to represent these technological constraints in our analytical formulation of
the problem. First, let’s focus on the production constraint, which states
that the manufacturer has 60 h of production capacity available for weekly
production. As mentioned in the problem statement, 100 cases of P1 require
6 h of production time and that of P2 require 5 h of production time. The
technological constraint imposing this production limitation that our total weekly
production doesn’t exceed the available weekly production capacity is analytically
expressed by:
From our problem statement, we know that the weekly demand for P1 does not
exceed 800 cases. So we need not produce more than 800 cases of P1 in the week.
Thus, we add a maximum demand constraint as follows:
x1 ≤ 8. (11.4)
Constraints (11.2), (11.3), and (11.4) are known as the technological constraints
of the problem. In particular, the coefficients of the variables xi , i = 1, 2, are known
as the technological coefficients while the values on the right-hand side of the
three inequalities are referred to as the right-hand side (rhs) vector of the constraints.
Finally, we recognize that the permissible value for variables xi , i = 1, 2, must
be nonnegative, i.e.,
xi ≥ 0 ; i = 1, 2, (11.5)
342 M. G. Sohoni
since these values express production levels. These constraints are known as the
variable sign restrictions. Combining (11.1)–(11.5), the LP formulation of our
problem is as follows:
Sign restrictions:
Additivity assumption: The total consumption of each resource and the over-
all objective value are the aggregates of the resource consumptions and the
contributions to the problem objective, resulting by carrying out each activity
independently.
Proportionality assumption: The consumptions and contributions for each
activity are proportional to the actual activity level.
Divisibility assumption: Each variable is allowed to have fractional values (con-
tinuous variables).
Certainty assumption: Each coefficient of the objective vector and constraint
matrix is known with certainty (not a random variable).
It is informative to understand how we implicitly applied this logic when we derived
the technological constraints of the prototype example: (1) Our assumption that
the processing of each case of P1 and P2 required constant amounts of time,
respectively, implies proportionality, and (2) the assumption that the total production
time consumed in the week is the aggregate of the manufacturing times required
for the production of each type of glass, if the corresponding activity took place
independently, implies additivity.
It is important to note how the linearity assumption restricts our modeling
capabilities in the LP framework: For example, we cannot immediately model
effects like economies of scale in the problem structure, and/or situations in which
resource consumption of resources by complementary activities takes place. In
some cases, one can approach these more complicated problems by applying some
linearization scheme—but that requires additional modeling effort.
Another approximation, implicit in many LPPs, is the so-called divisibility
assumption. This assumption refers to the fact that for LP theory and algorithms
to work, the decision variables must be real valued. However, in many business
problems, we may want to restrict values of the decision variables to be integers.
For example, this may be the case with the production of glass types, P1 and P2 ,
in our prototype example or production of aircraft. On the other hand, continuous
quantities, such as tons of steel to produce and gallons of gasoline to consume,
are divisible. That is, if we solved a LPP whose optimal solution included the
consumption of 3.27 gallons of gasoline, the answer would make sense to us; we
are able to consume fractions of gallons of gasoline. On the contrary, if the optimal
solution called for the production of 3.27 aircraft, however, the solution probably
would not make sense to anyone.
Imposing integrality constraints for some, or all, variables in a LPP turns
the problem into a (mixed) integer programming (MIP or IP) problem. The
computational complexity of solving an MIP problem is much higher than that of a
LP. Actually, MIP problems belong to the notorious class of NP-complete problems,
i.e., those problems for which there is no known/guaranteed polynomial bound on
the solution time to find an optimal solution. We will briefly discuss the challenge
of solving MIPs later in Sect. 3.
Finally, before we conclude this discussion, we define the feasible region of the
LP of (11.7)–(11.9), as the entire set of vectors x1 , x2 , . . . , xn (notice that each
344 M. G. Sohoni
system. Remember that the set of constraints determine the feasible region of the
LP. Thus, under aforementioned correspondence, the feasible region is depicted
by the set of points that satisfy the LP constraints and the sign restrictions
simultaneously. Since all constraints in a LPP are expressed by linear inequalities,
we must first characterize the set of points that constitute the solution space of
each linear inequality. The intersection of the solution spaces corresponding to each
technological constraint and/or sign restriction will represent the LP feasible region.
Notice that a constraint can either be an equality or an inequality in LPP. We first
consider the feasible region corresponding to a single equality constraint.
The Feasible Space of a Single Equality Constraint Consider an equality
constraint of the type
a1 x1 + a2 x2 = b (11.10)
The feasible space is one of the closed half-planes defined by the equation of the
line corresponding to this inequality: a1 x1 + a2 x2 = b. Recollect that a line divides
a 2-D plane into two halves (half-planes), i.e., the portion of the plane lying on
each side of the line. One simple technique to determine the half-plane comprising
the feasible space of a linear inequality is to test whether the point (0, 0) satisfies
the inequality. In case of a positive answer, the feasible space is the half-space
containing the origin. Otherwise, it is the half-space lying on the other side which
does not contain the origin.
Consider our prototype LP Sect. 2.1 described earlier. Figure 11.2 shows the
feasible regions corresponding to the individual technological and nonnegativity
constraints. In particular, Fig. 11.2c shows the entire feasible region as the inter-
section of the half-spaces of the individual constraints. Note that, for our prototype
problem, the feasible region is bounded on all sides (the region doesn’t extend to
infinity in any direction) and nonempty (has at least one feasible solution).
Infeasibility and Unboundedness Sometimes, the constraint set can lead to an
infeasible or unbounded feasible region.
346 M. G. Sohoni
12 12
10 10
8 8
x2
x2
6 6 10 x1 + 20 x2 = 150
x2 = 0 x2 = 0
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x1 x1
(a) (b)
12
10
8
x2
6 10 x1 + 20 x2 = 150
4 Feasible Region
6 x1 + 5 x2 = 60
2 x1 = 0
x1 = 8
x2 = 0
0
0 2 4 6 8 10
x1
(c)
Fig. 11.2 Feasible region of the prototype LP in 2-D. (a) Feasible region of the production and
nonnegative constraints. (b) Feasible region of the storage and production constraint. (c) The entire
feasible region
An infeasible region implies the constraints are “contradictory” and hence the
intersection set of the half-spaces is empty. An unbounded feasible region may
mean that the optimal solution could go off to −∞ or +∞ if the objective function
“improves” in the direction in which the feasible region is unbounded.
Consider again our original prototype example. Suppose there is no demand
restriction on the number of cases of P1 and the manufacturer requires that at least
1050 cases of P1 are produced every week. These requirements introduce two new
constraints into the problem formulation, i.e.,
x1 ≥ 10.5.
11 Optimization 347
12
20
10
Direction of unbounded objective
8 15 x1 ≥ 5
10 x1 + 20 x2 ≤ 150
x2
x2
6
10
0
x2 ≥ 0 0
0 2 4 6 8 10 12 0 5 10 15 20
x1 x1
(a) (b)
Fig. 11.3 Infeasible and unbounded feasible regions in 2-D. (a) Feasible region is empty
(Infeasible). (b) Feasible region is unbounded
Figure 11.3a shows the feasible region for this new problem which is empty, i.e.,
there are no points on the (x1 , x2 )-plane that satisfy all constraints, and therefore
our problem is infeasible (over-constrained).
To understand unbounded feasible regions visually, consider a situation wherein
we change our prototype LP such that the manufacturer must use at least 60 h of
production, must produce at least 500 cases of P1 , and must use at least 15,000
units of storage capacity. In this case the constraint set changes to
x1 ≥ 5,
6x1 + 5x2 ≥ 60,
10x1 + 20x2 ≥ 150,
and the feasible looks like the region depicted in Fig. 11.3b. It is easy to see that the
feasible region of this problem is unbounded, Furthermore, in this case our objective
function, 500x1 + 450x2 can take arbitrarily large values and there will always be
a feasible production decision corresponding to that arbitrarily large profit. Such a
LP is characterized as unbounded. It is noteworthy, however, that even though an
unbounded feasible region is a necessary condition for a LP to be unbounded, it is
not sufficient (e.g., if we were to minimize our objective function, we would get a
finite value).
Representing the Objective Function A function of two variables f (x1 , x2 ) is
typically represented as a surface in an (orthogonal) three-dimensional space, where
two of the dimensions correspond to the independent variables x1 and x2 , while
the third dimension provides the objective function value for any pair (x1 , x2 ). In
348 M. G. Sohoni
the context of our discussion, however, we will use the concept of contour plots.
Suppose α is some constant value of the objective function, then for any given range
of α’s, a contour plot depicts the objective function by identifying the set of points
(x1 , x2 ) such that f (x1 , x2 ) = α. The plot obtained for any fixed value of α is a
contour of the function. Studying the structure of a contour identifies some patterns
that depict useful properties of the function. In the case of 2-D LPPs, the linearity of
the objective function implies that any contour can be represented as a straight line
of the form:
c1 x1 + c2 x2 = α. (11.12)
Consider the objective function 500x1 + 450x2 in our prototype example. Let
us draw the first isoprofit line as 500x1 + 450x2 = α (the dashed red line in
Fig. 11.4), where α = 1000 and superimpose it over our feasible region. Notice that
the intersection of this line with the feasible region provides all those production
decisions that would result in a profit of exactly $1000.
12
10 6 x1 + 5 x2 = 60
8 x1 = 8
x2
6 10 x1 + 20 x2 = 150
4 Increasing α
x1 = 0
2 500 x1 + 450 x2 = α = 1000
0
x2 = 0
0 2 4 6 8 10 12
x1
10
3920
8 500 x1 + 450 x2
x2
6 10 x1 + 20 x2 = 150
(6.43, 4.29)
4 Feasible Region
1960
6 x1 + 5 x2 = 60
x1 = 0
2 980
2940 x1 = 8
0
x2 = 0 4900
0 2 4 6 8 10
x1
Fig. 11.5 Sweeping the isoprofit line across the feasible region until it is about to exit. An optimal
solution exists at a corner point (vertex)
As we change the value of α, the resulting isoprofit lines have constant slope
and varying intercept, i.e., they are parallel to each other (since by definition
isoprofit/isocost lines cannot intersect). Hence, if we continuously increase α from
some initial value α0 , the corresponding isoprofit lines can be obtained by “sliding”
the isoprofit line corresponding to f (x1 , x2 ) = α0 parallel to itself, in the direction
of increasing (decreasing) intercepts, if c2 is positive (negative.) This “improving
direction” of the isoprofit line is denoted by the dashed magenta arrow in Fig. 11.4.
Figure 11.5 shows several isoprofit lines, superimposed over the feasible region,
for our prototype problem.
Finding the Optimal Solution It is easy to argue that an optimal solution to a LPP
will never lie in the interior of the feasible region. To understand why this must
be true, consider the prototype example and let us assume that an optimal solution
exists in the interior. It is easy to verify that by simply increasing the value of either
x1 or x2 , or both—as long as we remain feasible—we can improve the objective
value. But this would contradict the fact that the point in the interior is an optimal
solution. Thus, we can rule out the possibility of finding an optimal solution in the
interior of the feasible region. So then, if an optimal solution exists, it must lie
somewhere on the boundary of the feasible region. The “sliding motion” described
earlier suggests a way for finding the optimal solution to a LPP. The basic idea is to
keep sliding the isoprofit line in the direction of increasing α’s, until we cross (or are
just about to slide beyond) the boundary of the LP feasible region. For our prototype
350 M. G. Sohoni
LPP, this idea is demonstrated in Fig. 11.5. The dashed red lines are the contour lines
and the solid red line is the contour line corresponding to that value of α such that
any further increase would result in the objective line crossing the feasible region,
i.e., an infinitesimal increase in α would result in the contour line moving parallel to
itself but not intersecting the feasible region. Thus, the objective value is maximized
at that point on the boundary beyond which the objective function crosses out of the
feasible region. In this case that point happens to be defined by the intersection of the
constraint lines for the production capacity and storage capacity, i.e., 6x1 +5x2 = 60
and 10x1 + 20x2 = 150. The coordinates
of the optimal point are x1 = 6.43 and
In fact, notice that the optimal point (the green dot) is one of the corner points
(the black dots) of the feasible region depicted in Fig. 11.5 and is unique. The
optimal corner point is also referred to as the optimal vertex.
In summary, if the optimal vertex is uniquely determined by a set of intersecting
constraints and the optimal solution only exists at that unique corner point (vertex),
then we have a unique optimal solution to our problem. See Fig. 11.6.
LPs with Many Optimal Solutions A natural question to ask is the following:
Is the optimal solution, if one exists, always unique? To analyze this graphically,
suppose the objective function of our prototype problem is changed to
225x1 + 450x2 .
12
8
x2
6 10 x1 + 20 x2 = 150
(6.43, 4.29)
4 Feasible Region
6 x1 + 5 x2 = 60
2
0
0 2 4 6 8 10 12
x1
Fig. 11.6 A unique optimal solution exists at a single corner point (vertex)
11 Optimization 351
12
10
8
(0, 7.5)
x2
225 x1 + 450 x2
6
(6.43, 4.29)
4
0
0 2 4 6 8 10 12
x1
Fig. 11.7 Multiple optimal solutions along a face (includes corner points)
Notice that any isoprofit line corresponding to the new objective function is parallel
to the line corresponding to the storage constraint:
10 10
6 x1 + 5 x2 = 61
8 6 x1 + 5 x2 = 60 8 6 x1 + 5 x2 = 60
6 6
x2
x2
6 x1 + 5 x2 = 59
4 4
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x1 x1
(a) (b)
Fig. 11.8 Relaxing and tightening the feasible region. (a) Relaxing (expanding) the feasible
region. (b) Tightening (shrinking) the feasible region
11 Optimization 353
n
si = bi − aij xj∗ .
j =1
n
surplusi = aij xj∗ − bi .
j =1
with m equality constraints and n variables (where we can assume n > m). Then,
theory tells us that each vertex of the feasible region of this LP can be found by:
choosing m of the n variables (these m variables are collectively known as the basis
and the corresponding variables are called basic variables); setting the remaining
(n − m) variables to zero; and solving a set of simultaneous linear equations to
determine values for the m variables we have selected. Not every selection of m
variables will give a nonnegative solution. Also, enumerating all possible solutions
can be very tedious, though, there are problems where, if m were small, the
enumeration can be done very quickly. Therefore, the Simplex algorithm tries to
find an “adjacent” vertex that improves the value of the objective function. There is
one problem to be solved before doing that: if these values for the m variables are all
>0, then the vertex is nondegenerate. If one or more of these variables is zero, then
the vertex is degenerate. This may sometimes mean that the vertex is over-defined,
i.e., there are more than necessary binding constraints at the vertex. An example of
a degenerate vertex in three dimensions is
x1 + 4x3 ≤ 4
x2 + 4x3 ≤ 4
x1 , x2 , x3 ≥ 0
The three-dimensional feasible region looks like the region in Fig. 11.9. Notice that
the vertex $(0,0,1)$ has four planes defining it.
Degenerate vertex
(0,0,1)
0.5
4
3
X
3
0
4
X
2 2
3
X 2 1
1
1
0 0
m
c̄j = cj − aij yi
i=1
where cj is the objective coefficient of activity j , yi is the shadow price (dual value)
associated with constraint i, and aij is the amount of resource i (corresponds to con-
straint i) used per unit of activity j . The operation of determining the reduced cost of
an activity, j , from the shadow prices of the constraints and the objective function is
generally referred to as pricing out an activity. To understand these computations,
consider the prototype LP described earlier. Suppose the manufacturer decides to
add another set of glasses, P3 , to his product mix. Let us assume that P3 requires
8 h of production time per 100 cases and occupies 1000 cubic units of storage
space. Further, let the marginal profit from a case of P3 be $6. If x3 represents the
decision of how many hundreds of cases of P3 to produce, then the new LP can be
rewritten as:
356 M. G. Sohoni
Now, suppose we want to compute the shadow price of the production constraint.
Let b1 denote the rhs of the production constraint (C1). Currently, b1 = 60 as stated
in the formulation above. Notice that the current optimal objective value is 5142.86
when b1 = 60. Let us define the optimal value as a function of rhs of the production
constraint, i.e., b1 and denote it as Z
(b1 ). Thus, Z
(60) = 5142.86. Now suppose
we keep all other values the same (as mentioned in the formulation) but change b1
to 61 and recompute the optimal objective value. Upon solving the LP we get the
new optimal objective value of Z
(61) = 5221.43. Then, using the definition of
11 Optimization 357
the shadow price of a constraint, the shadow price of the production constraint is
computed as follows:
Z
(61) − Z
(60)
Shadow price of C1 =
61 − 60
5221.43 − 5142.86
=
1
= 78.57.
Notice that the shadow price is the rate at which the optimal objective changes
with respect to the rhs of a particular constraint all else remaining equal. It
should not be interpreted as the absolute change in the optimal objective value.
Notice two important facts: (1) The reduced cost of basic variables is 0, i.e.,
c̄j equals 0 for all basic xj (see Sect. 2.4.2 for the definition), and (2) Since cj
equals zero for slack and surplus variables (see Sect. 2.4.2 for definition) the reduced
cost of these variables is always the negative of the shadow price corresponding
to the respective constraints. The economic interpretation of a shadow price, yi
(associated with resource i), is the imputed value of resource i. The term m i=1 aij yi
is interpreted as the total value of the resource used per unit activity j . It is thus
the marginal resource cost for using that activity. If we think of the objective
coefficients cj as being the marginal revenues, the reduced costs, c̄j , are simply
the net marginal revenues.
An intuitive way to think about reduced costs is as follows: If the optimal solution
to a LP indicates that the optimal level of a particular decision variable is zero,
it must be because the objective function coefficient of this variable (e.g., its unit
contribution to profits or unit cost) is not beneficial enough to justify its “inclusion”
in the decision. The reduced cost of that decision variable tells us the amount by
which the objective function coefficients must improve for the decision variable to
become “attractive enough to include” and take on a nonzero value in the optimal
solution. Hence the reduced costs of all decision variables that take nonzero values
in the optimal solution are, by definition, zero ⇒ no further enhancement to their
attractiveness is needed to get the LP to use them, since they are already “included.”
In economic terms, the values imputed to the resources (xj ) are such that the
net marginal revenue is zero on those activities operated at a positive level, i.e.,
marginal revenue = marginal cost (MR = MC).
Shadow prices are only locally accurate (shadow prices are valid over a
particular range, i.e., as long as the set of binding constraints does not change the
shadow price of a constraint remains the same.); if we make dramatic changes in the
constraint, naively multiplying the shadow price by the magnitude of the change
may mislead us. In particular, the shadow price holds only within an allowable
range of changes to the constraints rhs; outside of this allowable range the shadow
price may change. This allowable range is composed of two components. The
allowable increase is the amount by which the rhs may be increased before the
shadow price can change; similarly, the allowable decrease is the corresponding
358 M. G. Sohoni
reduction that may be applied to the rhs before a change in the shadow price
can take place (whether this increase or decrease corresponds to a tightening or
a relaxation of the constraint depends on the direction of the constraints inequality).
A constraint is binding if it passes through the optimal vertex, and nonbinding if it
does not pass through the optimal vertex (constraint C3 in the example above). For
a binding constraint, the geometric intuition behind the definition of a shadow price
is as follows: By changing the rhs of a binding constraint, we change the optimal
solution as it slides along the other binding constraints. Within the allowable range
of changes to the rhs, the optimal vertex slides in a straight line, and the optimal
objective value changes at a constant rate (which is the shadow price). Once we
cross the limits indicated by the allowable increase or decrease, however, the optimal
vertex’s slide changes because the set of binding constraints change. At some
point the constraint, whose rhs is being modified, may become nonbinding and a
new vertex is optimal. For a nonbinding constraint the shadow price (or dual value)
is always zero.
Consider the prototype LP described earlier where the rhs value of production
constraint is 60. In Fig. 11.10 we show how the feasible region changes and when
the set of binding constraints change as we perturb the rhs value of the production
constraint. Notice that in Fig. 11.10a the storage constraint drops out of the set of
binding constraints and in Fig. 11.10c the demand constraint becomes binding. In
between these two extremes, the set of binding constraints, as shown in Fig. 11.10b,
remains unchanged. The range over which the current optimal shadow price of 78.57
remains unchanged is from 37.5 to 65.5 (allowable increase is 5.5 and allowable
decrease is 22.5). That is, if the rhs of the production constraint were to vary in the
range from 37.5 to 65.5 (values of b1 ∈ [37.5, 65.5]) the shadow price would be
constant at 78.57.
Currently, the value of b1 = 60. In Fig. 11.11 we plot the optimal objective value
Z
(b1 ) as a function of b1 , the rhs of production constraint , when b1 is in the range
[37.5, 65.5]. All other values are kept the same. Notice, as we vary b1 , the optimal
objective value changes linearly at the rate of the shadow price, i.e., 78.57.
When the reduced cost of a decision variable is nonzero (implying that the value
of that decision variable is zero in the optimal solution), the reduced cost is also
reflected in the allowable range of its objective coefficient. In this case, one of the
allowable limits is always infinite (because making the objective coefficient less
attractive will never cause the optimal solution to include the decision variable in
the optimal solution); and the other limit, by definition, is the reduced cost (for it
is the amount by which the objective coefficient must improve before the optimal
solution changes).
10 10
8 Storage nonbinding 8 6 x1 + 5 x2 = 60
6 x1 + 5 x2 = 65.5
6 Optimal vertex 6
x2
x2
4 Production binding 4
Objective function
2 2 6 x1 + 5 x2 = 37.5
Demand nonbinding
New feasible region Feasible region
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x1 x1
(a) (b)
10
Objective function
8 Production nonbinding
6
x2
Storage binding
4
Optimal vertex
Demand binding
New feasible region
0
0 2 4 6 8 10
x1
(c)
Fig. 11.10 A shadow price is valid until the set of binding constraints remains the same. (a)
Decreasing the rhs beyond the range. (b) The range of the rhs for which shadow price remains
constant. (c) Increasing the rhs beyond the range
Z * (b1 )
5500 5575.00 (65.5, 5575.00)
5142.86
(60, 5142.86)
5000
4500 5142.86 + (b1 – 60)×78.57
Slope = Shadow price
4000
Allowable increase = 5.5
3500 (37.5, 3375.03)
Allowable decrease = 22.5
3000 37.5 60 65.5
b1
40 45 50 55 60 65
Fig. 11.11 Plot of Z (b1 ) vs. b1 for the production constraint, when b1 ∈ [37.5, 65.5]
It is evident from the earlier discussion that any optimal solution to a LPP has a very
specific structure. We reiterate the optimal structure of any LPP below:
1. The shadow price of nonbinding constraint is always 0. A binding constraint may
have a nonzero shadow price. Together, this implies
2. Every decision variable has a reduced cost associated with it. Basic variables, at
optimality, have a zero reduced cost and nonbasic variables may have a nonzero
reduced cost. This implies that
where yi
is the shadow price of the ith constraint at optimality, bi is the value of
the rhs of constraint i, cj is the objective coefficient of the j th decision variable,
and xj
is the optimal value of the j th decision variable. For the prototype problem
described earlier,
n
cj xj
= (500 × 6.429) + (450 × 4.285) = 5142.8.
j =1
m
bi yi
= (60 × 78.571) + (150 × 2.857) = 5142.8.
i=1
Conditions (1) and (2) together are called the complementary slackness conditions
of optimality. All the three conditions, (1), (2), and (3) provide an easily verifiable
certificate of optimality for any LPP. This is one of the fascinating features of
any LP optimal solution—the certificate of optimality comes with the solution.
Thus, combining the search from vertex to vertex and examining the solution
for optimality gives an algorithm (the Simplex algorithm) to solve LPs very
efficiently!
5 x1 + 8 x2 = 24
x2
0
0 1 2 3 4 5
x1
difficulty of solving the model and guaranteeing optimality of the solution. Let us
consider a simple example to understand where the computational challenge arises.
Consider the following example:
5 x1 + 8 x2 = 24
x2
LP optimal = 4.8
0
0 1 2 3 4 5
x1
4 4
3 3
5 x1 + 8 x2 = 24 5 x1 + 8 x2 = 24
x2
x2
2 2
1 1
x1 x1
(a) (b)
Fig. 11.14 Finding an integer solution by truncating or rounding-up a LP solution may not work.
(a) Truncating (not optimal). (b) Rounding-up (infeasible)
But as Fig. 11.13 illustrates the corner point, at which the LP will always find its
optimal solution, need not be integer valued. In this examples the LP relaxation
optimal value is found at the vertex (4.8,0). As Fig. 11.14 shows, truncating or
rounding-up the optimal LP solution doesn’t provide the integer optimal solution.
364 M. G. Sohoni
5 x1 + 8 x2 = 24
x2
0
0 1 2 3 4 5
x1
Fig. 11.15 Had the truncated solution been optimal, the LP would have found it at another corner
point! That’s why it is not optimal
While rounding-up renders the solution infeasible, had the truncated solution
been optimal, the LP would have found at another corner point as shown in
Fig. 11.15.
In this simple example, it turns out that the IP optimal solution is in the interior
of the LP feasible region as shown in Fig. 11.16.
Thus, finding the IP optimal solution is much harder than looking for optimal
solution of the LP relaxation (which is guaranteed to be found at a corner point
of the LP polyhedra if an optimal solution exists) because the solution can lie
in the interior of the corresponding LP feasible region. If it were possible to
get an accurate mathematical (polyhedral) description of the convex hull using
linear constraints, then one could solve the resulting problem (after including these
additional constraints) as a LP and guarantee that the LP corner point solution
would indeed be optimal to the IP too. However, there is no known standard
technique to develop these constraints systematically for any IP and get an accurate
mathematical description of the convex hull. Developing such constraints are largely
problem specific and tend to exploit the specific mathematical structure underlying
the formulation.
So what carries over from LPs to IPs (or MILPs)? The idea of feasibility is
unchanged. One can define and compute shadow price in an analogous fashion. The
linear relaxation to an integer problem provides a bound on the attainable solution
but does not say anything about the feasibility of the problem. There is no way
to verify optimality from the solution—instead one must rely on other methods to
11 Optimization 365
5 x1 + 8 x2 = 24
x2
2
0
0 1 2 3 4 5
x1
verify optimality. Search methods are used but not going from vertex to vertex!
That’s why IPs are so hard to solve. Of course, this is not to imply that problems
shouldn’t be modeled and solved as IPs. Today’s state-of-the-art algorithms are very
efficient in solving large instances of IPs to optimality but, unlike the case of LPs,
guaranteeing optimality is not possible in general. A detailed study of the theory and
practice of integer programming can be found in Bertsimas and Tsitsiklis (1997),
Nemhauser and Wolsey (1988), and Schrijver (1998).
Next we briefly illustrate a basic branch-and-bound solution technique to
solve IPs.
The basic idea behind the naive branch-and-bound (B&B) method is that of divide
and conquer. Notice that the feasible region of the LP relaxation of an IP, i.e., when
we ignore the integrality constraints, is always larger than that of the feasible region
of the IP. Consequently, any optimal solution to the LP relaxation provides a bound
on the optimal IP value. In particular, for a minimization problem the LP relaxation
will result in a lower bound and for a maximization problem it will result in a upper
bound. If ZLP
denotes the optimal objective value of the LP relaxation and Z
IP
denotes the optimal solution to the IP, then
ZLP ≥ ZI
P for a maximization problem, and
ZLP ≤ ZI
P for a minimization problem.
366 M. G. Sohoni
The B&B method divides the feasible region (partitions it) and solves for
the optimal solution over each partition separately. Suppose F is the feasible
region of the IP and we wish to solve min c x. Consider a partition F1 , . . . Fk of
x∈F
F . Recollect, a partition implies that the subsets are collectively exhaustive and
mutually exclusive, i.e.,
* +
k
Fi Fj = ∅ and Fi = F.
i=1
In other words, we optimize over each subset separately. The idea hinges on the fact
that if we can’t solve the original problem directly, we might be able to solve the
smaller subproblems recursively. Dividing the original problem into subproblems is
the idea of branching. As is readily observable, a naive implementation of the B&B
is equivalent to complete enumeration and can take a arbitrarily long time to solve.
To reduce the computational time most B&B procedures employ an idea called
pruning. Suppose we assume that each of our decision variables have finite upper
and lower bounds (not an unreasonable assumption for most business problems).
Then, any feasible solution to our minimization problem provides an upper bound
u(F ) on the optimal IP objective value.3 Now, after branching, we obtain a lower
bound b (Fi ) on the optimal solution for each of the subproblems. If b (Fi ) ≥ u (F ),
then we don’t need to consider solving the subproblem i any further. This is
because we already have a solution better than any that can be found in partition
Fi . One typical way to find the lower bound b (Fi ) is by solving the LP relaxation.
Eliminating exploring solution in a partition by creating an appropriate bound is
called pruning. The process of iteratively finding better values of b (Fi ) and u (F )
is called bounding. Thus, the basic steps in a LP-based B&B procedure involve:
LP relaxation: first solve the LP relaxation of the original IP problem. The result
is one of the following:
1. The LP is infeasible ⇒ IP is infeasible.
2. The LP is feasible with an integer solution ⇒ Optimal solution to the IP.
3. LP is feasible but has a fraction solution ⇒ Lower bound for the IP.
In the first two cases of step 1, we are done. In the third case, we must branch and
recursively solve the resulting subproblems.
Branching: The most common way to branch is as follows: Select a variable
i whose value xi is fractional in the LP solution. Create two subproblems: in
3 Typically, we could employ a heuristic procedure to obtain an upper bound to our problem.
11 Optimization 367
Fraconal:
Z = 22
X3=1
X3=0
Soluon
Fraconal: Soluon
Z = 21.65 Fraconal:
x = {1, 1, 0 , 0.667} Z = 21.85 x = {1, 0.714, 1, 0}
X2=1
X2=0
Soluon
Soluon
x = {0.6, 1, 1, 0}
x = {1, 0, 1, 1} Integer: Fraconal:
Z = 18 Z = 21.8
X1=1
X1=0
Integer: Infeasible:
Z = 21
Opmal Soluon
x = {0, 1, 1, 1}
iteratively and always choose the “right-side” child candidate. Thus, we begin by
branching on x3 followed by x2 and then x1 . Notice that by fixing x1 = x2 = x3 = 1
we arrive at an infeasible solution at the rightmost node of the tree (fourth level).
However, the left child candidate at the same level, i.e., when x3 = x2 = 1 and
x1 = 0, gives us an integer feasible solution with objective value ZI P = 21. This is
the best IP lower bound solution we have so far—our incumbent IP solution (0,1,1).
Now, when we step one level higher to explore the node when x3 = 1 and x2 = 0,
we get a LP solution with an objective value 18 (also happens to be integer valued),
which is lesser than 21, our incumbent IP solution. Hence, we prune the sub-tree
(not shown in the figure) rooted at that node (where the optimal objective value is
18). Similarly, we don’t need to explore the sub-tree to the left of the root node, i.e.,
when we fix x3 = 0 because that sub-tree can never get us a better integer solution
than what we already have with our incumbent solution.
In (11.14),
εi is the random error (residual) associated with the ith observation
and βˆ1 , βˆ2 are unbiased estimates of the (true) parameters of the linear function
(β1 , β2 ). Alternately, the relationship can be expressed as Yi = E [Y | xi ] + εi ,
where E [Y | xi ] = β1 + β2 xi is the conditional expectation of all the responses,
Yi , observed when the predictor variable takes a value xi . It is noteworthy that
capital letters indicate random variables and small letters indicate specific values
(instances). For example, suppose we are interested in computing the parameters
of a linear relationship between a family’s weekly income level and its weekly
expenditure. In this case, the weekly income level is the predictor (xi ) and the
weekly expense is the response (yi ). Figure 11.18a shows a sample of such data
collected, i.e., sample of weekly expenses at various income levels. The scatterplot
in Fig. 11.18b shows the fitted OLS regression line.
In order to construct the unbiased estimates βˆ1 , βˆ2 OLS involves minimizing
the sum of squared errors, i.e.,
n n
2
min εi2 = yi − βˆ1 − βˆ2 xi . (11.15)
i=1 i=1
X (Income level)
80 100 120 140 160 180 200 220 240 260
55 65 79 80 102 110 120 135 137 150
60 70 84 93 107 115 136 137 145 152
65 74 90 95 110 120 140 140 155 175
Y (Expense)
70 80 94 103 116 130 144 152 165 178
75 85 98 108 118 135 145 157 175 180
88 113 125 140 160 189 185
115 162 191
Total 325 462 445 707 678 750 685 1043 966 1211
E[Y|X] 65 77 89 101 113 125 137 149 161 173
(a)
Expenditure vs Income
250.0
200.0
Weekly expenditure.
150.0
100.0
50.0
75 95 115 135 155 175 195 215 235 255
0.0
Weekly income.
(b)
Fig. 11.18 Fitting an OLS regression line. (a) Weekly expenditure at various income levels. (b)
Weekly expenditure as a function of weekly expense
11 Optimization 371
,n - ,n - ,n 2
- ,n 2
-
For a give sample, notice that i=1 yi , i=1 yi xi , i=1 xi , i=1 yi are
constants. Hence, (11.15) is a simple quadratic function of the parameters, i.e.,
" # " # " n # " n #
n
n
2
−2βˆ1 yi − 2βˆ2 yi xi + βˆ2 xi +
2 2
yi
i=1 i=1 i=1 i=1
yi = βˆ1 + βˆ2 xi + εi
where Yi is the response variable and xi is the predictor variable. The method of
maximum likelihood estimation (MLE),
like the OLS method, helps us estimate the
ˆ ˆ
linear regression parameters β1 , β2 . In the MLE approach, we assume that the
sample collected is made of independent and identically distributed observations
(yi , xi ) and that the error terms follow a normal distribution with mean zero and
variance σ 2 . This implies that Yi are normally distributed with mean β1 + β2 xi and
variance σ 2 . Consequently, the joint probability density function of Y1 , . . . , Yn can
be written as
f Y1 , . . . , Yn | β1 + β2 xi , σ 2 .
372 M. G. Sohoni
But given that the sample points are drawn independently, we express the joint
probability density function as a product of the individual density functions as
f Y1 , . . . , Yn | β1 + β2 xi , σ 2
= f Y1 | β1 + β2 xi , σ 2 f Y2 | β1 + β2 xi , σ 2 · · · f Yn | β1 + β2 xi , σ 2
where
(Yi −β1 −β2 xi )2
1 − 12
σ2
f (Yi ) = √ e
σ 2π
The method of MLE computes βˆ1 , βˆ2 such that the probability of observing
the given y = y1 , . . . , yn is maximum (as high as possible.) Notice that this is a
nonlinear optimization problem that maximizes the likelihood function over βˆ1 and
βˆ2 . One natural way to solve this problem is to convert LF function into its log
form, i.e.,
2
n 1
n Yi − βˆ1 − βˆ2 xi
ln LF βˆ1 , βˆ2 , σ 2 = −n ln σ − ln (2π ) − ,
2 2 σ2
i=1
2
n n 1
n Yi − βˆ1 − βˆ2 xi
= − ln σ 2 − ln (2π ) − .
2 2 2 σ2
i=1
Unlike the case discussed in Sect. 4.1, sometimes we encounter situations wherein
the response variables take binary outcomes. For example, consider a binary model,
in which xi (predictor) is the price of a product and yi (response) is whether a
customer purchased a product. In this case, the response variable yi ∈ {0, 1}. Fitting
an OLS regression model, in this case, may not be appropriate because the response
variable must be restricted to the interval [0, 1] and there exists no such restriction
in the standard linear regression model. Instead, we use a binary outcome model
that tries to estimate the conditional probability that yi =
1 as a function of the
ˆ ˆ
independent variable, i.e., Pr {Yi = 1 | xi } = F β1 + β2 xi , where the function
F (·) represents the cumulative density function of a probability distribution. One
common model used is the logit model, where F (·) is the logistic distribution
function, i.e.,
!
βˆ1 +βˆ2 xi
e
F βˆ1 + βˆ2 xi = !.
βˆ1 +βˆ2 xi
1+e
Assuming that the observations in the sample data are independent of each other,
the conditional likelihood of seeing the n outcomes in our sample data is given by
n
n
yi
!(1−yi )
Pr {Y = yi | xi } = F βˆ1 + βˆ2 xi × 1 − F βˆ1 + βˆ2 xi
i=1 i=1
n !
βˆ1 +βˆ2 xi
− (1 − yi ) ln 1 + e ,
i=1
n ! n !
βˆ1 +βˆ2 xi
= ˆ ˆ
yi β1 + β2 xi − ln 1 + e .
i=1 i=1
374 M. G. Sohoni
βˆ1 and βˆ2 to maximize the log-likelihood function, ln LF βˆ1 , βˆ2 . This is a
nonlinear optimization problem but cannot be solved analytically using stan-
dard differential calculus. We may have to resort to approximately solving it
numerically (e.g., see Newton’s method in Bazaraa et al., 2013). It is notewor-
thy that this type of formulation can be used for making multi-class predic-
tions/classifications, where Y can take on more than two values (not just binary).
See Chaps. 15, 16, and 17 on machine learning techniques for discussion on
these types of problems. Several specialized algorithms have been developed
to solve this problem efficiently. Moreover, it is somewhat straightforward to
connect this to a machine learning problem! The multi-class prediction can be
seen to be equivalent to a single-layer neural network using softmax loss func-
tion (see Chaps. 16 and 17 on Supervised Learning and Deep Learning). The
connection between learning and optimization is an advanced topic well worth
pursuing.
As described in this section, the techniques and solution methodologies for solv-
ing nonlinear optimization problems can be varied. For a partial list of algorithmic
procedures to solve nonlinear problems the reader is referred to https://neos-guide.
org/algorithms (accessed on Jul 22, 2018).
5 Discussion
Appendix
There are a few online tutorials available to understand how to input a LP model in
Solver. The two commonly used websites are Solver5 and Microsoft support6 page.
This section describes the various fields in the LP reports generated by Microsoft
Solver and how to locate the information related to shadow prices, reduced costs,
and their ranges after the model has been solved. We use the prototype example
referred earlier to describe these reports.
Figure 11.19 shows the answer report generated by Excel Solver for our prototype
problem. We describe the entries in this report.
Target Cell The initial value of the objective function (to be maximized or
minimized), and its final optimal objective value.
Adjustable Cells The initial and final values of the decision variables.
Constraints Maximum or minimum requirements that must be met, whether they
are met just barely (binding) or easily (not binding), and the values of the slacks
(excesses) leftover. Binding constraints have zero slacks and nonbinding ones have
positive slacks.
Adjustable Cells
Cell Name Original Value Final Value
$G$33 Objective: max 500 x1 + 450 x2 X1 0.00 6.43
$H$33 Objective: max 500 x1 + 450 x2 X2 0.00 4.29
Constraints
Cell Name Cell Value Formula Status Slack
$I$36 Production 60.000 $I$36<=$K$36 Binding 0
$I$37 Storage 150.000 $I$37<=$K$37 Binding 0
$I$38 Demand 6.429 $I$38<=$K$38 Not Binding 1.571428571
Adjustable Cells
Final Reduced Objective Allowable Allowable
Cell Name Value Cost Coefficient Increase Decrease
$G$33 Objective: max 500 x1 + 450 x2 X1 6.43 0.00 500 40 275
$H$33 Objective: max 500 x1 + 450 x2 X2 4.29 0.00 450 550 33.33
Constraints
Final Shadow Constraint Allowable Allowable
Cell Name Value Price R.H. Side Increase Decrease
$I$36 Production 60.000 78.571 60 5.5 22.5
$I$37 Storage 150.000 2.857 150 90 22
$I$38 Demand 6.429 0.000 8 1E+30 1.57
Figure 11.20 shows the sensitivity report generated by Excel Solver for our
prototype problem. Below we describe the entries in this report.
Adjustable Cells The decision variables, their cell addresses, names, and optimal
values.
Reduced Cost This relates to decision variables that are bounded, from below (
such as by zero in the nonnegativity requirement), or from above (such as by a
maximum number of units that can be produced or sold). Recollect:
1. A variable’s reduced cost is the amount by which the optimal objective value
will change if that bound was relaxed or tightened.
2. If the optimal value of the decision variable is at its specified upper bound, the
reduced cost is the amount by which optimal objective value will improve (go up
in a maximization problem or go down in a minimization problem) if we relaxed
the upper bound by increasing it by one unit.
3. If the optimal value of the decision variable is at its lower bound, its reduced cost
is the amount by which the optimal objective value will be hurt (go down in a
maximization problem or go up in a minimization problem) if we tightened the
bound by increasing it by one unit.
Objective Coefficient The unit contribution of the decision variable to the objec-
tive function (unit profit or cost).
Allowable Increase and Decrease The amount by which the coefficient of the
decision variable in the objective function can change (increase or decrease) before
the optimal solution (the values of decision variables) changes. As long as an
objective coefficient changes within this range, the current optimal solution (i.e., the
values of decision variables) will remain optimal (although the value of the objective
function optimal objective value will change as the objective coefficient changes,
even within the allowable range).
11 Optimization 377
zero, then the LP is degenerate. One has to be careful while interpreting optimal
solutions for degenerate LPs. For example:
1. When a solution is degenerate, the reduced costs may not be unique. Addition-
ally, the objective function coefficients for the variable cells must change by at
least as much (and possibly more than) their respective reduced costs before the
optimal solution would change.
2. Shadow prices and their ranges can be interpreted in the usual way, but they are
not unique. Different shadow prices and ranges may apply to the problem (even
if the optimal solution is unique).
Exercises
Every employee works five consecutive days and then takes off two days, repeating
this pattern indefinitely. Our goal is to minimize the number of employees that staff
the outlet. Define your variables, constraints, and objective function clearly.
Develop a Solver model and solve for the optimal staffing plan.
Ex. 11.1.2 Managing a Portfolio
We are going to manage an investment portfolio over a 6-year time horizon. We
begin with |1,000,000, and at various times we can invest in one or more of the
following:
(a) Savings account X, annual yield 6%
(b) Security Y , 2-year maturity, total yield 14% if bought now, 12% thereafter
(c) Security Z, 3-year maturity, total yield 17%
(d) Security W , 4-year maturity, total yield 22%
To keep things simple we will assume that each security can be bought in any
denomination. We can make savings deposits or withdrawals anytime. We can buy
Security Y any year but year 3. We can buy Security Z anytime after the first year.
Security W , now available, is a one-time opportunity. Write down a LP model
to maximize the final investment yield. Assume all investments must mature on
or before year 6 and you cannot sell securities in between. Define your decision
variables and constraints clearly.
11 Optimization 379
4 5
polyhedron P = x ∈ Rn+ | Ax ≤ b and define S = Zn4∩ P , where Zn+ 5is the
n-dimensional set of nonnegative integers. Thus, S = x ∈ Zn+ | Ax ≤ b and
conv(S) is the convex hull of S, i.e., the set of points that are convex combinations
of points in S. Note that conv (S) ⊆ S; “Ideal” if conv(S) = S. An inequality
π T x ≤ π0 is called a valid inequality if it is satisfied by all points in S.
2. Solve using the branch-and-bound (B&B) algorithm. Draw the B&B tree, show
your branches, LP solutions, lower and upper bounds. You may simply branch in
sequence x1 followed by x2 and so on.
References
Bazaraa, M. S., Jarvis, J. J., & Sherali, H. D. (2011). Linear programming and network flows.
Hoboken, NJ: Wiley.
Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (2013). Nonlinear programming: Theory and
algorithms. Hoboken, NJ: Wiley.
Bertsekas, D. P. (1999). Nonlinear programming. Belmont, MA: Athena Scientific.
Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to linear optimization (Vol. 6). Belmont,
MA: Athena Scientific.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University
Press.
Bradley, S. P., Hax, A. C., & Magnanti, T. L. (1977). Applied mathematical programming.
Chvátal, V. (1983). Linear programming. New York: WH Freeman.
Gujarati, D. N. (2009). Basic econometrics. New York: Tata McGraw-Hill Education.
Luenberger, D. G., & Ye, Y. (1984). Linear and nonlinear programming (Vol. 2). Berlin: Springer.
Nemhauser, G. L., & Wolsey, L. A. (1988). Interscience series in discrete mathematics and
optimization: Integer and combinatorial optimization. Hoboken, NJ: Wiley.
Schrijver, A. (1998). Theory of linear and integer programming. Chichester: Wiley.
Wagner, H. M. (1969). Principles of operations research: With applications to managerial
decisions. Upper Saddle River, NJ: Prentice-Hall.
Wolsey, L. A. (1998). Integer programming. New York: Wiley.
Chapter 12
Forecasting Analytics
Of course, there is no accurate forecast, but at times this shifts the focus for ... If
there is no perfect plan, is there such thing as a good enough plan? . . . 1
1 Introduction
K. I. Nikolopoulos ()
Bangor Business School, Bangor, Gwynedd, UK
e-mail: k.nikolopoulos@bangor.ac.uk
D. D. Thomakos
University of Peloponnese, Tripoli, Greece
Forecasting analytics is probably the most difficult part of the analytics trio:
descriptive, prescriptive, and predictive analytics. More challenging as it is about
the future, and although everybody is right once the forecasts are set, only the very
few (and brave!) will be right when the future is realized and the forecast accuracy
is evaluated . . . . That is the judgment day for any predictive analytics professional
(PAP).
Forecasting analytics is the key for an effective and efficient applied business and
industrial forecasting process. Applied . . . as the focus is primarily on evidence-
based practical tools and algorithms for business, industrial and operational fore-
casting methods and applications, rather than upon problems from economics and
finance. The rather more advanced techniques required for the latter are more of the
core of a more focused chapter on financial predictive analytics (FPA). Similar is
the case and the narrower focus of marketing analytics (MA).
Forecasting analytics is also the next big thing in the employment front, with
millions of jobs on demand expected in the next few years.2
Forecasting analytics is a crucial function in any twenty-first century company
and is the one that can truly give a competitive advantage to nowadays managers
and entrepreneurs as a bad forecast can be translated into:
Either
. . . lost sales, thus poor service and unsatisfied customers!
Or
. . . products left in shelves, thus high inventory and logistics costs!
Wait a minute . . . this sounds like a lose-lose situation! If you do not get it
exactly right, you will lose money—one way or another. What’s more, as you
might have guessed, you will not ever get it exactly right! Even the most advanced
forecasting system, only by pure chance, will give you a perfect forecast . . .
Thus, the angle of this chapter, and our sincere advice to the reader would be to:
“ . . . make sure you do your best to get an as-accurate-forecast-as-possible,3 and learn to
live with the uncertainty that will inevitably come with this forecast . . . 4 ” (exactly as the
introductory quote wisely suggests).
2 Fisher, Anne (May 21, 2013), Big Data could generate millions of new jobs, http://fortune.com/
2013/05/21/big-data-could-generate-millions-of-new-jobs [Accessed on Oct 1, 2017].
3 To take into account all available Information that is relevant to the specific forecasting task—
The first sentence of the quote is the most important one: forecasting is more-or-
less about estimating in unknown situations—thus your only weapon is the past and
how much the latter resembles the former. This wiki-quote continues with nicely
distinguishing between forecasting and prediction. In this chapter, we will use them
interchangeably.7 The next sentence fully aligns with the beliefs of the authors
of this chapter: Forecasting has evolved into the practice of Demand Planning in
everyday business forecasting for manufacturing companies . . . sometimes referred
to as supply chain forecasting.
The world of everyday business forecasting, comes with the assumption that
some kind of regularly observed quantitative information will be available for the
products under consideration. In other words, time-series data will be available.
A time series (Fig. 12.1) is just a series of observations over a long period of
time; those observations are usually taken in equally distanced periods (months,
week, quarters, etc.). That is typically how data look like in business and operational
forecasting; in most cases, observations for more than 3 years per product are
available, while these are recorded quite frequently (every month or less).
But this is not always the case as:
• You may have cross-sectional data, data referring to the same point of time but
for different product/services, etc.—for example, sales for ten different car makes
in a given day.
• You may have no data at all—so you end up using entirely your judgment as to
make some forecasts.
In this chapter, focus is basically given to time-series forecasting8 and how this
integrates efficiently with judgmental adjustments. These adjustments are driven
This thick-line stands for the history of the specific product you are interested
to forecast. As explained in the data collection chapter (Chap. 2), data-related
problems such as outliers, missing data, and sudden shifts need to be treated
before forecasting. Also, data transformations, such as taking roots, logarithms, and
differencing, might be required to conform to the requirements of the model.
The next logical step would be to project this thick-line into the future: forecast-
ing the available time series. Time-series forecasting is based on the assumption
that a particular variable will behave in the future in much the same way as it
behaves in the past.10 Thus, the dotted-line should be the “natural” extension of
the thick-black-line. Natural . . . in the sense that history repeats itself. This is the
basic assumption of statistical forecasting; thus Statistics—abbreviated Stats—is the
second fundamental part of the forecasting process.
We will call this a point-forecast. In most of the cases you will usually be
interested in many points of time in the future, so forecasts for the full forecasting
horizon as it is usually termed, and not just a single-point forecast, are of interest.
3
Judgment
FORECASTS
1
IS / ICT
2
Stats
HISTORY
In order to live with the aforementioned risk, we would like to have the
black-lines as well, as shown in Fig. 12.2. Those lines are the forecast/prediction
intervals—in this case symmetric over and under the point forecasts—and their very
reason for existence is to give a sense of the uncertainty around point forecasts. In
essence they tell you:
. . . if it’s not going to be the dotted-line, then with great confidence it would be something
from the lower black-line up to the upper black-line!
We would like to set these confidence levels around 95%, thus being 95% certain
that the future unknown demand will appear somewhere between those solid-lines.
But in real life this results in something that managers totally dislike: solid-lines
being far out from the dotted line . . . And as a result, managers go one step back
and require only the point forecasts to be reported to them. This is the reason that
most advanced forecasting software—FSS (stands for forecasting support systems)
usually do not report the prediction intervals at all.
Another critical part of the forecasting process, as presented in Fig. 12.2, is
Human Judgment! Humans don’t really like machines . . . They’re afraid of them!
They think they will get their jobs and eventually they will get fired! As a result,
they dislike ready-made solutions that do not require their intervention. They would
like to have some ownership of the produced forecast. So ... they Adjust!
386 K. I. Nikolopoulos and D. D. Thomakos
11 The true birth of the Forecasting discipline dates back to late 1970s, early 1980s at the hands
of Spyros Makridakis (at INSEAD), Robert Fildes (then at Manchester Business School, now in
Lancaster University), and Scott Armstrong (Wharton). Benito Carbone also played a key role
in the early stages. The result was to create two journals International Journal of Forecasting—
IJF (Elsevier) and Journal of Forecasting—JoF (Wiley), a conference ISF (https://isf.forecasters.
org/, accessed on Feb 22, 2018), an Institute IIF (www.forecasters.org, accessed on Feb 22, 2018),
in a word ... a DISCIPLINE! Many have followed since then and are part of the forecasting
community now, including the authors of these texts, but history was written by those 3–4 men
and their close associates. More details can be found in the interview of Spyros for IJF: Fildes, R.
and Nikolopoulos, K. (2006) “Spyros Makridakis: An Interview with the International Journal of
Forecasting”. International Journal of Forecasting, 22(3): 625–636.
12 Forecasting Analytics 387
In essence, in statistics we try to find the model that best fits the data. And since
we expect history to repeat itself, we project it and we are happy. Thus, the statistical
forecasting recipe is: Find the best fit → Get the job done → Sleep tight!
However, in time-series forecasting, history very rarely repeats itself!
Forecasters instead focus on which model forecasts best rather than which model
fits best! Hold on, we have an oxymoron here? How can we know which model
forecasts best since we do not know the future?
To resolve this, we do our first forecasting trick: we hide a part of the series,
usually the very recent one: A 20% of the most recent part of the series is usually
enough. Others suggest we have to hide as much as the forecasting horizon we are
interested in—thus if we have to forecast 3 months ahead we should hide the last
3 months of the available data. We call this the holdout data (or sample) and we will
use it to evaluate which model forecasts best. For example, we hide the last year of
our time series, and we use the previous years to forecast this last hidden one, with
a variety models, and the one model that goes “closer” to the hidden values is the
model that . . . forecasts “best.” And this of course is not necessarily the one that
fits the whole available dataset the best (the standard technique used in statistics).
Unfortunately, our approach is not bullet-proof either . . . as:
There is no guarantee that the model that forecasts best, will keep on forecasting best . . .
However, it still produces on average better forecasts than the model that fits
best! At least, that is what most empirical investigations suggest. The next section
discusses time-series techniques, process, and other applications.
388 K. I. Nikolopoulos and D. D. Thomakos
Seasonal and Cyclical Adjustment: Seasonal fluctuations and changes can occur
and repeat more or less regularly and periodically within a year and behavior of the
data show predictability. The drivers of seasonal demands and supply are climate or
festivals which repeat every year during a particular month. The most widely used
tool to test and determine seasonality in time series is plotting the autocorrelation
function (ACF).
Analysis of the autocorrelation coefficient or autocovariance function (ACF)
which shows the relationship between current and lagged values of a time series
is a way to decompose the data and investigate repeating patterns and presence of
a periodic signal obscured by noise. The autocorrelation coefficient can be used
to detect the presence of stationary, seasonality, trend, and random variability in
the data. Specific aspects of autocorrelation processes such as unit root, trend sta-
tionary, autoregressive, and moving averages can be computed. The autocorrelation
coefficient (rk ) is computed as:
n
i=k+1 Yi − Y Yi−k − Y
rk = n 2 .
i=1 Yi − Y
where k = time lag, n is the number of observations, and Y = observed value. Close
to zero values of rk indicate no autocorrelation—the series is not related to each
other at any lag k and the variability in the values is random with zero mean and
constant variance (Fig. 12.4a). If there is a trend the rk value is high initially then
drops off to zero (Fig. 12.4b). In the case there is a seasonal pattern in the data series,
rk reappears in cycles, for example, of 4 or 12 lags depending on quarterly or yearly
series (Fig. 12.4c).
The autocorrelation between two observations at prior time steps, that is,
correlations between observations at predetermined or specified time lags, in a
data series consists of direct correlations among themselves, as well as indirect
correlations with observation at intervening time steps, that is, correlations with
observations in between specified time lag. The ACF comprises both direct and
indirect correlations among observations and does not control for correlations
of a particular observation with observations at other than the specified lag. An
alternative to ACF is the partial autocorrelation function (PACF) in which indirect
correlations with values at shorter lags are excluded and only direct correlations
with its own lagged values are taken. (Under the assumption of stationarity, the jth
PACF value is obtained by regressing the present values against the past j values
and taking the coefficient of the jth value as the estimate of the PACF coefficient.)
The ACF and the PACF both play important roles in identifying the extent of lags in
autoregressive model such as the ARIMA model discussed in a subsequent section.
Cyclical fluctuations and data behavior indicate regular changes over a period of
more than 1 year, at least 2 years, and can be analyzed and forecasted. The classical
decomposition method provides ways to separate cyclical movements in the data.
390 K. I. Nikolopoulos and D. D. Thomakos
(a) (b)
1.00 1.00
0.80
0.60
0.50
0.40
0.20
0.00 0.00
-0.20
-0.40
-0.50
-0.60
-0.80
-1.00 -1.00
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11 13 15 17 19 21 23
(c)
1.00
0.80
0.60
0.40
0.20
0.00
-0.20
-0.40
-0.60
-0.80
-1.00
1 3 5 7 9 11 13 15 17 19 21 23
A time-series data may consist of seasonal fluctuations, a trend, cyclical change, and
irregular components. A simple divide and conquer approach could be to remove
the seasonal and trend components by directly providing estimates of those using
any number of simple techniques and using the smoothing techniques to forecast.
After obtaining such a forecast, the trend and seasonal components can be added
back. Below, we describe smoothing methods (different from the divide and conquer
approach) for handling all three types of series, those without trend and seasonality,
those without seasonality, and those with all the three.
A naïve method assumes no seasonality and no trend-cycle in the data and simply
sets the latest available actual observation to be the point forecast for periods in
the future. Sometimes seasonally adjusted data is used and the forecasts are re-
seasonalized. The naïve model is a kind of a random walk model. The NF is
considered a simple benchmark against which the more advanced results may be
12 Forecasting Analytics 391
compared. In some ways this is reasonable: the naïve method measures the volatility
inherent in the data and in many systems nothing works better than what happened
yesterday.
A simple average is obviously easy to compute but misses trends and recent changes
in the series. Moving averages are a simple way of smoothing out the seasonality
and noise in a series to reveal the underlying signal of trend used for forecasting. In
the simplest version, the forecast for the future periods is set equal to the average
of the observations of the past k periods. One may wish to optimize on the value of
k. Variants of the simple moving average are weighted and exponentially weighted
moving averages. In weighted moving average, different weights are assigned to
various point observations within a seasonal period to be used for averaging, while
in exponentially weighting, higher weight is assigned for the latest point observation
of a season and lesser weights are assigned in a continuously decreasing manner to
the earlier point observations.
One will notice that there will be no forecast for the first k periods unless fewer
periods are used to produce the initial forecast. Also, the prediction after the last
period of data will be same for every period thereafter.
In this method one assumes the absence of trend and seasonality in the data.
Brown (1956) is credited with the development of the single exponential smoothing
methodology.
The following formula is used to forecast using the SES method:
Ft+1 = α ∗ Yt + (1 − α) ∗ Ft ;
where Yt is the actual observation in the period t and Ft is the forecast value from
(t − 1) period.
Also, et = Yt − Ft , is the error between the observation and forecast value.
By substitution, one may also write:
Ft = α ∗ Yt−1 + (1 − α) ∗ Ft−1
The best α can be found using an optimization approach or simply by trial and
error. The forecast can be started in many ways. A popular method is to use the first
392 K. I. Nikolopoulos and D. D. Thomakos
actual value as the first forecast (Y1 = F1 ) or set the average of the first few values
as the value of the first forecast.
In sheet “SES” of spreadsheet “FA-Excel Template.xlsx” we have provided
sample data (which can be changed) and the value of the smoothing constant, α, that
can be changed. As α changes from zero to 1, the forecast will be seen to follow the
most recent value more closely. In the Appendix in the section “SES Method” we
provide the R command for SES. The data is shown in Table 12.6 and the output in
Table 12.10.
In the adaptive-response-rate single exponential smoothing (ARRSES) α can be
modified as changes occur:
Ft+1 = αt Yt + (1 − αt ) Ft
At
αt+1 = ABS
Mt
At = βet + (1 − β) At−1
et = Yt − Ft .
Brown’s SES methodology (1956) was extended by Holt (1957) who added a
parameter for smoothing the short-term trend. The current value is called the level
of the series or Lt . The change in levels (Lt − Lt − 1 ) is used to determine the trend
during the period t. Then, the trend is smoothed using the previously forecast value,
that is, Tt = (1 − β)Tt−1 + β*(Lt – Lt−1 ). This method is also called Double
Exponential Smoothing (DES). The formulae are:
12 Forecasting Analytics 393
Ft+m = Lt + mTt , m = 1, 2, . . .
In order to start the forecast, one may set L1 = Y1 and the slope can be obtained
by regressing initial values of the series against time. Search methods can be used to
select the “optimal” values of the two smoothing constants, α and β. (The word
optimal is in quotes because the criterion for optimization could be minimizing
different types of errors, including errors one step or two steps ahead.) An example
is shown in sheet “Holt” of spreadsheet “FA-Excel Template.xlsx”. Appendix 1
section “Holt method” lists the R command. The data is shown in Table 12.6 and
the output in Table 12.10.
Holt–Winters’ method is a smoothing method that takes both trend and seasonality
into account known as Error, Trend, and Seasonality (ETS) or triple exponential
smoothing as three components (viz., level, trend, and seasonality) in the data are
used and smoothened to arrive at forecast values. It is a variant of Holt method of
exponential smoothing in which a component of seasonality index along with trend
and level is also added to arrive at forecast:
Yt
Lt = α + (1 − α) (Lt−1 + Tt−1 )
St−s
Tt = β (Lt − Lt−1 ) + (1 − β) Tt−1
Yt
St = γ + (1 − γ ) St−s
Lt
Ft+m = (Lt +Tt m)St − s+m .
where St denotes the seasonal component, s is the length of a season, and γ is the
seasonal smoothing factor. Note that after each step we need to renormalize the
seasonal factors to add up to k (“Periods in Season”). The initial values for Ls , bs ,
and Ss can be initially calculated as:
1
Ls =(Y1 + Y2 + · · · + Ys )
s
1 Ys+1 − Y1 Ys+2 − Y2 Ys+s − Ys
bs = + + ··· +
s s s s
Y1 Y2 Ys
S1 = , S2 = , . . . , Ss = .
Ls Ls Ls
394 K. I. Nikolopoulos and D. D. Thomakos
St = γ (Yt − Lt ) + (1 − γ ) St−s
Ft+m = Lt + Tt m + St−s+m .
The initial values for level and trend can be chosen like in the multiplicative
method. The seasonality values can be estimated as below to start the forecast:
S1 = Y1 − L1 , S2 = Y2 − L2 , . . . , Ss = Ys − Ls .
The data is shown in Table 12.6 and the forecast output (produced by R) is shown
in Table 12.10 for both methods. The R command is listed in the Appendix in the
section “Holt–Winters Method.”
When the trend in the observation has a nonlinear pattern, the damped method of
exponential smoothing can be used. It is a variant of Holt’s method in which only
a fraction of trend forecast values of current and earlier periods are added to Lt to
arrive at Ft + 1 :
Ft+m = Lt + φ + φ 2 + · · · + φ m Tt .
where ϕ is the damped parameter for the trend coefficient Tt . The forecast can be
started just as in the Holt’s method for FIT. Usually, the damping parameter is set to
be greater than 0.8 but less than 1.
The data is given in Table 12.7 and the output in Table 12.11. The same example
is given in sheet “Damped Holt” of spreadsheet “Forecasting Analytics-Excel
Template”. The R command is listed in the Appendix in “Damped Holt Method”
section.
12 Forecasting Analytics 395
0.8
Auto-Correlation Coefficients
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.2
Time Lags
Box and Jenkins developed the ARMA model in 1970. The autoregressive part
(or AR) of the ARMA model can be written as yt = c + ϕ1 yt − 1 + ϕ2 yt − 2 + ···
+ ϕp yt − p + et , where et is white noise and p lagged values are used. This is a
multiple regression with lagged values of yt as predictors. The lagged explanatory
variable becomes stochastic and contemporaneously correlated with the error
term, making the forecast stochastic and creating bias leading to loss of con-
fidence that comes with estimator bias and variance (please see Chap. 7 on
Regression for details). The moving average (or MA) part of the model includes
yt = c + et + 1 et − 1 + 2 et − 2 + ··· + p et − q , which is a multiple regression
with q past errors as predictors. (A common confusion is with the MA methods
discussed earlier. There the data itself was averaged. Here, the errors are averaged.)
The ACF and PACF can be used to identify the lag structure of an ARMA model.
ACF is used to estimate the MA-part and PACF is used to estimate the AR-part,
for example, in Fig. 12.5 we show the ACF and PACF plots of difference in data.
Both ACF and PACF are decaying, there is a drop off after the time-lag 6 in ACF
(Fig. 12.5), and there is spike at the time-lag 1 in PACF (Fig. 12.6). Therefore, the
appropriate lag structure could be ARMA (1, 6).
The ARIMA model: ARIMA forecasting is used when the condition of no-
autocorrelation and homoscedasticity are violated. Then, it requires transformation
of the data series to stabilize both variance as well as mean. A data series
is said to be stationary when the mean and variance are constant over time.
ARIMA was developed to handle nonstationary data by differencing d-times. Later,
Engle (1982) introduced autoregressive conditional heteroscedastic (ARCH) models
which “describe the dynamic changes in conditional variance as a deterministic
function of past values.” When “additional dependencies are permitted on lags of
12 Forecasting Analytics 397
1
Partial Auto-Correlation Coefficients
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.2
-0.4
Time Lags
the conditional variance the model” is called generalized ARCH (GARCH) model
and share many properties of ARMA (Bollerslev et al. 1994; Taylor 1997).
The data series is plotted against time to identify nonstationary, that is, changing
means and variances over time. For a nonstationary series, the value of the autocor-
relation coefficient, r1 , is often large and positive and the autocorrelation function
(ACF) decreases slowly while it drops to zero relatively quickly for stationary
data. To stabilize the varying mean due to seasonality and trend, data differencing
is done, while AR and MA processes are used to incorporate autocorrelation
in lagged values of the time series and the linear combination of error terms
whose values change contemporaneously over time. Combining autoregressive
and moving average models, the ARIMA (p; d; q) model can be written as:
yt = c + ϕ1 yt − 1 + ϕ2 yt − 2 + ··· + ϕp yt − p + et + 1 et − 1 + 2 et − 2 + ··· + p et − q,
where, AR: p = order of the autoregressive part, I: d = degree of first differencing
involved, and MA: q = order of the moving average part. While it appears that one
has to search for a number of values, practically just values of 0, 1, and 2 for p, d, q
suffice to generate a large number of models.
Maximum Likelihood Estimation (MLE) of ARIMA Model: Having specified
the model order, after checking for stationarity, the ARIMA parameters are esti-
mated using the MLE method and use of a nonlinear numerical optimization
technique. One can minimize AIC = −2 log(L) + 2 (P + q + k + 1) or
BIC = AIC + log(T)(p + q + k-1), where L is likelihood of the data, k = 1, if
constant = 0 and k = 0, if constant = 0 to get a good model. An approximate
estimate of −2log(L) is given by n(1 + log(2π )) + n log(σ 2 ), where n is the number
of data points and σ 2 is the variance of the residuals.
The R command is given in the Appendix in “ARIMA method” section. The
data and summary output of R on an example is given in Table 12.1. The complete
398 K. I. Nikolopoulos and D. D. Thomakos
output is in Table 12.13. The same data can be found in sheet “Data - ARIMA” in
csv format.
EXAMPLE: The monthly demand of sofa (in thousands) by a company for the last
50 months is given below in Table 12.1. The problem is to provide the forecast of
sofa demand for the company for the next 3 months using the ARIMA model.
Solution:
Assume that we want to fit the ARIMA model (2, 1, 2). Assume that the data is
named as ARCV. The R command is: fitted ← arima(ARCV, order = c(2, 1, 2)).
Here, fitted is where the output will be placed.
The ARIMA parameters of order p = 2, d = 1, q = 2 are estimated using MLE
(maximum likelihood estimation) methods and automated nonlinear numerical
optimization techniques. The coefficients are obtained by calling fitted. The output
is given in Table 12.2.
The forecast equation is (to be written by the user): Yt = 0.929*Yt-1 -
0.256*Yt-2 - 1.993*et-1 + 0.999*et-2 + Error. The example reveals that after the
estimate the equation has to be written in the forecast equation form to predict values
in the future. The forecast values can be obtained by using the command forecast
(fitted, h = 3).
12 Forecasting Analytics 399
Intermittent Demand
8
5
Numbers
0
1 2 3 4 5 6 7 8 9
Year
The SES method assumes a constant probability for the occurrence of nonzero
values which is often violated leading to count data or intermittent series (Lindsey
and Pavur 2008). It is found that around 60% of the stock-keeping units in industrial
settings can be characterized as intermittent (Johnston et al. 2003). Intermittent
demand is characterized by infrequent demand arrivals and variable demand sizes
when demand occurs. As Fig. 12.7 shows, there are “periods with demand followed
by periods of no demand at all, and on top of this even the demand volume (when
realized) comes with significant variation. There are two things to forecast: when the
next demand period is going to be realized? And, whenever demand is realized, what
will be the volume of this demand?” The basic technique is to combine different
time block and different methods have been proposed to doing so. The SES method
performs poorly in cases of stochastic intermittent demand.
Croston (1972) developed methodology for forecasting such cases and suggested
decomposition of intermittent series into nonzero observations and the time intervals
between successive nonzero values. The two series, namely, quantity and the
intervals, are extrapolated separately. An updating is done for both quantity and
interval series only after a nonzero value occurs in quantity series.
400 K. I. Nikolopoulos and D. D. Thomakos
yt+1
Ft+1 = .
τt+1
where yt+1 and τt+1 are the forecast of the demand size and interval. Both are
updated at each time t for which yt = 0. An example is provided in sheet “Croston
and SBA” in spreadsheet “Forecasting Analytics-Excel Template”. The R command
is given in the Appendix in the “Croston and SBA method” section. The data is in
Table 12.9 and output is in Table 12.14 in the Appendix.
Syntetos and Boylan (2001) found that Croston’s methodology provides upward
biased forecast. Subsequently, they proposed an improved Croston’s methodology
in which the final forecasts are multiplied by a debiasing factor derived from the
value of the smoothing parameter of intervals (Syntetos and Boylan 2005). Syntetos
and Boylan (2005) found that Croston method is biased on stochastic intermittent
demand and corrected the bias by modifying the forecasts to:
β yt+1
Ft+1 = 1− .
2 τt+1
SBA works well for intermittent demand but is biased for non-intermittent
demand. Syntetos and Boylan (2001) avoided this problem by using a forecast:
β
yt+1
Ft = 1 − β
.
2 τt+1 − 2
This removes the bias but it increases the variance of the forecast. Other variants
include that of Leven and Segerstedt (2004).
None of these variants handle obsolescence well. When obsolescence occurs
these methods continue to forecast a fixed nonzero demand forever. An example
of SBA is provided in sheet “Croston and SBA” in spreadsheet “Forecasting
Analytics – Excel Template”. The R command is given in the Appendix in the
“Croston and SBA Method” section. The data is shown in Table 12.9 and output
is in Table 12.14 in the Appendix.
Recent development in the area of forecasting intermittent demand include
the work of Babai et al. (2012), Kourentzes (2014), Kourentzes et al. (2014),
Nikolopoulos et al. (2011a, b), Prestwich et al. (2014), Rostami-Tabar et al. (2013),
Spithourakis et al. (2011), and Teunter et al. (2011).
12 Forecasting Analytics 401
Often, the forecasting task is to predict demand over a fixed leadtime. In this case
bootstrapping might be used. Bootstrapping (Efron 1979) is a statistical method
of inference that uses draws from sample to create an approximate distribution.
Willemain et al. (2004) produce accurate forecasts of the demand of nine companies
over a fixed lead time compared to exponential smoothing or Croston’s method. We
illustrate with an example.
EXAMPLE: Demand for an Automobile Part
Suppose, we would like to forecast the automobile part demand for the next
3 months. Historically, the 24 monthly demand for the part is given as follows
(Table 12.3).
Solution:
Bootstrap scenarios of possible total demands for 3-month lead periods are created
by taking random sample with replacement as follows:
1. Months: 3,17,21; demand: 7 + 0 + 0 = 7.
2. Months: 1,20,8; demand: 0 + 13 + 0 = 13.
3. Months: 6,14,19; demand: 2 + 9 + 5 = 16.
Continuing this process, we can build the demand distribution for the given lead
time.
• At the strategic level we usually look into forecasting horizons that go beyond a
year and involve the impact of rare events (like major international crises as the
recent one regarding energy prices and the global credit system), new product
development, product withdrawals, capacity amendments, and scenario planning.
The aforementioned forecasting horizons are only indicative, and often met in
supply chain forecasting. There are many forecasting applications where a strategic
forecast is just for a few months ahead! So, in order to avoid any confusion, we
use the terms: “forecast for x steps ahead” or “forecast for x periods ahead”, without
specifying what steps/periods stand for. These steps could be anything from minutes
to years depending on the application area. Typically short-term forecasting involves
1–3 steps ahead; medium-term or mid-term is for 4–12 steps ahead; and long-term
anything over 12 steps ahead.13
This chapter proposes a simple seven-step forecasting process tailored for oper-
ational forecasting tasks. This process is illustrated in Fig. 12.8, and is abbreviated
as “AFA forecasting process” or just AFA for short.
AFA provides detailed guidance on how to prepare operational forecasts for a
single product. This process should be:
(a) Repeated for every product in your inventory, and
(b) Rerun each time new demand/sales data becomes available.
Thus, if we observe our inventory of 100 products on a monthly basis, we should
run AFA every month for each of the hundred products we manage.
13 For this and other types of forecast classifications, see Hibon and Makridakis (2000).
12 Forecasting Analytics 403
Let us start decoding what these boxes stand for; there is one box for each of the
seven steps of the AFA process. The upper three form the preprocessing phase, the
one in the middle the main-forecasting-task, and the latter three the post-processing
phase (Fig. 12.9).
It looks like a typical Black-Box approach.14 However, we believe it is more like
a “Grey-Box” approach! A situation where you will be able to understand most of
the things that are happening throughout AFA, however rely on automated tools to
deliver for you!
Let us explore the AFA process as illustrated in Fig. 12.8:
• First Box: the BAD things . . .
Each single time series comes with a number of problems. Some of these
are dead-obvious but some are well hidden. To cut the long story short, we
must deal with all these “bad things” and prepare a series with no missing
values, no extremely low/high values (outliers), no level or trend shifts; this
would involve automated detection algorithms for such problems and suggested
solutions in order to adjust the original series into new series, filtered for all the
aforementioned problems.
• Second Box: the GOOD things . . .
In time-series forecasting it is often very difficult to tell good things from bad
ones . . . just like in real life! A ‘good thing’ in a time series is a special event
(SE), often termed in literature as irregular, infrequent or rare event; it could be a
promotion, a production interruption, news, regulation, etc., in general anything
that could make demand deviate substantially from regular levels! But why is
something irregular good? Simply because it is an information-rich period, a
period with special interest where an external/exogenous event has driven excess
or limited demand respectively. So it would look exactly like an outlier, but we
will know what exactly happened. From a mathematics perspective, the way you
14 A standard engineering expression, for a situation or a solution where something seems to work
fine, but we are not sure why and definitely do not know for how long it will keep on working!!
404 K. I. Nikolopoulos and D. D. Thomakos
detect and subsequently adjust periods with special events, is identical to the one
used to treat outliers.
• Third Box: the REGULAR things . . .
In forecasting, finding regularities and patterns in a series is an essential task;
usually termed periodicity, things that repeat themselves on a regular basis. If the
regularity, the repetition happens within a year then we will call this phenomenon
seasonality; for less than a year mini-cycles, while for more than a year, big
economic/financial-cycles. In any case, removing the effect of these cycles at this
stage of the AFA process (and reintroduce them later on) has been empirically
proven to work very well, as argued in various empirical investigations.15
After successfully completing these three steps of the preprocessing phase of our
time series, we should by now have a nice filtered16 —smoothed—series, that will
look almost17 like a straight line, either entirely flat or with a constant-ish trend
(upward or downward). Now, it is about time to extend this line into the future . . .
Now we are ready to forecast!
• THE (fourth) BOX: FORECASTING . . .
This is where all the fun is . . . . Let us try forecasting: extrapolate the series
in the future. We will not just choose a method—and that is it! (where would the
fun be after all . . . ?).
We basically employ three fundamental strategies . . . the “three forecasting
tricks” as I fancy calling them:
– “Select”: my mother always says that “Experience is all that matters . . . ”;
and she is probably right. Thus if a method worked well in the past, we
should probably stick to it, and keep on selecting that same one for the task
of extrapolation. Furthermore, some methods may have been proven to work
better for some products while other methods better for other products; so
there are “horses for courses” and once again we are better off sticking to
them. In essence, we could build a nice table—a selection protocol (SP) as we
will call it more formally, where in one column there is a list of our products,
while in the other column the forecasting methods and models that have
worked well in the past for the respective products. An illustrative example
is shown in Table 12.4.
– “Combine”:
When in doubt . . . combine!
a very good piece of advice I dare say. When a method has worked well in the
past for a certain product, but the new Statistician on the block . . . insists that
method X is the new panacea in time-series forecasting, then why not combine
those two? So get a set of forecasts from your trusty chosen method, get another
set of forecasts from the new highly promising method, and then average these
to get a final set of forecasts. If you believe more in the former (or the later) you
could easily differentiate the weights respectively as to express your belief, for
example, via a 30% weight to the experience-based method and 70% to the new
one. Rule of thumb: “Combining always works!” (In other words: Combining
most of the times outperforms the individual performance of the methods being
combined.)
– “Compete,” the true reason forecasters exist: (empirical) Forecasting Competi-
tions! We do not trust anything, and from all the available methods and models,
applied on all the available history, we will find the one that forecasts “best.”
These criteria typically include average or median error metrics like MAE,
MdAE, MAPE, MDAPE, MASE, and MdASE.
Sometimes, we even apply these tricks simultaneously—for example, (a) we
compete only among methods that have performed well in the past, or (b) we
combine the winner of the competition and the top performing method in the past
as described in a selection protocol, or . . .
• Fifth Box: Superimpose regular patterns.
By now, our forecasts should look like a straight line, either flat or with a
certain slope. If we have identified regularities in step three, then we need to bring
them back into the game, in other words we superimpose these patterns onto the
extrapolation. Once this step completed, our forecasts will have ups and downs,
and will look like a natural extension of the cycles and seasonality observed in
the history of the series.
• Sixth Box: Human Judgment.
This is where humans come into the game. No matter how sophisticated the
process so far, people—usually referred to as forecasters or experts—want to
intervene at this stage; primarily to introduce market intelligence? This is usually
performed in two phases: (a) an initial phase, where experts roughly revise all
provided forecasts by changing18 them by a percentage x% (e.g., increase all
monthly forecasts for the full next year by 10%), and (b) a more targeted one,
where some specific forecasts in the future are adjusted for the potential impact
of special events like promotions (e.g., increase by an extra 1000 units the sales
forecast for next September due to an expected advertising campaign).
• Seventh (Last) Box: Density forecasting + SCENARIOS; living with Uncer-
tainty!
18 Usuallyincreasing the forecasts, due to an optimism bias (more on this and other types of bias
in Chap. 13).
406 K. I. Nikolopoulos and D. D. Thomakos
• In this final step, we try to cope with the uncertainty that comes with the produced
forecasts. Firstly, we usually provide a set of confidence or prediction intervals,
associated with the point forecasts for the full forecasting horizon, as shown in
Fig. 12.2; this is also known as density forecasting. There are theoretical as well
empirical ways so as to produce these intervals. The most popular way to deal
with the uncertainty around the provided forecast is by building scenarios. These
practically derivate from the produced forecasts, but we will treat them as an
indispensable part of the AFA process.
We have seen the input; we have roughly seen the steps within the “grey-box”; let
us stay a bit more on the output of AFA. When you started reading this chapter, you
probably thought it would all be about a number or a few numbers—if forecasts for
more periods ahead were required. By now, it should have become obvious that far
more output—in numerical and narrative form—will be available. Practically every
step of the AFA process is producing some output, which is consisted by-and-large
of what is contained in Table 12.5. AFA Output (which is not exhaustive).
19 Noise is a term met in many sciences. I prefer the electrical engineering definition of it where
Noise can block, distort, or change/interfere with the meaning of a message in both human and
electronic communication.
20 Armstrong 2001.
21 Timmerman and Granger 2004.
22 Syntetos et al. 2010.
23 Maris et al. 2007; Bozos et al. 2008.
12 Forecasting Analytics 407
The measures such as mean absolute error (MAE), mean squared error (MSE), root
mean squared error (RMSE), and mean absolute percent error (MAPE) are used to
evaluate accuracy of a particular forecast (Hyndman 2014).
Let, yt denote the tth observation and yt|t−1 denotes its forecast based on all
previous data, where t = 1,2 . . . T. Then, the following measures, mean absolute
error, means square error, root mean squared error, and mean absolute percentage
error are useful.
T 6 6
MAE = T −1 yt|t−1 6
6 yt −
t=1
T 2
MSE = T −1 yt −
yt|t−1
t=1
408 K. I. Nikolopoulos and D. D. Thomakos
T 2
RMSE = T −1 yt −
yt|t−1
t=1
6 6
T 6yt − yt|t−1 6
−1
MAP E = 100T
t=1 |yt |
6 Conclusion
J. Scott Armstrong25 and the International Institute of Forecasters,26 where you get
a gateway to the amazing world of forecasting free of cost.
Furthermore, this chapter is not about giving you all the underlying theory and
mathematics of the discipline. In fact, mathematics and statistics, theorems, and
axioms are kept to the absolute minimum. “Everything is kept as simple as possible,
. . . but not simpler!”27 So there will be a few formulae, but expressed in a way
that does not require a mathematical background to follow. If you were looking for
the mathematics of forecasting then the leading textbook of the field “Forecasting
Methods and Applications”—by Makridakis et al. (1998)—is your reference point.
For engineers like me, that prefer the “do it yourself” approach, the second edition
of the latter book is particularly useful as most of the forecasting algorithms are
presented in such a way that their implementation is very straightforward in a
standard programming language.
Now, if you need subjective approach then and judgmental forecasting is your
weapon of choice when approaching forecasting tasks, asked then Goodwin (2006)
and along with Wright and Goodwin (1998), are probably the way to go.
A forecasting process we strongly believe will significantly enhance the forecast-
ing performance in your company/private or public organization; and is a process
consisting roughly of two basic elements:
(a) A fairly accurate set of forecasts.
(b) A good estimate of the uncertainty around them.
Of course, it would be up to you, once faced with real-life problems, how to
use these forecasts, and more importantly how to take countermeasures and back-
up policies as to cope with the predicted uncertainty. Living with scenarios built
around this uncertainty is the key to your business success.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 12.1: Data - ARIMA.csv
• Data 12.2: Data - Croston and SBA.csv
• Data 12.3: Data - Damped Holt.csv
• Data 12.4: Data - SES, ARRSES, Holt, HoltWinter.csv
• Data 12.5: Data - Theta.csv
• Data 12.6: FA - Excel Template.xlsx
• Code 12.1: Forecasting Analytics.R
• Data 12.7: Forecasting chapter - Consolidated Output.xlsx
25 Professor
J. Scott Armstrong, http://www.jscottarmstrong.com/ [Accessed on Oct 1, 2017].
26 International
Institute of Forecasters, https://forecasters.org/ [Accessed on Oct 1, 2017].
27 A famous quote attributed to Albert Einstein.
410 K. I. Nikolopoulos and D. D. Thomakos
Exercises
Ex. 12.1 Using time-series data on annual production of tractors in India from 1991
to 2016, provide forecast of the production of tractors (in millions) for the year 2017
using Theta model (combining regression and SES methods, α = 0.4).
Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Production 14.5 14.8 15.1 15.4 15.7 16.0 16.3 16.7 17.0 17.3 17.7 18.0 18.4
Year 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Production 18.8 19.3 19.8 20.3 20.8 21.3 21.9 22.4 23.0 23.5 24.1 24.7 25.4
Ex. 12.2 Given the monthly production (in millions) of mobiles by a company for
the last 20 months, provide the forecast of mobile production for the company for
the next 3 months using the ARIMA (2,1,2) model.
Months 1 2 3 4 5 6 7 8 9 10
Production 3.40 3.43 3.47 3.50 3.54 3.57 3.61 3.65 3.68 3.72
Months 11 12 13 14 15 16 17 18 19 20
Production 3.77 3.82 3.77 3.82 3.77 3.82 3.77 3.82 3.77 3.82
Ex. 12.3 Monthly demand (in millions) for an automobile spare part is given as
follows:
Months 1 2 3 4 5 6 7 8 9 10 11 12
Demand 3 0 1 0 0 8 0 0 0 2 0 5
Months 13 14 15 16 17 18 19 20 21 22 23 24
Demand 0 0 0 1 4 0 0 0 3 ? ? ?
Based on the above time-series data, provide 3 months ahead forecast for the
spare part using Croston and SBA methods.
Ex. 12.4 Using the provided Excel templates, create:
(a) A version of Holt exponential smoothing where both the level smoothing
parameter and the trend smoothing parameter are equal,
(b) A version of damped Holt exponential smoothing where alpha (α) = a, beta
(β) = a2, and phi (ϕ) = a3.
12 Forecasting Analytics 411
Appendix 1
Example: The monthly sales (in million USD) of Vodka is given for the period
1968–1970. We want to forecast the sales for the year 2016 using various forecasting
methods—SES, ARRSES, Holt, Holt–Winters (Additive/Multiplicative).
Data: The data can be downloaded from the book’s website and the dataset name
is “Data - SES, ARRSES, Holt, HoltWinter.csv”. You can also refer to Table 12.6 for
data.
R Code (to read data)
read.csv (“filename.ext”, header = TRUE)
SES Method
Install forecast package
install.packages(“forecast”)
R function
ses (<Univariate vector of observations>, h = <number of periods
to forecast)
Note: The ses function in R by default optimizes both the value of alpha and the
initial value.
In case you prefer the output for a specified alpha value then use parameter
<initial = ”simple”>
and set the alpha value in the parameters.
ses(<univariate vector of observations>, h = <number of periods
to forecast>, alpha = < >, initial = “simple”)
The above code will set the first forecast value equal to first observation. If alpha
is omitted it will optimize for alpha.
Table 12.6 Data for SES, Holt, ARRSES, and Holt–Winters method
Period (t) Vodka (Yt ) Period (t) Vodka (Yt ) Period (t) Vodka (Yt )
Jan-68 42 Jan-69 21 Jan-70 47
Feb-68 40 Feb-69 31 Feb-70 38
Mar-68 43 Mar-69 33 Mar-70 91
Apr-68 40 Apr-69 39 Apr-70 107
May-68 41 May-69 70 May-70 89
Jun-68 39 Jun-69 79 Jun-70 116
Jul-68 46 Jul-69 86 Jul-70 117
Aug-68 44 Aug-69 125 Aug-70 274
Sep-68 45 Sep-69 55 Sep-70 137
Oct-68 38 Oct-69 66 Oct-70 171
Nov-68 40 Nov-69 93 Nov-70 155
Dec-68 49 Dec-69 99 Dec-70 143
412 K. I. Nikolopoulos and D. D. Thomakos
Holt Method
Install forecast package
install.packages(“forecast”)
R function.
holt (<univariate vector of observations>, h = <number of
periods to forecast>)
Note: The holt function by default optimizes both the value of alpha and the
initial value.
In case you prefer the output for a specified alpha and beta value then use
<initial = ”simple”>
parameter and set the alpha and beta values in the parameters.
holt (<univariate vector of observations>, h = <number of
periods to forecast>, alpha = < >, beta = < >, initial =
“simple”)
The above code sets first level equal to first value and trend as difference of first
two values. If alpha is omitted it will optimize for alpha.
Holt–Winters Method
Install stats package
install.packages(“stats”)
R function
HoltWinters (<name of dataset>, alpha = <>, beta = <>, gamma =
<>, seasonal = c(”additive”, “multiplicative”), start.periods =
2, l.start = NULL, b.start = NULL, s.start = NULL, optim.start
= c(alpha = 0.3, beta = 0.1, gamma = 0.1),
optim.control = list())
The value of alpha, beta, and gamma can be either initialized by specifying
<alpha>, <beta>, <gamma> and if they are NULL it will optimize the values as
specified in optim.start. You can also specify starting values of alpha, beta, and
gamma to optimize using <optim.start> parameter. Seasonality can be considered
additive or multiplicative. The <start.periods> is the initial data used to start the
forecast (minimum 2 seasons of data). Starting values of level <l.start>, trend
<b.start>, and seasonality <s.start> can be either be initialized or optimized by
setting equal to NULL.
For the HoltWinters function, the dataset must be defined as a time-series (ts)
type. A dataset can be converted to time-series type, using the below code:
ts (<name of dataset>, frequency = number of periods in a season)
Table 12.7 Data for damped Period (t) Demand (Yt ) Period (t) Demand (Yt )
exponential smoothing using
Holt’s method 1 818 8 805
2 833 9 808
3 817 10 817
4 818 11 836
5 805 12 855
6 801 13 853
7 803 14 851
Table 12.8 Data for Theta Time (t) Cars (Yt ) Time (t) Cars (Yt )
model
1 13.31 13 18.12
2 13.6 14 18.61
3 13.93 15 19.15
4 14.36 16 19.55
5 14.72 17 20.02
6 15.15 18 20.53
7 15.6 19 20.96
8 15.94 20 21.47
9 16.31 21 22.11
10 16.72 22 22.72
11 17.19 23 23.3
12 17.64 24 23.97
R function
holt (<univariate vector of observations>, h = <number of
periods to forecast>, damped = TRUE)
Note: The holt function by default optimizes both the value of alpha and the
initial value.
In case you prefer the output for a specified alpha, beta, and phi values then
use <initial = ”simple”> parameter and set the alpha, beta, and phi values in the
parameters.
holt (<univariate vector of observations>, h = <number of
periods to forecast>, damped = TRUE, alpha = < >, beta = < >,
phi = < >)
If alpha, beta, and phi are omitted, it will optimize for these values.
Theta method
Data: The data can be downloaded from the book’s website and the dataset name
is “Data - Theta.csv”. You can also refer to Table 12.8 for data.
Install forecTheta package
Install.packages(“forectheta”)
R function
stm (ts (<univariate vector of observations>), h = <number of
periods to forecast>, par_ini = c (y[1]/2, 0.5,2))
414 K. I. Nikolopoulos and D. D. Thomakos
Table 12.10 Consolidated output of SES, ARRSES, Holt, Holt–Winters methods (*R Output,
ˆExcel Output)
Holt–Winters Holt–Winters
Period Vodka SESˆ ARRSESˆ Holtˆ Additive* Multiplicative*
t Yt Ft Ft Ft Ft Ft
Jan-68 42 42.0000 42.0000 42.0000 – –
Feb-68 40 42.0000 42.0000 42.0000 – –
Mar-68 43 41.0000 41.4000 41.2474 – –
Apr-68 40 42.0000 41.8800 43.7804 – –
May-68 41 41.0000 41.3160 43.2701 – –
Jun-68 39 41.0000 41.1590 43.5794 – –
Jul-68 46 40.0000 39.9062 42.5508 – –
Aug-68 44 43.0000 45.2167 46.1132 – –
Sep-68 45 43.5000 44.4632 46.6752 – –
Oct-68 38 44.2500 44.5858 47.4834 – –
Nov-68 40 41.1250 42.1887 43.7545 – –
Dec-68 49 40.5625 40.5625 43.0670 – –
Jan-69 21 44.7813 47.5915 47.9867 13.3136 17.0694
Feb-69 31 32.8906 34.1680 34.0851 25.4230 24.3483
Mar-69 33 31.9453 31.8512 33.1256 30.4248 25.8316
Apr-69 39 32.4727 32.7419 33.9700 40.4559 31.7168
May-69 70 35.7363 36.4148 38.0008 71.4978 57.2978
Jun-69 79 52.8682 45.2818 58.0939 79.4149 64.2547
Jul-69 86 65.9341 75.3264 73.0838 73.0716 61.8609
Aug-69 125 75.9670 85.5710 84.1672 84.3076 65.0093
Sep-69 55 100.4835 123.8677 111.8418 117.0539 73.0378
Oct-69 66 77.7418 55.6361 84.0182 82.3369 65.9793
Nov-69 93 71.8709 59.8772 76.2076 75.8113 71.2471
Dec-69 99 82.4354 65.2778 87.9222 93.4888 86.3133
Jan-70 47 90.7177 80.4700 97.0133 69.5189 44.0442
Feb-70 38 90.7177 55.8669 97.0133 63.7863 63.7920
Mar-70 91 90.7177 53.2861 97.0133 46.3351 66.0826
Apr-70 107 90.7177 70.0641 97.0133 75.7723 77.0393
May-70 89 90.7177 85.5019 97.0133 126.1014 136.6830
Jun-70 116 90.7177 88.0616 97.0133 114.8522 150.6100
Jul-70 117 90.7177 109.1998 97.0133 109.6536 158.7803
Aug-70 274 90.7177 116.2344 97.0133 119.5669 219.8930
Sep-70 137 90.7177 262.3861 97.0133 208.5555 98.9493
Oct-70 171 90.7177 137.8308 97.0133 185.8300 114.7506
Nov-70 155 90.7177 143.9737 97.0133 195.4645 157.7524
Dec-70 143 90.7177 145.0455 97.0133 – 166.1216
416 K. I. Nikolopoulos and D. D. Thomakos
Table 12.13 Forecast using Month Production of Sofa (in Thousands) Forecast*Ft
ARIMA method
1 98 97.9020
2 82 92.7517
3 84 86.6899
4 85 87.2354
5 99 90.2576
6 90 92.3250
7 92 89.9272
8 83 89.0541
9 86 87.5629
10 90 89.2152
11 95 90.6584
12 91 90.7141
13 87 89.0086
14 99 89.5715
15 93 90.8333
16 82 87.9907
17 84 86.9563
18 88 89.0024
19 93 90.6564
20 83 90.5054
21 95 89.8928
22 93 91.6043
23 92 90.1840
24 92 89.4144
25 97 89.2052
26 88 88.5374
27 81 86.6061
28 93 87.6779
29 91 90.0032
30 81 88.9099
31 86 88.8512
32 81 90.9446
33 97 92.2831
34 88 93.9458
35 96 92.0745
36 96 92.2296
37 97 90.4160
38 90 88.5144
39 88 86.7300
40 93 86.9814
41 90 87.4583
42 84 86.7020
43 82 86.9679
(continued)
418 K. I. Nikolopoulos and D. D. Thomakos
References
Andrawis, R. R., Atiya, A. F., & El-Shishiny, H. (2011). Forecast combinations of computational
intelligence and linear models for the NN5 time series forecasting competition. International
Journal of Forecasting, 27, 672–688.
Armstrong, J. S. (2001). Principles of forecasting: A handbook for researchers and practitioners.
Dordrecht: Kluwer Academic Publishers.
Assimakopoulos, V., & Nikolopoulos, K. (2000). The theta model: A decomposition approach to
forecasting. International Journal of Forecasting, 16, 521–530.
Babai, M. Z., Ali, M., & Nikolopoulos, K. (2012). Impact of temporal aggregation on stock
control performance of intermittent demand estimators: Empirical analysis. OMEGA: The
International Journal of Management Science, 40, 713–721.
Bollerslev, T., Engle, R. F., & Nelson, D. B. (1994). ARCH models. In R. F. Engle & D. L.
McFadden (Eds.), Handbook of econometrics (Vol. 4, pp. 2959–3038). Amsterdam: North-
Holland.
Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San
Francisco, Holden Day (revised ed. 1976).
Bozos, K., Nikolopoulos, K., & Bougioukos, N. (2008). Forecasting the value effect of seasoned
equity offering announcements. In 28th international symposium on forecasting ISF 2008, June
22–25 2008. France: Nice.
12 Forecasting Analytics 419
Brown, R. G. (1956). Exponential smoothing for predicting demand. Cambridge, MA: Arthur D.
Little Inc.
Chatfield, C. (2005). Time-series forecasting. Significance, 2(3), 131–133.
Croston, J. D. (1972). Forecasting and stock control for intermittent demands. Operational
Research Quarterly, 23, 289–303.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7,
126.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance
of the United Kingdom inflation. Econometrica, 50, 987–1008.
Goodwin, P. (2006). Decision Analysis for Management Judgement, 3rd Edition Chichester: Wiley.
Hanke, J. E., & Wichern, D. W. (2005). Business forecasting (8th ed.). Upper Saddle River:
Pearson.
Harrison, P. J., & Stevens, C. F. (1976). Bayesian forecasting. Journal of the Royal Statistical
Society (B), 38, 205–247.
Hibon, M., & Makridakis, S. (2000). The M3 competition: Results, conclusions and implications.
International Journal of Forecasting, 16, 451–476.
Holt, C. C. (1957). Forecasting seasonals and trends by exponentially weighted averages. O. N. R.
Memorandum 52/1957. Pittsburgh: Carnegie Institute of Technology. Reprinted with discussion
in 2004. International Journal of Forecasting, 20, 5–13.
Hyndman, R. J. (2014). Forecasting – Principle and practices. University of Western Australia.
Retrieved July 24, 2017, from robjhyndman.com/uwa.
Johnston, F. R., Boylan, J. E., & Shale, E. A. (2003). An examination of the size of orders from
customers, their characterization and the implications for inventory control of slow moving
items. Journal of the Operational Research Society, 54(8), 833–837.
Jose, V. R. R., & Winkler, R. L. (2008). Simple robust averages of forecasts: Some empirical
results. International Journal of Forecasting, 24(1), 163–169.
Keast, S., & Towler, M. (2009). Rational decision-making for managers: An introduction.
Hoboken, NJ: John Wiley & Sons.
Kourentzes, N. (2014). Improving your forecast using multiple temporal aggregation. Retrieved
August 7, 2017, from http://kourentzes.com/forecasting/2014/05/26/improving-forecasting-
via-multiple-temporal-aggregation.
Kourentzes, N., Petropoulos, F., & Trapero, J. R. (2014). Improving forecasting by estimating time
series structural components across multiple frequencies. International Journal of Forecasting,
30, 291–302.
Leven and Segerstedt. (2004). Referred to in Syntetos and Boylan approximation section.
Lindsey, M., & Pavur, R. (2008). A comparison of methods for forecasting intermittent demand
with increasing or decreasing probability of demand occurrences. In K. D. Lawrence & M.
D. Geurts (Eds.), Advances in business and management forecasting (advances in business
and management forecasting) (Vol. 5, pp. 115–132). Bingley, UK: Emerald Group Publishing
Limited.
Makridakis, S., Hogarth, R., & Gaba, A. (2009). Dance with chance: Making luck work for you.
London, UK: Oneworld Publications.
Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (1998). Forecasting: Methods and
applications (3rd ed.). New York: John Wiley and Sons.
Maris, K., Nikolopoulos, K., Giannelos, K., & Assimakopoulos, V. (2007). Options trading driven
by volatility directional accuracy. Applied Economics, 39(2), 253–260.
Nikolopoulos, K., Assimakopoulos, V., Bougioukos, N., Litsa, A., & Petropoulos, F. (2011a). The
theta model: An essential forecasting tool for supply chain planning. Advances in Automation
and Robotics, 2, 431–437.
Nikolopoulos, K., Syntetos, A., Boylan, J., Petropoulos, F., & Assimakopoulos, V. (2011b).
ADIDA: An aggregate/disaggregate approach for intermittent demand forecasting. Journal of
the Operational Research Society, 62, 544–554.
Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K. (2014). ‘Horses for
Courses’ in demand forecasting. European Journal of Operational Research, 237, 152–163.
420 K. I. Nikolopoulos and D. D. Thomakos
Prestwich, S. D., Tarim, S. A., Rossi, R., & Hnich, B. (2014). Forecasting intermittent demand by
hyperbolic-exponential smoothing. International Journal of Forecasting, 30(4), 928–933.
Rostami-Tabar, B., Babai, M. Z., Syntetos, A. A., & Ducq, Y. (2013). Demand forecasting by
temporal aggregation. Naval Research Logistics, 60, 479–498.
Spithourakis, G. P., Petropoulos, F., Babai, M. Z., Nikolopoulos, K., & Assimakopoulos, V. (2011).
Improving the performance of popular supply chain forecasting techniques: An empirical
investigation. Supply Chain Forum: An International Journal, 12, 16–25.
Syntetos, A. A., & Boylan, J. E. (2001). On the bias of intermittent demand estimates. International
Journal of Production Economics, 71, 457–466.
Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand estimates.
International Journal of Forecasting, 21, 303–314.
Syntetos, A. A., Nikolopoulos, K., & Boylan, J. E. (2010). Judging the judges through accuracy-
implication metrics: The case of inventory forecasting. International Journal of Forecasting,
26, 134–143.
Taylor, A. R. (1997). On the practical problems of computing seasonal unit root tests. International
Journal of Forecasting, 13(3), 307–318.
Teunter, R. H., Syntetos, A., & Babai, Z. (2011). Intermittent demand: Linking forecasting to
inventory obsolescence. European Journal of Operational Research, 214, 606–615.
Thomakos, D. D., & Nikolopoulos, K. (2014). Fathoming the theta method for a unit root process.
IMA Journal of Management Mathematics, 25, 105–124.
Timmerman, A., & Granger, C. W. J. (2004). Efficient market hypothesis and forecasting.
International Journal of Forecasting, 20, 15–27.
Tseng, F., Yu, H., & Tzeng, G. (2002). Combining neural network model with seasonal time series
ARIMA model. Technological Forecasting and Social Change, 69, 71–87.
Willemain, T. R., Smart, C. N., & Schwarz, H. F. (2004). A new approach to forecasting
intermittent demand for service parts inventories. International Journal of Forecasting, 20,
375–387.
Wright, G., & Goodwin, P. (1998). Forecasting with judgement. Chichester and New York: John
Wiley and Sons.
Chapter 13
Count Data Regression
Thriyambakam Krishnan
1 Introduction
T. Krishnan ()
Chennai Mathematical Institute, Chennai, India
e-mail: sridhar@illinois.edu
binomial regression model, etc. are more appropriate. These models help unravel
the distributional effects of influencing factors rather than merely mean effects.
Furthermore, extensions of these models, called zero-inflated models, help tackle
high incidence of 0 counts in the data. This chapter covers the following:
• Understanding what a count variable is
• Getting familiar with standard models for count data like Poisson and negative
binomial
• Understanding the difference between a linear regression model and a count data
regression model
• Understanding the formulation of a count data regression model
• Becoming familiar with estimating count data regression parameters
• Learning to predict using a fitted count data regression
• Learning to validate a fitted model
• Learning to fit a model with an offset variable
The following are specific examples of business problems that involve count data
analysis. We list the response and predictor variables below:
• In a study of the number of days of reduced activity in the past 2 weeks due to
illness or injury, the following predictors are considered: gender, age, income, as
well as the type of medical insurance of the patient.
• In an application of Poisson regression, the number of fish (remember “poisson”
means fish in French) caught by visitors to a state park is analyzed in terms of the
number of children in the group, camping one or more nights during stay (binary
variable), and the number of persons in the group.
• In an example like the one above, there is scope for an excessive number of zero
counts and so a zero-inflated model might turn out to be appropriate.
• In an insurance application, the issue is one of predicting the number of claims
that an insurer will make in 1 year from third-party automobile insurance. The
predictor variables are the amount insured, the area they live in, the make of the
car, the no-claim bonus they received in the last year, the kilometers they drove
last year, etc. A zero-inflated model called the Hurdle model has been found to
be a reasonable model for the data.
• In another insurance example, we want to understand the determinants of the
number of claims “Claim count”: this is a count variable (discrete; integer values)
where the possible explanatory variables are: the number of vehicles in the policy
(an integer numeric variable) and age of the driver.
• In an experiment in AT & T Bell Laboratories, the number of defects per area
of printed wiring boards by soldering their leads on the board was related to five
possible influences on solderability.
13 Count Data Regression 423
Let us consider the fifth problem in the list above. A part of the data is given below.
The entire dataset named “numclaims.csv” is available on the book’s website. The
columns correspond to the number of claims, the number of vehicles insured, and
the age of the insured (Table 13.1).
Is an ordinary least-squares linear regression an appropriate model and method
for this problem? No, because this method assumes that the errors (and hence the
conditional distributions of the claims count given the predictor values) are normal
and also that the variances of the errors are the same (homoskedasticity). These
assumptions are not tenable for claims count since it is a discrete variable. Moreover,
the claims count is more likely distributed Poisson and hence for different values of
the predictor variables the Poisson means and hence the Poisson variance will be
different for different cases.
Poisson regression is appropriate when the conditional distributions of Y are
expected to be Poisson distributions. This often happens when you are trying to
regress on count data (for instance, the number of insurance claims in 5 years by
a population of auto-insurers). Count data will be, by its very nature, discrete as
opposed to continuous. When we look at a Poisson distribution, we see a spiked and
stepped histogram at each value of X, as opposed to a smooth continuous curve.
Moreover, the Poisson histogram is often skewed. Further, the distribution of Y ,
for a small value of Y , is not symmetric. In a Poisson regression the conditional
distribution of Y changes shape and spreads as Y changes. However, a Poisson
distribution becomes normal shaped, and wider, as the Poisson parameter (mean)
increases.
The conditional distribution graphs for normal and Poisson are given below
where the differences in the assumptions are apparent. Note that the model implies
heteroskedasticy since for the Poisson distribution mean also equals variance
(Fig. 13.1).
A Poisson distribution-based regression model could be stated as
log(μ) = β0 + β1 x1 + β2 x2 + . . . + βp xp
Fig. 13.1 Conditional distributional graphs for normal and Poisson distributions. Source: An
Animated Guide: An Introduction To Poisson Regression Russ Lavery, NESUG 2010
This model is an example of what are called generalized linear models. Generally,
maximum likelihood estimates of the regression parameters or their approximations
are used. Once the expected value of the response Poisson variable is worked
out, the probabilities of various possible values of the response are immediately
worked out.
The R command and results are as follows. The interpretation is given in the next
section (Table 13.2).
Call:
glm(formula = numclaims ~ numveh + age, family = "poisson",
data=numclaims)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1840 -0.9003 -0.5891 0.3948 2.9539
> logLik(pois)
’log Lik.’ -189.753 (df=3)
res.deviance df p
203.4512 197 0.3612841
426 T. Krishnan
The Poisson regression coefficients, their standard errors, z-scores for testing the
hypothesis of the regression coefficient to be zero, and the p-values are given. The
regression coefficient for age is 0.086121 which means that the expected log(count)
for 1 year increase in age is 0.086121 and so the ratio of counts at age x + 1 to age
x is exp(0.086121) ≈ 1.09. Similarly for the number of vehicles this ratio (often
called incident rates) is exp(0.123273) = 1.131193.
A saturated model is one which contains a separate indicator parameter for each
observation and so fits the data as closely as possible. Perfect fit means: μi = yi .
This is not useful since there is no data reduction, since the number of parameters
equals the number of observations. This model attains the maximum achievable log
likelihood (equivalently the minimum of −2 log Ls ). This is used as a baseline for
comparison to other model fits.1
The residual deviance is defined as
Dm ≡ 2(log Ls − log Lm )
where Lm is the maximized likelihood under the model in question and Ls is the
maximized likelihood under a saturated model. In this case log(Lm ) can be obtained
using the logLik function of model in R, in this example we get −189.753. Thus,
residual deviance
−2(log Lm − logLs )
1 The maximum log likelihood when μi = yi is given by: i (yi log(yi ) − yi − log(yi !)).
13 Count Data Regression 427
since the median is slightly different from 0. The deviance reduces as the model
fit improves. If the model exactly fits the data, then the deviance is zero. As an
approximation
D ∗ ∼ χn−dim(β)
2
if the model is correct. The approximation can be good in some cases and is exact
for the strictly linear model.
The residual deviance behaves similar to residual sum of squares of a linear model,
therefore it can be used similar to residual variance in least square and is suitable
for maximum likelihood estimates. For example, after the exploratory data analysis
(EDA) identifies important covariates one can use the partial deviance test to test for
significance of individual or groups of covariates. Example: The software reports
null deviance, which is the deviance when only one parameter, the mean of all
observations, is used to explain the number of claims. The deviance reported is
287.67 on 199 degrees of freedom. The difference in deviance between the null
model and the model with three explanatory variables = 287.67 − 203.45 = 84.22.
The chi-square test with 2 degrees of freedom (i.e., 199–197) yields a p-value close
to zero. The two models can also be tested using a standard ANOVA method as
shown below:
R Code and Output
Coefficients:
(Intercept)
-0.462
Degrees of Freedom: 199 Total (i.e. Null); 199 Residual
Null Deviance: 287.7
Residual Deviance: 287.7 AIC: 465.7
The error distributions assumed in our models lead to a relationship between mean
and variance. For the normal errors it is constant. In Poisson, the mean is equal to
variance, hence the dispersion parameter is 1. The dispersion parameter is used to
calculate standard errors. In other distributions, it is often considered a parameter
and estimated from data and presented in the output.
We now use the model to predict the number of claims for two cases with the number
of vehicles and age of driver as inputs: Case 1: (2, 48), Case 2: (3,50).
13 Count Data Regression 429
Predictions of the expected value of the number of claims: Case 1: 0.3019, Case 2:
0.4057
with respective standard errors of 0.0448 and 0.0839. You can also calculate these
by hand, for example, for Case 1: −5.578 + 2 ∗ 0.1233 + 48 ∗ 0.0861 = 0.3019.
Some of the basic diagnostic plots are illustrated in Fig. 13.2. These are similar to
the plots in regression, see Chap. 7 (Linear Regression Analysis). However, brief
descriptions are given below for a quick recap.
For the model to be a good fit, residuals should lie around the central line like a
set of random observations without a pattern. This plot has the predicted values of
μi on the x-axis and yi − μi on the y-axis. In this graph (Fig. 13.2), most of the
residuals lie on one side of the central line showing unsatisfactory fit.
In the Poisson model, the scale (spread) should vary as the location (mean). If the
spread is larger than the mean on the whole, it is a sign of overdispersion. The graph
(Fig. 13.2) shows the ID of cases that violate this phenomenon. This graph does not
indicate the expected kind of relationship, showing lack of fit.
This plot is meant to find influential cases, that is, those which by themselves change
the regression parameters, in terms of a statistic known as Cook’s distance. The
graph (Fig. 13.2) indicates the IDs of such influential points. One needs to examine
the reasons for this and if justified these points may be removed from the dataset.
The counts modeled as a Poisson distribution may depend on another variable which
when used in the denominator may define a rate. For instance, in the insurance
context the sum insured may be an exposure variable in which case one might
like to model the rate: number of claims/sum insured. This situation is handled by
multiplying both sides by the exposure variable and taking the log. This results in a
term log(exposure) as an additive regressor. The term log(exposure) is often called
an offset variable. Ideally the offset regressor should have a coefficient of 1 so that
when moved to the left side a rate is defined.
There are two potential problems with Poisson regression: Overdispersion and
excessive zeros. We describe each below along with possible solutions. Poisson dis-
tribution has the property that the mean equals variance. However, not infrequently
data display the phenomenon of overdispersion meaning that the (conditional)
variance is greater than the (conditional) mean. One reason for this is omitted or
unobserved heterogeneity in the data or an incorrect specification of the model not
using the correct functional form of the predictors or not including interaction terms.
13 Count Data Regression 431
The second potential problem is the excess number of 0’s in the counts which is
more than what is expected from a Poisson distribution, called zero inflation. The
implication of this situation is that standard errors of regression estimates and their
p-values are small. There are statistical tests available for checking this. See the
example of the overdispersion test in the next section. One way of dealing with this
heterogeneity of data is to specify an alternative distribution model for the data. One
such alternative distribution more general than the Poisson is the negative binomial
distribution, which can be looked upon as a result of modeling the overdispersion as
gamma distributed across means. Zero inflation is generally dealt with by modeling
separately “true zeros” due to the Poisson process and “excess zeros” by a separate
process.
(y + θ ) μy θ θ
f (y; μ, θ ) = ,
(θ )y! (μ + θ )y+θ
μ is mean and θ is shape parameter, is the Gamma function and Variance, V(μ) =
2
μ + μθ .
We illustrate the use of this model with a case study. First, we fit a Poisson model
and then the negative binomial model.
The source of this data is: “Poisson regression” by Claudia Czado and TU
München,2 and An Actuarial Note by Bailey and Simon (1960).3
The data is provided for private passenger automobile liability for non-farmers
for all of Canada excluding Saskatchewan. We have to fit the model to estimate the
number of claims using the given data. The raw data “canautoins.csv” is available
on the book’s website.
The variable Merit measures the number of full years since the insured’s most
recent accident or since the insured became licensed. The variable Class is a
concatenation of age, sex, use, and marital status. The variables Insured and
Premium are two measures of the risk exposure of the insurance companies. The
variable premium is the premium in 1000s for protection actually provided during
the experience period. Please refer to Table 13.3 for the detailed description.
We should observe that we are given the count of claims for each Merit-Class
combination. Thus, this data is aggregated over the same Merit-Claim class. First,
observe that such aggregation of data over the same category does not change the
MLE estimate. In other words, say we had claim data for every insured person. If
we ran the MLE estimate for the disaggregated dataset, we would get the same
estimate of coefficients and the same significance levels. Second, note that the
fully saturated model will include all interaction terms between Merit and Class.
Finally, the classification based on Merit and Class, as well as the definition of these
categories is based on experience and data analysis done in the past. For further
details, see the note by Bailey and Simon (1960).
Deviance Residuals:
Min 1Q Median 3Q Max
-3.526 -1.505 0.196 1.204 4.423
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.081e+00 2.103e-02 -98.953 < 2e-16 ***
merit1 -9.306e-02 1.308e-02 -7.117 1.10e-12 ***
merit2 -1.652e-01 1.549e-02 -10.663 < 2e-16 ***
merit3 -4.067e-01 8.623e-03 -47.162 < 2e-16 ***
class2 2.523e-01 1.897e-02 13.300 < 2e-16 ***
class3 3.965e-01 1.280e-02 30.966 < 2e-16 ***
class4 4.440e-01 9.788e-03 45.356 < 2e-16 ***
class5 1.854e-01 2.414e-02 7.680 1.59e-14 ***
Premium -8.537e-06 1.391e-06 -6.137 8.39e-10 ***
Cost 2.064e-05 3.902e-06 5.289 1.23e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The R command for the test that the mean is equal to the variance and the output
are given below. The additional parameter in the R Command: trafo=1 means that
the ratio of mean and variance = 1.
Test for Overdispersion
> dispersiontest(pois_ofs)
Overdispersion test
data: pois_ofs
z = 2.9394, p-value = 0.001644
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
4.784538
434 T. Krishnan
What does the test say? It rejects the null hypothesis and suggests a scaling of the
variance by the factor of 4.784. Next, we estimate the negative binomial model for
this data. In this model, we use log(Insured) as an independent variable (to match
the use of it as an offset in the Poisson Regression).
This model is a much better fit as shown by the p-value of 0.0246. It is better than
that of the Poisson model and also can be seen by the considerable reduction of AIC
from 325.83 for the Poisson model to 277.5 for the negative binomial model. We can
compute AIC by using definition −2logLm + 2 ∗ (numberofparameters). As we
are also estimating theta, a shape parameter, it is added in the number of parameters
while computing AIC. Thus, in the example above: AI C = 253.493+2∗(11+1) =
277.49. However, at 5% level it is still not a good fit. In other words, we reject the
hypothesis that the coefficients in the saturated model that are not in the negative
binomial model are all equal to zero. Perhaps, the poor fit might be due to missing
variables or due to non-linearity in the underlying variables. These issues can be
further explored as detailed in the exercises.
4.1 Models for Zero Inflation: ZIP and ZINB Models and
Hurdle Models
In ZIP (zero inflated poisson) and ZINB (zero inflated negative binomial) models,
the count variable and the excess zero values are generated by two different
processes both regressed on the predictors. The two processes are a Poisson or
negative binomial count model (which could produce 0 counts) and a logit model
for excess zeros. In contrast a hurdle model assumes all zeros are generated by a
process and the positive counts are generated by a truncated Poisson or negative
binomial process. Which model to use will depend on the structural way zeros
arise and the design of the experiment. They may lead to quite different results and
interpretations. We do not go into the details of these models here. The exercises in
the chapter illustrate the use of these models (Table 13.5).
We can use “zeroinfl” function to run Zero Inflation models in R. For ZIP and
ZINB model, use “poisson” and “negbin,” respectively, in family parameter. The
sample R command is:
zeroinfl(y~ x1+x2, data=inputdata, family="poisson",
link="logit")
zeroinfl(y~ x1+x2, data=inputdata, family="negbin",
link="logit")
Refer to pscl package documentation to know more about running zero inflation
models (ZIP, ZINB) in R.
13 Count Data Regression 435
>neg_bin<-glm.nb(Claims~Merit+Class+Premium+Cost+log(Insured),
+ data=canautoins,init.theta=3557)
> summary(neg_bin)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8544 -0.7172 0.1949 0.5656 1.4852
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.163e-01 5.104e-01 -1.403 0.16053
merit1 -2.034e-01 4.031e-02 -5.046 4.50e-07 ***
merit2 -2.829e-01 5.075e-02 -5.575 2.48e-08 ***
merit3 -2.142e-01 7.114e-02 -3.010 0.00261 **
class2 -6.703e-02 1.183e-01 -0.567 0.57082
class3 1.851e-01 8.816e-02 2.099 0.03579 *
class4 2.180e-01 8.312e-02 2.623 0.00873 **
class5 -2.310e-01 1.510e-01 -1.529 0.12625
Premium -2.435e-06 3.081e-06 -0.790 0.42941
Cost 4.824e-06 8.250e-06 0.585 0.55870
log(Insured) 8.969e-01 3.895e-02 23.025 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Theta: 3557
Std. Err.: 1783
2 x log-likelihood: -253.493
> with(neg_bin, cbind(res.deviance = deviance,df = df.residual,
+ p = pchisq(deviance, df.residual,lower.tail=FALSE)))
res.deviance df p
[1,] 19.07203 9 0.02458736
This chapter introduces count data regression where a response variable is a count
(taking values 0, 1, 2, . . .) which is regressed on a set of explanatory variables.
The basic models for such a regression—the Poisson regression and the negative
binomial regression—are introduced and discussed with examples. Methods of
436 T. Krishnan
measuring goodness of fit and validating the models are also discussed. The
problems of overdispersion in the Poisson model and of zero inflation are briefly
discussed and solutions to these problems are mentioned. Several excellent texts
are listed in the reference section for further reading, such as Cameron and Trivedi
(2013), Jackman (2006), Winkelmann (2015), Zeileis et al. (2008), and Simonoff
(2003).
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 13.1: numclaim.csv
• Data 13.2: canautoins.csv
• Data 13.3: orsteindata.csv
• Code 13.1: count_data.R
• Data 13.4: Additional datasets are available on Jeff Simonoff’s website.5
Exercises
Ex. 13.1 You are given a sample of subjects randomly selected for an Italian study
on the relation between income and whether one possesses a travel credit card (such
as American Express or Diner’s Club). At each level of annual income in millions
of Lira (the currency in Italy before euro), the table indicates the number of subjects
sampled and the number of these subjects possessing at least one travel credit card.
Please refer to the data “creditcard.csv” available on the book’s website. The dataset
is taken from Pennsylvania State University.6
This example has information on individuals grouped by their income, the
number of individuals (cases) within that income group and number of credit cards.
Notice that the number of individuals is the frequency of the data point and not a
regressor.
(a) What is the estimated average rate of incidence, that is, the usage of credit cards
given the income?
(b) Is income a significant predictor?
(c) Does the overall model fit?
11, 2018.
6 https://onlinecourses.science.psu.edu/stat504/node/170. Accessed on Apr 15, 2018.
13 Count Data Regression 437
(d) How many credit cards do you expect a person with income of 120 million Lira
to have?
(e) Also test for overdispersion and zero inflation.
Ex. 13.2 Ornstein’s dataset (“orsteindata.csv”) is on interlocking directorates
among 248 dominant Canadian firms. The number of “interlocks” for each firm
is the number of ties that a firm maintained by virtue of its board members and top
executives also serving as board members or executives of other firms in the dataset.
This number is to be regressed on the firm’s “assets” (billions of dollars), “nation” of
control (Canada, the United States, the United Kingdom, or another country), and
the principal “sector” of operation of the firm (ten categories, including banking,
other financial institutions, heavy manufacturing, etc.) The asymmetrical nature of
the response, a large number of 0s make the data not suitable for ordinary least-
squares regression. The response is a count.
To understand coding of categorical variable, refer to examples in Dummy
Variable section in Chapter 7 (Linear Regression Analysis). In this exercise, you
can consider “CAN” (Canada) as the reference category for “Nation” variable and
“AGR” (Agriculture) for “Sector” variable.
(a) Fit a Poisson regression model for the number of interlocking director and
executive positions shared with other major firms. Examine its goodness of fit.
(b) Discuss the results from an economic point of view. Which variables are most
important in determining the number of interlocking director and executive
positions shared with other major firms?
(c) Fit a negative binomial and compare with Poisson model.
(d) Examine whether adjusting for zero inflation improves the model by fitting ZIP
and ZINB models.
(e) Compare the outputs of different models. Which metrics should we look at?
(f) Discuss which model is the best and why. Recommend further steps to improve
the model.
Ex. 13.3 Introduce all interaction terms between “Merit” and “Class” in the
Canadian Insurance model of Sect. 4. Run the Poisson regression with log(Insured)
as offset.
(a) Which interaction terms are significant?
(b) Do you see that this is the fully saturated model because there is only one
observation for every unique combination of Merit and Class?
(c) Rerun the model retaining only the significant interaction terms as well as all
the original variables. What would you conclude based on this investigation?
How does it help an insurance rating agency?
438 T. Krishnan
References
Bailey, R. A., & Simon, L. (1960). Two studies in automobile insurance rate-making. ASTIN
Bulletin, 1, 192–217.
Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data. Econometric Society
Monograph No. 53. Cambridge: Cambridge University Press.
Jackman, S. D. (2006). Generalized linear models. Thousand Oaks: Sage Publications.
Simonoff, J. S. (2003). Analyzing categorical data. New York: Springer. http://people.stern.nyu.
edu/jsimonof/AnalCatData/.
Winkelmann, R. (May 2015). Counting on count data models. Bonn: IZA World of Labor. https://
wol.iza.org.
Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R. Journal of
Statistical Software, 27, 1–25. http://www.jstatsoft.org/.
Chapter 14
Survival Analysis
Thriyambakam Krishnan
1 Introduction
T. Krishnan ()
Chennai Mathematical Institute, Chennai, India
e-mail: sridhar@illinois.edu
Survival analysis can provide tremendous insights and improved understanding into
patterns of customer behavior depending upon their profiles and key performance
indicators, especially in regard to churning, attrition, product purchase pattern,
insurance claims, credit card default, etc. It can be used to compute customer
lifetime values as a function of their past behaviors and contributions to a business,
which in turn can be used to fine-tune campaigns. It can also be used to study
organizational behaviors like bankruptcy, etc. The data required is a set of cases
(suitably selected) where “lifetime” information (even if censored, but with censor-
ing information) and information on possible drivers of such lifetimes is available.
Some specific examples of survival analysis are given below:
• Business bankruptcy (time to bankruptcy) analysis on the basis of explanatory
variables such as profitability, liquidity, leverage, efficiency, valuation ratio, etc.
A firm not bankrupt at the time of end of data collection yields a censored
observation that has to be interpreted in the analysis as samples of firms that
have not yet failed (Lee 2014).
• Analysis of churn pattern in the telecom industry and impact of explanatory
variables like the kind of plan, usage, subscriber profile like age, gender,
household size, income, etc. on churn pattern. This information may be useful
to reduce churn (Lu and Park 2003).
• Analysis of lifespan of car insurance contracts in terms of car’s age, type of
vehicle, age of primary driver, etc. may be carried out using survival analysis
techniques to measure profitability of such contracts.
• Estimating a customer lifetime value (CLV) to a business on the basis of past
revenue from the customer and an estimate of their survival probabilities based on
their profile is a standard application of survival analysis techniques and results.
This type of analysis is applicable to many types of business, this helps plan
different campaign strategies depending on estimated lifetime value.
Survival times are follow-up times from a defined starting point to the occurrence of
a given event. Some typical examples are the time from the beginning of a customer-
ship to churning; from issue of credit card to the first default; from beginning of an
14 Survival Analysis 441
insurance to the first claim, etc. Standard statistical techniques do not apply because
the underlying distribution is rarely normal; and the data are often “censored.”
A survival time is called “censored” when there is a follow-up time but the
defined event has not yet occurred or is not known to have occurred. In the examples
above, the survival time is censored if the following happens: at the end of the
study if the customer is still transacting; the credit card customer has not defaulted;
the insurance policy holder has not made a claim. Concepts, terminology, and
methodology of survival analysis originate in medical and engineering applications,
where the prototype events are death and failure, respectively. Hence, terms such as
lifetime, survival time, response time, death, and failure are current in the subject
of survival analysis. The scope of applications is wider including in business, such
as customer churn, employee attrition, etc. In Engineering these methods are called
reliability analysis. In Sociology it is known as event-history analysis.
As opposed to survival analysis, regression analysis considers uncensored data
(or simply ignores censoring). Logistic regression models proportion of events
in groups for various values of predictors or covariates; it ignores time. Survival
analysis accounts for censored observations as well as time to event. Survival
models can handle time-varying covariates (TVCs) as well.
Censored observations are incorporated into the likelihood (or for that matter, in
other approaches as well) as probability t ∗ ≥ t c , whereas uncensored observations
are incorporated into the likelihood through the survivor density. This idea is
illustrated below.
Suppose the lifetime (T) distribution is exponential (λ) with density function
f (t|λ) = λe−λt . Suppose an observation t is a censored observation. Then the
contribution to the likelihood is P (T ≥ t) = e−λt . Suppose an observation t is an
uncensored observation. Then the contribution to the likelihood is λe−λt . Suppose
t1 , t2 are censored, and u1 , u2 , u3 are uncensored, then the likelihood function is
maximizing which gives the maximum likelihood estimates of the parameters of the
survival density.
Increasing hazard: A customer who has continued for 2 years is more likely to
attrite than one that has stayed 1 year
Decreasing hazard: A customer who has continued for 2 years is less likely to
attrite than one that has stayed 1 year
Flat hazard: A customer who has continued for 2 years is no more or less likely
to attrite than one that has stayed 1 year
Once we have collected time-to-event data, our first task is to describe it—usually
this is done graphically using a survival curve. Visualization allows us to appreciate
temporal patterns in the data. If the survival curve is sufficiently nice, it can
help us identify an appropriate distributional form for the survival time. If the
data are consistent with a parametric form of the distribution, then parameters
can be derived to efficiently describe the survival pattern and statistical inference
can be based on the chosen distribution by specifying a parametric model for
h(t) based on a particular density function f (t) (parametric function). Otherwise,
when no such parametric model can be conceived, an empirical estimate of the
survival function can be developed (i.e., nonparametric estimation). Parametric
models usually assume some shape for the hazard rate (i.e., flat, monotonic,
etc.).
Suppose there are no censored cases in the dataset. Then let t1 , t2 , . . . , tn be the
event-times (uncensored) observed on a random sample. The empirical estimate of
the survival function, Ŝ(t), is the proportion of individuals with event-times greater
than t.
Number of event-times > t
Ŝ(t) = . (14.1)
n
When there is censoring Ŝ(t) is not a good estimate of the true S(t); so other
nonparametric methods must be used to account for censoring. Some of the standard
methods are:
1. Kaplan–Meier method
2. Life table method, and
3. Nelson–Aalen method
444 T. Krishnan
We discuss only the Kaplan–Meier method in this chapter. For Life table method,
one can consult Diener-West and Kanchanaraksa1 and for Nelson–Aalen method,
one may consult the notes provided by Ronghui (Lily) Xu.2
This is also known as Product-Limit formula as will be evident when the method is
described. This accounts for censoring. It generates the characteristic “stair case”
survival curves. It produces an intuitive graphical representation of the survival
curve. The method is based on individual event-times and censoring information.
The survival curve is defined as the probability of surviving for a given length
of time while considering time in intervals dictated by the data. The following
assumptions are made in this analysis:
• At any time, cases that are censored have the same survival prospects as those
who continue to be followed.
• Censoring is independent of event-time (i.e., the reason an observation is
censored is unrelated to the time of censoring).
• The survival probabilities are the same for subjects recruited early and late in the
study.
• The event happens at the time specified.
The method involves computing of probabilities of occurrence of events at
certain points of time dictated by when events occur in the dataset. These are
conditional probabilities of occurrence of events in certain intervals. We multiply
these successive conditional probabilities to get the final estimate of the marginal
probabilities of survival up to these points of time.
With censored data, Eq. (14.1) needs modifications since the number of event-times
> t will not be known exactly. Suppose out of the n event-times, there are k
distinct times t1 , t2 , . . . , tk . Let event-time tj repeat dj times. Besides the event-
times t1 , t2 , . . . , tk , there are also censoring times of cases whose event-times are
not observed. The Kaplan–Meier or Product-Limit (PL) estimator of survival at time
t is
(rj − dj )
Ŝ(t) = for 0 ≤ t ≤ t + , (14.2)
rj
j :tj ≤t
The aim of the study in this example is to evaluate attrition rates of employees of
a company. Data were collected over 30 years over n = 23 employees. Follow-
up times are different for different employees due to different starting points of
employment. The number of months with company is given below where + indicates
still employed (censored):
6, 12, 21, 27, 32, 39, 43, 43, 46+, 89, 115+, 139+, 181+, 211+,
217+, 261, 263, 270, 295+, 311, 335+, 346+, 365+
The same data is named as “employ.csv” and available on the book’s website. The
following is the data dictionary.
Variable Description
ID The unique id of the employee
att Represent 1 if uncensored and 0 if censored
months No. of months the employee worked in the company
What happens when you have several covariates that you believe contribute to
survival? For example, in job attrition data, gender, age, etc. may be such covariates.
In that case, we can use stratified KM curves, that is, different survival curves for
14 Survival Analysis 447
different levels of a categorical covariate, possibly drawn in the same frame. Another
approach is the Cox proportional hazards model.
Of all survival analysis functions, the hazard function captures the essence of the
time process. Survival analysis uses a regression model-like structure into hazard
function h(t). The h(t) being a rate should be positive with infinite range. To achieve
this h(t) is formulated as h(t) = eβ . Covariates (explanatory variables) x (a vector
with components (1, x1 , x2 , . . . , xp )) is included by being additive in the log scale.
Formulation:
log[h(t, x)] = β T x = β0 + β1 x1 + β2 x2 + . . . + βp xp
or
Tx
h(t, x) = eβ (14.3)
h(t, x, β) = h0 (t)r(x, β)
is such a formulation. h0 (t) describes how the hazard function changes over
time, r(x, β) describes how the hazard function changes with the covariates. It is
necessary that h(t, x, β) > 0. Then h(t, x, β) = h0 (t) when r(x, β) = 1. h0 (t) is
called the baseline hazard function—a generalization of intercept in regression.
The h0 (t) which is the baseline hazard rate when X = 0 = (0, 0, . . . , 0); this
serves as a convenient reference point although an individual¯ with X = 0 may not
¯ by
be a realistic one. Hazard ratio (HR) between two cases with x 1 , x 2 is given
r(x1 , β)
HR(t, x 1 , x 2 ) =
r(x2 , β)
T
and does not depend on h0 (t). Cox proposed the form r(x, β) = e(x β) so that
h(t, x, β) = h0 (t)ex β . Then HR(t, x 1 , x 2 ) = e(x 1 −x 2 ) β . This is called Cox
T T
h(t|X)
log = XT β = β1 X1 + β2 X2 + . . . + βp Xp .
h(t)
T
The model can also be written as h(t|X) = h(t)e(X β) . The model can also be
T
written as S(t|X) = S(t|X) = 0)e(X β) . Predictor effects are the same for all t. No
assumptions are made on the forms¯ of S, h, f .
The hazard rate in PH models increases or decreases as a function of the covari-
ates associated with each unit. The PH property implies that absolute differences in
x imply proportionate differences in the hazard rate at each t. For some t = t¯, the
ratio of hazard rates for two units i and j with vectors of covariates x i and x j is:
h(t¯, x i )
= e(x i −x j )β .
h(t¯, x j )
Because the baseline hazards drop out in the equation it indicates that the baseline
hazard rate for unit i is e(Xi −Xj )β times different from that of unit j . Importantly,
the right-hand side of the equation does not depend on time, i.e., the proportional
difference in the hazard rates of these two units is fixed across time. Put differently,
the effects of the covariates in PH models are assumed to be fixed across time.
Estimates of the β’s are generally obtained using the method of maximum
partial likelihood, a variation of the maximum likelihood method. Partial likelihood
is based on factoring the likelihood function using the multiplication rule of
probability and discarding certain portions that involve nuisance parameters. If
a particular regression coefficient βj is zero, then the corresponding explanatory
variable, Xj , is not associated with the hazard rate of the response; in that case,
Xj may be omitted from any final model for the observed data. The statistical
significance of explanatory variables is assessed using Wald tests or, preferably,
likelihood ratio tests. The Wald test is an approximation to the likelihood ratio test.
The likelihood is approximated by a quadratic function, an approximation which is
generally quite good when the model fits the data.
In PH regression, the baseline hazard component, h(t) vanishes from the partial
likelihood. We only obtain estimates of the regression coefficients associated with
the explanatory variables. Notice that h(t) = h(t|x) = β0 . Take the case of a
14 Survival Analysis 449
A parametric survival model completely specifies h(t) and S(t) and hence is
more consistent with theoretical S(t). It enables time-quantile prediction possible.
However, the specification of the underlying model S(t) makes this exercise a
difficult one. On the other hand, the Cox PH model, a semiparametric one leaves
the distribution of survival time unspecified and hence may be less consistent with
a theoretical S(t); an advantage of the Cox model is that the baseline hazard is not
necessary for estimation of hazard ratio.
A semiparametric model has only the regression coefficients as parameters and is
useful if only the study of the role of the explanatory variables is of importance. In a
full parametric model, besides the role of the explanatory variables, survival curves
for each profile of explanatory variables can be obtained.
Some advantages of fully parameterized models are: maximum likelihood
estimates (MLEs) can be computed. The estimated coefficients or their transforms
may provide useful business information. The fitted values can provide survival time
estimates. Residual analysis can be done for diagnosis.
Many theoretical specifications are used based on the form of S(t) (or f(t)) in
survival analysis. Some of them are: Weibull, log-normal, log-logistic, generalized
gamma, etc.
The regression outputs of a semiparametric and a full parametric are not directly
comparable although one may compare the relative and absolute significance (p-
values) of the various regressors. However, using the form of the parametric
function’s h(t) it is possible to strike a relationship between the parametric model’s
regression coefficients and Cox regression coefficients.
A parametric model is often called the accelerated failure time model (AFT
model) because according to this model, the effect of an explanatory variable is
to accelerate (or decelerate) the lifetime by a constant as opposed to say, the Cox
proportional hazards model wherein the effect of an explanatory variable is to
multiply hazard by a constant.
450 T. Krishnan
4 A Case Study
We analyze the churn data to fit a Cox PH model (semiparametric model). The
results are provided in Table 14.3. The output will be in two tables where the
first table contains the regression coefficients, the exponentiated coefficients which
are equivalent to estimated hazard ratios, standard errors, z tests, corresponding
p-values and the second table contains exponentiated coefficients along with the
reciprocal of exponentiated coefficients and values at 95% confidence intervals.
> churncoxph <- coxph(Surv(tenure_month, dead_flag) ~
ptp_months+unsub_flag+ce_score+items_Home+items_Kids+
items_Men+items_women
+avg_ip_time+returns +acq_sourcePaid+acq_
sourceReferral+mobile_site_user+business_name+redeemed
_exposed+refer_invite+avg_ip_time_sq+revenue_per_month,
data=churn)
> summary(churncoxph)
> predict(churncoxph, newdata=churn[1:6,], type="risk")
From the output, the estimated hazard ratio for business_nameKids vs busi-
ness_nameHome is under column “exp(coef)” which is 1.8098 with 95% CI
(1.7618, 1.8591). Similarly, exp(-coef) provides estimated hazard rate for busi-
ness_nameHome vs business_ nameKids which is 0.5525 (the reciprocal of 1.8098).
For continuous variables, exp(coef) is estimated hazard ratio for one unit increment
in x, “(x+1)” vs “x” and exp(-coef) provides “x” vs 1 unit increment in x, “(x+1)”.
From the table the concordance is 0.814, which is large enough and thus indicating
a good fit.
Besides interpreting the significance or otherwise of the explanatory variables
and their relative use in predicting hazards, the output is useful in computing the
relative risk of two explanatory variable profiles or relative risk with respect to the
average profile, i.e., e(Xi −Xj ) β , where Xi contains particular observation and Xj
contains average values. The relative risks of the first six cases with respect to the
average profile are: 3.10e-11, 0.60, 0.0389, 1.15, 0.196, and 0.182 (refer Table 14.3
for β values). We can compute the survival estimates of fitted model and obtain Cox
adjusted survival curve.
> summary(survfit(churncoxph))
> plot(survfit(churncoxph),main= "Estimated Survival
Function by PH model", ylab="Proportion not churned")
Table 14.3 Cox PH model output
452
Now we analyze the same data to fit the log-logistic parametric model. A simple
way of stating the log-logistic model is by failure odds:
1 − S(t)
= λt p
S(t)
Next, we fit a Weibull parametric model on the same data. In the Weibull model,
∗ tp)
S(t) = e(−λ
For new data, any number of quantiles (importantly the 0.5 quantile, the median)
of survival times can be predicted for input cases of regressors, effectively predicting
the survival curves. The following is an example of 0.1, 0.5, 0.9 quantiles for the
first ten cases in the dataset from the above model (Table 14.7). From the predicted
values the median time for the first observation is 4180.3 months and for the second
observation it is only 33.0 months. You can similarly interpret other values.
5 Summary
This chapter introduces the concepts and some of the basic techniques of survival
analysis. It covers a nonparametric method of estimating a survival function called
the Kaplan–Meier method, a semiparametric method of relating a hazard function
to covariates in the Cox proportional hazards model, and a fully parametric method
of relating survival time to covariates in terms of a regression as well as estimating
quantiles of survival time distributions for various profiles of the covariate values.
Survival analysis computations can be easily carried out in R with specialized
packages such as survival, KMsurv, survreg, RPub, and innumerable other packages.
Several textbooks provide the theory and explanations of the methods in detail.
These include Gomez et al. (1992), Harrell (2001), Kleinbaum and Klein (2005),
Hosmer et al. (2008), Klein and Moeschberger (2003), Lawless (2003), Sun (2006),
Springate (2014), as well as websites given in the references.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 14.1: churn.csv
• Data 14.2: employ.csv
• Data 14.3: nextpurchase.csv
• Code 14.1: Survival_Analysis.R
14 Survival Analysis 457
Exercises
The data file nextpurchase.csv (refer website for dataset) relates to the purchase of
fertilizers from a store by various customers. Each row relates to a customer. The
study relates to an analysis of “time-to-next-purchase” starting from the previous
purchase of fertilizers. “Censoring” is 0 if the customer has not returned for another
purchase of a fertilizer since the first one. Censoring is 1 if he has returned for the
purchase of a fertilizer since his earlier one. “Days” is the number of days since
last purchase (could be a censored observation). “Visits” is the number of visits
to the shop in the year not necessarily for the purchase of a fertilizer. “Purchase”
is the amount of all purchases (in $’s) during the current year so far. “Age” is the
customer’s age in completed years. “Card” is 1 if they used a credit card; else 0.
Ex. 14.1 Without taking into account the covariates, use the Kaplan–Meier method
to draw a survival curve for these customers.
Ex. 14.2 Fit the Weibull parametric model and predict the 0.1 (0.1) 0.9 quantiles
of a customer aged 45, who uses a credit card, who spent $100 during the year so
far and who has visited the shop four times in the year so far (not necessarily to
purchase fertilizers).
Ex. 14.3 Rework the parametric Weibull exercise using the log-logistic parametric
model.
Ex. 14.4 Rework the parametric Weibull exercise using the Cox PH model.
Useful functions for the Weibull distribution: (You need not know these to run
this model.)
Density: f(t)= kλk t k−1 e−(λt) ; Survival S(t) = e(−λt) ; Hazard h(t) = λk kt k−1 ;
k k
References
Gomez, G., Julia, O., Utzet, F., & Moeschberger, M. L. (1992). Survival analysis for left censored
data. In J. P. Klein & P. K. Goel (Eds.), Survival analysis: State of the art (pp. 269–288).
Boston: Kluwer Academic Publishers.
Harrell, F. E. (2001). Regression modeling strategies: With applications to linear models, logistic
regression, and survival analysis (2nd ed.). New York: Springer.
Hosmer, D. W., Jr., Lemeshow, S., & May, S. (2008). Applied survival analysis: Regression
modeling of time to event data (2nd ed.). Hoboken, NJ: Wiley.
Klein, J. P., & Moeschberger, M. L. (2003). Survival analysis: Techniques for censored and
truncated data (2nd ed.). New York: Springer.
Kleinbaum, D. G., & Klein, M. (2005). Survival analysis: A self-learning text (2nd ed.). New York:
Springer.
Lagakos, S. W. (1979). General right censoring and its impact on the analysis of survival data.
Biometrics, 139–156.
Lawless, J. F. (2003). Statistical models and methods for lifetime data (2nd ed.). Hoboken, NJ:
Wiley.
458 T. Krishnan
Lee, M.-C. (2014). Business bankruptcy prediction based on survival analysis approach. Interna-
tional Journal of Computer Science & Information Technology (IJCSIT), 6(2), 103. https://doi.
org/10.5121/ijcsit.2014.6207.
Lu, J. & Park, O. (2003). Modeling customer lifetime value using survival analysis—An application
in the telecommunications industry. Data Mining Techniques, 120–128 http://www2.sas.com/
proceedings/sugi28/120-28.pdf.
Springate, D. (2014). Survival analysis: Modeling the time taken for events to occur. RPubs by
RStudio. https://rpubs.com/daspringate/survival.
Sun, J. (2006). The statistical analysis of interval censored failure time data. New York: Springer.
Chapter 15
Machine Learning (Unsupervised)
Shailesh Kumar
We live in the age of data. This data is emanating from a variety of natural
phenomena, captured by different types of sensors, generated by different business
processes, or resulting from individual or collective behavior of people or systems.
This observed sample data (e.g., the falling of the apple) contains a view of reality
(e.g., the laws of gravity) that generates it. In a way, reality does not know any other
way to reveal itself but through the data we can perceive about it.
The goal of unsupervised learning is essentially to “reverse engineer” as much of
this reality from the data we can sample from it. In this chapter, we will explore
unsupervised learning—an important paradigm in machine learning—that helps
uncover the proverbial needle in the haystack, discover the grammar of the process
that generated the data, and exaggerate the “signal” while ignoring the “noise” in it.
In particular, we will explore methods of projection, clustering, density estimation,
itemset mining, and network analysis—some of the core unsupervised learning
frameworks that help us perceive the data in different ways and hear the stories
it is telling about the reality it is sampled from. The examples, corresponding code,
and exercises for the chapter are given in the online appendices.
1 Introduction
The most elementary and valuable statement in Science, the beginning of Wisdom is—‘I do
not know’
—Star Trek.
S. Kumar ()
Reliance Jio, Navi Mumbai, Maharashtra, India
e-mail: skumar.0127@gmail.com
Some of these tasks require us to just observe the data in various ways and find
structures and patterns in it. Here there is no “mapping” from some input to some
output. Here we are just given a lot of data and asked to find something “interesting”
in it, to reveal from data insights that we might not be aware of. For example, one
might find product bundles that “go together” in a retail point of sale data or the fact
that age, income, and education are correlated in a census data. The art and science
of finding such structures in data without any particular end use-case in mind falls
under unsupervised learning. Here we are just “reading the book” of the data and
not “trying to answer a specific question” about the data. It is believed that in early
childhood, most of what our brain does is unsupervised learning. For example:
• Repeated Patterns: when a baby hears the same set of sounds over and over
again (e.g., “no”), it learns that this sound seems important and creates and
stores a pattern in the brain to recognize that sound whenever it comes. It may
not “understand” what the sound means but registers it as important because of
repetition. The interpretation of this pattern might be learnt later as it grows.
• Sequential patterns: a child might register the fact that a certain event (e.g.,
ringing of a doorbell) is typically followed by another event (e.g., someone opens
the door). This sequential pattern learning is key to how we pick up music, art,
and language (mother tongue) even without understanding its grammar but by
simply observing these sequential patterns over and over.
• Co-occurrence patterns: a child might recognize that two things always seem
to co-occur together (e.g., whenever she sees eyes, she also sees nose, ear, and
mouth). A repeated co-occurrence of same objects in the same juxtaposition leads
to the recognition of a higher order object (e.g., the face).
In all these patterns, the grammar of the data is being learnt for no specific
purpose except that it is there.
Supervised learning, on the other hand, is a mapping from a set of observed
features to either a class label (classification paradigm) or a real value (regression
paradigm) or a list of items (recommendation or retrieval paradigm), etc. Here we
deliberately learn a mapping between one set of inputs (e.g., a visual pattern on
a paper) and an output (e.g., this is letter “A”). This mapping is used both in
interpreting and assigning names (or classes) to the patterns we have learnt (e.g.,
the sound for “dad” and “mom”) as a baby in early childhood, which now are
interpreted to mean certain people, or to the visual patterns one has picked up in
childhood which are now given names (e.g., “this is a ball,” “chair,” “cat”), etc.
This mapping is also used for learning cause (a disease) and effect (symptoms)
relationships or observation (e.g., customer is not using my services as much as
before) and prediction (e.g., customer is about to churn) relationships. A whole suite
of supervised learning paradigms is discussed in the next chapter. In this chapter we
will focus only on unsupervised learning paradigms.
462 S. Kumar
N
J (m |X ) = |m − xn |
n=1
Modification: Now the above objective function makes sense intuitive, but it is
not easy to optimize it from a mathematical perspective. Hence, we come up with
464 S. Kumar
a more “solvable” or “cleaner” version of the same function. In this case, we want
to make it “differentiable” and “convex.” The following objective function is also
known as sum of squared error (SSE).
N
J (m |X ) = (m − xn )2
n=1
Now we have derived an objective function that matches our intuition as well
as is mathematically easy to optimize using traditional approaches—in this case,
simple calculus.
Optimization: The most basic optimization method is to set the derivative of the
objective w.r.t. the parameter to zero:
∂J (m |X ) ∂(m − xn )2 1
N N N
= =2 (m − xn ) = 0 ⇒ m
= xn
∂m ∂m N n
n=1 n=1
So we see that there is no formula for mean of a set of numbers. That formula is
a result of an optimization problem. Let us see one more example of this process.
Probability of Heads
Let us say we have a two-sided (possibly biased) coin that we tossed a number
of times and we know how many times we got heads (say H) and how many times
we got tails (say T). We want to find the probability p of heads in the next coin toss.
Again, we know the answer, but let us again go through the optimization process to
find the answer. Here the data is (H, T) and parameter is p.
Intuition: The intuition says that we want to find that parameter value p that
explains the data the most. In other words, if we knew p, what would be the
likelihood of seeing the data (H, T)?
Formulation: Now we formulate this as an optimization problem by assuming
that all the coin tosses are independent (i.e., outcome of previous coin tosses does
not affect the outcome of the next coin toss) and the process (the probability p) is
constant throughout the exercise. Now if p is the probability of seeing a head, then
the joint probability of seeing H heads is pH . Also since heads and tails are the only
two options, probability of seeing a tail is (1 − p) and seeing T tails is (1 − p)T . The
final Likelihood of seeing the data (H, T) is given by the product of the two:
J (p|H, T ) = pH (1 − p)T
Modification: The above objective function captures the intuition well but is
not mathematically easy to solve. We modify this objective function by taking the
log of it. This is typically called the Log Likelihood and is used commonly in both
supervised and unsupervised learning.
J (p|H, T ) = H ln p + T ln (1 − p)
15 Machine Learning (Unsupervised) 465
Fig. 15.1 Histograms of eight different dimensions of Higgs boson dataset (none is normally
distributed)
Fig. 15.2 Histogram of a feature (left) and its log (right). Taking log is a useful transformation
15 Machine Learning (Unsupervised) 467
Fig. 15.3 A few scatter plots of IRIS data show that two classes are closer to each other than the
third
each point by another property. Scatter plots reveal the structure of the data in the
projected spaces and develop our intuition about what techniques might be best
suited for this data, what kind of features we might want to extract, which features
are more correlated to each other, which features are able to discriminate the classes
better, etc.
Figure 15.3 shows the scatter plot of the IRIS dataset2 between a few pairs of
dimensions. The color/shape coding of a point is the class (type of Iris flower) the
point represents. This immediately shows that two of the three classes of flowers are
more similar to each other than the third class.
While histograms give us a one-dimensional view of the data and scatter plots
give us two- or three-dimensional view of the data, they are limited in what they
can do. We need more sophisticated methods to both visualize the data in lower
dimensions and extract features for next stages. This is where we resort to a variety
of projection methods discussed next.
2 Projections
One of the first things we do when we are faced with a lot of data is to get a grasp
of it from both domain perspective and statistical perspective. More often than not,
any real-world data is comprised of large number of features either because each
record inherently is comprised of large number of input features (i.e., there are
lots of sensors or the logs contain many aspects of each entry) or because we have
engineered a large number of features on top of the input data. High dimensionality
has its own problems.
• First, it becomes difficult to visualize the data and understand its structure. A
number of methods help optimally project the data into two or three dimensions
to exaggerate the signal and suppress the noise and make data visualization
possible.
• Second, just because there are a large number of features does not mean that the
data is inherently high dimensional. Many of the features might be correlated
with other features (e.g., age, income, and education levels in census data). In
other words, the data might lie in a lower dimensional “manifold” within the
higher dimensional space. Projection methods uncorrelate these dimensions and
discover the lower linear and nonlinear manifolds in the data.
• Finally, the curse of dimensionality starts to kick in with high dimensional
data. The amount of data needed to build a model grows exponentially with
the number of dimensions. For all these reasons a number of projection
techniques have evolved in the past. The unsupervised projection techniques
are discussed in this chapter. Some of the supervised projection techniques
(e.g., Fisher Discriminant Analysis) are discussed in the Supervised Learning
chapter.
In this section, we will introduce three different types of projection methods:
principal components analysis, self-organizing maps, and multidimensional scaling.
Principal components analysis (PCA) is one of the oldest and most commonly used
projection algorithms in machine learning. It linearly projects a high dimensional
multivariate numeric data (with possibly correlated features) into a set of lower
orthogonal (uncorrelated) dimensions where the first dimension captures most of
the variance, next dimension—while being orthogonal to the first—captures the
remaining variance, and so on. Before we go into the mathematical formulation of
PCA, let us take a few examples to convey the basic intuition behind orthogonality
and principalness of dimensions in PCA.
• The Number System: Let us take a number (e.g., 1974) in our base ten
number system. It is represented as a weighted sum of powers of 10 (e.g.,
1974 = 1 × 103 +9 × 102 +7 × 101 +4 × 100 ). Each place in the number is
independent of another (hence orthogonal), and the digit at the ones place is
least important, while the digit at the thousands’ place is the most important.
If we were forced to mask one of the digits to zero by minimizing the loss of
information, it would be the ones place. So here thousands’ place is the first
principal component, hundreds’ is the second, and so on.
• Our Sensory System: Another example of PCA concept is our sensory system.
We have five senses—vision, hearing, smell, taste, and touch. There are two
15 Machine Learning (Unsupervised) 469
Fig. 15.4 The idea of a projection and loss of information as a result of projection
properties we want to highlight about our sensory system: First, all the five
senses are orthogonal to each other, that is, they capture a very different
perspective of reality. Second, the amount of information they capture about
reality is not the same. The vision sense perhaps captures most of the information,
followed by auditory, then taste, and so on. We might say that vision is the
first principal component, auditory is the second, and so on. So PCA captures
the notion of orthogonality and different amount of information in each of the
dimensions.
Let us first understand the idea of “projection” and “loss of information.” Figure
15.4 shows the idea of a projection in another way. Consider the stumps in a cricket
game—this is the raw data. Now imagine we hold a torch light (or use the sun) to
“project” this three-dimensional data on to the two-dimensional field. The shadow
is the projection. The nature of the shadow depends on the angle of the light. In Fig.
15.4 we show four options. Among these options, projection A is “closest to reality,”
that is, loses minimal amount of information, while option D is farthest from reality,
that is, loses all the information. Thus, there is a notion of the “optimal” projection
w.r.t a certain objective called “loss of information.”
We will use the above intuition to develop an understanding of principal
components analysis. We first need to define the notion of “loss of information due
, -N
to projection.” Let X = x Tn n=1 be the N × D data matrix with N rows where
each row is a D-dimensional data point. Let w(k) be the kth principal component,
constrained to be a unit vector, such that k ≤ min {D, N − 1}. We will use the same
four-stage process to develop PCA:
Intuition: The real information in the data from a statistical perspective is the
variability in it. So any projection that maximizes the variance in the projected space
is considered the first principal component (direction of projection), w(1) . Note that
since the input data X is (or is transformed to have) zero mean, its linear projection
will also be zero mean (Fig. 15.5).
470 S. Kumar
N
2
w1 = arg max x Tn w
w=1 n=1
In general, the first k principal components of the data correspond to the first
k eigenvectors (that are both orthogonal and decreasing in the order of variance
captured) of the covariance matrix of the data. The percent variance captured by the
first d principal components out of D is given by the sum of squares of the first d
eigenvalues of XT X.
Figure 15.6 shows the eigenvalues (above) and the fraction of variance captured
(below) as a function of the number of principal components for MNIST data which
is 28 × 28 images of handwritten data. Top 30 principal components capture more
than 95% of variance in the data. The same can be seen in Figure 15.7 that shows
Fig. 15.6 Eigenvalues of the first 30 principal components for MNIST and fraction of variance
captured
472 S. Kumar
Fig. 15.7 Reconstruction of the digits by projecting them into k dimensions and back
the reconstruction of the ten digits when the data is projected to different number of
principal components and reconstructed back. Again, we can see that although the
original data is 784-dimensional (28 × 28), the top 30–40 dimensions capture the
essence of the data (signal).
A number of other such linear (e.g., independent components analysis) and
nonlinear (e.g., principal surfaces) projection methods with different variants of
the loss of information objective function have been proposed. PCA is a “type”
of projection method. Next we study another “type” of projection method which is
very different in nature compared to the PCA-like methods.
To learn more about principal components analysis, refer to Chap. 3 (Sect. 3.4.3)
in Han et al. (2011), Chap. 12 (Sect. 12.2) in Murphy (2012), and Chap. 14 (Sect.
14.5.1) in Friedman et al. (2001).
Fig. 15.8 Nearby points in the original space (left) map to nearby or same point in the SOM grid
(right)
• Compute degree of association θ t (n, m) between the nth data point and the
mth grid point, such that it decreases with the distance between m and It (n),
δ(It (n), m):
δ (It (n), m)
θt (n, m) = exp − , ∀m = 1 . . . M
σ (t)2
• Now each of the grid point weights wm is updated in the direction of the input xn
with different degrees that depends on θ t (n, m):
• Decrease the learning rate η(t) and the variance σ (t) as iterations progress.
Figure 15.9 shows a semantic space of a news corpus comprising of millions of
news articles from the last ten years of a country. Here word embeddings (300-
dimensional semantic representation of words such that two words with similar
meaning are nearby in the embedding space) of the top 30K most frequent words
(minus the stop words) are smeared over a 30 × 30 SOM grid. Two words close
to each other in the embedding space are mapped to either the same or nearby grid
points. The grid vectors quantize different parts of the semantic embedding space
representing different meanings. These grid point vectors are further clustered into
macro concepts shown on the map.
SOMs smear the entire data into a 2D grid. Sometimes, however, we do not
want to put the projected data on a grid. Additionally, we are not given a natural
representation of data in a Euclidean space. In such cases, we use another class of
projection method called multidimensional scaling.
The reader can refer to Chap. 14 (Sect. 14.4) in Friedman et al. (2001) to learn
more about self-organizing maps.
PCA and SOM are two different kinds of projection/visualization methods. In PCA,
we project the data linearly to minimize loss of variance, while in SOM we quantize
each data point into a grid point via competitive learning. Another way to map
a high dimensional data into a low dimensional data is to find each data point’s
representatives in the lower dimensions such that the distance between every pair of
points in the high dimension matches the distance between their representatives in
the projected space. This is known as multidimensional scaling (MDS).
Intuition: The idea of “structure” in data manifests in many ways—correlation,
variance, or pairwise distances between data points. In MDS, we find a representa-
tive (not a quantization as in SOM) for each data point in the original space such
that the distance between two points in the original space is preserved in the MDS
projected space. Figure 15.10 shows the basic idea behind MDS.
Formulation: Let D = [δ ij ] be the N × N distance matrix between all pairs
of N points in the original space. Note that techniques like MDS do not require
Fig. 15.10 Multidimensional scaling preserves pairwise distances between all pairs of points
476 S. Kumar
• Sammon Map: The intuition behind this is that when two points are very far
from each other in the original space, then the error between their distances in
the projected space and the original space matters less. Only when points are
close to each other in the original space that the error matters.
2
x i , x j − δij
JSP E (X) =
δij
1≤i<j ≤N
Figure 15.11 shows a 2D map of the various product categories of a grocery store.
This map was created by first learning the strength of co-occurrence consistency
between all pairs of categories from the point-of-sale data. Consistency measures
the degree with which a pair of product categories is purchased together more
often than random. Two products that are consistently closer to each other (e.g.,
meat and seafood) land up in close proximity in the 2D space as well. This
visualization reveals not only the structure in the purchase co-occurrence grammar
of the customers but can also be used to change the store layout to match customer
buying patterns or create rules for recommending products, etc.
15 Machine Learning (Unsupervised) 477
Fig. 15.11 Store layout based on co-occurrence of products from various categories
3 Clustering
The fundamental hypothesis in finding structure in data is that while a dataset can be
very large, the underlying processes that generated the data has only finite degrees
of freedom. There are only a small number of actual latent sources of variations
from which the data actually emerged. For example,
• In retail point-of-sale data, we might see a lot of variation from customer to
customer but inherently there are only a finite types of customer behaviors based
on their lifestyle (e.g., brand savvy, frugal), life-stage (e.g., bachelor, married, has
kids, old age), purchase behavior (e.g., when, where, how much, which channel)
and purchase intents (grocery, birthday, vacation related, home improvement,
etc.). We might not know all these variations or combinations in advance, but
we know we are not dealing with an infinite number of such variations and we
can discover such quantization if we let the data speak for itself.
478 S. Kumar
• Similarly, while it seems that there are billions of videos on YouTube or billions
of pages on the web, or millions of people on LinkedIn and Facebook, the
different types of videos (music albums, home videos, vacation videos, talent
videos, cat videos, etc.), pages (news, spam, blogs, entertainment, etc.), or people
(software engineers, managers, data scientists, artists, musicians, politicians, etc.)
is a reasonably finite set. Whether we know all the types or not is another
question, but what we definitely know is that the number of such types is not
as many as the number of entities.
• Similarly, consider all words in a language. It appears that there are many words
in the dictionary but they can again be grouped by, say, parts of speech, root,
tense, and meaning, into only a small number of types.
• Finally, consider telematics data while someone is driving the car. Again, the data
variation might be very large across all cars but the number of things people do
while driving (soft or hard brake, soft or hard acceleration, sharp or comfortable
left or right turns, etc.) combined with the number of driving scenarios (pot-holes
uphill, downhill, highway, inner-roads, etc.) is still finite.
When these finite “sources” of variation in the data are already known in advance
and/or when we have to map the data variations into a specific set of known types,
this becomes a classification problem. We deal with classification in the supervised
learning chapter at length. On the other hand, when these variations are not known
in advance and need to be discovered, by grouping similar data points together
(whatever “similar” means for that type of data), then it becomes a clustering
problem.
In many systems of intelligence including our own, we transition from clustering
to classification. For example, in early childhood, babies do not know all the
variations of what they see or hear so they internally do clustering to quantize these
variations. If they see more data of a certain type, the resolution of quantization
on those parts of the data becomes fine-grained. At this stage, we do know that
this is similar to what I have seen before (quantized symbol number 48), but we
do not know yet what to call it. As we grow and language develops, we learn that
those quantization have been given names (vertical line, sleeping line or “nose,”
“eyes,” “square,” “triangle,” etc.). Now we have some known quantization that we
call classes, and when a new experience comes, we first try to map it to a known
class (e.g., if a child has never seen a goat before, she might “classify” it as a “dog”),
but if this quantization is not “close enough” to any of the known classes, then she
might ask the mother—it looks like a dog, is it a dog? And when she gets a new
label (no it is called a “goat”), she creates another class in the brain. In this stage,
we rely not only on the known set of named classes but are also open to discovering
beyond the known. This is where we are in the hybrid stage of learning—exploit
the known and explore the unknown simultaneously. As we grow older and we have
“seen enough,” the number of new quantization reduces as we have a sufficiently
large number of classes to represent all inputs and nothing seems to surprise us
anymore.
15 Machine Learning (Unsupervised) 479
If we assume a certain number of clusters and try to partition the data into
those many clusters, then it is called partitional clustering. Different algorithms,
most notably K-means clustering—that partitions the data into K clusters—are
examples of partitional clustering. Consider a multivariate dataset where we can
define Euclidean distance between two points meaningfully, that is, we have already
transformed all the features and z-scored them.
Let X = {x1 , x2 , . . . , xN } be the N data points that need to be clustered into K
clusters (1 ≤ K ≤ N). If K = 1, that means the entire data is clustered into one cluster.
In that case, the mean of the entire dataset is the cluster center we are looking for.
In case K = N, then each data point is by itself a cluster center. Both these are valid
but not useful extreme cases. Typically, the value of K is somewhere in between.
We will first formulate this as an optimization problem using the same process as
above—intuition, formulation, modification, and optimization.
Intuition: Clustering is about “grouping similar things (feature vectors repre-
senting them) together.” There are two equivalent ways to represent a “clustering”:
Enumeration vs Representation. In enumeration, we can explicitly label each data
point with the cluster id it belongs to. Let δ n,k ∈ {0, 1} be a set of binary labels such
that it is ones if nth data point is associated with the kth cluster and zero otherwise.
In representation, each cluster is represented by a cluster mean of all data points it
represents.
480 S. Kumar
N
K
J (M, ) = δn,k x n − mk 2
n=1 k=1
9 9
(t) 9 (t) 92
δn,k ← 1 if k = arg min 9x n − mj 9 , 0 otherwise.
j =1...K
(b) Maximization step (the M-step) updates Mt+1 given the current value of t by
maximizing the above objective function resulting in:
N
∂J (M, )
N
(t+1)
t
n=1 δn,k x n
=2 δn,k (x n − mk ) = 0 ⇒ mk ← N t
∂mk n=1 δn,k
n=1
Figure 15.12 shows these two steps pictorially on how a complete EM iteration
works. Figure 15.12a shows two randomly initialized cluster centers. Figure 15.12b
shows how given those two initial cluster centers, the data points are enumerated
by the cluster they are closest to (orange vs. green). Figure 15.12c shows how with
the new associations the cluster centers are updated using the M-step. Figure 15.12d
shows the final association update leading to the desirable clustering of the data.
There are several properties of K-means clustering that are likeable and some
that are not:
Sensitivity to Initialization—In case of PCA, the solution to the objective
function was what we call a “closed-form solution” because there is only one
15 Machine Learning (Unsupervised) 481
Fig. 15.12 (a) Initial cluster centers, (b) E-step associating data points with one cluster or the
other, (c) M-step updating the cluster centers, (d) next E-step shows convergence
Fig. 15.13 Clustering is sensitive to initialization. Three different possible random initializations
(blue) that might either result in different final clusters or more iterations to convergence
optimal answer there. But clustering does not have such an objective function that
gives us one final answer. Here, the final clusters learnt depend on the way we
have initialized the clustering. Figure 15.13 shows three different initializations of
the same data. Depending on the initialization, the final cluster might either be
suboptimal or take longer to converge to the optimal even if it is possible. But
random initialization could give any of these or other combinations as the initial
clusters, and hence K-means is not always guaranteed to give the same clusters. As
a general rule, we do not like “non-determinism” in our algorithms—no guarantee
that we will get the same results for the same data and the same hyper parameters
(number of clusters).
Smart Initialization: There are a number of algorithms that have been proposed
to make K-means clustering more “optimal” and “deterministic” from an initial-
ization perspective. One such initialization method is the farthest first point (FFP)
initialization where we choose the first cluster center deterministically, that is, pick
the data point farthest from the mean of the entire data. Then we choose the second
cluster center that is farthest from the first. The third is picked such that it is farthest
from the first two, and so on. Figure 15.14 shows one such initialization where first
figure shows the first two clusters picked. Middle figure shows how the next cluster
is chosen such that it is farthest from the first two, and third is chosen such that it is
farthest from the first three. This guarantees good coverage of the space and leads
to a decent initialization, resulting in a closer to optimal clustering.
482 S. Kumar
Fig. 15.14 Farthest first point initialization. The first figure shows two cluster centers initialized.
Middle figure shows how the third is picked such that it is farthest from both the first two and the
fourth is farthest from all three
exp − x nσ−m k
2
2 (t)
(t)
δn,k ←
K x n −mj 2
j =1 exp − σ 2 (t)
To learn more about K-means clustering and expectation maximization, one can
read Chap. 10 (Sect. 10.2.1) and Chap. 11 (Sect. 11.1.3) in Han et al. (2011), Chap.
11 (Sect. 11.4) in Murphy (2012), and Chap. 6.12 (Sect. 6.12) in Michalski et al.
(2013).
484 S. Kumar
Partitional clustering assumes that there is only one “level” in clustering. But
in general, the world is made up of a “hierarchy of objects.” For example, the
biological classification of species has several levels—domain, kingdom, phylum,
class, order, family, genus, and species. All the documents on the web can be
clustered into coarse (sports, news, entertainment, science, academic, etc.) to fine
grained (hockey, football, . . . , or political news, financial news, etc.). To discover
such a “hierarchical organization” from data, we do hierarchical clustering in two
ways: top-down and bottom-up.
Top-down hierarchical clustering also known as divisive clustering where we
apply partitional clustering recursively first to, say, find K1 clusters at the first level
of the hierarchy, then within each find K2 clusters and so on. With the right number
of levels and number of clusters at each level (which may be different in different
parts of the hierarchy), we can now discover the overall structure in the data in a
top-down fashion. This, however, still suffers—at each level in the hierarchy—the
problems that a partitional clustering algorithm suffers from—initialization issues,
number of clusters at each level, etc. So it can still give a variety of different answers
and the problem of non-determinism remains.
Bottom-up hierarchical clustering also known as agglomerative clustering is
the other approach to building the hierarchy of clusters. Here we start with the raw
data points themselves at the bottom of the hierarchy, and we find the distances
between all pairs of points and merge the two points that are nearest to each other
since they make the most sense to “merge.” Now the merged point replaces the two
points that were merged and we are left with N – 1 data points when we started with
N data points. The process continues as we keep merging two data points or clusters
together until the entire data is merged into a single root node. Figure 15.15 shows
the result of a clustering of ten digits (images) in a bottom-up fashion. The structure
is called a dendrogram that shows how at each stage two points or clusters are
merged together. First, digits 1 and 7 got merged. Then 3 and 8 got merged, then
4 and 9 got merged, then 0 and 5 got merged, then the cluster {3,8} and {0,5} got
merged, and so on. The process leads eventually to a binary tree that we can cut at
any stage to get any number of clusters we want.
The key to agglomerative clustering is the definition of distance between two
“clusters” in general (e.g., clusters {3,8} and {0,5}). Different ways of doing this
define different kinds of agglomerative clustering, resulting in different forms of
clustering shapes. In the following, let X = {x1 , x2 , . . . , xP } be the set of P points
in cluster X and let Y = {y1 , y2 , . . . , yQ } be the set of Q points in cluster Y. Note
that either P or Q or both can be 1. The distance between the set X and set Y can be
defined in many ways:
• Single linkage—distance between two nearest points across X and Y is used as
distance between the two clusters. This gives elongated clusters as two clusters
with even one point close to one point of another cluster will be merged.
4 5
(X, Y) = min min xp, yq
p=1...P q=1...Q
4 5
(X, Y) = max max xp, yq
p=1...P q=1...Q
• Average linkage—is between the single and complete linkage clustering where
distance between the two clusters is computed as the average distance between
all pair of points among them. This makes clustering robust to noise.
1
P Q
(X, Y) = xp − yq
PQ
p=1 q=1
Fig. 15.16 A similarity graph with six nodes. Edge weights are similarity between corresponding
pairs of entities. The graph is partitioned into two parts such that total weight of removed edges is
minimum
(1) or the second partition (0). Now the intuition suggests that two nodes (i, j) should
be in the same partition (i.e., (xi − xj )2 is 0) if they are very similar (i.e., wij is high)
and in different partitions (i.e., (xi − xj )2 is 1) if they are dissimilar, that is, (wij is
low). We can therefore capture the intuition by maximizing the following objective
function.
1
N
2
J (X |W ) = wij xi − xj
2
1≤i,j ≤N
N
− xi wij xj = x T (D − W) x
1≤i,j ≤N
where D is the diagonal matrix whose diagonal elements are sum of the rows of W
⎡ ⎤
d1 0 0
⎢ ⎥
D = ⎣ 0 ... 0 ⎦
0 0 dN
N
where dn = j =1 wnj . The matrix L = (D − W) is called the unnormalized
graph Laplacian of the similarity matrix W. It is a positive semi-definite matrix with
smallest eigenvalue 0 and the corresponding eigenvector as all 1’s. If the graph has k
connected components, that is, each connected component has no link across, then
there will be k smallest eigenvalues equal to 0. Assuming the graph has only one
488 S. Kumar
connected component, the second smallest eigenvector is used to partition the graph
into two parts. We can take the median value of the second smallest eigenvector and
partition the graph such that the nodes whose second eigenvector components are
above the median are in one partition and the remaining nodes in the other partition.
This partitioning can be applied recursively now to break the two components
further into two partitions in a top-down fashion.
In this section, we have studied a number of clustering algorithms depending on
the nature of the data. One of the open problems in clustering is how to system-
atically define distance functions when data is not a straightforward multivariate
real-valued vector. This is where the critical domain knowledge is required. The
next paradigm—density estimation—extends the idea of clustering by allowing us
to describe each cluster with a “shape” called its density function.
For further reading, the reader can refer to Chap. 25 (Sect. 25.4) from Murphy
(2012) and Chap. 14 (Sect. 14.5.3) from Friedman et al. (2001).
4 Density Estimation
The fundamental hypothesis that data has structure implies that it is not uniformly
distributed across the entire space. If it were, it would not have any structure. In
other words, all parts of the feature space are not equally probable. Consider a
space with two features “age” and “education.” Let us say age takes a value from
0 to 100 years and “education” from, say, 1 to 20. Now probability P(age = 3,
education = PhD) is zero and P(age = 26, education = PhD) is high. Similarly,
P(age = 20, education = grade-1) is low and P(age = 5, education = grade-1) is
high. Estimating this joint probability, given the data, gives us a sense of which
combination of feature values are more likely than others. This is the essence
of structure in the data, and density estimation captures such joint probability
distributions in the data. Density estimation has many applications, for example:
• Imputation: If one or more of the feature values is missing, given the others we
can estimate the missing value as the value that gives the highest joint probability
after substituting it.
• Bayesian classifiers: Another application of density estimation is to build a
“descriptive” classifier for each class where the descriptor is essentially a class
conditional density function P(x|c).
• Outlier detection: Another important application of density estimation is outlier
detection used in many domains such as fraud, cyber security, and when dealing
with noisy data. A data point with low probability after we have learnt the density
function is considered an outlier point.
There are two broad density estimation frameworks. First, is the nonparametric
density estimation where we do not learn a model but use the “memory” of all
the known data points to determine the density of the next data point. Parzen
Window is an example of a nonparametric density estimation algorithm. Second
15 Machine Learning (Unsupervised) 489
Let us first develop an intuition behind density functions from an example. Imagine
that in a room floor we scatter a large number of magnets at specific locations. Each
of these magnets has the same “magnetic field of influence” that diminishes as we go
away from the magnet. Now imagine if there is a piece of iron at a certain location in
the room, it will experience a total magnetic field that is the sum of all the magnets.
The magnets that are closer to this piece of iron will have a higher influence than
the farther ones.
Let X = {x1 , x2 , . . . , xN } be the set of N magnets (data points) scattered in some
high dimensional space. Let x be a new data point (iron) whose density (influence
by all the magnets), P(x), has to be estimated. In a nonparametric kernel density
estimation, we represent this total field of influence as follows:
1
N N
1 x − x n 2
P (x) = Kσ (x, x n ) = √ exp
n nσ 2π n=1 2σ 2
n=1
about whether a credit card transaction is fraud or not and we are using outlier
detection based on density estimations, we cannot use such nonparametric
density estimators).
• Robustness to noise: Since each training data point has an influence on density
estimation of each point, even noisy points get to have their say. It is therefore
important to identify and remove the noisy points from the training set or use
parametric techniques for highly noisy datasets.
In nonparametric density functions, the data is stored “as is” and is used to compute
density using kernel functions. The parametric density estimation functions, on the
other hand, first define a parametric form and then find the parameters by optimizing
an objective function. We will follow the same four-stage process of intuition,
formulation, modification, and optimization to learn parameters.
Intuition: Let P(x| θ ) be a parametric density function where θ is the set of
parameters to be learnt. For N data points X = {x1 , x2 , . . . , xN }, we have to find
the set of parameters that “fits” the data best. In other words, we need to find the
parameters θ such that the probability of seeing the entire data is maximum.
Formulation: Parametric density function problems are all modeled as opti-
mization problems where we try to find the set of parameters that maximizes the
likelihood of seeing the data. Since each data point is identical and independently
distributed, the likelihood of seeing the entire data is the product of the likelihood
of seeing each data point independently.
N
θ ∗ = arg max J (θ |X ) = arg max P (xn |θ )
θ θ
n=1
N
N
N
J (θ |X ) = ln P (xn |θ ) = [ln θ − θ xn ] = N ln θ − θ xn
n=1 n=1 n=1
15 Machine Learning (Unsupervised) 491
N
∂J (θ|X )
∂θ = N
θ − xn = 1 ∴
θ= 1 N
1
n=1 N n=1 xn
N
N
J (θ |X ) = ln P (xn |θ ) = [xn ln θ + (1 − xn ) ln (1 − θ )]
n=1 n=1
" #
1 1 N
N N
∂J (θ |X ) 1
= − xn − N− xn = 0 ∴
θ= xn
∂θ θ 1−θ N n=1
n=1 n=1
θ x −θ
• Poisson distribution: where P (x|θ ) = x! e , x = 0,1,2, . . . and θ > 0. So
N
N
J (θ |X ) = ln P (xn |θ ) = xn ln θ − θ − ln xn !
n=1 n=1
1 1
N N
∂J (θ |X )
= xn − N = 0 ∴
θ= xn
∂θ θ N
n=1 n=1
2
N
1
N
x−μ
J μ, σ 2 |X = ln P (xn |θ ) = − ln σ 2 + + ln 2π
2 n 2σ 2
n=1
1 1
N N
∂J μ, σ 2 |X
= (x − μ) = 0 ∴ μ = xn
∂μ 2σ 2 N
n=1 n=1
" #
1 1 1
N N
∂J μ, σ 2 |X (x − μ)2
= − = 0 ∴ σ 2
= (x − μ)2
∂σ 2 2 σ2 σ4 N
n=1 n=1
The reader can refer to Criminisi et al. (2012) and Robert (2014) to learn more
about density estimation.
492 S. Kumar
N
K
N
K
J ( |X ) = [P (x n , k)] δn,k
= ln [P (x n , k)]δn,k
n=1 k=1 n=1 k=1
P(xn , k) = P(xn | k)P(k) and take the log of the likelihood to make the calculus easy
for optimization. Also note that there are constraints on δ n,k that for each n they
must add up to 1. We put all these into the modified objective function:
K
N
K
J ( |X ) = δn,k [ln P (x n |k )] + ln P (k) + λ δn,k − 1
n=1 k=1 k=1
j =1 πj Pt (x n |j )
K (t) (t) (t)
j =1 πj Pt x n |μj , j
The maximization step when optimized for mean and covariance results in the
following updates:
1 (t)
N
(t+1)
πk ← δn,k
N
n=1
N (t)
(t+1) n=1 δn,k x n
μk ← N (t)
n=1 δn,k
Fig. 15.17 Different complexities of a density function: (a) single Gaussian, spherical covariance.
(b) Single Gaussian, diagonal covariance, (c) single Gaussian, full covariance, (d) mixture of two
full covariance Gaussians, (e) mixture of three full covariance Gaussians
dimensions. In (c) we continue to use a single Gaussian to model the density, but
we allow a full covariance. In (d) we increase the number of Gaussians to be two as
one Gaussian does not seem to be sufficient to model the density of this data. In (e)
we finally use three Gaussians to model the density—which seems to be sufficient.
Adding more will try to memorize the data and not generalize. See Rasmussen
(2004) to read more about mixture of Gaussian.
In this section, we have studied a variety of density estimation paradigms. There
are other density estimation frameworks, for example, hidden Markov models for
sequence of symbols. Overall, density estimations can give deep insights about
“where to look” in the data, which parts of the data matter, and which parts are
“surprising” or “anomalous.”
be found in such data. The itemset data is best described as a dataset where each
record or data point is a “set of items from a large vocabulary of possible items.”
Let us first consider various domains where such data occurs:
• Market basket data: One of the most common examples of itemset data is
the market basket data or the point of sale (POS) data where each basket is
comprised of a “set” of products purchased by a customer in either a single visit
or multiple visits put together (e.g., all products purchased in a week or one
quarter, or the whole lifetime). Here the list of all products sold by the retailer is
the vocabulary from which a product in the item could come from. We are losing
information by just considering the set of products and not include their quantity
or the price which would make it a “weighted set” instead. The problem is that
in a typical heterogeneous retail environment, the products are not comparable.
For example, 1 l of milk, 1 dozen bananas, and 1 fridge are not comparable to
each other either in physical or monetary units. Hence, we stick with just the
unweighted sets or baskets rather than weighted sets or bags.
• Keyword sets: Another common example of itemset data is a keyword set. Often
entities such as images on Flickr, videos on YouTube, papers in conferences, or
even movies in IMDB are associated with a set of keywords. These keywords
are used to tag or describe the entity they are referring to. Here the vocabulary
from which the keywords can come from is predefined (e.g., keyword lists for
conference papers) or is taken to be the union of all the keywords associated with
all the entities.
• Token sets: Itemset data is also present in many other contexts not as keywords or
products but arbitrary tokens. For example, hashtags in tweets whether per tweet
or per account is an itemset. Another example is the set of skills in each LinkedIn
profile is an itemset data. In a user session on YouTube, all videos watched by
a user in one session constitute an itemset as well. All WhatsApp groups are
itemsets of phone numbers. In a payment app or a credit card account, the set of
merchants where a customer shopped in the last n days is also an itemset. It is up
to us how we convert any transaction data into an itemset data as it makes sense.
• Derived itemsets: Itemset data can also be derived from other datasets. In
neuroscience experiments, for example, we might want to discover which
neurons fire together in response to different experiences or memories. In such
cases, we can consider a moving (overlapping) window of a certain size and all
neurons that fire in the same window could be considered an itemset. Similarly,
in gene expression experiments, all genes that express themselves from the same
stimuli could also be considered an itemset.
In all these itemset datasets we are interested in finding patterns of “co-
occurrence”—that is, which subsets of items co-occur in the same itemsets. There
are many ways to define co-occurrence. In frequent itemset mining (FISM), we are
interested in finding “large and frequent” itemsets. While the dataset is simple and
the definition of what is a pattern is also very straightforward, what makes this a
complex problem is the combinatorial explosion when the vocabulary of possible
496 S. Kumar
items is very large. One of the key algorithms that we will develop here called the
apriori algorithm solves this problem using a very basic insight from set theory.
Intuition: Consider an itemset data shown in Fig. 15.18i, it has a total of 10 data
points over a vocabulary of 6 items {a,b,c,d,e,f}. We will first consider itemsets of
size 1. There are six itemsets of size 1 (Fig. 15.18ii). For each we can compute
the frequency which is the number of itemsets (out of 10) in which that item was
present. Now that we have itemsets of size 1 and their frequencies (also known as
support), we can compute itemsets of size 2 and so on. Now since we only care about
the “frequent” itemsets and not all itemsets, we can define a frequency threshold θ f
(also known as support threshold) such that only those itemsets of size k (= 1 for
now) will be kept whose frequency is above this threshold and others will be deemed
not “supported” (i.e., noisy). This goes with the underlying philosophy of pattern
recognition that anything that is high frequency is a pattern worth remembering.
Now from itemsets of size 1 we can find itemsets of size 2 and their support, again
pruning off those whose support is less, etc.
Formulation: The only problem with this brute-force counting is the following.
In order for us to count the frequency of an itemset of size k, we need to maintain
a counter with the itemset as the key and its count as value. As we go through
a dataset, we check whether this itemset is a subset of the data itemset or not. If
so, we increment its counter. Now as the vocabulary size grows and the value of
Fig. 15.18 The apriori algorithm at work. (i) The dataset where each data point is a set of items
from a dictionary of six possible items {a,b,c,d,e,f}. (ii) Frequent itemsets of size 1. If support
threshold is 3, then all itemsets of size less than 3 are ignored (i.e., {d}). (iii) Using the apriori
trick, all candidate itemsets of size 2 created from frequent itemsets of size 1. (iv) A pass over the
data gives frequency of each of the candidates. Note that we did not have to worry about any pair of
itemsets involving item d because its frequency count is less than threshold (3). (v) Again applying
apriori trick to create candidates of size 3. (vi) Final frequent itemset of size 3 or more is {a,e,f} of
size 3 and others of size 2
15 Machine Learning (Unsupervised) 497
k grows, the potential number of combinations that we might have to keep in the
N
counter memory grows to O . So we apply the famous “apriori trick” here
k
which tames the combinatorial explosion in an intelligent fashion.
Modification: The Apriori Trick is based on a simple observation that if f (s| X)
is the frequency of the itemset s of size k in a dataset X, then its frequency cannot
be greater than the frequency of the least frequent subset of size k − 1 of s. In other
words, let us say if s = {a,b,c} and let us say its frequency is 3, then it must be true
that the frequency of all of its subsets, that is, {a,b}, {b,c}, and {a,c}, is at least 3.
Otherwise, it will not be possible for {a,b,c} to have a frequency of 3. More formally:
Where s∼i is the set obtained by removing item i from set s. Using this “apriori
trick,” the frequent itemset is able to ignore many itemsets from counting as it knows
that they will not be frequent anyway.
Optimization: The frequent itemset mining algorithm essentially grows itemsets
from size k to size k + 1 as follows using a three-step process.
• Candidate Generation Step: The input to this step is the frequent itemsets (whose
support is above a threshold) of size k, Fk . From this frequent itemset we first
generate a candidate set of size k + 1, Ck+1 that satisfy the apriori property, that
is, we add all itemsets of size k + 1 to Ck+1 whose subsets of size k are present
in Fk (Fig. 15.18iii, v).
• Frequency Counting Step: The k + 1 size itemsets in the candidate set Ck+1
are the only itemsets that have a chance to have a frequency above the support
threshold, θ f . All other combination of itemsets of size k + 1 are not counted at
all. This really reduces the combination of itemsets on which the counter has to
run in the next iteration:
N
f (s |X ) = δ (s ⊆ x n ) , ∀s ∈ Ck+1
n=1
• Frequency Pruning Step: Finally, when a pass through the data has been
made and all frequencies of candidate itemsets are counted, the itemsets whose
frequency is below the support threshold are removed to obtain Fk+1 , the final
frequent itemsets of size k + 1.
Figure 15.18 shows the entire process of generating frequent itemsets of size
up to 3 from an itemset data with support threshold of 3. Each iteration alternates
between the above three steps.
The purpose of creating frequent itemsets is to find rules of the sort: (If condition
then trigger) with some confidence. For example, once we have discovered through
498 S. Kumar
the above process that {a,e,f} is a frequent itemset, we can now create rules of the
form: {a,e} ➔{f}, {a,f} ➔ {e}, {e,f} ➔ {a}, {a} ➔ {e,f}, {e} ➔ {a,f}, {f} ➔ {a,e}. Each
rule comes with a confidence score computed based on the frequency of the entire
set {a,e,f} and the frequency of the condition set, that is,
In other words, this says that if someone bought both a and e, then the probability
that they will also buy f is 1 and can therefore be recommended with a very high
confidence. In frequent itemset mining, all such rules are created and a confidence
threshold θ c is used to prune out rules with lower confidence. The output of the
frequent itemset algorithm is the set of such rules with high support and confidence.
Frequent itemset mining has been one of the early algorithms that almost gave
birth to the field of “data mining.” It was the first breakthrough of its kind in mining
such itemset data and since then, there have been a number of improvements in smart
data structures to store the candidate and frequent itemsets to make it faster and more
scalable. It has also been applied to areas beyond retail data mining for which it was
originally invented. It has been used to discover “higher order features” of type “sets
of items” in various domains including computer vision where each image region
could be thought of as a collection of symbols from a vocabulary (HoG or SIFT). If
many regions across many images show the same set of items (e.g., face images all
show eye, nose, mouth, etc.), then a new object (face) can be created from a set of
lower order features. Wherever we have a “set of items” dataset, we can use FISM.
See Chap. 6 (Sect. 6.2) in Han et al. (2011) for additional material on frequent
itemset mining.
6 Network Analysis
Now that we understand how to find patterns in sets and multivariate data, we
turn our attention to an even more complex yet commonly available data type—
a graph or network data. These graphs may be weighted (i.e., edges have weights)
or unweighted (i.e., edges are binary—either present or not), directed (i.e., edges
either go from a node to another) or undirected (i.e., there is no direction on edges),
or homogeneous (i.e., all nodes and edges are of same types) or heterogeneous (i.e.,
nodes or edges are of different types). Analyzing graphs for patterns presents very
interesting challenges and a lot of opportunities in a wide variety of applications.
There are a number of different kinds of patterns that can be discovered in graphs.
In this section, we will focus on two kinds of network analyses problems: (1)
PageRank, one of the most important algorithms in graph theory that led to the birth
of companies like Google, (2) Detecting Cliques in graphs—another commonly
used algorithm with many applications.
Graphs or networks are present in many domains. Internet, for example, is a
collection of a very large number of web pages (generated at the rate of more than
15 Machine Learning (Unsupervised) 499
1000 pages per minute) with links going from one page to another (directed graph).
This is perhaps one of the largest graph out there. Social networks are another class
of large graphs—LinkedIn, Facebook, telecom networks (e.g., people calling each
other above a threshold), financial networks (e.g., based on money transfers), etc.
Weighted graphs can also be created from transaction or co-occurrence data. For
example, consider a market basket data where we can quantify the consistency with
which two products (a, b) are co-purchased together P(a, b) more often than random
P(a)P(b) using, for example, pointwise mutual information.
P (a, b)
φ (a, b) = log
P (a)P (b)
Any co-occurrence data can be converted to such weighted graphs where the
edges can be removed if these weights are below a threshold. Many measures such as
Jaccard coefficient, normalized pointwise mutual information, and cosine similarity
can be used to create these weighted graphs. Next we develop two algorithms for
network analysis.
Given a directed graph like the Internet, we are interested in finding out which is
the most important node in the graph. The key motivation behind this problem came
from Google where they wanted to sort all pages that contained a keyword in an
order that “made sense.” They posed this as a “random surfer” problem—if a surfer
randomly picks a page on the Internet and starts following the links, what would
be the probability that he will be at a certain page, and if we average over all such
random surfers, which page on the Internet would have the most number of people,
the second most number of people, and so on. This distribution over pages in the
steady state gives the PageRank of each page on the Internet.
Intuition: Every page on the Internet has a set of incoming edges (shown
for node j in Fig. 15.19) and a set of outgoing edges (shown for node i in
Fig. 15.19). When on any page (node i), a random surfer might have some (e.g.,
equal) probability of going to one of the outgoing edges from this page. Thus the
probability of the random surfer to “reach” a page (node j) would be to first “be”
at one of the incoming pages (e.g., node i) of this page with a certain probability
and then reach this page (node j) with a certain transition probability from that page
(node i) in the next iteration as shown in Fig. 15.19.
Formulation: We now formulate this PageRank problem. Let us assume that
there are N pages on the Internet X = {x1 , x2 , . . . , xN }. Let I(xn ) be the set of in-
neighbors of xn and let O(xn ) be the set of out-neighbors of xn . The Link structure is
characterized by the transition probabilities: P = [P(xj | xi )], ∀ xj ∈ O(xi ). This could
be either an equal probability or a weighted probability depending on the nature of
the links going from xi to xj . For example, this transition probability will depend on
whether there are lots of prominent links or a few footnote links going from xi to
xj vs. xi to xk , some other page going out of xi . Let us assume that there is a prior
probability that a user might “start” or “randomly go to” a particular page. This
prior depends on, for example, how many people have this page as their home page
or how often is this page typed directly in the browser compared to the other pages.
Let Q(xi ) be this initial probability of going to this page. Let us say this random
jumping to this page happens with a probability (1 − λ) and with a probability λ
the surfer actually systematically follows the links (browser behavior). Now at any
given iteration t, we can compute the probability that a random surfer will be at a
certain page:
N Pt (xi )
Pt+1 xj ← (1 − λ) Q xj + λ Pt (xi ) P xj |xi = 1−λ N +λ |O(xi )|
i=1 xi ∈I(xj )
In the above we made an assumption that Q(xj ) are all equal to 1/N and outgoing
probabilities P(xj | xi ) are all equal to 1/|O(xi )|. Once converged, this gives the most
“central” or important pages based on the link structure of the graph. Such an
analysis can be done not just on the Internet graph but any directed graph. For
example, if we have a gene expression graph that suggests which gene affects which
other genes, we can find the most important genes in the network. Similarly, if
we have an influencer–follower graph on a social network, we can find the most
influential people in the social network and so on.
Fig. 15.20 A set of cliques found in keyword–keyword co-occurrence graph created from IMDB
dataset
Here we first create a graph between all pairs of keywords based on how often they
co-occur more than random. This graph is then binarized by applying a threshold
and then cliques are sought in this graph.
A “Clique” is a fully connected subgraph of a binary graph. A “Maximal
Clique” is a clique that is not a sub-graph of any other clique. Figure 15.21 shows
a graph with eight nodes and ten edges. It has four maximal cliques. Finding all
maximal
n
cliques in a graph is an NP-hard problem with a known complexity of
O 3 3 for a graph with n nodes. In this section, we will present a MapReduce
algorithm for finding all maximal cliques of a binary graph. Finding such maximal
cliques in graphs could help improve our understanding of the graph, find actionable
insights in the graph, and even discover higher order structures beyond nodes and
edges (e.g., product bundles or communities).
In order to develop the MapReduce algorithm for finding all maximal cliques,
we will first introduce a few concepts:
502 S. Kumar
• Neighborhood of a clique: For any known clique in the graph (e.g., {b,c}) we
define its neighbor as the set of nodes that are connected to all nodes in the
clique. Here since node f and node g are connected to both b and c, they form the
neighborhood of the clique {b,c}. In other words:
N ({b, c, f }) = {g}
N ({a}) = {b, e}
N ({a, b, e}) = ∅
N ({b, c, f, g}) = ∅
N ({c, d}) = ∅
N ({h}) = ∅
• Clique map: We define a map between a clique (key) and its neighborhood
(value) as a Clique Map. This is the main data structure that will be used by
the MapReduce algorithm to find all maximal cliques iteratively.
Fig. 15.22 Four MapReduce iterations needed to find all maximal cliques of different sizes for
the graph shown in Fig. 15.20. Iteration 1 is the input to the algorithm—it comprises of all cliques
of size 1 and their clique neighbors. Iteration 2 is the set of all cliques of size 2 (edges) and their
neighbors, iteration 3 is the set of all cliques of size 3 and their neighbors, and so on. In each
iteration, a clique whose clique neighbor is empty is deemed a maximal clique and stored
starts with cliques of size 1, that is, each node is a clique. This is stored along with
its adjacency list or clique neighbor (forming a clique map shown in Fig. 15.21,
iteration 1). Figure 15.22 shows the four iterations of the algorithm where each
iteration is the same MapReduce step where we go from clique maps of size k
cliques to clique maps of size k + 1 cliques. The crux of this algorithm is now
the Map and the Reduce steps that will take us from one iteration to the next.
The Map Step
In each iteration of the algorithm, we are given a clique map with a clique and
its neighborhood. We want to grow the clique by adding one neighbor at a time to
the original clique. We make the following observation about a clique map (e.g.,
N({b, c}) = {f, g}): If one element (say f ) is removed from the clique neighbor and
added to the clique itself ({b, c}), the resulting set ({b, c, f }) will also be a clique.
This is true because we know that by definition f is connected to both b and c and
{b, c} is already a clique so {b, c, f } will also be a clique. However, also note that
we cannot guarantee that what remains on the neighborhood side (i.e., {g}) is still a
neighbor of the new clique ({b, c, f }) because for that we need to guarantee that g is
connected to f, information that is not availale to this mapper.
The Reduce Steps
The output of the mapper is an intermediate key value obtained from clique maps.
While the keys of these maps are guaranteed to be cliques of size K + 1, the values
are not guaranteed to be the neighbors of the corresponding cliques. In order to
obtain the clique map of size K + 1, the reducer must take an intersection of all the
504 S. Kumar
Fig. 15.23 Each MapReduce iteration for finding all maximal cliques in an unweighted graph.
The Map step takes clique maps of size K, generates all possible cliques of size K + 1 by moving
one element at a time to the clique side from the neighborhood side. The Reduce step then takes
the intersection of the remaining neighbors for the same clique of size K + 1 resulting in clique
maps of size K + 1
sets of the same clique. Figure 15.23 shows the entire process from Map to Shuffle to
Reduce that takes us from clique maps of size 1 to clique maps of size 2. Repeating
this process in each iteration results in cliques of various sizes.
While we explored only two broad ideas—one macros and one micro—in net-
work analysis, there are a large number of algorithms especially around community
detection where softer variants of cliques—communities are discovered within the
networks. Analysis of networks can find interesting structures like fraud syndicates
in telecommunication or service networks and financial networks. Link Prediction,
another important area in network analysis, is used by LinkedIn and other social
networks to suggest more connections to any individual based on their neighborhood
structure and so on. Handcock et al. (2007) can be a helpful resource to learn further.
7 Conclusion
Supervised learning, on the other hand, starts with a question and forces us to
read the book but only with respect to the question. In general, it is always better
to explore the data using these unsupervised learning approaches before building
supervised learning models on it. The insights derived from these algorithms can be
used as is to draw conclusions about the data, make decisions, or serve as features
for the next stages of modeling.
More examples, corresponding code, and exercises for the chapter are given in the
online appendices to the chapter. All the datasets, code, and other material referred
in this section are available in www.allaboutanalytics.net.
References
Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for
classification, regression, density estimation, manifold learning and semi-supervised learning.
Foundations and Trends® in Computer Graphics and Vision, 7(2–3), 81–227.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No.
10). Springer series in statistics. New York, NY: Springer.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques. Amsterdam:
Elsevier.
Handcock, M. S., Raftery, A. E., & Tantrum, J. M. (2007). Model-based clustering for social
networks. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(2), 301–
354.
Murphy, K. (2012). Machine learning – A probabilistic perspective. Cambridge, MA: The MIT
Press.
Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (2013). Machine learning: An artificial
intelligence approach. Berlin: Springer Science & Business Media.
Rasmussen, C. E. (2004). Gaussian processes in machine learning. In O. Bousquet, U. von
Luxburg, & G. Rätsch (Eds.), Advanced lectures on machine learning. ML 2003. Lecture notes
in computer science (Vol. 3176). Berlin: Springer.
Robert, C. (2014). Machine learning, a probabilistic perspective. Chance, 27(2), 62–63.
Chapter 16
Machine Learning (Supervised)
Shailesh Kumar
Every time we search the Web, buy a product online, swipe a credit card, or even
check our e-mail, we are using a sophisticated machine learning system, built on a
massive cloud platform, driving billions of decisions every day. Machine learning
has many paradigms. In this chapter, we explore the philosophical, theoretical,
and practical aspects of one of the most common machine learning paradigms—
supervised learning—that essentially learns a mapping from an observation (e.g.,
symptoms and test results of a patient) to a prediction (e.g., disease or medical
condition), which in turn is used to make decisions (e.g., prescription). This chapter
explores the process, science, and art of building supervised learning models. The
examples, corresponding code, and exercises for the chapter are given in the online
appendices to the chapter.
1 Introduction
The last few decades have seen an unprecedented growth in our ability to collect
and process large volumes of data in a variety of domains—from science to social
media, e-commerce to enterprises, Internet to Internet-of-things, and healthcare
S. Kumar ()
Reliance Jio, Navi Mumbai, Maharashtra, India
e-mail: skumar.0127@gmail.com
Our technology, our machines, is part of our humanity. We created them to extend ourselves,
and that is what is unique about human beings!—Ray Kurzweil
Ever since the dawn of mankind, we have been trying to extend ourselves in all
our faculties: If we could not lift more, we created levers and cranes; if we could
not move fast and far, we created horse carts and cars; if we could not see far, we
created telescopes; if we could not speak loud enough, we created microphones;
if we could not compute fast enough, we created calculators and computers; if we
could not talk far enough, we created telephones and mobiles; etc. In this journey,
we are also extending one of our most important faculties that make us unique—our
intelligence. Using machine learning and Artificial Intelligence, we are now at the
early stages of building intelligent machines that can see, listen, speak, read, learn,
understand, think, create, plan, and converse like humans.
Before we can build intelligent machines, however, it is essential to understand
the nature of intelligence itself. Intelligence has many facets; for example, it is the
ability to:
• Learn causality or correlation from past data (e.g., Should I approve this loan?)
• Recognize structures in the data (e.g., words in speech, objects in images)
• Understand semantics using context (e.g., apple is healthy, I like apple products)
• Adapt to novel situation (e.g., network routers react to change in traffic patterns)
• Reason about alternate ways of solving a problem (e.g., playing chess)
• Synthesize data (e.g., next word in a ??, next utterance in a conversation)
In the context of supervised learning, let us explore two of these notions of
intelligence in a little more depth: understanding and generalization.
Does the Google Search Engine actually understand the Web? Do YouTube or
NetFlix understand the videos they store? Do Amazon and Zomato understand the
reviews written by their customers? Do our “smart” phones actually understand
what we are speaking into them? It is one thing to collect, store, transfer, or index
a large amount of data, but it is completely a different thing to actually understand
it. One of the first fundamental qualities of an intelligent system is its ability to
interpret the raw data it is receiving at the right level of abstraction (e.g., pixels,
lines, blobs, eyes, face, body). But what does understanding mean?
Our language and sensory systems evolved not only to capture and transmit the
raw data to the brain, but to actually understand it in real time, that is, to identify
structures, objects, and attributes in them. Our visual system—perhaps the most
sophisticated intelligent system so far—looks at pixels in the retina but sees a fresh
red rose or a flying eagle in the brain. Similarly, when we process a sequence of
510 S. Kumar
words (e.g., “Apple filed a suit against Orange”) we interpret or assign meaning
to each part (word)—for example, “Apple” the company, not fruit—so the whole
(sentence) makes sense.
Understanding is a hierarchical process of using context to interpret each part
so the whole—as a juxtaposition of its parts—makes sense.
A large class of Unsupervised and Supervised Learning algorithms today are
understanding algorithms as they try to interpret the raw data, for example, this
word is a noun (part of speech tagger), this document is about hockey (document
classifier), this article mentions Mahatma Gandhi (information extraction), this
video segment shows bungee jumping (activity recognition), this image shows a
cat under a table (object recognition in images), this person is Mr. X (speaker
recognition from voice, face recognition from image, or fingerprint recognition).
Fig. 16.1 We can recognize this letter immediately without having seen any of these renditions
before
16 Machine Learning (Supervised) 511
“classifier” with this training data that can now recognize new examples that it has
never seen before. The basic principle of generalization is that:
Inputs that belong to the same class must be similar to each other in some way.
A classification model maps a set of input features to a discrete class label (e.g.,
Digit (0–9) or character (A–Z) (class labels) from images; emotions (sad, happy,
confused, frustrated, . . . ) from face images; land-cover (water, marsh, sand, etc.)
from remote sensing data; e-mail type (spam, promotion, finance, update) from
e-mails (e-mail classifier); words (e.g., words in any language) from speech data
(speech to text); objects (cat, dog, car, tree, etc.) in images; computer vision activity
(stealing, holding, throwing, etc.) from videos (activity recognition), part-of-speech
of a word in a sentence (POS taggers); sentiment in a tweet; or review about an
entity (movie, product, etc.)
In all these cases the input could take any form—multivariate, text, image,
speech, video, sequence of transactions, etc. This raw data is further converted to
some meaningful features. The output is a discrete class label from a predefined set
512 S. Kumar
of labels. In two-class classification problems (e.g., spam vs. not-spam, churn vs.
not-churn, conversion probability estimation), the output is typically interpreted as
probability of target class (e.g., spam, churn, click). Appropriate thresholds on this
probability can be used to make a binary decision. In general, the classes themselves
might have hierarchies (e.g., news articles might be labelled as sports, entertainment,
politics, science, etc. at the first level of hierarchy while within sports class, there
might be subclasses for various sports or within science there might be subclasses
such as space, medicine, and technology).
A regression model maps input features to a real or ordinal value (e.g., click-
through-rate prediction in search and advertising, lifetime-value prediction of a
customer, efficiency prediction from device sensors, capacity of a customer to take
loan, and value of a house/property in a local neighborhood): Regression is used
in many ways: either to predict a value, to predict score in a certain range (e.g.,
in ordinal regression we might want to predict a score from say 1 to 5 for ratings
corresponding to poor, fair, bad, good, and excellent), or for forecasting a value into
the future (e.g., demand prediction for products in retail).
A recommendation model maps a past behavior into future potential activities (e.g.,
which product a customer might buy given what he/she purchased, browsed, etc. in
the past, which movie a user might like on NetFlix or YouTube given his/her past
content consumption, who will a user like to connect to on Facebook or LinkedIn
given his/her current connections and interactions, which news or tweet a user might
like given his/her past consumptions, which topic the student should study next
given how he/she has fared in past topics). A typical recommendation engine uses a
two-stage process:
Creating user profile: In this first stage, user’s past behavior is used to build
his/her profile (a set of features and their weights). For example, in retail the profile
might be built based on the products the user searched, added to wish list, purchased,
read a review about, wrote a review about, etc. In education, the profile of a student
may be created using the time spent on learning, number of problems solved, and
test scores on problems associated with each topic. In YouTube, a user profile may
be built based on videos previously watched, liked, and commented by the user.
Note that a user may have different types of interactions with the same entity.
Each interaction type could be given a different weightage (e.g., purchase is more
important than browse, writing a review might be more important than reading one)
while creating the user profile. Once the user profile is built it is used to make the
actual recommendation.
16 Machine Learning (Supervised) 513
A retrieval model maps a query into a sorted list of entities (e.g., relevant Web
pages on search engines for a given text query, relevant images, videos, news
stories on search engine given a text query, relevant images/videos for an image
query (content-based image retrieval), relevant song for a given humming or audio
snippet, relevant property/car/products on various entity search portals, relevant
flights/hotels on various travel portals (structured queries), and relevant gene
sequences for a gene snippet query.
In both retrieval and recommendation paradigms, the output is a list of entities
sorted by a score. The key difference is that in recommendation, the (recommenda-
tion) score is based on the behavior summarized into a profile of the user, while in
retrieval the (relevance) score is based on the match to a query. For example, in Web
search, one might use URL match, title match, anchor text match (text associated
with all incoming links to this page), header match, body match, etc. Click feedback
data is used to learn the relative importance of various types of matches between
query and entity fields to synthesize the final relevance score.
One of the key skills of a data scientist is to formulate a business problem as one
(or more) of these paradigms, pick the right kind of modelling approach within the
paradigm, and using data and domain knowledge to learn these models. In the rest
of this chapter, we will focus primarily on the classification paradigm. You can refer
Chaps. 7 and 8 (Linear and Advanced Regression) for the regression paradigm and
Chap. 21 (Social Media and Web Analytics) for examples on the retrieval paradigm.
One of the primary goals of Data Science is to drive business and operational
decisions from data to maximize profitability or efficiency metrics, respectively.
Figure 16.2 shows the overall process typically used to drive decisions from data.
We will explore each of the stages of this process in this section.
514 S. Kumar
Fig. 16.2 The overall data science process for building and improving models
Often the most effective way to describe, explore, and summarize a set of numbers—even
a very large set—is to look at pictures of those numbers.—Edward R. Tufte
The first stage in the data science process is to understand the nuances in the data
itself before we start building models. The insights hidden in the data either confirm
some of our own hypotheses about the underlying process that generated the data or
reveal new aspects of the process that we did not know before. Some of the basic
practices for revealing insights in the data include:
Feature Distributions: One of the most basic set of insights comes from
individual feature distributions. Most modelling techniques assume normal or well-
behaved distributions, while most real-world features are either exponentially or
log-normally distributed. Looking at feature histograms reveals such nuances and
helps correct for them by, for example, taking the log of those features that
are exponentially distributed. Further, looking at feature distributions of different
classes reveals whether or not a certain feature would be useful for discriminating
various classes. Feature distributions reveal structure in each feature independent of
other features.
Scatter Plots: A powerful yet simple technique in understanding feature inter-
actions is scatter plots between all pairs of features. This visually shows correlation
among features, if any. Further, color coding each data point with class label reveals
combination of features that might help discriminate classes. Figure 16.3 shows
scatter plots of IRIS dataset1 with respect to a few pairs of features. Scatter plots
limit us to only visualize the data two or three features at a time.
Fig. 16.3 Scatter plots of IRIS data—reveals how two classes are more similar to each other
Fig. 16.4 PCA vs. Fisher projection of same data (classes 0, 3, and 8 from MNIST)
One of the most creative parts of the data science process—literally the art of data
science—is the feature engineering stage. There are two types of data scientists,
viz., feature engineering:
The feature-centric data scientists believe in systematically and painstakingly
creating meaningful features to make the modelling stage simple. For them “real”
data science happens here. They marry their deep understanding of the data
(acquired from insights stage) with substantial appreciation of the domain knowl-
edge (acquired from domain experts) to build features. These features are highly
interpretable, semantically deeper than the original data, and cover all potential
aspects of input-output mapping. This traditional approach to data science is more
useful when labelled data is less compared to domain knowledge and interpretability
of model output is as important as its accuracy.
The model-centric data scientists, on the other hand, believe that throwing
a large amount of labelled data and computational resources (e.g., GPUs) will
automatically learn the right features (in the lower layers) as well as the map-
ping between those features and the output (in the higher layers) of a deep
learning model. Here, the creative process takes the form of designing the right
architecture—nature and type of layers in the deep learning models as opposed
to designing individual features. This model-centric deep learning approach works
well in domains such as text, vision, speech, and time series data where (1) the space
of possible features is very large, which makes it impractical to explore it through
traditional feature engineering; (2) the amount of data is substantial enough to learn
the large number of parameters in deep learning models; and (3) the semantic gap
16 Machine Learning (Supervised) 517
between the raw input (e.g., pixels in images or words in text) to the final output
(e.g., activity in video or meaning of a document) is so large that we need a hierarchy
of features and not just a single layer of features.
In the rest of this section, we will explore a number of transformations on raw
data that constitute traditional feature engineering:
Feature transformation: In a typical model, the different input features might
have very different distributions and ranges. Combining them into a model such as
logistic regression without first making their distributions “compatible” makes the
life of the model miserable. Taking log of certain features (that are exponentially
or log-normally distributed), binning the values, or applying any domain-specific
transformation (e.g., Fourier transformations or wavelet transformations on time-
series data) might help build better models than just shoving the raw inputs into the
model. For example, in many models using income as a feature, it might be better
to either bin the income or take the log of the income since income is typically
exponentially distributed (lots of people have low income, very few have very high
income). Using percentile scores or cumulative density binning is also an example
of taming the distribution variability in the data.
Feature normalization: Even after proper transformations, the raw inputs might
be in different ranges and their values in different units. For example, to predict the
value of a house, one might need features such as number of rooms and bathrooms
(count), area of the house (square foot), distance from nearest school or places of
interest (kilometers), prices of nearby houses sold recently (money), and age of the
house (years). While the distribution can be tamed as described above, the values
might still need to be brought into comparable ranges. For this, the features might
need to be transformed to some min-max range (so min is always 0 and max is
always 1) or z-scored values could be used (so the mean of each feature is zero and
standard deviation is 1). Such transformations then let the model do the actual job
of learning the relative importance of these features instead of forcing them to also
compensate for these feature differences. Care must be taken to first remove outliers
in each feature before learning parameters for min-max or z-score normalization.
Creating invariant features: Often the raw data contains variances in it that
are not related to the problem at hand. For example, speech recognition problems
have accent variances; images might have illumination, pose, rotation, and scale
variances; and transaction and time series data might have seasonal variances. In
essence, the final “data” that we see (e.g., sound of a word spoken by a person)
is a “joint” of the actual signal in it (e.g., the actual word spoken) with additional
factors (accent, tonal quality, loudness, etc.). Keeping what is essential for the task
(signal) and ignoring what is not (noise) is the key to good feature engineering.
Understanding and removing these variances is perhaps the most intricate part of
feature engineering and requires deep understanding of the domain, possible sources
of such variances, and the tools to remove these variances. If not removed, the model
will become complex and will try to learn these variances instead of doing actual
classification.
518 S. Kumar
Ratio Features: A lot of features contain variances that can be removed simply
by dividing them with other features. For example, in information retrieval models,
query length bias is removed by dividing the total match between query and
document field (e.g., title) with query length. In credit models, instead of using
total debt it is better to use debt-to-income ratio, instead of using total-payment a
better feature would be the percent of EMI paid, and instead of total-credit-taken,
percent of credit limit reached might be better features. Such ratio features cannot
be “discovered” by the modelling techniques that are only doing linear combination
of features (e.g., logistic regression or linear Support Vector Machines). Infusing
domain knowledge through ratio features helps model explore the right “space” in
which to discriminate classes.
Output feature ratios: Not only the input features, even the output features
might also have biases that must be corrected for before trying to predict them.
For example, instead of predicting click-through-rate of a document for a query, we
might want to first take into account the expected click-through-rate bias at each
position (e.g., people are anyway more likely to click on the first result than second
and so on irrespective of the query and document). In forecasting sales, instead of
predicting the raw sales count, we might want to predict deviation from the expected
sales given the context (city, season, etc.). The ratings data (movie or product rating)
has inherent “consumer bias.” A critical consumer will typically rate most products
say 1–3 out of 5 and hardly give a rating of 5, while a generous customer might
rate most products between 3 and 5. Now a rating of 4 on a certain product does
not mean the same thing for these two customers. It should be “calibrated” correctly
to remove individual customer’s rating biases to make them “comparable” across
customers.
Creating new features: Additional features beyond basic transformations, nor-
malizations, ratios, and bias corrections are also needed in many domains. Consider,
for example, four features in a credit card fraud prevention problem: location and
time of the last and the current transaction. These four features by themselves put
into a logistic regression model might not be able to predict whether the current
transaction is fraud or not. But a common sense domain knowledge that “there
should be sufficient time between two distant transactions” can be used to translate
these four features into say a velocity feature, that is, ratio of distance between
current and previous transaction to time between current and previous transaction is
a single “semantic” feature that can help predict fraud. In speech and vision domain,
biologically motivated semantic features are extracted from raw signals.
Defining output variable: In some of the problems the prediction variable might
be very obvious (click through rate in search, spam vs. not-spam in Web page or
e-mail classification, land-cover type in remote sensing, etc.). However, in many
other domains, we might have to first define the output variable itself. For example,
in churn prediction, we might have to define churn in terms of future user behavior
(e.g., did not make any purchase in the last 3 months). In credit modelling, we might
define a high-risk customer as someone who missed his last three EMIs in a row. In
such problems where future is to be predicted based on current and past observation,
defining the future output to be predicted becomes very critical.
16 Machine Learning (Supervised) 519
Setting the right defaults: A default value is typically associated with a feature
if no meaningful value can be assigned. For numeric features, often such default
values are zero. Assuming such defaults or not setting them thoughtfully is one of
the most common “bugs” in modelling. Consider a feature called first-occurrence
of a query word in a document field. The earlier the word occurs in the field, the
better—so lower the value of first-occurrence, the better. Now if in a field no query
word is present, what should be the default value of this feature? If we pick a default
value of 0, then it will confuse the model where both for the best case (when the
query word is the first word (at position 0) of the field) and the worst case (where
the query word is not at all present in the field) take the same feature value. A
better default might be the length of the field plus a constant or a high number. It
is essential to deliberate over the default values of all features to make sure that the
default value in conjunction with the regular values are “consistent” with the goals
of the modelling.
Imputing missing features: One of the realities of real-world data science is
the absence of features in the collected data. This happens either because the data
was never collected for a period of time and plugins to collect a feature were added
later, or the sensor was down for a while, or there are data corruption issues. In these
cases, either we use one of the many feature imputation techniques or use modelling
techniques (such as decision trees and their variants) that gracefully handle missing
features. Again substituting the wrong defaults or simple average value of a feature
may not always work.
Feature selection: Once a large number of features have been engineered, we
might decide not to use all of them together in the same model because some of
them might be highly correlated with each other. Feature selection methods can
be model agnostic (aka filter methods) or model centric (aka wrapper methods).
In a model-agnostic approach, features are sorted by some measure of “goodness,”
which is computed based on their discriminative power (e.g., Fisher discriminant)
and nonredundancy with other features. The best features are then chosen to build
the models. Filter methods are used when we have a large number of features
(say tens of thousands) and it is not clear which modelling technique we want to
use. In model-centric feature, selection features are added one at a time (forward
feature selection) or removed one at a time (backward feature selection) in a greedy
manner to maximally increase the model performance (e.g., accuracy). Being model
centric, every time a set of features is evaluated, the model has to be trained
and evaluated. This makes model-centric feature selection potentially very time-
consuming. Feature selection is a classic NP—hard “subset selection problem”
where we know how to compute the “goodness” of a “set” of features but there
is no simple (polynomial) algorithm to find an optimal set for a given dataset
and modelling technique. Many other techniques such as genetic algorithms and
simulated annealing have also been explored for feature selection.
520 S. Kumar
Over the last several decades, the field of machine learning has given birth to a very
large number of modelling techniques—some of which are described in the next
section. Each technique has its own pros and cons and was developed to specifically
address a set of weaknesses in other modelling techniques or “reformulate” the
classification problem differently. In this section, we will explore the common
guiding principles typically used for choosing the right modelling technique and
using the output of these models correctly to solve the business problems.
Interpretability vs. accuracy: In a number of business problems, it is more
important to interpret the output of the prediction model (i.e., give a reason for
why the score is high or low) and not just to be accurate at it. For example, credit
models are legally required to give top three reasons why a user has been denied
a loan. Similarly, in churn prediction models, it might be useful not just to know
that a certain customer is about to churn but also the reason why the customer is
about to churn. This “reason code” can help address those reasons specifically for
each customer. In such cases, it is better to use modelling techniques that are more
interpretable and can generate a reason code along with a prediction score for each
input. In cases where accuracy is more important than interpretability, another class
of modelling techniques is preferred.
Scoring time vs. training time: Most models are deployed in high-throughput
environments. For example, a search engine must be able to generate the top ten
matches within half a second, a credit card fraud model must approve or disapprove
each transaction within a second, in autonomous vehicles, the car must respond
to the environment in real-time. In taxi hailing services, a cab must be allocated
within a few seconds of a request. In all such cases the scoring throughput of the
model must be high. While part of this is an engineering problem, part of it is also a
data science problem where the right modelling technique makes all the difference.
Similarly, the training time of a model might also matter when the model has to be
updated frequently to compensate for real-time inputs from the data. ETA prediction
models in Google Maps, for example, must update their predictions about expected
arrival times in real time as new data is fed into the model continuously. Traffic
routing models must respond quickly to the changes in the traffic patterns or network
issues in real time. Modelling techniques that have a high training time might not
be useful here.
Matching data complexity with model complexity: Once the modelling tech-
nique is chosen, one of the fine arts in data science is to pick the right complexity
of the model. In other words, we must match the complexity of the model with the
complexity of the data itself. If a more complex model is chosen, it might memorize
the training data and may not generalize well to the unseen data. If a less complex
model is chosen, it might not be able to capture the essential causal structure in the
data. This principle of picking the right complexity of model is known by many
names: bias–variance trade-off, signal-to-noise ratio, or Occam’s razor. In essence,
16 Machine Learning (Supervised) 521
the model needs to be just complex enough and not any more. In practice, the right
model complexity for a given labelled data and modelling technique is arrived at
as follows: We start with a simple model and increase its complexity slowly while
measuring the training and test set accuracy—that is, how well it does on the data
that was used to build the model and how well it does on the unseen data. As model
complexity increases the training and test accuracies will go up. But beyond the
point of peak generalization, the test accuracy will start falling as the model will
start to learn the noise in the training data. This is a good indication of the right
model complexity as shown in Fig. 16.5. Each modelling technique comes with a
set of “knobs” to increase their model complexity.
From predictions to decisions: The output of a model is typically a score—for
example, the credit score, the fraud score, or the predicted demand in a forecasting
model. Machine learning stops where this score is generated. Data science starts
where this score is now used to make decisions. Often a number of business
constraints and metrics determine how the score should be used. For example, a
bank with a higher risk appetite might give loans at a lower score than another. In
recommendation engines, for example, we might not just recommend the product
with the highest recommendation score but might decide to recommend products
that are also highly connected to other products for increasing cross-sell beyond just
the current recommendation. Decisions are made, often, with conflicting business
metrics in mind and the model prediction outputs serve as key inputs to the overall
business logic that tries to solve a complex multiobjective optimization actually
make the final decision.
Feedback and continuous learning: Once the model is deployed, feedback is
collected on how well it is doing. This feedback is critical for monitoring model
performance and continuously updating the models. For example, search engines
continuously update their models based on real-time click feedback data by moving
the search results up or down based on whether they are getting higher or lower than
expected clicks for the result at that position. This feedback data is the real goldmine
in any modelling exercise. It is the cheapest and most consistent source of “ground
truth” that is very critical for building supervised learning models. This feedback
522 S. Kumar
also comes in implicit form. For example, if the model predicted that a customer
is about to churn but he/she did not or vice versa then such implicit feedback can
be used to continuously improve the models. Modelling is therefore never a one-
time exercise. Using this feedback data to automatically and periodically update
the model really completes the “continuous learning” loop in real-time, large-
throughput systems that evolve as the business processes, customer behavior, and
environment evolves.
If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.
In the rest of the chapter, we will focus primarily on the classification paradigm.
We will assume that the raw data has already been transformed into a meaningful
feature space as discussed above.
Definition of a classifier: Essentially, a classifier partitions the feature space into
pure regions. A region is considered pure if most of the points in that region belong
to the same class. There are two ways of characterizing pure regions: Either we learn
to “describe” each class (the descriptive classifiers) or we learn to “discriminate”
between the classes (the Discriminative Classifiers). Figure 16.6 shows how a
descriptive vs. a discriminative classifier approaches the same two-class problem.
We seem to be using both classifiers: as we discover new objects in the world
and see one or more examples of it, we build a descriptive classifier that learns the
essence of the class. But when we are confused between two classes (e.g., “dog”
vs “goat,” letter “o” vs. “c”), that is, their descriptions “overlap” quite a bit, then
Fig. 16.6 A descriptive classifier learns the shape of each class. A discriminative classifier tries to
find the decision boundaries between the classes
16 Machine Learning (Supervised) 523
As data started to become more and more abundant and the rule-based systems
started to become harder and harder to manage and use, a new opportunity of
learning rules from data emerged. This led to the first algorithm—decision trees—
that marked the beginning of machine learning. Decision trees combined the
interpretability of rule-based classifiers with learnability of data-driven systems that
do not need humans to handcraft the rules enabling the discovery of interactions
among features that are far more complex for a human to encode.
Decision Trees Classifier
One of the earliest use cases of machine learning was to learn rules directly from
data, adapt the rules as data changes, and enable us to even quantify the goodness of
the rules given a dataset. Decision trees are an early attempt to learn rules from data.
Decision trees follow a simple recursive process of greedily partitioning the feature
space, one level at a time, discovering pure regions. A region is a part of the feature
space represented by a node in the decision tree. The root-node represents the entire
feature space.
Purity of a region: In a classification problem, a region is considered “pure” if it
contains points only from one class and “impure” if it contains almost equal number
of examples from each class. There are several measures of purity that have been
used in various decision tree algorithms. Consider a region in the feature space that
contains nc points from class c ∈ {1, 2, . . . , C} for a C class classification problem.
The class distribution p = {pc }Cc=1 is given by:
nc + λ
pc = C
c =1 nc + Cλ
Gain in purity: A decision tree recursively partitions the entire feature space
into pure subregion using a greedy approach. At each node (starting from the root
node), it finds the best feature with which to partition the region into subregions.
The “best” feature is the one that maximizes the gain in purity of the subregions
resulting from that feature.
Let us say node m is partitioned using feature : f (e.g., COLOR) into; Km,f
children nodes (e.g., RED, GREEN, BLUE): Rf,1 m , Rm , . . . , Rm
f,Km ,f . Let
f,2
m m and p R m
be the purity of subregion Rf,k
P urity Rf,k f,k be the fraction of
m . Then purity gain due to feature f at node
data at m that goes to the subregion Rf,k
m is:
Km,f
P urityGainm (f ) = m
p Rf,k × P urity Rf,k
m
− P urity(m)
k=1
Decision tree algorithm: Decision tree recursively partitions each region into
subregions by picking that feature at each node that yields the maximum purity
gain. Figure 16.8 shows a decision tree over a dataset over five variables {A, B,
C, D, E}. Let us say variable A takes two possible values {A1, A2}, variable B
takes two values {B1, B2 }, C takes two values {C1, C2}, D takes two values {D1,
D2}, and E takes two values {E1, E2}. At the root node, the decision tree algorithm
tries all the five variables and picks the one (in this case variable B) that gives the
highest purity gain. The entire region is now partitioned into two parts: B = B1, and
B = B2. Now that variable B has already been used, the remaining four variables are
considered at each of these nodes. In this example, it turns out that under B = B2,
variable A is the best choice; under node A = A2, variable C is the best choice; and
under C = C1, variable E is the best choice among all the other choices within those
regions. Variable D does not increase purity at any node.
The sample data “Decision_Tree_Ex.csv” and R code “Decision_Tree_Ex.R” are
available on website.
A leaf node at any time in the growing process is considered for growing further:
(1) Its depth (distance from the root node) is less than a depth-threshold, (2) its purity
is less than a purity-threshold, and (3) its size (number of data points) is more than a
size-threshold. These thresholds (Depth, Purity, and Size) control the complexity or
526 S. Kumar
size of the decision tree. Different values of these thresholds might yield a different
tree for the same dataset, but it will look the same from the root node onward.
Sometimes, a tree is overgrown and pruned to a smaller tree as needed.
Decision trees were created to learn rules from data. A Decision Tree model can
be easily written as a collection of highly interpretable rules. For example, the tree
in Fig. 16.8 learns the five rules, one for each leaf node. Each rule is essentially an
AND of the path from the root node to the leaf node.
• B = B1 ➔ Class = Green
• B = B2 and A = A1 ➔ Class = Green
• B = B2 and A = A2 and C = C2 ➔ Class = Red
• B = B2 and A = A2 and C = C1 and E = E1 ➔ Class = Green
• B = B2 and A = A2 and C = C1 and E = E2 ➔ Class = Red
Apart from interpretability, decision trees are also very deterministic—they
generate the same tree given the same data—thanks to their greedy nature. This
is essential for robustness, stability, and repeatability. The scoring throughput of
decision trees is high. They just have to apply at most D conjunctions, where D is
the depth of the tree. Apart from this, decision trees are also known to handle a
combination of numeric and categorical features together. Numeric features at any
node are partitioned into two by rules like Age <25. Finally decision trees handle
missing data gracefully. They either ignore the missing features (so when a feature is
missing the training data is ignored for that feature’s purity computation) or assume
the most likely value of that feature at that node (fine-grained imputation).
One of the key criticisms of decision trees is that they are not guaranteed to yield
an optimal partition of the feature space due to their greedy nature. It is possible
that a bad feature chosen early in the tree can lead to a pretty suboptimal sub-
tree below that as there is no mechanism of “backtracking” and correcting for a
bad-greedy choice made earlier. This is a classic example of the fundamental trade-
16 Machine Learning (Supervised) 527
off in AI between optimality and speed. Decision trees were originally designed
for categorical features only. They handle numeric features using thresholding
type rules—for example, if (temperature <100 degrees). This often limits them
to partitioning the numeric subspace only along the numeric axes. If the required
decision boundary is oblique, then decision trees land up learning staircase functions
that could lead to very large trees. This can be addressed by learning logistic
regression models at each internal node using all the numeric variables available at
that node and using a threshold on that logistic model to partition it into two parts.
This is a classic example of overcoming model limitations by combining them with
other techniques.
Decision tree classifiers have evolved over the last few decades. Ensemble
version of decision trees including Random Forest and XGBoost are commonly
used for complex supervised learning problems. You can read more about decision
trees in Chap. 3 of “Machine Learning” by Carbonell et al. (1983) or Chap. 8 of
“Data Mining: Concepts and Techniques” by Han et al. (2011).
Often when we have to make important decisions, we take the advice of our near
ones. We are more influenced by the opinions (about products, movies, restaurant,
politics, etc.) of our friends, family, social circle, etc. than those of strangers. This
principle that nearby things have a higher influence than far-off things is the essence
of a whole family of algorithms starting with k-nearest neighbor (k-NN) classifier.
In k-NN, the training data is stored as is and there is no modelling. Hence, this
is an example of a nonparametric classifier. During the scoring phase, first the k-
nearest neighbors in the training set (previously seen and labelled examples) are
sought. Then the new example is assigned the majority class among these k-nearest
neighbors as shown in Fig. 16.9. Let the two classes—blue triangle and red star—
training data be stored as is. The new example—the green square—is classified as a
blue triangle if k = 5 is chosen because in the top-5 neighbors of the new example,
blue triangle is the majority class. While, for the case of k = 10, it is classified as
red star class.
1
N
P (x |c ) α δ (c = cn ) Kσ (x − x n )
Nc
n−1
The key to PW classifier is the density function P(x| c) that essentially quantifies
whether the point x “looks like” previously seen points that belong to class c. PW
aggregates the influence of all training points in class c on x to estimate P(x| c).
But as humans, we do not classify by first remembering all previous examples of
each class and comparing a new example with them. We, on the other hand, build a
“representation” of each class by summarizing or describing the essence of all the
data per class into a class “model.”
In Bayesian Classifiers, each class c is modelled by (a) its class prior P(c) that
quantifies the probability that an unseen data point would belong to class c and
(b) the class conditional density function P(x| c) that quantifies the probability of
having seen “such” a data point from class c in the training data. Unlike in PW, in
Bayesian Classifiers, P(x| c) is modelled (and not just computed) using a parametric
density function that takes into account the nature of the data (e.g., multivariate, text,
speech) as well as the parametric form used to model it (e.g., Normal distribution).
The class prior and class conditional density functions are learnt from the training
data. They are then used to compute the class posteriori probability P(c| x) over
all the classes c for a new data point x. This is done by using one of the most
celebrated relationships in statistics and probability theory—a relationship between
cause and effect, between learning and scoring, between past observations and
future predictions, and between data and knowledge—the Bayes Theorem:
Unimodal Bayesian classifier and PW are two extreme ways of estimating the
same statistic: P(x| c). In nonparametric PW each training data point is associated
with a Gaussian kernel of a certain width around it. In the parametric unimodal
Bayesian classifier, all the data points associated with a class are “described” using
a single Gaussian—in terms of its mean and covariance parameters.
Multimodal Bayesian classifier (MBC): Clearly there is a continuum of
complexity from PW—that uses one Gaussian per data point—a potential overkill
to UBC—that uses one Gaussian per class—which might not be sufficient to
describe the class. In many domains, a class might be multimodal, that is, it
might have subclasses. For example, the same word might have two very different
pronunciations in different accents. An object in image might look very different in
different pose, illumination, and scale. A letter in OCR might have different fonts
and emphases (bold, italics, etc.). In handwritten digits, for example, people write
a digit “7” or “1” or “9” in different ways. In all such cases, the entire class cannot
be modelled as a single Gaussian but as a mixture-of-Gaussians (MoG), that is,
two or more Gaussians—one representing each subclass. Figure 16.10 shows a two-
class problem data where using one Gaussian per class (left) might yield a low
accuracy classifier (under trained) but using three Gaussians per class might be the
right level of complexity for this dataset. In general, MoG is a generic and powerful
way of modelling arbitrarily complex density functions to match complexity of data
(number of subclasses per class). An MoG is written as:
Mc
where:
(c)
• π k is the prior proportion of subclass k in class c.
(c)
• μk is the mean of subclass k of class c.
(c)
• k is the covariance of subclass k of class c.
These parameters are learnt using the EM algorithm using the data from each
class independently. The number of mixture components for each class can vary
depending on the number of subclasses it might have. The EM algorithm for
learning MoG is described in the unsupervised learning chapter. The unimodal
16 Machine Learning (Supervised) 531
Fig. 16.10 Using 3 instead of 1 Gaussian per class: matching data complexity with model
complexity
D
P (x|c) = P (x d |c)
d=1
very large compared to the size of the labelled corpus. A document is represented
as a bag-of-words where each document x is represented by the number of times
each word wd occurs in the document—the term frequency: tf(wd | x). In the training
phase, class conditional probabilities of each word are computed from labelled data
as follows:
n (wd , c) + λ
P (wd |c) =
w n (w, c) + Dλ
Here n(w, c) = x ∈ c tf(w| x) is the number of times word w occurs in class c,
D is the total number of words in the dictionary, and λ is the Laplacian smoothing
constant used to make sure that none of the P(wd | c) becomes 0.
These estimates can be used to compute P(x| c). A new document x is classified
by first computing its class conditional probability density. But since the document
lies in a high-dimensional (number of unique words after preprocessing) sparse
space (each document only contains a very small fraction of the total words in the
dictionary), it is not possible to model the density in the joint space. We therefore
make a naïve assumption that all words are independent given the document belongs
to a certain class:
D
P (x|c) = P (wd |c)tf (wd |x) .
d=1
The decision boundary between two classes (assuming a two-class problem for
simplicity) is the locus of all points x where the two posterior probabilities are same,
that is, where the two regions intersect or where the points cannot be classified in
one class or the other: P(c1 | x) = P(c2 | x). The discriminant classifiers label a data
point into the maximum discriminant value class: c∗ (x) = arg max {gc (x)}. The
c
decision boundary can be derived by solving: g1 (x) = g2 (x) or g1 (x) − g2 (x) = 0.
Linear discriminant analysis (LDA): In a two-class problem, if we make the
assumption that the covariance (shapes of the Gaussians) of the two classes are
the same, that is, 1 = 2 = then the decision boundary is given by: ln
P(x| c1 )+ ln P(c1 ) − ln P(x| c2 ) − ln P(c2 ) = 0, which when simplified leads
to a linear decision boundary: wT x+w0 = 0, where:
w = −1 (μ1 − μ2 ) and w0 = 1
2 (μ2 − μ1 ) −1 (μ2 − μ1 ) + ln PP (c 1)
(c2 )
Fig. 16.11 Linear vs. quadratic discriminant analysis (LDA vs. QDA)
534 S. Kumar
LDA and QDA are a bridge between descriptive and discriminative classifiers.
The decision boundary in LDA and QDA is simply computed—in terms of class
prior, mean, and covariance properties—but not learnt. Hence, they are still funda-
mentally descriptive classifiers yet a bridge between the shape and the boundary of
the class.
You may read more about LDA and QDA in Chap. 4 in “Machine Learning—
A Probabilistic Perspective” by Murphy (2012) or Chap. 4 in “The Elements of
Statistical Learning” by Friedman et al. (2001).
Perceptron: One of the earliest discriminative classifiers is a perceptron—a
simple, biologically inspired, functional model of what we believe the neuron in
the brain does. Our brain contains billions of neurons, each connected to thousands
of other neurons both laterally (within the same layer) and hierarchically (across
layers). Each neuron does, more or less, functionally the same thing—it aggregates
the inputs received from incoming neurons (connected to its dendrites), attenuates
the aggregated activation, and makes it available at its axons to pass on to its
“children” neurons. While a neuron sitting at the lower layer (e.g., on the retina
of the eye) might take raw pixel level input and combine them to detect lines, the
face detecting neuron at much higher up in the visual cortex hierarchy might be
taking inputs from eye detecting neurons, nose detecting neurons, mouth detecting
neurons, etc. as inputs and predict whether it is “seeing” a face. The simplicity of
each neuron combined with the complexity with which they are arranged and work
together makes the brain one of the most mysterious and powerful masterpieces of
evolution. This also forms the basis of the modern deep learning paradigms that use
a variety of neurons and deep layered architecture to replicate some of the most
complex human brain capabilities of vision, speech, and text understanding. All of
this complexity starts with the “transistors of the brain”—the neurons (Fig. 16.12).
The basic perceptron algorithm that captures the early essence of a neuron for
a two-class problem is very simple. Here, let P and N be the set of positive and
negative examples, respectively. Let wt be the weights of the perceptron in iteration
t, initialized randomly (imagine the neurons of a newborn baby that has never seen
any data yet but has this powerful infrastructure to learn a hierarchical representation
of the world he/she is about to interact with). Then the perceptron algorithm updates
these weights iteratively by (1) sampling a data point, (2) classifying it into one of
the two classes based on its current weights, (3) determining whether it has classified
it correctly or not given the class label associated with the data point, and (4) update
its weights if it made a mistake in the classification:
• Sample a data point x ∈ P ∪ N
• If x ∈ P and wt . x ≤ 0 then: wt+1 ← wt +x; t ← t+1
• If x ∈ N and wt . x ≥ 0 then: wt+1 ← wt − x; t ← t+1
The perceptron “converges” if we either reach a maximum number of iterations
or better yet when no more examples are wrongly classified by the perceptron.
Perceptron-based classifiers are mostly useful for two-class problems, they are not
very robust to noise, they learn in an online fashion and therefore very sensitive to
the order in which the data is presented, and finally they make a hard decision—if
a point is on the correct side—no matter how far, they will try to self-correct—
rendering them brittle. Perceptron is equivalent to k-NN classifier, which is also a
hard classifier sharing some of the similar problems that perceptron-based classifiers
have.
Logistic Regression: One of the oldest, time-tested, discriminative classifiers in
machine learning is Logistic Regression. On the one hand, it is the softer version of
the perceptron (pretty much like the Parzen Windows is a softer version of the k-NN
and mixture-of-Gaussians is the softer version of k-Means clustering); on the other
hand, it is the nonlinear version of linear regression. It models the log-odds ratio of
the target class vs. the background class as a linear combination of the inputs.
P (Y = 1|x) 1
ln = wT x ⇒ P (Y = 1|x) =
P (Y = 0|x) 1 + exp −w T x
where w ∈ RD+1 is the set of D + 1 parameters including the constant bias term.
Most machine learning is optimization. Every parametric modelling technique
optimizes an objective function written in terms of the data (or some statistics on
the data) and some parameters. Clustering, for example, minimizes the distance
between a data point and its cluster center, Fisher discriminant maximizes separation
between classes, decision trees try to split a leaf node into the purest possible
subregions, and perceptron tries to minimize misclassification error, etc. Modelling
is essentially the art of formulating and solving an objective function. Sometimes,
the solution is closed form (PCA, Fisher, LDA, QDA, etc.), sometimes it is greedy
(e.g., decision trees), and sometimes it is iterative (e.g., perceptron). The objective
function too can take multiple forms. Sometimes it is a variant of the sum-squared-
error (e.g., K-means clustering), sometimes it is maximizing (log) likelihood of
seeing the data.
Logistic regression is the solution of a maximum log-likelihood objective
function:
536 S. Kumar
N
J (w) = ln P (Y = 1|x n )yn P (Y = 0|x n )1−yn
n=1
Substituting logistic function for P(Y = 1| xn ) and optimizing for θ yields the
following update rule:
N
wt ← wt−1 + η (yn − P (Y = 1 |x n )) x n
n=1
The other limitation of Logistic Regression is that it can only be used with
numeric features. When the dataset contains both numeric and non-numeric fea-
tures, we can use a 1-hot-encoding (e.g., if R, G, B are three colors then R can
be represented by a vector (1 0 0), B with (0 1 0) and G with (0 0 1)) to create a
multivariate representation out of these symbolic data. In spite of all the flexibility
16 Machine Learning (Supervised) 537
In the Activation Forward Propagation (Fig. 16.14, left), the input layers are
activated by the input data z(0) = x; these activations travel to the subsequent layers
to eventually activate the output layer.
Each hidden unit first aggregates the activations of all previous layer neurons and
applies the bias term:
(+1) () () N () ()
= w0 + N
() ()
ai i=1 wi zi = i=0 wi zi (where z0 , the bias term is
always set to 1).
16 Machine Learning (Supervised) 539
()
In other words, the update in the weight wi,j is proportional to the activation
() (+1)
on its input neuron zi at the error at its output neuron δj . The negative sign
indicates that we are trying to minimize the error. The error δj(L) for the output layer
is simply:
∂E (W ) ∂E (W ) ∂zj
(L)
= zj − yj g aj
(L) (L) (L)
δj = (L)
= (L) (L)
∂aj ∂zj ∂aj
()
The error δj for all the other layers is given by:
∂E (W )
N+1
= g ai
() () () (+1)
δi = wij δj
∂ai
() j =1
540 S. Kumar
(+1)
In other words, the total error in each of the nodes in the next layer δj
()
propagates backward in proportion to the weight wij to form the total the error
()
δi at this node.
A wide variety of applications, architectures, and heuristics have been proposed
in the last couple of decades that have made neural networks one of the most
common and powerful machine learning algorithms especially in two cases: (1)
where there is sufficient data to train large networks and (2) where accuracy
is more important than interpretability. Recent advances in deep learning have
carved a special place for neural networks and their variants—recurrent neural net-
works (CNN), auto-encoders, convolution neural networks (CNN), and generative
adversarial networks—in machine learning. In Chap. 17 on deep learning, a few
examples, sample code, and further details are presented. Other interesting books
to learn more about ANN are “Machine Learning—A Probabilistic Perspective” by
Murphy (2012), “The Elements of Statistical Learning” by Friedman et al. (2001),
and “Machine Learning” by Carbonell et al. (1983).
Support Vector Machines: Machine learning is really an art of formulating an
intuition into an objective function. In classification, the fundamental problem is to
find pure regions. So far we have explored a number of classification paradigms
that greedily, iteratively, or hierarchically try to find such pure regions in the feature
space. A good classifier should be both deterministic and robust to data and label
noise. Perceptron, logistic regression, and neural networks are nondeterministic
as they depend on model initialization and choice of hyperparameters governing
model training. Decision tree, k-NN, Parzen windows, on the other hand, are more
deterministic as they yield the same model for a given dataset and hyperparameters
but k-NN and Parzen windows could be sensitive to data noise.
Consider the two-class classification problem shown in Fig. 16.15 (left). A
perceptron trained on this data, could give an infinite number of solutions depending
on its initialization, each of which will have a 100% accuracy. The fundamental
question that forms the basis of support vector machines classifier is which
hyperplane is “optimal” among the infinite possibilities. From a robustness point
of view, the “best” hyperplane maximizes the width or margin of the linear decision
boundary. This gives a unique solution for this problem (Fig. 16.15, right).
It is easier to understand this using an analogy. Assume that we want
to build a straight road between two villages. Let the labelled data points
{(x n , yn )}N
n=1 where yn ∈ {−1, +1} denote the houses of the two villages (classes).
The goal is to build the widest possible straight road (i.e., maximum margin) that
can be built without destroying any house in either of the two villages. This can
be done by choosing the center and direction of the road in such a way that as we
increase its width equally on either side and stop as soon as it touches the first house
on either side. More formally let us say xT w + b (the solid green line in the right
Fig. 16.15) denote the center of the road. The dotted lines parallel to it, on either
side denote the boundaries of the roads obtained by extending the road on either
side and stopping as soon as it hits a house (data point) on either side.
16 Machine Learning (Supervised) 541
w T x n + b ≥ +1 ∀n where yn = +1
⇒ yn wT x n + b ≥ 1, ∀n = 1 . . . N
w T x n + b ≤ −1 ∀n where yn = −1
Note that we can use 1 as a threshold on both sides because any scaling factor
can be subsumed in the linear coefficients w and constant term b. Using Geometry,
(or examining the distance when there is equality), the width of the road is:
(−1 − b) (1 − b) 2
J (w) =| − |=
w w w
This needs to be maximized. To make the overall function well behaved we write
it as:
1
1
N
!
LP (w, b) = w2 − αn yn wT x n + b −
2
n=1
1
N
N
= w2 − αn yn wT x n + b + αn
2
n=1 n=1
The points on either side of the road on which the margin “hinges” are called
the Support Vectors—these are highlighted in Fig. 16.15 (right). Further SVM
formulation is built on three “SVM Tricks.”
SVM Trick 1—Primal to Dual: The primal objective function above contains
two types of parameters—the original parameters of the hyperplane (w and b)
as well as the Lagrange multipliers α n . Note that hyperplane parameters can be
used to determine the support vectors and similarly knowing the support vectors
can determine the hyperplane parameters. Hence, the two sets of parameters are
complementary to each other and both need not be present in the same objective
function. To clean this up, let us optimize w.r.t. the hyperplane parameters first:
∂LP (w, b) N N
=w− αn yn x n = 0 ⇒ w∗ = αn yn x n
∂w
n=1 n=1
∂LP (w, b)
N
= αn yn = 0
∂b
n=1
The solution for w shows how knowing the support vectors and the data can
be used to find the hyperplane. The second equation implies that the total penalty
associated with the positive class is the same as the total penalty associated with the
negative class. Substituting both these back into the primal, and simplifying, we get:
N
1
N
LD (α) = αn − αm αn ym yn x Tm x n , s.t. αn yn = 0 and αn ≤ 0, ∀n
2 m<n
n=1 n=1
Fig. 16.16 The trade-off between bigger margin and violating the constraints on same dataset
same data if no constraints were allowed to be violated we can only build a smaller
margin classifier (left), but if two constraints are allowed to be violated (two houses
were allowed to be broken) then we can build a wider margin classifier. This ability
to trade-off between maximizing the margin and violating some constraints is the
second kernel trick. It is realized by introducing “slack variables” {ξn ≥ 0}N n that
allow a certain slack on each constraint:
T
The Primal Objective function with slack variables has two additional terms. First
a cost C associated with the total slack given and a set of terms to ensure that all the
slack variables are positive.
1 N
!
LP (w, b) = w2 + C ξn − αn yn wT x n + b − 1 + ξn − μn ξn
2 n n
n=1
Here, ξ n are slack variables and μn are Lagrange multipliers on these slack
variables. Converting this to the dual, however, gives a very elegant variation of
the original dual.
Z
1
N
LD (α) = αn − αm αn ym yn x Tm x, s.t. αn yn = 1 and αn ≤ C, ∀n
2 m<n
n=1 n=1
544 S. Kumar
Note that the only difference now is that earlier the penalty of violating a
constraint had no upper bound (i.e., α n ≥ 0), which means that violating even a
single constraint could result in an infinite cost. But with the introduction of the
slack variables and a cost C on these slack variables changes the dual in only one
way: It just upper-limits the amount of penalty that any single violation can cause,
that is, 0 ≤ α n ≤ C. This implies that even if a few constraints are violated, the
maximum penalty could at most be C for each such violation and if that leads
to a wider margin, so be it. The cost parameter C controls the complexity of the
SVM classifiers. A low value of C will allow more constraints to be violated and
larger margin, simpler classifier be learnt while a high value of C will allow smaller
number of constraints to be violated and smaller margin, complex classifier to be
learnt.
SVM Trick 3—Kernel Functions: Machine learning is the art of matching data
complexity with model complexity. This is accomplished in two ways: Either we use
linear (simple) models with nonlinear (complex) features or nonlinear (complex)
models with linear (simple) features. For example, in logistic regression if the raw
features are used as-is, we are not able to learn the complex decision boundaries and
so we add nonlinear features (via generalized linear models). The third kernel trick
is on the same lines. The original SVM formulation is only for two-class problems
and learns a linear large margin classifier. To build more complex models than
linear, we can introduce nonlinear features and “warp” the space and learn a linear
classifier in the warped space. Note, however, that in SVM the only way data points
are used in a space is to take their dot-products x Tm x n . Let us call this the kernel
or similarity between these two data points: K(xm , xn ). SVM classifier really needs
(only) this pairwise dot product (the Gram Matrix) as input. Now if there were a
class of kernels where it was possible to actually compute this pairwise dot product
in the transformed space directly without actually having to first transform the data
into that space, then we could use this generalized kernels directly. In other words,
let us say
K (x m , x n ) = φ(x m )T φ (x n )
where φ(x) is the nonlinear high-dimensional space to which the raw input x is
mapped. Some of the common kernels used in SVM are:
poly d
• Polynomial Kernels: Kc,d x, x = x T x + c with hyperparameters c
and d.
rbf
x−x 2
• Radial Basis Function Kernels: Kσ x, x = exp − 2σ 2
Using such nonlinear kernels to first warp the space into a hypothetical
high-dimensional space, building a linear large margin classifier, in that space,
and therefore realizing a nonlinear large margin classifier in the original space
is the third kernel trick. Together, these three tricks make SVMs one of the
most elegant formulations of an intuition into a powerful machine learning
algorithm.
16 Machine Learning (Supervised) 545
The class label is the sign of the score. Note that this scoring function is similar
to Parzen window scoring except that in Parzen windows all training data points are
used while in SVM, the weighted sum is taken only w.r.t. the support vectors, hence
the time complexity is much lower.
One of the key drawbacks of SVM methods is that their training is quadratic
in the number of training data (as they need pairwise cosine similarity) and hence
with larger dataset learning an SVM can take much longer and can become quite
infeasible. Sampling the data can address this.
In many domains, it is easier and more natural to quantify similarity between two
data points than to represent a data point in a multidimensional space. For example,
similarity between two LinkedIn or Facebook profiles, two gene sequences, two
images or words, or two documents is much more natural than to represent them in
a multidimensional feature space. In such cases, kernel-based approaches including
SVM, k-NN, and Parzen windows might be more natural to use than traditional
models such as decision trees or logistic regressions.
SVM in particular and Kernel methods in general have been applied in a variety
of applications. Text classification using TFIDF representation was one of the areas
in which they have shown remarkable success. A lot of research has gone into
discovering new kernel functions for specific datasets and extending the SVM
thinking (large margin) to other domains such as regression and outlier detection
as well.
To learn more about SVM, you can refer to Chap. 14 in “Machine Learning—
A Probabilistic Perspective” by Murphy (2012), Chap. 12 in “The Elements of
Statistical Learning” by Friedman et al. (2001), or Chap. 9 of “Data Mining:
Concepts and Techniques” by Han et al. (2011).
Ensemble learning: We have explored two broad approaches of building models:
First, extracting better features (i.e., semantic, hierarchical, domain knowledge
driven, statistics driven) and building complex models (deeper decision trees, deeper
neural networks, nonlinear SVM vs. Linear SVM, neural networks vs. logistic
regression, mixture-of-Gaussians vs. single Gaussian per class, etc.). Instead of
building a single increasingly complex model (both feature and model complexity),
the third approach is to divide-and-conquer, that is, break the problem into simpler
subproblems, solve each subproblem independently, and combine their solutions.
This is called ensemble learning, where the models must be different from each
other in some ways while being similar to each other with respect to the nature
of the modelling technique and complexity. In other words, we neither want to
create an ensemble of, say, a neural network and a decision tree—they should all be
546 S. Kumar
improve accuracy of the model by building a sequence of models such that the next
model focuses on the cumulative weakness of the models built so far. In boosting,
the first model gives equal importance to all data points. The second model tries to
focus more on (increase weight) those data points for which the first model does not
do as well. The third model increases the weights of those data points whose error
according to the cumulative first and second model so far is high. Boosting is not
amenable to parallelism as the next model depends on the previous k—1 models.
Nevertheless, it is one of the most powerful techniques for building ensemble
models. The key to boosting is again a large number of shallow/weak models. One of
the most famous boosting algorithms is XGBoost that applies boosting to decision
trees.
Region-Based Ensemble (Mixture-of-Experts)
The bagging and boosting algorithms focused on sampling the data or features
randomly. Another class of ensemble learning algorithms is where each model
focuses on a different part of the input space instead of building a single model
for the entire space. For example, if we were to build credit models for all
businesses, one approach would be to build a single complex model for all types
(size × vertical) of businesses. Another approach might be to build a separate model
for small, medium, and large businesses and also businesses in different verticals.
The business size and vertical now become the “partitioning variables” instead of
“input features” to the model. And each of the models becomes an “expert” in
that cohort of businesses. Such a framework is called a mixture-of-experts. Local
linear embedding is an example of a mixture-of-experts where to model a complex
regression function, instead of using a high-order polynomial we might use local
linear planes where each plane is valid only over a small region in the input space
and near the boundaries the outputs of two planes are interpolated to give the final
output.
Output-Based Ensemble (Binary Hierarchical Classifiers)
Most machine learning algorithms such as logistic regression or support vector
machines are natural at solving two-class problems. But more often than not we are
faced with classification problems with more than two classes (digits recognition,
remote sensing, etc.). In such cases, we can apply these two-class classification
algorithms in creative ways:
1-vs-rest classifier: One approach is to take a C-class problem and break it into C
2-class problems, where each problem takes one of the C classes as a positive class
and all the other C-1 classes as negative class. This approach has a few drawbacks
as the negative class can become large and create an imbalanced two-class problem
each time. If we were to sample the negative class (C-1), choosing the right negative
samples becomes critical to build a good 1-vs-rest classifier. Finally, the decision
boundary where one class has to be discriminated from all the others might be too
complex to learn.
Pairwise classifier: Here, the C-class problem is divided into (C choose 2)
two-class problems where a 2-class classifier is built for each pair of classes. The
advantage here is that each of the pairwise classifiers can select or engineer its
own set of features (e.g., features needed to distinguish digits 1 vs. 7 are very
548 S. Kumar
different from the features needed to distinguish digits 3 vs. 8). With such specific
features that focus on discriminating just the two classes at a time, the accuracy of
these pairwise classifiers can be very high even with simple models. The domain
knowledge that is discovered in terms of which features are needed to discriminate
which pair of classes is an additional outcome. At the time of scoring, a new data
point is first sent through all the pairwise classifiers where each one gives a label
from among its class pair. Note that here each of the C classes has equal votes. The
majority voting is used to then combine the output. The only drawback here is that
the number of classifiers needed to be built is quadratic in number of classes. This
can, however, be parallelized. Similarly, at scoring time each new data point has to
be sent through all the pairwise classifiers. This again can be parallelized. Pairwise
classifiers do not suffer from some of the problems of 1-vs-rest classifiers.
Binary hierarchical classifiers: In hierarchical clustering, data is clustered
either in top-down (divisive) or bottom-up (agglomerative) fashion. In the same way,
if we have a large number of classes, then the classes themselves can be clustered
hierarchically. The distance between two classes can be measured by the accuracy
of the pairwise classifier itself. Figure 16.18 shows an example of such a binary tree
discovered from classifying letters of the English alphabet using OCR features. Here
classes G and Q are merged first, then classes (M, W) and (F, P), etc. are merged.
This is based on the training accuracy between those pairwise classes. These two
classes are merged together (bottom up) and a new meta-class {G,O} is created.
Now we are left with C-1 classes. The process is repeated again and a whole binary
tree of classes is created.
Each internal node in this tree is a two-class classifier with its own set of features
that best discriminate the two child (meta)classes. When a new data point has to
be classified, it first goes to the root node where the root node classifier decides
whether it looks more like “left meta-class” or “right meta-class.” The data point is
then passed along that path recursively. This can be done both in a “hard” way, that
is, send it either to left or right. Or this can be done in a “soft” way, that is, send it
to both left and right with the posterior probability weight. These posterior weights
are then multiplied across each path leading from the root node to the leaf node to
get the overall posterior probability of each class. Such a classifier eventually needs
only C-1 pairwise classifiers (as opposed to C choose 2 for pairwise classifier), it
will still use features that discriminate only the two meta-classes at each node, and
most importantly, this will automatically discover the class hierarchy as additional
domain knowledge that we might not be aware of before.
Ensemble learning is used where individual complex models are not enough and
we need to build robust and accurate models. The mixture-of-experts and binary
hierarchical classifiers not only improve model accuracy but also improve model
interpretation as they focus us on the right features and therefore a simpler yet more
accurate decision boundary. In general, ensembles are more reliable than individual
models as they explore the possible space of input/output mapping more thoroughly.
To learn more about various ensembling techniques, you can refer to Chap. 16 in
“Machine Learning—A Probabilistic Perspective” by Murphy (2012), Chaps. 8, 10,
and 15 in “The Elements of Statistical Learning” by Friedman et al. (2001), Chap.
11 in “Machine Learning” by Carbonell et al. (1983), or Chap. 8 of “Data Mining:
Concepts and Techniques” by Han et al. (2011).
In this chapter, so far, we have explored the classification paradigm in depth.
In this section, we will discuss various aspects of the recommendation engine
paradigm.
• In browse mode, the items are organized in categories (e.g., electronics vs. sports)
and hierarchies (e.g., electronics ➔ cameras ➔ SLR cameras) so the user can
navigate through this organization structure to reach the item he/she is looking
for.
• In filter and sort mode, any list of items (obtained from search or browse) is
further refined by either filtering (including or excluding) or sorting (in ascending
or descending order) the items by various properties (brand, rating, price, etc.).
In all three modes, the onus is on the user to discover what they are looking for
using these modes.
• In recommendation mode, the discovery process becomes proactive where the
system itself suggests or pushes the items to the user that he/she is most likely
to engage with next. This is one use-case of recommendation engines—build
products that enable “intelligent proactive discovery” of items from a large
collection of items for the user.
Personalization—most Apps or home pages of services today have an entry point
that can be personalized for each user. For example, each user sees a different set
of videos when they log into YouTube. They see a different set of suggestions
for potential connections when they log into their LinkedIn, Facebook, or Twitter
accounts. They also see different Netflix, Amazon Prime, Gaana, Saavn home pages
depending on their previous activities on these Apps. The home pages differ not just
from other users but also from their last visit. This degree of personalization of
home pages of websites or apps is also powered by recommendation engines in the
backend.
Serendipity—The epitome of intelligence is not just to do what makes sense but
to do what might surprise us while it makes sense, to exceed our expectations. An
essential element of advanced recommendation systems is this serendipity. Search
is a zeroth order intelligence where the user already knows what he/she is looking
for and the system is just trying to “match the query” with content meta-data;
personalization is the first-order discovery where the user is suggested proactively
what he/she might be looking for next, but serendipity is really a second-order
intelligence where the user is suggested what he/she might not even be looking for
but is pleasantly surprised by it. Serendipity opens up gates to a new dimension
of exploration for the user. Serendipity in recommendations is like mutation in
evolution. It allows for random yet connected exploration.
Optimization—Finally, recommendation engines can also be used to optimize
different utility functions depending on the life stage of a customer. For example,
when a new customer boards on a business (e.g., a bank, a retailer), the business tries
to optimize the relationship of the customer over a period of time. In the beginning,
the goal is just to transition a new customer to a loyal customer. Then from a loyal
to a valuable customer, and then from a valuable to a retained customer. Each of
these stages of a customer journey vis-à-vis the business can apply a different utility
function when recommending the next set of products. For example, to make a
new customer loyal, a retailer might offer daily use products at a lower price to
the customer (milk, soap, groceries, etc.). Once the customer is loyal, the business
might want to cross-sell the customer into other categories that are more profitable
16 Machine Learning (Supervised) 551
to the business and relevant to the customer (clothing, shoes, etc.). After that, the
business might want to up-sell the customer even more profitable items such as high-
end electronics or jewelry, etc. This delicate hand-holding of a customer leads to a
long-term lifetime value of a customer. Some of the most advanced recommendation
engines fine-tune their recommendations to each customer based not just on the
customer’s historical and demographic indicators but also the inferred stage of the
customer’s life stage in this journey.
Problem statement: Given the past engagement of a user with items in the domain,
predict whether or not a particular user will exhibit high or low engagement with a
particular item that he/she may or may not have been exposed to yet.
Figure 16.19 shows an example of an Engagement Matrix with six users engaging
with eight items. If a user likes an item, the corresponding cell is marked green and
if the user does not like an item, that cell is marked red. So, for example, user 2 likes
items 2, 4, and 7 but does not like the items 3 and 6. These items could be movies or
songs or products, etc. The gray boxes indicate that the corresponding user (row) has
not yet interacted with the item (column). Let us say we have this data collected over
a large number of users and items and we want to predict whether user-5 will like
item-4 and whether user-6 will like item-1. In other words, should we recommend
item-4 to user-5 and item-1 to user-6? How will we compute these “recommendation
scores”?
Fig. 16.19 An engagement matrix with six users and eight items. Red indicates that the user did
not like the item and green indicates that the user liked the item. Gray indicates the user has not
yet interacted with the item
552 S. Kumar
We can remove that bias by a simple z-scoring of each user’s past rating, that is, for
each user, we can create a normalized rating based on his/her mean and standard
deviation of all the ratings given in the past and use this normalized rating instead of
the raw rating. Another example is in media consumption—songs, movies, videos.
If a user clicks on a recommended item but does not finish it, only consumes a few
seconds of it and then returns back—this also indicates lack of engagement. So,
clicking is not engagement, finishing the experience up to a certain percentage is
true engagement. Furthermore, repeated interaction of a user with an item indicates
deeper engagement. Temporally, recent interactions should get a higher engagement
score.
Combining Multiple Feedbacks—Finally, for the same item, the user might be
giving more than one feedback. For example, in e-commerce, the user might be
searching for product, spending time browsing the product, reading reviews on the
product, adding it to wish list, removing from wish list, purchasing the product,
returning the product, writing a review on the product, or responding to a review of
the product. In media, the user might be again searching for a content, consuming
the content partially or fully, downloading it, liking it, sharing it, etc. Combining
all these various interactions both implicit and explicit—normalizing their scales,
weighting them appropriately, etc., to come up with the final engagement score is
again a fine art.
More formally, let there be K different types of engagements between a user and
an item:
K
e (un , im ) = wk ek (un , im )
k=1
1
where e(u) is the average engagement of user u, that is, e(u) = |I(u)| i∈I (u) e (u, i)
Estimating Recommending Score: Once the user–user similarity is computed,
we can use these to compute the rating of an item (e.g., item-4 above) for a user
556 S. Kumar
(e.g., user-5 above) given his/her similarity with all the other users (e.g., user-3 and
user-4) and whether they liked or did not like the item. This can be written as a
simple weighted sum as follows:
Here:
• (u, i) ∈ E implies that the summation is over cells where user u has engaged with
item i.
• The first term minimizes error of approximation between residual error and
parameters.
• The second and third are regularization terms that penalize for high values of
parameters.
• Also note that in iteration 1, E0 = E itself.
Solving for the parameters we get:
∂J (a k , bk |Ek−1 )
=2 (ak (v)bk (i)) − ek−1 (v, i) bk (i) + 2λA ak (v) = 0
∂ak (v)
i∈I(v)
16 Machine Learning (Supervised) 559
(t)
(t+1) (t) i∈I(u) ek−1 (v, i) bk (i)
ak (v) ← (1 − η) ak (v) + η (t)
λA + i∈I(v) bk (i)2
∂J (a k ,bk |Ek−1 )
∂bk (j ) =2 (ak (u)bk (j )) − ek−1 (u, j ) ak (u) + 2λB bk (j ) = 0e
u∈U(j )
(t)
(t+1) (t) u∈U(j ) ek−1 (u, j ) ak (u)
bk (j ) ← (1 − η) bk (j ) + η (t)
λB + u∈U(j ) ak (u)2
Here, η is the learning rate. Vectors ak and bk are initialized to small values
and learnt via these alternate updates until convergence is achieved. Once we have
reached the maximum dimensionality K, we can compute the “interpolated” or
“smeared” score for the cells with no engagement scores:
K
e (u, i) = ak (u) × bk (i)
k=1
CF-based recommendation engines answer the point question: “while this user
engaged with this item.” This is fine when we do not know anything about a user and
an item and they are just IDs in the two dictionaries. But if we know or infer enough
about a user (e.g., demographics, behavior patterns) and if we know enough about
560 S. Kumar
Fig. 16.21 In profile-based recommendation engines, we do not just know the past engagement
between users and items, but we also know or infer additional user and item features that can be
used to determine which type of users like which type of items
the items (e.g., meta-data), then we can answer the space question: “will such user
engage with such an item.” These are also known as profile-based recommendation
engines and work with not only the engagement matrix but also the user and item
properties beyond just the interaction among them as shown in Fig. 16.21.
There are four stages in building a profile-based recommendation engine.
1. Characterize the features of users and items: The meta-data (given or inferred)
characterizes the “space” in which users and items live. For example:
• User features include the basic demographics of the user—their age group,
income group, gender, location, device, behavior patterns, preferences, etc.
• Item features depend on the nature of the item, for example:
– Movies—features are actor(s), actress(es), director(s), genre, plot, pro-
ducer(s), etc.
– Songs—features are singer(s), music director(s), album, genre, melody,
length, etc.
– News—features are entities, events, location, actions, sentiment, etc.
– Videos—features are keywords, playlists, source, etc.
– Tweets—features are hash-tags, keywords, source, sentiments, etc.
– Clothes—features are brand, the fabric, the color, fashion type, fitting style,
etc.
16 Machine Learning (Supervised) 561
2. Profile users (items) in item (user) space: The intuition behind profile-based
recommendation engines is as follows: We hypothesize that a user is engaging
with an item because of certain properties of that item. If we consider all items
that a user is heavily engaging with and find out what is common among them—
then we start to build a user profile in terms of item properties. For example:
• A movie customer likes movies with certain plots and a certain set of directors.
• A song customer likes classic songs sung by particular artists.
• A retail customer likes clothes with bright colors, of a certain fabric, from a
certain brand.
In other words, the explicit engagements (e.g., like, buy, consume, add to list,
write comment), of a user with an item (i.e., values in the engagement matrix)
can be used to build the “implicit profile” of the user (item) characterizing what
kind of items (users) a user (an item) likes (is liked by) instead of what items
(users) a user (item) likes (is liked by). More formally,
• Let π(u) = {π 1 (u), π 2 (u), . . . , π L (u)} be the L properties associated with
user u
• Let ν(i) = {ν 1 (i), ν 2 (i), . . . , ν K (i)} be the K properties associated with item i
For now, let us assume these properties are binary indicator functions (e.g.,
“is actor X in movie i”). Like in CF, we could take a user-centric approach
(e.g., user–user similarity), an item-centric approach (e.g., item–item similarly),
or joint user–item centric approach (e.g., matrix factorization), here too we can
either take a user-centric, item-centric, or a joint approach.
• User profiling: Given the engagement matrix E between users and items and
item properties, we can build a user profile by aggregating profiles of all items
that the user engaged with. This answers the question: What kind of items this
user is engaging with?
i∈I(u) e (u, i) × νk (i)
φk (u) = , ∀1 ≤ k ≤ K
i∈I(u) e (u, i)
• Item profiling: Again, given the engagement matrix E and user properties,
we can build an item profile by aggregating profiles of all users that engaged
with this item. This answers the question: What kind of users engage with this
item?
u∈U(i) e (u, i) × π (u)
θ (i) = , ∀1 ≤ ≤ L
u∈U(i) e (u, i)
These user profiles are in item-spaces and the item-profiles are in user-spaces.
In a way, we can think of a user profile as a point in the item-space and all items
are also points in the same space. Similarly, an item-profile is a point in the user-
space and all users are also points in the user-space.
562 S. Kumar
e (u, i) = Sim(π (u), θ (i))α Sim(ν(i), φ(u))1−α
items that are rich in content. For example, if we are recommending a product on
an e-commerce portal, we might have access to a number of additional product
features but beyond that we do not have much to go on to accentuate the product’s
profile. But in content-rich items such as songs, videos, movies, news articles,
teaching content, etc. we can extract deeper features from the content itself, add
them to the meta-data features and then build a more holistic profile of the items
to improve their representation. For example:
• Song recommendation can be enhanced by learning its melody and style by
extracting say frequency characteristics of the songs, the instruments playing,
etc. Why a user likes a song is not just because of its meta-data but because of
its content.
• Movie recommendation can be enhanced by extracting activity features, for
example, is there a car chase or an action sequence or a court scene or a
cultural scene in the movie. What is the storytelling style or background music
or nature of language used, etc.
• News recommendation can be enhanced by extracting entities, events, issues,
topics, and sentiments about them in the news article and not just by
representing it as a bag-of-words. The reader is interested in the real-world
stories that the news represents, not just the words.
• Teaching content recommendations can be enhanced by identifying the
different parts of a teaching content—real-world example, definition, detail,
humor, motivation for the topic being taught—and how they are ordered. This
will help extract the “teaching style” of the teaching content that must be
matched with the “learning style” of the student.
Ultimately, algorithms in machine learning can take us only so far. It is the
features that we can extract about our items that makes these algorithms come
alive.
2. Strategic recommendations: Often recommendations are done on a trigger and
for a purpose. The triggers could be entering the store (imagine a face recognition
system recognizes customer) or logging into the online portal, reviewing a
product (people who bought this also bought that), or making a final purchase
of all items in the basket (printing coupons on the back of the receipt) These
recommendations could serve very different purposes, which might be both
tactical or strategic. The choice of the “utility function” to apply for a particular
recommendation instance depends on the stage and context of the customer. For
example:
• Recommendation for Loyalty—Often if the goal is just to maximize the
loyalty of a new customer, the business might just recommend commonly or
repeatedly bought products (e.g., groceries or clothes) at a discounted price to
a certain customer.
564 S. Kumar
• Location bias: Typically, our apps know our location and if there are events
(e.g., news or shows or sale) that are relevant to a user’s location and are also
relevant with respect to past engagements, then it might be shown to the user.
• Popularity bias: If something is becoming suddenly popular because it is
either important or trending and even if it is only slightly relevant to the
user it might still be shown to him/her. For example, a drastic news event
(e.g., a terrorist attack or natural disaster or a major business announcement,
best-selling book, a hit movie, or a viral video). This is typically seen
in verticals where users give explicit feedback—likes, shares, buys, etc.
indicating popularity.
• Social bias: Finally, in a social network setting, items that are explicitly
popular in one’s neighborhood in the social graph might also surface in a
customer’s recommendations as birds of a feather flock together. One might
like what his/her friends on the network like.
What a user finally sees might be a combination of all these aspects together
giving a ranking of what the user might see. The actual feedback by the user
might be used to learn the weights of which biases among the above are more
important to the user than others.
4. Cross-domain recommendations: Earlier, each service was focused only on one
aspect of the user. For example, banks know only about a user’s financial view,
retailers know only about their purchase behavior, and cab hailing services
know only about a user’s travel behaviors within the city and its neighborhood,
while airline services know only about the user’s air travel. They all have
a siloed view of a customer and can only suggest recommendations that are
best suited accordingly. The next-generation businesses might provide different
types of services to the same customer (e.g., Amazon has both a retail business
and a media business) or might have different views on the same customer
via different channels (e.g., firms such as Paytm or banks understand from
the customer’s payment behavior what kind of cross-vertical engagements the
customer is having). The public profile of the customer—their Facebook, Twitter,
and LinkedIn profiles—can also provide additional insights about a customer.
Soon, recommendation engines will be able to combine all these views and
suggest the right products and services with a more holistic view of the customer.
For example,
• Knowing that customer is booking a flight to a beach resort city (e.g., Florida
or Goa) during the summer, one might recommend the right clothes and beach
products for the customer.
• Knowing that a customer just bought sports shoes might lead to recommenda-
tion of “sporty music” to a customer.
• Knowing that a customer just took a home loan, a bank might recommend
home furnishing products from a partner retailer to him/her.
566 S. Kumar
Fig. 16.22 Students A and B have mastered different set of concepts well (green), not so well
(orange), and not at all (red). Depending on their mastery levels and the prerequisite or concept
dependence graph, the workflow-based recommendation engine might suggest a different set of
concepts to learn for student A vs. student B
16 Machine Learning (Supervised) 567
• Next Best Career Move: Career building is a strategic art that requires right
choices at the right time. Today, most people make some of the most important
career decisions with limited understanding or ad hoc criteria. Each potential
move in the career might have different prerequisites (e.g., an MBA school
admission might require a minimum of say 3 years of job experience, a job
might require one to have a certain set of hard or soft skills, or a profession,
e.g., researcher, doctor, professor, might require one to have an advanced
degree). A career moves recommendation engine, aware of prerequisite
constraints, a user’s personality, and aspirations might recommend the best
next move—whether it is taking a certain MOOC course, pursuing a degree
from a certain college, an internship experience in a certain company, or a
volunteer work in a certain organization.
6. Contextual Recommendations: So far, we talked about which item the customer
is most likely to engage with, but the success of that engagement might depend
not just on the accuracy of the recommendation score but also on the context in
which the recommendation is made. For example,
• Recommending one’s potential cuisine just before meal times
• Recommending movies just before weekend or holiday starts
• Recommending back-to-school items toward the close of summer vacations
• Recommending a tourist spot when one has just landed in a new city for
vacation
• Recommending cartridge exactly 2 months after a customer bought a printer
• Recommending different songs in the morning than evening than weekends
The timing, the triggers, the location, the device, and the channel—are other
aspects to be considered when making a relevant recommendation to the user.
Overall, recommendation engines play a very important role for many types
of user interactions with the business, for discovering new items, for keeping the
user engaged and informed, and helping the users make better choices. These rec-
ommendation engines become better with deeper understanding of the user—both
where they have been and where they are heading. To read more on recommendation
engines, you can refer to Singhal et al. (2017), Li et al. (2011), or Chap. 13 of “Data
Mining: Concepts and Techniques” by Han et al. (2011).
5 Conclusion
right modelling algorithm to match the data complexity to the model complexity.
Every decision today has a potential to be driven by data. The real challenge is to
find the right insights, engineer the right features, build the right models, and apply
the right business optimization to convert model predictions into decisions. Doing
all this right will improve our ability to make more accurate, personalized, and real-
time decisions, improving our businesses and processes multifold.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 16.1: Decision_Tree_Ex.csv
• Code 16.1: Decision_Tree_Ex.R
More examples, corresponding code, and exercises for the chapter are given in
the online appendices to the chapter.
References
Carbonell, J. G., Michalski, R. S., & Mitchell, T. M. (1983). An overview of machine learning. In
R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning volume 1. Symbolic
computation (pp. 3–23). Berlin: Springer Science & Business Media.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No.
10) Springer series in statistics. New York, NY: Springer.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques. Amsterdam:
Elsevier.
Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of contextual-bandit-
based news article recommendation algorithms. In WSDM’11 (Ed.), Proceedings of the Fourth
ACM International Conference on Web Search and Data Mining (pp. 297–306). New York City,
NY: ACM.
Murphy, K. (2012). Machine learning – A probabilistic perspective. Cambridge, MA: The MIT
Press.
Singhal, A., Sinha, P., & Pant, R. (2017). Use of deep learning in modern recommendation system:
A summary of recent works. International Journal of Computers and Applications, 180(7),
17–22.
Chapter 17
Deep Learning
Manish Gupta
1 Introduction
Deep learning has caught a great deal of momentum in the last few years. Research
in the field of deep learning is progressing very fast. Deep learning is a rapidly
growing area of machine learning. Machine learning (ML) has seen numerous
successes, but applying traditional ML algorithms today often means spending a
long time hand-engineering the domain-specific input feature representation. This
is true for many problems in vision, audio, natural language processing (NLP),
robotics, and other areas. To address this, researchers have developed deep learning
algorithms that automatically learn a good high-level abstract representation for the
input. These algorithms are today enabling many groups to achieve groundbreaking
results in vision recognition, speech recognition, language processing, robotics, and
other areas.
The objective of the chapter is to enable the readers:
• Understand what is deep learning
• Understand various popular deep learning architectures, and know when to use
which architecture for solving their business problem
• Know how to perform image analysis using deep learning
• Know how to perform text analysis using deep learning
M. Gupta ()
Microsoft Corporation, Hyderabad, India
e-mail: manishg.iitb@gmail.com
Wikipedia defines deep learning as follows. “Deep learning (deep machine learning,
or deep structured learning, or hierarchical learning, or sometimes DL) is a branch
of machine learning based on a set of algorithms that attempt to model high-
level abstractions in data by using model architectures, with complex structures or
otherwise, composed of multiple non-linear transformations.” The concept of deep
learning started becoming very popular around 2012. This was mainly due to at least
two “wins” credited to deep learning architectures. In 2012, Microsoft’s top scientist
Rick Rashid demonstrated a voice recognition program that translated Rick’s
English voice into Mandarin Chinese in Tianjin, China.1 The high accuracy of
the program was supported by deep learning techniques. Similarly, in 2012, a deep
learning architecture won the ImageNet challenge for the image captioning task.2
Now deep learning has been embraced by companies in a large number of
domains. After the 2012 success in speech recognition and translation, there
has been across the board deployment of deep neural networks (DNNs) in the
speech industry. All the top companies in machine learning including Microsoft,
Google, and Facebook have been making huge investments in this area in the
past few years. Popular systems like IBM Watson have also been given a deep
learning upgrade. Deep learning is practically everywhere now. It is being used for
image classification, speech recognition, language translation, language processing,
sentiment analysis, recommendation systems, etc. In medicine and biology, it is
being used for cancer cell detection, diabetic grading, drug discovery, etc. In the
media and entertainment domain, it is being used for video captioning, video search,
real-time translation, etc. In the security and defense domain, it is being used for face
detection, video surveillance, satellite imagery, etc. For autonomous machines, deep
learning is being used for pedestrian detection, lane tracking, recognizing traffic
signs, etc. This is just to name a few use cases. The field is growing very rapidly—
not just in terms of new applications for existing deep learning architectures but also
in terms of new architectures.
In this chapter, we primarily focus on three deep supervised learning architec-
tures: multilayered perceptrons (MLPs), convolutional neural networks (CNNs),
and recurrent neural networks (RNNs). This chapter is organized as follows.
In Sect. 2, we discuss the biological inspiration for the artificial neural net-
works (ANN), the artificial neuron model, the perceptron algorithm to learn
the artificial neuron, the MLP architecture and the backpropagation algorithm
to learn the MLPs. MLPs are generic ANN models. In Sect. 3, we discuss
convolutional neural networks which are an architecture specially designed to
1 http://deeplearning.net/2012/12/13/microsofts-richard-rashid-demos-deep-learning-for-speech-
learn from image data. Finally, in Sect. 4, we discuss the recurrent neural net-
works architecture which is meant for sequence learning tasks (mainly text and
speech).
m 2
f = m w x
j =1 j j
2 − b or a spherical function f =
j =1 xj − wj − b . We
will discuss MLP next.
572 M. Gupta
Output layer …
…
Hidden layers
…
Input layer …
The fundamental difference between ANNs and other traditional classifiers is the
following. For building traditional classifiers, a data scientist first needs to perform
domain-specific feature engineering and then build models on top of featurized data.
This needs domain knowledge, and a large amount of time is spent in coming up
with innovative features that could help predict the class variable. In case of ANNs,
the data scientist simply supplies the raw data to the ANN classifier. The hope is that
the ANN can itself learn both the representation (features) and the weights too. This
is very useful in hard-to-featurize domains like vision and speech. Multiple layers
of a deep ANN capture different levels of data abstraction.
There are multiple hyper-parameters one has to tune for various deep learning
architectures. The best way to tune them is by using validation data. But here
are a few tips in using MLPs. The initial values for the weights of a hidden
layer i could be uniformly sampled from a symmetric interval that depends on
the
activation function. For the tanh activation function, the interval could be
− f anin +f anout ,
6 6
f anin +f anout where fanin is the number of units in the
(i − 1)-th layer and fanout is the number
of units in the
i-th
layer. For the Sigmoid
function, the suggested interval is −4 f anin +f
6
anout , 4 6
f anin +f anout . This
initialization ensures that, early in training, each neuron operates in a regime of
its activation function where information can easily be propagated both upward
(activations flowing from inputs to outputs) and backward (gradients flowing from
outputs to inputs).
How many hidden layers should one have? How many hidden units per layer?
There is no right answer to this. One should start with one input, one hidden, and one
output layers. Theoretically this can represent any function. Add additional layers
only if the above does not work well. If we train for too long, possible overfitting can
happen—the test/validation error increases. Hence, while training, use validation
error to check for overfitting. Simpler models are better—try them first (Occam’s
razor).
Overview of Deep Learning Architectures
A large number of deep learning architectures have been proposed in the past few
years. We will discuss just a few of these in this chapter. We mention a partial list of
them below for the sake of completeness.
1. Deep supervised learning architectures: classification—multilayered percep-
tron (MLP); similarity/distance measure—DSSM, convolutional NN; sequence-
to-sequence—recurrent neural net (RNN)/long short-term memory (LSTM);
question answering and recommendation dialog—memory network (MemNN);
reasoning in vector space—tensor product representation (TPR).
17 Deep Learning 575
2.4 Summary
ANN is a computational model inspired from the workings of the human brain.
Although a perceptron can simply represent linear functions, multiple layers
of perceptrons can represent arbitrary complex functions. The backpropagation
algorithm can be used to learn the parameters in a multilayered feed-forward neural
network. The various parameters of a feed-forward ANN such as learning rate,
number of hidden layers, and initial weight vectors need to be carefully chosen.
An ANN allows for learning of deep feature representations from raw training data.
The following section explains how to build a simple MLP using the “mxnet”
package in R for the MNIST handwritten digit recognition task. The MNIST data
comprises of handwritten digits (60,000 in training dataset and 10,000 in test
dataset) produced by different writers. The sample is represented by a 28 × 28 pixel
map with each pixel having value between 0 and 255, both inclusive. You may refer
576 M. Gupta
to the MNIST data website3 for more details. Here, we provide a sample of only
5000 digits (500 per digit) in the training sample and 1000 digits (100 per digit) in
the test dataset. The task is to recognize the digit.
The main stages of the code below are as follows:
1. Download and perform data cleaning.
2. Visualize the few sample digits.
3. Specify the model.
(a) Fully connected
(b) Number of hidden layers (neurons)
(c) Activation function type
4. Define the parameters and run the model.
(a) “softmax” to normalize the output
(b) X: Pixel data (X values)
(c) Y: Dependent variable (Y values)
(d) ctx: Processing device to be used
5. Predict the model output on test data.
6. Produce the classification (confusion) matrix and calculate accuracy.
Sample code “MLP on MNIST.R” and datasets “MNIST_train_sample.csv” and
“MNIST_test_sample.csv” are available on the website.
words or word phrases, is called a “synonym set” or “synset.” There are more than
100,000 synsets in WordNet, majority of them are nouns (80,000+). The ImageNet
project is inspired by a growing sentiment in the image and vision research field—
the need for more data. There are around 14,197,122 images labeled with 21,841
categories.
This dataset is used for the ImageNet Large Scale Visual Recognition Challenge
held every year since 2010. The challenge runs for a variety of tasks including image
classification/captioning, object localization, object detection, object detection from
videos, scene classification, and scene parsing. The most popular task is image
captioning.
The image classification task is as follows. For each image, competing algorithms
produce a list of at most five object categories in the descending order of confidence.
The quality of a labeling is evaluated based on the label that best matches the ground
truth label for the image. The idea is to allow an algorithm to identify multiple
objects in an image and not be penalized if one of the objects identified was in
fact present but not included in the ground truth (labeled values). For example, for
the image in Fig. 17.2, “red pillow” is a good label, but “flying kite” is a bad label.
Also, “sofa” is a reasonable label, although it may not be present in the hand-curated
ground truth label set.
Table 17.2 shows the winners for the past few years for this task. Notice that
in 2010, the architecture was a typical feature engineering-based model. But since
2012 all the winning models have been deep learning-based models. The depth of
these models has been increasing significantly as the error has been decreasing over
time.
CNNs have been used to solve various kinds of vision-related problems including
the image classification challenge. Such tasks include object detection, action clas-
sification, image captioning, pose estimation, image retrieval, image segmentation
for self-driving cars, traffic sign detection, face recognition, video classification,
whale recognition from ocean satellite images, and building maps automatically
from satellite images.
578
Hubel and Wiesel (1962) made the following observations about the visual cortex
system. Nearby cells in the cortex represented nearby regions in the visual field.
Visual cortex contains a complex arrangement of cells. These cells are sensitive to
small subregions of the visual field, called a receptive field. The subregions are tiled
to cover the entire visual field and may overlap. These cells act as local filters over
the input space and are well suited to exploit the strong spatially local correlation
present in natural images. Additionally, two basic cell types have been identified.
Simple cells respond maximally to specific edge-like patterns within their receptive
field. Complex cells have larger receptive fields and are locally invariant to the exact
position of the pattern.
The question is how to encode these biological observations into typical MLPs.
Fukushima and Miyake (1982) proposed the neocognitron, which is a hierarchical,
multilayered artificial neural network, and can be considered as the first CNN in
some sense.
Besides the visual cortex system, in general, we tend to think in terms of hierar-
chy, for example, the vision hierarchy (pixels, edges, textons, motifs, parts, objects),
the speech hierarchy (samples, spectral bands, formants, motifs, phones, words), and
the text hierarchy (character, word, phrases, clauses, sentences, paragraphs, story).
To encode this hierarchical behavior into a neural framework, we will study CNNs
in this section.
Why cannot we rely on MLPs for image classification? Consider a simple task
where you want to learn a classifier to detect images with dogs versus those without.
In the popular CIFAR-10 image dataset, images are of size 32 × 32 × 3 (32 wide, 32
high, 3 color channels) only, so a single fully connected neuron in a first hidden layer
of a regular neural network would have 32 × 32 × 3 = 3072 weights. A 200 × 200
image, however, would lead to neurons that have 200 × 200 × 3 = 120,000 weights.
Such network architecture does not take into account the spatial structure of data,
treating input pixels which are far apart and close together on exactly the same
footing. Clearly, the full connectivity of neurons is wasteful in the framework of
image recognition, and the huge number of parameters quickly leads to overfitting.
This motivates us to build specific architecture to deal with images, as discussed
below.
Figure 17.3 shows four kinds of layers that a typical CNN has: the convolution
(CONV) layer, the rectified linear units (RELU) layer, the pooling (POOL) layers,
and the fully connected (FC) layers. FC layers are the ones that we have seen so far
in MLPs. In this section, we will discuss the other three layers (CONV, RELU, and
POOL) in detail one by one.
580 M. Gupta
CONV Layer
Let us start by understanding the convolution layer. Given an original image, the
convolution layer applies multiple filters on the image to obtain feature maps. Filters
are rectangular in nature and always extend the full depth of the input volume. For
example, in Fig. 17.4, the input image has a size of 32 × 32 × 3, and a filter of size
5 × 5 × 3 is being applied. To get the entire feature map, the filter is convolved with
the image by sliding over the image spatially and computing the dot products. The
sliding can be done one-step or multiple steps at a time; this is controlled using a
parameter called the stride. Filters are like features defined over the input volume.
Rather than just using one filter, we could use multiple filters. The final output
volume depth depends on the number of filters used. For example, if we had six
17 Deep Learning 581
Next, we discuss about the RELU (rectified linear units) layer. This is a layer
of neurons that applies the activation function f(x) = max(0,x). It increases the
nonlinear properties of the decision function and of the overall network without
affecting the receptive fields of the convolution layer. Other functions are also used
to increase nonlinearity, for example, the hyperbolic tangent f(x) = tanh(x) and the
sigmoid function. This layer clearly does not involve any weights to be learned.
POOL Layer
There are several nonlinear functions to implement pooling among which max pool-
ing is the most common. It partitions the input image into a set of nonoverlapping
5 Activation Map Size = ((image size − filter size)/stride) + 1. Here, Image size is 32. Filter Size
is 5. Stride = 1. Activation Map size = ((32 − 5)/1) + 1 which is equal to 28.
582 M. Gupta
rectangles and, for each such subregion, outputs the maximum. The intuition is that
the exact location of a feature is less important than its rough location relative to
other features. The pooling layer serves to progressively reduce the spatial size of
the representation, to reduce the number of parameters and amount of computation
in the network, and hence to also control overfitting. Figure 17.5 shows an example
of max pooling with a pool size of 2 × 2.
Finally, after several convolutional and max pooling layers, the high-level
reasoning in the neural network is done via fully connected layers. Neurons in a
fully connected layer have connections to all activations in the previous layer, as
seen in regular MLPs.
3.4 Summary
In this section, we will discuss a deep learning architecture to handle sequence data,
RNNs. We will first motivate why sequence learning models are needed. Then we
will talk about technical details of RNNs (recurrent neural networks) and finally
discuss about their application to image captioning and machine translation.
P(w1 , . . . , wm ). Language models are very useful for many tasks like the following:
(1) next word prediction: for example, predicting the next word after the user
has typed this part of the sentence. “Stocks plunged this morning, despite a
cut in interest rates by the Federal Reserve, as Wall ...”; (2) spell checkers: for
example, automatically detecting that minutes has been spelled incorrectly in
the following sentence. “They are leaving in about fifteen minuets to go to her
house”; (3) mobile auto-correct: for example, automatically suggesting that the
user should use “find” instead of “fine” in the following sentence. “He is trying
to fine out.”; (4) speech recognition: for example, automatically figuring out that
“popcorn” makes more sense than “unicorn” in the following sentence. “Theatre
owners say unicorn sales have doubled...”; (5) automated essay grading; and (6)
machine translation: for example, identifying the right word order as in p(the cat
is small) > p(small the is cat), or identifying the right word choice as in p(walking
home after school) > p(walking house after school).
Traditional language models are learned by computing expressing probability
of an entire sequence using the chain rule. For longer sequences, it helps to
compute probability by conditioning on a window of n previous words. Thus,
P (w1 , . . . , wm ) = m i=1 P (wi |w1 , . . . wi−1 ) ≈ i=1 P wi |wi−(n−1) , . . . , wi−1 .
m
Here, we condition on the previous n values instead of previous all values. This
approximation is called the Markov assumption. To estimate probabilities, one may
compute unigrams, bigrams, trigrams, etc., as follows, using a large text corpus with
T tokens.
Unigram model: p (w1 ) = count(w
T
1)
6 Word2vec is an algorithm for learning a word embedding from a text corpus. For further details,
read Mikolov et al. (2013).
17 Deep Learning 585
Fig. 17.6 MLP for next word prediction task (Source: Bengio et al. 2003)
dimension for the feature vector representation, and |V| is vocabulary size, C is a
|V| × m sized matrix. C(wt − i ) is the vector representation of the word that came
i words ago. C could also be learned along with the other weights in the network.
Further, the model contains a hidden layer with a nonlinearity. Finally, at the output
layer, a softmax is performed to return the probability distribution of size |V| which
is expected to be as close as possible to the one-hot encoded representation of the
actual next word.
In all conventional language models, the memory requirements of the system
grow exponentially with the window size n making it nearly impossible to model
large word windows without running out of memory. But in this model, the RAM
requirements grow linearly with n. Thus, this model supports a fixed window
of context (i.e., n). There are two drawbacks of this model: (1) the number of
parameters increase linearly with the context size, and (2) it cannot handle contexts
of different lengths. RNNs help address these drawbacks.
RNNs is a deep learning neural architecture that can support next word prediction
with variable n. RNNs tie the weights at each time step. This helps in conditioning
586 M. Gupta
the neural network on all previous words. Thus, the RAM requirement only scales
with the number of words in the vocabulary. Figure 17.7 shows the architecture of
a basic RNN model with three units. U, V, and W are the shared weight matrices
that repeat across multiple time units. Overall the parameters to be learned are U,
V, and W.
RNNs are called recurrent because they perform the same task for every element
of a sequence. The only thing that differs is the input at each time step. Output is
dependent on previous computations. RNNs can be seen as neural networks having
“memory” about what has been calculated so far. The information (or the state)
ht at any time instance t is this memory. In some sense, ht captures a thought
that summarizes the words seen so far. RNNs process a sequence of vectors x by
applying a recurrence formula at every time step: ht = fU,W (ht − 1 , xt ), where ht is
the new state, fU,W is some function with parameters U and W, ht − 1 is the old state,
and xt is the input vector at current time step. Notice that the same function and the
same set of parameters are used at every time step.
The weights for an RNN are learned using the same backpropagation algorithm,
also called as backpropagation through time (BPTT) in the context of RNNs.
The training data for BPTT should be an ordered sequence of input-output pairs
x0 , y0 ,x1 , y1 , . . . ,xn − 1 , yn − 1 . An initial value must be specified for the hidden
layer output h0 at time t0 . Typically, a vector of all zeros is used for this purpose.
BPTT begins by unfolding a recurrent neural network through time. When the
network is unfolded through time, the unfolded network contains k instances of a
unit, each containing an input, a hidden layer, and an output. Training then proceeds
in a manner similar to training a feed-forward neural network with backpropagation,
17 Deep Learning 587
except that each epoch must run through the observations, yt , in sequential order.
Each training pattern consists of ht , xt , xt+1 , xt+2 , . . . , xt+k − 1 , yt+k . Typically,
backpropagation is applied in an online manner to update the weights as each
training pattern is presented. After each pattern is presented, and the weights
have been updated, the weights in each instance of U, V, and W are averaged
together so that they all have the same weights, respectively. Also, ht+1 is calculated
as ht+1 = fU,W (ht , xt+1 ), which provides the information necessary so that the
algorithm can move on to the next time step, t + 1. The output yt is computed
as follows: yt = softmax(V ht ). Usually the cross entropy loss function is used
for the optimization: Given an actual output distribution yt and a predicted output
|V |
yt , cross entropy loss is defined as − j =1 yt,j log
distribution y(t,j ) . Note that yt is
the true vector; it could be a one-hot encoding of the expected word or a word2vec
representation of the expected word at the t-th time instant.
The following pseudo-code shows how to build an RNN using the “mxnet” R
package for the next word prediction task. Below are the main code stages:
1. Download the data and perform cleaning.
2. Create Word 2 Vector, dictionary, and lookup dictionary.
3. Create multiple buckets for training data.
4. Create iterators for multiple buckets data.
5. Train the model for multiple bucket data with the following parameters:
(a) Cell_type = “lstm” #Using lstm cell which can hold the results
(b) num_rnn_layer = 1
(c) num_embed = 2
(d) num_hidden = 4 #Number of hidden layers
(e) loss_output = “softmax”
(f) num.round = 6
6. Predict the output of the model on “Test” data.
7. Calculate the accuracy of the model.
The sample code helps understand how to build an RNN using the
“mxnet” R package. The code “Next_word_RNN.R” and the datasets “cor-
pus_bucketed_train.rds” and “corpus_bucketed_test.rds” are available on the
website.
The basic RNN architecture can be extended in many ways. Bidirectional RNNs
are RNNs with two hidden vectors per unit. The first hidden vector maintains the
state of the information seen so far in the sequence in the forward direction, while
the other hidden vector maintains the state representing information seen so far in
the sequence in the backward direction. The number of parameters in bidirectional
RNNs is thus twice the number of parameters in the basic RNN.
588 M. Gupta
RNNs could also be deep. Thus, a deep RNN has stacked hidden units, and the
output neurons are connected to the most abstract layer.
Recurrent networks offer a lot of flexibility. Thus, they can be used for a large
variety of sequence learning tasks. Such tasks could be classified as one-to-
many, many-to-one, or many-to-many depending on the number of inputs and the
number of outputs. An example of a one-to-many application is image captioning
(image → sequence of words). An example of many-to-one application is sentiment
classification (sequence of words → sentiment). An example of “delayed” many-to-
many application is machine translation (sequence of words → sequence of words).
Finally, an example of the “synchronized” many-to-many case is video classification
on frame level.
In the following, we will discuss two applications of RNNs: image captioning
and machine translation. Figure 17.8 shows the neural CNN-RNN architecture for
the image captioning task. First a CNN is used to obtain a deep representation for
the image. The representation is then passed on to the RNN to learn captions. Note
that the captions start with a special word START and end with a special word END.
Unlike image classification task where the number of captions is limited, in image
captioning, the number of captions that can be generated are many more since rather
than selecting one of say 1,000 captions, here the task is to generate captions.
As shown in Figure 17.8, a CNN trained on the ImageNet data is first used.
Such a CNN was discussed in Sect. 3. The last fully connected layer of the CNN is
thrown away, and the result from the CNN’s penultimate layer is fed to the first unit
of the RNN. One-hot encoding of the special word START is fed as input to the first
unit of the RNN. At the training time, since the actual image captions are known,
the corresponding word representations are actually fed as input for every recurrent
unit. However, at test time, the true caption is unknown. Hence, at test time, the
output of the k-th unit is fed as input to the (k + 1)-th unit. This is done for better
learning of the order of words in the caption. Cross entropy loss is used to compute
error at each of the output neurons. Microsoft COCO7 is a popular dataset which
can be used for training such a model for image captioning. The dataset has about
120K images each with five sentences of captions (Lin et al. 2014).
Lastly let us discuss about application of RNNs to machine translation. Figure
17.9 shows a basic encoder–decoder architecture for the machine translation task
using RNNs. The encoder RNN tries to encode all the information from the source
language into a single hidden vector at the end. Let us call this last hidden vector
of the encoder as the “thought” vector. The decoder RNN uses information from
this thought vector to generate words in the target language. The architecture tries
to minimize the cross entropy error for all target words conditioned on the source
words.
There are many variants of this architecture as follows. (1) The encoder and
the decoder could use shared weights or different weights. (2) Hidden state in the
decoder always depends on the hidden state of the previous unit, but it could also
optionally depend on the thought vector and predicted output from the previous unit.
(3) Deep bidirectional RNNs could be used for both encoder and decoder.
Beyond these applications, RNNs have been used for many sequence learning
tasks. However, RNNs suffer from vanishing gradients problem. In theory, RNN
can memorize in hidden state, that is, ht , all the information about past inputs.
But, in practice, standard RNN cannot capture very long-distance dependency.
Vanishing/exploding gradient problem in backpropagation: gradient signal can end
up being multiplied a large number of times (as many as the number of time
steps) by the weight matrix associated with the connections between the neurons
of the recurrent hidden layer. If the weights in transition weight matrix are small
(or, more formally, if the leading eigenvalue of the weight matrix is smaller than
1.0), it can lead to vanishing gradients where the gradient signal gets so small that
learning either becomes very slow or stops working altogether. It can also make
more difficult the task of learning long-term dependencies in the data. Conversely,
if the weights in this matrix are large (or, again, more formally, if the leading
eigenvalue of the weight matrix is larger than 1.0), it can lead to a situation
where the gradient signal is so large that it can cause learning to diverge. This
is often referred to as exploding gradients. A solution to this problem is long
short-term memory (LSTM) which are deep learning architectures similar to RNNs
but with explicit memory cells. The main idea is to keep around memories to
capture long-range dependencies and to allow error messages to flow at different
strengths depending on the inputs. The intuition is that memory cells can keep
information intact, unless inputs make them forget it or overwrite it with new
input. The memory cell can decide to output this information or just store it.
The reader may refer to Hochreiter and Schmidhuber (1997) for further details
about LSTMs.
4.5 Summary
5 Further Reading
The advances being made in this field are continuous in nature due to the practice
of sharing information as well as cooperating with researchers working in labs and
in the field. Therefore, the most recent information is available on the Web and
through conferences and workshops. The book8 by Goodfellow et al. (2016), the
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 17.1: MNIST_train_sample.csv
• Data 17.2: MNIST_test_sample.csv
• Data 17.3: corpus_bucketed_test.rds
• Data 17.4: corpus_bucketed_train.rds
• Code 17.1: MLP_MNIST.R
• Code 17.2: MXNET_MNIST_CNN.R
• Code 17.3: Next_word_RNN.R
Exercises
(a) MLPs have fully connected layers, while CNNs have sparse connectivity.
(b) MLPs are supervised, while CNNs are usually used for unsupervised algo-
rithms.
(c) MLPs have more weights, while CNNs have fewer number of weights to be
learned.
(d) MLP is a general modeling architecture, while CNNs specialize for images.
Ex. 17.9 Given an image of 32 × 32 × 3, a single fully connected neuron will have
how many weights to be learned?
(a) 32 × 32 × 3 + 1
(b) 32
(c) 3
(d) 32 × 32
Ex. 17.10 What is the convolution operation closest to?
(a) Jaccard similarity
(b) Cosine similarity
(c) Dot product
(d) Earth mover’s distance
Ex. 17.11 How many weights are needed if the input layer has 32 × 32 inputs and
the hidden layer has 20 × 20 neurons?
(a) (32 × 32 + 1) × 20 × 20
(b) (20 + 1) × 20
(c) (32 + 1) × 20
(d) (32 + 1) × 32
Ex. 17.12 Consider a volume of size 32 × 32 × 3. If max pooling is applied to it
with pool size of 4 and stride of 4, what are the number of weights in the pooling
layer?
(a) (32 × 32 × 3 + 1) × (4 × 4)
(b) 4×4+1
(c) 0
(d) 32 × 32 × 3
Ex. 17.13 Which among the following is false about the differences between MLPs
and RNNs?
(a) MLPs can be used with fixed-sized sequences, while RNNs can handle variable-
sized sequences.
(b) MLPs have more weights, while RNNs have fewer number of weights to be
learned.
(c) MLP is a general modeling architecture, while RNNs specialize for sequences.
(d) MLPs are supervised, while RNNs are usually used for unsupervised algo-
rithms.
594 M. Gupta
Ex. 17.14 We looked at two neural models for next word prediction: an MLP and an
RNN. Given a vocabulary of 1000 words, and a hidden layer of size 100, a context
of size 6 words, what are the number of weights in an MLP?
(a) (6 × 1000 + 1) × 100 + (100 + 1) × 1000
(b) (1000 + 1) × 100 + (100 + 1) × 100 + (100 + 1) × 1000
(c) (6 × 6 + 1) × 100 + (6 × 6 + 1) × 1000
(d) (1000 + 1) × (100 + 1) × 6
Ex. 17.15 How does backpropagation through time differ from typical backpropa-
gation in MLPs?
(a) Weights on edges supposed to have shared weights must be averaged out and
set to the average after every iteration.
(b) Backpropagation in MLPs uses gradient descent, while backpropagation
through time uses time series modeling.
(c) Backpropagation in MLPs has two iterations for every corresponding iteration
in backpropagation through time.
(d) None of the above.
Answer in Length
Ex. 17.16 Define deep learning bringing out its five important aspects.
Ex. 17.17 Describe the backpropagation algorithm.
Ex. 17.18 RNNs need input at each time step. For image captioning, we looked at a
CNN-RNN architecture.
(a) What is the input to the first hidden layer of the RNN?
(b) Where do the other inputs come from?
(c) How is the length of the caption decided?
(d) Does it generate new captions by itself or only select from those that it had seen
in training data?
(e) If vocab size is V, hidden layer size is h, and average sequence size is “s,” how
many weights are involved in an RNN?
Hands-On Exercises
Ex. 17.19 Create a simple logistic regression-based classifier for the popular iris
dataset in mxnet.
Ex. 17.20 Create an MLP classifier using three hidden layers of sizes 5, 10, 5 for
the MNIST digit recognition task using mxnet. (Hint: Modify the code from Sect.
2.5 appropriately).
17 Deep Learning 595
Ex. 17.21 Create a CNN classifier using two CONV layers each with twenty 5 × 5
filters with padding as 2 and stride as 1. Also use pooling layers with 2 × 2 filters
with stride as 2. Do this for the MNIST digit recognition task using mxnet. (Hint:
Modify the code from Sect. 3.5 appropriately).
Ex. 17.22 Train an RNN model in mxnet for the next word prediction task. Use a
suitable text corpus from https://en.wikipedia.org/wiki/List_of_text_corpora. (Hint:
Modify the code from Sect. 4.2 appropriately).
References
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model.
Journal of Machine Learning Research, 3, 1137–1155.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for
a mechanism of visual pattern recognition. In S. Amari & A. Michael (Eds.), Competition and
cooperation in neural nets (pp. 267–285). Berlin: Springer.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning (Vol. 1). Cambridge: MIT Press.
Hinton, G., Srivastava, N., & Swersky, K. (2012). Lecture 6d—A separate, adaptive learning rate
for each connection. Slides of lecture neural networks for machine learning. Retrieved Mar 6,
2019, from https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8),
1735–1780.
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex. The Journal of Physiology, 160(1), 106–154.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco:
Common objects in context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), European
conference on computer vision (pp. 740–755). Cham: Springer.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations
in vector space. arXiv:1301.3781
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM,
38(11), 39–41.
Minsky, M. L., & Papert, S. (1969). Perceptrons. MIT Press, Cambridge, MA.
Rosenblatt, F. (1962). Principles of neurodynamics. Wuhan: Scientific Research Publishing.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by
error propagation. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing.
Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA:
Bradford Books.
Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. arXiv:1212.5701.
Part III
Applications
Chapter 18
Retail Analytics
Ramandeep S. Randhawa
1 Introduction
1.1 Background
Retail is one of the largest sectors in today’s economy. The global retail sector is
estimated to have revenues of USD 28 trillion in 2019 (with approximately USD
5.5 trillion sales in the USA alone). This sector represents 31% of the world’s
GDP and employs billions of people throughout the globe.1 A large and growing
component of this is e-commerce or e-tail, which includes products and services
ordered via the Internet, with sales estimated to be about USD 840 billion in 2014,
and expected to grow at a rate of about 20% over the subsequent years.2 Analytics
is gaining increasing prominence in this sector with the retail analytics market size
being estimated at over USD 3.52 billion in 2017 and is expected to grow at a CAGR
of over 19.7% over the next few years.3
R. S. Randhawa ()
Marshall School of Business, University of Southern California, Los Angeles, CA, USA
e-mail: Ramandeep.Randhawa@marshall.usc.edu
Retail acts as the last stop in the supply chain by selling products directly
to customers. Given that retailers are focused on this aspect, collecting data on
customer behavior and preferences and incorporating these into business decisions
are quite natural. And so, retail has indeed been an early adopter of analytics
methodologies and focuses heavily on advancing knowledge in this domain.
Retail analytics is an umbrella term that comprises various elements which assist
with decision-making in the retail business. Typically, this includes data collection
and storage (data warehousing), data analysis that involves some statistical or
predictive modeling, and decision-making. Traditionally, the analysis of data was
limited to monitoring and visualizing some key performance indicators (KPIs)
retrospectively.
One may use the term business intelligence to refer to the gamut of activities
that underlie intelligent business decision-making. However, typically this term is
used to refer to the collection and presentation of historical information in an easy-
to-understand manner, via reports, dashboards, scorecards, etc. The term advanced
analytics is typically reserved for when predictive modeling is applied to data via
statistical methods or machine learning. Our focus in this chapter will be on the
later, advanced analytics, methodologies that can significantly assist in the decision-
making process in retail.
To understand the role analytics plays in retail, it is useful to break down the
business decisions taken in retail into the following categories: consumer, product,
workforce, and advertising.
1. Consumer: Personalization is a key consumer-level decision that retail firms
make. Personalized pricing by offering discounts via coupons to select customers
is one such decision. This approach uses data collection via loyalty cards to better
understand a customer’s purchase patterns and willingness to pay and uses that to
offer personalized pricing. Such personalization can also be used as a customer
retention strategy. Another example is to offer customers a personalized sales
experience: in e-tail settings, this entails offering customers a unique browsing
experience by modifying the products displayed and suggestions made based on
the customer’s historical information.
2. Product: Retail product decisions can be broken down into single product and
group of product decisions. Single or individual product decisions are mostly
inventory decisions: how much stock of the product to order, and when to
place the order. At the group level, the decisions are typically related to pricing
and assortment planning. That is, what price to set for each product in the
group and how to place the products on the store-shelves, keeping in mind
the variety of products, the number of each type of product, and location. To
make these decisions, predictive modeling is called for to forecast the product
18 Retail Analytics 601
• Analytics has revealed that a great number of customer visits to online stores
fail to convert at the last minute, when the customer has the item in their
shopping basket but does not go on to confirm the purchase. Theorizing that this
was because customers often cannot find their credit or debit cards to confirm
the details, Swedish e-commerce platform Klarna moved its clients (such as
Vistaprint, Spotify, and 45,000 online stores) onto an invoicing model, where
customers can pay after the product is delivered. Sophisticated fraud prevention
analytics are used to make sure that the system cannot be manipulated by those
with devious intent.
• Trend forecasting algorithms comb social media posts and Web browsing habits
to elicit what products may be causing a buzz, and ad-buying data is analyzed to
see what marketing departments will be pushing. Brands and marketers engage
in “sentiment analysis,” using sophisticated machine learning-based algorithms
to determine the context when a product is discussed. This data can be used to
accurately predict what the top selling products in a category are likely to be.
602 R. S. Randhawa
• Russian retailers have found that the demand for books increases exponentially
as the weather gets colder. So retailers such as Ozon.ru increase the number of
book recommendations which appear in their customers’ feeds as the temperature
drops in their local areas.4
• The US department store giant, Macy’s, recently realized that attracting the right
type of customers to its brick-and-mortar stores was essential. Due to its analytics
showing up a dearth of the vital millennials demographic group, it recently
opened its “One Below” basement5 at its flagship New York store, offering “selfie
walls” and while-you-wait customized 3D-printed smartphone cases. The idea is
to attract young customers to the store who will hopefully go on to have enduring
lifetime value to the business.
• Amazon has proposed using predictive shipping analytics6 to ship products to
customers before they even click “add to cart.” According to a recent trend
report by DHL, over the next 5 years, this so-called psychic supply chain will
have far reaching effects in nearly all industries, from automotive to consumer
goods. It uses big data and advanced predictive algorithms to enhance planning
and decision-making.
There are various complications that arise in retail scenarios that need to be
overcome for the successful use of retail analytics. These complications can be
classified into (a) those that affect predictive modeling and (b) those that affect
decision-making.
Some of the most common issues that affect predictive modeling are demand
censoring and inventory inaccuracies (DeHoratius and Raman 2008). Typically,
retail firms only have access to sales information, not demand information, and
therefore need to account for the fact that when inventory runs out, actual demand is
not observed. Ignoring this censoring of information can result in underestimating
demand. There is also a nontrivial issue of inventory record inaccuracies that
exists in retail stores—the actual number of products in an inventory differs
from the number expected as per the firm’s IT systems (DeHoratius 2011). Such
inaccuracy may be caused by theft, software glitches, etc. This inaccuracy needs
to be incorporated into demand estimation because it confounds whether demand
is low or appears low due to product shortage. Inaccuracy also affects decision-
4 https://www.forbes.com/sites/bernardmarr/2015/11/10/big-data-a-game-changer-in-the-retail-
making by impacting the timing of order placement. Some of the other factors
that affect decision-making are constraints on changing prices, physical constraints
on assortments, supplier lead times, supplier contracts, and constraints on work-
force scheduling. In particular, retail firms deal with many constraints on changing
prices. Some of these are manpower constraints: changing assortments requires a
reconfiguration of store shelves, and changing prices may involve physically tagging
products with the new price (a labor-intensive process). To make it easier to change
prices, many stores such as Kohl’s are turning to electronic shelf labels as a means
of making the price-changing process efficient.7 There are additional nonphysical
constraints that a firm may need to deal with. For instance, in fashion, prices are
typically only marked down, and once a price is lowered, it is not increased. There
are also limits to how often prices may be changed, for instance, twice a week. There
are many supplier-based constraints that need to be considered as well, for instance,
lead times on any new orders placed and any terms agreed to in supplier contracts.
In this chapter, our focus will be on use of retail analytics for product-based
decision-making. We continue in Sect. 2, by exploring the various means of data
collection that are in use by retailers and those that are gaining prominence in
recent times. In Sect. 3, we will discuss some key methodologies that are used for
such decision-making. In particular, we will discuss various statistics and machine
learning methodologies for demand estimation, and how these may be used for
pricing, and techniques for modeling consumer choice for assortment optimization.
In Sect. 4, we will focus on the many business challenges and opportunities in retail,
focusing on both e-tail settings and the growth in retail analytic startups.
2 Data Collection
Retail data can be considered as both structured (spreadsheet with rows and
columns) and unstructured (images, videos, and other location-based data). Tradi-
tional retail data has been structured and derived mostly from point-of-sale (POS)
devices and data supplied by third parties. POS data typically captures sales infor-
mation, number of items sold, prices, and timestamps of transactions. Combined
with inventory record keeping, this data provides a rich trove of information about
products sold and, in particular, product baskets (collection of items in the cart)
sold. Retailers tend to use loyalty programs to attach customer information to this
information, so that customer level sales data can be analyzed. Third-party data
typically consists of competitor information, such as prices and product assortments.
It also consists of some broad information about the firm’s customers, such as their
demographics and location.
7 https://www.wsj.com/articles/now-prices-can-change-from-minute-to-minute-1450057990
The recent trend is to capture more and more unstructured data. There now exists
technology that can help retailers collect information not only about direct customer
sales but also about product comparisons, that is, what products were compared by
the customer in making decisions. Video cameras coupled with image detection
technology can help collect data on customer routes through a store. This video data
can also be used to collect employee data (e.g., what tasks are employees doing,
how are customers being engaged, and how much time does a customer needing
assistance have to wait for the assistance to be provided). Recently, many firms have
also employed eye-tracking technology in controlled environments to collect data
on how the store appears from a customer’s perspective; a major downside of this
technology is that it requires the customer to wear specialized eyeglasses.
With the advent of Internet of Things (IoT), the potential to collect in-store
data has increased. Walmart began using radio-frequency identification (RFID)
technology about a decade ago. Initially, the main goal of using this technology was
to track inventory in the supply chain. However, increasingly, retailers are finding it
beneficial to track in-store inventory. RFID tags are far easier to read than barcodes
because they do not require direct line-of-sight scanning. This ease of tracking
allows the tags to be used to collect data on the movement of products through
the store. For instance, in fashion retail, the retailer can track the items that make
their way to the fitting rooms; the combination of items tried can also be tracked,
and finally it can easily be detected whether the items were chosen or not. All of
this provides a rich set of data to feed into the system for analytics.
Near-field communication (NFC) chips are also being used by retailers to
simplify the shopping experience. Most of the current NFC usage is targeted at
payments. However, several retailers are also using NFC scanning as a means to
provide customers with additional information about the product. This helps collect
information about the products a customer is considering. Because NFC readers
are not present in all smartphones, some retailers also use Quick Response (QR)
codes for their products that customers can typically scan using an app for similar
functionality.
Another new method of collecting customer data is via Bluetooth beacons.
Beacons use Bluetooth Low Energy, a technology built into recent smartphones. The
beacons are placed throughout the store and can detect the Bluetooth signal from a
customer’s smartphone that is in the vicinity. These devices can send information
to the smartphone via specialized apps. In this sense, the beacons provide a lot of
flexibility for the retailer to engage with and interact with the customer (assuming
that the customer has the specialized app). This can be used to push notifications
about products, coupons, etc. in real time to the customer. Furthermore, because
the customer interacts with the app to utilize this information, the effect of sending
the information to the customer can also be tracked immediately. This technology
seems to have a lot of potential for personalizing the retail experience for customers,
18 Retail Analytics 605
as well as for collecting information from the customer. As per Kline,8 nearly a
quarter of US retailers have implemented such beacons. Macy’s and Rite Aid are
some of the prominent retailers to complete a rollout of beacons into most of its
stores in 2015.
Some of the most exciting potentials for data collection can be seen in the
recently launched Amazon Go retail store. The store allows customers to simply
grab items and go, without needing to formally check out at a counter. The customer
only needs to scan an app while entering the store. The use of a large number
of video cameras coupled with deep learning-based algorithms make this quite
plausible. Deep learning is an area of machine learning that has gained considerable
attention recently because of its state-of-the-art ability to decipher unstructured
data, especially for image recognition; see Chap. 17 on deep learning. In the
retail context, the video cameras capture customers and their actions, and the deep
learning algorithms decipher what the actions mean: what items are they grabbing
from the shelves and if they are putting back any items from their bag. Such an
approach would revolutionize customers’ retail experience. However, at the same
time, it provides the firm with large amounts of data beyond customer routes. It
allows the firm to pick up on moments of indecision, products that were compared,
especially when one product is replaced by a similar product.
3 Methodologies
Typical forecasting methods consider the univariate time series of sales data and
use time-series-based methods such as exponential smoothing and ARIMA models;
see Chap. 12 on forecasting analytics. These methods typically focus on forecasting
8 https://www.huffingtonpost.com/kenny-kline/how-bluetooth-beacons-wil_b_8982720.html
sales and may require uncensoring to be used for decision-making. Recently, there
have been advances that utilize statistical and machine learning approaches to deal
with greater amounts of data.
As the number of predictors grow, estimating demand becomes statistically
complicated because the potential for overfitting increases. Typically, one deals with
such a situation by introducing some “regularization.” Penalized L1 regularization
is a common, extremely successful methodology developed to deal with high
dimensionality as it performs variable selection. Penalized L1 regression called
LASSO (least absolute shrinkage and selection operator) was introduced in Tib-
shirani (1996) and in the context of the typical least squares linear regression can
be understood as follows: suppose the goal is to predict a response variable y ∈ Rn
using covariates X ∈ Rn × p , then the LASSO objective is to solve
where ||·||x represents the Lx norm of the expression in parentheses. Such a formu-
lation makes the typical least squares estimator biased because of the regularization
term; however, by selecting the regularizer appropriately, the variance can be
reduced, so that on the whole, the estimator performs better for prediction. The
use of the L1 -norm facilitates sparsity and leads to “better” variable selection. This
is especially useful in the case of high-dimensional settings in which the number of
parameters p may even exceed the data points n. Prior to the introduction of LASSO,
L2 -based regularization, also called ridge regression, was a common way to alleviate
overfitting. More recently the elastic net has been proposed that uses both the L1 -
and L2 -norms as a regularizer. We direct the reader to the open source book9 by
James et al. (2013) for a detailed discussion of these methodologies. A description
is also contained in Chap. 7 on regression analysis.
Recently, Ma et al. (2016) used a LASSO-based approach (along with additional
feature selection) to estimate SKU-based sales in a high-dimensional setting, in
which the covariates included cross-category promotional information, involving
an extremely large number of parameters. They found that including cross-category
information improves forecast accuracy by 12.6%.
Ferreira et al. (2015) recently used an alternate machine learning-based approach
for forecasting demand in an online fashion retail setting using regression trees.
Regression trees are a nonparametric method that involves prediction in a hierar-
chical manner by creating a number of “splits” in the data. For instance, Fig. 18.1
(reproduced from Ferreira et al. 2015) displays a regression tree with two such splits.
Demand is then predicted by answering the questions pertaining to each of the
splits: first, whether the price of the product is less than $100. If not, then the
demand is predicted as 30. Otherwise, the following question is asked: whether
the relative price of competing styles is less than 0.8 (i.e., is the price of this style
less than 80% of the average price of competing styles); if the answer is no, then
the demand is predicted as 40. Otherwise, it is predicted as 50. The paper further
uses the variance reduction method of bootstrap aggregation or bagging, in which
an ensemble of trees is “grown,” with each tree trained on a random sampling of
the dataset, that is, if the data has N records, then each tree is trained on m < N
records randomly sampled from the dataset with replacement. This reduces the
interpretability of the model but improves the performance. We refer the reader to
the previously cited book on statistical learning and Chap. 16, Machine Learning
(Supervised), for details on this methodology. A closely related method is that of
random forests, which is similar to bagged trees, except each tree is only allowed
to split on a subset of the parameters; this reduces the correlation between the trees
and lowers the variance further. Random forests are extremely good out-of-the-box
predictors; however, because each tree only uses a subset of parameters for training,
its overall interpretability is quite limited.
Recently, (artificial) neural networks (NN) have also been employed for demand
forecasting (Au et al. 2008). A neural network is a large group of nodes that are
arranged in a layered manner, and the arcs in the network are associated with
weights (Sect. 2.1 in Chap. 17 on deep learning). The input to the network is
transformed sequentially layer by layer—the input to a layer is used to compute
the output of each node in the layer based on the inputs to the node, and this
serves as an input to the next layer. In this manner the neural network produces
an output for any given input. The weights of the neural network are “trained”
typically by gradient descent methods to minimize a loss function that relates to
the error between the output and the input. Neural networks can model highly
nonlinear dependencies and as such work extremely well in detecting patterns
and trends in complex scenarios. Neural networks have been around for a long
time; see Chaps. 16 and 17. The well-known logistic regression function can be
represented by a single-layer neural network. However, more interesting networks
are obtained by creating a large number of layers; hundred-layer neural nets are not
uncommon. Deep neural networks are notoriously difficult to train, as they require
a lot of data and computational power. With recent advances in data collection and
computing, it has become possible to harness the potential of these networks. Initial
research demonstrates that NN can be used effectively in the context of predicting
fashion demand, and these illustrate the potential for using such methods for
demand forecasting in the future. The performance of NNs (and their sophistication)
increases as the amount of training data increases. With the spurt in data collection,
especially that of unstructured data, NNs provide an exciting potential for demand
forecasting.
608 R. S. Randhawa
Table 18.1 Demand for refurbished hand tools (sample from handtools_reseller.csv)
Number Past
Average of com- Total 12 months
price peting sales sales
Product Department Category MSRP Price competing styles events events Demand
9728 3 3 417 261 215.4 4 4 3 9
9131 3 2 290 124 133 7 5 1 18
2102 3 1 122 21 50.2 3 1 1 40
1879 1 2 258 84 135.6 3 6 1 38
1515 3 1 133 128 98.6 8 3 2 6
18 Retail Analytics 609
predict the demand except the price of the product. We can then create an Excel
model shown in Table 18.2.
In this simple model, one can set a price for the product and drop it down the
tree to predict the demand using a series of “IF” statements. The demand if price
<70 and price >70 are computed separately to minimize the number of “IF” blocks.
The demand is the total of the two. Using Solver, an add-in to Microsoft Excel, one
can maximize the revenue by varying the price. The “optimal” price is $158 and the
optimal revenue is $3634. The reader is asked to make several enhancements to the
model shown above in the chapter exercises.
The work by Ferreira et al. (2015) takes the typical approach of first estimating
demand and then optimizing decisions. From the perspective of decision-making
based on available data, the sequential nature of this approach is unnecessary, and
one can conceive directly optimizing the decision of interest based on all available
data. Indeed, Liyanage and Shanthikumar (2005) prove that formulating the decision
as a function of the data and then directly optimizing it can lead to improved
performance compared with the sequential approach.
The work by Ban and Rudin (2018) studies such an approach in the context of
optimal inventory choice for a single product (at fixed prices). This is the setting of
the classical newsvendor problem in which the goal is to select the stock level to
minimize shortfall and holding costs (this dates back to Edgeworth 1888 and is a
building block for more sophisticated models of stochastic inventory optimization;
see Porteus 2002, for more background). The work by Ban and Rudin (2018)
610 R. S. Randhawa
10 An alternative has been proposed in Alptekinoglu and Semple (2016) that considers exponen-
tially distributed noise, and the model is called the exponomial choice (EC) model.
18 Retail Analytics 611
For simplicity, assume that the value of no purchase is normalized to zero. The
net value to a customer from a purchase of a slab is equal to (Value – Price). Also,
assume that each customer will purchase exactly one slab. The cost of a slab is
80% of the price. There is a per unit holding cost on account of interest, storage,
breakage, etc. There is a fixed cost of 175 per slab stocked. The average number of
customers per period in all examples is 5.
Consider the case when the product is sold as MTO. We use the MNL choice
model in this example: In the simplest version of the model the probability
that a
customer chooses a product i is given by πi = e((vi −pi )/s) /(1 + j e((vl −pj )/s) );
where pi stands for the price and vi for the value attached to product-i, s is a scale
factor, and the sum is taken over all products which are compared with product-i.
Chapter 23, Pricing Analytics, has further details about this model of consumer
choice. Let D be the average demand. Assume the scale factor is 100. The expected
profit is then given by
D × j (0.2 πj pj – Hold cost) – number of products stocked × stocking cost
The expected profitability from offering all four products in the assortment is
computed as shown (the profit is computed after subtracting the stocking cost) in
Table 18.4.
The reader is asked to verify these calculations, as well as evaluate whether this
is the optimal MTO assortment to offer in the exercises.
18 Retail Analytics 613
demand framework. In the paper, a nesting feature is observed, so that the products
can be ranked by their value and the optimal assortment only requires considering
nested subsets of the ranked products. Empirically, such optimization has been seen
to improve a firm’s financials significantly, for instance, Kök and Fisher (2007)
estimate a 50% increase in profit to a retailer, Fisher and Vaidyanathan (2014) report
a sales lift of 3.6% and 5.8% for two different product categories, and Farias et al.
(2013) and Jagabathula and Rusmevichientong (2016) estimate about 10% increases
in revenue. We direct the reader to Rusmevichientong et al. (2006), Smith et al.
(2009), Honhon et al. (2010), and Honhon et al. (2012) for optimized decision-
making under variants of rank-based choice models. More recently, there has been
a growing interest in optimizing assortments using nonparametric methods. For
instance, Bertsimas and Mišic (2015) use a nonparametric choice model related to
Farias et al. (2013) but forego the robust approach to directly estimate the choice
model by efficiently solving a large-scale linear optimization problem using column
generation and then solve the assortment optimization piece based on the solution
to a practically tractable mixed integer optimization problem. The previously
referenced Jagabathula and Rusmevichientong (2016) solves a joint assortment and
pricing problem (which is known to be NP-hard) using an approximation algorithm
with a provable performance guarantee based on a DP formulation.
Interestingly, as we move to dynamic assortments, which become more relevant
in the context of e-tail, Bernstein et al. (2015) and Golrezaei et al. (2014) solve this
problem in a limited inventory setting. The latter, in fact, consider a very general
consumer choice model and propose an algorithm that does not require knowledge
of customer arrival information.
Example: Assortment over Internet
Consider the previous example and the case when the assortment is offered over
the Internet. In this case the assortment is changed every selling season. We model
this as MTS problem with no substitution. Once the product runs out, the customer
who asks for it gets a message that it is out of stock, and s/he walks away with no
purchase. We are also told that disposing of unsold product at the end of a period
(or selling season) recovers 85% of the cost of the product. This cost is estimated
as the cost of the item less the cost of shipping to a discount outlet and holding
cost for selling the product at the end of the season. The demand is assumed to be
distributed Poisson with mean equal to 5. Though we can calculate the expected
profit analytically, to illustrate an alternative method, we will use simulations! For
doing this, we generate demand 1000 times. In each simulation, we draw the number
of customers according to the Poisson distribution. Then determine which slab is
preferred by each of the customers. Given a stocking quantity, it is straightforward
but a little tedious to calculate the expected profit. The “Assortment_Examples.xlsx”
sheet contains the simulation. The sample summary results when stocking one slab
of each type is shown below (data and average profit above and first few rows of
simulation below in Tables 18.5 and 18.6).
For example, in the first simulation, three customers arrived. All wanted Roman
Blue. The actual sale was for one slab of Roman Blue and one slab each of Niagara,
18 Retail Analytics 615
Forever Black, and Violet Black had to be salvaged at a loss. The reader is asked to
verify the simulation setup in the exercise and then make optimal choices.
AbD sells to walk-in customers or sells MTS with substitution. We can already
see how this problem becomes vastly more complicated when there is substitution!
In addition to creating arrivals, we have to keep track of the sequence in which the
customers arrive and then see if there is stock. If there is stockout, we need to model
whether there is substitution from the remaining products or the customer leaves the
store empty-handed. Exercise 18.4 gives a simple example to illustrate these ideas.
In many settings, customer preferences are not known, and one may need to learn
these while simultaneously optimizing the assortment. Caro and Gallien (2007)
and Rusmevichientong et al. (2010) were among the first to study this problem.
Caro and Gallien (2007) undertook a Bayesian learning approach in which the
underlying primitives have a certain distribution, and the Bayesian approach is
used to learn these parameters. On the other hand, Rusmevichientong et al. (2010)
used an adaptive learning approach in which such priors are not assumed, and an
explore–exploit paradigm is used: in this approach, the decision-maker balances
“exploration” to collect relevant data points, with “exploitation” to generate revenue
based on the data observed thus far. More recently, the notion of personalized
assortments is becoming prevalent, especially in e-tail settings, wherein a customer
could be shown an assortment of items based on customer-specific information, such
616 R. S. Randhawa
The tremendous success of e-commerce has led many retailers to augment their
brick-and-mortar stores with an online presence, leading to the advent of multichan-
nel retail. In this approach the retailer has access to multiple channels to engage with
and sell to customer. Typically, each of these channels is managed separately. This
multichannel approach has been overshadowed by what is commonly referred to
as omni-channel retail, in which the firm integrates all the channels to provide a
seamless experience to customers. A good example of such an approach is the “buy
online, pick up in store” (BOPS) approach that has become quite commonplace.
This seamless approach inarguably improves the customer experience and overall
sales; however, it can lead to unintended outcomes. For instance, Bell et al. (2014)
show that such a strategy can reduce online sales and instead lead to an increase
in store sales and traffic. In that context, the authors find that additional sales are
generated by cross-selling in which the customers who use the BOPS functionality
buy additional products in the stores, and further there is a channel effect as well in
which online customers may switch to becoming brick-and-mortar customers.
The benefits of the omni-channel approach are even spurring online retailers to
foray into physical stores. For instance, recent studies show how, by introducing an
offline channel via display showrooms, WarbyParker.com was able to increase both
overall and the online channel’s demand (see Bell et al. 2014).
Thus, there is significant value for a retailer to foray into omni-channel. However,
while doing so, it is crucial for the firm’s retail analytics to transcend to omni-
channel analytics for correct estimation and optimal decision-making.
18 Retail Analytics 617
There has been a spurt in retail analytics startups recently. A majority of these
companies can be classified as those using technology to aid in data collection and
those that are using sophisticated means of analyzing the data itself.
In terms of data collection, there are many startups that cater to the range
of retailers both small and large. Some illustrative examples here are Euclid
Analytics, which uses in-store Wi-Fi to collect information on customers via their
smartphones. The company is able to collect in-store behavior of customers and
also data on repeat visits. Collection of Wi-Fi-based information also allows the
retailer to track what customers do online while in store. This lets the retailer better
understand their customer base, including what products they are researching on
their smartphones (showrooming). Another recent startup is Dor, which is targeted
at smaller retailers and sells a device that counts the foot traffic in the store. It
then provides a dashboard view of the traffic and provides insights to optimize
staffing. Startups like Brickstream and Footmarks produce sensors that monitor foot
traffic, so associates can react to shoppers in real time. Swirl works with brands
like Urban Outfitters, Lord & Taylor, and Timberland to monitor shopper behavior
with beacons. Point Inside provides beacons to Target. Startups like Estimote,
Shelfbucks, and Bfonics leverage beacons for in-store proximity marketing, such
as sending mobile notifications to shoppers about the products they are currently
browsing.
Video analytics is another exciting area that startups are getting into. For
instance, Brickstream uses video analytics to capture in-depth behavior intelligence.
RetailNext is one of the larger startups in this domain and covers a wide gamut
of data collection, starting with Wi-Fi-based data collection to more sophisticated
methods using video camera feeds. RetailNext also delves into an array of analytics
solutions including staffing schedules and some A/B testing. Video analytics helps
in heat mapping customer paths and can also be used for loss prevention.
Oak Labs builds interactive touchscreen mirrors that aim to revolutionize the
fitting room. The mirror allows customers to explore product recommendations and
digitally seek assistance from store associates. The use of technology here enhances
the customer experience and collects valuable data at the same time.
PERCH produces interactive digital displays used by brands like Kate Spade and
Cole Haan. Blippar focuses on integrating digital and physical domains by using an
app that unlocks content upon scanning products. Aila manages interactive in-aisle
tablets for stores that provide customers with detailed product information upon
scanning a product barcode. These startups focus on improving customer experience
by providing more information while also collecting information on how customers
choose products that can help the retailer make smarter decisions.
Turning to analytics, some startups such Sku IQ and Orchestro focus on
providing a unified view to multichannel firms. Sku IQ provides a unified view
of inventory, sales, and customers from all channels. Orchestro focuses on demand
estimation by combining data from POS systems, internal ERP, and third-party data
618 R. S. Randhawa
This chapter has set out various models and approaches used in the retail industry.
The survey is not exhaustive, simply because of changes that take place every day
in design, manufacture, and delivery of products and services. Retailing and supply
chains are joined together, and any progress in one will lead to changes in the
other. The changes do not occur synchronously—due to constant experimentation,
opening of new markets, new channels, and proliferation of supply sources—the
approach has been opportunistic. The references and the journals that published the
papers cited in this chapter are a good starting point for learning more on the subject
and staying on top of the developments.
All the datasets, code, and other material referred in this section are available in
https://www.allaboutanalytics.net.
• Data 18.1: handtools_reseller.csv
• Data 18.2: Assortment_Examples.xlsx
• Data 18.3: Tree_Regression_Example.xlsx
Exercises
(c) Enhance the model to not only consider the new product revenue but also the
revenue of competing products. To do so, assume the other products have prices
75, 100, 125, 150, and 175, and their MSRPs are equal to their prices.
(d) The store management would like to ensure that the total sales of the new
product is at least 25 units. How does this change affect your solution?
(e) The store wishes to order inventory based on the forecast demand. The store
manager argues that the demand prediction is just one number! He says, “We
need an interval forecast.” How would you modify the model to predict an
interval (such as (20,26))? What inventory would you order? (Hint: Look at
the error associated with the prediction at a node of the regression tree. Can you
use this?)
Ex. 18.2 AbD Sells MTO:
(a) Compute the expected profit (shown in the Table 18.4) using the formulae
provided in the chapter. You can consult AssortmentExamples.xlsx MTO sheet.
(b) Is this the optimal assortment to offer?
(c) What would change if the stocking cost were to increase to $230 per product
stocked?
(d) How would AbD evaluate whether to add a new product to this category? Create
your own example and show the analysis.
Ex. 18.3 AbD Sells MTS:
(a) Compute the expected profit (shown in Table 18.5) using the formulae provided
in the chapter. Verify the simulation.
(b) Is this the optimal assortment to offer? Can you optimize the expected profit?
(c) What would change if the stocking cost were to increase to $230 per product
stocked?
(d) How would AbD evaluate the impact of a 5% increase in price of all products?
Ex. 18.4 AbD Sells to Walk-Ins:
AbD decides to offer only two types of slabs to walk-in customers, Roman Blue
and Forever Black. Assume that the choice probabilities are 0.4857 and 0.3256,
respectively. If a customer does not find the desired product, s/he will switch to the
other product with half the original probability (0.2428 and 0.1628). AbD keeps
just one slab of each as inventory. Use the same costs and prices as in the previous
exercise. Calculate the expected profit when exactly one customer arrives and also
when exactly two customers arrive. (Hint: Enumerate all possible sequences in
which a product is demanded.)
References
Alptekinoğlu, A., & Semple, J. H. (2016). The exponomial choice model: A new alternative for
assortment and price optimization. Operations Research, 64(1), 79–93.
Anupindi, R., Dada, M., & Gupta, S. (1998). Estimation of consumer demand with stock-out based
substitution: An application to vending machine products. Marketing Science, 17(4), 406–423.
620 R. S. Randhawa
Au, K. F., Choi, T. M., & Yu, Y. (2008). Fashion retail forecasting by evolutionary neural networks.
International Journal of Production Economics, 114(2), 615–630.
Ban, G.-Y., & Rudin, C. (2018). The big data newsvendor: Practical insights from machine
learning. Operations Research. Available at SSRN: https://ssrn.com/abstract=2559116. doi.org/
10.2139/ssrn.2559116
Bell, D. R., Bonfrer, A., & Chintagunta, P. K. (2005). Recovering stockkeeping-unit-level
preferences and response sensitivities from market share models estimated on item aggregates.
Journal of Marketing Research, 42(2), 169–182.
Bell, D. R., Gallino, S., & Moreno, A. (2014). Offline showrooms and customer migration in
omni-channel retail. Retrieved August 1, 2018, from https://courses.helsinki.fi/sites/default/
files/course-material/4482621/17.3_MIT2014%20Bell.pdf
Bernstein, F., Kök, A. G., & Xie, L. (2015). Dynamic assortment customization with limited
inventories. Manufacturing & Service Operations Management, 17(4), 538–553.
Bertsimas, D., & Mišic, V. V. (2015). Data-driven assortment optimization. Operations Research
Center, MIT.
Blanchet, J., Gallego, G., & Goyal, V. (2016). A Markov chain approximation to choice modeling.
Operations Research, 64(4), 886–905.
Bradlow, E. T., Gangwar, M., Kopalle, P., & Voleti, S. (2017). The role of big data and predictive
analytics in retailing. Journal of Retailing, 93(1), 79–95.
Caro, F., & Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer
goods. Management Science, 53(2), 276–292.
Chandukala, S. R., Kim, J., Otter, T., Rossi, P. E., & Allenby, G. M. (2008). Choice models
in marketing: Economic assumptions, challenges and trends. Foundations and Trends® in
Marketing, 2(2), 97–184.
Chong, J. K., Ho, T. H., & Tang, C. S. (2001). A modeling framework for category assortment
planning. Manufacturing & Service Operations Management, 3(3), 191–210.
DeHoratius, N. (2011). Inventory record inaccuracy in retail supply chains. Wiley encyclopedia of
operations research and management science (pp. 1–14).
DeHoratius, N., & Raman, A. (2008). Inventory record inaccuracy: An empirical analysis.
Management Science, 54(4), 627–641.
Edgeworth, F. Y. (1888). The mathematical theory of banking. Journal of the Royal Statistical
Society, 51(1), 113–127.
Farias, V. F., Jagabathula, S., & Shah, D. (2013). A nonparametric approach to modeling choice
with limited data. Management Science, 59(2), 305–322.
Ferreira, K. J., Lee, B. H. A., & Simchi-Levi, D. (2015). Analytics for an online retailer: Demand
forecasting and price optimization. Manufacturing & Service Operations Management, 18(1),
69–88.
Fisher, M., & Vaidyanathan, R. (2014). A demand estimation procedure for retail assortment
optimization with results from implementations. Management Science, 60(10), 2401–2415.
Golrezaei, N., Nazerzadeh, H., & Rusmevichientong, P. (2014). Real-time optimization of
personalized assortments. Management Science, 60(6), 1532–1551.
Gruen, T. W., Corsten, D. S., & Bharadwaj, S. (2002). Retail out-of-stocks: A worldwide exami-
nation of extent, causes and consumer responses. Washington, DC: Grocery Manufacturers of
America.
Guadagni, P. M., & Little, J. D. (1983). A logit model of brand choice calibrated on scanner data.
Marketing Science, 2(3), 203–238.
Haensel, A., & Koole, G. (2011). Estimating unconstrained demand rate functions using customer
choice sets. Journal of Revenue and Pricing Management, 10(5), 438–454.
Honhon, D., Gaur, V., & Seshadri, S. (2010). Assortment planning and inventory decisions under
stockout-based substitution. Operations Research, 58(5), 1364–1379.
Honhon, D., Jonnalagedda, S., & Pan, X. A. (2012). Optimal algorithms for assortment selection
under ranking-based consumer choice models. Manufacturing & Service Operations Manage-
ment, 14(2), 279–289.
18 Retail Analytics 621
Jagabathula, S., & Rusmevichientong, P. (2016). A nonparametric joint assortment and price choice
model. Management Science, 63(9), 3128–3145.
Jagabathula, S., & Vulcano, G. (2017). A partial-order-based model to estimate individual
preferences using panel data. Management Science, 64(4), 1609–1628.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol. 112). New York: Springer.
Kallus, N., & Udell, M. (2016). Dynamic assortment personalization in high dimensions. arXiv
preprint arXiv, 1610, 05604.
Kök, A. G., & Fisher, M. L. (2007). Demand estimation and assortment optimization under
substitution: Methodology and application. Operations Research, 55(6), 1001–1021.
Liyanage, L. H., & Shanthikumar, J. G. (2005). A practical inventory control policy using
operational statistics. Operations Research Letters, 33(4), 341–348.
Ma, S., Fildes, R., & Huang, T. (2016). Demand forecasting with high dimensional data: The
case of SKU retail sales forecasting with intra-and inter-category promotional information.
European Journal of Operational Research, 249(1), 245–257.
Porteus, E. L. (2002). Foundations of stochastic inventory theory. Stanford, CA: Stanford
University Press.
Rusmevichientong, P., Shen, Z. J. M., & Shmoys, D. B. (2010). Dynamic assortment optimization
with a multinomial logit choice model and capacity constraint. Operations Research, 58(6),
1666–1680.
Rusmevichientong, P., Van Roy, B., & Glynn, P. W. (2006). A nonparametric approach to
multiproduct pricing. Operations Research, 54(1), 82–98.
Ryzin, G. V., & Mahajan, S. (1999). On the relationship between inventory costs and variety
benefits in retail assortments. Management Science, 45(11), 1496–1509.
Smith, J. C., Lim, C., & Alptekinoglu, A. (2009). Optimal mixed-integer programming and
heuristic methods for a Bilevel Stackelberg product introduction game. Naval Research
Logistics, 56(8), 714–729.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), 267–288.
Train, K. E. (2009). Discrete choice methods with simulation. Cambridge: Cambridge University
Press. https://eml.berkeley.edu/~train/distant.html.
van Ryzin, G., & Vulcano, G. (2017). An expectation-maximization method to estimate a rank-
based choice model of demand. Operations Research, 65(2), 396–407.
Vulcano, G., Van Ryzin, G., & Ratliff, R. (2012). Estimating primary demand for substitutable
products from sales transaction data. Operations Research, 60(2), 313–334.
Wierenga, B., van Bruggen, G. H., & Althuizen, N. A. (2008). Advances in marketing management
support systems. In B. Wierenga (Ed.), Handbook of marketing decision models (pp. 561–592).
Boston, MA: Springer.
Chapter 19
Marketing Analytics
1 Introduction
S. Arunachalam ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: S_Arunachalam@isb.edu
A. Sharma
Texas A&M University, College Station, TX, USA
that can inform and evaluate marketing decisions right from product/service devel-
opment to sales (Farris et al. 2010; Venkatesan et al. 2014). According to CMO
survey, spending on analytics will increase from 4.6% to 22% in the next 3 years.1
This shows the increasing importance of analytics in the field of marketing. Top
marketers no longer rely on just intuition or past experience to make decisions. They
want to make decisions based on data. But in the same survey, “Lack of processes
or tools to measure success through analytics” and “Lack of people who can link
to marketing practice” have been cited as the top two factors that prevent marketers
from using advanced marketing analytic tools in the real world.
This chapter is a step toward closing these two gaps. We present tools that both
inform and measure the success of marketing activities and strategies. We also hope
that this will help current and potential marketers get a good grasp of marketing
analytics and how it can be used in practice (Lilien et al. 2013; Kumar et al. 2015).
Since analytics tools are vast in numbers and since it is not feasible to explain the
A–Z of every tool here, we select a few commonly used tools and attempt to give
a comprehensive understanding of these tools. Interested readers can look at the
references in this chapter to get more advanced and in-depth understanding of the
discussed tools.
The processes and tools discussed in this chapter will help in various aspects
of marketing such as target marketing and segmentation, price and promotion,
customer valuation, resource allocation, response analysis, demand assessment, and
new product development. These can be applied at the following levels:
• Firm: At this level, tools are applied to the firm as a whole. Instead of focusing
on a particular product or brand, these can be used to decide and evaluate firm
strategies. For example, data envelopment analysis (DEA) can be used for all the
units (i.e., finance, marketing, HR, operation, etc.) within a firm to find the most
efficient units and allocate resources accordingly.
• Brand/product: At the brand/product level, tools are applied to decide and
evaluate strategies for a particular brand/product. For example, conjoint analysis
can be conducted to find the product features preferred by customers or response
analysis can be conducted to find how a particular brand advertisement will be
received by the market.
• Customer: Tools applied at customer level provide insights that help in seg-
menting and targeting customers. For example, customer lifetime value is
a forward-looking customer metric that helps assess the value provided by
customers to the firm (Fig. 19.1).
Before we move further, let us look at what constitutes marketing analytics.
Though it is an ever-expanding field, for our purpose, we can segment marketing
analytics into the following processes and tools:
1 https://cmosurvey.org/2017/02/cmo-survey-marketers-to-spend-on-analytics-use-remains-
Marketing
Analytics
Brand Customer
1. Multivariate statistical analysis: It deals with the analysis of more than one
outcome variable. Cluster analysis, factor analysis, perceptual maps, conjoint
analysis, discriminant analysis, and MANOVA are a part of multivariate
statistical analysis. These can help in target marketing and segmentation,
optimizing product features, etc., among other applications.
2. Choice analytics: Choice modeling provides insights on how customers make
decisions. Understanding customer decision-making process is critical as it can
help to design and optimize various marketing mix strategies such as pricing
and advertising. Largely, Logistic, Probit, and Tobit models are covered in this
section.
3. Regression models: Regression modeling establishes relationships between
dependent variables and independent variables. It can be used to understand
outcomes such as sales and profitability, on the basis of different factors such
as price and promotion. Univariate analysis, multivariate analysis, nonlinear
analysis, and moderation and mediation analysis are covered in this section.
4. Time-series analytics: Models stated till now mainly deal with cross-sectional
data (however, choice and regression models can be used for panel data as well).
This section consists of auto-regressive models and vector auto-regressive
models for time-series analysis. These can be used for forecasting sales, market
share, etc.
5. Nonparametric tools: Non parametric tools are used when the data belongs
to no particular distribution. Data envelopment analysis (DEA) and stochastic
frontier analysis (SFA) are discussed in this section and can be used for
benchmarking, resource allocation, and assessing efficiency.
6. Survival analysis: Survival analysis is used to determine the duration of time
until an event such as purchase, attrition, and conversion happens. Baseline
hazard model, proportional hazard model, and analysis with time varying
covariates are covered in this section.
626 S. Arunachalam and A. Sharma
2.1 Regression
Interaction Effect
This section covers the steps for conducting interactions between continuous
variables in multiple regression (MR). We provide an intuitive understanding of this
concept and share a step-by-step method for statistically conducting and probing
an interaction effect. Chapter 7 on linear regression analysis contains detailed
explanation of the theory underlying MR.
Interaction effects, also called moderator effects, are the combined effects of
multiple independent variables on the dependent variable. They represent situations
where the effect of a variable on a dependent variable is contingent on another
variable (Baron and Kenny 1986). Let us consider a hypothetical example where
19 Marketing Analytics 627
Y = i0 + a X + b Z + c XZ + Error (19.1)
where
Y = continuous dependent variable (sales in the above example)
X, Z = continuous independent variables (advertisement and price in the above
example)
XZ = new variable computed as the product of X and Z (product of advertise-
ment and price in the above example)
i0 = intercept
a, b, and c = slopes
An important aspect to remember in specifying an equation with interaction
terms is that the lower order terms should always be present in that equation. That
is, it is incorrect to test for interaction by omitting X and Z and just having the XZ
term alone. Here we assume that the independent variables, X and Z, are mean-
centered. To mean-center a variable the value of a variable is subtracted by the
mean of the variable in the sample (mean-centered variable = variable – mean
[variable]). The product term XZ is computed after mean-centering both X and Z.
The dependent variable, Y, is generally not mean-centered. Mean-centering helps
in ease of interpretation of effects and also removes nonessential multicollinearity.
More details about benefits of mean-centering (it is very useful to know) can be
found from the book Aiken and West (1991) and from other sources mentioned in
the references section (Cohen et al. 2003).
How does the presence of the interaction term, XZ, help in addressing the
managerial question above? Why is it needed and why cannot interaction effects
be uncovered using the simple MR equation: Y = i0 + a X + b Z that does not
contain the interaction term, XZ? To understand this let us start with the simple MR
equation that does not have the XZ term. Here the slopes “a” and “b” are the effect
of X and Z on Y, respectively. The regression of Y on X (i.e., slope “a”) is constant
across legit (explained below) values of Z and vice versa for regression of Y on
Z (i.e., slope “b”). This means that the regression of Y on X is independent of Z
because for any value of Z, value of the slope would be “a.” Hence, this does not
answer the managerial question asked above as it does not capture the variation in
the effect of X according to Z and vice versa. However, this is not the case if we
consider Eq. (19.1) that has the interaction term XZ. To understand why and how
let us rewrite Eq. (19.1) in this way:
Y = (a + c Z) X + (i 0 + b Z) (19.2)
628 S. Arunachalam and A. Sharma
This expanded view of Eq. (19.1) helps answer the two questions we raised
above. It is seen from Eq. (19.2) that the slope of the regression of Y on X is now
the term (a + c Z), unlike just “a” from the simple equation. This term (a + c Z)
is called simple slope in the moderation literature as the effect of Y on X is now
conditional on the value of Z! Therefore, the effect is now dependent on Z rather
than being independent as noted above. To go back to our managerial example this
would mean that effect of advertisement (X) on sales (Y) is dependent on price (Z).
This is precisely what we would like to achieve or test. This also means that for
every value of Z, regression of Y on X would have a different line, which is called a
simple regression line. Equation (19.2) could also be rewritten by pulling out Z and
having a simple slope that is (b + c X), which would help understand how the effect
of price (Z) on sales (Y) is dependent on advertisement (X).
Estimation: Estimating Eq. (19.2) using any statistical software like STATA (or
R), is straightforward. The command regress (lm) can be used to estimate the
equation after mean-centering the independent variables and creating the XZ term.
Interpretation and probing of interaction effects need more understanding (Preacher
et al. 2006). Let us assume that the slope “c” of XZ term is statistically significant
(at p < 0.05). What does this mean? How to interpret this significant effect?
There are two very important takeaways when the slope “c” is significant. First, it
means that simple slopes for any pair of simple regression lines obtained using two
different legit values of Z are statistically different. Second, for any specific value
of Z the researcher has to test whether the simple regression line is significantly
(i.e., statistical significance) different from zero. These two points are better
understood and observed by plotting graphs of the interaction. We recommend that
the researcher should always plot graphs to thoroughly understand the interaction
effects. Before we probe these effects through graphs let us understand the meaning
of “legit values” of Z. Let us recall that we have mean-centered the variables X and
Z and hence the mean of these two variables would be zero and the range would
be from negative to positive values around zero. It is always a good practice to
choose two values of Z—one low value and another high value as one standard
deviation below and above the mean (which is zero now) respectively, to plot the
simple regression lines. The reason is that these two values would be within the
range of values of Z in our sample. This is important because we should not choose
a value that is not within the sample range. The researcher is free to choose any two
values that are within the range of Z. If we choose an arbitrary value and if that
value is outside of the range of Z, then we are testing the effects for out-of-sample
data, which is incorrect. Therefore, researchers should pay attention in choosing
legit values of Z.
Let us assume we choose two legit values of Z, namely, Z-low (ZL ) and Z-high
(ZH ). So the simple slopes from Eq. (19.2) that determine the simple regression
lines are (a + c ZL ) and (a + c ZH ). Therefore, we can plot two regression lines by
substituting the two values of Z in Eq. (19.2). These two lines would then represent
the effect of advertisement (X) on sales (Y) when the price is low (ZL ) and when
the price is high (ZH ). Deriving from Eq. (19.2) the simple regression equations are:
19 Marketing Analytics 629
Let us assume we have estimated Eq. (19.2) from a sample dataset that contained
values for Y, X, and Z and that we have this prediction equation (intercept i0 is
omitted for pedagogical reason; X and Z are mean-centered):
Y = 0.12 X + 0.09Z + 0.46XZ (19.4)
The parameters “a,” “b,” and “c” are statistically significant (p < 0.05). Please
note it is not necessary for “a” and “b” to be significant to probe the interaction
effect. It is however important to have “c,” the estimate for the interaction term XZ,
to be statistically significant to probe the interaction effect (refer above for the two
important takeaways). Let us also assume that one standard deviations of X and Z
are 0.5 and 0.5. The following steps help in plotting the graphs.
We calculate the predicted value of Y (sales) for the two legit values of X and Z.
So we take low and high values of X and Z and plug it in Eq. (19.4). As one standard
deviation of mean-centered X and Z are 0.5 and 0.5, legit low and high values would
be −0.5 and 0.5, respectively. When these values are used to compute the predicted
sales (Y) for low advertisement (XL ) and low price (ZL ), Eq. (19.4) becomes:
Y = 0.12 ∗ (−0.5) + 0.09 ∗ (−0.5) + 0.46 ∗ (−0.5 ∗ − 0.5)
This leads to predicted sales of 0.01 at low advertisement and low price.
Similarly, predicted values for all other combinations of X and Z can be obtained.
We recommend using an excel sheet to compute this 2 * 2 table as shown below
(Table 19.1).
Once we have the above table, it becomes easy to plot the two regression lines
corresponding to low and high values of price (Z). The two rows in the table are the
two regression lines. The graph can be plotted as a line graph in excel (Fig. 19.2).
(Excel approach: To plot, click the “Insert” option and select line graph and
choose the two data points for low and high to get the two regression lines as shown
below.)
Interpretation of the graph: The line titled “High Price” shows the effect of
advertisement on sales when the price is also increased to a high value. The line
shows that the slope is positive; that is, when price is increased, as advertisement
increases, the effect on sales is positive. The line titled Low Price shows the effect
0.25
High Price ZH
0.2
0.15
Predicted Sales (Y)
0.1
0.05
-0.05 1 2
-0.1
Low Price ZL
-0.15
Advertisement (X)
Y
Y
X X
3. a Negave, b Negave
Y = i0 + a X + b X2 (19.5)
(Note: X is mean-centered)
As noted in the section on interaction effect, lower order term (i.e., X) should
be present when a higher order term, formed as product of X*X, is present. The
magnitude and the sign of slopes “a” and “b” of Eq. (19.5) contain a wealth of
information that helps the researcher in understanding what could be the shape of
this specification. Refer Fig. 19.1 to understand the expected shapes depending on
the sign (Fig. 19.3).
Going back to the example of the effect of advertisement and price on sales (Y);
let us now consider (for pedagogical reason) only the effect of advertisement (X)
on sales (Y). A manager could argue that as advertisement increases, product sales
tend to increase. Also, beyond a point the effect of advertisement on sales might
632 S. Arunachalam and A. Sharma
not increase at the same rate. The manager is actually talking about a curvilinear
relationship between X and Y, wherein the effect of X on Y increases, but beyond a
point it increases at a decreasing rate. This leads to a prediction line as depicted in
Graph (19.3, 2) above, wherein slope “a” is positive and “b” is negative.
As we learnt in the interaction section, we could try to compute the simple slope
of Eq. (19.5) by restructuring it. However, this method is incorrect and cannot be
used for equations that have curvilinear effects. We have to use simple calculus to
derive the simple slope. As we all know, the first partial derivative with respect to
(w.r.t) X is the simple slope of regression of Y on X. This technique can be applied
to the interaction equation we dealt with earlier as well. So, going back to Eq. (19.5),
the first partial derivative is:
∂ Y/∂ X = a + 2 b X (19.6)
As seen in Eq. (19.6), the simple slope depends or is conditional on the value of
X itself. This along with the sign of “a” and “b” is precisely the reason for the effect
of X on Y being curvilinear and not just linear. Again, going to back to our steps in
choosing legit values of X, we can take one standard deviation below and above the
mean (which is zero) and compute the values of the simple slope.
If a manager would like to investigate how price interacts with the curvilinear
relationship of advertisement on sales, we can follow the same steps we did in
interaction. However, to derive the simple slope, we have to use calculus and not
just simple restructuring of the equation for reasons stated above. Interaction effects
on curvilinear relationship are complex and advanced concepts. So we urge the
interested readers to peruse the resources in the reference section for a complete
understanding. We provide some ideas to help understand some basics of interaction
effects in curvilinear relationships below.
The full equation after including price (Z) would be:
Y = i0 + a X + b X2 + c Z (19.7)
Y = i0 + a X + b X2 + c Z + d XZ (19.8a)
To derive the simple slope, which shows the effect of X on Y, we take the first
partial derivative w.r.t X:
19 Marketing Analytics 633
∂ Y/∂ X = a + 2 b X + d Z (19.8b)
It can be seen from the above equation that not only does the effect depend on X
(due to the presence of the term 2 b X) but it also depends on Z (due to the term d
Z). This means that the curvilinear effect is conditional on Z as well.
Furthermore, if a manager hypothesizes that Z alters both the level and the shape
of the curvilinear effect, it can be tested by introducing the X2 Z term. This term is
the product of X2 and Z. Then Eqs. (19.8a and 19.8b) will change to the following:
Y = i0 + a X + b X2 + c Z + d XZ + e X2 Z (19.9a)
Please note the presence of all lower order terms (which are X, X2 , Z, and XZ) of
the highest order term (X2 Z). Neglecting any the lower order terms leads to incorrect
interpretation of equation (19.9a and 19.9b). Again, to calculate the simple slope we
do the following:
∂ Y/∂ X = a + 2 b X + d Z + 2 e XZ (19.9b)
This shows that Z affects not just the level but the shape of the curve as well.
Mediation
Mediation is used to study relationships among a set of variables by estimating a
system of equations. The objective of a mediation analysis is to extract the mech-
anism behind the effect of X on Y. Mediation analysis is useful in understanding
the intervening mechanism that actually causes the effect of X on Y (MacKinnon
2008; Preacher and Hayes 2004). To find this intervening mechanism, an intervening
variable called the mediator (M) is introduced. We can imagine this mediation as a
pathway: X→M→Y. Therefore, the total effect of X on Y is now partitioned into
an indirect effect via M and a direct effect of X on Y.
In mediation analysis, there is a series of regressions or a set of equations that
has to be estimated. First is the effect of X on the mediator M (X→M, slope a),
next is the effect of mediator on the dependent variable Y (M→Y, slope b), and
finally the direct effect of X on Y (X→Y. slope c ). The researcher should note
that now there are two dependent variables, namely, Y and the newly introduced
intervening mediator M. It can be easily understood through the diagram given
below (Fig. 19.4).
The diagram (Fig. 19.4) can be represented by the following set of regression
equations:
Y = i0 + b M + c X (19.10)
M = j0 + a X (19.11)
where i0 and j0 are intercepts and are ignored hereafter for pedagogical purpose.
634 S. Arunachalam and A. Sharma
a b
x y
c`
Now the mediating effect, also called the indirect effect of X on Y via M, can be
obtained by inserting Eq. (19.11) in Eq. (19.10):
Y = ab X + c X (19.12)
This shows that the indirect effect of X on Y is the product of two slopes captured
as “a * b” and the direct effect of X on Y is “c .”
The slopes “a,” “b,” and “c ” can be estimated using any statistical software
and the product of the slopes “a * b” can be computed to find the strength of the
mediation. However, we recommend using software that supports path analysis or
structural equation modeling (SEM) techniques (STATA or Mplus or R) to estimate
both the equations simultaneously without having to run multiple regressions. This
practice can help the researcher easily progress from the simple model narrated
here to more complex mediational models that are often the case in research. One
important question still remains—how to test for the statistical significance of the
product term “a * b”? The statistical test of the simple slopes “a,” “b,” and “c ”
using standard errors is straightforward. However, for the product of the slopes,
recent research in mediational analysis suggests that deriving the standard error
(using a technique called delta method) for a higher order product term is inaccurate
and that statistical significance has to be tested using nonparametric techniques
like bootstrapping. The intuition behind this is that though the individual slopes
“a” and “b” are assumed to be normally distributed, the product of these slopes
cannot be. Hence, as the derivation of standard errors through parametric methods
(like OLS or maximum likelihood) assumes normality, researchers are advised to
use nonparametric resampling procedures like bootstrapping to derive the standard
errors. The logic behind bootstrapping procedure is straightforward. The estimates
of the indirect effect “a * b” are repeatedly obtained from samples drawn from the
original sample with replacement. The standard deviation of those estimates is the
standard error, which is then used to build a nonsymmetric 95% confidence interval
to test the significance of the indirect effect. If the 95% confidence interval of the
19 Marketing Analytics 635
indirect effect does not contain zero the effect is statistically significant. In this case,
we can conclude that there is a significant mediation effect.
Let us go back to the example of advertisement (X) having an effect on product
sales (Y). Now a manager might be interested in knowing the reason behind the
“increase in advertisement having a positive effect on sales.” One could argue
that, as advertisement increases (X), customers’ awareness (M) about the product
also increases, which encourages them to buy more, leading to an increase in
product sales (Y). Therefore, here the mediator is consumer awareness (M). If
somehow the manager can measure or capture this variable, it can be used, along
with the X and Y variables, to test this hypothesized mediation effect. But what
is the managerial implication? What really is the “extra” insight derived by doing
this mediation analysis? An important insight is that the manager is now able to
identify one relevant mechanism that actually causes the effect of X on Y. The
manager can then increase resource allocation to marketing strategies that help
improve consumer awareness levels. Furthermore, this manager could be intrigued
to investigate whether there are other mediators or intervening mechanisms like
awareness. Understanding and finding the reasons behind effects are critical to
decision making. Armed with the knowledge of the mechanism behind the cause-
and-effect relationship, executives can make informed decisions which lead to
greater success.
In this section, we have covered simple mediation analysis. Interested readers are
strongly recommended to consult the resources provided in the reference section
for advanced mediation topics like mediation analysis with multiple mediators and
multiple dependent variables. One interesting and important advance in mediation
analysis in recent times has been the ability to integrate moderation with mediation
analysis. This is called moderated mediation analysis (Preacher et al. 2007). This
is an important advancement because it resonates well with the purpose of theory
testing in marketing or in general any social science domain. Theory is important
primarily for two reasons: to provide arguments to explain the reason behind why
and how a variable affects another and to uncover the boundary condition under
which this effect could change. This is essentially the spirit of any moderated
mediation test as it helps to simultaneously understand the mechanism (mediation)
and the conditional effect (moderation) of that mediation pathway. That is, to
investigate if the indirect effect of ‘X→M→Y’ as a unit is conditional on the
level of a moderator variable, say Z. This would imply how the product term “ab”
is conditional on legit values of Z (say low and high levels of Z). For example,
considering the example of how “consumer awareness” (M) channelizes the effect
of advertisement (X) on product sales (Y), a manager can also understand how the
strength of this pathway varies depending on whether the product price (Z) is low
or high. As the mediation pathway is conditional on another variable, or mediation
effect is moderated by another variable, this test is termed moderated mediation test.
The theory and empirics behind moderated mediation has exploded in recent times.
Readers are requested to consult the references for a deeper understanding of this
technique.
636 S. Arunachalam and A. Sharma
Suppose you are a manager of a big MNC and you want to identify the efficient
units (finance, marketing, operations, HR, international business, etc.) within your
organization. You may want to decide on allocating resources, hiring, increasing
the benefits to the employees of a specific unit, based on the efficiency of the unit.
How are you going to decide on the efficiency? You may use observational data
for such decisions. However, such an approach may not be great while designing
critical strategies such as resource allocation or hiring.
A very prominent analytic tool that can be used in such decision making is
“data envelopment analysis” (Cook and Seiford 2009; Cooper et al. 2004). DEA
is designed to help managers measure and improve the performance of their
organizations. As the quest for efficiency is a never-ending goal, DEA can help
capture the efficiency of each unit and suggest the potential factors that may be
responsible for making units efficient.
DEA allows managers to take into account all the important factors that affect
a unit’s performance to provide a complete and comprehensive assessment of
efficiency (Charnes et al. 1978). DEA does this by converting multiple inputs and
outputs into a single measure of efficiency. By doing so, it identifies those units
that are operating efficiently and those that are not. The efficient units, that is, units
making best use of input resources to produce outputs, are rated as being 100%
efficient, while the inefficient ones obtain lower scores.
Let us understand the technical aspects of DEA in a simple way. Suppose, a
manager of ICICI bank wants to measure the efficiency of HR, finance, marketing,
operation, public relation, and accounting departments to allocate resources for the
following year (Sherman and Franklin 1985). Managers generally have information
regarding the number of employees, total issues handled, raises given, and the
contribution of these units toward the operational profits of the firm. Managers can
use DEA with number of employees, total issues handled, and raises given as the
inputs and contribution of these units toward the operational profits as the output.
DEA analysis will provide a score of either 1 or less than 1 to each unit. A score of
1 means the unit is perfectly efficient and anything less than 1 means that the unit
has the room to grow and be more efficient. Based on the relative importance of the
inputs, units may be asked to work on specific inputs such that the units become
efficient. Further, managers can now allocate the resources they want to the efficient
units.
DEA also provides guidance to managers on the reduction/increment required
in the inputs of the units to become efficient, helping managers answer questions,
such as “How well are the units doing?” and “How much could they improve?” It
suggests performance targets, such as marketing unit should be able to produce 15%
more output with their current investment level or HR unit should be able to reduce
churn by 25% and still produce the same level of outputs. It also identifies the best
performing units. One of the most interesting insights from DEA is that one can test
the operating practices of such units and establish a guide to “best practices” for
others to emulate.
19 Marketing Analytics 637
maxθ,λ θ
such that − θyi + Yλ ≥ 0, λ ≥ 0
xi − Xλ ≥ 0, λ ≥ 0
not hold in reality, most DMUs in a sample may not work at the optimal scale. In
such a context, CRS may not be an ideal tool. To avoid any potential scale effects,
one may use VRS. One can compute the technical efficiency with VRS that excludes
the scale effects. The CRS linear programming problem can be modified to account
for VRS by adding the convexity constraint eλ = 1 (where eλ is N X 1 vector
of ones). The additional constraints give the frontier piecewise linear and concave
characteristics.
maxθ,λ θ
such that − θyi + Yλ ≥ 0, eλ = 1, λ ≥ 0
xi − Xλ ≥ 0, eλ = 1, λ ≥ 0
The CRS and VRS provide different values for the efficiency score of a DMU;
it indicates the presence of scale inefficiency in the DMU. One can compute the
scale inefficiency by taking the ratio of efficiency score obtained from CRS to the
efficiency score obtained from VRS (Caudill et al. 1995; Gagnepain and Ivaldi 2002;
Greene 2010).
Example
Suppose that we are interested in evaluating the efficiency of the hospital units
(Ouellette and Valérie 2004) of a chain based on a number of characteristics: the
total number of employees, the size of units in square meters, the number of patients
each unit serves, total number of specialists, total revenue, and satisfaction. It
becomes obvious that finding the most efficient units requires us to compare records
with multiple features.
To apply DEA, we must define our inputs (X) and outputs (Y). In the case of a
hospital chain, X can be the total number of employees, the size of units in square
meters, the number of patients each unit serves, total number of specialists; and Y
can be total revenue and satisfaction. If we run DEA, we will estimate the output to
input ratio for every hospital under the ideal weights (ideal weights are weights that
consider the values that each unit puts on inputs and outputs). Once we have their
ratios, we will rank them according to their efficiency (Banker and Morey 1986).
STATA/R Code
DEA can be analyzed using multiple statistical programing software. Here, we are
providing the syntax required to conduct the analysis in STATA2 or R3 . Although
STATA does not have a built-in function, one can use user-written command (dea)
to do the analysis. In case you find it difficult, please type “help dea” in STATA
command window and you will get a step-by-step explanation for the analysis.
2 https://www.cgdev.org/sites/default/files/archive/doc/stata/MO/DEA/dea_in_stata.pdf (accessed
Options:
• rts(crs|vrs|drs|nirs) specifies the returns to scale. The default is rts(crs)
• ort(in|out) specifies the orientation. The default is ort(in)
• stage(1|2) specifies the way to identify all efficiency slacks. The default is stage
(2)
To do the analysis in R, you need to use “Benchmarking” package.
> install.packages(“Benchmarking”) # to install package for the
first time
> library(Benchmarking) # to load package
# DEA code
> eff <- dea(x,y, RTS=“crs”) # where x is input vector and y
is output vector
> eff # where “eff” will give the efficiency of each unit
# RTS options allow us to specify which return of scale we want
• “crs” - constant return to scale
• “vrs” - varying return to scale
• “drs” - decreasing return to scale
• “irs” - increasing return to scale
DEA in Practice
There are eight units of a restaurant chain (largely in Northern India) and the
manager wants to measure the efficient unit for the best restaurant award. The
manager has insights only on the total number employees and the revenue (in
100,000) of each unit as follows:
Units 1 2 3 4 5 6 7 8
Employee 5 6 11 15 20 9 12 10
Revenue 4 4.8 8 14 18 5 11 9
In order to find the efficiency of each unit, we first calculate the efficiency by
dividing the revenue by employee as follows:
Units 1 2 3 4 5 6 7 8
Employee 5 6 11 15 20 9 12 10
Revenue 4 4.8 8 14 18 5 11 9
Revenue/employee 0.8 0.8 0.727 0.93 0.9 0.55 0.91 0.9
As per the efficiency measurement, we find that Unit 4 is the efficient one and
Unit 6 is the least efficient one.
640 S. Arunachalam and A. Sharma
Now, to compute the relative efficiency, we need to divide the efficiency of the
units by the efficiency of the most efficient units, that is, the relative efficiency of
units is measured by taking the ratio of efficiency of each unit and the efficiency of
most efficient unit as shown below.
0 ≤ revenue per employee for each unit/revenue per employee for the most
efficient unit ≤1.
Units 1 2 3 4 5 6 7 8
Employee 5 6 11 15 20 9 12 10
Revenue 4 4.8 8 14 18 5 11 9
Revenue/ 0.8 0.8 0.727 0.93 0.9 0.55 0.91 0.9
employee
Relative 0.860215 0.860215 0.78172 1 0.967742 0.591398 0.978495 0.967742
effi-
ciency
Data envelopment analysis shows that the Unit 4 is the most efficient relative
to all other units; and the Unit 6 is the least efficient. That means Unit 4 will be
on the frontier and rest will be within the frontier. One can also show the frontier
pictorially.
yi = βxi − ui + vi
the outside influences beyond the control of the producer. In sum, the SFA has
two components: a stochastic production frontier serving as a benchmark against
which firm efficiency is measured, and a one-sided error term with independent and
identical distribution across observations and captures technical inefficiency across
production units.
If a manager allows the inefficiencies to depend on the firm level factors
(employees, experts, size, alliances, acquisitions, etc.), the manager can examine
the determinants of inefficiencies. Such understanding helps to implement different
policy interventions to improve efficiency. Managers can also modify the inputs and
incorporate in the production function to reduce the inefficiencies.
Now the question is whether one should use SFA over DEA? (Wadud and White
2000; Koetter and Poghosyan 2009). Imagine that there are random variations in the
inputs. These variations can make the DEA analysis unstable. A potential advantage
of the SFA over DEA is that random variations in inputs can be accommodated.4
Although there are some benefits, SFA also suffers from multiple disadvantages
including its complications in handling multiple outputs. Further, it also requires
stochastic multiple output distance functions, and it raises problems for outputs that
take zero values.
Technical Details of SFA
The technological relationship between a few input variables and corresponding
output variables is given by a “production function” (Aigner et al. 1997). Econo-
metrically, if we use data on observed outputs and inputs, the production function
will indicate the average level of outputs that can be produced from a given level of
inputs (Schmidt 1985). One can estimate production functions at either an individual
or an aggregate level.
The literature on production function suggests that the implicit assumption of
production functions is that all firms are producing in a technically efficient manner,
and the representative (average) firm therefore defines the frontier (Førsund et al.
1980; Farrell and Fieldhouse 1962). The estimation of the production frontier
assumes that the boundary of the production function is defined by “best practices”
units. It therefore indicates the maximum potential output for a given set of inputs
along a straight line from the origin point. The error term in the SFA model
represents any other reason firms would be away from (within) the boundary.
Observations within the frontier are deemed “inefficient (Hjalmarsson et al. 1996;
Reinhard et al. 2000).”
The SFA Model and STATA/R Code
Restating the stochastic frontier model:
yi = βxi − ui + vi , ui = |U|
In this area of study, estimation of the model parameters is usually not the
primary objective. Estimation and analysis of the inefficiency of individuals in the
sample and of the aggregated sample are usually of greater interest.
STATA has a built-in command to estimate SFA.5 This is essentially a regression
analysis where the error term consists of a random error and an inefficiency term,
and again can be estimated for both production and cost functions. The syntax of
the frontier command is:
frontier depvar [indepvars] [if] [in] [weight] [, options]
SFA in Practice
A manager can find the effect of inefficiency and the drivers of inefficiency
by adopting an SFA approach. For example, given the availability of the data
on the improper implementation of the marketing-mix elements, and the within-
unit corruption, a manager can model the inefficiency as a function of improper
implementation and corruption and see their effect on the overall productivity (e.g.,
revenue contribution) of each unit. In marketing, there has been several applications
of SFA; see, for example, Feng and Fay (2016) who apply it to evaluate salesperson
capability, Parsons (2002) who comments upon its application to sales people and
retails outlets and observes that unlike explaining just the mean performance SFA
provides explanation for the gap between the unit and the nest performer, and Vyt
(2008) who uses SFA for comparing retailers’ geo-marketing.
Cheese Cheddar
Mozzarella
Parmesan
Toppings Onion
Mushroom
Pepperoni
Drink Coke
Pepsi
Price $5
$7
$10
Now, the approach is to regress the consumer’s preferences for pizza based on
her ratings of the various attribute levels.
We run the regression on categorical dummy variables. We set one attribute level
as the baseline and remove the corresponding level from regression. Our baseline
in this example is a pizza with Cheddar cheese, onion toppings, coke, and a price
of 5$.
The utility of this baseline pizza is captured in the intercept from the regression
output. The other coefficients give the partworths for various attribute levels.
Depending on the type of information needed further insights can be derived from
the partworths calculated above.
Conjoint Analysis Applications
There are many possible applications of conjoint analysis. However, the four
common applications are trade-off analysis, market share forecasting, determining
relative attribute importance, and comparing product alternatives. All other applica-
tions are generally variants of these four applications.
Trade-off analysis: Utilities from conjoint analysis are used to analyze whether
average consumers would be willing to give up on one particular attribute to gain
improvements in another.
For the given example, the partworths are estimated as follows:
Partworths
Intercept 3.28 Best Bundle
Cheese Cheddar 0 Parmesan
Mozzarella −1.11
Parmesan 2.78
Toppings Onion 0 Pepperoni
Mushroom 0.78
Pepperoni 0.89
Drink Coke 0 Coke
Pepsi −2
Price Dollar5 0 10$
Dollar7 −0.11
Dollar10 0.44
evaluated. Based on product utilities, the market share for product i can be
calculated as:
eU i
Sharei = n Uj
j =1 e
Ui − U i
Ii = n
i=1 Ui − U i
Attribute importances
Attribute Range Importance
Cheese 3.89 0.53
Toppings 0.89 0.12
Drink 2 0.27
Price 0.55 0.08
Let us consider two product profile ratings of a single respondent and predict her
choice among these two profiles.
Product Cheddar Mozzarella Parmesan Onion Mushrooms Pepperoni Coke Pepsi Five$ Seven$ Ten$ Rating
A 0 0 1 1 0 0 0 1 0 0 1 4
B 0 1 0 0 0 1 0 1 0 0 1 1
Cheddar Mozzarella Parmesan Onion Mushrooms Pepperoni Coke Pepsi Five$ Seven$ Ten$
0 −1.11 2.78 0 0.78 0.89 0 −2 0 −0.11 0.44
Product Cheddar Mozzarella Parmesan Onion Mushrooms Pepperoni Coke Pepsi Five$ Seven$ Ten$ Utility Choice
A 0 0 1 1 0 0 0 1 0 0 1 4.5 A
B 0 1 0 0 0 1 0 1 0 0 1 1.5
Conjoint analysis has several advantages, such as uncovering hidden drivers that
may not be apparent to the respondents themselves and evaluating the choice at an
individual level. It can be used to obtain brand equity by determining the popularity
of a brand. However, it has certain disadvantages such as added complexity in
the experimental design because of the inclusion of a large number of attributes.
This increases respondent’s fatigue in taking the survey and thus compromises
the accuracy of the result. Also, the validity of conjoint analysis depends on the
650 S. Arunachalam and A. Sharma
Now that we have established that CLV is an important metric we will try to
calculate it. CLV is calculated by discounting all the current and future profits
expected from the customer’s side. It deals with profit margins from each customer
instead of revenue. It also takes into account the percentage of customers retained
by the firm. CLV can be calculated as:
mt rt
CLV = (19.13)
t
(1 + i)t
r
Here 1+i−r is the margin multiple and depends on the retention rate and
discount rate. The higher the retention rate and lower the discount rate, the more
valuable is a customer. Table 19.2 provides the margin multiple based on typical
retention and discount rates. Table 19.2 is a very useful “back of the envelope”
calculation of CLV!
CLV is the maximum value provided by customers to the firm. Hence, it can be
used as the upper limit of any customer-centric activity. For example, if a potential
customer’s CLV is $5 then a firm should not spend more than $5 on acquiring this
customer. This holds for customer retention and development activities as well. As
mentioned before, the dollar value provided by CLV can help distinguish cohorts of
customers. So it also helps distinguish future prospects that are similar to currently
profitable customers.
As with any metric, CLV also has some limitations. First, it is very difficult to
calculate profit margins at individual level. Sometimes, revenue and/or cost cannot
be attributed to a single customer, making it difficult to calculate profit margins.
Similarly, it is difficult to accurately calculate retention rates because the rate
calculation requires sophisticated analysis. A small increase in retention rate can
substantially increase the margin multiple, which in turn increases the CLV. Hence,
accuracy of retention rate is extremely important for calculating CLV.
Despite the limitations mentioned above, CLV helps in making important
marketing decisions. It is a customer metric that takes into account both the current
and future profitability of a customer. But CLV should not be the only decision-
making criteria. Firms should take into account factors such as reference/influence
of the customer, and brand reputation, along with the CLV to make marketing
decisions.
Customer Referral Value (CRV)
There are many resource-intensive ways to go about acquiring new businesses.
While most firms are engaged in the traditional ways of attracting customers and
hence gaining profits, there are effective ways to bring the business at no or little
cost. One such way is customer referral. Think about your friends who are loyal
customers to Tata Motors. Tata Motors can use your friends to refer you for a new
model of car, or simply to the firm itself. In such situations, will Tata Motors incur
cost to acquire? Probably not . . . or may be a little. Conditional on the tangible and
intangible aspects of reference, you may end up buying a new car from Tata Motors.
And that is where the power of metrics such as customer referral value (CRV) lies.
Let us think about a few firms that engage in the referral programs: Dropbox
referral program secured 4 million users in just 15 months. Inspired by PayPal,
who literally gave free money for referrals, Dropbox added double-sided referral
programs, where both referrer and referee get rewarded. Amazon prime, PayPal,
Airbnb, and Uber have recently seen huge success of referral programs in their
business strategies.6 Whether it is a B2B or a B2C business, customer referral has
seen significant success in recent times. According to Linkedln, 84% of B2B buying
decisions start with a referral. Further, customer referral has a good conversion rate.
The probability of a referred customer getting converted is 30% higher as compared
to a lead generated through traditional channels.7
Now the question is how to capitalize on the referrals and design business
strategies. Literature on customer management suggests that firms should engage in
measuring the values of each referral and then decide the follow-up strategies (e.g.,
Kumar 2010; Kumar et al. 2007). Accordingly, customer referral value is defined as
an estimate of lifetime values of any type-one referrals—people who would have not
purchased or become customers without referrals (e.g., Kumar et al. 2007). Firms
should also include the value of type-two referrals—people who would have become
customers anyway. This has implications for managing marketing efforts to acquire
new customers.
CRV is more complicated than computing customer lifetime value (CLV).
Computation of CRV requires the estimation of the average number of successful
referrals a customer makes after providing some incentive from the firm’s side. For
that, we need to look at the past behavior, which must include enough variance in
the number of referrals for proper empirical modeling and accuracy. Computation
of the CRV requires understanding of the time that can go by and still be sure that a
customer’s referrals are actually prompted by a firm’s referral incentives. Further, it
is critical to understand the conversion rate of referrals to actual customers. Finally,
a customer’s referral value is the present value of her type-one referrals plus present
value of her type-two referrals (Kumar et al. 2007). Customer referral value of a
customer is the monetary value associated with the future profits given by each
referred prospect, discounted to present value.
CRV can be calculated by summing up the value of the customers who joined
because of the referral and the value of the customers that would have joined anyway
discounted to present value. We can compute the CRV of customer8 i as
n1
T
n2
T
Aty − aty + Mty + ACQ1ty ACQty
CRVi = +
(1 + r)t (1 + r)t
t=1 y=1 t=1 y=1
where
Aty = contribution margin by customer y who otherwise would not buy the
product
aty = cost of the referral for customer y
ACQ1ty = savings in acquisition cost from customers who would not join w/o
the referral
ACQ2ty = savings in acquisition cost from customers who would have joined
anyway
T = number of periods that will be predicted into the future (e.g., years)
n1 = number of customers who would not join w/o the referral
n2 = number of customers who would have joined anyway
Mty = marketing costs required to retain customers
A firm can first compute the CRV of their customers and then categorize them
based on the value of CRV. A firm may find a particular group of customers
to have significantly higher CRV than that of others. The firm can market to or
provide incentives to this set of customers to increase the referrals and hence new
customer acquisitions. Moreover, a firm can look at profiles, similar to that of the
high CRV group, who have not referred yet and induce them to refer by providing
some incentives. Given the available data, CRV can be computed with any standard
statistical software.
Customer Influence Value (CIV)
Imagine yourself to be looking at the best car that is affordable within your budget.
Also, imagine that you do not have much knowledge on the technical aspects of a
car. Probably you will ask your friends and colleagues or search online for advice.
What else can you do to reach to your decision? With growing power of social
platforms (online or offline), you may want to post a question on Facebook or
ask a question to an expert auto-blogger, or follow someone on Instagram whose
ideas, comments, and feedback about the auto industry influence you. Just about
everything from big firms to kids have some sort of strategies, tips, experiences, and
attribution that drive others’ decision, sales, impression, etc. That said, most firms
cannot fully grasp the value of societal connections strategies and tactics, which
goes beyond traditional marketing ploys and tactics.9 It is critical to compute this
influence while determining the value of your customers. Social influence can play
a significantly larger role in your decision to buy a new car or putting your kids in
a particular school, or deciding on which dating app to go for. Hence, value of a
customer can go beyond her purchase value, or referral values. In addition to CLV
and CRV, value of a customer can stem from her influence on other customers. Value
of a customer’s influence refers to the monetary value of the profits associated with
the purchases generated by a customer’s social media influence on other acquired
customers and prospects, discounted to present value (Kumar 2013).
Understanding CIV can be of great value for most firms. For an ice cream retailer,
Kumar and Mirchandani (2012) show that a firm can harness the true value of
customer influence. They design a seven-step process to identify the influencers in
online social network, observe their influences over time, and substantially improve
the firm performance. Indeed customers’ influence has significant value for firms
and firms should measure and implement CIV in their business strategies. For a
detailed understanding of CIV computation, refer Kumar et al. (2013), Kumar and
Mirchandani (2012).
3 Applications
Marketing Analytics has evolved over the last century of applications, research
and data collection. Some might even say that marketing was the first consumer
of large “business” data! Wedel and Kannan (2016) provide a readable summary
of the evolution of marketing analytics as well as pose several questions for
researchers. They classify the applications into customer relationship management
(CRM), marketing mix analytics, personalization, and privacy & data security. In
this book, too, there are many applications of the tools: the chapter on social
media analytics has applications to online advertising, A/B experiments, and digital
attribution; the chapters on forecasting analytics, retail analytics, pricing analytics,
and supply chain analytics contain applications to their specialized settings, such
as demand forecasting, assortment planning, and distribution planning; and the case
study “InfoMedia Solutions” contains an application to media-mix planning. Other
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 19.1: exercise_inter.csv
• Data 19.2: exercise_curvilinear.csv
• Data 19.3: exercise_mediation.csv
• Data 19.4: ABC_hospital_group.csv
• Data 19.5: restaurant_chain_data.csv (“DEA in practice” section)
• Data 19.6: pizza.csv (“Conjoint Analysis Interpretation” section)
• Data 19.7: product_profile_ratings.csv (“Comparing product alternatives” sec-
tion in conjoint analysis)
Exercises
Ex. 19.1 Use the data file titled “exercise_inter.csv” and answer the following
questions:
a) Why are independent variables mean-centered? Mean-center advertising, dis-
count, and promotion variables.
b) Is the effect of advertising on sales contingent on the level of discount? Plot a
graph to interpret the interaction effect.
c) Is the effect of advertising on sales contingent on the level of promotion? Plot a
graph to interpret the interaction effect.
Ex. 19.2 Use the data file titled “exercise_curvilinear.csv” and answer the following
questions:
a) Is the effect of advertising on profit curvilinear or linear? Plot a graph if the
relationship is curvilinear.
b) Is the effect of sales promotion on profit curvilinear or linear? Plot a graph if the
relationship is curvilinear.
c) Is the effect of rebates and discount on profit curvilinear or linear? Plot a graph
if the relationship is curvilinear.
Ex. 19.3 Use “exercise_mediation.csv” data and answer the following questions:
a) Does recall mediate the effect of advertising on market share?
b) Why is bootstrapping used in mediation analysis?
Ex. 19.4 Find the most efficient hospital unit of ABC hospital group given the
following information:
Unit 1 2 3 4 5 6 7 8
Profit (crore) 120 160 430 856 200 320 189 253
Number of 150 100 120 180 220 90 140 160
Specialists
Area (sq feet) 21,000 32,650 40,000 18,780 19,870 50,000 33,000 19,878
References
Aigner, D. J., Lovell, C. A. K., & Schmidt, P. (1977). Formulation and estimation of stochastic
frontier production functions. Journal of Econometrics, 6(1), 21–37.
Aiken, L.S., and West, S.G, 1991. Multiple regression: Testing and interpreting interactions.
Baccouche, R., & Kouki, M. (2003). Stochastic production frontier and technical inefficiency: A
sensitivity analysis. Econometric Reviews, 22(1), 79–91.
Banker, R. D., Cooper, W. W., Seiford, L. M., Thrall, R. M., & Zhu, J. (2004). Returns to scale in
different DEA models. European Journal of Operational Research, 154, 345–362.
Banker, R. D., & Morey, R. (1986). Efficiency analysis for exogenously fixed inputs and outputs.
Operation Research, 34, 513–521.
Banker, R. D., & Thrall, R. M. (1992). Estimation of returns to scale using data envelopment
analysis. European Journal of Operational Research, 62(1), 74–84.
Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psy-
chological research: Conceptual, strategic, and statistical considerations. Journal of Personality
and Social Psychology, 51(6), 1173.
Bera, A. K., & Sharma, S. C. (1999). Estimating production uncertainty in stochastic frontier
production function models. Journal of Productivity Analysis, 12(2), 187–210.
Boussofiane, A., Dyson, R. G., & Thanassoulis, E. (1991). Applied data envelopment analysis.
European Journal of Operational Research, 52(1), 1–15.
Caudill, S. B., Ford, J. M., & Gropper, D. M. (1995). Frontier estimation and firm-specific 1076
inefficiency measure in the presence of heteroskedasticity. Journal of Business & Economic
Statistics, 13(1), 105–111.
Charnes, A., Cooper, W. W., & Rhodes, E. (1978). Measuring the efficiency of decision making
units. European Journal of Operational Research, 2, 429–444.
Charnes, A., Clark, T., Cooper, W. W., & Golany, B. (1985). A developmental study of data
envelopment analysis in measuring the efficiency of maintenance units in the U.S. air forces,
in: R. Thompson and R.M. Thrall (eds.). Annals of Operational Research, 2, 95–112.
Chen, C.-F., & Soo, K. T. (2010). Some university students are more equal than others: efficiency
evidence from England. Economics Bulletin, 30(4), 2697–2708.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
19 Marketing Analytics 657
Cook, W. D., & Seiford, L. M. (2009). Data envelopment analysis (DEA)–Thirty years on.
European Journal of Operational Research, 192(1), 1–17.
Cooper, W. W., Seiford, L. M., & Zhu, J. (2004). Data envelopment analysis. Handbook on data
envelopment analysis (pp. 1–39). Boston, MA: Springer.
Cullinane, K., Wang, T. F., Song, D. W., & Ji, P. (2006). The technical efficiency of container ports:
comparing data envelopment analysis and stochastic frontier analysis. Transportation Research
Part A: Policy and Practice, 40(4), 354–374.
Farrell, M. J., & Fieldhouse, M. (1962). Estimating efficient production functions under increasing
returns to scale. Journal of the Royal Statistical Society. Series A (General), 125, 252–267.
Farris, P. W., Bendle, N. T., Pfeifer, P. E., & Reibstein, D. J. (2010). Marketing metrics:
The definitive guide to measuring marketing performance, Introduction (pp. 1–25). London:
Pearson.
Feng, C., & Fay, S. A. (2016). Inferring salesperson capability using stochastic frontier analysis.
Journal of Personal Selling and Sales Management, 36, 294–306.
Fenn, P., Vencappa, D., Diacon, S., Klumpes, P., & O’Brien, C. (2008). Market structure and the
efficiency of European insurance companies: A stochastic frontier analysis. Journal of Banking
& Finance, 32(1), 86–100.
Førsund, F. R., Lovell, C. K., & Schmidt, P. (1980). A survey of frontier production functions and
of their relationship to efficiency measurement. Journal of Econometrics, 13(1), 5–25.
Gagnepain, P., & Ivaldi, M. (2002). Stochastic frontiers and asymmetric information models.
Journal of Productivity Analysis, 18(2), 145–159.
Greene, W. H. (2010). A stochastic frontier model with correction for sample selection. Journal of
Productivity Analysis, 34(1), 15–24.
Hjalmarsson, L., Kumbhakar, S. C., & Heshmati, A. (1996). DEA, DFA and SFA: a comparison.
Journal of Productivity Analysis, 7(2-3), 303–327.
Jacobs, R. (2001). Alternative methods to examine hospital efficiency: Data envelopment analysis
and stochastic frontier analysis. Health Care Management Science, 4(2), 103–115.
Koetter, M., & Poghosyan, T. (2009). The identification of technology regimes in banking:
Implications for the market power-fragility nexus. Journal of Banking & Finance, 33,
1413–1422.
Kumar, V. (2010). Customer relationship management. Hoboken, NJ: Wiley Online Library.
Kumar, V. (2013). Profitable customer engagement: Concept, metrics and strategies. Thousand
Oaks, CA: SAGE Publications India.
Kumar, V., Andrew Petersen, J., & Leone, R. P. (2007). How valuable is word of mouth? Harvard
Business Review, 85(10), 139.
Kumar, V., Bhaskaran, V., Mirchandani, R., & Shah, M. (2013). Practice prize winner-creating a
measurable social media marketing strategy: Increasing the value and ROI of intangibles and
tangibles for Hokey Pokey. Marketing Science, 32(2), 194–212.
Kumar, V., & Mirchandani, R. (2012). Increasing the ROI of social media marketing. MIT Sloan
Management Review, 54(1), 55.
Kumar, V., & Sharma, A. (2017). Leveraging marketing analytics to improve firm performance:
insights from implementation. Applied Marketing Analytics, 3(1), 58–69.
Kumar, V., Sharma, A., Donthu, N., & Rountree, C. (2015). Practice prize paper-implementing
integrated marketing science modeling at a non-profit organization: Balancing multiple busi-
ness objectives at Georgia Aquarium. Marketing Science, 34(6), 804–814.
Kumbhakar, S. C., & Lovell, C. K. (2003). Stochastic frontier analysis. Cambridge: Cambridge
university press.
Lilien, G. L., Rangaswamy, A., & De Bruyn, A. (2013). Principles of marketing engineering. State
College, PA: DecisionPro.
Lilien, G. L. (2011). Bridging the academic–practitioner divide in marketing decision models.
Journal of Marketing, 75(4), 196–210.
MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. Abingdon: Routledge.
Meeusen, W., & van den Broeck, J. (1977). Efficiency estimation from Cobb-Douglas production
functions with composed error. International Economic Review, 18(2), 435–444.
658 S. Arunachalam and A. Sharma
Ofek, E., & Toubia, O. (2014). Conjoint analysis: A do it yourself guide. Harvard Business School,
note, 515024.
Ouellette, P., & Vierstraete, V. (2004). Technological change and efficiency in the presence of
quasi-fixed inputs: A DEA application to the hospital sector. European Journal of Operational
Research, 154(3), 755–763.
Parsons, J. L. (2002). Using stochastic frontier analysis for performance measurement and
benchmarking. Advances in Econometrics, 16, 317–350.
Preacher, K. J., Curran, P. J., & Bauer, D. J. (2006). Computational tools for probing interactions
in multiple linear regression, multilevel modeling, and latent curve analysis. Journal of
Educational and Behavioral Statistics, 31(4), 437–448.
Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect effects in
simple mediation models. Behavior Research Methods, 36(4), 717–731.
Preacher, K. J., Rucker, D. D., & Hayes, A. F. (2007). Addressing moderated mediation hypotheses:
Theory, methods, and prescriptions. Multivariate Behavioral Research, 42(1), 185–227.
Reinhard, S., Lovell, C., & Thijssen, G. (2000). Environmental efficiency with multiple envi-
ronmentally detrimental variables; estimated with SFA and DEA. European Journal of
Operational Research, 121(3), 287–303.
Schmidt, P. (1985). Frontier production functions. Econometric Reviews, 4(2), 289–328.
Seiford, L. M., & Zhu, J. (1999). An investigation of returns to scale under data envelopment
analysis. Omega, 27, 1–11.
Sherman, H. D., & Gold, F. (1985). Bank branch operating efficiency: Evaluation with data
envelopment analysis. Journal of Banking & Finance, 9(2), 297–315.
Venkatesan, R., Farris, P., & Wilcox, R. T. (2014). Cutting-edge marketing analytics: Real world
cases and data sets for hands on learning. London: Pearson Education.
Vyt, D. (2008). Retail network performance evaluation: A DEA approach considering retailers’
geomarketing. The International Review of Retail, Distribution and Consumer Research, 235–
253.
Wadud, A., & White, B. (2000). Farm household efficiency in Bangladesh: a comparison of
stochastic frontier and DEA methods. Applied Economics, 32(13), 1665–1673.
Wedel, M., & Kannan, P. K. (2016). Marketing analytics for data-rich environments. Journal of
Marketing, 80, 97–121.
Other Resources
Kristopher, J. Preacher (2018). Preacher’s website. Retrieved May 19, 2018, from http://
quantpsy.org/medn.htm.
Retrieved May 19, 2018, from www.conjoint.online.
Software: STATA, Mplus. Retrieved May 19, 2018, from www.statmodel.com.
Chapter 20
Financial Analytics
Krishnamurthy Vaidyanathan
1 Part A: Methodology
1.1 Introduction
K. Vaidyanathan ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: vaidya_nathan@isb.edu
One can paraphrase Rudyard Kipling’s poem The Ballad of East and West and say,
“Oh, the Q-world is the Q-world, the P-world is the P-world, and never the twain
shall meet.” Truth be told, Kipling’s lofty truism is not quite true in the Quant world.
The Q-world and P-world do meet, but they barely talk to each other. In this section,
we introduce the P- and Q-worlds and their respective theoretical edifices.
1.2.1 Q-Quants
In the Q-world, the objective is primarily to determine a fair price for a financial
instrument, especially a derivative security, in terms of its underlying securities.
The price of these underlying securities is determined by the market forces of
demand and supply. The demand and supply forces come from a variety of sources
in the financial markets, but they primarily originate from buy-side and sell-side
financial institutions. The buy-side institutions are asset management companies—
large mutual funds, pensions funds, and investment managers such as PIMCO who
manage other people’s money—both retail and corporate entities’ money. The sell-
side institutions are market makers who make money on the margin they earn by
undertaking market making. That is, they are available to buy (bid) a financial
instrument for another market participant who wants to sell and make available to
sell (offer) a financial instrument for somebody wanting to buy. They provide this
service for a commission called as bid–offer spread, and that is how they primarily
make their money from market making. The trading desks of large investment banks
such as Goldman Sachs, JPMorgan, Citibank, and Morgan Stanley comprise the sell-
side. The Q-quants primarily work in the sell-side and are price-makers as opposed
to P-quants who work in the buy-side and are typically price-takers.
The Q-quants borrow much of their models from physics starting with the
legendary Brownian motion. The Brownian motion is one of the most iconic and
influential ingress from physics into math finance and is used extensively for pricing
and risk management. The origins of the Q-world can be traced to the influential
work published by Merton in 1969 (Merton 1969)—who used the Brownian motion
process as a starting point to model asset prices. Later in 1973, Black and Scholes
used the geometric Brownian motion (GBM) to price options in another significant
work for which they eventually won the Nobel Prize in 1997 (Black and Scholes
1973). This work by Black and Scholes gave a fillip to pricing of derivatives as
these financial instruments could be modeled irrespective of the return expectation
of the underlying asset. In simple terms, what that meant was that even if I think that
the price of a security would fall while you may think that the price of that security
would increase, that is, the return expectations are different, yet we can agree on
the price of a derivative instrument on that security. Another important edifice in the
Q-space is the fundamental theorem of asset pricing by Harrison and Pliska (1981).
This theorem posits that the current price of a security is fair only if there exists a
stochastic process such as a GBM with constant expected value for all future points
20 Financial Analytics 661
in time. Any process that satisfies this property is called a martingale. Because the
expected return is the same for all financial instruments, it implies that there is no
extra reward for risk taking. It is as if, all the pricing is done in a make-believe world
called the risk-neutral world, where irrespective of the risk of a security, there is no
extra compensation for risk. All financial instruments in this make-believe world
earn the same return regardless of their risk. The risk-free instruments earn the risk-
free rate as do all risky instruments. In contrast, in the P-world, the economic agents
or investors are, for all intents and purposes, risk-averse, as most people are in the
real world.
The Q-quants typically have deep knowledge about a specific product. So a Q-
quant who, for instance, trades credit derivatives for a living would have abundant
knowledge about credit derivative products, but her know-how may not be very
useful in, say, a domain like foreign exchange. Similarly, a Q-quant who does
modeling of foreign exchange instruments may not find her skillset very useful if she
were to try modeling interest rates for fixed income instruments. Most of the finance
that is done in the Q-world is in continuous time because as discussed earlier, the
expectation of the price at any future point of time is equal to its current price. Given
that this holds for all times, the processes used in the Q-world are naturally set in
continuous time. In contrast, in the P-world, the probabilities are for a risk-averse
investor. This is because in the real world, people like you and me need extra return
for risk-taking. Moreover we measure returns over discrete time intervals such as a
day, a week, a month, or a year. So most processes are modeled in discrete time. The
dimensionality of the problem in the P-world is evidently large because the P-quant
is not looking at a specific instrument or just one particular asset class but multiple
asset classes simultaneously. The tools that are used to model in the P-world are
primarily multivariate statistics which is what concerns data analysts.
1.2.2 P-Quants
We now discuss the P-world and their origins, tools, and techniques and contrast
them with the Q-world. The P-world started with the mean–variance framework
by Markowitz in 1952 (Markowitz 1952). Harry Markowitz showed that the
conventional investment evaluation criteria of net present value (NPV) needs to
be explicitly segregated in terms of risk and return. He defined risk as standard
deviation of return distribution. He argued that imperfect correlation of return
distribution of stocks can be used to reduce the risk of a portfolio of stocks. He
introduced the concept of diversification which is the finance equivalent of the
watchword—do not put all your eggs in one basket. Building on the Markowitz
model, the next significant edifice of the P-world was the capital asset pricing model
(CAPM) by William Sharpe in 1964 (Sharpe 1964). William Sharpe converted the
Markowitz “business school” framework to an “economics department” model.
Sharpe started with a make-believe world where all investors operate in the
Markowitz framework. Moreover, all investors in the CAPM world have the same
expectation of returns and variance–covariance. Since all the risky assets in the
662 K. Vaidyanathan
financial market must be held by somebody, it turns out that in Sharpe’s “economics
department” model, all investors end up holding the market portfolio—a portfolio
where each risky security has the weight proportional to how many of them are
available. At the time when William Sharpe postulated the model, the notion of
market portfolio was new. Shortly thereafter, the financial industry created mutual
funds which hold a diversified portfolio of stocks, mostly in the proportion of stocks
that are available. It was as if nature had imitated art, though we all know it is almost
always the other way around. In 1990, Markowitz and Sharpe won the Noble Prize
in Economics for developing the theoretical basis for diversification and CAPM.
The next significant edifice came in 1970 by Eugene Fama called the efficient
market hypothesis (EMH). The EMH hypothesizes that no trading strategy based
on already available information can generate super-normal returns (Fama 1970).
The EMH offered powerful theoretical insights into the nature of financial markets.
More importantly, it lent itself to empirical investigation which was imperative and
essential for finance—then a relatively nascent field. As a result, the efficient market
hypothesis is probably the most widely and expansively tested hypothesis in all
social sciences. Eugene Fama won the Nobel Prize in 2013 for his powerful insights
on financial markets.
Another important contribution in the mid-1970s was the arbitrage pricing theory
(APT) model by Stephen Ross (1976). The APT is a multifactor model used to
calculate the expected return of a financial asset. Though both CAPM and APT
provided a foundational framework for asset pricing, they did not use data analytics
because their framework assumed that the probability distribution in the P-world is
known. For instance, if financial asset returns follow an elliptical distribution, every
investor would choose to hold a portfolio with the lowest possible variance for her
chosen level of expected returns. Such a portfolio is called a minimum variance
portfolio, and the framework of portfolio analysis is called the mean–variance
framework. From a finance theory viewpoint, this is a convenient assumption to
make, especially the assumption that asset returns are jointly normal, which is a
special case of an elliptical distribution. Once the assumption that the probability
distribution is already known is made, theoretical implications can be derived on
how the asset markets should function. Tobin proposed the separation theorem
which postulates that the optimal choice of investment for an investor is independent
of her wealth (Tobin 1958). Tobin’s separation theorem holds good if the returns
of the financial assets are multinormal. Extending Tobin’s work, Stephen Ross
postulated the two-fund theorem which states that if investors can borrow and
lend at the risk-free rate, they will possess either the risk-free portfolio or the
market portfolio (Ross 1978). Ross later generalized it to a more comprehensive
k-fund separation theorem. The Ross separation theorem holds good if the financial
asset returns follow any elliptical distribution. The class of elliptical distribution
includes the multivariate normal, the multivariate exponential, the multivariate
Student t-distribution, and multivariate Cauchy distribution, among others (Owen
and Rabinovitch 1983).
However, in reality, the probability distribution needs to be estimated from
the available financial information. So a very large component of this so-called
20 Financial Analytics 663
information set, that is, the prices and other financial variables, is observed at
discrete time intervals, forming a time series. Analyzing this information set requires
manifestly sophisticated multivariate statistics of a certain spin used in economics
called as econometrics, wherein most of the data analytics tools come into play.
In contrast to this, in the Q-world, quants mostly look at the pricing of a specific
derivative instrument and get the arbitrage-free price of the derivative instrument
based on the underlying fundamental instrument and other sets of instruments.
However, in the P-world, we try to estimate the joint distribution of all the securities
that are there in a portfolio unlike in the Q-world wherein we are typically concerned
with just one security. The dimensionality in the Q world is small, but in the P-world
it is usually a lot larger. So a number of dimension reduction techniques, mostly
linear factor models like principal component analysis (PCA) and factor analysis,
have a central role to play in the P-world. Such techniques achieve parsimony by
reducing the dimensionality of the data, which is a recurring objective in most data
analytic applications in finance. Since some of these data analytic techniques can
be as quantitatively intense or perhaps more intense than the financial engineering
techniques used in the Q-world, there is now a new breed of quants called the P-
quants who are trained in the data analytic methodologies. Prior to the financial
crisis of 2008, the Q-world attracted a lot of quants in finance. Many having
PhDs in physics and math worked on derivative pricing. Till the first decade of
the twenty-first century culminating in the financial crisis, all the way back from
the 1980s when the derivatives market started to explode, quantitative finance was
identified with the Q-quants. But in recent years, especially post-crisis, the financial
industry has witnessed a surge of interest and attention in the P-world, and there
is a decrease of interest in the Q-world. This is primarily because the derivatives
markets have shrunk. The second-generation and third-generation types of exotic
derivatives that were sold pre-crisis have all but disappeared. In the last decade,
there has been a severe reduction in both the volume and complexity of derivatives
traded. Another reason why the P-quants have a dominant role in finance is that
their skills are extremely valuable in risk management, portfolio management, and
actuarial valuations, while the Q-quants work mostly on valuation. Additionally, a
newfound interest in data analytics, in general and mining of big data specifically,
has been a major driver for the surge of interest in the P-world.
Fig. 20.1 Methodology of the three-stage framework for data analysis in finance
that combines multiple steps is provided in Sect. 20.2. We now discuss the three-
stage methodology starting with variable identification for different asset classes
such as equities, foreign exchange, fixed income, credit, and commodities.
The objective of the first stage is to estimate the price behavior of an asset. It starts
with identification of the financial variable to model.
Step 1: Identification
The first step of modeling in the P-world is to identify the appropriate variable
which is different for distinct asset classes. The basic idea is to find a process for
the financial variable where the residuals are essentially i.i.d. The most common
process used for modeling a financial variable x is the random walk:
xt = xt−1 + εt
where εt is the error term and is random. The postulation where a financial variable
follows a random walk is called as random walk hypothesis and is consistent with
efficient market hypothesis (Fama 1965). What it means in simple terms is that
financial variables are fundamentally unpredictable. However, if one looks at any
typical stock price, the price changes in such a way that the order of magnitude
of the change is proportional to the value of the stock price. This kind of behavior
conflicts with homogeneity across time that characterizes a financial variable when
it follows a random walk. As a way out, the variable that is actually modeled is the
logarithm (log) of the stock price, and it has been observed that the log of the stock
price behaves as a random walk. A simple random walk has a constant mean and
standard deviation, and its probability distribution does not change with time. Such
a process is called as a stationary process.
Similar to stock prices, the log of foreign exchange rates or the log of commodity
prices behaves approximately as a random walk. The underlying variable itself, that
is, the stock price, currency rate, or commodity price, is not exactly stationary, but
the log of stock price, currency rate, or commodity price conducts itself just about as
a haphazard random walk. A stationary process is one whose probability distribution
does not change with time. As a result, its moments such as variance or mean are
not time-varying.
However, choosing the right financial variable is as important as the modification
made to it. For example, in a fixed income instrument such as a zero-coupon bond,
the price converges to the face value as the bond approaches maturity. Clearly,
neither the price itself nor its log can be modeled as a random walk. Instead, what
is modeled as a random walk is the yield on bonds, called yield to maturity. Simply
put, yield is the internal rate of return on a bond that is calculated on the cash flows
666 K. Vaidyanathan
the bond pays till maturity. And this variable fits a random walk model adequately.
The financial variables that are typically modeled in the different asset classes are
shown in Fig. 20.2.
Step 2: I.I.D.
Once the financial variable that is of interest is identified, the next step in data
preparation is to obtain a time series of the variables that are of interest. These
variables should display a homogenous behavior across time, and as shown in Fig.
20.2, the variables are different for different asset classes. For instance, in equities,
currencies, or commodities, it is the log of the stock/currency/commodity price. For
fixed income instruments, the variable of interest may not be the price or the log of
price. The variable of interest would be the yield to maturity of the fixed income
security.
Once we have the raw data of the financial variable, we test if the financial
variable follows a random walk using statistical tests such as the Dickey–Fuller
test or the multiple variance ratio test (Dickey and Fuller 1979; Campbell et al.
1997). A lot of times they may follow less random processes and therefore may be
predictable to some degree. This is known as non-random walk hypothesis. Andrew
Lo and Craig Mackinlay, at MIT Sloan and Wharton, respectively, in their book
A Non-Random Walk Down Wall Street present a number of tests and studies that
validate that there are trends in financial markets and that the financial variables
20 Financial Analytics 667
identified in Step 1 are somewhat predictable. They are predictable both in cross-
sectional and time series terms. As an example of the ability to predict using
cross-sectional data, the Fama–French three-factor model postulated by Eugene
Fama and Kenneth French uses three factors to describe stock returns (Fama and
French 1993). An example of predictability in time series is that the financial
variable may display some kind of mean reversion tendency. What that means is
that if the value of the financial variable is quite high, it will have a propensity
to decrease in value and vice versa. For example, if yields become very high,
there may be propensity for them to come back to long-run historical average
levels. In such cases, the features that cause deviation from the random walk are
extracted out so that the residuals display i.i.d. behavior. The models used would
depend on the features displayed by the financial variable. For example, volatility
clustering like mean reversion is a commonly observed feature in financial variables.
When markets are extremely volatile, the financial variable fluctuates a lot, and
there may be a higher probability of a large variability than otherwise. Techniques
like autoregressive conditional heteroscedasticity model (ARCH) or generalized
autoregressive conditional heteroscedasticity model (GARCH) are used to factor
out volatility clustering (Engle 1982). If the variable displays some kind of mean
reversion, one might want to use autoregressive moving average (ARMA) models
if it is a univariate case or use vector autoregression (VAR) models in multivariate
scenarios (Box et al. 1994; Sims 1980). These are, in essence, econometric models
which can capture linear interdependencies across multiple time series and are
fundamentally a general form of autoregressive models (AR) (Yule 1927). We
could also use stochastic volatility models, and those are comparatively commonly
used in volatility clustering. Long memory processes primarily warrant fractional
integration models (Granger and Joyeux 1980). Fractional integration displays
a long memory which principally means that the increments of the financial
variable display autocorrelation. The increments therefore are not i.i.d., and these
autocorrelations persist across multiple lags. For instance, the value of the random
variable at time t + 1 is a function of time t, t − 1, t − 2, and so on. The lags decrease
very gradually and are therefore called long memory processes. Such trends, be they
long memory, volatility clustering, or mean reversion, are modeled using techniques
such as fractional integration, GARCH, or AR processes, respectively. After such
patterns are accounted for, we are left with i.i.d. shocks with no discernible
pattern.
Step 3: Inference
The third step in estimation after the financial variable is identified and after we have
gotten to the point of i.i.d. shocks is to infer the joint behavior of i.i.d. shocks. In
the estimation process, we typically determine those parameters in the model which
gets us to an i.i.d. distribution. We explain the first three steps using data on S&P
500 for the period from October 25, 2012, to October 25, 2017. The first step is
to identify the financial variable of interest. We work with returns rather than with
668 K. Vaidyanathan
Returns
S&P500 Index Returns
3%
2 S.D. band
2%
1%
0%
-1%
-2%
-3%
-2 S.D. band
-4%
25/Oct/12 25/Oct/13 25/Oct/14 25/Oct/15 25/Oct/16 25/Oct/17
absolute index levels of S&P 500 for reasons mentioned in Step 1. From the daily
index levels, the 1-day returns are calculated as follows:
This return rt itself is not distributed in an i.i.d. sense. Neither are the daily returns
identical nor are they independent. We can infer from a quick look at the graph of
the returns data in Fig. 20.3 that the returns are not i.i.d.
One may refer to Chap. 5, on data visualization, for a better understanding
on how to interpret the graph. One of the things to observe from the graph is
that if the return in a given day was either extremely high or low, it normally
followed that return on the subsequent day was also quite high/low. That is,
if the return was volatile at time t, the probability of it being more volatile
is higher than it being stable at time t + 1. So in this case, the data seems
to suggest that the financial variable is conditionally heteroscedastic, which
means that the standard deviation is neither independent nor identical across
time periods. To accommodate for conditional heteroscedasticity, we can use the
GARCH(1,1) model (Bollerslev 1986). This model accounts for autocorrelation and
heteroscedasticity, that is, for correlations among errors at different time periods
t and different variances of errors at different times. The way we model variance
σ t 2 is:
σt 2 = ω + α rt−1 2 + βσ t−1 2
average daily returns to be zero. Using the Gaussian distribution, the likelihood or
the probability of σrtt being normally distributed is given by:
2
1 −1 rt
< e 2 σt
2π σ t 2
This completes Step 2 of reducing the variable to an i.i.d. process. The next
step is to compute the joint distribution. Since the variables σrtt across time are
independent, the joint likelihood L of the sample is calculated as the product of
the above likelihood function using the property of independence across the time
series of n data points. Therefore:
2
n
1 − 12 rt
L= < e σt
t=1 2π σ t 2
Since the above product would be a very small number in magnitude, the natural
log of the above likelihood is maximized in MLE. This log-likelihood is given by:
8
1
n
r 2
t
ln(L) = − ln 2π σ t 2 +
2 σt
t=1
Step 4: Projection
The fourth step is projection. We explain this step using a simple example from
foreign exchange markets. Let us say that the financial variable is estimated using
a technique such as MLE, GMM, or Bayesian estimation (Hansen 1982). The next
step is to project the variable using the model. Say the horizon is 1 year, and we
want to calculate the expected profit or loss of a certain portfolio. A commonly
used technique for this is the Monte Carlo simulation, which is another ingress from
physics (Fermi and Richtmyer 1948). We project the financial variable in the Q-
space using risk-neutral parameters and processes. This also helps us to understand
how the P- and Q-worlds converge.
Let us say we want to project the value of the exchange rate of Indian Rupee
against the US Dollar (USD/INR). USD/INR as of end October 2017 is 65. We
assume that the returns follow a normal distribution characterized by its mean and
standard deviation. The projection could be done either in the P-space or in the Q-
space. In the P-space, the projection would be based on the historical average annual
670 K. Vaidyanathan
return and the historical annualized standard deviation, and we would use these first
and second moments to project the USD/INR. The equivalent method in Q-world
would be to calculate using the arbitrage-free drift.
To estimate the arbitrage-free drift, let us assume that the annualized USD
interest rate for 1 year is 1.7% and that the 1-year INR rate is 6.2%. A dollar
parked in a savings account in the USA should yield the same return as that dollar
being converted to rupee, parked in India, and then reconverted to USD. This is
needed to ensure no arbitrage in an arbitrage-free foreign exchange market. This
implies that the exchange rate of USD/INR should depreciate at 4.5%. This can
also be understood from the uncovered interest rate parity criterion for a frictionless
global economy (Frenkel and Levich 1981). The criterion specifies that real interest
rates should be the same all over the world. Let us assume that real interest rate
globally is 0.5% and that the US inflation is 1.2%, implying nominal interest
rate is 1.7%. Likewise, inflation in India is 5.7% implying a nominal interest
rate of 6.2%. Inflation in India is 5.7% and that in the USA is 1.2%. Therefore,
the currency in India (Rupee) should get depreciated against the US currency
(Dollar) by the differential of their respective inflations 4.5% (=5.7%—1.2%).
Let us assume that the standard deviation of USD/INR returns is 10%. Once we
have the mean and standard deviation, we can run a Monte Carlo simulation to
project the financial variable of interest. Such an exercise could be of interest if
revenues are in dollars and substantial portion of expenditure is in dollars. For a
more detailed reading of applicability of this exercise to the various components
of earnings such as revenues and cost of goods sold in foreign currency, please
refer to “Appendix C: Components of earning of the CreditMetrics™” document by
JPMorgan (CreditMetrics 1999).
Monte Carlo Simulation
We first pick a random number from the standard normal distribution say x. We
then scale (multiply) x by standard deviation and add average return to get a random
variable mapped to the exact normal distribution of returns.
Note that the average return and standard deviation are adjusted for daily horizon
by dividing with 365 and square root of 365, respectively. After scaling the variable,
we multiply price of USD/INR at t with (1 + R) to project the value of USD/INR to
the next day. The above steps are explained in the spreadsheet “Financial Analytics
Steps 4 and 5.xlsx” (available on the book’s website). An example of the above
simulation with USD/INR at 65 levels at day 1 is run for seven simulations and
10 days in Table 20.1 and Fig. 20.4.
Once we have the projections of the currency rates in forward points in time, it is
an easy task to then evaluate the different components of earnings that are affected
by the exchange rate.
20 Financial Analytics 671
75.00
70.00
65.00
60.00
55.00
50.00
Nov-2017 Jan-2018 Mar-2018 May-2018 Jul-2018 Sep-2018 Nov-2018
Step 5: Pricing
The fifth step is pricing which logically follows from projection. The example
that we used in Step 4 was projection of USD/INR for a horizon of 1 year. What
pricing allows us to do is arrive at the ex-ante expected profit or loss of a specific
instrument based on the projections done in Step 4. In a typical projection technique
like Monte Carlo, each of the steps is equally likely. So the probability of each
of these steps is given by 1/n, where n is the number of simulations done. The
ex-ante profit or loss of the instrument is given by 1/n times the probability of
profit or loss of the instrument. For instance, in the case of a forward contract on
USD/INR that pays off 1 year from now, the payoff would be calculated at the end
of 1 year as the projected value of USD/INR minus the forward rate, if it is a long
672 K. Vaidyanathan
forward contract and vice versa for a short forward contract. A forward contract is
a contract between two parties to buy or sell an asset at a specified future point in
time. In this case, the asset is USD/INR, and the specified future point in time is
1 year. The party that buys USD is supposed to be a “Long” forward contract, while
the other party selling USD 1 year from now is a “Short” forward contract. The
expected ex-ante payoff is the summation of the payoff in all the scenarios divided
by the number of simulations. After pricing, we move on to the next stage of risk
management.
The second stage of data analytics in finance concerns risk management. It involves
analysis for risk aggregation, risk assessment, and risk attribution. The framework
can be used for risk analysis of a portfolio or even for an entire financial institution.
Step 1: Aggregation
The first of the three steps in risk management is risk aggregation. The aggregation
step is crucial because all financial institutions need to know the value of the
portfolio of their assets and also the aggregated risk exposures in their balance
sheet. After we have priced the assets at the instrument level, to calculate the value
of the portfolio, we need to aggregate them keeping in view the fact that the risk
drivers are correlated. The correlation of the various risk drivers and the financial
instruments’ risk exposure is thereby aggregated. We exposit aggregation using one
of the commonly used tools for risk aggregation called copula functions (Ali et
al. 1978). Copula functions, especially Gaussian copulas, are used extensively by
the financial industry and the regulators due to their analytical tractability. The
Basel Committee on Banking Supervision relies exclusively on Gaussian copula
to measure risk capital of banks globally. A copula function is a multivariate
probability distribution for which the marginal distributions are known. Copula
function illustrates the dependence between these correlated random variables.
Copula in Latin means to link or to tie. They are widely used in both the P-
world and the Q-world for risk aggregation and optimization. The underlying edifice
for copula function is the Sklar’s theorem (Sklar 1959). This theorem posits that
any multivariate joint distribution of risk drivers can be described in terms of the
univariate marginal distributions of the individual risk drivers. A copula function
describes the dependence structure between these correlated random variables
for the univariate marginal distributions. As usual, we discuss this step using an
example. Let us say that there is a loan portfolio comprising N number of exposures.
To keep the example computationally simple, we keep N = 5. So we have a bank
which has lent money to five different corporates which we index i = 1, 2, . . . ,
5. We assume for simplicity that Rs. 100 is lent to each of the five firms. So the
loan portfolio is worth Rs. 500. The way we will go about aggregating the risk of
20 Financial Analytics 673
this Rs. 500 loan portfolio is that we will first describe the marginal distribution of
credit for each of the five corporates. We will then use the Gaussian copula function
to get the joint distribution of the portfolio of these five loans.
Let us assume for simplicity that each corporate has a probability of default of
2%. Therefore, there is a 98% chance of survival of the corporate in a year. The
horizon for the loan is 1 year. Assume that in the event of a default, the bank can
recover 50% of the loan amount. The marginal distributions are identical in our
example for ease of exposition, but the copula models allow for varying distributions
as well. What we want to do is that based on the correlation structure, we want to
calculate the joint distribution of credit of each of these corporates. We model this
using a one-factor model. The single factor is assumed to be the state of the economy
M, which is assumed to have a Gaussian distribution.
To generate a one-factor model, we define random variables xi (1 ≤ i ≤ N):
xi = ρi M + 1 − ρi2 Zi
In the above equation, the single factor M and the idiosyncratic factor Zi are
independent of each other and are standard normal variables with mean zero and
unit standard deviation. The correlation coefficient ρ i satisfies 1 ≤ ρ i < 1. The above
equation defines how the assets of the firm are correlated with the economy M. The
correlation between the assets xi of firm i and assets xj of firm j is ρ i ρ j .
Let H be the cumulative normal distribution function of the idiosyncratic factor
Zi . Therefore:
⎛ ⎞
x − ρ M
Probability (xi < x| M) = H ⎝ ⎠
i
1 − ρi2
The assets of each of these corporates are assumed to have a Gaussian distribu-
tion. Note that the probability of default is 2%, corresponding to a standard normal
value of −2.05. If the value of the asset standardized with the mean and its standard
deviation is more than −2.05, the entity survives, else it defaults. The conditional
probability that the ith entity will survive is therefore:
⎛ ⎞
x − ρ M
Si (xi < x| M) = 1 − H ⎝ ⎠
i
1 − ρi2
The marginal distribution of each of the assets is known, but we do not know the
joint distribution of the loan portfolio. So, we model the portfolio distribution using
copulas based on the correlations that each of these corporates has. The performance
of the corporate depends on the state of the economy. There is a correlation between
these two variables. This can be explained by noting that certain industries such
as steel and cement are more correlated with the economy than others like fast-
674 K. Vaidyanathan
moving consumer goods. Assume that the correlation of the first corporate with
the economy is 0.2, the second is 0.4, the third is 0.5, the fourth is 0.6, and the
fifth is 0.8. So the pairwise correlation can be calculated as the product of the two
correlations to the single factor, which in our example is the economy. We model
the state of the economy as a standard normal random variable in the range from
−3.975 to 3.975 in intervals of 0.05. We take the mid-point of these intervals. Table
20.2 shows these values for the first ten states of the economy. The probability of
the economy being in those intervals is calculated in column 3 of Table 20.2 using
the Gaussian distribution. This is given by:
P rob m − ≤M ≤m+
2 2
where M follows the standard normal distribution, m is the mid-point of the interval,
and Δ is the step size. The way to interpret the state of the economy is that when
it is highly negative such as −2, then the economy is in recession. And if it is high
such as greater than 2, the economy is booming, and if it is close to zero, then the
health of the economy is average. Once we have the probabilities for the state of
the economy (Table 20.2), we calculate the conditional probability of a corporate
defaulting, and this again depends on the correlation between its asset values and
the states of the economy.
Let π (k) be the probability that exactly k firms default in the N-firm loan
portfolio. Depending on the state of the economy, the conditional probabilities of
M are independent. Therefore, the conditional probability that all the N firms will
survive is:
N
π (0|M) = Si (xi < x| M)
i=1
20 Financial Analytics 675
Similarly,
N
1 − Si (xi < x| M)
π (1|M) = π (0|M)
Si (xi < x| M)
i=1
Define
N
1 − Si (xi < x| M)
wi =
Si (xi < x| M)
i=1
Conditioned on the state of the economy, the chance of exactly k firms defaulting
is given by the combinatorial probability
N
π (k|M) = π (0|M) wq(1) wq(2) . . . wq(k)
i=1
N!
q(k) =
k! (N − k)!
Table 20.4 Conditional joint survival probabilities for one-factor Gaussian copula
States of economy k=0 k=1 k=2 k=3 k=4 k=5
1 5.37E-07 0.001336 0.035736 0.24323 0.134507 0.220673
2 9.99E-07 0.001746 0.041595 0.258263 0.136521 0.206578
3 1.83E-06 0.002267 0.048212 0.273353 0.138099 0.192794
4 3.3E-06 0.002926 0.055647 0.288374 0.139211 0.179355
5 5.84E-06 0.003755 0.063955 0.30319 0.139833 0.166295
6 1.02E-05 0.004791 0.073188 0.317651 0.139941 0.153647
7 1.75E-05 0.006079 0.08339 0.331596 0.139519 0.141441
8 2.96E-05 0.007672 0.094599 0.344857 0.138554 0.129705
9 4.93E-05 0.00963 0.106837 0.357256 0.13704 0.118463
10 8.08E-05 0.012025 0.120116 0.368611 0.134975 0.107738
Then we calculate the discrete joint distribution of survival of all the firms
together. There are five possibilities—all firms survive, one firm fails, two firms fail,
three firms fail, four firms fail, and all five firms fail. This is tabulated in Table 20.4
for the first ten states.
The above steps are explained in the spreadsheet “Financial Analytics Step
6.xlsx” (available on the book’s website). For each outcome, we have the losses
corresponding to that precise outcome. So using the copula functions we have
effectively used the information on marginal distribution of the assets of each firm
and their correlation with the economy, to arrive at the joint distribution of the
survival outcomes of the firms. We thus are able to aggregate the risk of the portfolio
even though as our starting point we only had the marginal probability distribution
of only individual loans.
Step 2: Assessment
that the risk manager can be confident that 99 times out of 100, the loss from the
portfolio will not exceed the VaR metric. This metric is also used for financial
reporting and for calculating the regulatory capital of financial institutions. VaR is an
ex-ante assessment in the Bayesian sense—the VaR number is a value that is ex-ante
assessed as the loss that can possibly result for the portfolio. It only incorporates
information available at the time of computation. VaR is used for governance in
pension plans, endowments, trusts, and other such risk-averse financial institutions
where the investment mandate often defines the maximum acceptable loss with
given probabilities. A detailed description of how Value at Risk has been used
to calculate capital can be found in Chapter 3, “VAR-Based Regulatory Capital,”
of the book Value at Risk: The New Benchmark for Managing Financial Risk by
Philippe Jorion. This particular measure incorporates the previous steps of portfolio
aggregation. We will understand the step using an example. We will examine the
VaR computation with a simple portfolio comprising 1 USD, 1 EUR, 1 GBP, and
100 JPY. The value of the portfolio in INR terms is Rs. 280 (1 USD = Rs. 64, 1 Euro
(EUR) = Rs. 75, 1 Sterling (GBP) = Rs. 82, 100 Yen (JPY) = Rs. 59). We want to
calculate at the end of 1 year what is the possible loss or gain from this particular
portfolio. To aggregate the risk, we make use of the correlation matrix between the
currencies as described in Table 20.5.
We will use Cholesky decomposition—which fundamentally decomposes the
correlation matrix into a lower triangular matrix and an upper triangular matrix
(Press et al. 1992). The only condition is that the correlation matrix should be
positive definite Hermitian matrix. This decomposition is almost akin to computing
the square root of a real number.
A = LL∗
For each currency we then simulate a random number drawn from a standard
normal distribution. These are independently drawn. This vector of independent
draws can be converted to a vector of correlated draws by multiplying with the
decomposed matrix.
Y = LX
where Y is the vector of correlated prices and X is the vector of i.i.d. draws.
This process is repeated multiple times to arrive at a simulation of correlated
draws. Using Step 4 we project the log of the prices of USD/INR, EUR/INR,
GBP/INR, and JPY/INR. We price the exchange rate and aggregate the portfolio
and subtract from the original value to get the portfolio loss or gain. These steps are
repeated for a given number of simulations as shown in Table 20.7.
We then calculate the VaR at 99% level from the simulated gains or losses.
The above steps are explained in the spreadsheet “Financial Analytics Step 7.xlsx”
(available on the book’s website). For a simulation run 100 times on the above data,
a VaR of −38 INR was obtained at 1% confidence level.
Step 3: Attribution
The third step in risk management analysis is attribution. Once we have assessed
the risk of the portfolio in the previous step, we need to now attribute the risk to
different risk factors. For instance, the combined risk of Rs. 38 of the portfolio in
the previous example can be attributed to each of the individual assets. Like for
a portfolio, this can be done at a firm level as well. What financial institutions
typically do is to attribute risk along a line of business (LoB). This is because
banks and financial institutions are interested in measuring the capital consumed
by various activities. Capital is measured using the Value at Risk metric. VaR has
20 Financial Analytics 679
become an inalienable tool for risk control and an integral part of methodologies
that seek to allocate economic and/or regulatory capital. Its use is being encouraged
by the Reserve Bank of India (RBI), the Federal Reserve Bank (Fed), the Bank for
International Settlements, the Securities and Exchange Board of India (SEBI), and
the Securities and Exchange Commission (SEC). Stakeholders including regulators
and supervisory bodies increasingly seek to assess the worst possible loss (typically
at 99% confidence levels) of portfolios of financial institutions and funds. A detailed
description of how Value at Risk has been used to calculate capital can be found
in Chapter 3, “VAR-Based Regulatory Capital,” of the book Value at Risk: The
New Benchmark for Managing Financial Risk by Philippe Jorion. There are three
commonly employed measures of VaR-based capital—stand-alone, incremental,
and component. It has been found that different banks globally calculate these
capital numbers differently, but they follow similar ideas behind the measures.
Stand-Alone Capital
Stand-alone capital is the amount of capital that the business unit would require, if
it were viewed in isolation. Consequently, stand-alone capital is determined by the
volatility of each LoB’s earnings.
Incremental Capital
Incremental capital measures the amount of capital that the business unit adds to the
entire firm’s capital. Conversely, it measures the amount of capital that would be
released if the business unit were sold.
Component Capital
Component capital, sometimes also referred to as allocated capital, measures
the firm’s total capital that would be associated with a certain line of business.
Attributing capital this way has intuitive appeal and is probably the reason why
it is particularly widespread.
We use a simplified example to understand how attribution is done using metrics
such as stand-alone, incremental, and component capital. Let us assume that there
is a bank that has three business units:
• Line of Business 1 (LoB1)—Corporate Banking
• Line of Business 2 (LoB2)—Retail Banking
• Line of Business 3 (LoB3)—Treasury Operations
For ease of calculation, we assume that the total bank asset is A = Rs. 3000
crores. We also assume for the sake of simplicity that each of the LoBs has assets
worth Ai = Rs. 1000 crores, i = 1, 2, 3. The volatility of the three lines of
businesses is:
σ = σ12 + σ22 + σ32 + 2ρ12 σ1 σ2 + 2ρ23 σ2 σ3 + 2ρ31 σ3 σ1
where σ i is the volatility of the ith line of business and ρ ij is the correlation between
the ith and jth LoB. The volatility of all three LoBs is calculated in Table 20.8, while
that of each LoB is calculated in Table 20.9.
680 K. Vaidyanathan
Table 20.9 Capital calculation attribution for LoB1, LoB2, and LoB3
Assets Volatility Standalone capital Incremental capital Component capital
LoB1 1000 σ1 = 5% 116.50 51.25 69
LoB2 1000 σ2 = 7% 163.10 67.05 102
LoB3 1000 σ3 = 9% 209.70 89.81 146
Total 3000 4.53% 489 208 317
Unattributed (172) 109 –
LoB1 is moderately correlated with that of LoB2 (ρ 12 =30%) and less correlated
to LoB3 (ρ 31 =10%). LoB2 is uncorrelated to LoB3 (ρ 23 =0).
The capital required at 99% (z = 2.33) calculated as Value at Risk is given
by 2.33Ai σ i . The stand-alone capital required for the first line of business is
2.33 × Rs.1000 crores × 5% = 116.50 crores. The stand-alone capital required
for the second line of business is 2.33 × Rs.1000 crores × 7% = 163.10 crores.
The stand-alone capital required for the third line of business is 2.33 × Rs.1000
crores × 9% = 209.70 crores.
The total capital is given by:
C = 2.33Aσ
= 2.33 A21 σ12 +A22 σ22 +A23 σ32 +2ρ12 A1 A2 σ1 σ2 +2ρ23 A2 A3 σ2 σ3 +2ρ31 A3 A1 σ3 σ1
The incremental capital for LoB1 is calculated as the total capital less the capital
of LoB2 and LoB3. It measures the incremental increase in capital from adding
LoB1 to the firm. The incremental capital for LoB1 is therefore:
= 2.33 A21 σ12 +A22 σ22 +A23 σ32 +2ρ12 A1 A2 σ1 σ2 +2ρ23 A2 A3 σ2 σ3 +2ρ31 A3 A1 σ3 σ1
− A2 σ2 + A3 σ3 + 2ρ23 A2 A3 σ2 σ3
2 2 2 2
∂σ
This is because ∂σ 1
= (σ1 +ρ12 σσ2 +ρ31 σ3 ) .
Similarly, the component capital for LoB2 is calculated as:
The component capital of each LoB always sums to the total capital. Please
refer to the spreadsheet “Financial Analytics Step 8.xlsx” (available on the book’s
website) for the specificities of the calculation. Readers interested in total capital
calculation for the entire business may refer to the RiskMetrics™ framework
developed by JPMorgan (RiskMetrics 1996).
The third stage of data analytics in finance concerns portfolio risk management. It
involves optimal allocation of risk and return as well as the execution required to
move the portfolio from a suboptimal to an optimal level.
Step 1: Allocation
After having aggregated the portfolio, assessed the risk, and then attributed the
risk to different lines of businesses, we move on to changing the portfolio for the
entire firm, for a division or an LoB for optimal allocations. So if we continue
with the previous example where we have three lines of business, the amount is
essentially kept the same—Rs. 1000 crores. If we analyze the results from Step 3
682 K. Vaidyanathan
of risk management, we find that risk attribution from all three metrics—stand-
alone, incremental, and component capital—indicates that the lowest attribution
of risk happens along the first line of business. If the Sharpe ratio (excess return
as a proportion of risk) for LoB1 is the highest (followed by that of LoB2 and
LoB3 respectively), then it is optimal for the firm to allocate more capital to the
first line of business and then to the second line of business. LoB3 is perhaps the
most expensive in terms of risk-adjusted return. Step 1 of portfolio analysis involves
optimally allocating the assets such that the overall risk of the firm is optimal.
Readers interested in optimal allocation of assets may refer to the RiskMetrics
framework developed by JPMorgan (RiskMetrics 1996).
Step 2: Execution
The last step is execution. Having decided to change the portfolio from its current
level to a more optimal level, we have to execute the respective trades for us to be
able to get to the desired portfolio risk levels. Execution happens in two steps. The
first step is order scheduling which is basically a planning stage of the execution
process. Order scheduling involves deciding how to break down a large trade into
smaller trades and timing each trade for optimal execution. Let us say a financial
institution wants to move a large chunk of its portfolio from one block to the other.
This is called as a parent order which is further broken down into child orders.
The timescale of the parent order is in the order of a day known as volume time.
In execution, the way time is measured is not so much in calendar time (called
wall-clock time) but in what is called as activity time. Activity time behaves as a
random walk. In this last step, we are coming back to Step 1 where we said that
we need to identify the risk drivers. For execution, the variable to be modeled is
the activity time. This behaves approximately as a random walk with drift and
activity time as a risk driver in the execution world, especially in high-frequency
trading.
There are two kinds of activity time—tick time and volume time. Tick time
is the most natural specification for activity time on very short timescales which
advance by 1 unit whenever a trade happens. The second type—volume time—
can be intuitively understood by noting that volume time lapses faster when more
trading activity happens, that is, the trading volume is larger. After the first step
of order scheduling, the second step in order execution is order placement which
looks at execution of child orders, and this is again addressed using data analytics.
The expected execution time of child orders is of the order of a minute. The child
orders—both limit orders and market orders—are based on real-time feedback
using opportunistic signals generated from data analytic techniques. So, in order
placement, the timescale of limit and market orders is of the order of milliseconds,
and the time is measured by tick-time which is discrete. These two steps are repeated
in execution algorithms after concluding the first child order called scheduling. It is
executed by placing limit and market orders. Once the child order is fully executed,
we update the parent order with the residual amount to be filled. We again compute
20 Financial Analytics 683
the next child order and execute. This procedure ends when the parent order is
exhausted. Execution is almost always done programmatically using algorithms and
is known as high-frequency trading (HFT). The last step thus feeds back into the
first step of our framework.
1.4 Conclusion
To conclude, the framework consists of three stages to model, assess, and improve
the performance of a financial institution and/or a portfolio. The first five steps
pertain to econometrical estimation. The next three steps concern risk management
and help measure the risk profile of the firm and/or the portfolio. The last two
steps are about portfolio management and help in optimizing the risk profile of
the financial institution and/or the portfolio. Following these sequential steps across
three stages helps us avoid common pitfalls and ensure that we are not missing
important features in our use of data analytics in finance. That being said, not every
data analysis in the finance world involves all the steps across three stages. If we
are only interested in estimation, we may just follow the first five steps. Or if we are
only interested in risk attribution, it may only involve Step 3 of risk management.
The framework is all encompassing so as to cover most possible data analysis cases
in finance. Other important aspects outside the purview of the framework like data
cleaning are discussed in Sect. 20.2.
2 Part B: Applications
2.1 Introduction
This chapter intends to demonstrate the kind of data science techniques used for
analysis of financial data. The study presents a real-world application of data
analytic methodologies used to analyze and estimate the risk of a large portfolio
over different horizons for which the portfolio may be held. The portfolio that we
use for this study consists of nearly 250 securities comprising international equities
and convertible bonds. The primary data science methods demonstrated in the case
study are principal component analysis (PCA) and Orthogonal GARCH. We use this
approach to achieve parsimony by reducing the dimensionality of the data, which is
a recurring objective in most data analytic applications in finance. This is because
the dimensionality of the data is usually quite large given the size, diversity, and
complexity of financial markets. We simultaneously demonstrate common ways of
taking into account the time-varying component of the volatility and correlations in
the portfolio, another common goal in portfolio analysis. The larger objective is to
demonstrate how the steps described in the methodology framework in the chapter
are actually implemented in financial data analysis in the real world.
684 K. Vaidyanathan
The chapter is organized as follows. The next section describes the finance
aspects of the case study and its application in the financial world. Section 2.3 also
discusses the metrics used in the industry for assessing risk of the portfolio. In Sect.
2.4, the data used and the steps followed to make the data amenable for financial
analysis are described. Section 2.5 explains the principles of principal component
analysis and its application to the dataset. Section 2.6 explains the Orthogonal
GARCH approach. Section 2.7 describes three different types of GARCH mod-
eling specific to financial data analysis. The results of the analysis are presented
in Sect. 2.8.
per Bloomberg estimates, there are more than 10,000 hedge funds available for
investment. It is humanly impossible to carry out due diligence of more than 10,000
hedge funds by any one asset management company (AMC).
Apart from developments in data science and the vastness of hedge fund universe,
another important driver in the use of data analytics in asset management has
been the advancements in robust risk quantification methodologies. The traditional
measures for risk were volatility-based Value at Risk and threshold persistence
which quantified downside deviation. These risk metrics are described in the
next section. The problem with a simple volatility-based Value at Risk is that
it assumes normality. So the assumption made is that financial market returns
distribution is symmetrical and that the volatility is constant and does not change
with time. It implicitly assumes that extreme returns, either positive or negative, are
highly unlikely. However, history suggests that extreme returns, especially extreme
negative returns, are not as unlikely as implied by the normal distribution. The
problem with downside measures such as threshold persistence is that, although
they consider asymmetry of returns, they do not account for fat tails of distributions.
These criticisms have resulted in the development of robust risk measures that
account for fat tails and leverage such as GJR and EGARCH (see Sect. 2.7.5).
So, nowadays all major institutional investors who have significant exposure to
hedge funds employ P-quants and use data analytic techniques to measure risk. The
exceptions of the likes of Harvard and Yale endowment funds have now become
the new norm. Consolidation of market risk at the portfolio level has become a
standard practice in asset management. In this chapter, we present one such analysis
of a large portfolio comprising more than 250 stocks (sample data in file: tsr.txt)
having different portfolio weights (sample data in file: ptsr.txt) and go through the
steps to convert portfolio returns into risk metrics. We use Steps 1–6 of the data
analysis methodology framework. We first identify the financial variable to model as
stock returns. We reduce the dimensionality of the data using principal component
analysis from 250 stock returns to about ten principal components. We then use
GARCH, GJR, and EGARCH (described in Step 3 of “Part A—Methodology”) to
make suitable inference on portfolio returns. We estimate the GARCH, GJR, and
EGARCH parameters using maximum likelihood estimation. We then project the
portfolio returns (Step 4 of the methodology) to forecast performance of the hedge
fund. We finally aggregate the risks using Step 6 of the framework and arrive at
the key risk metrics for the portfolio. We now describe the risk metrics used in the
investment management industry.
Value at Risk has become one of the most important measures of risk in modern-
day finance. As a risk-management technique, Value at Risk describes the loss in
a portfolio that can occur over a given period, at a given confidence level, due to
exposure to market risk. The market risk of a portfolio refers to the possibility of
financial loss due to joint movement of market parameters such as equity indices,
exchange rates, and interest rates. Value at Risk has become an inalienable tool for
risk control and an integral part of methodologies that seek to allocate economic
and/or regulatory capital. Its use is being encouraged by the Reserve Bank of India
(RBI), the Federal Reserve Bank (Fed), the Bank for International Settlements, the
Securities and Exchange Board of India (SEBI), and the Securities and Exchange
Commission (SEC). Stakeholders including regulators and supervisory bodies
increasingly seek to assess the worst possible loss (typically at 99% confidence
levels) of portfolios of financial institutions and funds. Quantifying risk is important
to regulators in assessing solvency and to risk managers in allocating scarce
economic capital in financial institutions.
Given a threshold level of return for a given portfolio, traders and risk managers
want to estimate how frequently the cumulative return on the portfolio goes below
this threshold and stays below this threshold for a certain number of days. Traders
also want to estimate the minimum value of the cumulative portfolio return when
the above event happens. In order to estimate both these metrics, two factors specify
a threshold, namely, financial market participants define a metric called threshold
persistence.
Threshold persistence is defined as follows: Given the time frame for which
a portfolio would remain constant and unchanged (T), two factors specify a
threshold, namely, cumulative portfolio return (β) and the horizon over which
the cumulative return remains below the threshold β. For the purposes of this
chapter, we label this threshold horizon as T’. The threshold persistence metrics are
defined as:
(a) The fraction of times the net worth of the portfolio declines below the critical
value (β) vis-à-vis the initial net worth of the portfolio and remains there for T’
days beneath this critical value
(b) The mean decline in the portfolio net worth value compared to the initial critical
level conditional on (a) occuring
To clarify the concept, consider the following example. Say T = 10 days,
β = −5%, T’ = 2 days, and the initial net worth of the portfolio is Rs. 100.
We simulate the portfolio net worth (please refer to Step 4 of the methodology
framework to understand how simulation is performed), and, say, we obtain the
following path (Table 20.10):
20 Financial Analytics 687
The pertinent progression here for calculating (a) and (b) are the net worth of
the portfolio in days 3, 4, and 5 since the net worth of the portfolio is lower than
Rs. 95 on all these three days. Observe that the decline to Rs. 90 on Day 8 would
not be reckoned as an applicable occurrence here since T’ = 2 and the net worth of
the portfolio came back above the critical value on Day 9 (the critical time span is
2 days, and it reverted above the critical level before 2 days). Let us suppose that we
simulate ten paths in all and in not one of the remaining paths of the simulation does
the portfolio value dip below the critical value and stays below the critical value over
the 2-day horizon. Therefore, the proportion of times the value of the portfolio goes
below the critical value is 1/10. Given that such a dip happens over the critical time
period of over 2 days, the drop would be −10%.
2.4 Data
The data that is normally available from secondary sources are the prices of the
various securities in the sample portfolio. The prices would be in local currencies—
US securities in US dollars, Japanese equity in Japanese yen, and so on. In the case
study, there is data from ten different currencies.
The data that is available from financial information services providers such as
Bloomberg or Thomson Reuters (the two largest providers in the global financial
markets), more often than not, is not “ready-made” for analysis. The foremost
limitation in the data made available by financial information services providers is
that they require considerable data cleaning before the data analytic methodologies
can be applied. The data cleaning process is usually the most time-consuming and
painstaking part of any data analysis, at least with financial data. The portfolio that
we use for this study consists of nearly 250 securities. We use the data in the context
of the study to describe in general the steps taken to make the data amenable for
financial analysis:
• The prices of almost all securities are in their native currencies. This requires
conversion of the prices into a common currency. Globally, the currency of choice
is US dollars, which is used by most financial institutions as their base currency
for reporting purposes. This is a mandatory first step because the prices and
returns converted into the base currency are different from those in their native
currencies.
• Securities are traded in different countries across the world, and the holidays
(when markets are closed) in each of these countries are different. This can
lead to missing data in the time series. If the missing data is not filled, then
688 K. Vaidyanathan
this could manifest as spurious volatility in the time series. Hence, the missing
data is normally filled using interpolation techniques between the two nearest
available dates. The most common and simplest interpolation methodology used
in financial data is linear interpolation.
• Some securities may have no price quotes at all because even though they are
listed in the exchange, there is no trading activity. Even when there is some
trading activity, the time periods for which they get traded may be different,
and therefore the prices that are available can vary for different securities. For
instance, in the portfolio that we use for this study, some securities have ten
years of data, while others have less than 50 price data points available. Those
securities which do not have at least a time series of prices spanning a minimum
threshold number of trading days should be excluded from the analysis. For the
purpose of this case study, we use 500 price data points.
• While too few price points is indeed a problem from a data analysis perspective,
many a times, a long time series can be judged to be inappropriate. This
is because in a longer time series, the more historical observations get the
same weights as the recent observations. Since recent observations have more
information relevant to the objective of predicting future portfolio risk, a longer
time series can be considered inappropriate. In the case study, the time series
used for analysis starts in May 2015 and ends in May 2017 thus giving us 2 years
of data (in most financial markets, there are approximately 250 trading days in a
year) or 500 time series observations.
• Prices are customarily converted into continuously compounded returns using
the formula rt = ln (Pt /Pt − 1 ). As explained in Step 1 of the methodology
in the “Financial Analytics: Part A—Methodology,” we work with returns data
rather than price data. Time series analysis of returns dominates that using prices
because prices are considerably non-stationary compared to returns.
• Portfolio returns are computed from the security returns as discussed in Step 6
of the methodology framework. In the case study, two portfolios—an equally
weighted portfolio and a value-weighted portfolio (calculated by keeping the
number of shares in each of the security in the portfolio constant)—are used
for the analysis.
For the purposes of the case study, readers need to understand PCA, the way it
is computed, and also the intuition behind the computation process. We explain
the intermediate steps and the concepts therein to make Sect. 2 of the chapter
self-contained. Further discussion on PCA is found in Chap. 15 on unsupervised
learning.
20 Financial Analytics 689
4 4 Heading (optional)
Transformation: Mirror reflection along the line y = x
3 3 1, 3
Y-Axis
Y-Axis
EigenVector
2 2 Transformation along y = x 2, 2
1 3, 1 1 3, 1
0 0, 0 0
0 1 2 X-Axis 3 4 0 1 2 X-Axis 3
From basic matrix algebra we know that we can multiply two matrices together,
provided that they are of compatible sizes. Eigenvectors are a special case of this.
Consider the two multiplications between a matrix and a vector below.
01 1 3
∗ =
10 3 1
01 2 2
∗ =
10 2 2
3
In the first multiplication, the resulting matrix is not an integer multiple
1
1
of the original matrix . In the second multiplication, the resulting matrix is a
3
2
multiple (of 1) of the original matrix . The first matrix is not an eigenvector of
2
01
the matrix , while the second one is an eigenvector. Why is it so? The reason
10
is that the eigenvector remains a multiple of itself after the transformation. It does
not get transformed after multiplication
like the first one.
1
One can think of the matrix as a vector in two dimensions originating from
3
(0,0) and ending at (1,3) as shown in Fig. 20.5.
690 K. Vaidyanathan
01
For ease of visual imagination, we have employed the matrix in the
10
discussion above. This matrix can be thought of as the following transformation:
reflection of any vector along the line y = x. For instance, a vector (1,3) after
multiplication by this matrix becomes (3,1), that is, a reflection of the vector itself
along the y = x line. However, the reflection of the vector (2,2) would be the vector
itself. It would be a scalar multiple of the vector (in this case, the scalar multiple is
1). Thus, an eigenvector even after transformation remains a scalar multiple of itself.
The scalar multiple is called the eigenvalue “λ.” In other words, an eigenvector
remains itself when subject to some transformation and hence can capture a basic
source of variation. When more than one eigenvector is put together, they can
constitute a basis to explain complex variations.
In general, an n × n dimension matrix can have a maximum of n eigenvectors.
All the eigenvectors of a matrix are orthogonal to each other, no matter how
many dimensions they have. This is important because it means that we can
represent the data in terms of these perpendicular eigenvectors, instead of expressing
them in terms of the original assets. This helps to reduce dimensionality of
the problem at hand considerably, which characteristically for financial data is
large.
PCA 2 3 PCA 1
Factor 2
2 Factor 1
0
-3 -2 -1 0 1 2 3
because convexity changes with duration. However, level of the yield curve and its
slope are orthogonal components which can explain variation in bond prices equally
well.
The results of the PCA for the portfolio of securities is detailed below. The first 21
principal components explain 60% nearly half of the variation as seen from Table
20.11.
Figure 20.7 shows that the first ten principal components capture the variation
in the portfolio returns quite accurately. These ten principal components explain
close to 50% of the variation in the security returns. As can be seen from Table
20.11, the first ten principal components explain 47.5% of the variation, while the
next thirteen components explain less than 15% of the variation. Adding more
principal components presents a trade-off between additional accuracy and the
added dimensionality of the problem in most financial data analyses.
In the data in the portfolio that we study, the principal components from 11
onward each help explain less than 2% of the additional variation. However, adding
one more principal component adds to the dimensionality by 10% and results in a
commensurate increase in the computational complexity. Hence, we can limit to ten
principal components for the subsequent analysis. This reduces the dimensionality
of the data from 250 to 10.
As the histogram in Fig. 20.8 shows, the difference between the actual portfolio
returns and the returns replicated using the ten principal components are, for the
692 K. Vaidyanathan
most part, small. Hence, we can limit our subsequent analysis to ten principal
components.
0.01
0.005
–0.005
-0.01
Original Portfolio Return
Recreated Portfolio Return with 10 PCAs
–0.015
450 455 460 465 470 475 480 485 490 495 500
Fig. 20.7 Replicated time series using ten principal components vs original portfolio returns
120
100
80
Count
60
40
20
0
–1.5 –1 –0.5 0 0.5 1 1.5
Error in Portfolio Return × 10–3
Table 20.12 shows the results of the augmented Dickey–Fuller test (ADF) of
stationarity for all the ten principal components. The ADF is used to test for the
694 K. Vaidyanathan
PC 2
PC 3
PC 4
-.3 -.18 -.174 -.131
1 500 1 500 1 500 1 500
obn_no obn_no obn_no obn_no
Principal Component No. 1 Principal Component No. 2 Principal Component No. 3 Principal Component No. 4
PC 6
PC 7
PC 8
-.107 -.106 -.12 -.14
1 500 1 500 1 500 1 500
obn_no obn_no obn_no obn_no
Principal Component No. 5 Principal Component No. 6 Principal Component No. 7 Principal Component No. 8
.157 .0935
PC 10
PC 9
-.138 -.103
1 500 1 500
obn_no obn_no
Principal Component No. 9 Principal Component No. 10
presence of unit roots in the time series. If yt = yt-1 + et , then the time series
will blow up as the number of observations increases. Further, the variance of the
time series will be unbounded in this case. In order to rule out the presence of
unit roots, the augmented Dickey–Fuller test runs regressions of the following kind:
yt − yt-1 = ρ yt − 1 + et . If the time series has a unit root, then ρ will be equal to
zero. The ADF essentially tests the null hypothesis that ρ = 0 versus ρ = 0. As the
results of the tests in Table 20.12 indicate, none of the principal components have
a unit root. Also, as we examined earlier, they do not have a time trend either. So,
predictability in the principal components is not spurious. This completes Step 2 of
our framework in the chapter.
Let D.PCi indicate the first difference of the respective principal component. The
absence of a unit root (which is the test of stationarity) is indicated by the coefficient
of the lag of PC(i) being different from zero. Data scientists in the financial domain
at times use the MacKinnon probability value to indicate the probability that the test
statistic is different from the augmented Dickey–Fuller critical values.
20 Financial Analytics
Given the large number of factors that typically affect the position of a large
portfolio, estimating the risk of the portfolio becomes very complex indeed. At the
heart of most data analytics models for estimating risk is the covariance matrix
which captures the volatilities and the correlations between all the risk factors.
Typically hundreds of risk factors encompassing equity indices, foreign exchange
rates, and yield curves need to be modeled through the dynamics of the large
covariance matrix. In fact, without making assumptions about the dynamics of
these risk factors, implementation of models for estimating risk becomes quite
cumbersome.
Orthogonal GARCH is an approach for estimating risk which is computa-
tionally efficient but captures the richness embedded in the dynamics of the
covariance matrix. Orthogonal GARCH applies the computations to a few key
factors which capture the orthogonal sources of variation in the original data. The
approach is computationally efficient since it allows for an enormous reduction
in the dimensionality of the estimation while retaining a high degree of accuracy.
The method used to identify the orthogonal sources of variation is principal
component analysis (PCA). The principal components identified through PCA
are uncorrelated with each other (by definition they are orthogonal). Hence,
univariate GARCH models can be used to model the time-varying volatility
of the principal components themselves. The principal components along with
their corresponding GARCH processes then capture the time-varying covariance
matrix of the original portfolio. Having described principal component analysis
and Orthogonal GARCH, we now illustrate the different variants of GARCH
modeling.
After the dimensionality of the time series is reduced using PCA, we now proceed to
Step 3 of our framework in the chapter with modeling the covariance using GARCH
on the principal components.
We first motivate the use of GARCH for measuring risk of a portfolio. Most
common methodologies for estimating risk, through a Value at Risk calculation,
assume that portfolio returns follow a normal distribution as shown in Fig. 20.10.
This methodology of calculating VaR using normal distribution implictly assumes
that the mean and standard deviation of the portfolio returns remain constant.
However, ample empirical evidence in finance shows that security returns exhibit
significant deviations from normal distributions, particularly volatility clustering
and fat tail behavior. There are certain other characteristics of equity markets
which are not adequately accounted for in a normal distribution. Data scientists
20 Financial Analytics 697
5.00E-02
4.00E-02
3.00E-02
2.00E-02
1.00E-02
0.00E+00
60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140
in finance therefore use GARCH models as they are devised to encapsulate these
characteristics that are commonly observed in equity markets.
Equity returns series usually exhibit this characteristic in which large changes
tend to follow large changes and small changes tend to follow small changes. For
instance, if markets were more volatile than usual today, there is a bias toward
they being more volatile tomorrow than they typically are. Similarly, if markets
were “quiet” today, there is a higher probability that they may be “quiet” tomorrow
compared to they being unusually volatile. In both cases, it is difficult to predict the
change in market activity from a “quiet” to a “volatile” scenario and vice versa.
In GARCH, significant perturbations, either for good or for worse, are intrinsic
part of the time series we use to predict the volatility for the next time period.
These large perturbations and shocks, both positive and negative, persist in the
GARCH model and are factored in the future forecasts of variance for future
time periods. They are sometimes also called persistence and model a process
in which successive disturbances, although uncorrelated, are nonetheless serially
dependent.
An examination of the time series of principal components reveals that periods of
high volatility are often clustered together. This has to be taken into account using a
GARCH model.
698 K. Vaidyanathan
The tail of distributions of equity returns are typically fatter compared to a normal
distribution. In simple terms, the possibility of extreme fluctuations in returns is
understated in a normal distribution, and these can be captured with GARCH
models.
This lack of normality in our portfolio is tested by analyzing the distribution
of the principal components using quantile plots as shown in Fig. 20.11. Fat tails
are evident in the distribution of principal components as seen from the quantile
plots since the quantiles at both the extremes deviate from the quantiles of a normal
distribution. To further test whether the distributions for the principal components
are normal, the Shapiro–Wilk test of normality is usually performed on all the
principal components. The results of the Shapiro–Wilk test are provided in Table
20.13.
As is evident from Table 20.13, pc1, pc3, pc5, pc6, pc8, and pc9 exhibit
substantial deviations from normality, while the remaining principal components
are closer to being normally distributed. Since six of the ten principal components
exhibit deviations from normality, it is important to model fat tails in the distribution
of principal components. Figure 20.11 depicts quantiles of principal components
plotted against quantiles of normal distribution (45% line). A look at the plot of the
time series of the principal components reveals that periods of volatility are often
clustered together. Hence, we need to take into account this volatility clustering
using GARCH analysis.
20 Financial Analytics 699
PC 3
PC 2
PC 1
PC 6
PC 5
PC 8
PC 10 Inverse Normal
.093988
PC 10
-.103
-.079519 .093988
Inverse Normal
Now that we have discussed why we use GARCH in financial data analysis, let
us try to understand it conceptually. GARCH stands for generalized autoregressive
conditional heteroscedasticity. Loosely speaking, one can think of heteroscedasticity
as variance that varies with time. Conditional implies that future variances depend
700 K. Vaidyanathan
0.1
0.05
–0.05
–0.1
–0.15
–0.2
0 5 10 15 20 25 30
Fig. 20.12 Simulated returns of portfolio assets with GJR model, Gaussian distribution
on past variances. It allows for modeling of serial dependence of volatility. For the
benefit of those readers who are well versed with econometric models and for the
sake of completeness, we provide the various models used in the case study. Readers
may skip this portion without losing much if they find it to be too mathematically
involved.
This general ARMAX(R,M,Nx) model for the conditional mean applies to all
variance models.
R
M
Nx
yt = C + ϕi yt−i + εt + θj εt−j + βk X (t, k)
i=1 j =1 k=1
moving average components of the conditional mean is desirable. Please see Chap.
12 for a brief description of the ARMA model and determining its parameters.
P
Q
σt 2 = K + 2
Gi σt−i + 2
Aj εt−j
i=1 j =1
P
Q
Q
−
σt 2 = K + 2
Gi σt−i + 2
Aj εt−j + Lj St−j 2
εt−j
i=1 j =1 j =1
where
− 1 εt−j < 0
St−j =
0 otherwise
"6 6 8#
P
Q 6εt−j | 6εt−j |
log σt 2 = K + 2
Gi log σt−i + Aj −E
σt−j σt−j
i=1 j =1
Q
εt−j
+ Lj
σt−j
j =1
2.8 Results
The calculation of Value at Risk for large portfolios presents a trade-off between
speed and accuracy, with the fastest methods relying on rough approximations and
20 Financial Analytics 703
the most realistic approach often too slow to be practical. Financial data scientists
try to use the best features of both approaches, as we try to do in this case study.
Tables 20.14, 20.15, 20.16, 20.17, 20.18, 20.19, 20.20, and 20.21 show the
calculation of Value at Risk and threshold persistence using four different models—
GARCH with Gaussian distribution, GARCH with Student’s t-distribution, GJR
with Student’s t-distribution, and EGARCH with Student’s t-distribution. This is
done for both the value-weighted portfolio and equi-weighted portfolio as below:
• Table 20.14 shows the calculation of Value at Risk and threshold persistence for
market value-weighted GARCH model with a Gaussian distribution.
• Table 20.15 shows the calculation of Value at Risk and threshold persistence for
market value-weighted GARCH model with a Student’s t-distribution.
• Table 20.16 shows the calculation of Value at Risk and threshold persistence for
market value-weighted GJR model with a Student’s t-distribution.
• Table 20.17 shows the calculation of Value at Risk and threshold persistence for
market value-weighted EGARCH model with a Student’s t-distribution.
• Table 20.18 shows the calculation of Value at Risk and threshold persistence for
equi-weighted GARCH model with a Gaussian distribution.
• Table 20.19 shows the calculation of Value at Risk and threshold persistence for
equi-weighted GARCH model with a Student’s t-distribution.
• Table 20.20 shows the calculation of Value at Risk and threshold persistence for
equi-weighted GJR model with a Student’s t-distribution.
• Table 20.21 shows the calculation of Value at Risk and threshold persistence for
equi-weighted EGARCH model with a Student’s t-distribution.
0.08
0.06
0.04
0.02
–0.02
–0.04
–0.06
–0.08
0 5 10 15 20 25 30
Fig. 20.13 Simulated returns of portfolio assets with GJR model, Student’s t-distribution
For GARCH models, the difference in dispersion for the two distributions is small.
However, the tables report the results for GJR and EGARCH using the two different-
distributions, and, in general, as is to be expected, the Student’s t-distribution tends
to have a higher dispersion than in the case of Gaussian distribution.
Figures 20.11 and 20.12 show the simulated paths for the equal-weighted
portfolio over the 30-day horizon with GJR model with a Gaussian distribution and
Student’s t-distribution. As is clearly visible, the fat-tailed Student’s t-distribution
generates greater variation at the extremes than the normal distribution.
GARCH tends to underestimate the VaR and persistence measure vis-à-vis GJR and
EGARCH. Again this is to be expected given that GJR and EGARCH factor in the
leverage effect which GARCH fails to do. GJR and EGARCH return similar results
which again is to be expected.
Five values are exhibited for each parameter to show the measure of dispersion.
Standard errors for the estimates are computed as also the t-statistics, and both are
found to be statistically acceptable.
20 Financial Analytics 709
Each value itself is an average of 10,000 paths. The horizon is mentioned in the
first column. The threshold horizon is taken as 2 days for consistency across horizon.
VaR and persistence measures are also computed for horizons of 2 years, 3 years,
4 years, and 5 years. The range of Value at Risk is between 20 and 22% for these
horizons. The probability of the portfolio remaining below the threshold level β for
2 or more days is about 42–47%, whereas the average drop in the portfolio given
that this happens is about 11–13%.
• Table 20.22 shows the PCA analysis for GARCH model with a Gaussian
distribution.
• Table 20.23 shows the PCA analysis for GARCH model with a Student’s t-
distribution.
• Table 20.24 shows PCA analysis for GJR model with a Student’s t-distribution.
• Table 20.25 shows the PCA analysis for EGARCH model with a Student’s t-
distribution.
2.9 Conclusion
for 1-Week, 1-Month, and 1-Year horizon. For instance, from Table 20.14 the CRO
can infer that if stock returns follow approximately Gaussian distribution, there
is a 1% chance (Value at Risk at 99% confidence level) that the portfolio might
lose around 12% of its value over a 1-year horizon. Using the metric of threshold
persistence, the CRO can infer that over a 1-year horizon, there is a 22% chance of
the portfolio dipping below 5%. And given that such a dip happens over the critical
time period of over 2 days, the drop in the portfolio value would be approximately
8%. The other tables quantify risk of portfolio when asset returns have excess
kurtosis or when there are causal mechanisms at play between returns and volatility
such as leverage effects.
Some CROs and investment management boards prefer to receive only summary
risk reports. The summary report is typically short so as to make it less likely that the
risk numbers will be missed by the board members. Most P-quant CROs choose to
receive both the summary and detailed risk reports. It is not usual for the modern-day
CROs to receive daily MIS (management information system) reports that contain
analysis from Table 20.14 to Table 20.21 on a daily basis. In the last few years, most
CROs come from the P-world and are quantitatively well equipped to understand
and infer risks from the detailed risk reports.
Apart from the senior management and the board, the other principal audience of
risk reports are regulators. Regulators like the Fed and the RBI mandate all financial
institutions that they regulate to upload their risk reports in a prescribed templete
at the end of each business day. Regulators normally prescribe templates for risk
reporting so that they can do an apples-to-apples comparison of risk across financial
institutions. Regulators themselves use systems to monitor the change in risk of a
given financial insitution over time. More importantly, it helps them aggregate risk
of all financial institutions that they regulate so as to assess the systemic risk in
the financial industry. With rapid advances in data sciences, it is envisaged that
application of analytics in finance would get increasingly more sophisticated in
times to come.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 20.1: Financial Analytics Steps 1, 2 and 3.xlsx
• Data 20.2: Financial Analytics Steps 4 and 5.xlsx
• Data 20.3: Financial Analytics Step 6.xlsx
• Data 20.4: Financial Analytics Step 7.xlsx
• Data 20.5: Financial Analytics Step 8.xlsx
• Data 20.6: nifty50.txt
• Data 20.7: ptsr.txt
• Data 20.8: randomgaussian.txt
• Data 20.9: randomgaussiancurrency.txt
• Data 20.10: tsr.txt
20 Financial Analytics 715
Exercises
Ex. 20.1 Stage I Step 3 Inference: We consider the National Stock Exchange index
NIFTY-50 recorded daily for the period November 1, 2016–October 31, 2017. Let
pt be the NIFTY-50 index and rt be the log return {rt = log (pt ) − log (pt − 1 )}.
Load the data from the file nifty50.txt into Matlab.
• Draw graphs of the stock index, the log returns, and the squared log returns.
• Do the graphs indicate GARCH effects?
• Estimate the GARCH parameters (ω,α,β).
Ex. 20.2 Stage I Step 4 Projection: Load the data on random numbers from
randomgaussian.txt into Matlab. The standard deviation is 20%, while the average
return is 5%. Note that the average return and standard deviation should be adjusted
for daily horizon by dividing with 365 and square root of 365, respectively. Project
the value of USD/INR for a horizon of 1 year.
Ex. 20.3 Stage II Step 2 Aggregation: Assume a loan portfolio of Rs. 500 lent
to five different corporates for Rs. 100 each. Aggregate the risk of this Rs. 500
loan portfolio using one-factor Gaussian copulas. Assume each corporate has a
probability of default of 4%. The horizon for the loan is 1 year. Assume that in
the event of a default the bank can recover 75% of the loan amount. Assume the
single factor to be the economy. The correlation of each firm’s asset to the economy
is given in the table below. Calculate the joint distribution of credit of each of these
corporates using a one-factor model.
Ex. 20.4 Stage II Step 2 Assessment: Load the data on random numbers from
randomgaussiancurrency.txt into Matlab. Compute VaR for a portfolio of 1 USD,
1 EUR, 1 GBP, and 100 JPY. The value of the portfolio in INR terms is Rs. 280 (1
USD = Rs. 64, 1 EUR = Rs. 75, 1 GBP = Rs. 82, 100 JPY = Rs. 59). Calculate the
possible loss or gain from this portfolio for a 1-year horizon. To aggregate the risk,
use the correlation matrix below between the currencies:
Ex. 20.5 Stage II Step 3 Attribution: A bank comprises three lines of businesses:
• Line of Business 1 (LoB1)—Corporate Banking
• Line of Business 2 (LoB2) —Retail Banking
• Line of Business 3 (LoB3) —Treasury Operations
LoB1 has a correlation of 0.5 with LoB2 and has a correlation of 0.2 with LoB3.
LoB2 is uncorrelated to LoB3. The total bank assets are Rs. 6000 crores. Each of
the LoBs has assets worth Rs. 2000 crores.
Assets Volatility
LoB1 2000 σ1 = 5%
LoB2 2000 σ2 = 7%
LoB3 2000 σ3 = 9%
References
Ali, M. M., Mikhail, N. N., & Haq, M. S. (1978). A class of bivariate distributions including the
bivariate logistic. Journal of Multivariate Analysis, 8, 405–412.
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The Journal of
Political Economy, 81(3), 637–654.
20 Financial Analytics 717
Vishnuprasad Nagadevara
1 Introduction
Social media has created new opportunities to both consumers and companies.
It has become one of the major drivers of consumer revolution. Companies can
analyze data available from the web and social media to get valuable insights into
what consumers want. Social media and web analytics can help companies measure
the impact of their advertising and the effect of mode of message delivery on the
consumers. Companies can also turn to social media analytics to learn more about
their consumers. This chapter looks into various aspects of social media and web
analytics.
Social media analytics involves gathering information from social networking sites
such as Facebook, LinkedIn and Twitter in order to provide businesses with better
understanding of customers. It helps in understanding customer sentiment, creating
V. Nagadevara ()
IIM-Bangalore, Bengaluru, Karnataka, India
e-mail: nagadevara_v@isb.edu
customer profiles and evolving appropriate strategies for reaching the right customer
at the right time. It involves four basic activities, namely, listening (aggregating
what is being said on social media), analyzing (such as identifying trends, shifts in
customer preferences and customer sentiments), understanding (gathering insights
into customers, their interests and preferences and sentiments) and strategizing
(creating appropriate strategies and campaigns to connect with customers with a
view to encourage sharing and commenting as well as improving referrals). One
of the major advantages of social media analytics is that it enables businesses to
identify and encourage different activities that drive revenues and profits and make
real-time adjustments in the strategies and campaigns. Social media analytics can
help businesses in targeting advertisements more effectively and thereby reduce the
advertising cost while improving ROI.
On the other hand, web analytics encompasses the process of measuring,
collecting and analyzing web traffic data to understand customer behaviour on a
particular website. Web analytics can help in improving user experience leading
to higher conversion rates. Companies can tweak the design and functionality of
their websites by understanding how users interact with their websites. They can
track user behaviour within the website and how users interact with individual
elements on each page of the website. Web analytics can also help in identifying the
most profitable source of traffic to the website and determining which referrals are
important in terms of investing in marketing efforts. Google Analytics is probably
the best tool available free of cost to any website owner for tracking and analyzing
the traffic to their website. These analytics include sources of traffic, bounce rates,
conversions, landing pages and paid search statistics. It is easy to integrate this with
Google AdWords.
It is obvious that social media analytics is very different from web analytics.
The data sources are different. These two complement each other and when used
together in tandem can provide deep insights into the traffic patterns, users and their
behaviour. For example, one can measure the volume of visitors to the website based
on referrals by different social networks. By ranking these social networks based on
the traffic generated, one can determine how to focus on the right networks. We can
even determine the influencers in the networks and their behaviour on our website.
Many companies have started leveraging the power of social media. A particular
airline keeps the customers informed through social media about the delays and the
causes for such delays. In the process, the airline is able to proactively communicate
the causes for the delays and thereby minimize the negative sentiment arising out of
the delays. In addition, the airline company is also able to save much of the time of
its call centre employees, because many customers already knew about the delays
as well as the reasons associated with the delays and hence do not make calls to the
call centre.
21 Social Media and Web Analytics 721
issues as well as public attitude towards political parties. It can help in identifying
pockets of political support. It can also help in creating appropriate campaigns and
policy formulation to take full advantage of the prevailing sentiments.
It is very common to have leaders and followers in social media. By using
network analytics (such as network and influence diagrams), organizations can
identify the most influential persons in the network. Organizations can take the
help of such influential persons to promote a new product or service. They can be
requested to contribute an impartial review or article about the product or service.
Such a review (or a recommendation) can help in promoting the new product or
service by resolving uncertainty or hesitancy in the minds of the followers.
Network and influence diagrams are used very effectively in identifying key
players in various financial frauds or drug laundering cartels. One such study
identified the perpetrators of 9/11 attacks by using a network diagram based on
the information available in the public domain. The network diagram could clearly
identify the key players involved in 9/11 starting from the two Cole bombing
suspects who took up their residences in California as early as 19991 .
Social media analytics can also be used for public good. Data from mobile
telephones are used to identify and predict traffic conditions. For example, Google
Traffic analyzes data sourced from a large number of mobile users. Cellular
telephone service providers constantly monitor the location of the users by a method
called “trilateration” where the distance of each device to three or more cell phone
towers is measured. They can also get the exact location of the device using GPS.
By calculating the speed of users along a particular length of road, Google generates
a traffic map highlighting the real-time traffic conditions. Using the existing traffic
conditions across different routes, better alternative routes can be suggested.
With the advent of smartphones, mobile devices have become the most popular
source for consuming information. This phenomenon has led to the development of
new approaches to reach consumers. One such approach is geo-fencing. It involves
creating a virtual perimeter for a geographical area and letting companies know
exactly when a particular customer (or potential customer) is likely to pass by a
store or location. These virtual perimeters can be dynamically generated or can
be predefined boundaries. This approach enables companies to deliver relevant
information or even pass on online coupons to the potential customer. The concept of
geo-fencing can be combined with other information based on earlier search history,
previous transactions, demographics, etc. in order to better target the customer with
the right message.
In the rest of the chapter, we describe some applications in greater detail: display
advertising in real time, A/B experiments for measuring value of digital media and
handling e-retailing challenges, data-driven search engine advertising, analytics of
digital attribution and strategies and analytics for social media and social enterprises
and mobile analytics and open innovation.
1 ValdisKrebs, “Connecting the Dots; Tracking Two Identified Terrorists” available at http://www.
orgnet.com/tnet.html (last accessed on Jan 18, 2018).
21 Social Media and Web Analytics 723
The Internet provides new scope for creative approaches to advertising. Advertising
on the Internet is also called online advertising, and it encompasses display
advertisements found on various websites and results pages of search queries and
those placed in emails as well as social networks. These display advertisements
can be found wherever you access the web. As in the case of any other mode of
advertising, the objective of display advertisement is to increase sales and enhance
brand visibility. The main advantage of display advertisements is that all the actions
of the user are trackable and quantifiable. We can track various metrics such as the
number of times it was shown, how many clicks it received, post-click-and-view
data and how many unique users were reached. These display advertisements can
be all pervasive and be placed anywhere and on any type of web pages. They can
be in the form of text, images, videos and interactive elements. Another advantage
of display advertisements is that the conversion and sales are instantaneous and
achieved with a single click.
There are different types of display advertisements. The most popular one is the
banner advertisement. This is usually a graphic image, with or without animation,
displayed on a web page. These advertisements are usually in the GIF or JPEG
images if they are static, but use Flash, JavaScript or video if there are animations
involved. These banner advertisements allow the user to interact and transact. These
can be in different sizes and can be placed anywhere on the web page (usually on
the side bar). You need to carry out appropriate tests or experimentation (refer to the
section on Experimental Designs in this chapter) to know what works best for you.
You can design your banner as a single- or multiple-destination URL(s).
There are banners that appear between pages. As you move from one page to the
next through clicks, these advertisements are shown before the loading of the next
page. These types of advertisements are referred to as interstitial banners.
The display advertisements can be opened in a new, small window over the web
page being viewed. These are usually referred to as pop-ups. Once upon a time, these
pop-up advertisements were very popular, but the advent of “pop-up blockers” in the
web browsers had diminished the effectiveness of the pop-up advertisements. Users
can be very selective in allowing the pop-ups from preselected websites. There are
also similar ones called pop-under where an ad opens a new browser window under
the original window.
Some of the local businesses can display an online advertisement over a map (say
Google Maps). The placement of the advertisement can be based on the search string
used to retrieve the map. These are generally referred to as map advertisements.
Occasionally, an advertisement appears as a translucent film over the web page.
These are called floating advertisements. Generally, these advertisements have a
close button, and the user can close the advertisement by clicking on the close
button. Sometimes, these advertisements float over the web page for a few seconds
before disappearing or dissolving. In such a case, there is no click-through involved,
and hence it is difficult to measure the effectiveness of such ads.
724 V. Nagadevara
Websites are generally designed to display at a fixed width and in the centre of
the browser window. Normally, this leaves considerable amount of space around
the page. Some display advertisements take advantage of these empty spaces. Such
advertisements are called wallpaper advertisements.
There are many options for getting the advertisements displayed online. Some of
these are discussed below.
One of the most popular options is placing the advertisements on social media.
You can get your ads displayed on social media such as Facebook, Twitter
and LinkedIn. In general, Facebook offers standard advertisement space on the
right-hand side bar. These advertisements can be placed based on demographic
information as well as hobbies and interests which can make it easy to target the
right audience. In addition to these ads, Facebook also offers engagement ads which
will facilitate an additional action point such as a Like button or Share button or
a button to participate in a poll or even to add a video. Sponsored stories or posts
can also be used to promote a specific aspect of a brand. These stories or posts can
appear as news feeds. You can even publicize an existing post on Facebook.
Twitter also allows advertisements. Some promotional tweets appear at the top
of the user’s time line. The section “Who to Follow” can also be used to have
your account recommended at a price. Usually, the payment is made when a user
follows a promoted account. The Trends section of Twitter is also available for
advertisements. While this section is meant for the most popular topics at any
particular time, the space is also available for full-service Twitter ads customers.
LinkedIn allows targeted advertisements with respect to job title, job function,
industry, geography, age, gender, company name and company size, etc. These
advertisements can be placed on the user’s home page, search results page or any
other prominent page.
Online advertisements can be booked through a premium media provider, just
like one would book from a traditional advertising agency. A premium media
provider usually has access to high-profile online space and also can advise on
various options available.
Another option is to work with an advertising network. Here, a single sales entity
can provide access to a number of websites. This option works better if the collection
of websites are owned or managed by a common entity, such as HBO or Times
Inc. The Google Display Network is another such entity. Usually, the advertising
network can offer targeting, tracking and preparing analytic reports in addition
to providing a centralized server which is capable of serving ads to a number of
websites. This advertising network can also advise you based on various factors that
influence the response, such as demographics or various topics of interest.
21 Social Media and Web Analytics 725
If you are looking for advertising inventory (unsold advertising space), advertis-
ing exchanges can help. The publishers place their unsold space on the exchange,
and it is sold to the highest bidder. The exchange tries to match the supply and
demand through bidding.
One of the fastest-growing forms of display advertising is mobile advertising.
Mobile advertising includes SMS and MMS ads. Considering that mobile is an
intensely interactive mass media, advertisers can use this media for viral marketing.
It is easy for a recipient of an advertisement to forward the same to a friend. Through
this process, users also become part of the advertising experience.
There are blind networks such as www.buzzcity.com which will help you to
target a large number of mobile publishers. A special category of blind networks
(can be called as premium blind networks) can be used to target high-traffic sites.
The sites www.millennialmedia.com and www.widespace.com are good examples
of premium blind networks.
Ad servers play a major role in display advertisement. These servers can be
owned by publishers themselves or by a third party and are used to store and serve
advertisements. The advantage of the ad servers is that a simple line of code can
call up the advertisement from the server and display it on the designated web page.
Since the ad is stored at one single place, any modifications are to be carried out
at one place only. In addition, the ad servers can supply all the data with respect
to the number of impressions, clicks, downloads, leads, etc. These statistics can be
obtained from multiple websites. One example of a third-party ad server is Google
DoubleClick.
The entire process described above takes less than half a second. In other words,
the entire process is completed and the display ad is shown while the browser is
loading the requested page on the consumer’s screen.
This process of matching the right display to the right consumer is completely
data driven. Data with respect to the context, who is to see the display, the profile of
the consumer, who is a good target, etc. is part of the process. In order to complete
the process within a very short time span, it is necessary to build all the rules
in advance into the system. The advertiser would have analyzed all the data and
identified the appropriate profile, the possible number of exposures of the ad and the
rules for bidding beforehand, based on his own analytics. The rules try to match the
information about the context, profile of the consumer and space available received
from the publisher with the requirement and trigger automatic bidding.
The programmatic display advertisement benefits both the publishers and adver-
tisers. The advertisers benefit by effectively targeting only those who match the
existing profiles. These profiles can be obtained by analyzing their own data or
728 V. Nagadevara
from third parties. With the context built into the process, the advertiser can reach
consumers who are browsing content that is most relevant to the product or service
that the advertiser is offering. They can be very selective in terms of the website
and the right context. They do not have to tie themselves to a pre-negotiated price
and quantity. They will be paying only for the relevant exposure that was made to
the most relevant target audience. The advertiser can analyze data with respect to
visits to the website, bounce rates, earlier marketing efforts, etc. in order to improve
the conversions. It also enables the advertiser to quickly review and revise the
advertisement strategies instead of waiting for the entire campaign to be completed.
The publishers gain by maximizing the revenue through auctioning the available
space. Each ad is auctioned to the highest bidder based on the context and consumer
profile. It also allows them to optimize the available advertising space.
Programmatic display advertising opens up yet another opportunity in display
advertising. The advertisers can use dynamic creative optimization (DCO) to
improve the conversion rates. DCO involves breaking an ad into a number of
components and creating different versions of each component. These components
can be with respect to content, visuals, colours, etc. These components are then
dynamically assembled to suit a particular consumer based on the context, profile,
demographics as well as earlier browsing history. These details along with any other
information available (such as time of the day or weather at that particular time) are
fed into the DCO platform. The display ad is assembled based on this information,
using predetermined rules before it is sent to the publisher’s server. Thus, DCO
can take advantage of the targeting parameters received from the ad exchange to
optimally create (assemble) the appropriate ad.
It is expected that programmatic display advertising in the USA alone will reach
more than $45 billion by 2019 (Fig. 21.2).
Web technology companies such as Amazon, Facebook, eBay, Google and LinkedIn
are known to use A/B testing in developing new product strategies and approaches.
A/B testing (also called A/B splits or controlled experimentation) is one of the
widely used data-driven techniques to identify and quantify customer preferences
and priorities. This dates back to Sir Ronald A. Fisher’s experiments at the
Rothamsted Agricultural Experimental Station in England in the 1920s. It was
called A/B splits because the approach was to change only one variable at a time.
The publication of Ronald Fisher’s book The Design of Experiments changed the
approach where the values of many variables are changed simultaneously and the
impact of each of the variables is estimated. These experimental designs use the
concept of ANOVA (discussed in the earlier chapter) extensively, with appropriate
modifications.
21 Social Media and Web Analytics 729
Yij = μ + βi + εij
where Yij is the response corresponding to ith treatment and jth replication
μ is the overall mean
βi is the treatment effect and
εij is the random error
Since the treatments are nominal variables, these are represented as dummy
variables. As there are three treatments, these will be represented by two dummy
variables. Type of Display “A” is represented by D1 = 1 and D2 = 0, “B” is
represented by D1 = 0 and D2 = 1, and “C” is represented by D1 = 0 and D2 = 0.
The data reformatted with dummy variables as required for the regression and the
results of regression analysis are presented in Table 21.2 (a) and (b).
21 Social Media and Web Analytics 731
Table 21.2 (a) Reformatted data with dummy variables for the CRD experiment. (b) Regression
results of the CRD experiment
(a) Reformatted data with dummy variables for the CRD experiment
Yij D1 D2
2565 1 0
864 1 0
1269 1 0
2025 1 0
2241 1 0
2295 0 1
2430 0 1
2133 0 1
1350 0 1
864 0 1
2079 0 0
3051 0 0
2619 0 0
3105 0 0
3348 0 0
(b) Regression results of the CRD experiment
Regression statistics
Multiple R 0.6531
R square 0.4265
Adjusted R square 0.3309
Standard error 633.7240
Observations 15
df SS MS F
Regression 2 3,584,347 1,792,174 4.4625
Residual 12 4,819,273 401,606.1
Total 14 8,403,620
Coefficients Standard error t stat P-value
Intercept 2840.4 283.41 10.0222 0.0000
D1 −1047.6 400.8022 −2.6138 0.0226
D2 −1026.0 400.8022 −2.5599 0.0250
It can be seen that the ANOVA table calculated in Table 21.3 is identical to the
ANOVA table obtained from the regression analysis. In addition, the regression
coefficients corresponding to the dummy variables D1 and D2 are negative and
statistically significant. This implies that Display C (which was left out in creating
the dummy variables) has resulted in significantly higher visitors than Displays
A and B. The intercept and other regression coefficients can be interpreted as the
differences between mean responses of the three displays. The pairwise differences
in the treatment effects can be obtained by post hoc tests.
732 V. Nagadevara
In the above experiment, it is possible that there is an effect of the search engine, in
addition to the effect of the type of display. In other words, there are two sources of
variation, the type of display and the search engine. Since the displays are randomly
assigned to the search engines and weeks, it is possible that all the 3 weeks selected
for the first search engine could have been assigned Display A, while no week
is assigned for Display A for the second search engine. It is necessary to run
the experiment in “blocks” if we need to isolate the effect of display as well as
the effect of the search engine. Such a design is called the randomized complete
block design. In other words, the experiment should be run in blocks such that
the three types of display are tested on each search engine on each of the weeks.
When the visitors come to the landing page, they can “click” on the page to obtain
additional information. The number of clicks is recorded, and the click-through rate
is calculated as Click-through rate (CTR) = Number of clicks/Number of visitors.
The design and the results are shown in Table 21.4.
In a general randomized block design, there are k treatments and b blocks. The
observations are represented by Yij where i represents the treatments and j represents
the blocks.
The above data can be analyzed for differences between means by ANOVA,
except that the sums of squares corresponding to treatments and blocks have to
be estimated separately. The formulae for calculating these sums of squares are
presented as follows:
k
2
SS(T reatment) = b Y i. − Y ..
i=1
21 Social Media and Web Analytics 733
b
2
SS(Block) = k Y .j − Y ..
i=1
k
b
2
SS(T otal) = Yij − Y ..
i=1 j =1
k
b
2
SS(Error) can also be calculated as Yij − Y i. − Y .j + Y ..
i=1 j =1
where
The regression model for the randomized complete block design can be repre-
sented by
Yij = μ + βi + πj + εij
where Yij is the response for ith treatment and jth block
μ is the mean
βi is the effect of Treatment i
πj effect of Block j and
εij is the random error
The coding of the above data for regression analysis is shown in Table 21.6. The
dummy variables corresponding to the treatments (types of display) are represented
by Ti , and those corresponding to blocks (search engines) are represented by SEj .
As usual, the dummy variables corresponding to T3 and SE5 are omitted. The
regression output is presented in Table 21.7.
It can be seen that the regression sum of squares is equal to the sum of the
“treatment sum of squares” and the “block sum of squares”. Consequently, the “F”
value is different. It can also be seen that all the treatment effects and block effects
are significant indicating that the effects of T1 and T2 are significantly better than
that of T3. Similarly, the effects of the first four search engines (SE1 to SE4) are
significantly better than that of SE5. The pairwise comparisons can be obtained by
running post hoc tests.
The real power of experimental designs is felt when we have to estimate the effects
of a number of variables simultaneously. For example, consider a scenario where a
company is contemplating a particular display advertisement. They have identified
three different variables (factors), each having two levels, to test. These are font
(traditional font vs. modern), background colour (white vs. blue) and click button
design (simple “Okay” vs. large “Click Now to Join”). The ad copies are randomly
displayed to each viewer, and the conversion rate (defined as those who click to
reach the website and join as members (free of cost)) is calculated. In the traditional
experimental design, such as CRD, we will first decide which one is likely to
be most important. Let us say the click button is the most important. Then, we
would combine each of the two levels of click button with one of the other two
factors (say, traditional font and white background) and run four replications. This
actually involves eight runs (four each of simple button + traditional font + white
background and large button + traditional font + white background). The resulting
conversion rate can be used to decide which type of click button is more effective
(say, large button). Now, we will select the background colour for experimentation.
Since we already have the combination with white background, we will now select
blue background and combine it with “large button” (since it was more effective)
and traditional font and run four replications. Suppose the results show that blue
background is more effective. Now, we select the combination of large button and
736 V. Nagadevara
blue background and combine it with traditional font and modern font. We already
have four runs of traditional font, blue colour and large button. Now we have to
carry out four runs of the combination of modern font, blue colour and large button.
Thus, we have a total of 16 runs. These 16 runs will help us to estimate the effects
of font, background colour and type of click button. But it is also possible that
there can be interaction effects between these factors. For example, a combination
of small button with blue background and modern font could be much more effective
than any other combination. Notice that we did not experiment with this particular
combination at all, and hence, we have no way of estimating this effect. Same is true
with many other interactions.
The factorial design developed by Ronald Fisher is a much better approach. With
three factors and two levels for each factor, there are eight possible combinations.
These eight combinations can be displayed randomly to the viewers and the
conversion rate calculated. It is important that each combination is displayed
with equal probability. This process involves only eight runs instead of 16 runs
required in the earlier approach. Table 21.8 shows the factorial design of the above
experiment with two replications. The levels of each factor are represented by +1
and −1. The coding is as follows:
There are only two levels for each of the factors in our experiment. Hence, these
designs are called two-level factorial designs (since there are three factors, this
design is referred to as 23 factorial design). In general, there can be many more
levels for each factor. The change in the response (conversion rate) when the level
of the factor is changed from −1 to +1 is called the “main effect”. For example,
the main effect for the factor “font” is the change in the conversion rate when
the font is changed from “traditional (−1)” to “modern (+1)”. When the effect of
one factor is influenced by another factor (a typical example is water and fertilizer
in agricultural experiments), it implies that there is a synergy between these two
factors. Such effects are called interaction effects. In Table 21.9, the coding of
interaction variables is the multiplication of the corresponding columns.
In order to isolate the effects of each factor and the interactions, we need to
calculate the sum of squares corresponding to each main effect and interaction
effect. The ANOVA table for the data of the above experiment is presented in
Table 21.10. These results are obtained by running the model in R using the
following code:
> Twoway_anova <- aov(Conversion_Rate ~ Font + Background
+ Click + FB + FC + BC + FBC, data=factorial_experiment)
> summary.aov(Twoway_anova)
21 Social Media and Web Analytics 737
It can be seen from the above table that the main effects of “font” and “click
button” are significant and the effect of “background colour” is not significant. In
addition, only the interaction between the “font” and “click button” is significant.
All other interactions are not significant.
The mean conversion rates for each level of the factors and the interactions are
presented in Table 21.11.
The mean effects of different factors and the interactions are also presented in
Table 21.11 (overall column). The way these effects need to be interpreted is that
the conversion rate will go up by 2.1038 when we change the font from traditional
738 V. Nagadevara
to modern. The increase in the conversion rate is only 0.0637 when we change the
background colour from white to blue.
The conclusion is that the company should use modern font with large click
button with “Click Now to Join”. The background colour does not matter.
More or less similar information could be obtained by carrying out a regression
on the conversion rate with the columns of main effects and interaction effects in
Table 21.9. The results of the regression analysis are presented in Table 21.12.
You can notice that the p-values of each effect in the regression analysis and
the ANOVA in Table 21.10 match exactly. The intercept is nothing but the overall
mean, and the regression coefficients corresponding to each factor or interaction are
the shifts (positive or negative) from the overall mean.
21 Social Media and Web Analytics 739
The above example deals with two-factorial experimental design. The same
model can be expanded to scenarios where there are more than two levels for
the factors. The real problem will be the number of possible runs needed for
the experiment. If there are six factors with two levels each, then the experiment
will require 64 runs, not counting the replications. In such situations, one can
use “fractional factorial designs”. The discussion on fractional factorial designs
is beyond the scope of this book. Interested students can read any textbook on
experimental designs.
The interaction effects can be gauged better by drawing the interaction graphs.
Figure 21.3 shows the interaction graphs for the three two-factor interactions (FB,
FC and BC). When the two lines in the graph run parallel to each other, it indicates
that there is no interaction between the two factors. A comparison between the two
graphs, FB and FC, indicates that the conversion rate increases significantly when
both font and click button are set at +1 level.
4.4 Orthogonality
An experimental design is said to be orthogonal if for any two design factors, each
factor level combination has the same number of runs. The design specified above
is an orthogonal design. Consider any two factors in the experiment and the effect
on the response variable is studied for four possible combinations. There are exactly
two runs (not counting the replications) for each combination. In addition, if you
take any two columns (other than the response column) in Table 21.9 and multiply
the corresponding elements and total them, the total is always zero. This also implies
that the correlation between any two columns in Table 21.9 (other than the response
column) is zero. This is a characteristic of the two-level factorial design. Because of
this orthogonal nature of the design, all the effects can be estimated independently.
Thus, the main effect of “font” does not depend on the main effect of “click button”.
Experimental designs are extensively used in many social networks such as
Facebook, Twitter and LinkedIn to make data-driven decisions. LinkedIn actually
created a separate platform called XLNT (pronounced as Excellent) to carry out
experiments on a routine basis. The platform can support more than 400 experiments
per day with more than 1000 metrics. They have been using this platform for
deploying experiments and analyzing them to facilitate product innovation. Their
experiments range from visual changes in the home pages to personalizing the
subject lines in emails3 .
3 YaXu et. al., “From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social
Networks”, KDD’15, 11–14 August 2015, Sydney, NSW, Australia.
740 V. Nagadevara
Serve the
A customer Products added Offer the right ad.
visits the to the cart. right product
website.
Who is this
person like?
Ad. database
Product database
that the quality score is 10) which is marginally higher than the next best rank. The
addition of one cent ($0.01) in the calculation formula is to ensure that the winning
ad is marginally higher and not tied with the next best ranked ad.
Prediction Model for Ad and Product
Similar methods can be employed to display appropriate ads and/or make appro-
priate recommendations when customers access e-commerce sites for purchasing
a product or service. The process is briefly presented in Fig. 21.4. Let us consider
a customer who visits a particular e-commerce website looking for a stroller. She
logs in and goes through different models and selects an infant baby stroller and
adds the same to her shopping cart. Her past search history indicates that she had
also been searching for various products for toddlers. In addition, her demographic
profile is obtained from her log-in data as well as her previous purchasing decisions.
This information can be fed into a product database which identifies a “duo stroller
(for an infant and a toddler)” which is marginally higher in cost as compared to the
one she had already selected. An advertisement corresponding to the duo stroller is
picked up from an ad database and displayed to the customer along with a discount
offer. The customer clicks on the ad, gets into the company website of the duo
stroller, checks the details and goes through some reviews. She goes back to the
e-commerce website and buys the duo stroller with the discount offered.
A similar process can be applied to make recommendations such as “those who
purchased this item also bought these” to various customers based on simple market
basket analysis. Other product recommendations can be made based on customer
profiles, past browsing history, or similar items purchased by other customers.
744 V. Nagadevara
Traditionally, TV and print media advertising had been considered as the most
effective marketing medium. But, in the recent years, digital ads have been outper-
forming all other media. With technology enabling advertisers to track consumers’
digital footprints and purchasing activities in the digital world, advertisers are
able to gain more insights into the behaviour of the consumers. Nevertheless, the
consumers are simultaneously exposed to many other types of advertising in the
process of making online purchases. Digital media constantly interacts with other
media through multichannel exposures, and they complement each other in making
the final sale. In such a scenario, attribution of the share of various components of
digital media as well as other media is becoming more and more important.
The consumer today is exposed to multiple channels, each providing a different
touch point. Imagine a consumer, while watching a TV show, comes across an ad
for Samsung S8 smartphone and searches for “Samsung smartphones” on Google
and comes across a pop-up ad for Samsung C9. He clicks on the ad and browses
the resulting website for various comments and reviews on C9. He watches a couple
of YouTube videos on C9 by clicking on a link given in one of the reviews. Then
he goes to a couple of e-commerce sites (say, Amazon and Flipkart) and does a
price comparison. A couple of days later, he receives email promotions on C9 from
both the e-commerce sites. He comes across a large POS advertisement for C9 at
a neighbourhood mobile store and stops and visits the store to physically see a C9
that is on display. A couple of days later, he receives a promotional mailer from
Axis Bank which offers a 10% cashback on C9 at a particular e-commerce site (say
Amazon) and finally buys it from the site.
The question here is: How much did each of these ads influence the consumer’s
decision and how should the impact of each of these channels be valued? These
questions are important because the answers guide us in optimizing our advertise-
ment spend on different channels.
Today, the explosion of data from various sources provides unmatched access
to consumer behaviour. Every online action on each and every website is recorded,
including the amount of time spent on a particular web page. Data from transactions
from retail stores, credit card transactions, call centre logs, set-top boxes, etc. are
available. All this information can be analyzed to not only understand consumer
behaviour but also evaluate the contribution of each marketing channel.
An attribution model is the rule, or set of rules, that determines how credit for
sales and conversions is assigned to touch points in conversion paths. Attribution
is giving credit to different channels that the company employs to promote and
broadcast the advertising message. Digital attribution pertains to attributing credit to
various components for providing the marketing message in the online world using
various forms of media and online platforms. It is true that the digital world offers
phenomenal opportunities for measurability and accountability. Nevertheless, it is
more challenging in the digital world to disentangle the impacts of various forms of
advertising and those of different platforms employed. Some of the actions are not
21 Social Media and Web Analytics 745
Social Organic
Referral Adwords Direct
Network Search
the touch point which clinched the conversion. At the same time, it does not ignore
other touch points.
The “Social Network” and the “Direct” touch points get 40% each, whereas the
remaining three will get 6.67% each, in the above funnel.
(viii) Custom or Algorithmic Attribution Model
Based on the merits and demerits of the models described earlier, the analyst can
build a custom attribution model. These custom models are built based on data on
customer behaviour obtained at different stages. It is being acknowledged that these
“evidence-based” models are better and more realistic. Unfortunately, these models
are not easy to build. The main principle is to estimate which of the touch points
contribute to what extent based on the available customer data so that it represents
a more accurate picture of the customer’s journey from the initiation to conversion.
This is called a custom or algorithmic attribution model. These models not only
make use of customer data but also use statistical techniques which can lead to
continuous optimization of the advertisement budget.
A simple approach for building an algorithmic attribution model starts with
identifying the key metrics that are to be used to measure the relative effectiveness
of each channel. It is important to associate an appropriate time period with this.
This is to be followed with the cost per acquisition for each channel.
With the present technology, the data with respect to various touch points
is obtained without much difficulty. This data can be used to build models to
predict conversion or otherwise. Generally, the target variable is binary, whether
the conversion is successful or not. The predictor variables are the various touch
points. To start with, the metric for effectiveness of the model can be the prediction
accuracy. Based on the contribution of each of the predictor variables, the attribution
to various touch points can be calculated. Consider building a logistic regression
model for predicting the success of conversion. Once we achieve the prediction
accuracy levels for both the training and testing datasets, the coefficients of the
logistic regression can be used to calculate the percentage attribution to various
touch points or channels4 .
While it is true that any of the predictive models can be employed for this
purpose, models such as logistic regression or discriminant analysis are more
amicable since the coefficients corresponding to different predictor variables are
available with these models. We can also use black box methods such as artificial
neural networks or support vector machines. In such a case, we can assign the
attribution values to different predictor variables based on the “sensitivity index”
obtained after building the model.
4 For
more details, see Shao, Xuhui and Lexin Li, “Data-driven Multi-touch Attribution Models”,
KDD’11, 21–24 August 2011, San Diego, California, USA.
748 V. Nagadevara
Other algorithmic models that can be used for calculating the attribution are as
follows:
• Survival analysis
• Shapley value
• Markov models
• Bayesian models
We can optimize the budget allocation for each of the channels by using the
attribution values (obtained from the predictive models) and the costs estimated
earlier (CPA, CPM, etc.). Even a simple optimization technique such as budget-
constrained maximization can yield significant results.
Let us consider an example of calculating cost of acquisition (CoA) under
different attribution models. Let us consider a customer who purchased an “Amazon
Home” through the following process:
1. Yajvin, the customer, first clicks on AdWords and visits the website.
2. Then he visits his Facebook page and clicks on the ad displayed on Facebook,
visits the website again and checks out the functionality of the device.
3. Afterwards, he visits the website again through his Twitter account and looks at
the technical details.
4. Then, he directly comes to the website and checks on various reviews.
5. Finally, he clicks on an offer with a discount that he received on email and
purchases the device.
Let us assume that the advertisement expenditure is as follows: AdWords, $12;
Facebook, $15; Twitter, $20; direct, $0; email, $8.
The cost of acquisition under each of the attribution models can be calculated
based on the above information. Table 21.14 provides the details about the ad
spend on each channel, the weightages for each channel under each model and the
calculated CoA.
Similarly, data-based attribution models can be used to estimate the contribution
of each channel. This is important to understand which channels are actually driving
the sales. Based on this, advertisers can spend money more effectively and maximize
ROI. The example below demonstrates the attribution across three different channels
using the Shapley value approach.
Let us consider an example where the company uses three channels for promoting
its product, a smartwatch. These channels are AdWords (Channel A), Facebook
(Channel B) and email (Channel C). Based on the data, the number of watches sold
through each channel (and each possible combination of channels) is obtained and
summarized in Table 21.15.
The company managed to sell 256 smartwatches when the customers exposed
(used) all the three channels, while it could sell only 64 smartwatches when the
customers used only email and nothing else. These numbers are obtained based on
the analysis of purchase data through different channels. Considering that there are
three channels (A, B and C) in this example, there are six possible permutations
for combining these three channels. These permutations are A→B→C, A→C→B,
B→A→C, B→C→A, C→A→B and C→B→A.
In the first permutation, Channel A contributes 180 (contribution of AdWords
alone), Channel B contributes 12 (channels A and B together contribute 192, and
hence the contribution of B is 192 − 180 = 12), and Channel C contributes 64 (all
the three channels together contribute 256, while A and B together contribute 192,
and hence the contribution of C is 256 − 192 = 64). Similarly, the contribution of
each of the channels corresponding to each permutation is calculated and presented
in Table 21.16.
750 V. Nagadevara
Across all the six permutations, the contribution of AdWords is 180, 180, 72, 128,
126 and 128. The total of these values is 814, and the average is 135.67. Similarly,
the averages for Facebook and email are 74.67 and 45.67, respectively. These values
are converted into percentages which are presented in the table. These values are
referred to as “Shapley values” (named after the Nobel Laureate Lloyd Shapley).
Based on the above analysis, the company should invest more in AdWords and
least in email. As a matter of fact, the advertisement budget can be distributed across
the three channels in the same ratio as the percentage contributions.
The above approach requires large amounts of data. The company needs to obtain
data with respect to each and every combination (all possible subsets as well as
individual channels) of the channels employed. If there are n channels, the data has
to be obtained for 2n − 1 subsets. Implementing experimental designs could be a
possible approach to obtain the required data. Once an optimization strategy for
budget allocation across different channels is evolved and implemented, constant
monitoring of the channels is necessary for further fine-tuning.
With the popularity of smart mobiles in the recent days, more than half of search
traffic started to emanate from mobiles. These devices are also the popular medium
for interacting within social networks. In addition, Google’s mobile ranking algo-
rithm includes mobile-friendly and mobile usability factors as well as availability
of mobile apps in its indexing. Consequently, those with mobile-friendly websites
and/or mobile apps get much higher ranks and appear at the top of the search results.
Consequently, it is becoming more and more important for businesses to evolve
a mobile-oriented strategy in order to improve effectiveness of their marketing
campaigns. It is becoming necessary to create mobile marketing strategies which
improve customer experience while using mobiles at every stage of the customer
purchase funnel.
Two important strategies that businesses need to adopt are to create mobile-
friendly websites and mobile apps. Users in the initial stages of the purchase funnel
are most likely to be using the website rather than downloading and installing the
app. On the other hand, mobile apps allow for better interaction and facilitate more
creativity in engaging the customer. In other words, it is necessary for businesses to
create their own mobile-friendly websites as well as create specific apps.
A mobile website is a website which is designed and created for specifically
viewing on a smartphone or a tablet. It needs to be optimized so that it responds
or resizes itself to suit the display based on the type of device. We need to
understand that customers use these devices at different stages in the purchase
funnel. Businesses can accelerate the purchase process through sales alerts, display
advertisements, providing QR codes, extending special discounts and issuing
discount coupons. It is easy to integrate the mobile-based campaigns with different
social media sites so that the customers can interact with others regarding the
21 Social Media and Web Analytics 751
products and services within their social networks. The websites need to be
optimized so that they load faster and they are easy to navigate, and click buttons
need to be large enough and have short menu structures. The websites and apps
should also ensure that there is minimum amount of typing required. It is also a
good idea to allow for maps showing the location since many customers tend to use
mobiles when they are on the go.
The website or app should allow users to connect to various social media
platforms. This should also include a feature which will make it easy for customers
to share the information with others in the network. The apps have an additional
advantage. The app stays on the mobile screen, whether the customer is using the
app or not. Every time the customer looks at the screen, they see the name of the
app or name of the brand which acts as a constant reminder.
Geolocation is an important aspect of the mobile strategy. It is easy to integrate
this into mobile apps. Businesses will be able to identify the location of the customer
at any particular moment. Data can be collected on places that the customer visits
on a regular basis (such as where the customer generally takes a walk). With
this kind of information, the app can automatically provide various promotions
or exclusive offers that are currently available at a store that is located nearest
to the customer. Many of the mobile devices today come equipped with “near-
field communication (NFC)”. NFC can be useful in locating the customer within
a particular store or facility, and the app can draw the user’s attention to any items
nearby or special discounts based on the past browsing/search/purchase behaviour
of the customer through SMS or notifications. This is especially useful when the
customer is physically close to the product and at a stage where he or she is ready
to make a decision.
It is also important for the app to be able to operate offline. For example, the user
could download the catalogue and browse through the offerings without relying on
Wi-Fi or the mobile signal.
Ultimately, full benefit of a mobile strategy can be extracted only when the
mobile channel is integrated with other channels. The customer should be able to
seamlessly move from his or her mobile to any other channel and reach the brand or
product or service.
Thus, the mobile strategy should be such that it provides enough customization
to leverage the advantages of a mobile or tablet device while integrating with other
channels so that the customer has a coherent experience across all channels.
The past decade has seen a phenomenal growth of social media which has changed
personal and professional lives of people all over the world. As networking through
social media grew, businesses started leveraging social media platforms to reach
out to customers directly to attract and retain them. Business organizations found
innovative ways to listen to customers’ voices through social media and better
understand their needs. At the same time, development of technologies provided
752 V. Nagadevara
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 21.1: adspends.csv
• Data 21.2: antiques_devices.csv
• Data 21.3: bid_qs_advt.csv
• Data 21.4: bid_qs_gmat_orgs.csv
• Data 21.5: factorial_experiment.csv
• Data 21.6: furnimart.csv
• Data 21.7: global_time.csv
• Data 21.8: indo_american.csv
• Data 21.9: membership_drive_isha.csv
• Data 21.10: modern_arts.csv
• Data 21.11: watches_sales.csv
• Data 21.12: webpage_visitors.csv
• Data 21.13: SMWA_Solutions.csv
• Code 21.1: factorial_experiment.R
Exercises
Ex. 21.1 Modern Arts (India) has initiated a special email campaign with three
different subject lines. These are as follows:
21 Social Media and Web Analytics 753
There are two response rates, namely, “open rate” and “click-through rate”. Test
to find out which subject line is the best with respect to each of the response rates.
Ex. 21.2 It was revealed that each replication was sent to a different mail account.
All the 7500 emails of Replication 1 were actually addressed to Gmail accounts.
Similarly, all mails of Replication 2 were sent to Outlook mail accounts. All mails
of Replication 3 were sent to Yahoo mail accounts. All mails of Replication 4 were
sent to AOL mail accounts. Given this information, Modern Arts (India) decided to
consider this as a completely randomized block design in order to look at the effect
of treatments and blocks.
Test if there is a significant block effect. Which is the best subject line? Carry out
the analysis for both the response rates.
Ex. 21.3 FurniMart is a hub for furniture enthusiasts to both sell and buy specially
designed furniture. FurniMart operates through its own website with an online
catalogue. The traffic to the website comes mainly from three sources, those who
type the URL and reach the website (direct), those who come through AdWords
754 V. Nagadevara
and those who respond to display advertisements from social networks. It had been
their experience that many customers added products to their carts directly from
the catalogue pages. The existing design of the website displays a graphic to
facilitate adding products to the cart. Customers select a particular product and click
on the graphic button to add the item to their shopping carts. It was felt that a bigger
“call-to-action” (CtA) button is likely to lead to better conversion rates.
FurniMart decided to experiment with three different types of buttons. These are
displayed below:
Analyze the above data to identify which CtA button is the best for conversion.
Ex. 21.4 Consider the above data (Question 3) as a completely randomized block
design in order to look at the effect of treatments and blocks. Test if there is a
significant block effect. Which is the best treatment?
Ex. 21.5 Global Time is a local dealer for the “Ultimate” brand of smartwatches.
Whenever any potential customer searches for smartwatches, Ultimate bids along
with other smartwatch sellers. When any customer who is located within the
geographical area of Global Time clicks on Ultimate’s ad, the visitor is taken to
Global Time’s website using the geolocation feature. Global Time is trying to
revamp its website in order to improve its conversion rates. They have identified
three different aspects (treatments) of the website that they want to tweak. These
aspects are as follows:
(a) Currently, there is no video on the home page. The proposal is to add a 90 s
video showing the features of “Ultimate” smartwatch.
21 Social Media and Web Analytics 755
(b) At present, the “Buy” button is at the right side of the web page, vertically
centred. The proposal is to shift it to the bottom right so that there is more space
for more visuals on the page.
(c) At present, the page displays testimonial in text form. The proposal is to include
a small photo of the customer who had given the testimonial.
Global Time decided to carry out A/B testing on these three aspects. The details
of the treatments and the corresponding conversion rates are given in the table
below:
What should “Global Time” do with respect to the three aspects? Are there any
interaction effects between these three aspects?
Ex. 21.6 Akshita started a search on Google for organizations which provide GMAT
training. A quick analysis of the relevant AdWords by Google found that there are
five advertisements that are available for display on the search results page. The bid
amounts as well as the quality scores are presented in the table below.
Calculate the ranks and identify the first two advertisers whose ads will be
displayed to Akshita. Also, calculate the CPC for each of the advertisers.
Ex. 21.7 Indo-American Consultancy Services (IACS) specializes in placing Indian
graduates with considerable work experience with clients in the USA. They adver-
tise their services with display ads in Twitter, LinkedIn and AdWords. When the
potential customers click on the display ad, they are taken to the company’s website,
756 V. Nagadevara
and the customers are encouraged to register on the website and upload their CVs.
Once the potential customer uploads the CV, it is considered as conversion.
Based on the above data, carry out appropriate attribution to each of the three
channels.
Vishnuprasad Nagadevara
service centres and pharmacists. The members of ISHA benefit from steeply
discounted services from these private healthcare providers. The service providers
benefit from economies of scale through increased demand for their services from
the members of ISHA.
As a part of the agreement, members of ISHA receive a certain number of free
consultations from any doctor at major hospitals including dental consultations.
In addition, they also get two free dental cleaning and scaling at certain dental
clinics. The members are also eligible for two free “complete health check-ups” per
year. The participating pharmacists give a minimum discount of 20% on medicines
subject to minimum billed amount. The members also get discounts on various
diagnostic tests including radiology tests.
These benefits are available to the members of ISHA. The membership is
available on an individual as well as family basis. The family membership covers
a maximum of four members. Additional members can be added into the family
membership by paying an additional amount per person.
The economics of the entire model depends on acquiring a critical mass of
members. ISHA decided to take advantage of the increasing web access to push
its membership drive. They have initiated an email campaign to enrol members
with very limited success. They have also realized that campaigns in print media
cannot be targeted in a focussed manner leading to high campaign costs and low
conversion rates. Karthik finally decided to resort to web-based campaign using
display advertising with organic search as well as with AdWords. ISHA hired “Web
Analytics Services India Limited (WASIL)” that has expertise in creating, testing
and running web-based advertising campaigns.
WASIL is willing to work with ISHA on a result-based payment model. The fees
that are payable to WASIL will depend on the success of the campaign in terms of
acquiring the members. WASIL has put together a team of three members to create
and run the campaign. Rajeev is heading the team with Subbu and Krithika as the
other two members. The team decided to first design a set of objects that can be put
together into a display advertisement based on keywords in search strings.
“Blue is always associated with health, like it is with BlueCross”, said Subbu.
“We should have blue colour in the border. That will make it noticeable and will
definitely lead to better click-through rates”. Subbu and other members of his team
are discussing the changes that need to be made in the display advertisement for
ISHA. Currently, ISHA does not use any colour in its advertisement. It is very
simple with plain white background and bold font in black. It does not have a click
button either. The potential visitor can click anywhere on the ad, and it will take the
visitor to ISHA’s home page.
Rajeev agreed with Subbu that there should be a border with a colour. Rajeev
is passionate about green and feels that it gives a soothing feeling which could
be easily associated with health services. Krithika suggested that they can use two
different versions, one with blue and the other with green. Since all of them agreed
that a colour in the border is absolute necessary, the A/B testing can be done with
the two colours.
758 V. Nagadevara
“We need to have a CTA (Call-to-Action) button. Since we are already experi-
menting with blue and green on the border, the CTA has to be red”, Krithika said.
“Here again, we have ample scope for experimentation. Should we try two different
colours?” asked Rajeev. The rest of the team members did not agree with different
colours. They felt that there are already two colours, one in the border and the other
on the CTA. The text will have a different colour, at least black. They felt that
having more than three colours in a display ad can make it look gaudy and can
also be jarring to the visitor. Rajeev said, “We can’t leave it as a button. We need
to put some ‘call-to-action text’ on the button, like ‘Click Here to Save’. It will
draw more attention. There are enough studies to show that an associated text will
always reinforce and yield better results”. Subbu felt that they should add some more
information into the click-to-action text. “Putting a number such as ‘Save up to 70%’
will get better response”. Rajeev was not convinced. He felt that highlighting such
a large percentage saving might possibly make people suspicious of the benefits.
Many people think that such large discounts are not possible in healthcare, even
though the labs do offer 70% discount on specified lab tests. After a prolonged
discussion, the team decided to try out both the versions as another treatment in
A/B testing.
The present design of ISHA’s ad is that the visitors, when they click on the ad, are
taken to the home page. It is expected that the visitor will first look at the home page
and navigate from there to other relevant pages. The team felt that it is not enough
to make the visitor click on the ad and reach the website. The real conversion is
when the visitor becomes a paid member, at least on trial basis. Any member can
withdraw his or her membership within 15 days from the registration and get full
refund. The team felt that the landing page will have a major impact on conversion.
The team members also believed that the landing page should be dynamic. It should
be based on the search string or the AdWords leading to the display ad. If the visitors
are looking for lab tests, the landing page should accordingly be the one with lab
tests, with a listing of laboratories located in a particular town and the corresponding
discounts. On the other hand, if the visitor was searching for physician consultation,
the landing page should correspond to physicians or respective clinics. Finally, the
team agreed that the landing page will also be treated as a treatment with home page
being the control.
The team decided to use one more treatment in their experimentation. It
was understood that ISHA is a new player with a relatively new concept. The
organization is not yet an established one. The team members as well as Karthik
felt that putting ISHA’s logo on the display ad will increase the visibility. The top
line of the ad will show either ISHA without logo or ISHA with logo on the right
side. The general layout of the ad is shown in Fig. 21.6.
The team concluded that they should go ahead with the four treatments. The team
summarized the final proposal as shown in the table below.
WASIL and Karthik approved the team’s proposal, and WASIL ran the campaign
with the above four treatments over a period of 1 month. Data was collected in
terms of number of potential visitors exposed to each type of advertisement (sample
21 Social Media and Web Analytics 759
Treatment details
Treatment Level 1 (−1) Level 2 (+1)
Border colour Blue Green
Top line Without logo With logo
CTA “Click Here to Save” “Save up to 70%”
Landing page Home page Relevant page
size), number of visitors who clicked on the ad (clicks) and, finally, the number of
conversions (members). The dataset “membership_drive_isha.csv” is available on
the book’s website.
Questions
(a) Which of the four treatments is more effective? What is the right level for each
of the treatments?
(b) Are there any interaction effects between the treatments?
(c) What would be the final design for the ad in order to maximize the conversions?
760 V. Nagadevara
Vishnuprasad Nagadevara
The Meeting
“I think it is Wanamaker who said ‘Half the money I spend on advertising is wasted;
the trouble is I don’t know which half ’. I can’t accept that today. Our advertising
cost is almost 12% of our revenue, and I need to know where it is going”, said
Yajvin. “I mean we know exactly where we are spending the money, but I need to
know what we are getting back. What is the effect of each of the channels that we
are investing in?”
Yajvin is briefing Swetha on her new assignment. Yajvin is the CEO of the
online company “Antiques in Modernity (AiM)”. Antiques in Modernity specializes
in modern electronic devices which have the look of antiques. Swetha heads
the analytics consulting company “NS Analytics Unlimited”, which provides
consultancy services to various companies. In addition to providing consultancy
services, NS Analytics also provides training to the client companies so that they
can become self-sufficient as much as possible in terms of analytics. Their motto is
“We make ourselves redundant by building expertise”. NS Analytics has been hired
by Antiques in Modernity to analyze their advertising spend and also advise them
on how best to improve the ROI on their advertising investment.
“Our products are very special. Look at this turntable. It looks like an antique,
and you can actually play a Beatles’ gramophone record on this. But it can be used
as a Bluetooth speaker; it can connect to you home assistant such as Amazon Echo
or Google Home; it can even connect to your mobile. You can stream music from
this turntable to any other device even in your backyard!” said Yajvin (Fig. 21.7).
“Similar is the case with our antique wall telephone. It can be used as a wired
telephone or as a cordless phone. We are in the process of redesigning it so that you
can carry the receiver outside your house and use it as a telephone. But let us come
back to our problem. As I said, we invest a lot of money in advertising in different
channels. We need to find out the effect of each of these channels. I do understand
that many of these channels do complement each other in today’s markets. Can we
somehow isolate the effects of each, so that our ad spend can be optimized?” asked
Yajvin.
Swetha responded saying that there are many models that can be used in order
to address the problem, but such models require large amounts of reliable data.
She also said that each of these models can give different results, and one needs to
understand the assumptions involved in each of these models so that the one which
is most applicable to a particular scenario can be picked. Yajvin put her in touch
with his chief information officer, Skanda, so that Swetha can get a feel for the type
of data that is available with the company and also explain her data requirements.
Skanda explained the data available with them. “We mainly depend on online
advertisement. We do invest a small amount in the print media, but most of our
advertising is on the social networking websites, AdWords and the like. We also get
good amount of direct traffic into our website. Since all our sales are through online
only, it makes sense for us to work this way”, said Skanda. “We do have a system
of tracking our potential customers through different channels. We try to collect as
much data, reliably, as possible.”
Antiques in Modernity
The fact that both of them are comfortable with web-based marketing had a major
role to play in making the decision. They had also decided to use as much of online
advertising as possible.
AiM uses display ads on social networks, especially LinkedIn, Twitter and
Facebook. They also use Google AdWords in order to display their ads based on the
search strings used by potential customers. They also keep track of customers who
reach their site through organic search. Since their products are sold only through
online from their own website, the final conversion is when the customer places the
order. They use various methods to trace the channels from which customers reach
their website.
The Data
During their meeting, Skanda promised Swetha that he can provide details of each
and every funnel starting from the first visit to their website, as well as the referrer
by a customer (or potential customer). Swetha felt that there is no reason to look at
the funnels that are incomplete which may or may not lead to conversion at some
later date. She requested for data only on funnels which resulted in final conversion.
There are also many repeat customers who type the URL directly and reach the
website and make purchases. Similarly, there are some who purchase items on their
very first visit to the website. Skanda told her that he can provide details of each
funnel corresponding to each and every conversion. He felt that such detailed data
could be useful because AiM sells different products and the profit margins are
different for different products. On the other hand, Swetha felt that such detail was
not necessary because the advertisements are not specific to any particular product.
Even the display ads which are put together on the fly, by AiM based on AdWords
or search strings, are not product specific. The main theme of these ads is that their
products are latest in technology, but packaged as antiques. They are not really
antiques either. Hence, she suggested that Skanda summarize the data “funnel-
wise”. She also suggested that all the social network channels can be clubbed into
one channel for the purpose of initial analysis. “We can drill down into different
social networks separately at a later date. As a matter of fact, you will be able to do
it yourself after we train your people”, she said.
Finally, they have agreed to concentrate on four channels: social networks (Chan-
nel A), AdWords (Channel B), organic search (Channel C) and direct (Channel D).
It was also decided to maintain the actual order of channels within each funnel. Each
funnel is to be read as the sequence of the channels. For example, ABCD implies
A→B→C→D. Swetha explained that the order becomes important for estimating
the contribution of each channel under different models. Then there was a question
of the final metric. Should the final metric for conversion be revenue, profit margin
or just the number of items sold? AiM is currently going through a major costing
exercise, especially in terms of assigning the fixed/non-variable costs to different
products. It was felt that the option of profit margin is not appropriate until the
21 Social Media and Web Analytics 763
costing exercise is completed. Skanda and Yajvin felt that the initial exercise can be
made based on the sales quantity (number of items sold) and the method can easily
be extended to revenue at a later date. Swetha assured them that they will just have to
change the values in a simple spreadsheet and everything else will get recalculated
automatically!
Swetha received the summarized data as required by her 2 days after her meeting
with Yajvin and Skanda. The data “antiques_devices.csv” is available on the book’s
website.
Further Readings
Abhishek, V., Despotakis, S., & Ravi, R. (2017). Multi-channel attribution: The blind
spot of online advertising. Retrieved March 16, 2018, from https://papers.ssrn.com/sol3/
papers.cfm?abstract_id=2959778.
Fisher, T. (2018). ROI in social media: A look at the arguments. Database Marketing & Customer
Strategy Management, 16(3), 189–195. Tracy L. Tuten, Michael R. Solomon, Social Media
Marketing, Sage Publishing.
Ganis, M., & Kohirkar, A. (2016). Social media analytics. New York, NY: IBM Press.
Gardner, J., & Lehnert, K. (2016). What’s new about new media? How multi-channel networks
work with content creators. Business Horizons, 59, 293–302.
Hawn, C. (2017). Take two aspirin and tweet me in the morning: How twitter, facebook, and other
social media are reshaping health care. Health Affairs, 28(2), 361.
Kannan, P. K., Reinartz, W., & Verhoef, P. C. (2016). The path to purchase and attribution
modeling: Introduction to special section. International Journal of Research in Marketing, 33,
449–456.
Ledolter, J., & Swersey, A. J. (2007). Testing 1 - 2 - 3: Experimental design with applications in
marketing and service operations. Palo Alto, CA: Stanford University Press.
Oh, C., Roumani, Y., Nwankpa, J. K., & Hu, H.-F. (2017). Beyond likes and tweets: Consumer
engagement behavior and movie box office in social media. Information & Management, 54(1),
25–37.
WilliamRibarsky, D. X. W., & Dou, W. (February 2014). Social media analytics for competitive
advantage. Computers & Graphics, 38, 328–331.
Zafarani, R., Abbasi, M. A., & Liu, H. (2014). Social media mining. Cambridge: Cambridge
University Press.
Chapter 22
Healthcare Analytics
Ancient understanding of biology, physiology, and medicine was built upon obser-
vations of how the body reacted to external stimuli. This indirect approach of
documenting and studying the body’s reactions was available long before the body’s
internal mechanisms were understood. While medical advances since that time
have been truly astounding, nothing has changed the central fact that the study of
medicine and the related study of healthcare must begin with careful observation,
followed by the collection, consideration, and analysis of the data drawn from those
observations. This age-old approach remains the key to current scientific method
and practice.
and interpretation of data, the new technologies have spawned a number of new
applications.1 For example, data analysis allows earlier detection of epidemics,2
identification of molecules (which will play an unprecedented role in the fight
against cancer3 ), and new methods to evaluate the efficacy of vaccination pro-
grams.4
While the capacity of these tools to increase efficiency and effectiveness seems
limitless, their applications must account for their limitations as well as their power.
Using modern tools of analytics to improve medicine and care delivery requires a
sound, comprehensive understanding of the tools’ strengths and their constraints.
To highlight the power and issues related to the use of these tools, the authors
of this book describe several applications, including telemedicine, modeling the
physiology of the human body, healthcare operations, epidemiology, and analyzing
patterns to help insurance providers.
One problem area that big data techniques are expected to revolutionize in
the near future involves the geographical separation between the patient and the
caregiver. Historically, diagnosing illness has required medical professionals to
assess the condition of their patients face-to-face. Understanding various aspects
about the body that help doctors diagnose and prescribe a treatment often requires
the transmission of information that is subtle and variable. Hearing the rhythm of a
heart, assessing the degradation in a patient’s sense of balance, or seeing nuances in
a change in the appearance of a wound are thought to require direct human contact.
Whether enough of the pertinent data can be transmitted in other ways is a key
question that many researchers are working to answer.
The situation is rapidly changing due to the practice of telemedicine. Market
research firm Mordor Intelligence expects telemedicine, already a burgeoning
market, to grow to 66.6 billion USD by 2021, growing at a compound annual
growth rate of 18.8% between 2017 and 2022.5 New wearable technologies can
assist caregivers by collecting data over spans of time much greater than an office
visit or hospital stay in a wide variety of settings. Algorithms can use this data to
suggest alternate courses of action while ensuring that new or unreported symptoms
are not missed. Wearable technologies such as a Fitbit or Apple Watch are able to
continuously track various health-related factors like heart rate, body temperature,
and blood pressure with ease. This information can be transmitted to medical
1 The article in Forbes of October 2016 provided many of the data in this introduction—
https://www.forbes.com/sites/mikemontgomery/2016/10/26/the-future-of-health-care-is-in-data-
analytics/#61208ab33ee2 (accessed on Aug 19, 2017).
2 https://malariajournal.biomedcentral.com/articles/10.1186/s12936-017-1728-9 (accessed on Aug
20, 2017).
3 http://cancerres.aacrjournals.org/content/75/15_Supplement/3688.short (accessed on Aug 20,
2017).
4 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287086 (accessed on Aug 20, 2017).
5 https://www.mordorintelligence.com/industry-reports/global-telemedicine-market-industry
personnel in real time. For example, military institutions use chest-mounted sensors
to determine the points at which soldiers reach fatigue and can suggest tactical
options based on this information.
While some wearable technologies are on the cutting edge and are thus often
expensive, telemedicine can use cheap, sturdy hardware to make diagnoses easier.
Electronic kits such as the Swasthya Slate,6 which is used in community clinics in
New Delhi, can be used by doctors to conduct blood sugar tests and electrocardio-
grams and monitor a patient’s temperature and blood pressure.7 In Kigali, Rwanda,
digital health company Babylon is testing a service that will allow patients to video-
call doctors rather than wait in lines at hospitals. In economies which suffer from
large burdens on healthcare and a scarcity of trained professionals, interventions
such as these can help save time, money, and lives.
The proper application of such technologies can prevent the wearer’s lack of
expertise from clouding data collection or transmission. Normally, doctors learn
about physical symptoms along with the patient’s experiences via face-to-face
interactions. This adds a gap between the experience and its discussion, as well
as the subjective interpretation of the patient. Its effectiveness also depends on the
patient’s ability to relay the information accurately. Direct recording of data can
bridge these gaps ensuring that the doctor receives objective information while also
understanding the patient’s specific circumstances. This information transmission
can be combined with additional elements including web-cameras and online voice
calling software. This allows doctors and patients to remain in contact regarding
diagnoses without the need for the patient to physically travel to a hospital or office.
Thus, new metrics become possible, accuracy is increased, and time is saved, while
costs are reduced. Additionally, such solutions may help provide proactive care in
case of medical emergency.
The benefits of data analytics are not just limited to diagnosis. Data analytics
also facilitates the leverage of technology to ensure that patients receive diagnoses
in a timely fashion and schedule treatment and follow-up interactions as needed.
Analytics already plays a key role in scheduling appointments, acquiring medicines,
and ensuring that patients do not forget to take their medications.
The advantage of using big data techniques is not limited to the transmission
of data used for diagnosis. Data analysis is key to understanding the fundamental
mechanisms of the body’s functions. Even in cases where the physical function
of the body is well understood, big data can help researchers analyze the myriad
ways in which each individual reacts to stimuli and treatment. This can lead to
more customized treatment and a decrease in side effects. By analyzing specific
interactions between drugs and the body, data analytics can help fine-tune dosages,
reduce side effects, and adjust prescriptions on a case-by-case basis. Geno Germano,
6 https://www.thebetterindia.com/49931/swasthya-slate-kanav-kahol-delhi-diagnostic-tests/
former group president of Pfizer’s Global Innovative Pharma Business, said in 2015
that doctors might (in the near future) use data about patients’ DNA in order to come
up with personalized, specific treatment and health advice that could save time and
ensure better outcomes.8
The ability to analyze multiple streams of related data in real time can be applied
to create simulations of the human body. Using such simulations allows researchers
to conduct experiments and gather information virtually, painlessly, and at low cost.
In constructing artificial models of all or some parts of the body, big data techniques
can harness computational power to analyze different treatments. Initiatives such as
the Virtual Physiological Human Institute9 aim to bring together diverse modeling
techniques and approaches in order to gain a better, more holistic understanding of
the body and in turn drive analysis and innovation.
Analytics is being used in the study and improvement of healthcare operations to
enhance patient welfare, increase access to care, and eliminate wastes. For example,
by analyzing patient wait times and behavior, data scientists can suggest policies that
reduce the load on doctors, free up valuable resources, and ensure more patients
get the care they need when and where they need it. Simulation techniques that
predict patient traffic can help emergency rooms prepare for increased number
of visitations,10 while systems that track when and where patients are admitted
make it easier for nurses and administrators to allocate beds to new patients.11
Modern technologies can also help in the provision of follow-up care, that is,
after the patient has left the hospital. Software and hardware that track important
physical symptoms can notice deviation patterns and alert patients and caregivers.
By matching such patterns to patient histories, they can suggest solutions and
identify complications. By reminding patients regarding follow-up appointments,
they can reduce rehospitalization.
Analytics is also needed to guide the use of information technologies related
to updating patient records and coordinating care among providers across time or
locations. Technologies like Dictaphones and digital diaries are aimed at collecting
and preserving patient data in convenient ways. Careful analysis of this data is key
when working to use these technologies to reduce redundant efforts and eliminate
misunderstandings when care is handed from one provider to another.
There are many applications of analytics related to the detection of health hazards
and the spread of disease: Big data methods help insurers isolate trends in illness and
behavior, enabling them to better match risk premiums to an individual buyer’s risk
8 https://www.forbes.com/sites/matthewherper/2015/02/17/how-pfizer-is-using-big-data-to-
profile. For instance, a US-based health insurance provider12 offers Nest Protect,13
a smoke alarm and carbon monoxide monitor, to its customers and also provides
a discount on insurance premiums if they install these devices. Insurers use the
data generated from these devices to determine the premium and also in predicting
claims.
Information provider LexisNexis tracks socioeconomic variables14 in order to
predict how and when populations will fall sick. Ogi Asparouhov, the chief data sci-
entist at LexisNexis, suggests that socioeconomic lifestyle, consumer employment,
and social media data can add much value to the healthcare industry.
The use of Google Trends c data, that is, Internet search history, in healthcare
research increased sevenfold between 2009 and 2013. This research involves a wide
variety of study designs including causal analysis, new descriptive statistics, and
methods of surveillance (Nuti et al. 2014). Google Brain,15 a research project by
Google, is using machine learning techniques to predict health outcomes from a
patient’s medical data.16 The tracking of weather patterns and their connection to
epidemics of flu and cold is well documented. The World Health Organization’s pro-
gram Atlas of Health and Climate17 is such an example regarding the collaboration
between metrological and public health communities.
By gathering diverse kinds of data and using powerful analytical tools, insurers
can better predict fraud, determine appropriate courses of action, and regulate
payment procedures. A comprehensive case study (Ideal Insurance) is included in
Chap. 25 that describes how analytics can be used to create rules for classifying
claims into those that can be settled immediately, those that need further discussion,
and those that need to be investigated by an external agency.
These techniques, however, are not without their challenges. The heterogeneity
of data in healthcare and privacy concerns have historically been significant
stumbling blocks in the industry. Different doctors and nurses may record identical
data in different ways, making analysis more difficult. Extracting data from sensors
such as X-ray and ultrasound scans and MRI machines remains a continuing
technical challenge, because the quality of these sensors can vary wildly.18
Big data techniques in healthcare also often rely on real-time data, which places
pressure on information technology systems to deliver data quickly and reliably.
Despite these challenges, big data techniques are expected to be a key driver of
technological change and innovation in this sector in the decades to come. The rest
of this chapter will discuss in detail the use of data and simulation techniques in
academic medical centers (AMCs) to improve patient flow.
Demands for increased capacity and reduced costs in outpatient settings create the
need for a coherent strategy on how to collect, analyze, and use data to facilitate pro-
cess improvements. Specifically, this note focuses on system performance related to
patient flows in outpatient clinics in academic medical centers that schedule patients
by appointments. We describe ways to map these visits as we map processes,
collect data to formally describe the systems, create discrete event simulations
(DESs) of these systems, use the simulations as a virtual lab to explore possible
system improvements, and identify proposals as candidates for implementation. We
close with a discussion of several projects in which we have used our approach to
understand and improve these complex systems.
2.1 Introduction
As of 2016, the Affordable Care Act (ACA) extended access to health insurance
coverage to roughly 30 million previously uninsured Americans, and that coverage
expansion is linked to between 15 and 26 million additional primary care visits
annually (Glied and Ma 2015; Beronio et al. 2014). In addition, the number of
people 65 and older in the USA is expected to grow from 43.1 million in 2012
to 83.7 million by 2050 (Ortman et al. 2014). This jump in the number of insured
Americans coupled with the anticipated growth in the size of the population above
the age of 65 will correlate with rising demand for healthcare services.
At the same time, Medicare and other payers are moving away from the older
“fee-for-service” model toward “bundled payment” schemes (Cutler and Ghosh
2012). Under these arrangements, providers are paid a lump sum to treat a patient
or population of patients. This fixes patient-related revenue and means that these
payments can only be applied to fixed costs if variable costs are less than the
payment. We expect the continued emergence of bundled payment schemes to
accelerate the gradual move away from inpatient treatment to the delivery of
care through outpatient settings that has been taking place for over 20 years.
Consequently, a disproportionate share of the growth in demand will be processed
through outpatient clinics, as opposed to hospital beds. This evolution is also seen
as one of the key strategies needed to help get healthcare cost in the USA closer to
the costs experienced in other developed countries (Lorenzoni et al. 2014).
22 Healthcare Analytics 771
To make the remainder of our discussion more concrete, let us introduce a represen-
tative unit of analysis. Data associated with this unit will be taken from a composite
of clinics that we have studied, but is not meant to be a complete representation of
772 M. (Mac) Dada and C. Chambers
any particular unit. Consider a patient with an appointment to see the attending at
a clinic within an AMC. We will work with a discrete event simulation (DES) of
this process. DES is the approach of creating a mathematical model of the flows and
activities present in a system and using this model to perform virtual experiments
seeking to find ways to improve measurable performance (Benneyan 1997; Clymer
2009; Hamrock et al. 2013; Jun et al. 1999). A screenshot from such a DES is
presented in Fig. 22.1 and will double as a simplified process map. By simplified,
we mean that several of the blocks shown in the figure actually envelop multiple
blocks that handle details of the model. Versions of this and similar models along
with exercises focused on their analysis and use are linked to this chapter.
Note that the figure also contains a sample of model inputs and outputs from the
simulation itself. We will discuss several of these metrics shortly.
In this depiction, a block creates work units (patients) according to an appoint-
ment schedule. The block labeled “Arrival” combines these appointment times with
a random variable reflecting patient unpunctuality to get actual arrival times. Once
created, the patients move to Step 1. Just above Step 1, we show a block serving
as a queue just in case the resources at Step 1 are busy. In Step 1, the patient
interacts with staff at the front desk. We will label this step “Registration” with
the understanding that it may include data collection and perhaps some patient
education. In Step 2, a nurse leads the patient into an examination room, collects
data on vital signs, and asks a few questions about the patient’s condition. We will
label this step “Vitals.” In Step 3, a trainee reviews the patient record and enters
the examination room to interact with the patient. We label this step “Trainee.” In
Step 4, the trainee leaves the exam room and interacts with the attending. We label
this step “Teach.” During this time, the trainee may present case information to
the attending, and the pair discusses next steps, possible issues, and the need for
additional information. In Step 5, the trainee and attending both enter the exam
room and interact with the patient. We label this step “Attending.” Following this
step, the trainee, attending, and room are “released,” meaning that they are free to be
22 Healthcare Analytics 773
assigned to the next patient. Finally, the patient returns to the front desk for “Check
Out.” This step may include collection of payment and making an appointment for
a future visit.
In order to manage this system, we need an understanding of its behavior. This
behavior will be reflected in quantifiable metrics such as cycle times wait times, and
how long it will take to complete the appointment schedule (makespan). Note that
cycle times may be calculated based on appointment times or patient arrival times.
Both of these values are included among the model outputs shown here. While this
model is fairly simple, some important questions may be addressed with its use. For
example, we may make different assumptions regarding the attending’s processing
time and note how this changes the selected output values. This is done by altering
the parameters labeled “Att. Time Parameters” among the model inputs. For this
illustration, we assume that these times are drawn from a log-normal distribution
and the user is free to change the mean and standard deviation of that distribution.
However, one benefit of simulation is that we may use a different distribution or
sample directly from collected activity time data. We will discuss these issues later.
This model can also be used as part of a more holistic approach to address more
subtle questions, including how the added educational mission affects output metrics
and what is the best appointment schedule for this system. In the next section, we lay
out a more complete approach to handling more complex questions such as these.
appointment scheduling (Cayirli et al. 2006), nurse rostering problems (Burke et al.
2004), resource allocation problems (Chao et al. 2003), capacity planning (Bowers
and Mould 2005), and routing problems (Mandelbaum et al. 2012).
Given this confluence of approaches and needs, it seems natural for those
working to improve healthcare processes to employ OR techniques such as DES to
conduct controlled, virtual experiments as part of the improvement process. How-
ever, when one looks more closely, one finds that the history of implementations
of results based on OR findings in AMCs is actually quite poor. For example, a
review of over 200 papers that used DES in healthcare settings identified only four
that even claimed that physician behavior was changed as a result (Wilson 1981).
A more recent review found only one instance of a publication which included a
documented change in clinic performance resulting from a simulation-motivated
intervention (van Lent et al. 2012).
This raises a major question: Since there is clearly an active interest in using
DES models to improve patient flow and there is ample talent working to make it
happen, what can we do to make use of this technique in a way that results in real
change in clinic performance? Virtually any operations management textbook will
provide a list of factors needed to succeed in process improvement projects such as
getting all stakeholders involved early, identifying a project champion, setting clear
goals, dedicating necessary resources, etc. (Trusko et al. 2007). However, we want
to focus this discussion on two additional elements that are a bit subtler and, in our
experience, often spell the difference between success and failure when working in
outpatient clinics in the AMC.
First, finding an important problem is not sufficient. It is critically important
to think in terms of finding the right question which also addresses the underlying
problem. As outside agents or consultants, we are not in a position to pay faculty and
staff extra money to implement changes to improve the system. We need a different
form of payment to motivate their participation. One great advantage in the AMC
model is that we can leverage the fact that physicians are also dedicated researchers.
Thus, we can use the promise of publications in lieu of a cash payment to induce
participation.
Second, we need to find the right combination of techniques. Experiments and
data collection resonate with medical researchers. However, the translation from
“lab” to “clinic” is fraught with confounding factors outside of the physician’s
control. On the other hand, OR techniques can isolate a single variable or factor,
but modeling by itself does not improve a system, and mathematical presentations
that feel completely abstract do not resonate with practitioners. The unique aspect
of our approach is to combine OR tools with “clinical” experiments. This allows
clinicians to project themselves into the model in a way that is more salient than the
underlying equations could ever be. The key idea is that value exists in finding a way
to merge the tools of OR with the methodologies of medical research to generate
useful findings that will actually be implemented to improve clinic flow.
22 Healthcare Analytics 775
We have also used a fourth approach to data collection. Many hospitals and
clinics are equipped with real-time location systems (RTLS). Large AMCs are
often designed to include this capability because tracking devices and equipment
across hundreds of thousands of square feet of floor space are simply not practical
without some technological assistance. Installations of these systems typically
involve placing sensors in the ceilings or floors of the relevant spaces. These sensors
pick up signals from transmitters that can be embedded within “tags” or “badges”
worn by items or people being tracked. Each sensor records when a tag comes
within range and again when it leaves that area. When unique tag numbers are
given to each caregiver, detailed reports can be generated at the end of each day
showing when a person or piece of equipment moved from one location to another.
This approach offers several dramatic advantages. It does not interfere with the care
delivery process, the marginal cost of using it is virtually 0, and since these systems
are always running, the observation periods can begin and end as needed.
In closing, we should highlight three key factors in the data collection process:
(1) data collection needs to be done in a way that does not interfere with care
delivery; (2) audits of the data collection system are needed to ensure accuracy; and
(3) sufficient time span must be covered to eliminate any effects of the “novelty” of
the data collection and its subsequent impact on agent behaviors.
Step 3: Create a DES of the System
We have often found it useful to create DES models of the systems under study
as early in the process as possible. This can be a costly process in that a great deal
of data collection is required and model construction can be a nontrivial expense.
Other tools such as process mapping and queueing theory can be applied with much
less effort (Kolker 2010). However, we have repeatedly found that these tools are
insufficient for the analysis that is needed. Because the variances involved in activity
times can be extremely high in healthcare, distributions of the metrics of interest
are important findings. Consequently, basic process analysis is rarely sufficient and
often misleading.
Queueing models do a much better job of conveying the significance of variabil-
ity. However, many common assumptions of these models are routinely violated
in clinic settings, including that some processing times are not exponentially
distributed, that processing times are often not from the same distribution, and
that if arrivals are based on appointments, inter-arrival times are not exponentially
distributed.
However, none of these issues pose the largest challenge to applying simple
process analysis or queuing models in outpatient clinics. Consider two additional
issues. First, the basic results of process analysis or queueing models are only
averages which appear in steady state. A clinic does not start the day in steady
state—it begins in an empty state. It takes some time to reach steady state. However,
if one plots average wait times for a clinic over time, one quickly sees that it may
take dozens or even hundreds of cases for the system to reach steady state. Clearly,
a clinic with one physician is not going to schedule hundreds of patients for that
resource in a single session. Thus, steady-state results are often not informative.
778 M. (Mac) Dada and C. Chambers
Second, if activity times and/or the logic defining work flow changes in response
to job type or system status, then the results of simple process analysis or queueing
models become invalid. We have documented such practices in multiple clinics that
we have studied (Chambers et al. 2016; Conley et al. 2016). Consequently, what
is needed is a tool that can account for all of these factors simultaneously, make
predictions about what happens when some element of the system changes, and
give us information about the broader distribution of outcomes—not just a means
for systems in steady state. DES is a tool with the needed capabilities.
A brief comment on the inclusion of activity times in DES models is warranted
here. We have used two distinct approaches. We can select an activity time at random
from a collection of observations. Alternatively, we can fit a distribution to collected
activity time data. We have found both approaches to work satisfactorily. However,
if the data set is sufficiently large, we recommend sampling directly from that set.
This generates results that are both easier to defend to statisticians and more credible
to practitioners.
Step 4: Field and Virtual Experiments
It is at this point that the use of experiments comes into play, and we merge the
OR methodology of DES with the experimental methods of medical research. The
underlying logic is that we propose an experiment involving some process change
that we believe will alter one or more parameters defining system behavior. We
can use the DES to predict outcomes if our proposal works. In other cases, if we
have evidence that the proposed change works in some settings, we can use the
DES to describe how that change will affect system metrics in other settings. The
construction of these experiments is the “art” of our approach. It is this creation that
leads to publishable results and creates novel insights.
We will provide examples of specific experiments in the next section. However,
at this juncture we wish to raise two critical issues: confounding variables and
unintended consequences. Confounding variables refer to system or behavioral
attributes that are not completely controlled when conducting an experiment but can
alter study results. For example, consider looking at a system before an intervention,
collecting data on its performance, changing something about the system, and
then collecting data on the performance of the modified system. This is the ideal
approach, but it implicitly assumes that nothing changed in the system over the
span of the study other than what you intended to change. If data collection takes
place over a period of months, it is quite possible that the appointment schedule
changed over that span of time due to rising or falling demand. In this example,
the change in demand would be a confounding variable. It is critically important to
eliminate as many confounding variables as you can before concluding that your
process change fully explains system improvement. DES offers many advantages in
this regard because it allows you to fix some parameter levels in a model even if
they may have changed in the field.
It is also critical to account for unintended consequences. For example, adding
examination rooms is often touted as a way to cut wait times. However, this also
makes the relevant space larger, increasing travel times as well as the complexity of
22 Healthcare Analytics 779
resource flows. This must be accounted for before declaring that the added rooms
actually improved performance. It may improve performance along one dimension
while degrading it in another.
DES modeling has repeatedly proven invaluable at this stage. Once a DES model
is created, it is easy to simulate a large number of clinic sessions and collect data
on a broad range of performance metrics. With a little more effort, it can also be
set up to collect data on the use of overtimes or wait times within examination
rooms. In addition, DES models can be set up to have patients take different paths
or have activity times drawn from different distributions depending on system status.
Finally, we have found it useful to have DES models collect data on subgroups of
patients based on system status because many changes to system parameters affect
different groups differently.
Step 5: Metrics of Interest
A famous adage asserts, “If you can’t measure it, you can’t manage it.” Hence,
focusing on measurements removes ambiguity and limits misunderstandings. If all
parties agree on a metric, then it is easier for them to share ideas on how to improve
it. However, this begs an important question—what metrics do we want to focus on?
In dealing with this question, Steps 4 and 5 of our method become intertwined and
cannot be thought of in a purely sequential fashion. In some settings, we need novel
metrics to fit an experiment, while in other settings unanticipated outcomes from
experiments suggest metrics that we had not considered earlier.
Both patients and providers are concerned with system performance, but their
differing perspectives create complex trade-offs. For example, researchers have
often found that increase in face time with providers serves to enhance patient
experience (Thomas et al. 1997; Seals et al. 2005; Lin et al. 2001), but an increase
in wait time degrades that experience (Meza 1998; McCarthy et al. 2000; Lee et
al. 2005). The patient may not fully understand what the care provider is doing,
but they can always understand that more attention is preferable and waiting for it
is not productive. Given a fixed level of resources, increases in face time result in
higher provider utilization, which in turn increases patient wait times. Consequently,
the patient’s desire for increased face time and reduced wait time creates a natural
tension and suggests that the metrics of interest will almost always include both face
time and wait time.
Consider one patient that we observed recently. This patient arrived 30 min early
for an appointment and waited 20 min before being lead to the exam room. After
being led to the room, the patient waited for 5 min before being seen by a nurse
for 5 min. The patient then waited 15 min before being seen by the resident. The
trainee then spoke with the patient for 20 min before leaving the room to discuss
the case with the attending. The patient then waited 15 min before being seen by
the resident and the attending together. The attending spoke with the patient for
5 min before being called away to deal with an issue for a different patient. This
took 10 min. The attending then returned to the exam room and spoke with the
patient for another 5 min. After that, the patient left. By summing these durations,
we see that the patient was in the clinic for roughly 100 min. The patient waited
for 20 min in the waiting room. However, the patient also spent 45 min in the exam
780 M. (Mac) Dada and C. Chambers
room waiting for service. Time in the examination room was roughly 80 min of
which 35 min was spent in the presence of a service provider. Thus, we can say that
the overall face time was only 35 min. However, of this time only 10 min was with
the attending physician. Consideration of this more complete description suggests a
plethora of little-used metrics that may be of interest, such as:
1. Patient punctuality
2. Time spent in the waiting room before the appointment time
3. Time spent in the waiting room after the appointment time
4. Wait time in the examination room
5. Proportion of cycle time spent with a care provider
6. Proportion of cycle time spent with the attending
The key message here is that the metrics of interest may be specific to the
problem that one seeks to address and must reflect the nuances of the process in
place to deliver the services involved.
Step 6: Predict Impact of Process Changes
Even after conducting an experiment in one setting, we have found that it is
extremely difficult to predict how changes will affect a different system simply by
looking at the process map. This is another area where DES proves quite valuable.
For example, say that our experiment in Clinic A shows that by changing the process
in some way, the time for the Attending step is cut by 10%. We can then model
this change in a different clinic setting by using a DES of that setting to predict
how implementing our suggested change will be reflected in performance metrics
of that clinic in the future. This approach has proven vital to get the buy-in needed to
facilitate a more formal experiment in the new setting or to motivate implementation
in a unit where no formal experiment takes place.
Our work has included a collection of experiments that have led to system
improvements for settings such as that depicted in Fig. 22.1. We now turn to a
discussion of a few of these efforts to provide context and illustrations of our
approach. Figure 22.1 includes an arrival process under an appointment system. This
is quickly followed by activities involving the trainee and/or nurse and/or attending.
Finally, the system hopes to account for all of these things when searching for an
optimized schedule. We discuss a few of these issues in turn.
Arrival Process
We are focusing on clinics which set a definite appointment schedule. One
obvious complication is that some patients are no-shows, meaning that they do not
show up for the appointment. No-show rates of as much as 40% have been cited in
prior works (McCarthy et al. 2000; Huang 1994). However, there is also a subtler
issue of patients arriving very early or very late, and this is much harder to account
for. Early work in this space referred to this as patient “unpunctuality” (Bandura
22 Healthcare Analytics 781
1969; White and Pike 1964; Alexopoulos et al. 2008; Fetter and Thompson 1966;
Tai and Williams 2012; Perros and Frier 1996). Our approach has been used
to address two interrelated questions: Does patient unpunctuality affect clinic
performance, and can we affect patient unpunctuality? To address these questions,
we conducted a simple experiment. Data on patient unpunctuality was collected
over a six-month period. We found that most patients arrived early, but patient
unpunctuality ranged from −80 to +20. In other words, some patients arrived
as much as 80 min early, while others arrived 20 min late. An intervention was
performed that consisted of three elements. In reminders mailed to each patient
before their visit, it was stated that late patients would be asked to reschedule.
All patients were called in the days before the visit, and the same reminder was
repeated over the phone. Finally, a sign explaining the new policy was posted near
the registration desk. Unpunctuality was then tracked 1, 6, and 12 months later.
Additional metrics of interest were wait times, use of overtime, and the proportion
of patients that were forced to wait to be seen (Williams et al. 2014).
This lengthy follow-up was deemed necessary because some patients only visited
the clinic once per quarter, and thus the full effect of the intervention could not
be measured until after several quarters of implementation. To ensure that changes
in clinic performance were related only to changes in unpunctuality, we needed a
way to control for changes in the appointment schedule that happened over that
time span. Our response to this problem was to create a DES of the clinic, use
actual activity times in the DES, and consider old versus new distributions of patient
unpunctuality, assuming a fixed schedule. This allowed us to isolate the impact of
our intervention.
Before the intervention, 7.7% of patients were tardy and average tardiness of
those patients was 16.75 min. After 12 months, these figures dropped to 1.5% and
2 min, respectively. The percentage of patients who arrived before their appointment
time rose from 90.4% to 95.4%. The proportion who arrived at least 1 min tardy
dropped from 7.69% to 1.5%. The range of unpunctuality decreased from 100
to 58 min. The average time to complete the session dropped from 250.61 to
244.49 min. Thus, about 6 min of overtime operations was eliminated from each
session. The likelihood of completing the session on time rose from 21.8% to 31.8%.
Our use of DES allowed us to create metrics of performance that had not yet been
explored. For example, we noticed that the benefits from the change were not the
same for all patients. Patients that arrived late saw their average wait time drop from
10.7 to 0.9 min. Those that arrived slightly early saw their average wait time increase
by about 0.9 min. Finally, for those that arrived very early, their wait time was
unaffected. In short, we found that patient unpunctuality can be affected, and it does
alter clinic performance, but this has both intended and unintended consequences.
The clinic session is more likely to finish on time and overtime costs are reduced.
However, much of the benefit in terms of wait times is actually realized by patients
that still insist on arriving late.
Physician Processing Times
Historically, almost all research on outpatient clinics assumed that processing
times were not related to the schedule or whether the clinic was running on time. Is
782 M. (Mac) Dada and C. Chambers
this indeed the case? To address this question, we analyzed data from three clinic
settings. One was a low-volume clinic that housed a single physician, another was
a medium-volume clinic in an AMC that had one attending working on each shift
along with two or three trainees, and the last one was a high-volume service that had
multiple attendings working simultaneously (Chambers et al. 2016).
We categorized patients into three groups: Group A patients were those who
arrived early and were placed in the examination room before their scheduled
appointment time. Group B patients were those who also arrived early, but were
placed in the examination room after their appointment time, indicating that
the clinic was congested. Group C patients were those who arrived after their
appointment time. The primary question was whether the average processing time
for patients in Group A was the same as that for patients in Group B. We also had
questions about how this affected clinic performance in terms of wait times and
session completion times.
In the low-volume clinic with a single physician, average processing times and
standard errors (in parentheses) were 38.31 (3.21) for Group A and 26.23 (2.23) for
Group B. In other words, the physician moved faster when the clinic was behind
schedule. Similar results have been found in other industries, but this was the first
time (to the best of our knowledge) that this had been demonstrated for outpatient
clinics.
In the medium-volume clinic, the relevant values were 65.59 (2.24) and 53.53
(1.97). Again, the system worked faster for Group B than it did for Group A. Note
the drop in average times is about 12 min in both settings. This suggests that the
finding is robust, meaning that it occurs to a similar extent in similar (but not
identical) settings. Additionally, remember that the medium-volume clinic included
trainees in the process flow. This suggests that the way that the system got this
increase in speed might be different. In fact, our data show that the average amount
of time the attending spent with the patient was no more than 12 min to begin with.
Thus, we know that it was not just the behavior of the attending that made this
happen. The AMC must be using the trainees differently when things fall behind
schedule.
In the high-volume clinic, the parallel values were 47.15 (0.81) and 17.59 (0.16).
Here, we see that the drop in processing times is much more dramatic than we saw
before. Again, the message is that processing times change when the system is under
stress and the magnitude of the change implies that multiple parties are involved in
making this happen. In hindsight, this seems totally reasonable, but the extent of the
difference is still quite startling.
As we saw in the previous section, there is an unintended consequence of this
system behavior as it relates to patient groups. Patients that show up early should
help the clinic stay on schedule. This may not be so because these patients receive
longer processing times. Thus, their cycle times are longer. Patients that arrive late
have shorter wait times and shorter processing times. Thus, their cycle times are
shorter. If shorter cycle times are perceived as a benefit, this seems like an unfair
reward for patient tardiness and may explain why it will never completely disappear.
22 Healthcare Analytics 783
trainee review the case after the patient is placed in the examination room and
then having the first conversation about the case with the attending after the trainee
interacts with the patient, we can notify both the trainee and attending in advance
which patient each trainee will see. That way, the trainee can review the file before
the session starts and have a conversation with the attending about what should
happen upon patient arrival. We also created a template to guide the flow and content
of this conversation. We refer to this approach as “preprocessing” (Williams et al.
2015).
We recorded activity times using the original system for 90 days. We then
introduced the new approach and ran it for 30 days. During this time, we continued
collecting data on activity times.
Before the intervention was made, the average teach time was 12.9 min for new
patients and 8.8 min for return patients. The new approach reduced these times by
3.9 min for new patients and 2.9 min for return patients. Holding the schedule as a
constant, we find that average wait times drop from 36.1 to 21.4 min and the session
completion time drops from 275.6 to 247.4 min.
However, in this instance, it was the unintended consequences that proved to be
more important. When the trainees had a more clearly defined plan about how to
handle each case, their interactions with the patients became more efficient. The
trainees also reported that they felt more confident when treating the patients than
they had before. While it is difficult to measure this effect in terms of time, both the
trainees and the attending felt that the patients received better care under the new
protocol.
Cyclic Scheduling
Considering the works mentioned above, one finding that occurred repeatedly
was that the way the trainee was involved in the process had a large impact on system
performance and how that was done was often state dependent. Recall that we found
that the system finds ways to move faster when the clinic is behind schedule. When
a physician is working alone, this can be done simply by providing less face time to
patients. When the system includes a trainee, an additional response is available in
that either the attending or the trainee can be dropped from the process for one or
more patients. Our experience is that doctors strongly believe that the first approach
produces huge savings and they strongly oppose the second.
Our direct observation of multiple clinics produced some insights related to these
issues. Omitting the attending does not save as much time as most attendings think
because the trainee is slower than the attending. In addition, the attending gets
involved in more of these cases than they seem to realize. Many attendings feel
compelled to “at least say hi” to the patients even when the patients are not really on
their schedule, and these visits often turn out to be longer than expected. Regarding
the second approach, we have noticed a huge variance in terms of how willing the
attending is to omit the trainee from a case. Some almost never do it, while others do
it quite often. In one clinic we studied, we found that the trainee was omitted from
roughly 30% of the cases on the clinic schedule. If this is done, it might explain
why a medium-volume or high-volume clinic within the AMC could reduce cycle
times after falling behind schedule to a greater extent than the low-volume clinic
22 Healthcare Analytics 785
can achieve. This can be done by instructing the trainee to handle one case while the
attending handles another and having the attending exclude the trainee from one or
more cases in an effort to catch up to the clinic schedule.
Accounting for these issues when creating an appointment schedule led us to
the notion of cyclic scheduling. The idea is that the appointment schedule can be
split into multiple subsets which repeat. We label these subsets “cycles.” In each
cycle, we include one new patient and one return patient scheduled to arrive at the
same time. A third patient is scheduled to arrive about the middle of the cycle.
If both patients arrive at the start of the cycle, we let the trainee start work on
the new patient, and the attending handles the return patient without the trainee
being involved. This was deemed acceptable because it was argued that most of the
learning comes from visits with new patients. If only one of the two patients arrives,
the standard process is used.
Process analysis tools produce some results about average cycle times in this
setting, but since wait times are serially correlated, we want a much clearer
depiction of how each patient’s wait time is related to that of the following patients.
Considering the problem using a queuing model is extremely difficult because the
relevant distribution of activity times is state dependent and the number of cycles
is small. Consequently, steady-state results are misleading. Studying this approach
within a DES revealed that average makespan, wait times, and cycle times are
significantly reduced using our cyclic approach and the trainee is involved in a
greater proportion of the cases scheduled.
3 Conclusion
While a great deal of time, effort, and money has been spent to improve healthcare
processes, the problems involved have proven to be very difficult to solve. In this
work, we focused on a small but important sector of the problem space—that of
appointment-based clinics in academic medical centers. One source of difficulty is
that the medical field favors an experimental design-based approach, while many OR
tools are more mathematical and abstract. Consequently, one of our core messages
is that those working to improve these systems need to find ways to bridge this gap
by combining techniques. When this is done, progress can be made and the insights
generated can be spread more broadly. Our use of DES builds on tools of process
mapping that most managers are familiar with and facilitates virtual experiments
that are easier to control and use to generate quantitative metrics amenable to the
kinds of statistical tests that research physicians routinely apply.
However, we would be remiss if we failed to emphasize the fact that data-driven
approaches are rarely sufficient to bring about the desired change. Hospitals in
AMCs are often highly politicized environments with a hierarchical culture. This
fact can generate multiple roadblocks that no amount of “number crunching” will
ever overcome. One not so subtle aspect of our method is that it typically involves
embedding ourselves in the process over some periods of time and interacting
786 M. (Mac) Dada and C. Chambers
repeatedly with the parties involved. We have initiated many projects not mentioned
above because they did not result in real action. Every project that has been
successful involved many hours of working with faculty, physicians, staff, and
technicians of various types to collect information and get new perspectives. We
have seen dozens of researchers perform much more impressive data analysis on
huge data sets using tools that were more powerful than those employed in these
examples, only to end up with wonderful analysis not linked to any implementation.
When dealing with healthcare professionals, we are often reminded of the old adage,
“No one cares how much you know. They want to know how much you care.” While
we believe that the methodology outlined in this chapter is useful, our experience
strongly suggests that the secret ingredient to making these projects work is the
attention paid to the physicians, faculty, and especially staff involved who ultimately
make the system work.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Model 22.1: Model1.mox
• Model 22.2: Model1A.mox
• Model 22.3: Model2.mox
• Model 22.4: Model3.mox
Exercises
available. A wide variety of texts and tools are also available to assist the potential
user with details of software capabilities including Strickland (2010) and Laguna
and Marklund (2005). However, the models utilized in this reading are fairly simple
to construct and can be easily adapted to other packages as the reader (or instructor)
sees fit. For ease of exposition and fit with the main body of the reading, we present
exercises corresponding to settings described earlier. Hints are provided in the Hints
for Solution word file (refer to book’s website) that should help in going through
the exercises. The exercises allow the reader to explore the many ideas given in the
chapter in a step-by-step manner.
A Basic Model with Patient Unpunctuality
Service providers in many settings utilize an appointment system to manage the
arrival of customers/jobs. However, the assumption that the appointment schedule
will be strictly followed is rarely justified. The first model (Model 1; refer to book’s
website) presents a simplified process flow for a hypothetical clinic and embeds an
appointment schedule. The model facilitates changes to the random variable that
defines patient punctuality. In short, patients arrive at some time offset from their
appointment time. By adjusting the parameters which define the distribution of this
variable, we can represent arrival behavior. You may alter this model to address the
following questions:
Ex. 22.1 Describe clinic performance if all patients arrive on time.
Ex. 22.2 Explain how this performance changes if unpunctuality is included.
For this example, this means modeling actual arrival time as the appointment
time plus a log-normally distributed variable with a mean of μ and a standard
deviation of σ minutes. A reasonable base case may include μ = −15 min, and
σ = 10 min. (Negative values of unpunctuality mean that the patient arrives prior
to the appointment time, which is the norm.) Note how changes to μ and σ affect
performance differently.
Ex. 22.3 Explain how you would create an experiment (in an actual clinic) to
uncover how this behavior changes and how it affects clinic performance.
Ex. 22.4 Explain how you would alter Model 1 to report results for groups of
patients such as those with negative unpunctuality (early arrivers), those with
positive unpunctuality (late arrivers), and those with appointment times near the
end of the clinic session.
Ex. 22.5 The DES assumes that the patient with the earlier appointment time is
always seen first, even if they arrived late. How would you modify this model if the
system “waits” for late patients up to some limit, “w” minutes rather than seeing the
next patient as soon as the server is free?
An Academic Model with Distributions of Teaching Time
The process flow within the academic medical center (AMC) differs from Model
1 in that it includes additional steps and resources made necessary by the hospital’s
teaching mission. Simple process analysis is useful in these settings to help identify
the bottleneck resource and to use management of that resource to improve system
performance. However, such simple models are unable to fully account for the
impact of system congestion given this more complex flow. For example, idle time
is often added because one resource is forced to wait for the availability of another.
788 M. (Mac) Dada and C. Chambers
Using a DES of such systems may be particularly valuable in that they facilitate
various forms of sensitivity analysis which can produce novel insights about these
issues. Use Model 2 (refer to book’s website) of the AMC to address the following
questions:
Ex. 22.6 How do the average values of cycle time, wait time, and makespan respond
to changes in teach time?
Ex. 22.7 Describe the linkage between utilization of the trainees in this system and
the amount of time they spend with patients. How much of their busy time is not
explained by value-adding tasks?
Ex. 22.8 Describe the linkage between the number of trainees and the utilization of
other key resources in the system.
Ex. 22.9 Explain how you would create an experiment (in an actual clinic) to
uncover how changing the educational process is linked to resident productivity.
Ex. 22.10 How would you alter Model 2 to reflect a new approach to trainee
education aimed at increasing the share of their time that adds value to the patient?
State-Dependent Processing Times
Experience with many service systems lends support to the notion that the service
provider may be motivated to “speed up” when the system is busy. However,
common sense also suggests that this is not sustainable forever. With these facts
in mind, it is important to think through how we might measure this behavior and
how we may monitor any unintended consequences from such an approach. With
this in mind, Model 3 (refer to book’s website) includes a reduction to processing
times for the attending when the system is busy. Consider this model to address the
following questions:
Ex. 22.11 How do average values of cycle time, wait time, and makespan change
when the attending gets faster in a busy system?
Ex. 22.12 Instead of reducing face time, consider adding examination rooms to the
system instead. Is there any evidence produced by the DES to suggest that one
approach is better than the other?
Ex. 22.13 Describe the comparison between decreasing processing times when the
system is busy to changing processing times for all settings.
Ex. 22.14 Explain how you would create an experiment (in an actual clinic) to
explore how this behavior affects patient flow and service quality. What extra factors
do you need to control for?
Ex. 22.15 How would you alter Model 3 to separate the effects of patient
behavior (including unpunctuality) from the effects of physician behavior (including
changing processing times)?
Cyclic Scheduling
Personnel creating an appointment schedule are likely to favor having a simple
template to refer to when patients request appointment times. Consequently, there is
administrative value in having a logic that is easy to explain and implement. Again,
this is more difficult to do in the AMC since the process flow is more complex.
Return to the use of Model 2 and modify it as needed to address the following
questions:
22 Healthcare Analytics 789
Ex. 22.16 Study the existing appointment schedule. Develop the “best” schedule if
there is no variability to consider. (You may assume that average activity times are
always realized.)
Ex. 22.17 How does your schedule perform when patient unpunctuality is added,
and how will you adjust your schedule to account for this?
Ex. 22.18 Assuming that patients are always perfectly punctual and only attending
time is variable, look for a schedule that works better than the one developed in
Exercise 22.16.
Ex. 22.19 Explain how you would create an experiment (in an actual clinic) to
explore ways to reduce this variability. What extra factors do you need to control
for?
Ex. 22.20 How would you alter Model 2 to include additional issues such as patient
no-shows, emergencies, work interruptions, and open-access scheduling?
Conclusion
It is important to note that DES models are only one tool that can be applied
to develop a deeper understanding of the behavior of complex systems. However,
adding this approach to the “toolbox” of the clinic manager or consultant should
provide ample benefits and support for ideas on how to make these systems better
meet the needs of all stakeholders.
References
Alexopoulos, C., Goldman, D., Fontanesi, J., Kopald, D., & Wilson, J. R. (2008). Modeling patient
arrivals in community clinics. Omega, 36, 33–43.
Bandura, A. (1969). Principles of behavior modification. New York, NY: Holt, Rinehart, &
Winston.
Benneyan, J. C. (1997). An introduction to using computer simulation in healthcare: Patient wait
case study. Journal of the Society for Health Systems, 5(3), 1–15.
Beronio, K., Glied, S. & Frank, R. (2014) J Behav Health Serv Res. 41, 410. https://doi.org/
10.1007/s11414-014-9412-0
Boex, J. R., Boll, A. A., Franzini, L., Hogan, A., Irby, D., Meservey, P. M., Rubin, R. M., Seifer,
S. D., & Veloski, J. J. (2000). Measuring the costs of primary care education in the ambulatory
setting. Academic Medicine, 75(5), 419–425.
Bowers, J., & Mould, G. (2005). Ambulatory care and orthopaedic capacity planning. Health Care
Management Science, 8(1), 41–47.
Burke, E. K., De Causmaecker, P., Berghe, G. V., & Van Landeghem, H. (2004). The state of the
art of nurse rostering. Journal of Scheduling, 7(6), 441–499.
Cayirli, T., Veral, E., & Rosen, H. (2006). Designing appointment scheduling systems for
ambulatory care services. Health Care Management Science, 9(1), 47–58.
Chambers, C. G., Dada, M., Elnahal, S. M., Terezakis, S. A., DeWeese, T. L., Herman, J. M., &
Williams, K. A. (2016). Changes to physician processing times in response to clinic congestion
and patient punctuality: A retrospective study. BMJ Open, 6(10), e011730.
Chao, X., Liu, L., & Zheng, S. (2003). Resource allocation in multisite service systems with
intersite customer flows. Management Science, 49(12), 1739–1752.
Chesney, A. M. (1943). The Johns Hopkins Hospital and John Hopkins University School of
Medicine: A chronicle. Baltimore, MD: Johns Hopkins University Press.
790 M. (Mac) Dada and C. Chambers
Clymer, J. R. (2009). Simulation-based engineering of complex systems (Vol. 65). New York, NY:
John Wiley & Sons.
Conley, K., Chambers, C., Elnahal, S., Choflet, A., Williams, K., DeWeese, T., Herman, J., & Dada,
M. (2018). Using a real-time location system to measure patient flow in a radiation oncology
outpatient clinic, Practical radiation oncology.
Cutler, D. M., & Ghosh, K. (2012). The potential for cost savings through bundled episode
payments. New England Journal of Medicine, 366(12), 1075–1077.
Fetter, R. B., & Thompson, J. D. (1966). Patients’ wait time and doctors’ idle time in the outpatient
setting. Health Services Research, 1(1), 66.
Franzini, L., & Berry, J. M. (1999). A cost-construction model to assess the total cost of an
anesthesiology residency program. The Journal of the American Society of Anesthesiologists,
90(1), 257–268.
Glied, S., & Ma, S. (2015). How will the Affordable Care Act affect the use of health care services?
New York, NY: Commonwealth Fund.
Hamrock, E., Parks, J., Scheulen, J., & Bradbury, F. J. (2013). Discrete event simulation for
healthcare organizations: A tool for decision making. Journal of Healthcare Management,
58(2), 110.
Hing, E., Hall, M. J., Ashman, J. J., & Xu, J. (2010). National hospital ambulatory medical care
survey: 2007 Outpatient department summary. National Health Statistics Reports, 28, 1–32.
Hosek, J. R., & Palmer, A. R. (1983). Teaching and hospital costs: The case of radiology. Journal
of Health Economics, 2(1), 29–46.
Huang, X. M. (1994). Patient attitude towards waiting in an outpatient clinic and its applications.
Health Services Management Research, 7(1), 2–8.
Hwang, C. S., Wichterman, K. A., & Alfrey, E. J. (2010). The cost of resident education. Journal
of Surgical Research, 163(1), 18–23.
Jun, J. B., Jacobson, S. H., & Swisher, J. R. (1999). Application of discrete-event simulation in
health care clinics: A survey. Journal of the Operational Research Society, 50(2), 109–123.
Kaplan, R. S., & Anderson, S. R. (2003). Time-driven activity-based costing. SSRN 485443.
Kaplan, R. S., & Porter, M. E. (2011). How to solve the cost crisis in health care. Harvard Business
Review, 89(9), 46–52.
King, M., Lapsley, I., Mitchell, F., & Moyes, J. (1994). Costing needs and practices in a changing
environment: The potential for ABC in the NHS. Financial Accountability & Management,
10(2), 143–160.
Kolker, A. (2010). Queuing theory and discrete event simulation for healthcare: From basic
processes to complex systems with interdependencies. In Abu-Taieh, E., & El Sheik, A. (Eds.),
Handbook of research on discrete event simulation technologies and applications (pp. 443–
483). Hershey, PA: IGI Global.
Laguna, M., & Marklund, J. (2005). Business process modeling, simulation and design. Upper
Saddle River, NJ: Pearson Prentice Hall.
Lee, V. J., Earnest, A., Chen, M. I., & Krishnan, B. (2005). Predictors of failed attendances in a
multi-specialty outpatient centre using electronic databases. BMC Health Services Research,
5(1), 1.
van Lent, W. A. M., VanBerkel, P., & van Harten, W. H. (2012). A review on the relation between
simulation and improvement in hospitals. BMC Medical Informatics and Decision Making,
12(1), 1.
Lin, C. T., Albertson, G. A., Schilling, L. M., Cyran, E. M., Anderson, S. N., Ware, L., &
Anderson, R. J. (2001). Is patients’ perception of time spent with the physician a determinant
of ambulatory patient satisfaction? Archives of Internal Medicine, 161(11), 1437–1442.
Lorenzoni, L., Belloni, A., & Sassi, F. (2014). Health-care expenditure and health policy in the
USA versus other high-spending OECD countries. The Lancet, 384(9937), 83–92.
Mandelbaum, A., Momcilovic, P., & Tseytlin, Y. (2012). On fair routing from emergency
departments to hospital wards: QED queues with heterogeneous servers. Management Science,
58(7), 1273–1291.
22 Healthcare Analytics 791
McCarthy, K., McGee, H. M., & O’Boyle, C. A. (2000). Outpatient clinic wait times and non-
attendance as indicators of quality. Psychology, Health & Medicine, 5(3), 287–293.
Meza, J. P. (1998). Patient wait times in a physician’s office. The American Journal of Managed
Care, 4(5), 703–712.
Moses, H., Thier, S. O., & Matheson, D. H. M. (2005). Why have academic medical centers
survived. Journal of the American Medical Association, 293(12), 1495–1500.
Nuti, S. V., Wayda, B., Ranasinghe, I., Wang, S., Dreyer, R. P., Chen, S. I., & Murugiah, K.
(2014). The use of Google trends in health care research: A systematic review. PLoS One,
9(10), e109583.
Ortman, J. M., Velkoff, V. A., & Hogan, H. (2014). An aging nation: The older population in the
United States (pp. 25–1140). Washington, DC: US Census Bureau.
Perros, P., & Frier, B. M. (1996). An audit of wait times in the diabetic outpatient clinic: Role of
patients’ punctuality and level of medical staffing. Diabetic Medicine, 13(7), 669–673.
Sainfort, F., Blake, J., Gupta, D., & Rardin, R. L. (2005). Operations research for health care
delivery systems. WTEC panel report. Baltimore, MD: World Technology Evaluation Center,
Inc..
Seals, B., Feddock, C. A., Griffith, C. H., Wilson, J. F., Jessup, M. L., & Kesavalu, S. R. (2005).
Does more time spent with the physician lessen parent clinic dissatisfaction due to long wait
times. Journal of Investigative Medicine, 53(1), S324–S324.
Sloan, F. A., Feldman, R. D., & Steinwald, A. B. (1983). Effects of teaching on hospital costs.
Journal of Health Economics, 2(1), 1–28.
Strickland, J. S. (2010). Discrete event simulation using ExtendSim 8. Colorado Springs, CO:
Simulation Educators.
Tai, G., & Williams, P. (2012). Optimization of scheduling patient appointments in clinics using a
novel modelling technique of patient arrival. Computer Methods and Programs in Biomedicine,
108(2), 467–476.
Taylor, D. H., Whellan, D. J., & Sloan, F. A. (1999). Effects of admission to a teaching hospital
on the cost and quality of care for Medicare beneficiaries. New England Journal of Medicine,
340(4), 293–299.
Thomas, S., Glynne-Jones, R., & Chait, I. (1997). Is it worth the wait? a survey of patients’
satisfaction with an oncology outpatient clinic. European Journal of Cancer Care, 6(1), 50–
58.
Trebble, T. M., Hansi, J., Hides, T., Smith, M. A., & Baker, M. (2010). Process mapping the patient
journey through health care: An introduction. British Medical Journal, 341(7769), 394–397.
Trusko, B. E., Pexton, C., Harrington, H. J., & Gupta, P. (2007). Improving healthcare quality and
cost with six sigma. Upper Saddle River, NJ: Financial Times Press.
White, M. J. B., & Pike, M. C. (1964). Appointment systems in out-patients’ clinics and the effect
of patients’ unpunctuality. Medical Care, 133–145.
Williams, J. R., Matthews, M. C., & Hassan, M. (2007). Cost differences between academic and
nonacademic hospitals: A case study of surgical procedures. Hospital Topics, 85(1), 3–10.
Williams, K. A., Chambers, C. G., Dada, M., Hough, D., Aron, R., & Ulatowski, J. A. (2012).
Using process analysis to assess the impact of medical education on the delivery of pain
services: A natural experiment. The Journal of the American Society of Anesthesiologists,
116(4), 931–939.
Williams, K. A., Chambers, C. G., Dada, M., McLeod, J. C., & Ulatowski, J. A. (2014). Patient
punctuality and clinic performance: Observations from an academic-based private practice pain
centre: A prospective quality improvement study. BMJ Open, 4(5), e004679.
Williams, K. A., Chambers, C. G., Dada, M., Christo, P. J., Hough, D., Aron, R., & Ulatowski, J.
A. (2015). Applying JIT principles to resident education to reduce patient delays: A pilot study
in an academic medical center pain clinic. Pain Medicine, 16(2), 312–318.
Wilson, J. C. T. (1981). Implementation of computer simulation projects in health care. Journal of
the Operational Research Society, 32(9), 825–832.
Chapter 23
Pricing Analytics
1 Introduction
One of the most important decisions a firm has to take is the pricing of its products.
At its simplest, this amounts to stating a number (the price) for a single product.
But it is often a lot more complicated than that. Various pricing mechanisms
such as dynamic pricing, promotions, bundling, volume discounts, segmentation,
bidding, and name-your-own-price are usually deployed to increase revenues, and
this chapter is devoted to the study of such mechanisms. Pricing and revenue
optimization is known by different names in different domains, such as revenue
management (RM), yield management, and pricing analytics. One formal definition
of revenue management is the study of how a firm should set and update pricing
and product availability decisions across its various selling channels in order to
maximize its profitability. There are several key phrases in this definition: Firms
should not only set but also update prices; thus, price setting should be dynamic and
depend on many factors such as competition, availability of inventory, and updated
demand forecasts. Firms not only set prices but also make product availability
decisions; in other words, firms can stop offering certain products at a given price
K. Talluri
Imperial College Business School, South Kensington, London, UK
e-mail: kalyan.talluri@imperial.ac.uk
S. Seshadri ()
Gies College of Business, University of Illinois at Urbana-Champaign, Champaign, IL, USA
e-mail: sridhar@illinois.edu
(such as the closing of low-fare seats on airlines) or offer only certain assortments
in certain channels. Firms might offer different products at different prices across
selling channels—the online price for certain products might be lower than the retail
price!
The application of pricing and revenue management analytics in business
management began in the 1970s. Airline operators like British Airways (then
British Overseas Airways Corp.) and American Airlines began to offer differentiated
fares for essentially the same tickets. The pioneer of this technique, called yield
management, was Bob Crandall. Crandall, who eventually became chief executive
of American Airlines, spearheaded a revolution in airline ticket pricing, but its
impact would be felt across industries. Hotel chains, such as Marriott International,
and parcelling services, like United Parcel Service, have used it to great effect.
These techniques have only become more refined in the decades since. The
advent of big data has revolutionized the degree to which analytics can predict
patterns of customer demand, helping companies adapt to trends more quickly than
ever. Retail chains such as Walmart collect petabytes of data daily, while mobile
applications like Uber rely on big data to provide the framework for their business
model.
Yet even in its simplest form (a simple posted-price mechanism), pricing is tricky.
If you set it too low or too high, you are losing out on revenue. On the other
hand, determining the right price, either before or after the sale, may be impossible.
Analytics helps; indeed, there are few other areas where data and analytics come
together as nicely to help out the manager. That is because pricing is inherently
about data and numbers and optimization. There are many unobservable factors
such as a customer’s willingness to pay and needs, so modeling plays a critical
role. Here too, we restrict ourselves by and large to monopoly models, folding in,
whenever possible, competitive prices and product features, but do not explicitly
model strategic reactions and equilibria. We cover modeling of pricing optimiza-
tion which by necessity involves modeling customer behavior and constrained
optimization.
Moreover, the application of big data techniques to pricing methods raises
concerns of privacy. As models become better at understanding customers, com-
panies may find themselves rapidly entering an uncanny valley-like effect, where
their clients find themselves disoriented and put off by the amount of preci-
sion with which they can be targeted. The European Union’s General Data
Protection Regulation is explicitly aimed at limiting the use and storage of per-
sonal data, necessitating a wide set of reforms by companies across sectors and
industries.
The two building blocks of revenue management are developing quantitative
models of customer behavior, that is, price-response curves, demand forecasts,
market segmentation, etc., and tools of constrained optimization. The first building
block is all about capturing details about the consumers at a micro-market level.For
23 Pricing Analytics 795
example, one might consider which customers shop at what times for which
products at a given store of a food retailer. Then, one might model their sensitivity
to price, product assortments, and product bundles. This data can be combined
with inventory planning system information to set prices. The second building
block reflects the fact that price should depend on availability. Therefore, capacity
constraints play an important role in price optimization. In addition, there could
be other simple constraints, such as inventory availability, route structure of an
airline, network constraints that equate inflow and inventory to outflows, and
consumption and closing inventory. More esoteric constraints are used to model
customer switching behavior when presented with a choice of products or even the
strategic behavior of customers in anticipation of a discount or price increase.
What sorts of questions does RM help answer? We have provided a partial list as
follows:
• A hotel chain wants guidelines on how to design products for different customer
segments. Price is not the only distinguishing feature. For example, hotels sell
the same room as different products and at different prices, such as no refund,
advance payment required, full refund, breakfast included, access to executive
lounge included, etc!
• The owner of a health club wants to know whether the profits will increase if he
sets different prices at different times and for different customers.
• A car manufacturer bidding on supply of a fleet of cars would like to know how
to bid for a contract based on past bid information, current competition, and other
factors to maximize expected profitability.
• A retail chain needs to decide when and how much to discount prices for a fashion
good during a selling season to maximize expected revenue.
• In a downtown hotel, business travelers book closer to the date of stay than leisure
travelers. Leisure travelers are more price sensitive than business travelers. The
hotel manager has to decide how many rooms to save for business travelers.
• A hotel manager has to determine how to price a single-day stay vs. a multiple-
day stay.
• A car rental agency has to decide whether it is profitable to transport cars from
one location to another in anticipation of demand surge.
• A basketball franchise wants to explore differential pricing. It wants to evaluate
whether charging different prices for different days, different teams, and different
times of the day will increase revenue.
• How does the freedom to name your own price (invented by Priceline) work?
The analytics professional will recognize the opportunity to employ almost
every tool in the analytics toolkit to solve these problems. First, data is necessary
at the right granularity and from different sources including points of sales and
reservation systems, surveys, and social media chatter. Information is also required
on competitive offerings and prices. Data has to be gathered not only about sales
but also no-shows and cancellations. Many a times, bookings are done in groups.
These bookings have their own characteristics to record. Second, these data have to
be organized in a form that reveals patterns and trends, such that revenue managers,
796 K. Talluri and S. Seshadri
product managers, and operations managers can coordinate their actions to change
in demand and supply. Third, demand has to be forecast well into the future and
at every market level. Some recent systems claim to even predict demand at a
granularity of a single customer. The optimal RM solutions of prices and product
availability have to be made available in an acceptable format to sales persons,
agents, auction houses, etc. Thus, RM requires information, systems, technologies,
and training, as well as disciplined action to succeed.
In the rest of the chapter, we provide a glimpse into the more commonly used
RM techniques. These include capacity control, overbooking, dynamic pricing,
forecasting for RM, processes used in RM, and network RM. We conclude with
suggestions for further reading.
2 Theory
Let p represent price (of a single product) and D(p) the demand at that price
(assuming all other features are held the same). Revenue optimization is to find
the price p that maximizes R(p) = pD(p), and profit optimization is to maximize
(p − c)D(p) when c is the cost of producing one unit.
D(p) is called the demand function, and it is natural to assume that it decreases
as we increase price. It is also customary to assume it has some functional form, say
23 Pricing Analytics 797
D(p) = a − bp or D(p) = apb where a and b are the parameters of the model that
we estimate based on observed data.
Example: Say, based on data, we estimate that demand for a certain product is D(p) =
35.12 − 0.02p (i.e., demand is assumed to have a linear form D(p) = a − bp, where we
calibrated a = 35.12 and b = 0.02). The revenue optimization problem is to maximize
p × (35.12 − 0.02p). From calculus (take the derivative of the revenue function and set
it to 0, so 35.12 − 2 × 0.02p = 0, and solve it for p), we obtain the optimal price to be
p ∗ = 2×0.02
35.12
= 878.
Capacity restrictions introduce some complications, but, at least for the single
product case, are still easy to handle. For instance, in the above example, if price is
$878, the demand estimate is 35.12 − 0.02 × 878 = 17.56. If, however, we have
only ten units, it is natural to raise the price so that demand is exactly equal to 10,
which can be found by solving 10 = 35.12 − 0.02p or p = 1256.
In this section, we look at the control of the sale of inventory when customers
belong to different types or, using marketing terminology, segments. The segments
are assumed to have different willingness to pay and also different preferences as
to when and how they purchase. For example, a business customer for an airline
may prefer to purchase close to the departure date, while a leisure customer plans
well ahead and would like a guaranteed flight reservation. The original motivation
of revenue management was an attempt to make sure that we set aside enough
inventory for the late-coming, higher-paying business customer, yet continue selling
at a cheaper price to the price-sensitive leisure segment.
We assume that we created products with sale restrictions (such as advance
purchase required or no cancellations or weekend stay), and we label each one of
these products as booking classes, or simply classes. All the products share the same
physical inventory (such as the rooms of the hotel or seats on a flight). In practice,
multiple RM products may be grouped into classes for operational convenience or
control system limitations. If such is the case, the price attached to a class is some
approximation or average of the products in that class.
From now on, we assume that each booking request is for a single unit of
inventory.
We begin with the simplest customer behavior assumption, the independent class
assumption: Each segment is identified with a single product (that has a fixed price),
and customers purchase only that product. And if that product is not available for
798 K. Talluri and S. Seshadri
sale, then they do not purchase anything. Since segments are identified one-to-one
with classes, we can label them as class 1 customers, class 2 customers, etc.
The goal of the optimization model is to find booking limits—the maximum
number of units of the shared inventory we are willing to sell to that product—that
maximize revenue.
Let’s first consider the two-class model, where class 1 has a higher price than
class 2, that is, f1 > f2 , and class 2 bookings come first. The problem would
be trivial if the higher-paying customers come first, so the heart of the problem
is to decide a “protection level” for the later higher-paying ones and, alternately, a
“booking limit” on when to stop sales to the lower-paying class 2 customers.
Say we have an inventory of r0 . We first make forecasts of the demand for each
class, say based on historic demand, and represent the demand forecasts by Dj , j =
1, 2.
How many units of inventory should the firm protect for the later-arriving, but
higher-value, class 1 customers? The firm has only a probabilistic idea of the class
1 demand (the problem would once more be trivial if it knew this demand with
certainty).
The firm has to decide if it needs to protect r units for the late-arriving class 1
customers. It will sell the rth unit to a class 1 customer if and only if D1 ≥ r, so the
expected marginal revenue from the rth unit is f1 P (D1 ≥ r). Intuitively, the firm
ought to accept a class 2 request if and only if f2 exceeds this marginal value or,
equivalently, if and only if
In practice, there are usually many products and segments, so consider n > 2
classes. We continue with the independent class assumption and that demand for
the n classes arrives in n stages, one for each class in order of revenue with the
highest-paying segment, class 1, arriving closest to the inventory usage time.
Let the classes be indexed so that f1 > f2 > · · · > fn . Hence, class n (the lowest
price) demand arrives in the first stage (stage n), followed by class n − 1 demand in
stage n − 1, and so on, with the highest-price class (class 1) arriving in the last stage
(stage 1). Since, there is a one-to-one correspondence between stages and classes,
we index both by j .
We describe now a heuristic method called the expected marginal seat revenue
(EMSR) method. This heuristic method is used because solving the n class problem
optimally is complicated. The heuristic method works as follows:
23 Pricing Analytics 799
Consider stage j + 1 in which the firm wants to determine protection level rj for
class j . Define the aggregated future demand for classes j, j − 1, . . . , 1 by
j
Sj = Dk ,
k=1
and let the weighted-average revenue (this is the heuristic part) from classes
1, . . . , j , denoted f¯j , be defined by
j
fk E[Dk ]
f¯j = k=1
j
, (23.3)
k=1 E[Dk ]
fj +1
P (Sj > rj ) = . (23.4)
f¯j
rj = μ + zσ,
j j
where μ = k=1 μk is the mean and σ
2 = 2
k=1 σk is the variance of the
aggregated demand to come at stage j + 1 and z = (1 − fj +1 /f¯j ) and −1 (·)
−1
is the inverse of the standard normal c.d.f. One repeats this calculation for each j .
The EMSR heuristic method is very popular in practice as it is very simple to
program and is robust with acceptable performance (Belobaba 1989). One can do
the calculation easily enough using Excel, as it has built-in functions for the normal
distribution and its inverse.
2.3 Overbooking
There are many industries where customers first reserve the service and then use it
later. Some examples are hotels, airlines, restaurants, and rental cars. Now, when
a customer reserves something for future use, their plans might change in the
meantime. A cancellation is when the customer explicitly cancels the reservation,
and a no-show is when they do not notify the firm but just do not show up at the
scheduled time for the service. What the customer does depends on the reservation
800 K. Talluri and S. Seshadri
policies. If there is an incentive like partial or full refund of the amount, customers
tend to cancel. If there is no refund, they will just opt for no-show.
Overbooking is a practice of taking more bookings than capacity, anticipating
that a certain fraction of the customers would cancel or opt for no-show. This leads
to better capacity utilization, especially on high-demand days when the marginal
value of each unit of inventory is very high. The firm, however, has to be prepared to
handle a certain number of customers who are denied service even though they have
paid for the product and have a reservation contract. In many industries, the benefits
of better capacity utilization dominate the risk of denying service, and overbooking
has become a common practice.
Firms control overbooking by setting an upper limit on how much they overbook
(called the overbooking limit). Typically as they come closer to the inventory usage
time (say a flight departing), they have a better picture of demand, and they reduce
the risk of overbooking if there appears to be high demand with few cancellations.
Figure 23.1 shows the dynamics of a typical evolution of the overbooking limit. As
the usage date nears and the firm fears that it might end up denying service to some
customers, it brings down the overbooking limit toward physical capacity faster (can
even be less than the current number of reservations on-hand also, to prevent new
bookings).
Overbooking represents a trade-off: If the firm sells too many reservations
above its capacity, it risks a scenario where more customers show up than there
is inventory and the resulting costs in customer goodwill and compensation. If
it does not overbook enough, it risks unsold inventory and an opportunity cost.
Overbooking models are used to find the optimal balance between these two factors.
We describe one such calculation below that, while not completely taking all factors
into consideration, highlights this trade-off mathematically. It is reminiscent of the
classical newsvendor model from operations management.
23 Pricing Analytics 801
Let CDB denote the cost of a denied boarding, that is, the estimated cost of
denying service to a customer who has a reservation (which, as we mentioned
earlier, includes compensation, loss of goodwill, etc.). Let Cu denote the opportunity
cost of underused capacity, typically taken as the expected revenue for a unit of
inventory. The overbooking limit we have to decide then is θ > C, where C is the
physical capacity.
For simplicity, we assume the worst and that demand will exceed overbooking
limit, that is, we will be conservative in setting our limit. Let N be the number of
no-shows/cancellations. Since we are not sure of the number of cancellations or no-
shows, we model it as a random variable, say as a binomial random variable with
parameters θ, p where p is the probability of a cancellation or no-show. Then, the
number of customers who actually show up is given by θ − N (recall demand is
conservatively assumed to be always up to θ ).
Next, we pose the problem as the following marginal decision: Should we stay
at the current limit θ or increase the limit to θ + 1, continuing the assumption that
demand is high and will also exceed θ + 1? Two mutually exclusive events can
happen: (1) θ − N < C. In this case by moving the limit up by 1, we would increase
our profit, or in terms of cost by −Cu . (2) θ − N ≥ C, and we incur a cost of CDB .
So the expected cost per unit increase of θ is
Cu
P r(θ − N ≥ C) = .
Cu + CDB
If we let S(θ ) be the number of people who show up, an alternate view is that
Cu
we need to set θ such that P r(S(θ ) ≤ C) = Cu +C DB
. If no-shows happen with
probability p, shows also follow a binomial distribution with probability 1 − p. So
set θ such that (writing in terms of ≤ to suit Excel calculations)
Cu
P r(S(θ ) ≤ C) = 1 − .
Cu + CDB
802 K. Talluri and S. Seshadri
Example (Fig. 23.2): Say Cu = $150 and CDB = $350. To calculate the
overbooking limit, we first calculate the critical ratio:
Cu 150
= = 0.3.
Cu + CDB 500
Over the last few years, dynamic pricing has taken on three distinct flavors:
surge pricing, as practiced by Uber, Lyft, and utility companies; repricing, or
competition-based pricing, as practiced by sellers on Amazon marketplace; and,
finally, markdown or markup pricing, where prices are gradually decreased (as in
fashion retail) or increased (as in low-cost airlines) as a deadline approaches.
This is the newest and perhaps most controversial of dynamic pricing practices. The
underlying economic reason is sound and reasonable. When there is more demand
than supply, the price has to be increased to clear the market—essentially the good or
service is allocated to those who value it the most. When asked to judge the fairness
of this, most consumers do not have a problem with this principle, for instance, few
consider auctions to be unfair.
However, when applied to common daily items or services, many consumers
turn indignant. This is due to many reasons: (1) There is no transparency on how
the prices move to balance demand and supply. (2) As the prices rise when a large
number of people are in great need of it, they are left with a feeling of being price-
gouged when they need the service most. (3) The item or service is essential or
life-saving, such as pharmaceutical or ambulance service. Uber was a pioneer in
introducing surge pricing into an industry used to a regulated fixed-price system
(precisely to bring transparency and prevent price-gouging and also to avoid the
hassle of bargaining). While initial reactions2 have been predictable, it has, in
a space of a few years, become a fact of life. This shows the importance of a
firm believing in the economic rationale of dynamic pricing and sticking to the
practice despite public resistance. Of course, consumers should find value in the
service itself—as the prices are lower than alternatives (such as regular taxi service)
during off-peak times, eventually consumers realize the importance and necessity of
dynamic pricing.
A second phenomenon that has recently taken hold of is called “repricing” used in
e-commerce marketplaces such as Amazon.com. It is essentially dynamic pricing
driven by competition.
Many e-commerce platforms sell branded goods that are identical to what other
sellers are selling. The seller’s role in the entire supply chain is little more than
stocking and shipping as warranties are handled by the manufacturer. Service does
play a role, but many of the sellers have similar reviews and ratings, and often price
is the main motivation of the customer for choosing one seller over the other, as the
e-commerce platform removes all search costs.
Prices however fluctuate, the reasons often being mysterious. Some possible
explanations are the firms’ beliefs about their own attractiveness (in terms of ratings,
reviews, and trust) compared to others and their inventory positions—a firm with
low inventories may want to slow down sales by pricing higher. Another possible
reason is that firms assess the profiles of customers who shop at different times
of day and days of the week. A person shopping late at night is definitely not
comparison-shopping from within a physical store, so the set of competitors is
reduced.
Note that an e-commerce site would normally have more variables to price on,
such as location and the past customer profile, but a seller in a marketplace such as
Amazon or eBay has only limited information and has to put together competitor
information from scraping the website or from external sources.
Repricing refers to automated tools, often just rules based, that change the price
because of competitor moves or inventory. Examples of such rules are given in the
exercise at the end of this chapter.
Markdown pricing is a common tactic in grocery stores and fashion retailing. Here,
the product value itself deteriorates, either physically because of limited shelf-
life or as in the fashion and consumer electronics industry, as fresh collections
or products are introduced. Markdowns in fashion have a somewhat different
motivation from the markdowns of fresh produce. In fashion retail, the products
cannot be replenished within the season because of long sales cycles, while for fresh
groceries, the sell-by date reduces the value of the product because fresher items are
introduced alongside.
Markdown pricing, as the name indicates, starts off with an initial price and then,
gradually at various points during the product life cycle, reduces the price. The
price reductions are often in the form of 10% off, 20% off, etc., and sometimes
coordinated with advertised sales. At the end of the season or at a prescribed date,
the product is taken off the shelf and sold through secondary channels at a steeply
discounted price, sometimes even below cost. This final price is called the salvage
value.
The operational decisions are how much to discount and when. There are various
restrictions and business rules one has to respect in marking down, the most common
one being once discounted, we cannot go back up in price (this is what distinguishes
markdown pricing from promotions). Others limit the amount and quantities of
discounting.
The trade-off involved is similar to what is faced by a newsvendor: Discounting
too late would lead to excess inventory that has to be disposed, and discounting too
soon will mean we sell off inventory and potentially face a stock-out.
In contrast to markdown pricing where prices go down, there are some industries
that practice dynamic pricing with prices going up as a deadline approaches. Here,
the value of the product does go down for the firm, but the customer mix may be
23 Pricing Analytics 805
changing when customers with higher valuations arrive closer to the deadline (either
the type and mix of customers might be changing or even for the same customer,
their uncertainty about the product may be resolved).
For reasons of practicality, we try to keep models of demand simple. After all,
elaborate behavioral models of demand would be useless if we cannot calibrate them
from data and to optimize based on them. In any case, more complicated models do
not necessarily mean they predict the demand better, and they often are harder to
manage and control.
In this section, we concentrate on three simple models of how demand is
explained as a function of price. All three are based on the idea of a potential
population that is considering purchase of our product. The size of this population is
M. Note that M can vary by day or day of week or time of day. Out of this potential
market, a certain fraction purchase the product. We model the fraction as a function
of price, and possibly other attributes as well.
Let D(p) represent demand as a function of price p.
• In the additive model of demand,
where a and b are parameters that we estimate from data. If there are multiple
products, demand for one can affect the other. We can model demand in the
presence of multiple products as
D(pi ) = M(ai + bi pi + bij pj ).
j =i
That is, demand for product i is a function of not just the price of i but also the
prices of the other products, pj , j = i. The parameters ai , bi , and bij are to be
estimated from data.
This model can lead to problems at the extremes as there is no guarantee that
the fraction is between 0 and 1.
D(p) = M(apb ).
eai +bi pi
D(pi ) = M ,
1 + eaj +bj pj
where e stands for the base of the natural logarithm. Note that this model has far
fewer parameters than either the additive or multiplicative model and naturally
limits the fraction to always lie between 0 and 1! This is the great advantage of
this model.
We show in the exercises how these models can be used for price optimization.
The case study on airline choice modeling (see Chap. 26), has a detailed exercise on
estimation and use of choice models for price optimization and product design.
Fixing prices for each product aimed at a segment, as outlined in Sect. 2.2.1, and
controlling how much is sold at each price requires that we monitor how many
bookings have been taken for each product and closing sales at that price whenever
we sold enough.
So the sequence is (1) forecasting the demand for each RM product for a specific
day and then (2) optimizing the controls given the forecasts and (3) controlling real-
time sales for each product so they do not exceed the booking limits for that product.
We list below the main control forms used in RM industries. Because of the
limitations of distribution systems that were designed many years ago, the output of
our optimization step has to conform to these forms.
• Nested allocations or booking limits: All the RM products that share inventory are
first ranked in some order, usually by their price.3 Then, the remaining capacity
is allocated to these classes, but the allocations are “nested,” so the higher class
has access to all the inventory allocated to a lower class. For example, if there
are 100 seats left for sale and there are two classes, Y and B, with Y considered
3 Aswe mentioned earlier, it is common to group different products under one “class” and take an
average price for the class. In the airline industry, for instance, the products, each with a fare basis
code (such as BXP21), are grouped into fare classes (represented by an alphabet, Y, B, etc.)
23 Pricing Analytics 807
a higher class, then an example of a nested allocation would be Y100 B54. For
example, if 100 Y customers were to arrive, the controls would allow sale to all
of them. If 60 B customers were to show up, only 54 would be able to purchase.
B is said to have an allocation or a booking limit of 54. Another terminology that
is used is (nested) protections: Y is said to have (in this example) a protection of
46 seats.
The allocations are posted on a central reservation system and updated
periodically (usually overnight). After each booking, the reservation system
updates the limits. In the above example, if a B booking comes in, then (as the
firm can sell up to 54 seats to B) it is accepted, so the remaining capacity is 99,
and the new booking limits are Y99 B53. Suppose a Y booking comes in and is
accepted, there are a couple of ways the firm can update the limits: Y99 B54 or
Y99 B53. The former is called standard nesting and the latter theft nesting.
• Bid prices: For historic reasons, most airline and hotel RM systems work with
nested allocations, as many global distribution systems (such as Amadeus or
Sabre) were structured this way. Many of these systems allow for a small number
of limits (10–26), so when the number of RM products exceeds this number, they
somehow have to be grouped to conform to the number allowed by the system.
The booking limit control is perfectly adequate when controlling a single
resource (such as a single flight leg) independently (independent of other
connecting flights, for instance), but we encounter its limitations when the
number of products using that resource increases, say to more than the size of
inventory. Consider network RM, where the products are itineraries, and there
could be many itineraries that use a resource (a flight leg)—the grouping of
the products a priori gets complicated and messy (although it has been tried,
sometimes called virtual nesting).
A more natural and appropriate form of control, especially for network RM,
is a threshold-price form of control called bid price control. Every resource has
a non-negative number called a bid price associated with it. A product that uses
a combination of resources is sold if the price of the product exceeds the sum of
the bid prices of the resources that the product uses. The bid prices are frequently
updated as new information comes in or as the inventory is sold off. The next
section illustrates the computation of bid prices.
The Need for Network Revenue Management: In many situations, the firm has to
compute the impact of a pricing, product offering, or capacity management decision
on an entire network of resources. Consider the case of an airline that offers flights
from many origins and to many destinations. In this case, passengers who are flying
to different origin-destination (OD) pairs might use the same flight leg.
808 K. Talluri and S. Seshadri
Say an airline uses Chicago as a hub. It offers itineraries from the East Coast
of the USA, such as New York and Boston, to cities in the West, such as LA and
San Francisco. Passengers who fly from New York to Chicago include those who
travel directly to Chicago and also those traveling via Chicago to LA, San Francisco,
etc. The firm cannot treat the flight booking on the New York to Chicago flight
independently but has to consider the impact on the rest of the network as it reduces
the capacity for future customers wishing to travel to LA and San Francisco.
Similarly, there are inter-temporal interactions when we consider multi-night-
stay problems, for example, a car or a hotel room when rented over multiple days.
Hence, the Monday car rental problem impacts the Tuesday car rental problem
and so forth. Other examples of network revenue management include cargo
bookings that consume capacity on more than one OD pair or television advertising
campaigns that use advertisement slots over multiple shows and days.
Suboptimality of Managing a Network One Resource at a Time: It is easy to
demonstrate that it is suboptimal to manage each resource separately. Consider a
situation in which the decision maker knows that the flight from city A to B, with
posted fare of 100, will be relatively empty, whereas the connecting flight from B to
C, fare of 200, will be rather full. Some passengers want to just travel from A to B,
and there are others who want to fly from A to C. Both use the AB leg. In this case,
what will be the order of preference for booking a passenger from A to B vis-a-vis
one who wants to travel from A to C and pays 275? Intuitively, we would remove
the 200 from 275 and value the worth of this passenger to the airline on the AB
leg as only 75. Therefore, total revenue might not be a good indicator of value. In
this example, allocation of the 275 according to the distance between A–B and B–C
might also be incorrect if, for example, the distances are equal. Allocations based
on utilization or the price of some median-fare class would also be inappropriate.
Therefore, any formula that allocates the value to different legs of the itinerary has
to consider both the profitability of each leg and the probability of filling the seat.
An Example to Illustrate an Inductive Approach: Consider a simple example in
which, as above, there is a flight from city A to city B and a connecting flight from
B to C. The single-leg fares are 200 and 200, whereas the through fare from A to C
is 360. There is exactly one seat left on each flight leg. Assume, as is done typically
in setting up the optimization problem, time is discrete. It is numbered backward
so that time n indicates that there are n time periods left before the first flight takes
place. Also, the probability of more than one customer arrival in a time period is
assumed to be negligible. Thus, either no customer arrives or one customer arrives.
We are given there are three time periods left to go. In each period, the probability
of a customer who wants to travel from A to B is 0.2, from B to C is 0.2, and from
A to C is 0.45; thus, there is a probability of zero arrivals equal to 0.15. In this
example, the arrival probabilities are the same in each period. It is easy to change
the percentages overtime. What should be the airline’s booking policy with one seat
left on each flight?
23 Pricing Analytics 809
This problem is best solved through backward induction. Define the state of the
system as (n, i, j ) where n is the period and i and j the numbers of unsold seats on
legs AB and BC.
Consider the state (1, 1, 1). In this state, in the last period, the optimal decision
is to sell to whichever customer who arrives. The expected payoff is 0.4 × 200 +
0.45×360 = 242. We write the value in this state as V (1, 1, 1) = 242. The expected
payoff in either state (1, 0, 1) or (1, 1, 0) is 0.2 × 200 = 40. We write V (1, 0, 1) =
V (1, 1, 0) = 40. For completeness, we can write V (n, 0, 0) = 0.
When there are two periods to go, the decision is whether to sell to a customer
or wait. Consider the state (2, 1, 1) and the value of being in this state, V (2, 1, 1).
Obviously, it is optimal to sell to an AC customer. Some calculations are necessary
for whether we should sell to an AB or BC customer:
If an AB customer arrives: If we sell, we get 200 + V (1, 0, 1) (from selling the
seat to a BC customer if they arrive in the last period) = 240. Waiting fetches
V (1, 1, 1) = 242. Therefore, it is best to not sell.
If a BC customer arrives: Similar to the case above, it is better to wait.
If an AC customer arrives: Sell. We get 360.
Thus, V (2, 1, 1) = 0.4 × 242 + 0.45 × 360 + 0.15 × V (1, 1, 1) = 295.1.
We can compute V (2, 1, 0)(= V (2, 0, 1)) = 0.2 × 200 + 0.8 × V (1, 1, 0) = 72.
In period 3, in the state (3, 1, 1), it is optimal to sell if an AC customer arrives.
If an AB (or BC) customer arrives, by selling we get 200 + V (2, 0, 1)(or
V (2, 1, 0)) = 272. This is smaller than V (2, 1, 1). Therefore, it is better to wait.
This completes the analysis.
The reader can easily generalize to the case when there are different combinations
of unsold seats. For example, having solved entirely for the case when a maximum
of (k, m) seats are left in the last period, one can use backward induction to solve
for the same when there are two periods to go, etc.
The backward induction method is called dynamic programming and can become
quite cumbersome when the network is large and the number of periods left is large.
It is stochastic in nature because of the probabilities. The astute reader might have
noticed that these probabilities can be generated by using an appropriate model of
customer choice that yields the probability of choosing an itinerary when presented
with a set of options.
Bid Price Approach: Recall the bid price control of Sect. 2.6. The operating rule is
to accept an itinerary if its fare exceeds the sum of bid prices on each leg used by
the itinerary, if there is sufficient capacity left. The bid prices can be thought of as
representing the marginal value of the units of capacity remaining.
But how do we calculate these bid prices? Many different heuristic approaches
have been proposed and analyzed, both in the academic and in the practitioner
literature (see, e.g., the references at the end of this chapter). These range from
solving optimization models such as a deterministic linear program (DLP), a
stochastic linear program (SLP), and approximate versions of the dynamic program
(DP) illustrated above to a variety of heuristics. (The usual caveat is that the use
810 K. Talluri and S. Seshadri
of bid prices in this manner need not result in the optimal expected revenue. Take,
for example, the decision rule that we derived using the dynamic program with three
periods to go and one seat that is available on each flight leg. We need two bid prices
(one per leg) such that each is greater than the fare on the single leg but their sum
is less than the fare on the combined legs. Thus, we need prices b1 and b2 such that
b1 > 200, b2 > 200, b1 + b2 ≤ 360. Such values do not exist.)
In this chapter, we illustrate the DLP approach as it is practical and is used by
hotels and airlines to solve the problem. In order to illustrate the approach, we shall
first use a general notation and then provide a specific example. We are given a set
of products, indexed by i = 1 to I . The set of resources is labeled j = 1 to J . If
product i uses resource j , let aij = 1 else 0. Let the revenue obtained from selling
one unit of product i be Ri . We are given that the demand for product i is Di and
the capacity of resource j is Cj . Here, the demand and revenue are deterministic.
In the context of an airline, the products would be the itineraries, the resources
are the flight legs, the coefficient aij = 1 if itinerary i uses flight leg j else 0, the
capacity would be the unsold number of seats of resource j , and the revenue would
be the fare of product i.
In a hotel that is planning its allocations of rooms for the week ahead, the product
could be a stay that begins on a day and ends on another day, such as check-
in on Monday and checkout on Wednesday. The resource will be a room night.
The capacity will be the number of unsold rooms for each day of the week. The
coefficient aij = 1 if product i requires stay on day j (e.g., a person who stays on
Monday and Tuesday uses one unit of capacity on each room night). The revenue
will be the price charged for the complete stay of product i. For a car rental problem,
replace room night with car rental for each day of the week. Note that it is possible
that two products use the same set of resources but are priced differently. Examples
of these include some combination of room sold with/without breakfast, allowing
or not allowing cancellation, taking payment ahead or at end of stay, etc.
The problem is to decide how many reservations Xi to accept of each product i.
The general optimization problem can be stated as follows (DLP):
max Ri Xi
X
i = 1 to I
s.t
aij Xi ≤ Cj , j = 1 to J, (23.5)
i
Xi ≤ Di , i = 1 to I, (23.6)
Xi = 0, 1, 2, . . . , I
Here, constraints (23.5) make sure we don’t sell more reservations than the capacity
on each flight (on average); constraints (23.6) ensure that the number of reservations
for an itinerary is less than the demand for that itinerary (mean of the demand—
23 Pricing Analytics 811
A small hotel is planning its allocation of rooms for the week after the next week.
For the purpose of planning, it assumes that the customers who stay on weekends
belong to a different segment and do not stay over to Monday or check in before
Saturday. It sells several products. Here, we consider the three most popular ones
that are priced on the average at $125, $150, and $200. These rates are applicable,
respectively, if the customer (1) pays up front, (2) provides a credit card and agrees
to a cancellation charge that applies only if the room reservation is cancelled with
less than 1 day to go, and (3) is similar to (2) but also provides for free Internet
and breakfast (that are virtually costless to the hotel). Customers stay for 1, 2, or 3
nights. The demand forecasts and rooms already booked are shown in Table 23.1.
The hotel has a block of 65 rooms to allocate to these products.
In this example, there are 45 products and five resources. Each demand forecast
pertains to a product. The available rooms on each of the 5 days constitute the five
different resources. The Monday 1-night-stay product uses one unit of Monday
capacity. The Monday 2-night-stay product uses one unit each of Monday and
Tuesday room capacity, etc. The rooms available are 65 minus the rooms sold.
There are 45 decision variables in this problem. The screenshots of the data, decision
variables (yellow), and Excel Solver setup are shown in Figs. 23.3 and 23.4.
Solving this problem as a linear program or LP (choose linear and non-negative
in Solver), we obtain the solution shown in Fig. 23.5.
23 Pricing Analytics 813
The optimal solution is to not accept many bookings in the $125 rate class, except
on Monday and Tuesday. Even some of the demand in the $150 rate class is turned
away on Wednesday and Thursday. One might simply use this solution as guideline
for the next few days and then re-optimize based on the accepted bookings and
the revised demand forecasts. Two potential opportunities for improvement are as
follows: (1) The solution does not consider the sequence of arrivals, for example,
whether the $125 rate class customer arrives prior to the $150. (2) The solution
does not take into account the stochastic aspect of total demand. These can be
partially remedied by use of the dual prices provided by the sensitivity analysis of
the solution. The sensitivity analysis of the solution to the LP is obtained from any
traditional solver including Excel. The sensitivity analysis of the room capacities is
given in Table 23.2.
There is one shadow price per resource and day of stay. This can be used as the
bid price for a room for that day. For example, if a customer were willing to pay
$225 for a 2-night stay beginning Monday, we would reject that offer because the
price is less than the sum of the bid prices for Monday and Tuesday (100 + 150),
whereas the hotel should accept any customer who is willing to pay for a 1-night
stay on Monday or Friday if the rate exceeds $100. One might publish what rate
classes are open based on this logic as shown in Table 23.3.
We can also compute the minimum price for accepting a booking (or a group):
In order to create the minimum price (see Table 23.4), we have rounded the shadow
price manually to integer value. We emphasize that the bid price is an internal
control mechanism that helps decisions makers in deciding whether to accept a
customer. The bid price need not bear resemblance to the actual price. Also, note
814 K. Talluri and S. Seshadri
Table 23.5 Tuesday night Product Total revenue Revenue for Tuesday
single-resource analysis
Monday 125 2 nights 250 150
Monday 125 3 nights 375 125
Monday 150 2 nights 300 200
Monday 150 3 nights 450 200
Monday 200 2 nights 400 300
Monday 200 3 nights 600 350
Tuesday 125 1 night 125 125
Tuesday 125 2 nights 250 100
Tuesday 125 3 nights 375 75
Tuesday 150 1 night 150 150
Tuesday 150 2 nights 300 150
Tuesday 150 3 nights 450 150
Tuesday 200 1 night 200 200
Tuesday 200 2 nights 400 250
Tuesday 200 3 nights 600 300
that even though the $150 rate class for 1-night stay is open on Thursday, the LP
solution does not accept all demand. Thus, the bid price is valid only for small
change in the available capacity. Moreover, we may need to connect back to the
single-resource problem to determine the booking limits for different rate classes.
To see this, consider just the resource called Tuesday. Several different products use
the Tuesday resource. Subtracting the bid price for the other days from the total
revenue, we arrive at the revenue for Tuesday shown in Table 23.5.
Based on this table, we can infer that the DLP can also provide relative value
of different products. This can be used in the single-resource problem to obtain the
booking limits. We can also group products into different buckets prior to using the
booking limit algorithm. Products with Tuesday revenue greater than or equal to
300 can be the highest bucket; the next bucket can be those with revenue between
200 and 250; the rest are into the lowest bucket.
Uses and Limitations of Bid Prices for Network Revenue Management: There
are many practical uses of the bid prices. First and foremost, the approach shifts
the focus of forecasting to the product level and away from the single-resource
level. Thus, the decision maker generates demand forecasts for 1-night and 2-
night stays separately instead of forecast for Tuesday night stay. The bid prices
can help in route planning, shifting capacity if some flexibility is available, running
23 Pricing Analytics 815
promotions/shifting demand, identifying bid price trends, etc. For example, the
management might decide not to offer some products on certain days, thereby
shifting demand to other products. If there is some flexibility, a rental car company
might use the bid price as guideline to move cars from one location with a low price
to another with a high price. The product values might reveal systematic patterns
of under- and over-valuation that can help decide whether to run a promotion for
a special weekend rate or to a particular destination. Bid price trends that show
a sustained increase over several weeks can indicate slackening of competitive
pressure or advance bookings in anticipation of an event.
Several limitations of the approach have been mentioned in the chapter itself.
More advanced material explaining the development of the network revenue
management can be found in the references given in the chapter.
3 Further Reading
There are several texts devoted to revenue optimization. Robert Cross’ book (2011)
is one of the earliest ones devoted to the art and science of revenue management
in a popular style. Many ideas discussed in this chapter and many more find
a place in the book. Robert Phillips’ book (2005) and Talluri and Van Ryzin’s
book (2006) contain a graduate level introduction to the subject. In addition, we
have borrowed ideas from the papers listed at the end of the chapter (Bratu 1998;
Lapp and Weatherford 2014; Talluri and van Ryzin 1998; Williamson 1992). The
INFORMS Revenue Management and Pricing Section website4 contains several
useful references. Finally, there is a Journal of Revenue and Pricing Management5
that is devoted to the topic.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 23.1: Opera.xls
Exercises
Ex. 23.1 (Protection Level) An airline offers two fare classes for economy class
seats on its Monday morning flight: one class is sold at $400/ticket and another at
$160/ticket. There are 225 economy seats on the aircraft. The demand for the $400
fare (also called full-fare) seats has a mean of 46, a standard deviation of 16. Assume
it follows a normal distribution. The demand for cheaper seats has an exponential
distribution with mean of 177. A seat can be sold to either class. Further, the demand
for the two fare classes can be assumed to be independent of one another. The main
restriction is that the cheaper tickets must be purchased 3 weeks in advance.
(a) How many seats would you protect for the $400 class customers?
(b) The forecast for cheaper class passengers has changed. It is now assumed to be
less than 190 with probability 1. How many seats would you protect for full-fare
customers given this information?
(c) Go back to the original problem. Suppose that unsold seats may sometimes be
sold at the last minute at $105. What effect will this have on the protection level
(will you protect more or less seats or the same number of seats)? Why?
(d) Will your original answer change if the demands for the two classes are not
independent of one another. Explain your answer if possible using an example.
Ex. 23.2 (Bid Price) Please see the data in the Excel sheet Opera.xls (available on
website). The question is also given in the spreadsheet. It is reproduced below. All
data is available in the spreadsheet.
Please carry out the following analysis based on the opera data. You are provided
the cumulative booking for 1 year for two ticket classes. Assume that the opera
house sells two types of tickets for their floor seats. The first is sold at $145, and the
ticket is nonrefundable. The second is for $215 but refundable. The opera house has
245 floor seats. This data is given in two sheets in the spreadsheet.
You may verify (or assume) that the booking pattern is the same for most days.
This is because we have normalized the data somewhat and got rid of peaks and
valleys. The booking pattern is given 14 days prior to the concert onward. The final
entry shows how many persons actually showed up for the concert on each day.
Here is a sample of the data for $145 seats:
−1 0 1 2 3
11/30/2011 143 143 133 124 116
For example, today is November 30, 2011. For this date, 116 persons had booked
seats with 3 days to go, 124 with 2 days to go, 133 with 1 day to go, and 143 the
evening before the concert. Finally, 143 persons showed up on November 30 which
was the day of the concert.
We have created a forecast for the demand for the two types of seats for the next
ten days, December 1 through December 10. We have used the additive method to
estimate the pickup (PU).
(In this method, we computed the difference between the average seats sold and
seats booked with 1, 2, 3, . . . days to go. That is the PU with 1, 2, 3, . . . days to go).
See rows 40–44 in the first sheet of the spreadsheet for the forecast.
23 Pricing Analytics 817
Ex. 23.4 (Dynamic Pricing) Mike is the revenue management manager at Marriott
Hotel on 49th St., Manhattan, New York. He is contemplating how to respond to last-
minute “buy it now” requests from customers. In this sales channel, customers can
bid a price for a room, and Mark can either take it or wait for the next bid. Customers
are nonstrategic (in the sense, they don’t play games with waiting to bid). Mark has
818 K. Talluri and S. Seshadri
observed that typically he gets at most one request every hour. Analysis indicates
that he gets a request in an hour with probability 0.2. He is looking at the last 3 h
of the decision before the booking for the next day closes. For example, if booking
closes at midnight, then he is looking at requests between 9 and 10 PM, 10 and
11 PM, and 11 and midnight. Customers either ask for a low rate or a high rate.
Typically, half of them ask for a room for $100 and the rest for $235 (which is the
posted rate).
Help Mark structure his thoughts and come up with a decision rule for accepting
or turning down bids. It may help to think that with 3 h to go he can at most sell
three rooms, with 2 h to go he can sell at most two rooms, and with an hour to go
he can sell at most one room. (Thus, he can give away excess rooms at any price
beyond these numbers, etc.) Use the dynamic programming example.
Ex. 23.5 (Overbooking) Ms. Dorothy Parker is the admissions director at Winch-
ester College that is a small liberal arts college. The college has a capacity of
admitting 200 students a year. The classrooms are small and the college wants to
maintain a strict limit. Demand is robust with over 800 applications the previous
year, out of which 340 students were offered a place on a rolling basis and the target
of 200 admissions was met.
However, 17 students who accepted the offer of admission did not show up.
Subsequent enquiries revealed that four of them had a last-minute change of heart
about their college choice, three decided to take a gap year, and there was no
reply from the rest. They paid the deposit and forfeited the amounts by college
rules. Admissions contacted those on the waiting list, but it was too late as most
already joined other institutions. As a result, the cohort comprised only 183 students
stressing the budgets.
Ms. Parker decided that a change of policy was needed, and for the next year, the
college will overbook, that is, admit a cohort larger than the capacity of 200. The
question is how many. The tuition fee for 1 year of study is $34,500.
(a) What data should Ms. Parker be collecting to make a decision on how many
students to admit beyond the limit of 200?
(b) Can we assume that the cost of falling short by a student is the 4 years’ worth
of tuition revenue? Argue why or why not.
(c) What is the cost of taking on a student over the 200 limit? Explain how you
came up with your number.
(d) Ms. Parker decided after some analysis that the lost revenue from a student was
$100,000, and the cost of having more students than capacity is as follows:
Students Cost
201 $10,000
202 $22,000
203 $40,000
204 $70,000
205 $100,000
206 $140,000
23 Pricing Analytics 819
Admitted Showed up
200 200
200 195
200 197
200 190
200 192
200 183
If Ms. Parker was to naively admit 217 students based on this year’s
observation of no-shows, what would be the expected cost? Based on the data,
what is the optimal number to overbook?
Ex. 23.6 (Markdown Optimization) Xara is a speciality fashion clothing retail
store focusing on the big-and-tall segment of the market. This year, it is selling
approximately 12,000 SKUs, with each SKU further classified by sizes. The initial
prices for each item are usually set by the headquarters, but once the shipment
reaches the stores, the store managers have the freedom to mark down the items
depending on sales. Store managers are evaluated based on the total revenue they
generate, so the understanding is that they will try to maximize revenue.
The demand for the new line of jeans was estimated based on historical purchases
as follows:
Here, 10,000 stands for the potential market, and the interpretation of (1 − 0.0105p)
is the probability of purchase of each member of the market. That is, demand at price
p is given by the preceding formula, where p is in the range of 0–$95 (i.e., beyond
$95, the demand is estimated to be 0).
The season lasts 3 months, and leftover items have a salvage value of 25% of
their initial price. The headquarters sets the following guidelines: Items once marked
down cannot have higher prices later. Prices can only be marked down by 10, 20,
30, or 40%. It is assumed demand comes more or less uniformly over the 3 month
season.
(a) Based on the demand forecast, what should be the initial price of the jeans, and
how many should be produced?
820 K. Talluri and S. Seshadri
(b) The manager of the store on Portal de l’Angel in Barcelona obtained an initial
consignment of 300 jeans, calculated to be the expected demand at that store.
After a while, he noticed that the jeans were selling particularly slowly. He had
a stock of 200 items still, and it was already 2 months into the season, so it
is likely the potential market for the store area was miscalculated. Should he
mark down? If so, by how much? (Hint: Based on the expected demand that
was initially calculated for the store, you need to derive the demand curve for
the store.)
Ex. 23.7 (Repricing) Meanrepricer.com offers a complex rule option where you
can set prices according to the following criteria:
• My Item Condition: the condition of your item
• Competitor Item Condition: the condition of your competitors’ product
• Action: the action that needs to be taken when applying a rule
• Value: the difference in prices which needs to be applied when using a certain
rule
Here are some sample rules. Discuss their rationale (if any) and how effective
they are.
(a) If our price for Product A is 100 and our competitors’ price for Product A is
$100, then the repricer will go ahead and reduce our price by 20% (i.e., from
$100 to $80).
(b) In case your competitors’ average feedback is lower than 3, chosen condition
will instruct the repricer to increase your price by two units.
(c) Sequential rules, where the first applicable rule is implemented:
(i) Reduce our price by two units if our competitors’ product price is within a
range of 300–800 units.
(ii) Increase our price by two units if our competitors’ product price is within a
range of 500–600 units.
References
Talluri, K. T., & Van Ryzin, G. J. (2006). The theory and practice of revenue management (Vol. 68).
New York: Springer.
Williamson, E. L. (1992). Airline network seat inventory control: Methodology and revenue
impacts. (Doctoral dissertation, Massachusetts Institute of Technology).
Chapter 24
Supply Chain Analytics
Yao Zhao
1 Introduction
Through examples and a case study, we shall learn how to apply data analytics to
supply chain management with the intention to diagnose and optimize the value
generation processes of goods and services, for significant business value.
A supply chain consists of all activities that create value in the form of goods
and services by transforming inputs into outputs. From a firm’s perspective,
such activities include buying raw materials from suppliers (buy), converting raw
materials into finished goods (make), and moving and delivering goods and services
to customers (delivery).
The twin goals of supply chain management are to improve cost efficiency and
customer satisfaction. Improved cost efficiency can lead to a lower price (increases
market share) and/or a better margin (improves profitability). Better customer
satisfaction, through improved service levels such as quicker delivery and/or higher
stock availability, improves relationships with customers, which in turn may also
lead to an increase in market share. However, these twin goals have the potential
to affect each other conversely. Improving customer satisfaction often requires a
higher cost; likewise, cost reduction may lower customer satisfaction. Thus, it is a
challenge to achieve both goals simultaneously. Despite the challenge, however,
those companies that were able to achieve them successfully (e.g., Walmart,
Y. Zhao ()
Rutgers University, Newark, NJ, USA
e-mail: yaozhao@business.rutgers.edu
Amazon, Apple, and Samsung) enjoyed a sustainable and long-term advantage over
their competition (Simchi-Levi et al. 2008; Sanders 2014; Rafique et al. 2014).
The twin goals are hard to achieve because supply chains are highly complex
systems. We can attribute some of this complexity to the following:
1. Seasonality and uncertainty in supply and demand and internal processes make
the future unpredictable.
2. Complex network of facilities and numerous product offerings make supply
chains hard to diagnose and optimize.
Fortunately, supply chains are rich in data, such as point-of-sale (POS) data from
sales outlets, inventory and shipping data from logistics and distribution systems,
and production and quality data from factories and suppliers. These real-time, high-
speed, large-volume data sets, if used effectively through supply chain analytics,
can provide abundant opportunities for companies to track material flows, diagnose
supply disruptions, predict market trends, and optimize business processes for
cost reduction and service improvement. For instance, descriptive and diagnostic
analytics can discover problems in current operations and provide insights on the
root causes; predictive analytics can provide foresights on potential problems and
opportunities not yet realized; and finally, prescriptive analytics can optimize the
supply chains to balance the trade-offs between cost efficiency and customer service
requirement.
Supply chain analytics is flourishing in all activities of a supply chain, from
buy to make to delivery. The Deloitte Consulting survey (2014) shows that the
top four supply chain capabilities are all analytics related. They are optimization
tools, demand forecasting, integrated business planning and supplier collaboration,
and risk analytics. The Accenture Global Operations Megatrends Study (2014)
demonstrated the results that companies achieved by using analytics, including an
improvement in customer service and demand fulfillment, faster and more effective
reaction times to supply chain issues, and an increase in supply chain efficiency.
This chapter shall first provide an overview of the applications of analytics in
supply chain management and then showcase the methodology and power of supply
chain analytics in a case study on delivery (viz., integrated distribution and logistics
planning).
Supply chain management involves the planning, scheduling, and control of the
flow of material, information, and funds in an organization. The focus of this
chapter will be on the applications and advances of data-driven decision-making
in the supply chain. Several surveys (e.g., Baljko 2013; Columbus 2016) highlight
the growing emphasis on the use of supply chain analytics in generating business
value for manufacturing, logistics, and retailing companies. Typical gains include
more accurate forecasting, improved inventory management, and better sourcing
and transportation management.
24 Supply Chain Analytics 825
It is relatively easy to see that better prediction, matching supply and demand
at a more granular level, removing waste through assortment planning, and better
category management can reduce inventory without affecting service levels. A
simple thought exercise will show that if a retailer can plan hourly sales and get
deliveries by the hour, then they can minimize their required inventory. One retailer
actually managed to do that—“Rakuten” was featured in a television series on the
most innovative firms in Japan (Ho 2015). The focus on sellers and exceptional
customer service seems to have paid off. In 2017, Forbes listed Rakuten among
the most innovative companies with sales in excess of $7 billion and market cap
more than $15 billion.1 Data analytics can achieve similar results, without the need
for hourly planning and delivery, and it can do so not only in retail but also in
global sourcing by detecting patterns and predicting shifts in commodity markets.
Clearly, supply chain managers have to maintain and update a database for hundreds
of suppliers around the globe on their available capacity, delivery schedule, quality
and operations issues, etc. in order to procure from the best source. On transportation
management, one does not have to look further beyond FedEx and UPS for the use
of data and analytics to master supply chain logistics at every stage, from pickup
to cross docking to last-mile delivery (Szwast 2014). In addition, there are vast
movements of commodities to and from countries in Asia, such as China, Japan, and
Korea, that involve long-term planning, sourcing, procurement, logistics, storage,
etc., many involving regulations and compliance that simply cannot be carried out
without the tools provided by supply chain analytics (G20 meeting 2014).
The supply chain is a great place to apply analytics for gaining competitive
advantage because of the uncertainty, complexity, and significant role it plays in the
overall cost structure and profitability for almost any firm. The following examples
highlight some key areas of applications and useful tools.
in the supply chain. These include the obvious ones such as inventory levels,
production schedules, and workforce planning (especially for service industries).
The less obvious ones are setting sales targets, working capital planning, and
supplier capacity planning (Chap. 4, Vollmann et al. 2010). Several techniques used
for forecasting are covered in Chap. 12 on “Forecasting Analytics.”
One notable example of the use of forecasting is provided by Rue La La, a
US-based flash-sales fashion retailer (Ferreira et al. 2015) that has most of its
revenues coming from new items through numerous short-term sales events. One
key observation made by managers at Rue La La was that some of the new items
were sold out before the sales period was over, while others had a surplus of leftover
inventory. One of their biggest challenges was to predict demand for items that were
never sold and to estimate the lost sales due to stock-outs. Analytics came in handy
to overcome these challenges. They developed models using which demand trends
and patterns over different styles and price ranges were analyzed and classified, and
key factors that had an impact on sales were identified. Based on the demand and
lost sales estimated, inventory and pricing are jointly optimized to maximize profit.
Chapter 18 on retail analytics has more details about their approach to forecasting
and inventory management.
Going forward, firms have started to predict demand at an individual customer
level. In fact, personalized prediction is becoming increasingly popular in e-
commerce with notable examples of Amazon and Netflix, both of which predict
future demand and make recommendations for individual customers based on their
purchasing history. Several mobile applications can now help track demand at the
user level (Pontius 2016). An example of the development, deployment, and use
of such an application can be found in remote India (Gopalakrishnan 2016). As
part of the prime minister’s Swachha Bharat (Clean India) program, the Indian
government sanctioned subsidies toward constructing toilets in villages. A volunteer
organization called Samarthan has built a mobile app which helps track the progress
of the demand for construction of toilet through various agencies and stages. The
app has helped debottleneck the provision of toilets.
Inventory planning and control in its simplest form involves deciding when and how
much to order to balance the trade-off between inventory investment and service
levels. Service levels can be defined in many ways, for example, fill rate measures
the percentage of demand satisfied within the promised time window. Inventory
investment is often measured by inventory turnover, which is the ratio between
annual cost of goods sold (COGS) and average inventory investment. Studies
have shown that there is a significant correlation between overall manufacturing
profitability and inventory turnover (Sangam 2010).
24 Supply Chain Analytics 827
The price and supply of commodity can fluctuate significantly over time. Because
of this uncertainty, it becomes difficult for many companies that rely on commodity
as raw materials to ensure business continuity and offer a constant price to their
customers. The organizations that use analytics to identify macroeconomic and
internal indicators can do a more effective job in predicting which way prices might
go. Hence, they can insulate themselves through inventory investment and purchases
of future and long-term contracts. For example, a sugar manufacturer can hedge
itself from supply and demand shocks by multiple actions, such as contracting out
production on a long-term basis, buying futures on the commodity markets, and
forward buying before prices upswing.
Another example is the procurement of ethanol that is used in medicines or
drugs. Ethanol can be produced petrochemically or from sugar or corn. Prices
of ethanol are a function of its demand and supply in the market, for which
there is good degree of volatility. The price of ethanol is also affected by the
supply of similar products in the market. As such, there are numerous variables
that can impact the price of ethanol. Data analytics can help uncover these
relationships to plan the procurement of ethanol. The same analytics tools and
models can be extended to other commodity-based raw materials and components
(Chandrasekaran 2014).
The last example is the spike in crop price due to changing climate. Climate
change is likely to affect food and hunger in the future due to its impact on
temperature, precipitation, and CO2 levels on crop yields (Willenbockel 2012).
Understanding the impact of climate change on food price volatility in the long run
would be useful for countries to take necessary preventive and corrective actions.
Computable general equilibrium (CGE) is used by researchers to model the impact
of climate change, which has the capability to assess the effects external factors
such as climate can have on an economy. The baseline estimation of production,
consumption, trade, and prices by region and commodity group takes into account
the temperature and precipitation (climate changes), population growth, labor force
growth, and total factor productivity growth in agricultural and nonagricultural
sectors. The advanced stage simulates various extreme weather conditions and
estimates crop productivity and prices subsequently.
The examples provided barely touch upon the many different possible applica-
tions in supply chain management. The idea of the survey is to provide guidance
regarding main areas of applications. The references at the end of the chapter contain
more examples and descriptions of methods. In the next section, we describe in
detail an example that illustrates inventory optimization and distribution strategies
at a major wireless carrier.
830 Y. Zhao
Our case study was set in 2010 where VASTA was one of the largest wireless service
carriers in the USA and well known for its reliable national network and superior
customer service. In the fiscal year of 2009, VASTA suffered a significant inventory
write-off due to the obsolescence of handsets (cell phones). At the time, VASTA
carried about $2 billion worth of handset inventory in its US distribution network
with a majority held at 2000+ retail stores. To address this challenge, the company
was thinking to change its current “push” inventory strategy, in which inventory
was primarily held at stores, toward a “pull” strategy, where the handset inventory
would be pulled back from the stores to three distribution centers (DCs) and stores
would alternatively serve as showrooms. Customers visiting stores would be able
to experience the latest technology and place orders, while their phones would be
delivered to their homes overnight from the DCs free of charge. The pull strategy
had been used in consumer electronics before (e.g., Apple), but it had not been
attempted by VASTA and other US wireless carriers as of yet (Zhao 2014a, b).
As of 2010, the US wireless service market had 280 million subscribers with
a revenue of $194 billion. With a population of about 300 million in the USA,
the growth of the market and revenue were slowing down as the market became
increasingly saturated. As a result, the industry was transitioning from the “growth”
business model that chased revenue growth to an “efficiency” model that maximized
operational efficiency and profitability.
The US wireless service industry was dominated by a few large players. They
offered similar technology and products (handsets) from the same manufacturers,
but competed for new subscribers on the basis of price, quality of services,
reliability, network speeds, and geographic coverage. VASTA was a major player
with the following strengths:
• Comprehensive national coverage
• Superior service quality and reliable network
• High inventory availability and customer satisfaction
These strengths also led to some weaknesses:
• Lower inventory turnover and higher operating cost when compared to competi-
tors.
• Services and products priced higher than industry averages due to the higher
operating costs
The main challenge faced by VASTA was its cost efficiency, especially in
inventory costs. VASTA’s inventory turnover was 28.5 per year, which was very
low compared to what Verizon and Sprint Nextel achieved (around 50–60 turns per
year). Handsets have a short life cycle of about 6 months. A $2 billion inventory
investment in its distribution system posed a significant liability and cost for VASTA
due to the risk of obsolescence. In the following sections, we will analyze VASTA’s
proposition for change using sample data and metrics.
24 Supply Chain Analytics 831
To maintain its status as a market leader, VASTA must improve its cost efficiency
without sacrificing customer satisfaction. VASTA had been using the “push”
strategy, which fully stocked its 2000+ retail stocked to meet customer demand.
The stores carried about 60% of the $2 billion inventory, while distribution centers
carried about 40%.
The company was thinking to change the distribution model from “push”
to “pull” which pulled inventory back to DCs. Stores would be converted to
showrooms, and customers’ orders would be transmitted to a DC which then filled
the orders via express overnight shipping. Figures 24.1 and 24.2 depict the two
strategies. In these charts, a circle represents a store and a triangle represents
inventory storage.
The “push” and “pull” strategies represent two extreme solutions to a typical
business problem in integrated distribution and logistics planning, that is, the
strategic positioning of inventory. The key questions are as follows: Where to place
inventory in the distribution system? And how does it affect all aspects of the system,
from inventory to transportation and fulfillment to customer satisfaction?
Clearly, the strategies will have a significant impact not only on inventory
but also on shipping, warehouse fulfillment, new product introduction, and, most
importantly, consumer satisfaction. The trade-off is summarized in Table 24.1.
While the push strategy allowed VASTA to better attract customers, the pull
strategy had the significant advantage of reducing inventory and facilitating the
fast introduction of new handsets, which in turn reduced the cost and risk of
inventory obsolescence. However, the pull strategy did require a higher shipping
and warehouse fulfillment cost than the push strategy. In addition, VASTA had to
renovate stores to showrooms and retrain its store workforce to adapt to the change.
Fig. 24.1 VASTA’s old distribution model. Source: Lecture notes, “VASTA Wireless—Push vs.
Pull Distribution Strategies,” by Zhao (2014b)
832 Y. Zhao
Fig. 24.2 VASTA’s proposed new distribution model. Source: Lecture notes, “VASTA Wireless—
Push vs. Pull Distribution Strategies,” by Zhao (2014b)
Intuitively, the choice of pull versus push strategies should be product specific.
For instance, the pull strategy may be ideal for low-volume (high uncertainty) and
expensive products due to its relatively small shipping and fulfillment cost but high
inventory cost. Conversely, the push strategy may be ideal for high-volume (low
uncertainty) and inexpensive products. However, without a quantitative (supply
chain) analysis, we cannot be sure of which strategy to use for the high-volume
and expensive products and the low-volume and inexpensive products; nor can we
be sure of the resulting financial impact.
We shall evaluate the push and pull strategies for each product at each store to
determine which strategy works better for the product–store combination from a
cost perspective. For this purpose, we shall consider the total landed cost for product
24 Supply Chain Analytics 833
i at store j, Cij , which is the summation of store inventory cost, ICij ; shipping cost,
SCij ; and DC fulfillment cost, FCij :
Cij = I C ij + SC ij + F C ij (24.1)
I C ij = hi × Iij (24.2)
where hi is the inventory holding cost rate for product i (per unit inventory per unit
of time) and Iij is the average inventory level of product i at store j.
The shipping cost is represented by
SC ij = sj × Vij (24.3)
where sj is the shipping cost rate (per unit) incurred for demand generated by store
j and Vij is the sales volume per unit of time for product i at store j. Under the
push strategy, sj is the unit shipping cost to replenish inventory at store j by the
DCs; under the pull strategy, sj is the unit shipping cost to deliver the handsets to
individual customers from the DCs.
Finally, the DC fulfillment cost is represented by
F C ij = f Vij (24.4)
To calculate the costs, such as store inventory, shipping, and DC fulfillment (e.g.,
picking and packing) cost for each product–store combination, we need to estimate
the inventory holding cost rate, hi ; the shipping cost rate, sj ; and the fulfillment cost
function, f (Vij ). We will use a previously collected data set of sales (or demand,
equivalently) and inventory data at all layers of the VASTA’s distribution system for
60 weeks. One period will equal 1 week because inventory at both the stores and
DCs is reviewed on a weekly basis.
Inventory cost rate:
Inventory holding cost per week = capital cost per week + depreciation cost per
week
Capital cost per week = Annual capital cost/Number of weeks in a year
Depreciation cost per week = [Product value − Salvage value]/Product life cycle
VASTA carried two types of handsets: smartphones and less expensive feature
phones with parameters and inventory holding cost per week, hi , as in Table 24.3.
Shipping cost rate: Clearly, the shipping rates are distance and volume depen-
dent. Here, we provide an average estimate for simplicity. The pull strategy requires
shipping each unit from DCs to individual customers by express overnight freight.
Quotation from multiple carriers returned the lowest flat rate of $12/unit. The push
strategy, however, requires weekly batch shipping from DCs to stores by standard
2-day freight. Overnight express rate is typically 2.5 times the 2-day shipping rate;
with a volume discount of 40%, we arrive at an average of $2.88/unit. Table 24.4
summarizes the shipping rates.
DC fulfillment cost: Distribution centers incur different costs for batch picking
and packing relative to unit picking and packing due the economies of scale. For
VASTA’s DCs, the pick of the first unit of a product costs on average $1.50. If more
than one unit of the product is picked at the same time (batch picking), then the
cost of picking any additional unit is $0.1. We shall ignore the packing cost as it is
negligible relative to the picking cost.
Under the push strategy, the stores are replenished on a weekly basis. Let Vij be
the weekly sales volume. Because of batch picking, the weekly fulfillment cost for
product i and store j is
f Vij = $1.50 + Vij − 1 × $0.1 f or Vij > 0. (24.5)
Under the pull strategy, each demand generated by a store must be fulfilled
(picked) individually. Thus, the fulfillment cost for product i and store j is
f Vij = Vij × $1.50 f or Vij > 0. (24.6)
To simplify the analysis, we shall group products with similar features together
based on their sales volume and cost. There are essentially two types of phones:
smartphones and feature phones. The average cost for a smartphone is $500, and
the average cost of a feature phone is $200. Thus, we shall classify products into
four categories as follows:
• High-volume and expensive products, that is, hot-selling smartphones
• High-volume and inexpensive products, that is, hot-selling feature phones
• Low-volume and expensive products, that is, cold-selling smartphones
• Low-volume and inexpensive products, that is, cold-selling feature phones
Using the data of a representative store and a representative product from each
category (Table 24.5), we shall showcase the solution, analysis, and results.
In the pull model for high-volume products, we assume a per-store inventory
level of five phones—these are used for demonstration and enhancing customer
experience. Table 24.6 compares the total cost and cost breakdown between the
push and pull strategies for the representative high-volume and expensive product.
The calculation shows that we can save 46.51% of the total landed cost for this
high-volume and expensive product if we replace the push strategy by the pull
strategy. This is true because the savings on inventory cost far exceeds the additional
cost incurred for shipping and DC fulfillment.
Table 24.6 Savings for “hot-smart” phones between pull and push strategies
High volume and
expensive (hot-smart) Pull Push
Inventory level 5 120
(Iij ) (Iij )
Inventory cost $99.52 $2388.46
(Iij × hij = 5 × $19.90) (Iij × hij = 120 × $19.90)
Weekly sales volume 99 99
(Vij ) (Vij )
Shipping cost $1188 $285.12
(Vij × sij = 99 × $12/unit) (Vij × sij = 99 × $2.88/unit)
Fulfillment cost $148.50 $11.30
(Vij × $1.50 = 99 × $1.50) ($1.50 + (Vij
− 1)*0.1 = $1.50 + 98*0.1)
Total cost $1436.02 $2684.88
Savings – 46.51%
Table 24.7 Savings for “hot-feature” phones between pull and push strategies
High volume and
inexpensive (hot-feature) Pull Push
Inventory level 5 110
(Iij ) (Iij )
Inventory cost $39.81 $875.77
(Iij × hij = 5 × $7.96) (Iij × hij = 110 × $7.96)
Weekly sales volume 102 102
(Vij ) (Vij )
Shipping cost $1224 $293.76
(Vij × sij = 102 × $12/unit) (Vij × sij = 102 × $2.88/unit)
Fulfillment cost $153.00 $11.60
(Vij × $1.50 = 102 × $1.50) ($1.50 + (Vij − 1)*0.1
= $1.50 + 101*0.1)
Total cost $1416.81 $1181.13
Savings – −19.95%
Table 24.8 Savings for “cold-smart” phones between pull and push strategies
Low volume and expensive
(cold-smart) Pull Push
Inventory level 2 15
(Iij ) (Iij )
Inventory cost $38.81 $298.56
(Iij × hij = 2 × $19.90) (Iij × hij = 15 × $19.90)
Weekly sales volume 2.5 2.5
(Vij ) (Vij )
Shipping cost $30.00 $7.20
(Vij × sij = 2.5 × $12/unit) (Vij × sij = 2.5 × $2.88/unit)
Fulfillment cost $3.75 $1.65
(Vij × $1.50 = 1.5 × $1.50) ($1.50 + (Vij − 1)*0.1
= $1.50 + 1.5*0.1)
Total cost $73.56 $307.41
Savings – 76.07%
Table 24.9 Savings for “cold-feature” phones between pull and push strategies
Low volume and inexpensive
(cold-feature) Pull Push
Inventory level 2 25
(Iij ) (Iij )
Inventory cost $15.92 $199.04
(Iij × hij = 2 × $7.96) (Iij × hij = 120 × $7.96)
Weekly sales volume 7.3 7.3
(Vij ) (Vij )
Shipping cost $87.60 $21.02
(Vij × sij = 7.3 × $12/unit) (Vij × sij = 7.3 × $2.88/unit)
Fulfillment cost $10.95 $2.13
(Vij × $1.50 = 7.3 × $1.50) ($1.50 + (Vij − 1)*0.1
= $1.50 + 6.3*0.1)
Total cost $114.47 $222.19
Savings – 48.48%
Table 24.10 Savings for all types of phones between pull and push strategies
Cold-smart Cold-feature Hot-smart Hot-feature
% Savings 76.07% 48.48% 46.51% –19.95%
To assess the impact of the pull strategy on store inventory, we quantify the
reduction of inventory investment per store. For the representative store, Table 24.11
shows the number of products in each category and their corresponding inventory
level reduction. Specifically, there are 22 products in the hot-smart category, 20 in
the hot-feature category, 15 in the cold-smart category, and 11 in the cold-feature
category. The store inventory investment can be calculated for both the pull and push
strategies.
From this table, we can see that inventory investment per store under the pull
strategy is only about 5% of that under the push strategy. Thus, the pull strategy can
reduce the store-level inventory by about 95%. Given that store inventory accounts
for 60% of the $2 billion total inventory investment, the pull strategy will bring
a reduction of at least $1 billion in inventory investment as compared to the push
strategy.
Despite the significant savings in inventory, the pull strategy can increase the
shipping and DC fulfillment costs substantially. To assess the net impact of the
pull strategy, we shall aggregate the costs over all products for each cost type
(inventory, shipping, and fulfillment) in the representative store and present them
in Table 24.12.
The table shows that the inventory cost reduction outweighs the shipping/picking
cost inflation and thus the pull strategy results in a net savings per store of about 31%
relative to the push strategy.
Table 24.11 Store inventory investment for pull and push strategies
Pull Push
Inventory Inventory Inventory Inventory
Category # of products level investment level investment
Hot-smart 22 5 5 · $500 · 22 120 120 · $500 · 22
= $55,000 = $1,320,000
Hot-feature 20 5 5 · $200 · 20 110 110 · $200 · 20
= $20,000 = $440,000
Cold-smart 5 · $200 · 20 110 · $200 · 20
15 2 = $15,000 15 = $112,500
Cold-feature 11 2 2 · $200 · 11 25 25 · $200 · 11
= $4,400 = $55,000
Total 68 – $94,400 – $1,927,500
Table 24.12 Total costs for Per store per week Pull Push
pull and push strategies
Total inventory cost $3757.85 $76,729.33
Total shipping cost $52,029.60 $12,487.10
Total picking cost $6,503.70 $528.78
Total cost $62,291.15 $89,745.21
24 Supply Chain Analytics 839
As shown by our prior analysis, the pull strategy does not outperform the push
strategy for all products. In fact, for high-volume and inexpensive products (hot-
feature phones), it is better to satisfy a portion of demand at stores. Thus, the ideal
strategy may be hybrid, that is, the store should carry some inventory so that a
fraction of demand will be met in-store, while the rest will be met by overnight
express shipping from a DC. The question is how to set the store inventory level to
achieve the optimal balance between push and pull.
To answer this question, we shall introduce more advanced inventory models
(Zipkin 2000). Consider the representative store and a representative product. Store
inventory is reviewed and replenished once a week. The following notation is
useful:
• T: the review period
• D(T): the demand during the review period:
• E[D(t)] = μ: the mean of the demand during the review period
• STDEV[D(t)] = σ : the standard deviation of the demand during the review period
The store uses a base-stock inventory policy that orders enough units to raise the
inventory (on-hand plus on order) to a target level S at the beginning of each period.
The probability of satisfying all the demand in this period via store inventory alone
is α (α is called the Type 1 service level). Assuming that D(t) follows a normal
distribution, Normal(μ, σ ), and lead time is negligible (as it is true in the VASTA
case), then
S = μ + zα σ, (24.7)
x2
where φ(x) = √1 e− 2 is the standard normal probability density function.
2π
The expected store demand met by store inventory is
Because the inventory level at the beginning of the period is S, the average on-
hand inventory during the period can be approximated by
S + EI
I= . (24.11)
2
Using α as the decision variable, we can calculate the total landed cost by Eq.
(24.1),
Cij = I C ij + SC ij + F C ij ,
Min0≤αij ≤1 Cij .
To solve this problem, we shall need demand variability (or uncertainty) infor-
mation in addition to averages, such as the standard deviation of demand per unit
of time. Table 24.13 provides the estimates of the representative products at the
representative store.
The results for the representative hot-smartphone are plotted in Fig. 24.3. It
shows how the total cost varies with α (store type 1 service level). Clearly, a hybrid
strategy is best for the representative hot-feature phone (better than both the pull
and push strategies). However, the pull strategy is still the best for the representative
hot-smartphone.
Similarly, the results for the low-volume products are plotted in Fig. 24.4. The
pull strategy works the best for the cold-smartphone, while a hybrid strategy is the
best for the cold-feature phone.
24 Supply Chain Analytics 841
Fig. 24.3 Comparison between push and pull strategies for hot phones
Fig. 24.4 Comparison between push and pull strategies for cold phones
So far, our analysis focuses on the total landed cost, which is smaller under the
pull strategy than the push strategy. Despite this cost efficiency, a fundamental issue
remains: Will customers accept the pull strategy? More specifically, will customers
be willing to wait for their favorite cell phones to be delivered to their doorstep
overnight from a DC?
An analysis of the online sales data shows that in the year 2010, one out of
three customers purchased cell phones online. While this fact implies that a large
portion of customers may be willing to wait for delivery, it is not clear how the rest
two-thirds of customers may respond to the pull strategy. It is also unclear how to
structure the delivery to minimize shipping cost while still keeping it acceptable to
most customers. The available delivery options include the following:
1. Overnight free of charge
2. Overnight with a fee of $12
3. Free of charge but 2 days
4. Free of charge but store pickup
Different options have significantly different costs and customer satisfaction
implications; they must be tested in different market segments and geographic
regions. To ensure customer satisfaction, VASTA had decided to start with option 1
for all customers.
24 Supply Chain Analytics 843
Implementation of the pull strategy requires three major changes in the distribu-
tion system:
1. Converting retail stores to showrooms and retraining sales workforce
2. Negotiating with carriers on the rate and service of the express shipping
3. A massive transformation of the DCs that will transition from handling about
33% individual customer orders to nearly 72% individual customer orders (the
indirect sales, through third-party retail stores such as Walmart, can be fulfilled
by batch picking and account for 28% of total sales)
Despite the renovation costs and training expenses, showrooms may enjoy
multiple advantages over stores from a sales perspective. For instance, removing
inventory can save space for product display and thus enhance customers’ shopping
experiences. Showrooms can increase the breadth of the product assortment and
facilitate faster adoption to newer handsets and thus increase sales. Finally, they
can also help to reduce store-level inventory damage and theft, thereby minimizing
reverse logistics.
Negotiation with carriers needs to balance the shipping rate and the geographic
areas covered as a comprehensive national coverage may require a much higher
shipping rate than a regional coverage. Important issues such as shipping damages
and insurance coverage should also be included in the contract. The hardest part
of implementation is the DC transformation, especially given the unknown market
response to the pull strategy. Thus, a three-phase implementation plan (see Table
24.15) had been carried out to slowly roll out the pull strategy in order to maximize
learning and avoid major mistakes.
3.7 Epilogue
In 2011, VASTA implemented the pull strategy in its US distribution system. FedEx
overnight was used. System inventory reduced from $2 billion to $1 billion. Soon
after, other US wireless carriers followed suit, and the customer shopping experience
of cell phones completely changed in the USA from buying in stores to ordering
in stores and receiving delivery at home. In the years after, VASTA continued to
fine-tune the pull strategy into the hybrid strategy and explored multiple options of
express shipping depending on customers’ preferences. VASTA remains as one of
the market leaders today.
844 Y. Zhao
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 24.1: Vasta_data.xls
Exercises
Ex. 24.1 Reproduce the basic model and analysis on the representative store for the
comparison between push and pull strategies.
Ex. 24.2 Reproduce the advanced model and analysis on the hybrid strategy for the
representative store.
Ex. 24.3 For NYC and LA stores, use the basic and advanced models to find out
which strategy to use for each type of product, and calculate the cost impact relative
to the push strategy.
24 Supply Chain Analytics 845
References
Accenture Global Operations Megatrends Study. (2014). Big data analytics in supply chain: Hype
or here to stay? Dublin: Accenture.
Baljko, J. (2013, May 3). Betting on analytics as supply chain’s next big thing. Retrieved Decem-
ber 30, 2016, from http://www.ebnonline.com/author.asp?section_id=1061&doc_id=262988
&itc=velocity_ticker.
Bisk. (2016). How to manage the bullwhip effect on your supply chain. Retrieved December
30, 2016, from http://www.usanfranonline.com/resources/supply-chain-management/how-to-
manage-the-bullwhip-effect-on-your-supply-chain/.
Bodenstab, J. (2015, January 27). Retrieved December 30, 2016, from http://blog.toolsgroup.com/
en/multi-echelon-inventory-optimization-fast-time-to-benefit.
Chandrasekaran, P. (2014, March 19). How big data is relevant to commodity markets. Retrieved
December 30, 2016, from http://www.thehindubusinessline.com/markets/commodities/how-
big-data-is-relevant-to-commodity-markets/article5805911.ece.
Columbus, L. (2016, December 18) McKinsey’s 2016 analytics study defines the future
of machine learning. Retrieved December 30, 2016, from http://www.forbes.com/sites/
louiscolumbus/2016/12/18/mckinseys-2016-analytics-study-defines-the-future-machine-
learning/#614b73d9d0e8.
Deloitte Consulting. (2014). Supply chain talent of the future findings from the 3rd annual supply
chain survey. Retrieved December 27, 2016, from https://www2.deloitte.com/content/dam/
Deloitte/global/Documents/Process-and-Operations/gx-operations-supply-chain-talent-of-the-
future-042815.pdf.
Ferreira, K. J., Lee, B. H. A., & Simchi-Levi, D. (2015). Analytics for an online retailer: Demand
forecasting and price optimization. Manufacturing & Service Operations Management, 18(1),
69–88.
G20 Trade Ministers Meeting. (2014, July 19). Global value chains: Challenges, opportunities,
and implications for policy. Retrieved December 30, 2016, from https://www.oecd.org/tad/
gvc_report_g20_july_2014.pdf.
Gilmore, D. (2008, August 28). Supply chain news: What is inventory optimization? Retrieved
December 30, 2016, from http://www.scdigest.com/assets/firstthoughts/08-08-28.php.
Gopalakrishnan, S. (2016, July 22). App way to track toilet demand. Retrieved December 30, 2016,
from http://www.indiawaterportal.org/articles/app-way-track-toilet-demand.
Heikell, L. (2015, August 17). Connected cows help farms keep up with the herd. Microsoft News
Center. Retrieved December 30, 2016, from https://news.microsoft.com/features/connected-
cows-help-farms-keep-up-with-the-herd/#sm.00001iwkvt0awzd5ppu5pahjfsks0.
Hendricks, K., & Singhal, V. R. (June 2015). The effect of supply chain disruptions on long-term
shareholder value, profitability, and share price volatility. Retrieved January 7, 2017, from
http://www.supplychainmagazine.fr/TOUTE-INFO/ETUDES/singhal-scm-report.pdf.
Hicks, H. (2012, March). Managing supply chain disruptions. Retrieved December 30, 2016, from
http://www.inboundlogistics.com/cms/article/managing-supply-chain-disruptions/.
Ho, J. (2015, August 19). The ten most innovative companies in Asia 2015. Retrieved Decem-
ber 30, 2016, from http://www.forbes.com/sites/janeho/2015/08/19/the-ten-most-innovative-
companies-in-asia-2015/#3c1077d6465c.
Lee, H. L., Padmanabhan, V., & Whang, S. (1997, April 15). The bullwhip effect in supply chains.
Retrieved December 30, 2016, from http://sloanreview.mit.edu/article/the-bullwhip-effect-in-
supply-chains/.
O’Marah, K., John, G., Blake, B., Manent, P. (2014, September). SCM World’s the chief supply
chain officer report. Retrieved December 30, 2016, from https://www.logility.com/Logility/
files/4a/4ae80953-eb43-49f4-97d7-b4bb46f6795e.pdf.
Pontius, N. (2016, September 24). Top 30 inventory management, control and tracking
apps. Retrieved December 30, 2016, from https://www.camcode.com/asset-tags/inventory-
management-apps/.
846 Y. Zhao
Rafique, R., Mun, K. G., & Zhao, Y. (2014). Apple vs. Samsung – Supply chain competition. Case
study. Newark, NJ; New Brunswick, NJ: Rutgers Business School.
Sanders, N. R. (2014). Big data driven supply chain management: A framework for implementing
analytics and turning information into intelligence. Upper Saddle River, NJ: Pearson Educa-
tion, Inc..
Sangam, V. (2010, September 2). Inventory optimization. Supply Chain World Blog. Retrieved
December 30, 2016.
Simchi-Levi, D., Kaminsky, F., & Simchi-Levi, E. (2008). Designing and managing the supply
chain: Concepts, strategies, and case studies. New York, NY: McGraw-Hill Irwin.
Simchi-Levi, D., Schmidt, W., & Wei, Y. (2014). From superstorms to factory fires - Managing
unpredictable supply-chain disruptions. Harvard Business Review., 92, 96 Retrieved from
https://hbr.org/2014/01/from-superstorms-to-factory-fires-managing-unpredictable-supply-
chain-disruptions.
Szwast, S. (2014). UPS 2014 healthcare white paper series – Supply chain management. Retrieved
December 30, 2016, from https://www.ups.com/media/en/UPS-Supply-Chain-Management-
Whitepaper-2014.pdf.
Vollmann, T., Berry, W., Whybark, D. C., & Jacobs, F. R. (2010). Manufacturing planning and
control systems for supply chain management (6th ed.). Noida: Tata McGraw-Hill Chapters 3
and 4.
Willenbockel, D. (2012, September). Extreme weather events and crop price spikes in a changing
climate. Retrieved January 7, 2017, from https://www.oxfam.org/sites/www.oxfam.org/files/rr-
extreme-weather-events-crop-price-spikes-05092012-en.pdf.
Winckelmann, L. (2013, January 17). HANA successful in mission to the Eldorado Group in
Moscow. Retrieved January 7, 2017, from https://itelligencegroup.com/in-en/hana-successful-
in-mission-to-the-eldorado-group-in-moscow.
Zhao, Y. (2014a). VASTA wireless – Push vs. pull distribution strategies. Case study. Newark, NJ;
Brunswick, NJ: Rutgers Business School.
Zhao, Y. (2014b). Lecture notes VASTA wireless – Push vs. pull distribution strategies. Newark,
NJ; Brunswick, NJ: Rutgers Business School.
Zipkin, P. (2000). Foundations of inventory management. New York, NY: McGraw-Hill Higher
Education.
Chapter 25
Case Study: Ideal Insurance
1 Introduction
Sebastian Silver, the Chief Finance Officer of Ideal Insurance Inc., was concerned.
The global insurance industry was slowing, and many firms like his were feeling
the pressure of generating returns. With low interest rates and increase in financial
volatility in world markets, Sebastian’s ability to grow the bottom line was being
put to test.
Sebastian started going through the past few quarters’ financial reports. He was
worried about the downward trend in numbers and was trying to identify the root
causes of the shortfall in order to propose a strategy to the board members in the
upcoming quarterly meeting. To support his reasoning, he started looking through
industry reports to examine whether the trend was common across the industry or
whether there were areas of improvement for his company. Looking at the reports,
he observed that both profit from core operations, that is, profit from insurance
service, and customer satisfaction rate were surprisingly lower than industry
standard. The data was at odds with the company’s claim settlement ratio,1 which
Sebastian knew was higher than that of his rivals, and claim repudiation ratio,2
which was lower than industry average. He also observed that claim settlement was
taking longer than expected.
The head of the claims department at Ideal Insurance, Rachel Morgan, told
Sebastian that there was a tremendous shortage of manpower in the settlement team
and added that there was no focus on making innovation and improvements in the
claim investigation process. She also reminded Sebastian that she had proposed
an in-house department of analysts who could help improve the claim settlement
process and support other improvement initiatives. Sebastian promised he would
review the proposal, which had been submitted to Adam Berger’s HR team in the
beginning of the year, and set up a meeting with Rachel in the following week. He
also asked her to contact an expert in claim settlement and investigation to provide
a detailed review of performance and to suggest a road map for changes.
Following Sebastian’s suggestion, Rachel reached out to an independent con-
sultant to verify and analyze Ideal’s healthcare policy claims. She knew that fraud
prevention, one of the biggest reasons for profit leakage in the sector, would have to
be a key priority area. However, she was also aware that improving the probability of
detecting fraudulent claims could hurt genuine policyholders. The challenge facing
Rachel was how to balance the need to deliver swift responses to customers with
the knowledge that too many fraudulent claims would severely hurt the bottom line.
It was to solve this challenge that the consultant advised Rachel to consider using
advanced analytical techniques. These techniques, such as artificial intelligence and
machine learning, could help to make claims processing more efficient by identify-
ing fraud, optimizing resource utilization, and uncovering new patterns of fraud. The
consultant added that such applications would improve customer perception of the
company because genuine claims would be identified and processed more quickly.
The global insurance industry annually writes trillions of dollars of policies. The
nature of the insurance industry means that insurers are incredibly sensitive to
fluctuations in local and global financial markets, because many policies involve
coverage for decades. Premiums that are paid to insurance companies are therefore
invested for the long term in various financial assets in order to generate returns
and to use capital efficiently. This necessarily means that understanding the global
1 Claims settlement ratio is calculated as the percentage of claims settled in a period out of
total claims notified in the same period. The higher the settlement ratio, the higher the customer
satisfaction.
2 Claims repudiation ratio is calculated as the percentage of claims rejected (on account of missing
or wrong information) in a period out of total claims notified in the same period.
25 Case Study: Ideal Insurance 849
insurance industry involves not just understanding the nature of the companies that
operate in the sector but also its interconnections with financial markets.
Perhaps the most important distinction regarding insurance companies is the
nature of the policies written. Large (measured by assets and geographical coverage)
firms such as AXA and Prudential usually have an array of policies in every
segment. Smaller companies, however, may restrict themselves to writing only in
the life or non-life segments. Insurers may thus choose only to be involved in motor,
disaster, or injury-linked claims. Companies may also reinsure the policies of other
insurance companies—Swiss Re, Munich Re, and Berkshire Hathaway all engage
in reinsuring the policies of other insurers. The operating results of any specific
insurance company, therefore, will depend not just on the geography in which it
operates but also on the type(s) of policy(ies) that it underwrites.
Insurance to a layperson is nothing but the business of sharing risk. The insurance
provider agrees to share the risk with the policyholder in return for the payment of
an insurance premium. Typically, an underwriter assesses the risk based on age,
existing health conditions, lifestyle, occupation, family history, residential location,
etc. and recommends the premium to be collected from the potential customer.
The policyholder secures his unforeseen risk by paying the premium and expects
financial support in case the risk event takes place.
The insurance business depends for survival and profitability on spreading the
risk over proper mix and volume and careful planning over a long horizon. The
collective premium is either used to honor the claims raised by policyholders
or invested in long-term assets in expectation of significant profit. Thus, the
insurance provider has two major avenues to earn profit, namely, profit from the
core operations of risk sharing and profit from investments. It has been observed
that profit from core operations is generally very low or sometimes even negative;
however, overall profits are high due to investment strategies followed by the
firm—such as value investing, wealth management, and global diversification. Most
insurance businesses have an asset management entity that manages the investment
of such collected premium. The insurance providers also protect themselves from
huge losses from the core business by working with reinsurers such as Swiss Re and
Berkshire Hathaway, who share the risks among the insurance providers.
The health and medical insurance industry is a fast-growing segment of the non-life
insurance industry. The compound annual growth rate of the global health insurance
market is expected to be around 11% during 2016–2020.3 The revenue from the
global private health insurance industry, which was around US$1.45 trillion in 2016,
3 https://www.technavio.com/report/global-miscellaneous-global-health-insurance-market-2016-
is likely to double by 2025 (Singhal et al. 2016). While the USA occupies the top
rank in gross written premium revenues, with more than 40% of the market share, an
aging population and growing income are expected to lead to major increases in the
demand for healthcare services and medical insurance in Latin America and Asia-
Pacific regions in the coming years. The major driving forces and disruptive changes
in health and medical insurance markets are due to the increase in health risk from
the rise in noncommunicable diseases and chronic medical conditions, advances
in digital technology and emerging medical sciences, improved underwritings and
changes in regulatory environments, and increased consumerism and rise in aging
population in developing economies.
Health insurance accounts for billions of dollars of spending by insurance
companies. The Centers for Disease Control and Prevention estimates that about
15% of spending by insurers is on healthcare plans.4 Advances in healthcare
technology are likely to be balanced by increased demand. As populations continue
to age in the USA, Europe, and Japan, spending on this sector will remain a
cornerstone of the insurance business for decades to come.
Most healthcare insurance divides the cost of care between the policyholder and
the insurer. Depending on the type of policy and the nature of treatment, this division
can take place in a number of ways. Some policies pay out once spending crosses a
certain threshold—called a “deductible.” Others split the cost with the policyholder
in a defined ratio—called a “copayment.”
4 Claims Processing
Speed is at the heart of designing the processes in a health insurance firm. By its
definition, insurance is required in situations that are unforeseeable. In the case of
health insurance, receiving the money that the policy guarantees as soon as possible
is vital to the policyholder. Yet this constraint means that the timeline of claims
processing must be necessarily as short as possible, hurting the ability of insurance
companies to verify claims.
The process for claiming health insurance normally proceeds in several stages.
The initial contact between the firm and the policyholder is made via the call center
or local office/agent. Often, this step is undertaken by a person close to the holder
or by the healthcare provider, because the holder may be incapacitated.
The firm’s call center obtains and records basic information from the client,
including details regarding the type of policy that the holder owns, the hospital to
which the holder has been taken, and the injury/ailment with which the holder has
4 Eynde,Mike Van Den, “Health Insurance Market Overview,” State Public Health Leadership
Webinar, Deloitte Consulting LLP, August 15, 2013, URL: https://www.cdc.gov/stltpublichealth/
program/transformation/docs/health-insurance-overview.pdf (accessed on May 25, 2017).
25 Case Study: Ideal Insurance 851
been afflicted. Armed with this information, the call-center employee forwards the
necessary details to the claims processing team.
The claims processing team is responsible for the first line of inquiry into the
claim. They check the policyholder’s coverage and expenses, verify the hospital
and network at which the client has been admitted, and ask for proof such as
bills and prescriptions. Upon doing so, the team either classifies the claim as
Genuine, Discuss, or Investigate. Genuine cases are processed without any further
clarification. Discuss cases are forwarded to a data collection team in order to collect
more information and verify details. Investigate cases are forwarded to the claims
investigation team and can take a long time to be processed (settled or rejected)
further.
At present, the claim processing team at Ideal Insurance examines the following
points to create a fraud score:
1. Is the policyholder raising reimbursement claims from a non-network hospital?
2. Are multiple claims raised from a single policy (except group policies) or
policyholder?
3. Are there multiple group claims from the same hospital?
4. Is the claim raised close to the policy expiry date?
5. Is the claim raised close to the policy inception date?
6. Is there no pre- or post-claim to the main claim?
7. Is there misrepresentation of material information identified in the report?
8. Was the claim submitted on a weekend?
9. Are there “costlier” investigations?
10. Are there high doctor fees?
11. Was the claim reported one day before discharge?
12. Did the claim intimation happen after 48 h of admission?
Each indicator carries a weight assigned based on prior research and experience
of the investigation team. The maximum weighted score is 40. If the weighted score
(or fraud score) of a claim is more than 20, then the claim processing team forwards
the claims to the investigation team to investigate potential fraud. If the fraud score
is between 16 and 20, then the claim processing team seeks additional data from
the information collection team and healthcare service provider. The claims with
fraud scores of less than or equal to 15 are considered genuine and forwarded to the
settlement team for payment to the policyholder or service provider.
Investigating claims requires firms to verify a host of corroborating details. In
order to satisfy themselves of the genuine nature of the claim, investigators check
the types of medications prescribed, years of patient history, and the specific nature
of the ailment. Depending on whether the type of fraud suspected is hard or soft,
investigators could choose to examine different levels of data. While soft fraud
might be identified by conclusively proving that certain kinds of treatment were
not appropriate for the disease diagnosed, hard fraud would need larger and more
complex patterns to be uncovered.
Longer, more complicated, claims processes are a double-edged sword for
insurance firms. As the number of claims that are investigated in-depth rises, the
852 D. Agrawal and S. Mamidipudi
chance of both inclusion and exclusion errors falls. Yet investigating a large number
of claims takes up time, resources, and risks, causing delays to genuine customers.
A claim gets further complicated if the policyholder decides to file a litigation
suit due to delay or rejection. Though it may create pressure for quick settlement,
providing strong argument for minimizing delay/rejection, a suit by itself does not
necessarily mean that the claim is genuine. Litigation is a crucial and important
tool for insurance firms. As health insurance is vital to most clients, the decision to
classify a claim as fraud can potentially open the door to a host of lawsuits. These
lawsuits can be on the behalf of either individual customers or a host of clients. It is
usually the company’s responsibility to justify its opinion that a claim is fraudulent.
Because legal standards for fraud may be different from the company’s internal
standards, ensuring that the company can win such cases can become complicated.
In addition, court costs in themselves can be prohibitive. The company may have to
follow different rules as prescribed by the law of the respective land. Large number
of cases or long pending cases can also potentially damage the firm’s reputation.
Avoiding such challenges is the best bet. It is also important not to back off from
litigation when the occasion demands to prevent potential fraudsters from taking
advantage of the firm.
The transformation in the health insurance industry involves and requires influenc-
ing numerous stakeholders and market participants. These include5 the following:
1. Consumers, patients, caregivers, and patient advocacy organizations: These are
people experiencing the health problems and who would be beneficiary of
various health services and treatments.
2. Clinicians and their professional associations: These are major medical
decision-makers; and their skills, experience, and expertise matter the most.
3. Healthcare institutions, such as hospital systems and medical clinics, and their
associations: Major healthcare decisions are structured and determined by
choices of institutional healthcare providers as they often have a broad view
on what is causing a particular health problem.
4. Purchasers and payers, such as employers and public and private insurers:
Coverage by insurer and purchaser of healthcare plays an important role in
diagnostic and treatment decisions and choices of insured.
5. Healthcare industry, pharmaceutical companies and drug manufacturers, and
industry associations: Manufacturers of drugs and treatment devices and their
suppliers and distributors influence the quality of healthcare services available
in a region.
5 Agency for Healthcare Research and Quality (AHRQ). 2014. Stakeholder Guide. https://www.
ahrq.gov/sites/default/files/publications/files/stakeholdr.pdf (accessed on Aug 17, 2018).
25 Case Study: Ideal Insurance 853
Insurance fraud is one of the largest sources of white-collar crime in the world,
meaning that significant police effort is also devoted to tracking and eliminating it.
However, given limited police resources and a universe of crime that encompasses
far more than just the white-collar variety, hard insurance fraud perpetrated by orga-
nized criminals tends to be the focus of law enforcement. This leaves unorganized
hard fraud and a plethora of soft fraud to remain within the purview of insurance
companies.
854 D. Agrawal and S. Mamidipudi
The health insurance industry is no more immune to fraud than any other
insurance subsectors. Experts estimate about 6% of global healthcare spending
is lost to fraud annually.6 In a world in which trillions of dollars are spent on
healthcare by governments, nongovernmental organizations, and corporations alike,
this amounts to tens of billions lost to criminal enterprises. In the USA alone, fraud
is estimated to cause about US$80 billion in losses to the industry annually, with
property casualty fraud accounting for US$32 billion.7 These figures do not include
fraud perpetrated on Medicare and Medicaid.
Health insurance fraud is an act of providing misleading or false information
to a health insurance company in an attempt to have them pay to a policyholder,
another party, or entity providing services (PAIFPA 2017). An individual subscriber
can commit health insurance fraud by:
• Allowing someone else to use his or her identity and insurance information to
obtain healthcare services
• Using benefits to pay for prescriptions that were not prescribed by his or her
doctor
Healthcare providers can commit fraudulent acts (PAIFPA 2017) by:
• Billing for services, procedures, and/or supplies that were never rendered
• Charging for more expensive services than those actually provided
• Performing unnecessary services for the purpose of financial gain
• Misrepresenting non-covered treatments as a medical necessity
• Falsifying a patient’s diagnosis to justify tests, surgeries, or other procedures
• Billing each step of a single procedure as if it were a separate procedure
• Charging a patient more than the co-pay agreed to under the insurer’s terms
• Paying “kickbacks” for referral of motor vehicle accident victims for treatment
• Patients falsely claiming healthcare costs
• Individuals using false/stolen/borrowed documents to access healthcare
Tackling fraud is critical to the industry, especially with fraud becoming ever
more complex. By its nature, insurance fraud is difficult to detect, as its aim
is to be indistinguishable from genuine insurance claims. In each of the above
cases, identifying the fraud that has been perpetrated can be a laborious process,
consuming time and effort. Given that healthcare spending can be sudden, urgent,
and unexpected, checking for fraud can be a complicated process. Companies must
balance their financial constraints with the reality of healthcare situations.
According to an estimate of the US National Healthcare Anti-Fraud Association
(NHCAA),8 3% of all healthcare spending is lost to healthcare fraud (LexisNexis
6 “The Health Care Fraud Challenge,” Global Health Care Anti-Fraud Network. http://www.ghcan.
2011). Financial fraud including unlawful billing and false claim is the most com-
mon type of health insurance fraud and generally tied into aspects of organization
and health information management (AHIMA Foundation 2010). The data mining
tools and techniques and predictive analytics such as neural network, memory-based
reasoning, and link analysis can be used to detect fraud in insurance claim data
(Bagde and Chaudhary 2016).
Healthcare fraud leads to higher premium rates, increased expenses to con-
sumers, and reduced coverage. It increases cost to employers for providing health-
care insurance to their employees affecting the cost of doing business. Besides
financial losses, fraudulent activities lead to exploitations and exposure of people
to unnecessary and unsafe medical procedures, which can have devastating health
side effects.
Detecting healthcare insurance fraud is a long drawn-out, complicated process
that costs companies time, effort, money, and the goodwill of their customers.
Modern technology and statistical software have helped to reduce this cost, but
it remains a significant burden on the resources of customer service departments
the world over. Healthcare insurance fraud-proofing and management strategies and
activities may include “improving data quality, building a data centric culture and
applying advanced data analytics.”9 These provide opportunity for significant cost
savings by the healthcare industry.
The innovations in insurance products and development in information com-
munication technologies can help to design tailor-made insurance products with
improved underwriting and pricing of healthcare insurance and coverage option.
Technology and improved information systems can benefit stakeholders and market
participants and lead to improved welfare and consumer satisfaction.
In the past, the primary manner in which insurers detected fraud was to employ
claims agents who investigated suspicious claims. However, as data analytics
software gains prominence and becomes more powerful, firms are becoming more
able to identify patterns of abnormal behavior.
Fraud detection, however, must contend with the possibility of misidentifying
fraud. Allowing false claims to go through the system hurts the company’s profit
and increases premiums. Forcing genuine claimants to go through the fraud
detection process, however, increases costs and hurts customer satisfaction. As these
constraints are diametrically opposed, any attempt to curb one will tend to increase
the other.
7 Ideal’s Business
Ideal Insurance Inc. is one of the largest healthcare insurance providers in the
USA and other developed nations. It has expansion plans to enter into emerging
markets where penetration is much lower. Most of the underwriting work is done
in the US or the UK office. It has a large claim processing team located in all the
countries of presence. It has back offices in other countries such as India, Singapore,
and Thailand. With the increasing competition in the market, the company has
had to focus on quick settlement, claims settlement ratio as well as profit margin.
The company has been investing significant amount on automating the claim
settlement process in order to increase customer satisfaction rate, reduce the length
of the settlement cycle, and reduce the loss associated with claim leakage due to
potential fraud claims. Table 25.1 shows some of the key performance measures that
Sebastian was tracking. Though Ideal offers competitive premiums and maintains a
high claim settlement ratio and low repudiation ratio, its net promoter score (NPS)—
a metric of customer satisfaction—is significantly lower than the industry average.
The company has an automated system in place that reviews the basic infor-
mation of all the claims based on prespecified rules set by experts. The rules
are used to classify the claims into three categories, namely, Genuine, Discuss,
and Investigate. This information is passed to the claims settlement team to act
further. The claims classified as “Genuine” are processed on high priority with
a settlement turnaround time (TAT) of 7 working days. The claims classified as
“Discuss” are forwarded to the data collection team in order to collect more and
granular information about the claims. Such claims usually take on an average of
up to 30 days. The claims classified as “Investigate” are forwarded to the claim
investigation team for thorough enquiry of the stakeholders. These claims usually
take between 1 and 2 months or sometimes even more than 3 months for settlement
or closure based on the results of the investigations or if the claims are litigated.
Some customers file litigation suits in case of rejected claims, and then it is the
company’s onus to prove that the claim is fraudulent. Anecdotal evidence suggested
that Ideal’s experienced claim settlement personnel did not completely trust the
current system. The feeling was that it was somewhat limited due to a “bureaucratic
approach.” They did their own checks, often uncovering fraud from claims identified
as Genuine by the current system. Rachel discussed the same with the more senior
personnel to understand the root cause. The argument given was that “data maturity
is vital in uncovering fraud pattern and therefore we re-analyze the claims (even if
it is identified as Genuine) when more information is populated in the database.”
Also, their experience suggested that it is difficult to uncover professional soft fraud
which is usually identified only by thorough analysis or if there is any lead from an
outside stakeholder or by doing network analysis of other fraud claims.
Rachel wanted a feedback on the current processes and hired an independent
consultant to examine 100,000 claims from the historical claims data. The con-
sultant with the help of an investigation team did a thorough examination of
the provided claims and classified those claims as fraud or non-fraud. Out of
100,000 claims investigated by the consultant, they identified 21,692 as potentially
fraudulent claims and 78,308 as genuine claims. Comparing these results with
the previous settlement records of 100,000 claims showed that more than 90%
of these fraudulent claims were not identified by the existing automated system.
Claims of more than 6657 customers were delayed because of the investigation
process suggested by the current system. The average claim cost is around US$5000
while the cost of investigation is approximately US$500. The cost of investigation
was equally divided between the internal manpower cost and the external cost of
obtaining information specific to the claim. Thus, conservatively, Rachel estimated
that investigating a genuine claim leads to a loss of US$500 and increases customer
dissatisfaction due to the delay in settlement. It also reduces the efficiency of the
investigation team. Settling a potential fraudulent claim leads to a loss of US$5000
on average and negatively affects the premium pricing and the effectiveness of the
underwriting team.
The management team discussed the report with its claims settlement team
and sought advice from them on how to improve the processes. Several excellent
suggestions were gathered, such as monitoring each stage of the process instead
of the entire process, flagging issues overlooked by the current system, and using
past similar-claims data to verify the claim. The claims settlement team also
suggested hiring an analytics professional to build a fraud predictive model using
advanced data science techniques. They explained that this would not only help
in correctly identifying potential fraud claims but also in optimizing the efforts
of the claims investigation and settlement teams. They also mentioned that their
closest competitor had recently set up an analytics department, which was helping
in various aspects of business such as conduct of fraud analytics, predicting
claims, review of blacklisted stakeholders, effective underwriting, and developing
customized products.
858 D. Agrawal and S. Mamidipudi
Rachel turned to Raghavan, a recent hire who had graduated with a master’s
degree in business analytics from a famous school in South Central India. Raghavan
had expertise in analytics specifically in the insurance domain. He was charged to
hire professionals and supervise the project: to build a predictive model to identify
potential fraudulent claims out of reported health insurance claims. This solution,
Rachel and Sebastian felt, will help not only in reducing the losses due to fraud
but also in improving efficiency and customer satisfaction and reducing the claim
settlement cycle.
Raghavan’s initial thoughts were to deliver a robust analytical solution that would
improve fraudulent claim identification process at Ideal’s site without investing
much time and effort in the field at the early stage. The potential fraud claims
can be investigated more rigorously, while genuine claims can be settled quickly
at the same time. He co-opted Caroline Gladys, who had also recently graduated
in analytics from one of the premier business schools, who had been working with
the business intelligence team and now wanted to switch to the advanced analytics
team. Raghavan provided her the opportunity to work on this proof of concept and
deliver a solution.
Caroline through her experience within Ideal Insurance quickly created a sample
dataset at the transaction level for 100,000 health insurance claims. Each observa-
tion has up to 33 parameters related to the claim such as policy details, whether
there was a third-party administrator, demographic details, and claim details. The
complete details are shown in Appendix 1. Tables in Appendix 2 provide the coding
of variables such as product type, policy type, and mode of payment.
The data in the tables are collected by the transaction processing system (1) when
the policy is issued, (2) when a claim is recorded, and (3) while its progress is
tracked. The ERP system did a fairly good job of collecting the necessary data.
25 Case Study: Ideal Insurance 859
Custom software helped put together the information into tables and created reports
for further processing. Ideal had invested a great deal in automation of transactions
in the past and was looking to reap dividends from the reporting system.
Caroline also obtained the classification of claims as Fraud/Not Fraud examined
by the expert who had investigated 100,000 claims. The classification is shown
in Table 25.2. Additionally, the data in Table 25.3 provide the classification of
all 100,000 claims as Genuine, Discuss, and Investigate according to the current
automated system.
Caroline put together all the data in a dataset (idea_insurance.csv; refer to
the website) and also the detailed definitions of the variables available and data
description required to decode the categories such as product type, policy type, and
claim payment type.
Having collected all this information, Caroline was wondering how to begin
the analysis. Was predictive analytics superior to the expert system used by Ideal?
Would the experts who created the system as well as the senior settlement officers
readily accept the changes? She was also worried about the ongoing creation of rules
and maintenance of the system. That would cost significant investment in people and
technology, not to mention training, obtaining data, etc. She would have to clearly
convince the management that this was a worthwhile project to pursue!
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 25.1: ideal_insurance.csv
860 D. Agrawal and S. Mamidipudi
Appendix 1
Appendix 2
References
AHIMA Foundation. (2010). A study of health care fraud and abuse: Implications for professional
managing health information. Retrieved September 15, 2018, from https://www. ahimafounda-
tion.org/downloads/pdfs/Fraud%20and%20Abuse%20-%20final%2011-4-10.pdf.
Bagde, P. R., & Chaudhary, M. S. (2016). Analysis of fraud detection mechanism in health
insurance using statistical data mining techniques. International Journal of Computer Science
and Information Technologies, 7(2), 925–927.
LexisNexis. (2011). Bending the cost curve: Analytics driven enterprise fraud control. Retrieved
September 15, 2018, from http://lexisnexis.com/risk/downloads/idm/bending-the-cost-curve-
analytic-driven-enterprise-fraud-control.pdf.
PA Insurance Fraud Prevention Authority (PAIFPA). (2017). Health insurance fraud. Retrieved
September 15, 2018, from http://www.helpstopfraud.org/Types-of-Insurance-Fraud/Health.
Singhal, S., Finn, P., Schneider, T., Schaudel, F., Bruce, D., & Dash, P. (2016). Global
private payors: A trillion-euro growth Industry. New York: McKinsey and Com-
pany Retrieved September 15, 2018, from http://healthcare.mckinsey.com/sites/default/files/
Global%20private%20payors%20%28updated%29.pdf.
Chapter 26
Case Study: AAA Airline
1 Introduction
Steven Thrush, Chief Revenue Officer of AAA Airline Corp, was concerned about
his company. The airline industry, buoyed by strong demand and low oil prices,
had been on an upswing for the last few years. Rising competition, however, had
begun to pressure AAA’s operations. Shifting market sentiments and an increasingly
complicated market had made travelling to most destinations in the USA dependent
for most customers on a number of contrasting factors.
Moreover, the rise of low-cost carriers and online ticket comparison websites
had put immense downward pressure on ticket prices, squeezing the margins of
companies and forcing them to investigate new avenues of growth in order to
maintain their profitability.
Thrush had just returned from a conference focused on the application of
data science and analytics in the passenger transport industry. At the conference,
researchers and practitioners talked about the rapid advance of big data and its
power to understand and predict customer behavior. Thrush grasped that big data
The case study is written by Agrawal, Deepak; Kollipara, Hemasri; and Mamidipudi, Soumithri
under the guidance of Professor Sridhar Seshadri. We would like to thank Coldrena et al. (2003)
and Garrow (2010) who are the inspiration to develop this case study.
could help his company move toward new models that took into account a dizzying
range of factors in order to make better decisions.
When Linda James, the company’s head of route planning, approached him to
ask about the feasibility of launching a New York–Boston flight, Thrush immedi-
ately thought about employing the customer choice models he had heard about in
order to understand the proposition. He asked his data team to use the company’s
database of its customers to understand the question of how well received a new
flight from New York to Boston would be. He knew that to answer such a question,
the team would also have to investigate many more issues such as what manner of
pricing would be most efficient, what type of aircraft would be most efficient, and
how best to reach new customers who might not otherwise fly AAA.
Settling on the correct approach to the problem, Thrush knew, would be the best
way to deliver the best service possible to customers while maximizing the profit of
his company.
AAA Airline Corp was founded in 2005, amid a sea change in the travel industry.
As Internet penetration grew and price comparison websites became increasingly
popular, AAA saw an opportunity for a low-cost carrier to capitalize on the
increased customer focus on prices.
Like many carriers founded in the wake of the online boom, AAA’s philosophy
was to compete purely on price. Instead of focusing on specific regional flights and
needs, AAA’s philosophy was to identify and fill gaps in the market and in doing
so carve out a niche for itself. While most of its flights operated in a hub-and-spoke
system out of Boston Logan Airport, the company was not averse to operating point-
to-point routes that are the hallmark of low-cost carriers worldwide.
AAA’s initial method to identify which routes were profitable relied on a mix of
market research and intuition. AAA’s original management team consisted mostly
of industry veterans hailing from Massachusetts, and they were all well acquainted
with the needs of local customers. AAA’s in-depth expertise in its initial market
helped it survive where many of its rivals failed, prompting it to expand its offering
and plan for more ambitious growth.
By 2016, the size of AAA’s fleet had risen considerably, prompting Thrush’s
concern regarding its next steps. AAA’s history meant that it had access to a large
database of its own customers, which it had so far been using to forecast future
demand and traffic patterns. Thrush was keen to know, however, what new tools
could be used and datasets found in order to analyze the market and help the
company stride into the new era of commercial air travel.
26 Case Study: AAA Airline 865
The US airline industry had a capacity of more than over 1.1 million available seat
miles (accounting for both domestic and international flights) in 2016 and is the
largest geography for air travel worldwide. The sector supplies nearly 3500 available
seat miles, a measure of carrying capacity, per person in North America, more than
double that of the industry in Europe.
The effects of the Airline Deregulation Act of 1978 are still being felt today.
Before the Act, American airline companies were strictly constrained by the Civil
Aeronautics Board, which was responsible for approving new routes and pricing.
The Board could give agreements between carriers anti-trust immunity if it felt it
was in the public interest. This resulted in a situation where airlines competed purely
on in-flight service and flight timings and frequency.
Legacy carriers—airlines founded before deregulation—are able to offer better
service and benefits such as loyalty schemes as a result of the environment in which
they operated at their founding. Airlines such as these tend to have larger planes and
operate in a hub-and-spoke system that means that their flights are largely based out
of a single airport.
After the industry was deregulated, airlines became free to decide what routes
to fly and what prices to offer. New low-cost carriers like AAA entered the market
and shifted the paradigm by which companies in the industry functioned, forcing
full-service airlines to adapt. Since 1978, more than 100 airline carriers have filed
for bankruptcy,1 underscoring the tumultuous nature of the industry.
The proliferation of the Internet was no less disruptive to the airline and travel
industries. Customers were more able than ever to compare flights, and their ability
to discriminate between a multitude of choices at the tap of a key left companies the
world over scrambling to keep up. This meant that companies such as AAA were
forced to use ever more complicated models in their attempts to understand and
predict customer demand while at the same time keeping track of their costs.
Thrush knew that AAA’s spoke-and-hub system helped to keep costs low and
enable the airline to fly a large number of passengers. However, he was also aware
hub airports were especially hard-hit by the increase in the number of passengers
using them, meaning that pressure on his staff and his operations was mounting
daily. The industry’s domestic load factor, the fraction of available seats that were
sold, had risen to 85% in 2016 from 66% in 1995.2 Domestic ASMs rose 29% in the
same period to 794,282. However, the sizes and capacities of hub airports had not
risen in line with this explosive growth in passengers due to property, environmental,
and financial constraints.
Traffic%20and%20Capacity/Domestic/Domestic%20Load%20Factor%20.htm, accessed on
Jul 15, 2017.
866 D. Agrawal et al.
The airline industry had so far tackled the problem of being able to supply its
customers with the flights they needed by looking to strategic alliances and code
sharing deals. Airlines that were part of the same alliance agreed to pool their
resources by agreeing to be located in the same terminals in hub airports, operating
flights under the banner of more than one carrier, and offering privileges to members
of fellow members’ loyalty programs. By doing so, companies ensured that they did
not have to operate and fly every route their customers demanded.
“We need to consider whether it makes sense to abandon our spoke-and-hub
system. Our rivals that use point-to-point routes are eating into demand, and I’m
sure passengers are noticing the kind of queues that are building up in the larger
airports,” James told Thrush.
The airline industry uses three main types of data to interpret the environment in
which it operates—demand data, such as booking and ticketing; supply data, such as
schedules; and operational data, such as delays, cancellations, and check-ins. Thrush
found that data scientists used these databases to uncover traveler preferences and
understand their behavior.
The demand data in the industry comes from booking and ticketing databases,
and detail a plethora of factors that affect customers while booking flights, and take
into account exactly what information is available to customers at the time of their
purchase. Supply data is usually accessible so that customers are able to identify
flights, but the industry’s main sources are schedules and guides provided by the
Official Airline Guide (OAG). These guides collate information including origin,
destination, trip length, and mileage for individual flights globally.3 Data regarding
the operational status of flights is usually available freely, though it is often not
granular. AAA, like its competitors, kept detailed records of operational data in
order to catch patterns of inefficiency. In addition, The US Department of Transport
maintains a databank that consists of 10% of all flown tickets in the country.4 The
databank provides detailed ticketing, itinerary, and travel information and is freely
available for research purpose.
%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and
%20Destination%20Survey (accessed on Jun 24, 2018).
26 Case Study: AAA Airline 867
Thrush met with John Heavens, a data scientist and airline travel consultant, to
inquire further about the possibility of using advanced data models in order to
understand and forecast customer behavior.
Heavens told Thrush that the industry’s old time-series/probabilistic models had
become too outdated. Multinomial logit decision-choice models were the industry’s
mainstay tools in understanding consumer demand. These models broke itineraries
down by assigning utility values to each flight and attempting to determine which
factors were most valuable to customers. By observing the factors that affected
customer choices for each origin–destination pair, Thrush would be able to predict
with confidence where customers were looking to travel next.
However, Heavens also gave Thrush a third option. “Even the decision-choice
models are becoming old, and we’re moving in new directions now,” he said. The
consultant pointed out that the industry’s MNL models were essentially linear in
nature, and were not able to deal with factors that were correlated. In addition, their
rigid need for data input meant that they could not predict the demand for new routes
and new markets.
Instead, Heavens pointed to groundbreaking artificial intelligence research as
the vanguard of an array of new technological tools that could be used to predict
future demand. Techniques such as random forests, gradient-boosting machines, and
artificial neural networks were able to produce better out-of-sample results without
sacrificing in-sample goodness-of-fit. While these techniques lacked the readability
and simplicity of MNL models, they were ultimately more efficient.
After being presented with the models, Thrush knew he had a difficult decision to
make. Moving to new methods of analysis had clear advantages, yet the significant
investment in time and effort needed to be justified. Training and hiring employees
and conducting ongoing analysis would be a drain on the company’s resources.
Thrush looked to Hari Veerabhadram, the newest member of his team, to explain
to him exactly which models are best suited to understand customer preferences.
Hari knew that he had to explain how the models worked. He started thinking
about which variables in the models he would use and what would be the most
important. He knew that it would be crucial to explain why particular variables were
the most important and which model was better at predicting customer preferences.
Management always like visual proof of analysis. Thus, he felt that he would need to
explain and compare the models both statistically (mean squared error, percentage
variance explained by model, etc.) and through visualization methods (predicted vs.
actual fit, training vs. validation results, etc.).
868 D. Agrawal et al.
the identity of the airline, how many passengers chose that itinerary, what was
the offered aircraft type, departure day and time, service-level, best service level
available on that route, mileage, average fare, etc. For example, say, the O-D pair
“5” represents the New York to Los Angeles route, and Airline = “A” represents all
the itineraries offered by AAA Airline. Pick Itinerary ID “3.” This itinerary offers a
Small Propeller service on the route as a single connect option departing at 7 a.m.
from New York. Single connect is the best service possible on this route across all
the airlines serving that OD pair.
The basic summary statistics (Table 26.2) helped Veerabhadram to understand
the variability in the data.5 He observed that AAA Airline is one of the top
performing airlines connecting significant number of cities through single and
double connect itineraries. It also means that even a minor change in route can have
5 Refer to the website to download the data (csv and excel version).
870 D. Agrawal et al.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Text 26.1: Airline Instruction manual.docx
• Data 26.1: AAA_Airline_dummy.csv
• Data 26.2: AAA_Airline_Template.xlsx
26 Case Study: AAA Airline 871
References
Coldrena, G. M., Koppelmana, F. S., Kasturirangana, K., & Mukherjee, A. (2003). Modeling
aggregate air-travel itinerary shares—Logit model development at a major US airline. Journal
of Air Transport Management, 9, 361–369.
Garrow, L. A. (2010). Discrete choice modelling and air travel demand: Theory and applications.
New York: Routledge.
Chapter 27
Case Study: InfoMedia Solutions
Hui Zhang had just returned from a workshop on sports and media analytics. One
of the speakers had described the convergence of media and how it had affected his
broadcast business in a very short time span. Hearing others mention the same set
of possibilities and with his own experience in the rapidly changing industry, Zhang
was convinced that an ever-increasing number of television viewers will, if they
haven’t done so already, “cut the cord” and move away from traditional viewing
platforms. It was this thought that Zhang had at the back of his mind when he
read a report predicting that viewership was splintering—more and more specialized
channels were sniping away at traditional shows and showtimes. On top of it, the
report mentioned that Internet advertising would overtake television and print media
in size and spend in the next 5 years. Zhang was concerned that new technologies
would threaten the position that his firm had built up in the TV advertising segment.
While millions of dollars were still expected to be spent on advertising on tradi-
tional television and cable channels, changes in viewership habits and demographics
would surely shift budgets to targeting different audiences through dedicated
channels. Moreover, the bundling strategy followed by cable TV companies allowed
audience to quickly surf from one show to another! The change in resource
allocation of ad spends had not yet happened because of the ongoing debate
1 InfoMedia Solutions
glossary/ or https://www.bionic-ads.com/2016/03/reach-frequency-ratings-grps-impressions-cpp-
and-cpm-in-advertising/ (accessed on Aug 22, 2018). Refer to the article for the meaning of terms,
such as Reach, CPM, CPV, Cross-Channel, Daypart, GRP, OTT, PTV, and RTB.
2 http://money.cnn.com/2017/03/02/technology/snapchat-ipo/index.html (accessed on Jun 23,
2018)
27 Case Study: InfoMedia Solutions 875
The first issue Zhang faced was the problem that all advertisers had to tackle:
duplication. Broadcasting an advertisement (ad) ten times would be very efficient
if it were watched by ten different people. If only one person saw all ten times,
however, the reach of the ad would become much smaller. In order to measure
reach effectively, Zhang would have to correctly account for redundant, duplicate
impressions that were inevitable in his line of business. Reach-1 measured the
number of times a unique viewer saw the ad at least once.
Zhang also knew that most customers would not change their minds regarding
a new product the first time they heard about it. In order to create an image
in the customer’s mind about a product, Zhang would have to reach the same
customer multiple times. Thus, Zhang’s duplication issue was akin to a Goldilocks
problem—too few impressions would not result in a successful sale, but too many
would be wasteful and inefficient. Identifying how to deliver the correct number
of impressions in order to maximize reach would be at the core of the solution he
needed from Gershwin.
Depending on the target audience and the product, Zhang normally tracked
Reach-1, -3, or -5, meaning that an impression had been delivered to a potential
customer at least three, five, or seven times. Understanding how many impressions
would be necessary for each product was crucial to efficiently use resources.
Gershwin’s next issue would be to identify the inflection point where duplication
was no longer useful to expanding reach.
While traditionally, overlap between media was not high enough to have a large
impact on duplication, the need to track duplicate impressions on digital, print, and
television was growing more and more important. The cross-platform approach,
in which the campaign played out in more than one medium in order to gain
impressions, was a burgeoning part of the sector that would surely grow. Gershwin
felt that the cross-platform duplication could be addressed later.
2 Reach in Advertising
free; a time slot that was able to deliver more reach would be more expensive. Zhang
thus would have to balance the cost of airing slots against the reach that those slots
could give him, by understanding his target audience. Reach-1, -3, -5, and -7 were
defined either in percentage terms or in the total number of unique viewers who saw
the ad at least one, three, five, and seven times. In the above example, these might be
45, 30, 15, and 7 million viewers or 75%, 50%, 25%, and 11.66%. Obviously, reach
will not exceed the size of the population. Reach-1 will be the largest followed by
the rest.
Reach-1, -3, -5, and -7 were known to be a concave increasing function of ad
spots, which is to say that as the number of spots increased, reach increased, but
the rate of that increase for the same increase in spots was diminishing. In an ideal
world, Zhang would be able to buy exactly the number of spots at which the rate
of decrease of reach times “a dollar value for reach” was balanced by the rate of
increase in spots times a “dollar value of a spot.” Gershwin thought one could
uncover the relationship between ad spots and reach using simulation.
Adding to the complication, a cross-platform approach would need to estimate
reach across multiple channels or media. This meant that the function used to
calculate reach would have multiple inputs to track.
Guessing the appropriate ad spots target would prove tricky. Gershwin’s solution
would have to help solve this problem more efficiently if it were to improve
InfoMedia’s bottom line.
3 Rooster Biscuits
Zhang already had a customer in mind for the first test of the new approach.
Rooster Confectionery, a biscuit- and cereal-maker, wanted to launch a new brand
of chocolate-coated biscuits. Rooster’s chief operating officer, Joni Fernandez, had
told Zhang she wanted to focus on a younger target audience (age group 20–35)
with the product—exactly the kind of customer who would be shifting to different
avenues of media consumption. Zhang felt that if Gershwin’s approach could help
to optimize Rooster Biscuits’ campaign, it would bode well for the use of novel
techniques.
However, Fernandez had tight programming and budget constraints. She had
already informed Zhang that “Rooster expected that the channels would show at
least 20% of its ads during ‘prime-time’ slots—between 8 pm and 11 pm and at least
30% over weekends.” Moreover, while she expected a long-term ad campaign, she
wanted Zhang to run a short, 1-week version first to test the market’s reaction to the
new product. She informed him that the test campaign should target the two biggest
cable channels aimed at the 18–34 P (P = person) demographic—The Animation
Channel and the Sports 500 Network.
27 Case Study: InfoMedia Solutions 877
4 Blanc Solutions
(r,g) data points, where r is the reach and g is the number of spots shown. Based on
his experience, Zhang suggested to start with quadratic and cubic fits.
Blanc planned to use the viewership data provided by the third-party aggregator
to simulate reach and thus obtain the data points. Zhang had told him that InfoMedia
intended to air between 10 and 250 ads in the 7-day period (1 week) of the campaign.
The data was collected by a survey of households that viewed the two channels,
conducted during a week that was as similar as possible to the target week. In the
survey, viewers were asked a number of demographic questions. Blanc would be
able to obtain the data regarding what channel they viewed, and for how long, from
the cable companies themselves. These two sources, when combined, would give
Blanc most of the necessary data.
Broadcasters divided airtime on their channels into 6-min slots, that is, Blanc
had 1680 potential slots in the week to air the ad—10 per h * 24 h * 7 days per
week. Simulation would involve throwing the ads randomly into these 1680 slots
and computing GRP and reach. Thus, Blanc would choose the number of ads to
air. Then, Blanc would simulate showing ads for each number of slots and use the
viewership data to understand how many people were watching the slot chosen by
the simulation. He could also estimate whether a viewer watched at least once, at
least thrice, etc. Doing this repeatedly would give a “reach versus spots shown”
set of data points. Then, he would fit a curve through these points to obtain the
relationship.
Yet there was a key element to be added to his dataset. The viewership data he
had was only a sample of the total population. It was necessary to add a unique
weight to each viewer in his set—a measure of the proportion of viewers in the
population that were similar to the selected viewer—to convert the sample numbers
to the population number. For example, if a viewer in the 18–34 P had watched at
least thrice and if this viewer’s weight were 2345, then he would estimate that 2345
viewers had watched at least thrice in the population. Adding these numbers viewer-
by-viewer would give an estimate for the GRP and reach. Blanc would thus be able
to determine the reach of any combination of slots selected by the simulation by
multiplying each viewer who saw the ad by the weight of that viewer. Moreover,
by tracking the number of views by the same consumer, multiplying by the weight,
and adding up across viewers, he could calculate not just Reach-1 but Reach-3 and
Reach-5 as well.
Using multiple simulations,4 he would be able to obtain a robust set of data
that he could use to derive the reach curve. He can then fit a polynomial curve
as explained above. The data science team constructed a simulator that produces the
reach given the number of ads to be shown and the constraints on when they are to
be shown. The help file, interface, and sample outputs are shown in Appendix 3.
Task 2: Develop the estimate.
4 Blanc took help from a simulation expert in his data science team who provided him the code
“Infomedia_simulation.R” (refer to Appendix 3) to run the simulation to calculate Reach for each
simulation given the constraints.
27 Case Study: InfoMedia Solutions 879
Blanc decided to partition the data set by time and day of the week and to use
that information to improve the prediction. In order to do this, he has multiple
approaches—divide each day into 3-h buckets, divide regular and prime time on
daily basis, or simply divide regular/prime time over weekday and weekend. The
time bucket starting at 2 am would ensure that the effect of the “prime-time” 8 pm–
11 pm bucket could be understood separately. Moreover, he also decomposed each
day into its own bucket, to better understand the difference in viewership between
weekdays and weekends. See Appendix 3 for examples of these data collection
methods and how these are reflected in the output produced by the simulator. He
reviewed the contract terms and legal notes to understand whether there is any
“no show”/“blackout” period and found no such restriction for Rooster’s campaign.
“No show” or “blackout” period basically restricts the broadcaster from showing
commercials when either the customer put the conditions of not broadcasting its ad
during a particular time period assuming there will not be target customers or if there
are any regulatory restrictions of not showing specific commercials in a particular
time zone. Further, in the future, refinements in estimation can be made by using
the variance amongst the viewers’ demographics such as age and gender and other
variables such as the average time spent by segment.
Task 3: Optimal spending.
Once Zhang had the information at hand and built up confidence in the model, the
decision he had to make was clear. Blanc’s own survey had informed him that the
overlap between the viewers of the two channels seemed negligible. Thus, Zhang
had to determine an optimal spending pattern for Rooster’s campaign. How many
ads would he run on each channel? At what times and on which days would he target
how many? For demonstration purposes, Blanc thought he could use the previous
campaign whose data is shown in Appendix 1 to demonstrate how the new method
could work.
He knew he had a difficult task ahead explaining which variables were important
in predicting reach. Could he convince his clients about the findings? He thought
there were two ways of going about doing this—(a) explain the model very carefully
to the client and (b) show how it can be used to increase the reach without increasing
the budget.
All the datasets, code, and other material referred in this section are available in
www.allaboutanalytics.net.
• Data 27.1: infomedia_ch1.csv
• Data 27.2: infomedia_ch2.csv
• Code 27.1: Infomedia_simulation.R
880 D. Agrawal et al.
Ex. 27.1 Review the simulator description in Appendix 3 and pseudo-code provided
in Appendix 4. Generate data using the simulator for both the channels. Report the
model fit as a function of the total number of spots as well as the other explanatory
variables. Visualize the reach curve for Reach-1, -3, -5, for both the channels.
Ex. 27.2 Using the campaign information/constraints and the model obtained in Ex.
27.1, demonstrate that a better allocation across channels, weekdays/weekends, and
time-of-the-day can yield higher reach. (Maximum allocated budget is $300,000 for
1 week.) Show optimal allocations for each channel separately, as well as together,
for Reach-3. Calculate the total spend for each allocation, based on the pricing
details given. Use Reach-3 for your final recommendation.
Ex. 27.3 The advertiser realizes that between 2 am and 5 am, there are very few
viewers from its target customer who watch TV and therefore adds the blackout
window of 3 h every day. Would this change your analysis?
Ex. 27.4 Due to the increasing demand and limited broadcasting slots, the broad-
caster is considering offering dynamic pricing. The broadcaster may redefine
prime-time concept and significantly change its media marketing strategy. Suggest
a new strategy if the broadcaster moves to dynamic pricing.
Ex. 27.5 How can this approach be used when more and more viewers switch to the
Internet?
Ex. 27.6 What would happen if the advertiser demands not to broadcast the
commercial alongside other similar commercials? What other practical constraints
do you see being imposed on a schedule by an advertiser?
Ex. 27.7 What would happen if the broadcaster repeats the ad within the same
slot/commercial break (known as a ‘‘pod’’ in the TV ad industry). Discuss how it
may impact viewership of target segment and whether you need to change your
strategy.
27 Case Study: InfoMedia Solutions 881
Table 27.2 Sample observations from input dataset (top five rows from infomedia_ch1.csv)
Day Ad_time Start End Cust_id P_id Age Sex Population_wgt Channel
2 1342 1308 1355 70,953 1 59 M 3427 1
2 1348 1308 1355 70,953 1 59 M 3427 1
1 2100 2100 2109 79,828 1 23 F 5361 1
1 2106 2100 2109 79,828 1 23 F 5361 1
1 2124 2120 2124 79,828 1 23 F 5361 1
27 Case Study: InfoMedia Solutions 883
Help File
Please refer to the code “Infomedia_simulations.R” to run the simulation for each
channel separately. The simulation function will ask for the following information:
Enter full path of datasource folder:
< Copy and paste the path name as it is. Please make sure you paste in R console
not the editor.>
Enter dataset name (including .csv):
< Datasets infomedia_ch1.csv and infomedia_ch2.csv correspond to channels
1 and 2 respectively. Separate simulation is needed for each channel to quickly
get results. Enter the source file name (infomedia_ch1 (or 2).csv) including file
extension (.csv). Please note that R is case sensitive—spell the file name correctly.>
Enter the minimum number of slots (typically 5) :
< the number of slots to begin the simulation>
Enter the maximum number of slots (typically 250):
< the number of slots to end the simulation>
Enter the incremental number of slots (typically 5):
< step size, that is, minimum, minimum + stepsize, minimum + 2* stepsize, . . .
will be simulated>
Enter the number of simulation to run for each spot (typically 100):
< the number of replications—too many will slow down the system>
Minimum percentage slots in prime time [0-100]:
< must be an integer, typically between 20 and 30>
Maximum percentage slots in prime time [0-100]:
< must be an integer, typically between 20 and 30>
Minimum percentage slots on weekends [0-100]:
< must be an integer, typically between 20 and 30>
Maximum percentage slots on weekends [0-100]:
< Must be an integer, typically between 20 and 30>
Once you enter all the inputs correctly, the simulation function will run the
simulation for the requested channel given the constraints and share the two output
884 D. Agrawal et al.
files—(a) data file that will consist of reach given the number of spots in weekday
non-prime, weekday prime, weekend non-prime, and weekend prime time and (b)
png file that shows the reach curve against the total number of spots.
Output files (csv and png files) will be saved in the current directory as shown in
the code output.
> simulation()
Enter full path of datasource folder: D:\MyData\InfoMedia_Solutions
*******************************************************************
Note: Please enter all the values below as positive integer only.
*******************************************************************
*****************************************************************************
The simulation is successfully completed.
You can refer to below files (csv and image) in current working directory.
Please use this data to fit the curve and for further analysis.
*****************************************************************************
>
The simulation function developed by the data science team produces the dataset
“simulation_ch1.csv” and “simulation_ch2.csv” for each channel. The sample
output is shown in Table 27.3.
Fig. 27.2 Reach (R-1,-3,-5) vs. spots for channel 1 and 2 (sample output of simulation function)
1. First, we identify the constraints we would place such as prime vs. non-prime
time spots, weekday vs. weekend, blackout zone, and number of spots on each
channels.
2. Create a data frame with unique spots and days available.
3. Simulation exercise (Task-1): You can either use simulation function provided
with case or develop your custom function using the steps below (3a–3c):
(a) In each run, take random sample based on constraints in step 1, using
sample() function. Sample (vector from which we have to choose, no of items
to choose).
(b) Merge this data frame with the actual data set depending upon time and day
using merge() function. Basically, step-3a will generate various samples (to
simulate the runs) and will help identify viewers in the next step.
(c) Now collect the data on customer level how many times a particular customer
viewed the ad. By using count function in plyr library.
4. Fitting the curve (Task-2)
(a) Now we can calculate the total reach of the ad based on the distinct viewers
count who watched the ad repeatedly, that is, at least once (R1), at least
thrice (R3), at least five times (R5), where the total reach is the sum of the
population weight column which represents the weightage of similar type of
customers. Plot the total reach and number of spots (say varying total spots
between 5 and 250 in steps of 5 or 10).
(b) We fit the curve to estimate the average total reach for a given slot size. You
will need one curve each for R-1, R-3, R-5, etc.
Reach = f (slots) + ε
(c) Ensemble the above models to better estimate the total reach.
6. Optimization (Task-3): Now, we optimize the total reach using nonlinear opti-
mization techniques (refer to Chap. 11 Optimization):
(a) Objective function: Maximize appropriate reach
(b) Constraints
• Number of weekend spots out of total spots (weekend regular + weekend
prime)
• Number of prime-time spots out of total spots (weekday prime + weekend
prime)
• Available budget for the advertisement (total, channel-wise)
Reference
Goerg, M. (2014). Estimating reach curves from one data point. Google Inc. Retrieved June 23,
2018, from https://ai.google/research/pubs/pub43218.
Chapter 28
Introduction to R
1 Introduction
As data science adoption increases more in the industry, the demand for data
scientists has been increasing at an astonishing pace. Data scientists are a rare breed
of “unicorns” who are required to be omniscient, and, according to popular culture,
a data scientist is someone who knows more statistics than a programmer and more
programming than a statistician. One of the most important tools in a data scientist’s
toolkit is the knowledge of a general-purpose programming language that enables
a data scientist to perform tasks of data cleaning, data manipulation, and statistical
analysis with ease. Such requirements call for programming languages that are easy
enough to learn and yet powerful enough to accomplish complex coding tasks. Two
such de facto programming languages for data science used in the industry and
academia are Python and R.
In this chapter, we focus on one of the most popular programming languages for
data science—R (refer to Chap. 29 for Python). Though we do not aim to cover
comprehensively all topics of R, we aim to provide enough material to provide
a basic introduction to R so that you could start working with it for your daily
programming tasks. A detailed knowledge of R can be gained through an excellent
collection of books and online resources. Although prior programming experience
is helpful, this chapter does not require any prior knowledge of programming.
P. Taori ()
London Business School, London, UK
e-mail: taori.peeyush@gmail.com
H. K. Dasararaju
Indian School of Business, Hyderabad, Telangana, India
1.1 What Is R?
As stated at the start of this chapter, R is one of the de facto languages when it comes
to data science. There are a number of reasons why R is such a popular language
among data scientists. Some of those reasons are listed below:
• R is a high-level general-purpose programming language that can be used for
varied programming tasks such as web scraping, data gathering, data cleaning
and manipulation, and website development and for statistical analysis and
machine learning purposes.
• R is a language that is designed mostly for non-programmers and hence is easy
to learn and implement.
• R is an open-source programming language. This implies that a large community
of developers contributes continually to the R ecosystem.
• R is easily extensible and enjoys active contribution from thousands of devel-
opers across the world. This implies that most of the programming tasks can
be handled by simply calling functions in one of these packages that developers
have contributed. This reduces the need for writing hundreds of lines of codes
and makes development easier and faster.
• R is an interpreted language that is platform independent. As compared to some
of the other programming languages, you do not have to worry about underlying
hardware on which the code is going to run. Platform independence essentially
ensures that your code will run in the same manner on any platform/hardware
that is supported by R.
28 Introduction to R 891
1.3 Limits of R
2 Chapter Plan
In this section, we describe the R programming language and use the features and
packages present in the language for data science-related purposes. Specifically, we
will be learning the language constructs, programming in R, how to use these basic
constructs to perform data cleaning, processing, and data manipulation tasks, and
use packages developed by the scientific community to perform data analysis. In
addition to working with structured (numerical) data, we will also be learning about
how to work with unstructured (textual) data as R has a lot of features to deal with
both the domains in an efficient manner.
We will start with discussion about the basic constructs of the language such
as operators, data types, conditional statements, and functions, and later we will
discuss specific packages that are relevant for data analysis and research purpose.
In each section, we will discuss a topic, code snippets, and exercise related to the
sessions.
2.1 Installation
There are multiple ways in which you can work with R. In addition to the basic
R environment (that provides R kernel as well as a GUI-based editor to write
and execute code statements), most people prefer to work with an Integrated
Development Environment (IDE) for R. One such free and popular environment
is RStudio. In this subsection, we will demonstrate how you can install both R and
RStudio.
892 P. Taori and H. K. Dasararaju
When you work in a team environment or if your project grows in size, it is often
recommended to use an IDE. Working with an IDE greatly simplifies the task of
developing, testing, deploying, and managing your project in one place. You can
choose to use any IDE that suits your specifics needs.
2.2 R Installation
2.3 R Studio
RStudio is a free and open-source IDE for R programming language. You can install
RStudio by going to the following website:
www.rstudio.com
Once at the website, download the specific installation of RStudio for your
operating system. RStudio is available for Windows-, Mac OSX-, and Linux-based
systems. RStudio requires that you have installed R first so you would first need to
install R before installing RStudio. Most of the RStudio installations come with a
GUI-based installer that makes installation easy. Follow the on-screen instructions
to install RStudio on your operating system.
Once you have installed RStudio, an RStudio icon would be created on the Desk-
top of your computer. Simply double-click the icon to launch RStudio environment.
There are four major components in RStudio distribution:
1. A text editor at the top left-hand corner. This is where you can write your R code
and execute them using the Run button.
2. Integrated R console at the bottom left-hand corner. You can view the output of
code execution in this pane and can also write individual R commands here.
3. R environment at the top right-hand corner. This pane allows you to have a quick
look at existing datasets and variables in your working R environment.
4. Miscellaneous pane at the bottom right-hand corner. This pane has multiple
tabs and provides a range of functionalities. Two of the most important tabs in
this pane are Plots and Packages. Plots allow you to view the plots from code
execution. In the packages tab, you can view and install R packages by simply
typing the package name (Fig. 28.1).
28 Introduction to R 893
In addition to core R, most of the times you would need packages in R to get
your work done. Packages are one of the most important components of the R
ecosystem. You would be using packages continuously throughout the course and
in your professional lives. Good thing about R packages is that you can find most
of them at a single repository: CRAN repository. In RStudio, click on Packages tab
and then click on Install. A new window will open where you can start typing the
name of the R package that you want to install. If the package exists in the CRAN
repository, then you will find the corresponding name. After that, simply click on
Install to install the R package and its dependencies as well. This is one of the easiest
ways to install and manage packages in your R distribution.
Alternatively, you can install a package from the command prompt as well by
using install.packages command. For example, if you type the following command,
it will install “e1071” package in R:
> install.packages(”e1071“)
A not so good thing about R packages is that there is not a single place where you
will get a list of all packages in R and what they do. In such cases, reading online
documentation of R packages is the best way. You can search for specific packages
and their documentation on the CRAN website. Thankfully, you will need only a
handful of packages to get most of your daily work done.
In order to view contents of the package, type:
> library(help=e1071)
894 P. Taori and H. K. Dasararaju
This will give you a description about the package, as well as all available
datasets and functions within that package. For example, the above command will
produce the following output:
Information on package ‘e1071’
Description:
Package: e1071
Version: 1.6-8
Title: Misc Functions of the Department of
Statistics, Probability Theory
Group (Formerly: E1071), TU Wien
Imports: graphics, grDevices, class, stats, methods,
utils
Suggests: cluster, mlbench, nnet, randomForest, rpart,
SparseM, xtable, Matrix,
MASS
Authors@R: c(person(given = ”David“, family = ”Meyer“,
role = c(”aut“, ”cre“),
email = ”David.Meyer@R-project.org“),
person(given = ”Evgenia“,
family = ”Dimitriadou“, role = c(”aut“,
”cph“)), person(given =
”Kurt“, family = ”Hornik“, role = ”aut“),
person(given = ”Andreas“,
family = ”Weingessel“, role = ”aut“), person
(given = ”Friedrich“,
family = ”Leisch“, role = ”aut“), person
(given = ”Chih-Chung“, family
= ”Chang“, role = c(”ctb“,”cph“), comment =
”libsvm C++-code“),
person(given = ”Chih-Chen“, family = ”Lin“,
role = c(”ctb“,”cph“),
comment = ”libsvm C++-code“))
Description: Functions for latent class analysis, short
time Fourier transform,
fuzzy clustering, support vector machines,
shortest path computation,
bagged clustering, naive Bayes classifier,
...
License: GPL-2
LazyLoad: yes
NeedsCompilation: yes
Packaged: 2017-02-01 16:13:21 UTC; meyer
Author: David Meyer [aut, cre], Evgenia Dimitriadou
[aut, cph], Kurt Hornik
[aut], Andreas Weingessel [aut], Friedrich
Leisch [aut], Chih-Chung
Chang [ctb, cph] (libsvm C++-code),
Chih-Chen Lin [ctb, cph] (libsvm
C++-code)
Maintainer: David Meyer <David.Meyer@R-project.org>
Repository: CRAN
28 Introduction to R 895
Index:
The simplest way to get help in R is to click on the Help button on the toolbar.
Alternatively, if you know the name of the function you want help with, you just
type a question mark “?” at the command line prompt followed by the name of
the function. For example, the following commands will give you a description of
function solve.
> help(solve)
> ?solve
> ?read.table
Sometimes you cannot remember the precise name of the function, but you know
the subject on which you want help (e.g., data input in this case). Use the help.search
function (without a question mark) with your query in double quotes like this:
> help.search(”data input“)
Other useful functions are “find” and “apropos.” The “find” function tells you
what package something is in:
> find(”lowess“)
On the other hand, “apropos” returns a character vector giving the names of all
objects in the search list that match your (potentially partial) enquiry:
> apropos(”lm“)
As of date (June 16, 2018), the latest version of R available is version 3.5. However,
in this chapter, we demonstrate all the R code examples using version 3.2 as it is one
of the most widely used versions. While there are no drastic differences in the two
versions, there may be some minor differences that need to be kept in mind while
developing the code.
28 Introduction to R 897
3.1 Programming in R
Before we get started with coding in R, it is always a good idea to set your working
directory in R. Working directory in R can be any normal directory on your file
system, and it is in this directory that all of the datasets produced will be saved. By
default R sets the working directory as the directory where R is installed. You can
get the current working directory by typing the following command:
> getwd()
It will produce the output similar to the one below:
[1] ”/Users/rdirectory“
In order to change working directory, use setwd() command with the directory
name as the argument:
> setwd(’/Users/anotherRDirectory’)
This command will make the new directory as your working directory.
There are two ways to write code in R: script and interactive. The script mode is
the one that most of the programmers would be familiar with, that is, all of the R
code is written in one text file and the file then executes on a R interpreter. All R code
files must have a <dot>R extension. This signals the interpreter that the file contains
an R code. In the interactive mode, instead of writing all of the codes together in one
file, individual snippets of code are written in a command line shell and executed.
The benefit of the interactive mode is that it gives immediate feedback for each
statement and makes program development more efficient. A typical practice is to
first write snippets of code in the interactive mode to test for functionality and then
bundle all pieces of code in a <dot>R file (script mode). RStudio provides access
to both modes. The top window in text editor is where you can type code in script
mode and run all or some part of it. In order to run a file, just click on the <Run>
button in the menu bar, and R will execute the code contained in the file.
The bottom window in the text editor acts as R interactive shell. In interactive
mode, what you type is immediately executed. For example, typing 1 + 1 will
respond with 2.
Let us now get started with understanding the syntax of R. The first thing to note
about R is that it is a case-sensitive language. Thus, variable1 and VARIABLE1 are
two different constructs in R. While we saw in the other languages such as Python
that indentation is one of the biggest changes that users have to grapple with, there is
no such requirement of indentations in R. The code simply flows, and you can either
terminate the code with a semicolon or simply start writing a new code from a new
line, and R will understand that perfectly. We will delve more on these features as
we move to further sections.
898 P. Taori and H. K. Dasararaju
3.3 Calculations
[1] 3.912023
> 5+3
[1] 8
Multiple expressions can be placed in single line but have to be separated by
semicolons.
> log(20); 3* 35; 5+2
[1] 2.995732
[1] 105
[1] 7
> floor(5.3)
[1] 5
> ceiling(5.3)
[1] 6
3.4 Comments
Note that in the above code snippet, the first line is the actual code that is
executed, whereas the second line is a comment that is ignored by the interpreter. A
strange observation in R is that it does not have support for multiline comments.
So if you want to use multiline comments in R, then you have to individually
comment each line. Fortunately, IDEs such as RStudio provide work-around for this
limitation. For example, in Windows you can use CTRL + SHIFT + C to comment
multiple lines of code in RStudio.
3.5 Variables
There are some in-built data types in R for handling different kinds of data: integer,
floating point, string, Boolean values, date, and time. Similar to Python, a neat
feature of R is that you don’t need to mention what kind of data a variable holds;
depending on the value assigned, R automatically assigns a data type to the variable.
Think of a variable as a placeholder. It is any name that can hold a value and that
value can vary over time (hence the name variable). In other terms, variables are
reserved locations in your machine’s memory to store different values. Whenever
you specify a variable, you are actually allocating space in memory that will hold
values or objects in future. These variables continue to exist till the program is
running. Depending on the type of data a variable has, the interpreter will assign the
required amount of memory for that variable. This implies that memory of a variable
can increase or decrease dynamically depending on what type of data the variable
has at the moment. You create a variable by specifying a name to the variable and
then by assigning a value to the variable by using equal sign (=) operator.
Code
variable1 = 100 # Variable that holds integer value
distance = 1500.0 # Variable that holds floating
point value
institute = ”ISB“ # Variable that holds a string
Output
100
1500.0
ISB
Code
a = 0
b = 2
c = ”0“
print(a + b)
print(c)
Output
2
“0”
900 P. Taori and H. K. Dasararaju
Although a variable can be named almost anything, there are certain naming
conventions that should be followed:
• Variable names in R are case-sensitive. This means that Variable and variable are
two different variables.
• A variable name cannot begin with a number.
• Remainder of the variable can contain any combination of letters, digits, and
underscore characters.
• A variable name cannot contain blank spaces.
The value of the variables can be intialized in two ways:
> x <- 5
> y = 5
> print(x)
[1] 5
> print(y)
[1] 5
[1] indicates that x and y are vectors and 5 is the first element of the vector.
Notice the use of <- for assignment operator. Assignments in R are convention-
ally done using <- operator (although you can use = operator as well). For most of
the cases, there is no difference between the two; however in some of the specialized
cases, you can get different results based on which operator you are using. The
official and correct assignment operator that is endorsed is <- operator, and we
would encourage the readers to use the same for their coding as well.
In addition to complex data types, R has five atomic (basic) data types. They are
Numeric, Character, Integer, Complex, and Logical, respectively. Let us understand
them one by one.
Numbers are used to hold numerical values. There are four types of numbers
that are supported in R: integer, long integer, floating point (decimals), and complex
numbers.
1. Integer: An integer type can hold integer values such as 1, 4, 1000, and −52,534.
In R, integers have a bit length of 32 bits. This means that an integer data type
can hold values in the range of −2,147,483,648 to 2,147,483,647. An integer
is stored internally as a string of digits. An integer can only contain digits and
cannot have any characters or punctuations such as $.
Code
> 120+200
28 Introduction to R 901
[1] 320
> 180-42
[1] 138
> 15* 8
[1] 120
2. Long Integer: Simple integers have a limit on the value that they can contain.
Sometimes the need arises for holding a value that is outside the range of integer
numbers. In such a case, we make use of Long Integer data types. Long Integer
data types do not have a limit on the length of data they can contain. A downside
of such data types is that they consume more memory and are slow during
computations. Use Long Integer data types only when you have the absolute
need for it.
Code
> 2** 32
[1] 4294967296
3. Floating Point Numbers: Floating point data types are used to contain decimal
values such as fractions.
4. Complex Numbers: Complex number data types are used to hold complex
numbers. In data science, complex numbers are used rarely, and unless you are
dealing with abstract math, there would be no need to use complex numbers.
3.8 Vector
Whenever you define a variable in R that can contain one of the above atomic data
types, that variable would most likely be a vector. A vector in R is a variable that
can contain one of more values of the same type (Numeric, Character, Logical, and
so on). A vector in R is analogous to an array in C or Java with the difference that
we do not have to create the array explicitly and we also do not have to worry about
increasing or decreasing length of array. A primary reason behind having vectors as
the basic variable in R is that most of the times, the programmer or analyst would
not be working with a single value but a bunch of values in a dataset (think of a
column in a spreadsheet). Thus, in order to mimic that behavior, R implements the
variable as a vector. A vector can also contain single values (in such a case, it would
be a vector of length one). For example, all of the variables below are vectors of
length one (since they contain only one element):
> a <- 4
> a
[1] 4
If you want to combine multiple values to create a vector, then you can make
use of the c operator in R. c() stands for concatenate operator, and its job is to take
individual elements and create a vector by putting them together. For example:
> x <- c(1, 0.5, 4)
> x
[1] 1.0 0.5 4.0
Note that in the last statement, we made use of the vector() function to create a
function. Vector is an inbuilt function in R that will create a vector of a specific size
(specified by length argument) and type (specified by numeric). If we do not specify
default values for the vector, then it will take default values for the specified vector
type (e.g., default value for numeric is 0).
You can perform a range of functions on the vector. For example:
#To find the class of a vector, use class function
> class(y)
[1] ”character“
Vectors are quite flexible in R and you can create them in a range of ways.
One very useful operator in R for vectors is the sequence operator (:). A sequence
operator works like an increment operator that will start with an initial value,
increment in steps (default is 1), and stop at a terminal value. In doing so, the
increment operator will create a vector from initial to terminal value. For example:
28 Introduction to R 903
Note that in the command, we explicitly called the seq() function (it is similar to
the sequence operator). The seq() function takes the initial value and terminal value
as 0 and 8, respectively, and creates a vector of values by incrementing in the steps
of 0.2.
If we want to generate a vector of repetitive values, then we can do so easily by
using the rep() function. For example:
> rep(4,9)
[1] 4 4 4 4 4 4 4 4 4
> rep(1:7,10)
[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2
3 4 5
[34] 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6
7 1 2 3
[67] 4 5 6 7
> rep(1:7,each=3)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7
In the first case, the rep function repeated the value 4 nine times. In the second
command, rep repeated the sequence 1 to 7 ten times. In the third, we created a
vector where each value from 1 to 7 was repeated three times.
You can perform the arithmetic operations on vectors in a manner similar to variable
operations. Here, the operations are performed on each corresponding element:
> x <- c(1, 0.5, 4)
> x
[1] 1.0 0.5 4.0
> y <- c(5,3,2)
> y
[1] 5 3 2
> x+y
[1] 6.0 3.5 6.0
> x
[1] 1.0 0.5 4.0
> y <- c(5,3,2,1)
> y
[1] 5 3 2 1
> x+y
[1] 6.0 3.5 6.0 2.0
Warning message:
In x + y: longer object length is not a multiple of shorter object length
You would expect that there should be an error since the vectors are not of same
length. However, while we received a warning saying that vectors are not of same
length, R nevertheless performs the operation in a manner such that when the vector
of shorter length finishes, then the whole process starts again from the first element
for the short vector. This means that x in our case is the vector with three elements.
While first three elements of x are added to three elements of y, but for the fourth
element of y, the element from x is the first element (since the process repeats itself
for the shorter length vector). This is a peculiar behavior of R that one needs to
be careful about. If we are not careful about the length of vectors while performing
arithmetic operations, then the results can be erroneous and can go undetected (since
R does not produce any errors).
Since a vector can be viewed as an array of individual elements, we can extract
individual elements of a vector and can also access sub-vectors from a vector. The
syntax for doing so is very similar to what we use in Python, that is, specify the
name of the vector followed by the index in square brackets. One point to be careful
about is that indexes in R start from 1 (and not from 0 as in Python). For example:
> a <- c(1,3,2,4,5,2,4,2,6,4,5,3)
> a
[1] 1 3 2 4 5 2 4 2 6 4 5 3
Let us say that you want to select a subset of a vector based on a condition.
> anyvector <- a>3
> a[anyvector]
[1] 4 5 4 6 4 5
> x[x>5]
[1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27
[23] 28 29 30
You can also apply set theory operations (in addition to usual arithmetic
operators) on vectors.
> setA <- c(”a“, ”b“, ”c“, ”d“, ”e“)
> setB <- c(”d“, ”e“, ”f“, ”g“)
> setdiff(setA,setB)
[1] ”a“ ”b“ ”c“
3.10 Lists
While vectors in R are a convenient way of playing with a number of values at the
same time, oftentimes, the need would arise that we need to have values of different
types in a vector. For example, we might want to have numeric as well as character
values in the same variable. Since we cannot do so with vectors, the data type that
comes to our rescue is list. A list in R is nothing but a special type of vector that can
contain different types of data. We define list with a list() function in R.
> x <- list(1,”c“,FALSE)
> x
[[1]]
[1] 1
[[2]]
[1] ”c“
[[3]]
[1] FALSE
> x[3]
[[1]]
[1] FALSE
> x[1:2]
[[1]]
[1] 1
[[2]]
[1] ”c“
We can then access individual elements of a list in the similar way we can do so
with vectors. In addition to containing basic data types, a list can contain complex
data types as well (such as nested lists). For example:
> x <- list(col1=1:3, col2 = 4)
> x
$col1
[1] 1 2 3
$col2
[1] 4
> x[1]
$col1
[1] 1 2 3
> x[[1]]
[1] 1 2 3
> x$col[1]
[1] 1 2 3
> x$col1[2]
[1] 2
> x[[1]][2]
[1] 2
In the above example, we defined a list x that contains two elements—col1 and
col2. Col1 and col2 are lists by themselves—col1 contains numbers 1, 2, and 3; and
col2 contains a single element 4. You can access individual elements of a list or
elements with the elements by using square brackets and the index of elements.
3.11 Matrices
Lists and vectors are unidimensional objects, that is, a vector can contain a number
of values, and we can think of it as a single column in a spreadsheet. But if we need
to have multiple columns, then vectors are not a convenient way of work-around.
For this R provides two different data structures at our disposal—matrices and data
frames. We will first discuss about matrices and then move on to data frames.
A matrix in R is nothing but a multidimensional object where each dimension is
an array. There are multiple ways of creating a matrix in R:
> m1 <- matrix(nrow=4, ncol=5)
> m1
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA NA NA NA
[2,] NA NA NA NA NA
[3,] NA NA NA NA NA
28 Introduction to R 907
[4,] NA NA NA NA NA
> dim(m1)
[1] 4 5
> dim(m1)
[1] 2 5
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> x[1,1]
[1] 1
> x[1,]
[1] 1 3 5 7 9
> x[,2]
[1] 3 4
Oftentimes it might happen that we have some vectors at our disposal and we
want to create a matrix by combining those vectors. This can be done by making
908 P. Taori and H. K. Dasararaju
use of rbind and cbind operators. While rbind will join the columns by row, cbind
will join the columns otherwise. For example:
> x<- 1:6
> x
[1] 1 2 3 4 5 6
> cbind(x,y)
x y
[1,] 1 12
[2,] 2 13
[3,] 3 14
[4,] 4 15
[5,] 5 16
[6,] 6 17
> rbind(x,y)
[,1] [,2] [,3] [,4] [,5] [,6]
x 1 2 3 4 5 6
y 12 13 14 15 16 17
A primary reason why Excel is very useful for us is that everything is laid out in
a neat tabular structure, and this enables us to perform a variety of operations on
the tabular data. Additionally, we can also hold string, logical, and other types
of data. This capability is not lost for us in R and is instead provided by data
frame in R.
Tabular data in R is read into a type of data structure known as data frame. All
variables in a data frame are stored as separate columns, and this is different from
matrix in the sense that each column can be of a different type. Almost always, when
you import data from an external data source, you import it using a data frame. A
data frame in R can be created using the function data.frame().
28 Introduction to R 909
> nrow(x)
[1] 20
> ncol(x)
[1] 2
In the first code snippet, we specified that we are creating a data frame that
has two columns (col1 and col2). To find the number of rows and columns in a
data frame, we use arguments nrow() and ncol(), respectively. In order to check the
structure of a data frame (number of observations, number and types of columns),
we make use of the function str().
Similar to matrices, we can select individual columns, rows, and values in a data
frame. For example:
> x[1]
col1
1 1
2 2
3 3
4 4
5 5
6 6
910 P. Taori and H. K. Dasararaju
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
> x[1,1]
[1] 1
> x[,2]
[1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
FALSE
[12] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
> x[2:5,1]
[1] 2 3 4 5
Additionally, we can also make use of the $ operator to access specific columns
of a data frame. The syntax is dataframe$colname.
x$col1
3.13 R Operators
4 Conditional Statements
After the discussion of variables and data types in R, let us now focus on the
second building block of any programming language, that is, conditional statements.
Conditional statements are branches in a code that are executed if a condition
associated with the conditional statements is true. There can be many different types
of conditional statement; however, the most prominent ones are if, while, and for. In
the following sections, we discuss these conditional statements.
Output
[1] ”Zero“
Whereas an if loop allows you to evaluate the condition once, the while loop allows
you to evaluate a condition multiple number of times depending on a counter or
variable that keeps track of the condition being evaluated. Hence, you can execute
the associated block of statements multiple times in a while block.
Code
a <- 10
while (a>0){
print(a)
a<-a-1
}
28 Introduction to R 913
Output
[1] 10
[1] 9
[1] 8
[1] 7
[1] 6
[1] 5
[1] 4
[1] 3
[1] 2
[1] 1
In many ways, the for loop is similar to a while loop in the sense that it allows
you to iterate the loop multiple times depending on the condition being evaluated.
However, the for loop is more efficient in the sense that we do not have to keep
count of incrementing or decrementing the counter of condition being evaluated. In
the while loop, the onus is on the user to increment/decrement the counter, otherwise
the loop runs until infinity. However, in a for loop, the loop itself takes care of the
increment/decrement.
Code
for (j in 1:5){
print(j)
}
Output
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In the code snippet below, we make use of the seq_along() function that acts as a
sequence of non-numeric values. The function will iterate through each of the values
in the specified vector x, and the print loop will then print the values.
x <- c(”a“,”c“,”d“)
for (i in seq_along(x)){
print(x[i])
}
[1] ”a“
[1] ”c“
[1] ”d“
[1] ”a“
[1] ”c“
[1] ”d“
Most of the times, in addition to using variables and in-built data structures, we
would be working with external files to get data input and to write output to. For
this purpose, R provides functions for file opening and closing. There are a range of
functions for reading data into R. read.table and read.csv are the most common for
tables, readLines for text data, and load for workspaces. Similarly, for writing, use
write.table and write.lines.
read.table is the most versatile and powerful function for reading data from
external sources. You can use it to read data from any type of delimited text files
such as tab, comma, and so on. The syntax is as follows:
> inputdata <- read.table(”inputdata.txt“,header=TRUE)
In the above code snippet, we read in a file inputdata.txt into a data frame input
data. By specifying the argument header=TRUE, we are specifying that the input
file contains the first line as header.
While you can import data using read.table function as well, there are specific
functions for csv and Excel files:
> titanicdata <- read.csv(”train.csv“)
> datafile1 <- read.table(”train.csv“,header=TRUE,sep=”,“)
Similar to the functions for reading files in R, there are functions for writing back
data frames to R. Here are some of the most common examples that you would
encounter. This list is not exhaustive, and there are many more functions available
for working with different file types.
> write.csv(titanicdata,”D:// file1.csv“)
In the above code snippet, we write the contents of the data frame titanicdata to
an output file (file1.csv) in the D drive.
5 Function
once and can then be called using their names elsewhere in the code. When we need
to create a function, we need to give a name to the function and the associated code
block that would be run whenever the function is called. A function name follows
the same naming conventions as a variable name. Any function is defined using
the keyword function. This tells the interpreter that the following piece of code is
a function. After function, we write the arguments that the function would expect
within the parenthesis. Following this, we would write the code block that would be
executed every time the function is called.
In the code snippet below, we define a function named func1 that takes two
arguments a and b. The function body does a sum of the arguments a and b that
the function receives.
Code
func1 <- function(a,b){
a+b
}
func1(5,10) #call the function by calling function name and
providing the arguments
Output
[1] 15
Code
square.it <- function(x) {
square <- x * x
return(square)
}
Output
square.it(5)
[1] 25
In the abovementioned code snippet, we created a function “square.it” using the
syntax of the function. In this case, the function expects one argument and does a
square of that argument. The return() statement will pass the value computed to the
calling line in the code (the line or variable that called the function). Note that the
names given in the function definition are called parameters, whereas the values
you supply in the function call are called arguments.
6 Further Reading
There are plenty of free resources—books and online websites including R doc-
umentation itself—available to learn more about R and packages used in R. As
mentioned earlier, since R is an open-source platform, many developers keep
building new packages and add to the R repository on a frequent basis. The best
way to know and learn them is by referring to respective documentation submitted
by various package authors.
Chapter 29
Introduction to Python
1 Introduction
As data science is increasingly being adopted in the industry, the demand for data
scientists is also growing at an astonishing pace. Data scientists are a rare breed of
“unicorns” who are required to be omniscient and according to popular culture, a
data scientist is someone who knows more statistics than a programmer and more
programming than a statistician. One of the most important tools in a data scientist’s
toolkit is the knowledge of a general-purpose programming language that enables
a data scientist to perform tasks of data cleaning, data manipulation, and statistical
analysis with ease. Such requirements call for programming languages that are easy
enough to learn and yet powerful enough to accomplish complex coding tasks.
Two such de facto programming languages for data science used in industry and
academia are Python and R.
In this chapter, we focus on covering the basics of Python as a programming
language. We aim to cover the important aspects of the language that are most
critical from the perspective of a budding data scientist. A detailed knowledge
of Python can be gained through an excellent collection of books and Internet
resources. Although prior programming experience is helpful, this chapter does not
require any prior knowledge of programming.
P. Taori ()
London Business School, London, UK
e-mail: taori.peeyush@gmail.com
H. K. Dasararaju
Indian School of Business, Hyderabad, Telangana, India
As stated at the start, Python is one of the de facto languages when it comes to
data science. There are a number of reasons why Python is such a popular language
among data scientists. Some of those reasons are listed below:
• Python is a high-level, general-purpose programming language that can be used
for varied programming tasks such as web scraping, data gathering, data cleaning
and manipulation, website development, and for statistical analysis and machine
learning purposes.
• Unlike some of the other high level programming languages, Python is extremely
easy to learn and implement, and it does not require a degree in computer science
to become an expert in Python programming.
• Python is an object-oriented programming language. It means that everything in
Python is an object. The primary benefit of using an object-oriented programming
language is that it allows us to think of problem solving in a simpler and real-
world manner, and when the code becomes too cumbersome then object-oriented
languages are the best way to go.
• Python is an open-source programming language. This implies that a large
community of developers contribute continually to the Python ecosystem.
• Python has an excellent ecosystem that comprises of thousands of modules and
libraries (prepackaged functions) that do not require reinvention of the wheel,
and most of the programming tasks can be handled by simply calling functions
in one of these packages. This reduces the need for writing hundreds of lines of
code, and makes development easier and faster.
29 Introduction to Python 919
There are multiple ways of installing the Python environment and related packages
on your machine. One way is to install Python, and then add the required packages
one by one. Another method (recommended one) is to work with an Integrated
Development Environment (IDE). Working with an IDE greatly simplifies the task
of developing, testing, deploying, and managing your project in one place. There
are a number of such IDEs available for Python such as Anaconda and Enthought
Canopy, some of which are paid versions while others are available for free for
academic purpose. You can choose to use any IDE that suits your specifics needs.
In this particular section, we are going to demonstrate installation and usage of one
such IDE: Enthought Canopy. Enthought Canopy is a comprehensive package of
Python language and comes pre-loaded with more than 14,000 packages. Canopy
makes it very easy to install/manage libraries, and also provides a neat GUI
environment for developing applications. In this chapter, we will focus on Python
installation using Enthought Canopy. Below are the guidelines on how to install
Enthought Canopy distribution.
920 P. Taori and H. K. Dasararaju
Launch the Canopy icon from your machine. There are three major components in
Canopy distribution (Fig. 29.2):
1. A text editor and integrated IPython console.
2. A GUI-based package manager.
3. Canopy documentation.
The Package Manager allows you to manage existing packages and install additional
packages as required. There are two major panes in Package Manager (Fig. 29.3):
1. Navigation Pane: It lists all the available packages, installed packages, and the
history of installed packages.
922 P. Taori and H. K. Dasararaju
2. Main Pane: This pane gives you more details about each package and allows you
to manage the packages at individual level.
The documentation browser contains help files for Canopy software and some of the
most commonly used Python packages such as Numpy, SciPy, and many more.
In this section, we outline the Python programming language, and use the features
and packages present in the language for data science related purposes. Specifically,
we would be learning the language constructs, programming in Python, how to use
these basic constructs to perform data cleaning, processing, and data manipulation
tasks, and use packages developed by the scientific community to perform data
analysis. In addition to working with structured (numerical) data, we will also be
learning about how to work with unstructured (textual) data because Python has a
lot of features to deal with both domains in an efficient manner.
29 Introduction to Python 923
We discuss the basic constructs of the language such as operators, data types,
conditional statements, and functions, and specific packages that are relevant for
data analysis and research purpose. In each section, we discuss a topic, code
snippets, and exercise related to the sessions.
There are two ways to write the codes in Python: script and interactive. The scripts
mode is the one that most of the programmers would be familiar with, that is, all of
the Python code is written in one text file and the file then executes on a Python
interpreter. All Python code files must have a “.py” extension. This signals the
interpreter that the file contains Python code. In the interactive mode, instead of
writing all of the code together in one file, individual snippets of code are written
in a command line shell and executed. Benefit of the interactive mode is that it
gives immediate feedback for each statement, and makes program development
more efficient. A typical practice is to first write snippets of code in interactive
mode to test for functionality and then bundle all pieces of code in a .py file (Script
mode). Enthought Canopy provides access to both modes. The top window in the
text editor is where you can type code in script mode and run all or some part of it.
In order to run a file, just click on the Run button in the menu bar and Python will
execute the code contained in the file.
The bottom window in text editor acts as the Python interactive shell. In
interactive mode what you type is immediately executed. For example, typing 1 + 1
will respond with 2.
Let us now get started with understanding the syntax of Python. The Python
community prides itself in writing the code that is obvious to understand even
for a beginner—this specific way is known as “Pythonic” in nature. Although it
is true that Python is a very simple and easy language to learn and develop, it has
some quirks—the biggest one of which is indentation. Let us first understand the
importance of indentation before we start to tackle any other syntax features of the
language. Please note that all the codes referred in the following sections are tested
on Python 2.7 in the Enthought Canopy console.
924 P. Taori and H. K. Dasararaju
3.3 Indentation
Code
a = 7
print (’Value of the variable is {}’.format(a))
# Error! Look at the space at the beginning of line
print (’This is now correct. Value of variable a is {}’.format(a))
Output
(Once you comment second line)
This is now correct. Value of variable a is 7
In the above piece, each line is a Python statement. In the first statement, we are
assigning a value of 7 to the variable “a.” In the second statement, notice the space
at the beginning. This is considered as indentation by Python interpreter. However,
since any indentation block is supposed to have a conditional statement (such as
if and for loop), the code here would give an error as the interpreter will consider
the second statement as having a separate flow from the first statement. The third
statement does not have any indentation (it is in the same block as the first statement)
and thus will execute just fine.
It is important to remember that all statements that are expected to execute in the
same block should have the same indentation. Presence of indentations improves
readability of Python programs tremendously but it also requires a bit getting used
to, especially if you are coming from languages such as C and Java where semicolon
(;) marks the end of statements. You should also be careful with indentations
because if you are not careful with them then they can cause errors in the program
to say the least, and if gone undetected they can cause program to behave in an
unpredictable manner. Most of the IDEs such as Canopy and Anaconda have in-
built support for indentations that make program development easier.
29 Introduction to Python 925
3.4 Comments
Output
This is code line, not a comment line
Note that in the above code snippet, the first line is the actual code that is executed,
whereas the second line is a comment that is ignored by interpreter.
2. Multiline comment:
Syntax for multiline comments is different from that of single-line comment.
Multiline comments start and end with three single quotes (“’). Everything in
between is ignored by the interpreter.
Code
”’
print(“Multi line comment starts from here”)
print (“Multi line comment continuing. This will not be printed”)
“’
print(”Multi line comment ended in above line. This line with
be printed“)
Output
Multi line comment ended in above line. This line with be printed
With a firm understanding of indentation and comments, let us now look at the
building blocks of the Python programming language. A concept central to Python
is that of object. Everything in Python, be it a simple variable, function, custom data
structure is an object. This means that there is data and certain functions associated
with every object. This makes programming very consistent and flexible. However,
it does not imply that we have to think of objects every time we are coding in Python.
926 P. Taori and H. K. Dasararaju
Behind the scenes everything is an object even if we explicitly use objects or not in
our coding. Since we are just getting started, we will first focus on coding without
objects and talk about objects later, once we are comfortable with the Pythonic way
of programming.
3.6 Variables
There are some in-built data types in Python for handling different kinds of data:
integer, floating point, string, Boolean values, date, and time. A neat feature of
Python is that you do not need to mention what kind of data a variable holds;
depending on the value assigned, Python automatically assigns a data type to the
variable.
Think of a variable as a placeholder. It is any name that can hold a value and that
value can vary over time (hence the name variable). In other terms, variables are
reserved locations in your machine’s memory to store different values. Whenever
you specify a variable, you are actually allocating space in memory that will hold
values or objects in future. These variables continue to exist while the program is
running. Depending on the type of data a variable has, the interpreter will assign the
required amount of memory for that variable. This implies that memory of a variable
can increase or decrease dynamically depending on what type of data the variable
has at the moment. You create a variable by specifying a name to the variable, and
then by assigning a value to the variable by using equal sign (=) operator.
Code
variable1 = 100 # Variable that holds integer value
distance = 1500.0 # Variable that holds floating point value
institute = ”ISB“ # Variable that holds a string
print(variable1)
print(distance)
print(institute)
print institute # print statement has been discontinued
from Python3
Output
100
1500.0
ISB
ISB
Code
a = 0
b = 2
c = ”0“
d = ”2“
print(a + b) # output as integer
print(c + d) # output as string
print(type(a + b))
print(type(c + d))
29 Introduction to Python 927
Output
2
02
<type ‘int’>
<type ‘str’>
Although a variable can be named almost anything, there are certain naming
conventions that should be followed:
• A variable can start with either a letter (uppercase or lowercase) or an underscore
(_) character.
• Remainder of the variable can contain any combination of letters, digits, and
underscore characters.
• For example, some of the valid names for variables are _variable, variable1.
5Variable, >Smiley are not correct variable names.
• Variable names in Python are case-sensitive. This means that Variable and
variable are two different variables.
In addition to complex data types, Python has five atomic (basic) data types. They
are Number, String, List, Tuple, and Dictionary, respectively. Let us understand
them one by one.
3.8.1 Numbers
Numbers are used to hold numerical values. There are four types of numbers that are
supported in Python: integer, long integer, floating point (decimals), and complex
numbers.
1. Integer: An integer type can hold integer values such as 1, 4, 1000, and -52,534.
In Python, integers have a bit length of minimum 32 bits. This means that an
integer data type can hold values in the range −2,147,483,648 to 2,147,483,647.
An integer is stored internally as a string of digits. An integer can only contain
digits and cannot have any characters or punctuations such as $.
Code and output
>>> 120+200
320
928 P. Taori and H. K. Dasararaju
>>> 180-42
138
>>> 15*8
120
2. Long Integer: Simple integers have a limit on the value that they can contain.
Sometimes the need arises for holding a value that is outside the range of integer
numbers. In such a case, we make use of Long Integer data types. Long Integer
data types do not have a limit on the length of data they can contain. A downside
of such data types is that they consume more memory and are slow during
computations. Use Long Integer data types only when you have the absolute
need for it. Python distinguishes Long Integer value from an integer value by
character L or l, that is, a Long Integer value has “L” or “l” in the end.
Code and output
>>> 2**32
4294967296L
3. Floating Point Numbers: Floating point data types are used to contain decimal
values such as fractions.
4. Complex Numbers: Complex number data types are used to hold complex
numbers. In data science, complex numbers are used rarely and unless you are
dealing with abstract math there would be no need to use complex numbers.
3.8.2 Strings
A key feature of Python that makes it one of the de facto languages for text analytics
and data science is its support for strings and string processing. Strings are nothing
but an array of characters. Strings are defined as a sequence of characters enclosed
by quotation marks (they can be single or double quotes). In addition to numerical
data processing, Python has very strong string processing capabilities. Since strings
are represented internally as an array of characters, it implies that it is very easy to
access a particular character or subset of characters within a string. A sub-string of
a string can be accessed by making use of indexes (position of a particular character
in an array) and square brackets []. Indexes start with 0 in Python. This means that
the first character in a string can be accessed by specifying string name followed
by [ followed by 0 followed by ] (e.g., (stringname[0]). If we want to join two
strings, then we can make use of the plus (+) operator. While plus (+) operator
adds numbers, it joins strings and hence can work differently depending on what
type of data the variables hold. Let us understand string operations with the help of
a few examples.
Code
newstring = ’Hi. How are you?’
print(newstring) # It will print entire string
print(newstring [0]) # It will print first character
29 Introduction to Python 929
Output
Hi. How are you?
H
How
How are you?
Hi. How are you?Hi. How are you?Hi. How are you?
Hi. How are you?I am very well, ty.
Strings in Python are immutable. Unlike other datasets such as lists, you cannot
manipulate individual string values. In order to do so, you have to take subsets of
strings and form a new string. A string can be converted to a numerical type and vice
versa (wherever applicable). Many a times, raw data, although numeric, is coded in
string format. This feature provides a clean way to make sure all of the data is in
numeric form. Strings are a sequence of characters and can be tokenized. Strings
and numbers can also be formatted.
Python has a built-in datetime module for working with dates and times. One can
create strings from date objects and vice versa.
Code
import datetime
date1 = datetime.datetime(2014, 5, 16, 14, 45, 05)
print(date1.day)
print(date1)
Output
16
2014-05-16 14:45:05
3.8.4 Lists
Lists in Python are one of the most important and fundamental data structures. At
the very basic level, List is nothing but an ordered collection of data. People with
background in Java and C can think of list as an array that contains a number of
elements. The difference here is that a list can contain elements of different data
types. A list is defined as a collection of elements within square brackets “[ ]”, and
each element in a list is separated by commas. Similar to the individual characters
930 P. Taori and H. K. Dasararaju
in a String, if you want to access individual elements in a list then you can do it by
using the same terminology as used with strings, that is, using the indexes and the
square brackets.
Code
alist = [ ’hi’, 123 , 5.45, ’ISB’, 85.4 ]
anotherlist = [234, ’ISB’]
3.8.5 Tuples
A tuple is an in-built data type that is a close cousin to the list data type. While
in a list you can modify individual elements of the list and can also add/modify
the number of elements in the list, a tuple is immutable in the sense that once it is
defined you cannot change either the individual elements or number of elements in a
tuple. Tuples are defined in a similar manner as lists with a single exception—while
lists are defined in square brackets “[]”, tuples are defined using parenthesis “()”.
You should use tuples whenever there is a situation where you need to use lists that
nobody should be able to modify.
Code
tupleone = (’hey’, 125, 4.45, ’isb’, 84.2)
tupletwo = (456, ’isb’)
print(tupleone) # It will
print entire tuple
print(tupleone[0]) # It will
print first element of tuple
print(tupleone[1:4]) # print 2nd to 4th element of tuple
It will
print(tupleone[3:]) # print entire tuple from 4th
It will
element till last
print(tupletwo * 2) # It will print tuple twice
print(tupleone + tupletwo) # It will concatenate and print
the two tuples
Output
(’hey’, 125, 4.45, ’isb’, 84.2)
’hey’
29 Introduction to Python 931
Code
tuple = (’hey’, 234, 4.45, ’Alex’, 81.4)
list = [’hey’, 234, 4.45, ’Alex’, 81.4]
tuple[2] = 1000 # Invalid (error: tuple object does not
support . . . )
list[2] = 1000 # Valid (it will change 4,45 to 1000)
Output
[’hey’, 234, 1000, ’Alex’, 81.4]
3.8.6 Dictionary
Perhaps one of the most important built-in data structures in Python are dictionaries.
Dictionaries can be thought of as arrays of elements where each element is a
key–value pair. If you know the key or value then you can quickly look up for
corresponding values/key respectively. There is no restriction on what key or values
could be, and they can assume any Python data type. Generally, as industry practice
we tend to use keys as containing either numbers or characters. Similar to keys,
values can assume any data type (be it basic data types or complex ones). When you
need to define a dictionary, it is done using curly brackets “{}” and each element is
separated by a comma. An important point to note is that dictionaries are unordered
in nature, which means that you cannot access an element of a dictionary by using
the index, but rather you need to use keys.
Code
firstdict = {}
firstdict [’one’] = ”This is first value“
firstdict [2] = ”This is second value“
seconddict = {’institution’: ’isb’,’pincode’:500111,
’department’: ’CBA’}
print(firstdict [’one’])
print(firstdict [2] )
print(seconddict)
print(seconddict.keys()) # It will print all keys in
the dictionary
print(seconddict.values()) # It will print all values in
the dictionary
Output
This is first value
This is second value
932 P. Taori and H. K. Dasararaju
Quite often the need might arise where you need to convert a variable of a specific
data type to another data type. For example, you might want to convert an int
variable to a string, or a string to an int, or an int to a float. In such cases, you use
type conversion operators that change the type of a variable. To convert a variable to
integer type, use int(variable). To convert a variable to a string type, use str(variable).
To convert a variable to a floating point number, use float(variable).
After the discussion of variables and data types in Python, let us now focus on the
second building block of any programming language, that is, conditional statements.
Conditional statements are branches in a code that are executed if a condition
associated with the conditional statements is true. There can be many different types
of conditional statements; however, the most prominent ones are if, while, and for.
In the following sections, we discuss these conditional statements.
934 P. Taori and H. K. Dasararaju
3.11.1 if Statement
Code
var1 = 45
if var1 >= 43:
print(”inside if block“)
elif var1 <= 40:
print(”inside elif block“)
else:
print(”inside else block“)
Output
inside if block
Whereas an if loop allows you to evaluate the condition once, the while loop allows
you to evaluate a condition multiple number of times depending on a counter or
variable that keeps track of the condition being evaluated. Hence, you can execute
the associated block of statements multiple times in a while block. Similar to an if
loop, you can have an optional else loop in the while block, see for loop example
below:
Code
counter = 0
while (counter < 5):
print(’Current counter: {}’.format(counter))
counter = counter + 1
Output
Current counter: 0
Current counter: 1
Current counter: 2
Current counter: 3
Current counter: 4
While loop ends!
29 Introduction to Python 935
In many ways for loop is similar to a while loop in the sense that it allows you
to iterate the loop multiple times depending on the condition being evaluated.
However, for loop is more efficient in the sense that we do not have to keep
count of incrementing or decrementing the counter of condition being evaluated.
In the while loop, onus is on the user to increment/decrement the counter otherwise
the loop runs until infinity. However, in for loop the loop itself takes care of the
increment/decrement.
Code
for a in range(1,8):
print (a)
else:
print(‘For loop ends’)
Output
1
2
3
4
5
6
7
For loop ends
Sometimes the situation might arise in which you might want to break out of a
loop before the loop finishes completion. In such cases, we make use of the break
statement. The break statement will break out of the loop whenever a particular
condition is being met.
Code
for a in range(1,8):
if a == 4:
break
print(a)
print(’Loop Completed’)
Output
1
2
3
Loop completed
936 P. Taori and H. K. Dasararaju
Whereas the break statement completely skips out of the loop, the continue
statement skips the rest of the code lines in a current loop and goes to the next
iteration.
Code
while True:
string = input(’Type your input: ’)
if string == ’QUIT’:
break
if len(string) < 6:
print ’String is small’
continue
print(’String input is not sufficient’)
Output
Type your input: ’Hi’
String is small
Whenever you need user to enter input from keyboard or you need to read keyboard
input, you can make use of two in-built functions—“raw_input” and “input.” They
allow you to read text line from standard keyboard input.
It will read one line from keyboard input and give it to the program as a string.
string= raw_input(‘Provide the input: ‘);
print(‘Input provided is: {}‘.format(string))
In the above-mentioned example, the user would get a prompt on the screen with
title “Provide the input.” The second line will then print whatever input the user has
provided.
Provide the input: Welcome to Python
Input provided is: Welcome to Python
29 Introduction to Python 937
For example:
Provide the input: [a*3 for a in range(1,6,2)]
Input provided is: [1,9,15]
Most of the times, in addition to using variables and in-built data structures, we
would be working with external files to get data input and to write output to. For
this purpose, Python provides functions for file opening and closing. A key concept
in dealing with files is that of a “file” object. Let us understand this in a bit more
detail.
File object is the handle that actually allows you to create a link to the file you want
to read/write to. In order to be able to read/write a file, we first need to use an object
of file type. In order to do so, we make use of open() method. When open() executes,
it will result in a file object that we would then use to read/write data for external
files.
Syntax
file_object = open(file_name [, access_mode][, buffering])
Let us understand this function in slightly more detail. The file_name requires us
to provide the file name that we want to access. You can specify either an existing
file on the filesystem or you can specify a new file name as well. Access_mode tells
Python in which mode the file should be opened. There are a number of modes to do
so; however, the most common ones are read, write, and append. A more detailed
knowledge of each mode type is given in Table 29.1. Finally, buffer mode tells us
how to buffer the data. By default, value for buffer is 0. This means that there is no
buffering. If it is 1, then it implies that there would be buffering whenever a file is
being accessed.
938 P. Taori and H. K. Dasararaju
Once we have opened the file for reading/writing purposes, we would then need to
close the connection with the file. It is done using the close() method. close() will
flush out any unwritten data to the file and will close the file object that we had
opened earlier using open() function. Once the close() method is called, we cannot
do any more reads/writes on the file. In order to do so, we would again have to open
the file using open() method.
Syntax
file_object.close();
Code
# File Open
file1 = open(‘transactions.txt’, ‘wb’)
print(‘File Name: {}’.format(file1.name))
# Close file
file1.close()
Output
File Name: transactions.txt
While the open() and close() methods allow us to open/close a connection to a file,
we need to make use of read() or write() methods to actually read or write data to a
file.
29 Introduction to Python 939
When we call write(), it will write the data as a string to the file that we opened
earlier.
Syntax
file_object.write(string);
Code
# File Open
file1 = open(‘sample.txt’, ‘wb’)
file1.write(‘This is my first output.\nIt looks good!!\n’);
# Close file
file1.close()
When we run the above code, a file sample.txt would be created and the string
mentioned in write() function would be written to the file. The string is given by:
This is my first output.
It looks good!!
Just as write() method writes data to a file, read() would read data from an open file.
Syntax
file_object.read([counter]);
You would notice that we have passed an argument called counter here. When
we do this, then it tells the interpreter to read the specified number of bytes. If no
such argument is provided, then the reading will read the entire text.
Code
# File Open
fileopen = open(‘sample.txt‘, ‘r+’)
string = fileopen.read(10);
print(‘Output is: {}‘.format(string))
# Close file
fileopen.close()
Output
Output is: This is my
940 P. Taori and H. K. Dasararaju
Code
def firstfunc():
print(‘Hi Welcome to Python programming!’)
# code block that is executed for the function
# Function ends here #
Output
Hi Welcome to Python programming!
Hi Welcome to Python programming!
Code
def MaxFunc(a1, a2):
if a1 > a2:
print(‘{} is maximum’.format(a1))
elif a1 == a2:
print(’{} is equal to {}’.format(a1, a2))
else:
print(’{} is maximum’.format(a2))
y = 1
MaxFunc (x, y) # give variables as arguments
MaxFunc (5, 5) # directly give literal values
Output
8 is maximum
3 is maximum
5 is equal to 5
If you want to make some parameters of a function optional, use default values in
case the user does not want to provide values for them. This is done with the help of
default argument values. You can specify default argument values for parameters by
appending to the parameter name in the function definition the assignment operator
(=) followed by the default value. Note that the default argument value should be a
constant. More precisely, the default argument value should be immutable.
Code
def say(message, times = 1):
print(message * times)
say(’Hello’)
say(’World’, 5)
Output
Hello
WorldWorldWorldWorldWorld
The function named “say” is used to print a string as many times as specified. If
we do not supply a value, then by default, the string is printed just once. We achieve
this by specifying a default argument value of 1 to the parameter times. In the first
usage of say, we supply only the string and it prints the string once. In the second
usage of say, we supply both the string and an argument 5 stating that we want to
say the string message five times.
Only those parameters that are at the end of the parameter list can be given default
argument values, that is, you cannot have a parameter with a default argument value
preceding a parameter without a default argument value in the function’s parameter
list. This is because the values are assigned to the parameters by position. For
example, def func(a, b=5) is valid, but def func(a=5, b) is not valid.
942 P. Taori and H. K. Dasararaju
The “return” statement is used to return from a function, that is, break out of the
function. You can optionally return a value from the function as well.
Code
def maximum(x, y):
if x > y:
return x
elif x == y:
return ’The numbers are equal’
else:
return y
print(maximum(2, 3))
Output
3
3.15 Modules
You can reuse code in your program by defining functions once. If you want to reuse
a number of functions in other programs that you write, you can use modules. There
are various methods of writing modules, but the simplest way is to create a file with
a “.py” extension that contains functions and variables.
Another method is to write the modules in the native language in which the
Python interpreter itself was written. For example, you can write modules in the
C programming language and when compiled, they can be used from your Python
code when using the standard Python interpreter.
A module can be imported by another program to make use of its functionality.
This is how we can use the Python standard library as well. The following code
demonstrates how to use the standard library modules.
Code
import os
print os.getcwd()
Output
<Your current working directory>
Importing a module is a relatively costly affair, so Python does some tricks to make
it faster. One way is to create byte-compiled files with the extension “.pyc”, which
is an intermediate form that Python transforms the program into. This “.pyc” file is
29 Introduction to Python 943
useful when you import the module the next time from a different program—it will
be much faster since a portion of the processing required in importing a module is
already done. Also, these byte-compiled files are platform-independent.
Note that these “.pyc” files are usually created in the same directory as the
corresponding “.py” files. If Python does not have permission to write to files in
that directory, then the “.pyc” files will not be created.
If you want to directly import the “argv” variable into your program (to avoid typing
the sys. everytime for it), then you can use the “from sys import argv” statement. In
general, you should avoid using this statement and use the import statement instead
since your program will avoid name clashes and will be more readable.
Code
from math import sqrt
print(‘Square root of 16 is {}’.format(sqrt(16)))
Creating your own modules is easy; this is because every Python program is also a
module. You just have to make sure that it has a “.py” extension. The following is
an example for the same:
The above was a sample module; there is nothing particularly special about it
compared to our usual Python program. Note that the module should be placed
either in the same directory as the program from which we import it, or in one of
the directories listed in sys.path.
Output
Hi, this is mymodule speaking.
Version 0.1
944 P. Taori and H. K. Dasararaju
3.16 Packages
There are a number of statistical and econometric packages available on the Internet
that can greatly simplify the research work. Following is the list of widely used
packages:
1. NumPy: Numerical Python (NumPy) is the foundation package. Other packages
and libraries are built on top of NumPy.
2. pandas: Provides data structures and processing capabilities similar to ones found
in R and Excel. Also provides time series capabilities.
3. SciPy: Collection of packages to tackle a number of computing problems in data
analytics, statistics, and linear algebra.
4. matplotlib: Plotting library. Allows to plot a number of 2D graphs and will serve
as primary graphics library.
5. IPython: Interactive Python (IPython) shell that allows quick prototyping of code.
6. Statsmodels: Allows for data analysis, statistical model estimation, statistical
tests, and regressions, and function plotting.
7. BeautifulSoup: Python library for trawling the Web. Allows you to pull data from
HTML and XML pages.
8. Scikits: A number of packages for running simulations, machine learning, data
mining, optimization, and time series models.
9. RPy: This package integrates R with Python and allows users to run R code from
Python. This package can be really useful if certain functionality is not available
in Python but is available in R.
Chapter 30
Probability and Statistics
1 Introduction
This chapter is aimed at introducing and explaining some basic concepts of statistics
and probability in order to aid the reader in understanding some of the more
advanced concepts presented in the main text of the book. The main topics that are
discussed are set theory, permutations and combinations, discrete and continuous
probability distributions, descriptive statistics, and bivariate distributions.
While the main aim of this book is largely beyond the scope of these ideas, they
form the basis on which the advanced techniques presented have been developed. A
solid grasp of these fundamentals, therefore, is crucial to understanding the insights
that can be provided by more complex techniques.
However, in explaining these ideas, the chapter briefly sketches out the core
principles on which they are based. For a more comprehensive discussion, see
Complete Business Statistics by Aczel and Sounderpandian (McGraw-Hill, 2009).
P. Taori ()
London Business School, London, UK
e-mail: taori.peeyush@gmail.com
S. Mamidipudi · D. Agrawal
Indian School of Business, Hyderabad, Telangana, India
2 Foundations of Probability
P (∅) = 0
P (S) = 1
3. If two or more events are mutually exclusive (the subsets that describe their
outcomes are disjoint), then the probability of one of them happening is simply
the sum of the individual probabilities.
Bayes’ theorem is one of the most powerful tools in probability. The theorem allows
us to relate conditional probabilities, or the likelihood of an event occurring given
that some other event has occurred, to each other.
Say that P(A| B) is the probability of an event A given that event B has occurred.
Then, the probability of A and B occurring together is the probability of B occurring
times the probability of A occurring given B has occurred (this is like a chain rule).
P (A ∩ B) = P (B).P (A|B)
P (B ∩ A) = P (A).P (B|A)
algorithm. The algorithm postulates the likelihood of an event occurring (the prior),
absorbs and analyzes new data (the likelihood), and then updates its analysis to
reflect its new understanding (the posterior).
We can use Bayes’ theorem to analyze a dataset in order to understand the
likelihood of certain events given other events—for example, the likelihood of
owning a car given a person’s age and yearly salary. As more data is introduced into
the dataset, we can better compute the likelihood of certain characteristics occurring
in conjunction with the event, and thus better predict whether a person with a
random set of characteristics may own a car. For example, say 5% of the population
is known to own a car—call this A. This can be inferred from your sample data. In
your sample, 12% are between 30 and 40 years of age—call this B. In the subset
of persons that own a car, 25% are between age 30 and 40—this is (B| A). Thus,
P(A) = 0.05. P(B) = 0.12. P(B| A) = 0.25. Thus, P (A|B) = (0.25×0.12)
0.05 = 0.60. In
other words, 60% of those that are between 30 and 40 years of age own a car.
Until now we have discussed probability in terms of sample spaces, in which the
likelihood of any single outcome is the same. We will now consider experiments in
which the likelihood of some outcomes is different than others. These experiments
are called random variables. A random variable assigns a numerical value to each
possible outcome of an experiment. These variables can be of two types: discrete or
continuous.
Discrete random variables are experiments in which there are a finite number of
outcomes. We might ask, for example, how many songs are on an album. Continuous
random variables, on the other hand, are experiments that might result in all possible
values in some range. For example, we might model the mileage driven of a car as
a continuous random variable.
Normally, we denote an experiment with a capital letter, such as X, and the
possibility of an outcome with a small letter, such as x. Therefore, in order to find
out the likelihood of X taking the value x (also known as x occurring), we would
write P(X = x). From the axioms of probability, we know that the sum of all P(x)
must be 1. From this property, we can construct a probability mass function (PMF)
P for X that describes the likelihood of each event x occurring.
Consider Table 30.1, which describes the results from rolling a fair die.
The PMF for each outcome P(X = x) (x = 1,2, . . . ,6) is equal to 1/6.
Now consider Table 30.2, which describes a die that has been altered.
In this case, the PMF tells us that the likelihood for some outcomes is greater
than the likelihood for other outcomes. The sum of all the PMFs is still equal to one,
but we can see that the die is no longer equally likely to produce each outcome.
30 Probability and Statistics 949
Table 30.1 Probability from Outcome (x) Probability (p) PMF: P(X = x) = p
rolling a fair die
1 1/6 P(X = 1) = 1/6
2 1/6 P(X = 2) = 1/6
3 1/6 P(X = 3) = 1/6
4 1/6 P(X = 4) = 1/6
5 1/6 P(X = 5) = 1/6
6 1/6 P(X = 6) = 1/6
Table 30.2 Probability from Outcome (x) Probability (p) PMF: P(X = x) = p
rolling an altered die
1 1/12 P(X = 1) = 1/12
2 3/12 P(X = 2) = 1/6
3 1/6 P(X = 3) = 1/6
4 1/6 P(X = 4) = 1/6
5 3/12 P(X = 5) = 1/4
6 1/12 P(X = 6) = 1/6
a b x
P{a ≤ x ≤ b}
=
+∞
E [X] = xf (x)dx
−∞
This tells us that the “expected” value of an experiment may not actually be equal
to a value that the experiment can take. We cannot actually ever roll 3.5 on a die,
but we can expect that on average, the value that any die will take is 3.5.
30 Probability and Statistics 951
In the case of a continuous random variable, the mean of the PDF cannot be
computed using discrete arithmetic. However, we can use calculus to derive the
same result. 7
By using the integral function to replace the additive function , we can find:
=
μ= x.f (x) dx
V ar(X) = E X2 − E(X)2
> plot(cars)
In this case, we can use combinatorics to identify how many ways there are of
picking combinations. Combinatorics deals with the combinations of objects that
belong to a finite set. A permutation is specific ordering of a set of events. For
example, the coin flipping heads on the first, third, and fourth flip out of five flips
is a permutation: HTHHT. Given “n” objects or events, there are n! (n factorial)
permutations of those events. In this case, given five events: H, H, H, T, and T,
there are 5! ways to order them. 5! = 5*4*3*2*1 = 120. (There may be some
confusion here. Notice that some of these permutations are the same. The 120
number comes up because we are treating the three heads as different heads and
two tails as different tails. In one other way of saying this, the five events are each
different—we would have been better if we had labeled the events 1,2,3,4,5.)
However, sometimes we may want to choose a smaller number of events. Given
five events, we may want a set of three outcomes. In this case, the number of
permutations is given by 5!/(5 − 3)! = 5*4*3 = 60. That is, if we have “n” events,
and we would like to choose “k” of those events, the number of permutations is
n!/(n − k)! If we had five cards numbered 1–5, the number of ways that we could
choose three cards from them would be 60. (In another way of seeing this, we can
choose the first event in five ways, the second in four ways, and the third in three
ways, and thus 5 * 4 * 3 = 60.)
A combination is the number of ways in which a set of outcomes can be
drawn, irrespective of the order in which the outcomes are drawn. If the number
of permutations of k events out of a set of n events is n!/(n − k)!, the number
of combinations of those events is the number of permutations, divided by the
number of ways in which those permutations occur: n!/((n − k)!k!). (Having drawn
k items, they themselves can be permuted k! times. Having drawn three items, we
can permute the three 3! times. The number of combinations of drawing three items
out of five equals 5!/((5 − 3)!3!) = 60/6 = 10.)
Using the theory of combinations, we can understand the binomial distribution.
When we have repeated trials of the Bernoulli experiment, we obtain the binomial
distribution. Say we are flipping the unfair coin ten times, and we would like to
know the probability of the first four flips being heads.
Consider, however, the probability of four out of the ten flips being heads. There
are many orders (arrangements or sequences) in which the four flips could occur,
which means that the likelihood of P(X = 4) is much greater than 0.0011351. In
954 P. Taori et al.
For X ∼ B(n,p), the mean E(X) is n*p, and Var(x) is n*p*(1 − p). (One can
verify that these equal n times the mean and variance of the Bernoulli distribution.)
A sample probability distribution for n = 10 and p = 0.25 is shown in Table 30.3.
In Excel, the command is BINOMDIST(x,n,p,cumulative). In R, the command is
DBINOM(x, n, p).
The normal distribution is one of the most important continuous distributions, and
can be used to model a number of real-life phenomena. It is visually represented by
a bell curve.
Just as the binomial distribution is defined by two parameters (n and p), the
normal distribution can also be defined in terms of two parameters: μ (mean) and
sigma (standard deviation). Given the mean and standard deviation (or variance) of
the distribution, we can find the shape of the curve. We can denote this by writing
X ∼ N(μ, sigma).
The curve of normal distribution has the following properties:
1. The mean, median, and mode are equal.
2. The curve is symmetric about the mean.
3. The total area beneath the curve is equal to one.
4. The curve never touches the x-axis.
The mean of the normal distribution represents the location of centrality, about
which the curve is symmetric. The standard deviation specifies the width of the
curve.
The shape of the normal distribution has the property that we can know the
likelihood of any given value falling within one, two, or three standard deviations
from the mean. Given the parameters of the distribution, we can confidently say
that 68.2% of data points fall within one standard deviation from the mean, 95%
within two standard deviations of the mean, and more than 99% fall within three
standard deviations of the mean (refer Fig. 30.3). A sample is shown below with
mean = 10 and standard deviation = 1. In Excel, the command to get the distribution
is NORMDIST(x, μ, sigma, cumulative (0/1)). In R, the command is PNORM(x, μ,
sigma) (Fig. 30.4).
However, computing the normal distribution can become difficult. We can use
the properties of the normal distribution to simplify this process. In order to do this,
we can define the “standard” normal distribution, denoted Z, as a distribution that
956 P. Taori et al.
has mean 0 and standard deviation 1. For any variable X described by a normal
distribution, z = (X − μ)/sigma. The z-score of a point on the normal distribution
denotes how many standard deviations away it is from the mean. Moreover, the
area beneath any points on a normal distribution is equal to the area beneath their
corresponding z-scores. This means that we only need to compute areas for z-scores
in order to find the areas beneath any other normal curve.
The second important use of the properties of the normal distribution is that it is
symmetric. This means that:
1. P(Z > z) = 1 − p(Z < z)
2. P(Z < − z) = P(Z > z)
3. P(z1 < Z < z2) = P(Z < z2) − P(Z < z1)
30 Probability and Statistics 957
Standard normal distribution tables provide cumulative values for P(Z < z) until
z = 0.5. Using symmetry, we can derive any area beneath the curve from these
tables.
The normal distribution is of utmost importance due to the property that the mean
of a random sample is approximately normally distributed with mean equal to the
mean of the population and standard deviation equal to the standard deviation of the
population divided by the square root of the sample size. This is called the central
limit theorem. This theorem plays a big role in the theory of sampling.
3 Statistical Analysis
and thus about the world that we have recorded in our dataset. These tools of
analysis, despite being very simple, can be incredibly profound and inform the most
advanced computational tools.
The use of data analysis that helps to describe, show, or summarize data in a
way that helps us identify patterns in the dataset is known as descriptive statistics.
The tools we use to make predictions or inferences about a population are called
inferential statistics. There are two main types of statistical analysis. The first is
univariate analysis, which describes a dataset that only records one variable. It
is mainly used to describe various characteristics of the dataset. The second is
multivariate analysis, which examines more than one variable at the same time in
order to determine the empirical relationship between them. Bivariate analysis is a
special case of multivariate analysis in which two variables are examined.
In order to analyze a dataset, we must first summarize the data, and then use the
data to make inferences.
The first type of statistic that we can derive from a variable in a numerical dataset
is measures of “central tendency,” or the tendency of data to cluster around some
value. The arithmetic mean, or the average, is the sum of all the values that the
variable takes in the set, divided by the number of values in the dataset. This mean
corresponds to the expected value we find in many probability distributions.
The median is the value in the dataset above which 50% of the data falls. It
partitions the dataset into two equal halves. Similarly, we can divide the data into
four equal quarters, called quartiles, or 100 equal partitions, called percentiles.
If there is a value in the dataset that occurs more times (more often) than any
other, it is called the mode.
The second type of statistic is measures of dispersion. Dispersion is a measure
of how clustered together data in the dataset are about the mean. We have already
encountered the first measure of dispersion—variance. The variance is also known
as the second central moment of the dataset—it is measured by the formula:
(Data value − Mean)2
n
where n is the size of the sample.
In order to find higher measures of dispersion, we measure the expected values
of higher powers of the deviations of the dataset from the mean. In general,
(Data value − Mean)r , -
r − th central moment = μr = = E (X − μ)r .
n
Mainly, the third and fourth central moments are useful to understand the shape
of the distribution. The third central moment of a variable is useful to evaluate a
measure called the skewness of the dataset. The skewness is a measure of symmetry,
and usually the mode can indicate whether the dataset is skewed in a certain
direction. The coefficient of skewness is calculated as:
30 Probability and Statistics 959
f(x)
Symmetric
distribution Right-skewed
distribution
x
Mean = Median = Mode Mode Mean
Median
f(x)
Symmetric distribution
Left-skewed
with two modes
distribution
x
Mean Mode Mode Mode
Median Mean = Median
μ23
β1 =
μ32
As skewness proceeds from negative to positive, it moves from being left skewed
to right skewed. At zero it is a symmetric distribution (Fig. 30.5).
The fourth central moment is used to measure kurtosis, which is a measure of
the “tailedness” of the distribution. We can think of kurtosis as a measure of how
likely extreme values are in the dataset. While variance is a measure of the distance
of each data point from the mean, kurtosis helps us understand how long and fat the
tails of the distribution are. The coefficient of kurtosis is measured as (Fig. 30.6):
μ4
β2 =
μ22
960 P. Taori et al.
4 Visualizing Data
3. From each side of the box, extend a line to the maximum and minimum values
4. Indicate the median in the box with a solid line
In R, the box plot is created using the function boxplot. The syntax is box-
plot(variable name). For example, let us draw a box plot for the distance variable in
the cars dataset (Fig. 30.7):
> boxplot(cars$dist)
5 Bivariate Analysis
Bivariate analysis is among the most basic types of analysis. By using the tools
developed to consider any single variable, we can find correlations between two
variables. Scatter plots, frequency tables, and box plots are frequently used in
bivariate analysis.
The first step toward understanding bivariate analysis is extending the idea of
variance. Variance is a measure of the dispersion of a variable. Covariance is a
measure of the combined deviation of two variables. It measures how much one
variable changes when another variable changes. It is calculated as:
Cov (X, Y )
ρxy =
σx .σy
The coefficient of correlation always lies between −1 and +1. As it moves from
negative to positive, the variables change from moving perfectly against one another
to perfectly with one another. At 0, the variables do not move with each other (to
be perfectly honest we need to say in an average sense). Independent variables are
uncorrelated (but uncorrelated variables are not independent with some exceptions
such as when both variables are normally distributed).
30 Probability and Statistics 963
For the variables cars$dist and cars$speed, covariance = 109.95 and correla-
tion = 0.8068.
Index
Classification, 5, 101, 264, 265, 269, 273, Constant return to scale (CRS), 637, 638
292, 293, 373–374, 402, 432, 461, 478, Consumer’s ratings, 643
479, 484, 511–513, 515, 517, 518, 520, Continuous random variables, 948, 949, 951
522–549, 554, 570, 571, 574, 576, 577, Contour plots, 344, 348
579, 588, 859 Control variables, 340
plot, 259–260, 265 Convex hull, 362, 364, 380
table, 259–260, 265 Convexity constraint, 638
Classifier, 10, 510, 511, 522–523, 567, 571, Convolution layer (CONV), 10, 579–582
579, 592 Convolutional neural networks (CNNs), 10,
Clickstream log, 73 540, 570, 574, 576–583, 588, 589, 591,
Cloud computing, 103–108 594
Cloudera, 3, 86–90 Co-occurrence graphs (COG), 289–290, 296,
Cluster analysis, 625 298, 501
CLV, see Customer lifetime value (CLV) Cook’s distance, 212, 213, 430
CNNs, see Convolutional neural networks Copula functions, 12, 672, 673, 676
(CNNs) Correlation, 208–211, 219, 273, 300, 389,
Coefficient of determination, 191–192, 194 504, 509, 514, 564, 579, 607, 661, 668,
COGS, see Cost of goods sold (COGS) 672–677, 679, 683, 696, 715, 716, 739,
Cold start problem, 557, 559, 562 826, 962, 963
Cole bombing suspects, 722 Cosine similarity, 476, 482, 499, 545, 555,
Cole Haan, 617 556, 562
Collaborative filtering (CF), 9, 101, 555, 557, Cost efficiency, 823, 824, 830, 831, 842, 844
559, 561, 564 Cost of goods sold (COGS), 290, 292,
Collinearity, 5, 180, 215–224, 228, 239, 241, 298–301, 670, 826
243 Count data regression model, 421–437
Combinations, 28, 60, 118, 167, 238, 295, Covariance, 187, 188, 212, 242, 492–494, 530,
298, 325, 337, 355, 380, 388, 395, 397, 532–534, 661, 690, 696, 962, 963
418, 424, 432, 437, 466, 477, 481, 488, Covariates, 9, 183, 188, 427, 428, 441,
497, 514, 518, 526, 535, 536, 549, 554, 446–448, 450, 453, 455–457, 606, 625
562, 565, 604, 629, 643, 645, 675, 735, Cox and Snell R2 , 257–258, 264, 268
736, 739, 749, 750, 774, 807, 809, 810, Cox proportional hazard model, 440, 447–449,
832–834, 878, 900, 927, 945, 953, 954 456
Commodity procurement, 14, 829 CRAN repository, 801
Common identifier, 31 CRD, see Completely randomized design
Comparison of two populations, 161–162 (CRD)
Complement, 33, 234, 630, 720, 744, 761, 946 CreditMetrics™, 670
Complementary slackness conditions of Crew scheduling, 337
optimality, 361 Cross domain recommendations, 565–566
Complete enumeration, 23, 366 Cross-sectional data, 24, 238, 383, 625, 667
Completely randomized design (CRD), 13, Croston’s approach (CR), 8, 400, 401
729–732, 735 CRS, see Constant return to scale (CRS)
Component capital, 679–682 CRV, see Customer referral value (CRV)
Computable general equilibrium (CGE), 829 Cumulative distribution function (CDF), 442,
Computer simulation, 6, 307, 333 949
Concordant pairs, 5, 261–262 Cumulative hazard, 442, 457
Conditional mean model, 642, 700–701 Curvilinear relationships, 12, 630–635
Conditional probabilities, 444, 532, 674, 947 Custom/algorithmic attribution, 747–750
Conditional variance, 396, 397, 430, 701 Customer analytics, 12, 75–76, 626, 650–654
Conditional variance models, 701–702 Customer identity, 32, 33
Confidence intervals, 4, 137, 141–151, 161, Customer influence value (CIV), 12, 653–654
169, 170, 192–195, 203, 204, 233, 256, Customer lifetime value (CLV), 12, 440, 624,
271, 315, 316, 329, 388, 451, 634, 961 650–652, 654
Conjoint analysis, 12, 624–626, 643–650, 655 Customer referral value (CRV), 12, 626,
Connected cows, 828 652–654
968 Index
Document, 6, 283, 286–292, 294–301, 476, 735, 736, 739, 747, 748, 776, 797, 801,
482, 484, 510, 511, 517–519, 532, 545, 805, 806, 811, 816, 817, 819, 826, 829,
670, 852 833, 834, 840, 841, 850, 854, 857,
Document classifier, 510 876–878, 886
Document-term matrix (DTM), 288, 292, 296, Estimation, 7, 9, 12, 33, 139, 180, 186, 189,
300 193, 195, 199, 211, 223, 233, 250–251,
Dor, 318, 617 263, 264, 275, 276, 278, 279, 328, 338,
DQN, see Deep Q-network (DQN) 369, 371, 383, 388, 397, 398, 441, 443,
Dropbox referral program, 652 449, 459, 462, 488–491, 494, 504, 508,
Dual values, 355, 358 512, 577, 602, 603, 610, 611, 616, 617,
Dummy variables, 5, 180, 224–233, 239, 247, 628, 641, 642, 645, 652, 663–665,
248, 263, 437, 647, 730, 731, 734 667–669, 683, 685, 696, 702, 805, 806,
Dynamic creative optimization (DCO), 728 829, 879, 886, 944
Estimation of parameters, 189, 199, 223,
250–251, 275
E Estimators, 139–141, 148, 149, 190, 191, 199,
EBay, 728, 804 223, 396, 444, 490, 606, 616
E-commerce, 73, 331, 507, 549, 553, 554, 563, Euclid analytics, 617
599, 601, 616, 743, 803, 804 Euclidian distances, 476, 528
Economies of scale, 343, 757, 833, 834 Eugene Fama, 662
EDA, see Exploratory data analysis (EDA) Excessive zeros, 430
Efficiency model, 830 Exogeneity, 188
Efficient market hypothesis (EMH), 662, 665 Expectation-maximization (EM) algorithm,
EGARCH, 685, 701–703, 705, 707–709, 713, 480, 613
716 Expected value, 140, 147, 148, 151, 186, 307,
Eigenvalues, 470, 471, 590, 688–690 328, 424, 425, 429, 660, 950, 958
Eigenvectors, 470, 688–690 Experiment, 13, 22, 87, 422, 434, 608,
Elliptical distribution, 662 729–733, 736, 737, 739, 752–754, 773,
EMH, see Efficient market hypothesis (EMH) 775, 778–781, 783, 787–789, 877,
Engagement ads, 724 946–948, 950–953
Engagement bias, 564 Experimental design, 644, 647, 649, 723, 728,
Engagement matrix, 551, 556–561 729, 735, 739, 750, 785
Ensemble, 10, 527, 545–547, 549, 607, 886 Explanatory variables, 182, 247–249, 251–253,
Enthought Canopy, 919, 920, 923 255, 257, 262–266, 268–270, 421, 422,
Entity integrity, 43 427, 435, 447–449, 451, 642, 886
Entropy, 525, 587, 589 Exploratory data analysis (EDA), 427
Epidemiology, 766 Exponential smoothing, 391–395, 401, 410,
ERP systems, 825, 858 413, 605
Error back propagation, 539 Exponomial choice (EC) model, 610
Estimates, 5, 72, 74, 137–151, 163, 164, 167, Exposure variable, 430
169, 170, 179, 180, 183, 190–195, ExtenSim version 9.0, 786
199–208, 211–215, 218–221, 223, 225, External data, 23, 908
228, 229, 232, 233, 239, 241, 242, 247, Extreme point, 351, 352
249–256, 258, 261–268, 273–276, 278,
284, 292, 296, 309, 310, 313, 315–320,
328, 331, 332, 369–371, 373, 381, 389, F
390, 394, 396–398, 408, 409, 425, 427, Face detection, 570
428, 431, 432, 434, 436, 440, 442–445, Facebook, 13, 26, 71, 72, 75, 76, 284, 285,
448, 449, 451, 453, 455, 488, 489, 524, 478, 499, 512, 545, 550, 553, 565, 570,
529, 531, 532, 556, 571, 584, 599, 601, 601, 654, 719, 721, 724, 726, 728, 739,
605, 606, 608, 613, 614, 616, 628, 629, 748–750, 762
633, 634, 638, 640–642, 645, 647–649, Factor analysis, 295, 625, 663, 690–691
652, 662, 663, 665, 668–670, 683–686, Factorial designs, 644, 736, 739
690, 692, 702, 708, 709, 715, 728, 732, Fama–French three-factor model, 667
970 Index
Fashion retailer, 6, 7, 306, 309, 317, 318, 321, 672–724, 761, 767, 771, 796, 797, 805,
323, 336, 826 829, 833, 834, 839, 875–877, 880,
F distribution, 4, 165, 167, 176, 195, 212, 335, 883–887, 896, 902, 903, 905, 907–909,
685, 698, 806, 959 913–915, 925, 936–942, 944, 948–951,
Feature centric data scientists, 516 961
Feature distributions, 514 Functional magnetic resonance imaging, 34
Feature engineering, 10, 516–519, 532, 537, Fundamental theorem of asset pricing,
574, 577 660
Feature normalization, 517 Fundamental theorem of linear programming,
Feature selection, 519, 606 352
Feature space, 486, 488, 522, 524–526, 531,
533, 537, 545, 562
Feature transformation, 517 G
Federal Reserve Bank (Fed), 679, 686 Gaana, 550
FedEx, 825, 843 Gasoline consumption, 181–182, 201,
Filter methods, 519 206–208, 210
Financial instrument, 660, 661 Gaussian copulas, 672–676, 715
First interaction/first click attribution model, GE, 307
746 Gender discrimination, 182, 224, 229
First normal form (1NF), 44 Generalized autoregressive conditional
Fisher discriminant analysis, 468, 515 heteroscedasticity (GARCH) model,
FitBit, 766 397, 667, 668, 684, 685, 696–699,
Fit-for-purpose visualization, 124 701–706, 708–713, 715
Flash, 723, 826 Generalized linear models (GLM), 425, 536,
Flipkart, 556, 744 544
Flume, 101 Generating random numbers, 333–334
Footmarks, 617 Generative adversarial networks, 540
Ford, 132, 828 Generic conjoint, 643
Forecast/prediction intervals, 7, 193, 195, 204, Geo-fencing, 655, 722
385, 406, 408 Geometric Brownian motion (GBM), 660
Forecasting, 7, 8, 11, 14, 15, 75, 180, 338, Ginni index, 524
381–418, 512, 518, 521, 601, 605–607, GJR, 685, 700–704, 706–709, 712, 716
625, 626, 647, 654, 796, 805–806, 814, GLM, see Generalized linear models (GLM)
824–826 Goldman Sachs, 660
Forecasting intermittent demand, 399–400 Google, 6, 37, 38, 77, 105, 106, 284, 301,
Foreign key, 43, 64, 66–67 383, 498, 499, 509, 520, 570, 601, 720,
Forward feature selection, 519 722–725, 728, 741, 742, 744, 750, 755,
Forward method, 263 760, 762, 769, 877
Fourier transformations, 517 Google Adwords, 720, 741, 762
Fraudulent claims detection, 76 Google Analytics, 720
Full profile conjoint, 643 Google Big Query, 106
Fully connected layers (FC), 579, 582, 592 Google Brain, 769
Function, 5, 7–9, 61, 102, 149, 185, 191, 210, Google Cloud Platform (GCP), 106
234, 249–263, 266, 269, 270, 275, 276, Google Compute Engine, 106
279, 300, 308–310, 312, 317, 319, 320, Google Display Networks, 724
323, 325, 336, 337, 339, 341, 342, 344, Google DoubleClick, 725
346–359, 370–378, 382, 389, 390, 396, Google Maps, 37, 38, 520, 723
397, 408, 411–414, 426, 431, 434, 440, Google Prediction API, 106
442, 443, 446–448, 451, 453, 455, Google Traffic, 722
456, 463, 464, 470, 472, 476, 480, Google Trends, 769
482, 485–490, 492–494, 524, 528–537, Gradient descent, 573, 581, 583, 594, 607
540–547, 552, 563, 571, 574, 576, 581, Graph processing, 98, 101
582, 586, 587, 591, 592, 601, 607, 609, Graphics processing units (GPUs), 482, 516,
610, 638, 640–642, 662, 667, 669, 575
Index 971
Mediation analysis, 12, 625, 633, 635, 656 Multivariate Cauchy distribution, 662
Medicinal value in a leaf, 182–183 Multivariate exponential, 662
Memory-based recommendation engine, 552, Multivariate normal, 662
555, 557 Multivariate statistical analysis, 625
Mesos, 99 Multivariate student t-distribution, 662
Meta-rules of data visualization, 4, 116–133 MySQL, 1, 3, 42, 43, 47–51, 53, 54, 68, 80,
Method of least squares, 4, 180, 189, 191, 199, 102, 106
223, 232, 233, 236, 275
Microsoft, 43, 46, 47, 106, 297, 344, 375, 405,
570, 589 N
Microsoft Azure, 106 Nagelkerke R2, 257–258, 268
Microsoft Cognitive Toolkit, 106 Naïve Bayes classifier, 10, 531–532
Millennials demographic, 602 Naïve method (NF), 8, 390–391
MIN price, 62 Named entity recognition (NER), 286, 298,
Missing data, 5, 31, 247, 272–274, 372, 384, 300
526, 687 NameNode, 81–84
Missing features, 519, 526 Namespace, 83
Mitigation strategy, 828 Natural language processing (NLP), 6, 283,
Mixed integer optimization, 614 286, 287, 294, 297–300, 569
Mixture-of-Gaussians (MoG), 483, 489, Natural language toolkit (NLTK), 298
492–494, 530, 531, 535, 545 Near field communication (NFC), 604, 751
ML, see Machine learning (ML) Negative binomial distribution, 8, 431
MLE, see Maximum likelihood estimate Negative binomial regression model, 431
(MLE) Neocognitron, 579
MLPs, see Multi-layered perceptrons (MLPs) NER, see Named entity recognition (NER)
MNIST data, 10, 471, 515, 575–576, 582–583 Nested logit model, 11, 613
MNL, see Multinomial logit (MNL) Netflix, 19–20, 509, 512, 550, 556, 826
Mobile advertising, 725 Net present value (NPV), 12, 19, 661
Mobile auto-correct, 584 Network
Mode, 27, 30, 99, 219, 273, 549, 550, 611, 719, analytics, 722
723, 858, 897, 923, 937, 938, 955, 958 and influence diagrams, 722
Model centric data scientists, 516–517 planning, 337
Model validation, 237–239 Neural networks (NN), 10, 11, 338, 374, 383,
Moderation, 625, 628, 635 537–540, 545, 582, 583, 586, 591, 607,
MoG, see Mixture-of-Gaussians (MoG) 855
Monte Carlo simulation, 6, 12, 306, 669–672 New product
Mordor Intelligence, 766 design, 626
Morgan Stanley, 660 development, 11, 75, 402, 624, 643
Movie recommendation, 563 Newspaper Problem, 181
Moving average (MA) methods, 8, 388, 389, News recommendation, 563
391, 396, 397 Nextel, 830
MR, see Multiple regression (MR) Next word prediction, 10, 584, 585, 587–588,
Multi-layered perceptrons (MLPs), 10, 594
570–576, 579, 581–585, 592–594 NFC, see Near field communication (NFC)
Multimodal Bayesian classifier (MBC), NLP, see Natural language processing (NLP)
530–531 9/11, 722
Multinomial logistic regression (MNL), 5, 15, Nobel prize, 660, 662
266–269 Nominal, 2, 24, 25, 27, 132, 247, 266, 670, 730
Multinomial logit (MNL), 610–613, 806, 817, Non-linear analysis, 369–374
867, 870 Non-linear optimization, 369–374
Multiple regression (MR), 225, 252, 396, 626, Non-negative matrix factorization, 558
627, 634 Nonparametric classifier, 527
Multiple variance ratio test, 666 Nonparametric envelopment frontier, 637
Multivariate analysis, 625, 958 Nonparametric resampling procedures, 634
974 Index
Nonparametric tools, 11, 625 747, 748, 750, 793–798, 806, 808–811,
Non-random walk hypothesis, 666 815, 817, 824, 826–829, 887, 944
Non-sampling errors, 23 Ordinal, 2, 24, 25, 27, 180, 230, 266, 269, 512
Normal distribution, 141, 149, 156, 157, 185, Ordinal logistic regression models, 266
187, 193, 231, 249, 275, 276, 278, 310, Ordinary least square (OLS), 4, 8, 236, 249,
311, 313, 319, 371, 430, 466, 491, 529, 369–371, 373, 634
669, 670, 673, 685, 696–698, 702, 708, ORION, 307
816, 839, 955–957, 960 Orthogonal GARCH, 683, 684, 696
Normality, 194, 196, 202, 208, 215, 231–233, Orthogonality, 468, 469, 739–740
238, 239, 466, 634, 685, 698, 699 Outliers, 5, 8, 205–206, 213, 243, 384, 388,
Normalization, 43–46, 517, 528 403, 404, 462, 488, 490, 493, 517, 545,
NoSql database, 80 963
NPV, see Net present value (NPV) Output feature ratios, 518
Null deviance, 425, 427, 428, 433, 435 Over-defined, 354
Null hypothesis, 151–154, 156–161, 163–167, Overdispersion, 8, 430–433, 436
170, 171, 193, 195, 206, 253, 255–257, Ozon.ru, 602
426, 434, 694
Numerical Python (NumPy), 944
P
PACF, see Partial autocorrelation function
O (PACF)
Oak Labs, 617 Paired-observation comparisons, 162–163
Objective function, 7, 9, 337, 339, 341, 342, Pairwise classifier, 547–549
344, 346–352, 354, 355, 357, 375, 376, Pair-wise Deletion, 272
378, 463–465, 470, 472, 480, 486, 487, Palisade DecisionTools Suite, 310, 311
490, 493, 535, 536, 540–543, 640, 887 Pandas, 944
Objective function coefficients, 341, 357, 378 Parallel computing, 3, 77, 81, 84
Object-oriented programming, 918 Parameters, 4, 5, 30, 33, 138–139, 141, 142,
Object recognition, 510 145, 151, 156, 164, 179, 183, 184,
Observational data, 636 187–191, 199–201, 204, 212, 220, 223,
Observational equations, 187 232, 241, 247, 249–251, 264, 266, 275,
Occam’s razor, 520, 574 276, 278, 369–371, 374, 377, 392, 397,
Odds ratio, 249, 251, 535 398, 400, 410–414, 422, 423, 425–428,
Office365, 105 430, 431, 433–435, 442, 443, 447–449,
Offset regressor, 430 453, 455, 463–465, 473, 476, 480–482,
Offset variable, 422, 430, 432 486, 489, 490, 492, 493, 517, 524, 530,
OLS, see Ordinary least square (OLS) 531, 535, 536, 542, 544, 545, 558, 572,
Omni-channel retail, 11, 616 574–576, 579–583, 585–587, 592, 606,
One-tailed hypothesis test, 159–161 607, 610, 613, 615, 629, 642, 650,
Oozie, 80 667–669, 685, 686, 701, 702, 708, 715,
Open source, 1, 3, 77, 78, 80, 286, 298, 606, 726, 728, 742, 773, 778, 779, 787, 797,
890, 892, 915, 918 801, 805, 806, 833–835, 858, 877, 896,
language, 890, 918 915, 940, 941, 955
Operational efficiency, 640, 830 Parametric density function, 489, 490
Operations research (OR), 339, 773, 774, 778, Parametric methods, 456, 634
785 Parametric model(ing), 443, 447, 449,
Optical illusion, 126 453–455, 457, 535
Optimality, 7, 338, 344, 356, 359–362, 364, Partial autocorrelation function (PACF), 8,
365, 527 389, 395–397
Optimization, 7, 9, 14, 21, 34, 49, 97, 325–327, Partial profile conjoint, 644
337–380, 386, 391, 393, 397, 398, 414, Partial regression coefficients, 202, 203, 241
463–465, 470–475, 479, 480, 489, 490, Parzen Window classifier (PWC), 10, 492,
493, 497, 521, 535, 541, 542, 549–551, 528–529, 531, 540
568, 587, 603, 608–610, 613, 614, 672, PasS, see Platform as a Service (PasS)
Index 975
Python, 1–3, 47, 49, 68, 84, 87–90, 92, 93, 96, Receiver operating characteristics (ROC)
99, 101, 108, 298, 344, 889, 891, 897, curve, 5, 261
899, 904, 917–944 Recency, frequency, monetary value (RFM)
analysis, 626, 650
Recommendation for Cross-Sell, 564
Q Recommendation for lifetime value, 564
QDA, see Quadratic discriminant analysis Recommendation for loyalty, 564
(QDA) Recommendation for Preventing Churn, 564
Q-quants, 12, 659–661, 663 Recommendation for Upsell, 564
QR codes, see Quick response (QR) codes Recommendation paradigm, 461, 512–513
Quadratic discriminant analysis (QDA), 10, Recommendation score, 521, 552, 556, 557,
533–535 562–565, 567
Quantitative finance, 12, 659, 663 Recommendation systems, 33, 550, 553, 555,
Quantitative supply chain analysis, 570, 859
844 Rectified linear units layer (RELU), 10,
Quartiles, 185, 196, 231, 958, 960 579–581
Query length bias, 518 Recurrent neural networks (RNNs), 10, 540,
Quick response (QR) codes, 604, 750 570, 571, 574, 583–590, 593, 594
Q-world, 659–661, 663, 669, 672 RedHat Linux, 87
Redshift, 105
Referential integrity, 43, 64, 66
R Regressand, 183
R2 , 191, 192, 194, 200–204, 212, 215, Regression, 5, 183–184, 222
220–222, 237, 238, 240, 257–258, 263, analysis, 114, 179–245, 247–281, 381, 384,
268 441, 606, 730, 731, 734, 735, 738
Radio frequency identification (RFID), 11, 31, models, 5, 8, 11, 13, 179, 180, 184, 185,
34, 604 187, 195, 252–257, 373, 423–425, 441,
Rakuten, 825 446–447, 512, 518, 527, 625, 730, 734
Random forest, 10, 527, 546, 607, 867 paradigm, 512, 513
Random sample, 137, 138, 181, 308–310, 312, Regressors, 4, 5, 179–180, 182–185, 187–206,
315, 317, 319, 327, 336, 401, 443, 886, 208, 210–212, 216–223, 228, 232–234,
957 238–241, 426, 430, 436, 449, 454, 456
Random variables, 140–142, 147, 149, 162, Regularization, 558, 606, 610
308, 309, 312, 313, 315, 317, 318, 321, Reinforcement learning, 10, 460, 508, 575
323, 324, 327, 334, 343, 370, 372, Relational data base management systems
672–674, 772, 787, 801, 948–952 (RDBMS), 3, 41–68, 76
Random walk hypothesis, 665 Reliability, 31, 81, 441, 830
Rank based conjoint, 643 RELU, see Rectified linear units layer (RELU)
Rating based conjoint, 643 Reserve Bank of India (RBI), 679, 686, 714
Ratio, 2, 24, 27, 122, 131, 164, 191, 194, 212, Residual(s), 4, 179, 190, 191, 193–194, 196,
249, 251–253, 255, 267, 330, 426, 433, 199–202, 204–206, 209, 212–216, 220,
440, 448, 449, 451, 518, 520, 535, 536, 221, 223, 225, 227, 229–232, 234–239,
573, 626, 637, 638, 640, 666, 682, 750, 241, 243, 258, 370, 397, 425–427, 429,
802, 848, 850, 856, 960 430, 432, 433, 435, 449, 470, 558, 665,
Ratio features, 518 667, 682
RBI, see Reserve Bank of India (RBI) deviance, 425–428, 432, 433, 435
RDBMS, see Relational data base management plots, 179, 197, 205–209, 230, 239, 241,
systems (RDBMS) 243
RDD, see Resilient Data Distribution (RDD) Resilient Data Distribution (RDD), 99–102
Read–Eval–Print Loop (REPL), 97 Response variable, 5, 183–185, 188, 189,
Real time bidding (RTB), 13, 725–727, 874 191–195, 197, 200, 202, 205, 208,
Real-time decision making, 827 211, 219, 220, 231–233, 237–241, 244,
Real-time location systems (RTLS), 777 247–249, 251, 252, 369, 371–373, 424,
Real time translation, 570 435, 439, 606, 739
Index 977
Volume, 31, 73, 74, 76–79, 181, 202, 203, 307, World Health Organization, 769
309, 318, 323, 384, 399, 507, 531, 580, World War II, 307
581, 593, 601, 663, 682, 720, 782, 784, Wrapper methods, 519
793, 824, 832–837, 839, 840, 849, 877 Write-once Read-many (WORM), 81
VRS, see Variable return to scale (VRS)
X
W XGBoost, 527, 547
Wage balance problem, 182
Wald’s test, 5, 252, 255–256, 270
Wallpaper advertisements, 724
WalMart, 300, 604, 794, 823, 843 Y
WarbyParker.com, 616 Yale, 684, 685
Wavelet transformations, 517 Yet Another Resource Negotiator (YARN), 3,
Wearable technologies, 13, 766, 767 78, 80, 81, 93–95, 98, 99
Web scraping, 35, 300, 890, 918 Yield to maturity, 665, 666
Website development, 890, 918 Youden’s Index, 5, 259–261, 271
Weibull distribution, 455, 457 YouTube, 71, 74, 291, 478, 495, 509, 512, 513,
Whole numbers, 946 550, 553, 564, 744
Wikipedia, 300, 383, 570
Winsorized means, 388
Wordcloud, 6, 288, 289, 292, 294, 296, 297, Z
299–301 Zero-coupon bond, 665
WordNet, 576, 577 Zero-inflated models, 422
Word2vec, 584, 587 Zero inflated Poisson (ZIP) and zero inflated
Workflow based recommendations, 566–567 negative binomial (ZINB) Models,
Workforce optimization, 337 434–435
Work units, 772 Zomato, 509