A Hans On Introduction To Data Science-1-300
A Hans On Introduction To Data Science-1-300
INTRODUCTION
TO DATA
SCIENCE
CHIRAG SHAH
A Hands-On Introduction to Data Science
This book introduces the field of data science in a practical and accessible manner, using
a hands-on approach that assumes no prior knowledge of the subject. The foundational
ideas and techniques of data science are provided independently from technology, allowing
students to easily develop a firm understanding of the subject without a strong technical
background, as well as being presented with material that will have continual relevance
even after tools and technologies change. Using popular data science tools such as Python
and R, the book offers many examples of real-life applications, with practice ranging from
small to big data. A suite of online material for both instructors and students provides
a strong supplement to the book, including datasets, chapter slides, solutions, sample
exams, and curriculum suggestions. This entry-level textbook is ideally suited to readers
from a range of disciplines wishing to build a practical, working knowledge of data science.
CH I RAG S H AH
University of Washington
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781108472449
DOI: 10.1017/9781108560412
© Chirag Shah 2020
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2020
Printed in Singapore by Markono Print Media Pte Ltd
A catalogue record for this publication is available from the British Library.
ISBN 978-1-108-47244-9 Hardback
Additional resources for this publication at www.cambridge.org/shah.
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party Internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To my amazingly smart and sweet daughters –
Sophie, Zoe, and Sarah – for adding colors and
curiosity back to doing science and living life!
Contents
Preface page xv
About the Author xx
Acknowledgments xxii
1 Introduction 3
1.1 What Is Data Science? 3
1.2 Where Do We See Data Science? 5
1.2.1 Finance 6
1.2.2 Public Policy 7
1.2.3 Politics 8
1.2.4 Healthcare 9
1.2.5 Urban Planning 10
1.2.6 Education 10
1.2.7 Libraries 11
1.3 How Does Data Science Relate to Other Fields? 11
1.3.1 Data Science and Statistics 12
1.3.2 Data Science and Computer Science 13
1.3.3 Data Science and Engineering 13
1.3.4 Data Science and Business Analytics 14
1.3.5 Data Science, Social Science, and Computational Social Science 14
1.4 The Relationship between Data Science and Information Science 15
1.4.1 Information vs. Data 16
1.4.2 Users in Information Science 16
1.4.3 Data Science in Information Schools (iSchools) 17
1.5 Computational Thinking 17
1.6 Skills for Data Science 21
1.7 Tools for Data Science 27
1.8 Issues of Ethics, Bias, and Privacy in Data Science 29
Summary 30
Key Terms 31
Conceptual Questions 32
Hands-On Problems 32
vii
viii Contents
2 Data 37
2.1 Introduction 37
2.2 Data Types 37
2.2.1 Structured Data 38
2.2.2 Unstructured Data 38
2.2.3 Challenges with Unstructured Data 39
2.3 Data Collections 39
2.3.1 Open Data 40
2.3.2 Social Media Data 41
2.3.3 Multimodal Data 41
2.3.4 Data Storage and Presentation 42
2.4 Data Pre-processing 47
2.4.1 Data Cleaning 48
2.4.2 Data Integration 50
2.4.3 Data Transformation 51
2.4.4 Data Reduction 51
2.4.5 Data Discretization 52
Summary 59
Key Terms 60
Conceptual Questions 60
Hands-On Problems 61
Further Reading and Resources 65
3 Techniques 66
3.1 Introduction 66
3.2 Data Analysis and Data Analytics 67
3.3 Descriptive Analysis 67
3.3.1 Variables 68
3.3.2 Frequency Distribution 71
3.3.3 Measures of Centrality 75
3.3.4 Dispersion of a Distribution 77
3.4 Diagnostic Analytics 82
3.4.1 Correlations 82
3.5 Predictive Analytics 84
3.6 Prescriptive Analytics 85
3.7 Exploratory Analysis 86
3.8 Mechanistic Analysis 87
3.8.1 Regression 87
Summary 89
Key Terms 91
Conceptual Questions 92
Hands-On Problems 92
Further Reading and Resources 95
ix Contents
4 UNIX 99
4.1 Introduction 99
4.2 Getting Access to UNIX 100
4.3 Connecting to a UNIX Server 102
4.3.1 SSH 102
4.3.2 FTP/SCP/SFTP 104
4.4 Basic Commands 106
4.4.1 File and Directory Manipulation Commands 106
4.4.2 Process-Related Commands 108
4.4.3 Other Useful Commands 109
4.4.4 Shortcuts 109
4.5 Editing on UNIX 110
4.5.1 The vi Editor 110
4.5.2 The Emacs Editor 111
4.6 Redirections and Piping 112
4.7 Solving Small Problems with UNIX 113
Summary 121
Key Terms 121
Conceptual Questions 122
Hands-On Problems 122
Further Reading and Resources 123
5 Python 125
5.1 Introduction 125
5.2 Getting Access to Python 125
5.2.1 Download and Install Python 126
5.2.2 Running Python through Console 126
5.2.3 Using Python through Integrated Development Environment
(IDE) 126
5.3 Basic Examples 128
5.4 Control Structures 131
5.5 Statistics Essentials 133
5.5.1 Importing Data 136
5.5.2 Plotting the Data 137
5.5.3 Correlation 138
5.5.4 Linear Regression 138
5.5.5 Multiple Linear Regression 141
5.6 Introduction to Machine Learning 145
5.6.1 What Is Machine Learning? 145
5.6.2 Classification (Supervised Learning) 147
5.6.3 Clustering (Unsupervised Learning) 150
5.6.4 Density Estimation (Unsupervised Learning) 153
x Contents
Summary 155
Key Terms 156
Conceptual Questions 157
Hands-On Problems 157
Further Reading and Resources 159
6 R 161
6.1 Introduction 161
6.2 Getting Access to R 162
6.3 Getting Started with R 163
6.3.1 Basics 163
6.3.2 Control Structures 165
6.3.3 Functions 167
6.3.4 Importing Data 167
6.4 Graphics and Data Visualization 168
6.4.1 Installing ggplot2 168
6.4.2 Loading the Data 169
6.4.3 Plotting the Data 169
6.5 Statistics and Machine Learning 174
6.5.1 Basic Statistics 174
6.5.2 Regression 176
6.5.3 Classification 178
6.5.4 Clustering 180
Summary 182
Key Terms 183
Conceptual Questions 184
Hands-On Problems 184
Further Reading and Resources 185
7 MySQL 187
7.1 Introduction 187
7.2 Getting Started with MySQL 188
7.2.1 Obtaining MySQL 188
7.2.2 Logging in to MySQL 188
7.3 Creating and Inserting Records 191
7.3.1 Importing Data 191
7.3.2 Creating a Table 192
7.3.3 Inserting Records 192
7.4 Retrieving Records 193
7.4.1 Reading Details about Tables 193
7.4.2 Retrieving Information from Tables 193
7.5 Searching in MySQL 195
7.5.1 Searching within Field Values 195
7.5.2 Full-Text Searching with Indexing 195
xi Contents
Appendices
Appendix A:Useful Formulas from Differential Calculus 379
Further Reading and Resources 380
Appendix B:Useful Formulas from Probability 381
Further Reading and Resources 381
Appendix C:Useful Resources 383
C.1 Tutorials 383
C.2 Tools 383
Appendix D:Installing and Configuring Tools 385
D.1 Anaconda 385
D.2 IPython (Jupyter) Notebook 385
D.3 Spyder 387
D.4 R 387
D.5 RStudio 388
Appendix E:Datasets and Data Challenges 390
E.1 Kaggle 390
E.2 RecSys 391
E.3 WSDM 391
E.4 KDD Cup 392
Appendix F:Using Cloud Services 393
F.1 Google Cloud Platform 394
F.2 Hadoop 398
F.3 Microsoft Azure 400
F.4 Amazon Web Services (AWS) 403
Appendix G:Data Science Jobs 407
G.1 Marketing 408
G.2 Corporate Retail and Sales 409
G.3 Legal 409
G.4 Health and Human Services 410
xiv Contents
Data science is one of the fastest-growing disciplines at the university level. We see more
job postings that require training in data science, more academic appointments in the field,
and more courses offered, both online and in traditional settings. It could be argued that data
science is nothing novel, but just statistics through a different lens. What matters is that we
are living in an era in which the kind of problems that could be solved using data are driving
a huge wave of innovations in various industries – from healthcare to education, and from
finance to policy-making. More importantly, data and data analysis are playing an increas-
ingly large role in our day-to-day life, including in our democracy. Thus, knowing the
basics of data and data analysis has become a fundamental skill that everyone needs, even if
they do not want to pursue a degree in computer science, statistics, or data science.
Recognizing this, many educational institutions have started developing and offering not
just degrees and majors in the field but also minors and certificates in data science that are
geared toward students who may not become data scientists but could still benefit from data
literacy skills in the same way every student learns basic reading, writing, and comprehen-
sion skills.
This book is not just for data science majors but also for those who want to develop their
data literacy. It is organized in a way that provides a very easy entry for almost anyone to
become introduced to data science, but it also has enough fuel to take one from that
beginning stage to a place where they feel comfortable obtaining and processing data for
deriving important insights. In addition to providing basics of data and data processing, the
book teaches standard tools and techniques. It also examines implications of the use of data
in areas such as privacy, ethics, and fairness. Finally, as the name suggests, this text is meant
to provide a hands-on introduction to these topics. Almost everything presented in the book
is accompanied by examples and exercises that one could try – sometimes by hand and
other times using the tools taught here. In teaching these topics myself, I have found this to
be a very effective method.
The remainder of this preface explains how this book is organized, how it could be used
for fulfilling various teaching needs, and what specific requirements a student needs to meet
to make the most out of it.
who are interested in data science. It is not meant to provide in-depth treatment of any
programming language, tool, or platform. Similarly, while the book covers topics such as
machine learning and data mining, it is not structured to give detailed theoretical instruction
on them; rather, these topics are covered in the context of applying them to solving various
data problems with hands-on exercises.
The book assumes very little to no prior exposure to programming or technology. It does,
however, expect the student to be comfortable with computational thinking (see Chapter 1)
and the basics of statistics (covered in Chapter 3). The student should also have general
computer literacy, including skills to download, install, and configure software, do file
operations, and use online resources. Each chapter lists specific requirements and expecta-
tions, many of which can be met by going over some other parts of the book (usually an
earlier chapter or an appendix).
Almost all the tools and software used in this book are free. There is no requirement of
a specific operating system or computer architecture, but it is assumed that the student has
a relatively modern computer with reasonable storage, memory, and processing power. In
addition, a reliable and preferably high-speed Internet connection is required for several
parts of this book.
The book is organized in four parts. Part I includes three chapters that serve as the
foundations of data science. Chapter 1 introduces the field of data science, along with
various applications. It also points out important differences and similarities with related
fields of computer science, statistics, and information science. Chapter 2 describes the
nature and structure of data as we encounter today. It initiates the student about data
formats, storage, and retrieval infrastructures. Chapter 3 introduces several important
techniques for data science. These techniques stem primarily from statistics and include
correlation analysis, regression, and introduction to data analytics.
Part II of this book includes chapters to introduce various tools and platforms such as
UNIX (Chapter 4), Python (Chapter 5), R (Chapter 6), and MySQL (Chapter 7). It is
important to keep in mind that, since this is not a programming or database book, the
objective here is not to go systematically into various parts of these tools. Rather, we focus
on learning the basics and the relevant aspects of these tools to be able to solve various data
problems. These chapters therefore are organized around addressing various data-driven
problems. In the chapters covering Python and R, we also introduce basic machine learning.
But machine learning is a crucial topic for data science that cannot be treated just as an
afterthought, which is why Part III of this book is devoted to it. Specifically, Chapter 8
provides a more formal introduction to machine learning and includes a few techniques that
are basic and broadly applicable at the same time. Chapter 9 describes in some depth
supervised learning methods, and Chapter 10 presents unsupervised learning. It should be
noted that, since this book is focused on data science and not core computer science or
mathematics, we skip much of the underlying math and formal structuring while discussing
xvii Preface
and applying machine learning techniques. The chapters in Part III, however, do present
machine learning methods and techniques using adequate math in order to discuss the
theories and intuitions behind them in detail.
Finally, Part IV of this book takes the techniques from Part I, as well as the tools from
Parts II and III to start applying them to problems of real-life significance. In Chapter 11, we
take this opportunity by applying various data science techniques to several real-life
problems, including those involving social media, finance, and social good. Finally,
Chapter 12 provides additional coverage into data collection, experimentation, and
evaluation.
The book is full of extra material that either adds more value and knowledge to your
existing data science theories and practices, or provides broader and deeper treatment of
some of the topics. Throughout the book, there are several FYI boxes that provide important
and relevant information without interrupting the flow of the text, allowing the student to be
aware of various issues such as privacy, ethics, and fairness without being overwhelmed by
them. The appendices of this book provide quick reference to various formulations relating
to differential calculus and probability, as well as helpful pointers and instructions for
installing and configuring various tools used in the book. For those interested in using
cloud-based platforms and tools, there is also an appendix that shows how to sign up,
configure, and use them. Another appendix provides listing of various sources for obtaining
small to large datasets for more practice and even participate in data challenges to win some
cool prizes and recognition. There is also an appendix that provides helpful information
related to data science jobs in various fields and what skills one should have to target those
calls. Finally, a couple of appendices introduce the ideas of data ethics and data science for
social good to inspire you to be a responsible and socially aware data citizen.
The book also has an online appendix (OA), accessible through the book’s website at
www.cambridge.org/shah, which is regularly updated to reflect any changes in data and
other resources. The primary purpose for this online appendix is to provide you with the
most current and updated datasets or links to datasets that you can download and use in the
dozens of examples and try-it-yourself exercises in the chapters, as well as data problems
at the end of the chapters. Look for the icon at various places that inform you that you
need to find the needed resource from OA. In the description of that exercise, you will see
the specific number (e.g., OA 3.2) that tells you where exactly you should go in the online
appendix.
The book is quite deliberately organized around teaching data science to beginner computer
science (CS) students or intermediate to advanced non-CS students. The book is modular,
making it easier for both students and teachers to cover topics to the desired depth. This
makes it quite suitable for the book to be used as a main reference book or textbook for
xviii Preface
a data science curriculum. The following is a suggested curriculum path in data science
using this book. It contains five courses, each lasting a semester or a quarter.
• Introduction to data science: Chapters 1 and 2, with some elements from Part II as needed.
• Data analytics: Chapter 3, with some elements from Part II as needed.
• Problem solving with data or programming for data science: Chapters 4–7.
• Machine learning for data science: Chapters 8–10.
• Research methods for data science: Chapter 12, with appropriate elements from Chapter
3 and Part II.
At the website for this book is a Resources tab with a section labeled “For Instructors.”
This section contains sample syllabi for various courses that could be taught using this
book, PowerPoint slides for each chapter, and other useful resources such as sample mid-
term and final exams. These resources make it easier for someone teaching this course for
the first time to adapt the text as needed for his or her own data science curriculum.
Each chapter also has several conceptual questions and hands-on problems. The con-
ceptual questions could be used in either in-class discussions, for homework, or for quizzes.
For each new technique or problem covered in this book, there are at least two hands-on
problems. One of these could be used in the class and the other one could be given for
homework or exam. Most hands-on exercises in chapters are also immediately followed by
hands-on homework exercises that a student could try for further practice, or an instructor
could assign as homework or in-class practice assignment.
Data science has a very visible presence these days, and it is not surprising that there are
currently several available books and much material related to the field. A Hands-On
Introduction to Data Science is different from the other books in several ways.
• It is targeted to students with very basic experience with technology. Students who fit in
that category are majoring in information science, business, psychology, sociology,
education, health, cognitive science, and indeed any area in which data can be applied.
The study of data science should not be limited to those studying computer science or
statistics. This book is intended for those audiences.
• The book starts by introducing the field of data science without any prior expectation of
knowledge on the part of the reader. It then introduces the reader to some foundational
ideas and techniques that are independent of technology. This does two things: (1) it
provides an easier access point for a student without strong technical background; and
(2) it presents material that will continue to be relevant even when tools and technologies
change.
• Based on my own teaching and curriculum development experiences, I have found that
most data science books on the market are divided into two categories: they are either too
technical, making them suitable only for a limited audience, or they are structured to be
xix Preface
simply informative, making it hard for the reader to actually use and apply data science
tools and techniques. A Hands-On Introduction to Data Science is aimed at a nice middle
ground: On the one hand, it is not simply describing data science, but also teaching real
hands-on tools (Python, R) and techniques (from basic regression to various forms of
machine learning). On the other hand, it does not require students to have a strong
technical background to be able to learn and practice data science.
• A Hands-On Introduction to Data Science also examines implications of the use of data in
areas such as privacy, ethics, and fairness. For instance, it discusses how unbalanced data
used without enough care with a machine learning technique could lead to biased (and
often unfair) predictions. There is also an introduction to the newly formulated General
Data Protection Regulations (GDPR) in Europe.
• The book provides many examples of real-life applications, as well as practices ranging
from small to big data. For instance, Chapter 4 has an example of working with housing
data where simple UNIX commands could extract valuable insights. In Chapter 5, we see
how multiple linear regression can be easily implemented using Python to learn how
advertising spending on various media (TV, radio) could influence sales. Chapter 6
includes an example that uses R to analyze data about wines to predict which ones are
of high quality. Chapters 8–10 on machine learning have many real-life and general
interest problems from different fields as the reader is introduced to various techniques.
Chapter 11 has hands-on exercises for collecting and analyzing social media data from
services such as Twitter and YouTube, as well as working with large datasets (Yelp data
with more than a million records). Many of the examples can be worked by hand or with
everyday software, without requiring specialized tools. This makes it easier for a student
to grasp a concept without having to worry about programming structures. This allows
the book to be used for non-majors as well as professional certificate courses.
• Each chapter has plenty of in-chapter exercises where I walk the reader through solving
a data problem using a new technique, homework exercises to do more practice, and more
hands-on problems (often using real-life data) at the end of the chapters. There are 37
hands-on solved exercises, 46 hands-on try-it-yourself exercises, and 55 end-of-chapter
hands-on problems.
• The book is supplemented by a generous set of material for instructors. These instructor
resources include curriculum suggestions (even full-length syllabuses for some courses),
slides for each chapter, datasets, program scripts, answers and solutions to each exercise,
as well as sample mid-term exams and final projects.
About the Author
also delivered special courses and tutorials at various international venues, and
created massive open online courses (MOOCs) for platforms such as Coursera. He
has developed several courses and curricula for data science and advised dozens of
undergraduate and graduate students pursuing data science careers. This book is
a result of his many years of teaching, advising, researching, and realizing the need
for such a resource.
chirags@uw.edu
http://chiragshah.org
@chirag_shah
Acknowledgments
A book like this does not happen without a lot of people’s help and it would be rude of me to
not acknowledge at least some of those people here.
As is the case with almost all of my projects, this one would not have been possible
without the love and support of my wife Lori. She not only understands late nights and long
weekends working on a project like this, but also keeps me grounded on what matters the
most in life – my family, my students, and the small difference that I am trying to make in
this world through the knowledge and skills I have.
My sweet and smart daughters – Sophie, Zoe, and Sarah – have also kept me connected to
the reality while I worked on this book. They have inspired me to look beyond data and
information to appreciate the human values behind them. After all, why bother doing
anything in this book if it is not helping human knowledge and advancement in some
way? I am constantly amazed by my kids’ curiosity and sense of adventure, because those
are the qualities one needs in doing any kind of science, and certainly data science. A lot of
the analyses and problem solving presented in this book fall under this category, where we
are not simply processing some data, but are driven by a sense of curiosity and a quest to
derive new knowledge.
This book, as I have noted in the Preface, happened organically over many years through
developing and teaching various data science classes. And so I need to thank all of those
students who sat in my classes or joined online, went through my material, asked questions,
provided feedback, and helped me learn more. With every iteration of every class I have
taught in data science, things have gotten better. In essence, what you are holding in your
hands is the result of the best iteration so far.
In addition to hundreds (or thousands, in the case of MOOCs) of students over the years,
there are specific students and assistants I need to thank for their direct and substantial
contributions to this book. My InfoSeeking Lab assistants Liz Smith and Catherine
McGowan have been tremendously helpful in not only proofreading, but also helping
with literature review and contributing several pieces of writings. Similarly, Dongho
Choi and Soumik Mandal, two of my Ph.D. students, have contributed substantially to
some of the writings and many of the examples and exercises presented in this book. If it
was not for the help and dedication of these four people, this book would have been delayed
by at least a year.
I am also thankful to my Ph.D. students Souvick Ghosh, who provided some writeup on
misinformation, and Ruoyuan Gao, for contributing to the topic of fairness and bias.
Finally, I am eternally grateful to the wonderful staff of Cambridge University Press for
guiding me through the development of this book from the beginning. I would specifically
call out Lauren Cowles, Lisa Pinto, and Stefanie Seaton. They have been an amazing team
xxii
xxiii Acknowledgments
helping me in almost every aspect of this book, ensuring that it meets the highest standards
of quality and accessibility that one would expect from the Press. Writing a book is often
a painful endeavor, but when you have a support team like this, it becomes possible and
even a fun project!
I am almost certain that I have forgotten many more people to thank here, but they should
know that it was a result of my forgetfulness and not ungratefulness.
PART I
CONCEPTUAL INTRODUCTIONS
This part includes three chapters that serve as the foundations of data science. If you have
never done anything with data science or statistics, I highly recommend going through this
part before proceeding further. If, on the other hand, you have a good background in
statistics and a basic knowledge of data storage, formats, and processing, you can easily
skim through most of the material here.
Chapter 1 introduces the field of data science, along with various applications. It also
points out important differences and similarities with related fields of computer science,
statistics, and information science.
Chapter 2 describes the nature and structure of data as we encounter it today. It initiates
the student about data formats, storage, and retrieval infrastructures.
Chapter 3 introduces several important techniques for data science. These techniques
stem primarily from statistics and include correlation analysis, regression, and introduction
to data analytics.
No matter where you come from, I would still recommend paying attention to some of the
sections in Chapter 1 that introduce various basic concepts of data science and how they are
related to other disciplines. In my experience, I have also found that various aspects of data
pre-processing are often skipped in many data science curricula, but if you want to develop
a more comprehensive understanding of data science, I suggest you go through Chapter 2 as
well. Finally, even if you have a solid background in statistics, it would not hurt to at least
skim through Chapter 3, as it introduces some of the statistical concepts that we will need
many times in the rest of the book.
1
1 Introduction
Sherlock Holmes would have loved living in the twenty-first century. We are drenched in
data, and so many of our problems (including a murder mystery) can be solved using large
amounts of data existing at personal and societal levels.
These days it is fair to assume that most people are familiar with the term “data.” We see it
everywhere. And if you have a cellphone, then chances are this is something you have
encountered frequently. Assuming you are a “connected” person who has a smartphone, you
probably have a data plan from your phone service provider. The most common cellphone plans
in the USA include unlimited talk and text, and a limited amount of data – 5 GB, 20 GB, etc.
And if you have one of these plans, you know well that you are “using data” through your phone
and you get charged per usage of that data. You understand that checking your email and posting
a picture on a social media platform consumes data. And if you are a curious (or thrifty) sort, you
calculate how much data you consume monthly and pick a plan that fits your needs.
You may also have come across terms like “data sharing,” when picking a family plan for
your phone(s). But there are other places where you may have encountered the notion of
data sharing. For instance, if you have concerns about privacy, you may want to know if
your cellphone company “shares” data about you with others (including the government).
3
4 Introduction
And finally, you may have heard about “data warehouses,” as if data is being kept in big
boxes on tall shelves in middle-of-nowhere locations.
In the first case, the individual is consuming data by retrieving email messages and
posting pictures. In the second scenario concerning data sharing, “data” refers to informa-
tion about you. And third, data is used as though it represents a physical object that is being
stored somewhere. The nature and the size of “data” in these scenarios vary enormously –
from personal to institutional, and from a few kilobytes (kB) to several petabytes (PB).
In this book, we will consider these and more scenarios and learn about defining, storing,
cleaning, retrieving, and analyzing data – all for the purpose of deriving meaningful insights
toward making decisions and solving problems. And we will use systematic, verifiable, and
repeatable processes; or in other words, we will apply scientific approaches and techniques.
Finally, we will do almost all of these processes with a hands-on approach. That means we
will look at data and situations that generate or use data, and we will manipulate data using
tools and techniques. But before we begin, let us look at how others describe data science.
Frank Lo, the Director of Data Science at Wayfair, says this on datajobs.com: “Data
science is a multidisciplinary blend of data inference, algorithm development, and technol-
ogy in order to solve analytically complex problems.”1 He goes on to elaborate that data
science, at its core, involves uncovering insights from mining data. This happens through
exploration of the data using various tools and techniques, testing hypotheses, and creating
conclusions with data and analyses as evidence.
In one famous article, Davenport and Patil2 called data science “the sexiest job of the
twenty-first century.” Listing data-driven companies such as (in alphabetical order) Amazon,
eBay, Google, LinkedIn, Microsoft, Twitter, and Walmart, the authors see a data scientist as a
hybrid of data hacker, analyst, communicator, and trusted adviser; a Sherlock Holmes for the
5 1.2 Where Do We See Data Science?
twenty-first century. As data scientists face technical limitations and make discoveries to
address these problems, they communicate what they have learned and suggest implications
for new business directions. They also need to be creative in visually displaying information,
and clearly and compellingly showing the patterns they find. One of the data scientist’s most
important roles in the field is to advise executives and managers on the implications of the
data for their products, services, processes, and decisions.
In this book, we will consider data science as a field of study and practice that involves
the collection, storage, and processing of data in order to derive important insights into a
problem or a phenomenon. Such data may be generated by humans (surveys, logs, etc.) or
machines (weather data, road vision, etc.), and could be in different formats (text, audio,
video, augmented or virtual reality, etc.). We will also treat data science as an independent
field by itself rather than a subset of another domain, such as statistics or computer science.
This will become clearer as we look at how data science relates to and differs from various
fields and disciplines later in this chapter.
Why is data science so important now? Dr. Tara Sinclair, the chief economist at indeed.
com since 2013, said, “the number of job postings for ‘data scientist’ grew 57%” year-over-
year in the first quarter of 2015.3 Why have both industry and academia recently increased
their demand for data science and data scientists? What changed within the past several
years? The answer is not surprising: we have a lot of data, we continue to generate a
staggering amount of data at an unprecedented and ever-increasing speed, analyzing data
wisely necessitates the involvement of competent and well-trained practitioners, and
analyzing such data can provide actionable insights.
The “3V model” attempts to lay this out in a simple (and catchy) way. These are the three Vs:
1. Velocity: The speed at which data is accumulated.
2. Volume: The size and scope of the data.
3. Variety: The massive array of data and types (structured and unstructured).
Each of these three Vs regarding data has dramatically increased in recent years. Specifically,
the increasing volume of heterogeneous and unstructured (text, images, and video) data, as well
as the possibilities emerging from their analysis, renders data science ever more essential. Figure
1.14 shows the expected volumes of data to reach 40 zettabytes (ZB) by the end of 2020, which
is a 50-fold increase in volume than what was available at the beginning of 2010. How much is
that really? If your computer has 1 terabytes (TB) hard drive (roughly 1000 GB), 40 ZB is 40
billion times that. To provide a different perspective, the world population is projected to be
close to 8 billion by the end of 2020, which means, if we think about data per person, each
individual in the world (even the newborns) will have 5 TB of data.
The question should be: Where do we not see data science these days? The great thing about
data science is that it is not limited to one facet of society, one domain, or one department of
a university; it is virtually everywhere. Let us look at a few examples.
6 Introduction
40
40
35
30
20
15
8.0
10
5 1.8
0.8 1.0
0.16 0.28 0.48
0
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Year
Figure 1.1 Increase of data volume in last 15 years. (Source: IDC’s Digital Universe Study, December 2012.5)
1.2.1 Finance
There has been an explosion in the velocity, variety, and volume (that is, the 3Vs) of
financial data, just as there has been an exponential growth of data in almost most fields, as
we saw in the previous section. Social media activity, mobile interactions, server logs, real-
time market feeds, customer service records, transaction details, and information from
existing databases combine to create a rich and complex conglomeration of information that
experts (*cough, cough*, data scientists!) must tackle.
What do financial data scientists do? Through capturing and analyzing new sources of
data, building predictive models and running real-time simulations of market events, they
help the finance industry obtain the information necessary to make accurate predictions.
Data scientists in the financial sector may also partake in fraud detection and risk
reduction. Essentially, banks and other loan sanctioning institutions collect a lot of data
about the borrower in the initial “paperwork” process. Data science practices can mini-
mize the chance of loan defaults via information such as customer profiling, past
expenditures, and other essential variables that can be used to analyze the probabilities
of risk and default. Data science initiatives even help bankers analyze a customer’s
purchasing power to more effectively try to sell additional banking products.6 Still not
convinced about the importance of data science in finance? Look no further than your
credit history, one of the most popular types of risk management services used by banks
and other financial institutions to identify the creditworthiness of potential customers.
Companies use machine learning algorithms in analyzing past spending behavior and
7 1.2 Where Do We See Data Science?
patterns to decide the creditworthiness of customers. The credit score, along with other
factors, including length of credit history and customer’s age, are in turn used to predict
the approximate lending amount that can be safely forwarded to the customer when
applying for a new credit card or bank loan.
Let us look at a more definitive example. Lending Club is one of the world’s largest
online marketplaces that connects borrowers with investors. An inevitable outcome of
lending that every lender would like to avoid is default by borrowers. A potential solution to
this problem is to build a predictive model from the previous loan dataset that can be used to
identify the applicants who are relatively risky for a loan. Lending Club hosts its loan
dataset in its data repository (https://www.lendingclub.com/info/download-data.action)
and can be obtained from other popular third-party data repositories7 as well. There are
various algorithms and approaches that can be applied to create such predictive models. A
simple approach of creating such a predictive model from Lending Club loan dataset is
demonstrated at KDnuggets8 if you are interested in learning more.
Simply put, public policy is the application of policies, regulations, and laws to the
problems of society through the actions of government and agencies for the good of a
citizenry. Many branches of social sciences (economics, political science, sociology, etc.)
are foundational to the creation of public policy.
Data science helps governments and agencies gain insights into citizen behaviors that
affect the quality of public life, including traffic, public transportation, social welfare,
community wellbeing, etc. This information, or data, can be used to develop plans that
address the betterment of these areas.
It has become easier than ever to obtain useful data about policies and regulations to
analyze and create insights. The following open data repositories are examples:
(1) US government (https://www.data.gov/)
(2) City of Chicago (https://data.cityofchicago.org/)
(3) New York City (https://nycopendata.socrata.com/)
As of this writing, the data.gov site had more than 200,000 data repositories on diverse
topics that anyone can browse, from agriculture to local government, to science and
research. The City of Chicago portal offers a data catalog with equally diverse topics,
organized in 16 categories, including administration and finance, historic preservation, and
sanitation. NYC OpenData encompasses datasets organized into 10 categories. Clicking on
the category City Government, for instance, brings up 495 individual results. NYC
OpenData also organizes its data by city agency, of which 94 are listed, from the
Administration for Children’s Services to the Teachers Retirement System. The data is
available to all interested parties.
A good example of using data to analyze and improve public policy decisions is the Data
Science for Social Good project, where various institutions including Nova SBE,
Municipality of Cascais, and the University of Chicago will participate in the program
for three months, and which will bring together 25 data analytics experts from several
8 Introduction
countries who will be working on using the open public policy dataset to find clues to solve
relevant problems with impact on society, such as: how does an NGO use data to estimate
the size of a temporary refugee camp in war zones to organize the provision of help, how to
successfully develop and maintain systems that use data to produce social good and inform
public policy, etc. The project usually organizes new events in June of every year.9
1.2.3 Politics
Politics is a broad term for the process of electing officials who exercise the policies that
govern a state. It includes the process of getting policies enacted and the action of the
officials wielding the power to do so. Much of the financial support of government is
derived from taxes.
Recently, the real-time application of data science to politics has skyrocketed. For
instance, data scientists analyzed former US President Obama’s 2008 presidential cam-
paign success with Internet-based campaign efforts.10 In this New York Times article, the
writer quotes Ariana Huffington, editor of The Huffington Post, as saying that, without the
Internet, Obama would not have been president.
Data scientists have been quite successful in constructing the most accurate voter
targeting models and increasing voter participation.11 In 2016, the campaign to elect
Donald Trump was a brilliant example of the use of data science in social media to
tailor individual messages to individual people. As Twitter has emerged as a major
digital PR tool for politics over the last decade, studies12 analyzing the content of
tweets from both candidates’ (Trump and Hillary Clinton) Twitter handles as well as
the content of their websites found significant difference in the emphasis on traits and
issues, main content of tweet, main source of retweet, multimedia use, and the level of
civility. While Clinton emphasized her masculine traits and feminine issues in her
election campaign more than her feminine traits and masculine issues, Trump focused
more to masculine issues, paying no particular attention to his traits. Additionally,
Trump used user-generated content as sources of his tweets significantly more often
than Clinton. Three-quarters of Clinton’s tweets were original content, in comparison
to half of Trump’s tweets, which were retweets of and replies to citizens. Extracting
such characteristics from data and connecting them to various outcomes (e.g., public
engagement) falls squarely under data science. In fact, later in this book we will have
hands-on exercises for collecting and analyzing data from Twitter, including extracting
sentiments expressed in those tweets.
Of course, we have also seen the dark side of this with the infamous Cambridge
Analytica data scandal that surfaced in March 2018.13 This data analytics firm
obtained data on approximately 87 million Facebook users from an academic
researcher in order to target political ads during the 2016 US presidential campaign.
While this case brought to public attention the issue of privacy in data, it was hardly
the first one. Over the years, we have witnessed many incidents of advertisers,
spammers, and cybercriminals using data, obtained legally or illegally, for pushing
an agenda or a rhetoric. We will have more discussion about this later when we talk
about ethics, bias, and privacy issues.
9 1.2 Where Do We See Data Science?
1.2.4 Healthcare
Healthcare is another area in which data scientists keep changing their research approach
and practices.14 Though the medical industry has always stored data (e.g., clinical studies,
insurance information, hospital records), the healthcare industry is now awash in an
unprecedented amount of information. This includes biological data such as gene expres-
sion, next-generation DNA sequence data, proteomics (study of proteins), and metabolo-
mics (chemical “fingerprints” of cellular processes).
While diagnostics and disease prevention studies may seem limited, we may see data
from or about a much larger population with respect to clinical data and health outcomes
data contained in ever more prevalent electronic health records (EHRs), as well as in
longitudinal drug and medical claims. With the tools and techniques available today, data
scientists can work on massive datasets effectively, combining data from clinical trials
with direct observations by practicing physicians. The combination of raw data with
necessary resources opens the door for healthcare professionals to better focus on
important, patient-centered medical quandaries, such as what treatments work and for
whom.
The role of data science in healthcare does not stop with big health service providers; it
has also revolutionized personal health management in the last decade. Personal wearable
health trackers, such as Fitbit, are prime examples of the application of data science in the
personal health space. Due to advances in miniaturizing technology, we can now collect
most of the data generated by a human body through such trackers, including information
about heart rate, blood glucose, sleep patterns, stress levels and even brain activity.
Equipped with a wealth of health data, doctors and scientists are pushing the boundaries
in health monitoring.
Since the rise of personal wearable devices, there has been an incredible amount of
research that leverages such devices to study personal health management space. Health
trackers and other wearable devices provide the opportunity for investigators to track
adherence to physical activity goals with reasonable accuracy across weeks or even months,
which was almost impossible when relying on a handful of self-reports or a small number of
accelerometry wear periods. A good example of such study is the use of wearable sensors to
measure adherence to a physical activity intervention among overweight or obese, post-
menopausal women,15 which was conducted over a period of 16 weeks. The study found
that using activity-measuring trackers, such as those by Fitbit, high levels of self-monitor-
ing were sustained over a long period. Often, even being aware of one’s level of physical
activities could be instrumental in supporting or sustaining good behaviors.
Apple has partnered with Stanford Medicine16 to collect and analyze data from Apple
Watch to identify irregular heart rhythms, including those from potentially serious heart
conditions such as atrial fibrillation, which is a leading cause of stroke. Many insurance
companies have started providing free or discounted Apple Watch devices to their clients,
or have reward programs for those who use such devices in their daily life.17 The data
collected through such devices are helping clients, patients, and healthcare providers to
better monitor, diagnose, and treat health conditions not possible before.
10 Introduction
Many scientists and engineers have come to believe that the field of urban planning is ripe
for a significant – and possibly disruptive – change in approach as a result of the new
methods of data science. This belief is based on the number of new initiatives in “infor-
matics” – the acquisition, integration, and analysis of data to understand and improve urban
systems and quality of life.
The Urban Center for Computation and Data (UrbanCCD), at the University of Chicago,
traffics in such initiatives. The research center is using advanced computational methods to
understand the rapid growth of cities. The center brings together scholars and scientists
from the University of Chicago and Argonne National Laboratory18 with architects, city
planners, and many others.
The UrbanCCD’s director, Charlie Catlett, stresses that global cities are growing quickly
enough to outpace traditional tools and methods of urban design and operation. “The
consequences,” he writes on the center’s website,19 “are seen in inefficient transportation
networks belching greenhouse gases and unplanned city-scale slums with crippling poverty
and health challenges. There is an urgent need to apply advanced computational methods
and resources to both explore and anticipate the impact of urban expansion and find
effective policies and interventions.”
On a smaller scale, chicagoshovels.org provides a “Plow Tracker” so residents can track
the city’s 300 snow plows in real time. The site uses online tools to help organize a “Snow
Corps” – essentially neighbors helping neighbors, like seniors or the disabled – to shovel
sidewalks and walkways. The platform’s app lets travelers know when the next bus is
arriving. Considering Chicago’s frigid winters, this can be an important service. Similarly,
Boston’s Office of New Urban Mechanics created a SnowCOP app to help city managers
respond to requests for help during snowstorms. The Office has more than 20 apps designed
to improve public services, such as apps that mine data from residents’ mobile phones to
address infrastructure projects. But it is not just large cities. Jackson, Michigan, with a
population of about 32,000, tracks water usage to identify potentially abandoned homes.
The list of uses and potential uses is extensive.
1.2.6 Education
According to Joel Klein, former Chancellor of New York Public Schools, “when it comes to
the intersection of education and technology, simply putting a computer in front of a
student, or a child, doesn’t make their lives any easier, or education any better.”20
Technology will definitely have a large part to play in the future of education, but how
exactly that happens is still an open question. There is a growing realization among
educators and technology evangelists that we are heading toward more data-driven and
personalized use of technology in education. And some of that is already happening.
The Brookings Institution’s Darrell M. West opened his 2012 report on big data and
education by comparing present and future “learning environments.” According to West,
today’s students improve their reading skills by reading short stories, taking a test every other
11 1.3 How Does Data Science Relate to Other Fields?
week, and receiving graded papers from teachers. But in the future, West postulates that
students will learn to read through “a computerized software program,” the computer constantly
measuring and collecting data, linking to websites providing further assistance, and giving the
student instant feedback. “At the end of the session,” West says, “his teacher will receive an
automated readout on [students in the class] summarizing their reading time, vocabulary
knowledge, reading comprehension, and use of supplemental electronic resources.”21
So, in essence, teachers of the future will be data scientists!
Big data may be able to provide much-needed resources to various educational struc-
tures. Data collection and analysis have the potential to improve the overall state of
education. West says, “So-called ‘big data’ make it possible to mine learning information
for insights regarding student performance and learning approaches. Rather than rely on
periodic test performance, instructors can analyze what students know and what techniques
are most effective for each pupil. By focusing on data analytics, teachers can study learning
in far more nuanced ways. Online tools enable evaluation of a much wider range of student
actions, such as how long they devote to readings, where they get electronic resources, and
how quickly they master key concepts.”
1.2.7 Libraries
Data science is also frequently applied to libraries. Jeffrey M. Stanton has discussed the
overlap between the task of a data science professional and that of a librarian. In his article,
he concludes, “In the near future, the ability to fulfill the roles of citizenship will require
finding, joining, examining, analyzing, and understanding diverse sources of data […] Who
but a librarian will stand ready to give the assistance needed, to make the resources
accessible, and to provide a venue for knowledge creation when the community advocate
arrives seeking answers?”22
Mark Bieraugel echoes this view in his article on the website of the Association of
College and Research Libraries.23 Here, Bieraugel advocates for librarians to create
taxonomies, design metadata schemes, and systematize retrieval methods to make big
datasets more useful. Even though the role of data science in future libraries as suggested
here seems too rosy to be true, in reality it is nearer than you think. Imagine that Alice, a
scientist conducting research on diabetes, asks Mark, a research librarian, to help her
understand the research gap in previous literature. Armed with the digital technologies,
Mark can automate literature reviews for any discipline by reducing ideas and results from
thousands of articles into a cohesive bulleted list and then apply data science algorithms,
such as network analysis, to visualize trends in emerging lines of research on similar topics.
This will make Alice’s job far easier than if she had to painstakingly read all the articles.
While data science has emerged as a field in its own right, as we saw before it is often
considered a subdiscipline of a field such as statistics. One could certainly study data
12 Introduction
science as a part of one of the existing, well-established fields. But, given the nature of data-
driven problems and the momentum at which data science has been able to tackle them, a
separate slot is warranted for data science – one that is different from those well-established
fields, and yet connected to them. Let us look at how data science is similar to and different
from other fields.
Priceonomics (a San Francisco-based company that claims to “turn data into stories”) notes
that, not long ago, the term “data science” meant nothing to most people, not even to those
who actually worked with data.24 A common response to the term was: “Isn’t that just
statistics?”
Nate Silver does not seem to think data science differs from statistics. The well-known
number cruncher behind the media site FiveThirtyEight – and the guy who famously and
correctly predicted the electoral outcome of 49 of 50 states in the 2008 US Presidential
election, and a perfect 50 for 50 in 2012 – is more than a bit skeptical of the term. However,
the performance of his 2016 election prediction model was a dud. The model predicted
Democrat-nominee Hillary Clinton’s chance of winning the presidency at 71.4% over
Republican-nominee Donald Trump’s 28.6%.25 The only silver lining in his 2016 predic-
tion was that it gave Trump a higher chance of winning the electoral college than almost
anyone else.26
“I think data-scientist is a sexed up term for a statistician,” Silver told an audience of
statisticians in 2013 at the Joint Statistical Meeting.27
The difference between these two closely related fields lies in the invention and advance-
ments of modern computers. Statistics was primarily developed to help people deal with
pre-computer “data problems,” such as testing the impact of fertilizer in agriculture, or
figuring out the accuracy of an estimate from a small sample. Data science emphasizes the
data problems of the twenty-first century, such as accessing information from large data-
bases, writing computer code to manipulate data, and visualizing data.
Andrew Gelman, a statistician at Columbia University, writes that it is “fair to consider
statistics … as a subset of data science” and probably the “least important” aspect.28 He
suggests that the administrative aspects of dealing with data, such as harvesting, processing,
storing, and cleaning, are more central to data science than is hard-core statistics.
So, how does the knowledge of these fields blend together? Statistician and data
visualizer Nathan Yau of Flowing Data suggests that data scientists should have at least
three basic skills:29
1. A strong knowledge of basic statistics (see Chapter 3) and machine learning (see
Chapters 8–10) – or at least enough to avoid misinterpreting correlation for causation
or extrapolating too much from a small sample size.
2. The computer science skills to take an unruly dataset and use a programming language
(like R or Python, see Chapters 5 and 6) to make it easy to analyze.
3. The ability to visualize and express their data and analysis in a way that is meaningful to
somebody less conversant in data (see Chapters 2 and 11).
13 1.3 How Does Data Science Relate to Other Fields?
As you can see, this book that you are holding has you covered for most, if not all, of these
basic skills (and then some) for data science.
Perhaps this seems like an obvious application of data science, but computer science
involves a number of current and burgeoning initiatives that involve data scientists.
Computer scientists have developed numerous techniques and methods, such as (1) data-
base (DB) systems that can handle the increasing volume of data in both structured and
unstructured formats, expediting data analysis; (2) visualization techniques that help people
make sense of data; and (3) algorithms that make it possible to compute complex and
heterogeneous data in less time.
In truth, data science and computer science overlap and are mutually supportive. Some of
the algorithms and techniques developed in the computer science field – such as machine
learning algorithms, pattern recognition algorithms, and data visualization techniques –
have contributed to the data science discipline.
Machine learning is certainly a very crucial part of data science today, and it is hard to do
meaningful data science in most domains without at least basic knowledge of machine
learning. Fortunately for us, the third part of this book is dedicated to machine learning.
While we will not go into so much theoretical depth as a computer scientist would, we are
going to see many of the popular and useful machine learning algorithms and techniques
applied to various data science problems.
In general, we can say that the main goal of “doing business” is turning a profit – even with
limited resources – through efficient and sustainable manufacturing methods, and effective
service models, etc. This demands decision-making based on objective evaluation, for
which data analysis is essential.
Whether it concerns companies or customers, data related to business is increasingly
cheap (easy to obtain, store, and process) and ubiquitous. In addition to the traditional types
of data, which are now being digitized through automated procedures, new types of data
from mobile devices, wearable sensors, and embedded systems are providing businesses
with rich information. New technologies have emerged that seek to help us organize and
understand this increasing volume of data. These technologies are employed in business
analytics.
Business analytics (BA) refers to the skills, technologies, and practices for continuous
iterative exploration and investigation of past and current business performance to gain
insight and be strategic. BA focuses on developing new perspectives and making sense of
performance based on data and statistics. And that is where data science comes in. To fulfill
the requirements of BA, data scientists are needed for statistical analysis, including
explanatory and predictive modeling and fact-based management, to help drive successful
decision-making.
There are four types of analytics, each of which holds opportunities for data scientists in
business analytics:30
1. Decision analytics: supports decision-making with visual analytics that reflect
reasoning.
2. Descriptive analytics: provides insight from historical data with reporting, score cards,
clustering, etc.
3. Predictive analytics: employs predictive modeling using statistical and machine learning
techniques.
4. Prescriptive analytics: recommends decisions using optimization, simulation, etc.
We will revisit these in Chapter 3.
It may sound weird that social science, which began almost four centuries ago and was
primarily concerned with society and relationships among individuals, has anything to do
with data science. Enter the twenty-first century, and not only is data science helping social
science, but it is also shaping it, even creating a new branch called computational social
science.
Since its inception, social science has spread into many branches, including but not
limited to anthropology, archaeology, economics, linguistics, political science, psychology,
public health, and sociology. Each of these branches has established its own standards,
procedures, and modes of collecting data over the years. But connecting theories or results
15 1.4 Relationship between Data and Information Science
from one discipline to another has become increasingly difficult. This is where computa-
tional social science has revolutionized social science research in the last few decades. With
the help of data science, computational social science has connected results from multiple
disciplines to explore the key urgent question: how will the information revolution in this
digital age transform society?
Since its inception, computational social science has made tremendous strides in gen-
erating arrays of interdisciplinary projects, often in partnership with computer scientists,
statisticians, mathematicians, and lately with data scientists. Some of these projects include
leveraging tools and algorithms of prediction and machine learning to assist in tackling
stubborn policy problems. Others entail applying recent advances in image, text, and
speech recognition to classic issues in social science. These projects often demand meth-
odological breakthroughs, scaling proven methods to new levels, as well as designing new
metrics and interfaces to make research findings intelligible to scholars, administrators, and
policy-makers who may lack computational skill but have domain expertise.
After reading the above paragraph, if you think computational social science has only
borrowed from data science but has nothing to return, you would be wrong. Computational
social science raises inevitable questions about the politics and ethics often embedded in
data science research, particularly when it is based on sociopolitical problems with real-life
applications that have far-reaching consequences. Government policies, people’s mandates
in elections, and hiring strategies in the private sector, are prime examples of such
applications.
While this book is broad enough to be useful for anyone interested in data science, some
aspects are targeted at people interested in or working in information-intensive domains.
These include many contemporary jobs that are known as “knowledge work,” such as those
in healthcare, pharmaceuticals, finance, policy-making, education, and intelligence. The
field of information science, which often stems from computing, computational science,
informatics, information technology, or library science, often represents and serves such
application areas. The core idea here is to cover people studying, accessing, using, and
producing information in various contexts. Let us think about how data science and
information science are related.
Data is everywhere. Yes, this is the third time I am stating this in this chapter, but this
point is that important. Humans and machines are constantly creating new data. Just as
natural science focuses on understanding the characteristics and laws that govern natural
phenomena, data scientists are interested in investigating the characteristics of data –
looking for patterns that reveal how people and society can benefit from data. That
perspective often misses the processes and people behind the data, as most researchers
and professionals see data from the system side and subsequently focus on quantifying
phenomena; they lack an understanding of the users’ perspective. Information scientists,
16 Introduction
who look at data in the context they are generated and used, can play an important role that
bridges the gap between quantitative analysis and an examination of data that tells a story.
In an FYI box earlier, we alluded to some connections and differences between data and
information. Depending on who you consult, you will get different answers – from
seeming differences to a blurred-out line between data and information. To make matters
worse, people often use one to mean the other. A traditional view used to be that data is
something raw, meaningless, an object that, when analyzed or converted to a useful form,
becomes information. Information is also defined as “data that are endowed with meaning
and purpose.”31
For example, the number “480,000” is a data point. But when we add an explanation that
it represents the number of deaths per year in the USA from cigarette smoking,32 it becomes
information. But in many real-world scenarios, the distinction between a meaningful and a
meaningless data point is not clear enough for us to differentiate data and information. And
therefore, for the purpose of this book, we will not worry about drawing such a line. At the
same time, since we are introducing various concepts in this chapter, it is useful for us to at
least consider how they are defined in various conceptual frameworks.
Let us take one such example. The Data, Information, Knowledge, and Wisdom (DIKW)
model differentiates the meaning of each concept and suggests a hierarchical system among
them.33 Although various authors and scholars offer several interpretations of this model,
the model defines data as (1) fact, (2) signal, and (3) symbol. Here, information is
differentiated from data in that it is “useful.” Unlike conceptions of data in other dis-
ciplines, information science demands and presumes a thorough understanding of informa-
tion, considering different contexts and circumstances related to the data that is created,
generated, and shared, mostly by human beings.
Studies in information science have focused on the human side of data and information, in
addition to the system perspective. While the system perspective typically supports users’
ability to observe, analyze, and interpret the data, the former allows them to make the data
into useful information for their purposes. Different users may not agree on a piece of
information’s relevancy depending on various factors that affect judgment, such as
“usefulness.”34 Usefulness is a criterion that determines how useful is the interaction
between the user and the information object (data) in accomplishing the task or goal of
the user. For example, a general user who wants to figure out if drinking coffee is injurious
to health may find information in the search engine result pages (SERP) to be useful,
whereas a dietitian who needs to decide if it is OK to recommend a patient to consume
coffee may find the same result in SERP worthless. Therefore, operationalization of the
criterion of usefulness will be specific to the user’s task.
Scholars in information science tend to combine the user side and the system side to
understand how and why data is generated and the information they convey, given a
17 1.5 Computational Thinking
context. This is often then connected to studying people’s behaviors. For instance, informa-
tion scientists may collect log data of one’s browser activities to understand one’s search
behaviors (the search terms they use, the results they click, the amount of time they spend
on various sites, etc.). This could allow them to create better methods for personalization
and recommendation.
There are several advantages to studying data science in information schools, or iSchools.
Data science provides students a more nuanced understanding of individual, community,
and society-wide phenomena. Students may, for instance, apply data collected from a
particular community to enhance that locale’s wellbeing through policy change and/or
urban planning. Essentially, an iSchool curriculum helps students acquire diverse perspec-
tives on data and information. This becomes an advantage as students transition into full-
fledged data scientists with a grasp on the big (data) picture. In addition to all the required
data science skills and knowledge (including understanding computer science, statistics,
machine learning, etc.), the focus on the human factor gives students distinct opportunities.
An iSchool curriculum also provides a depth of contextual understanding of information.
Studying data science in an iSchool offers unique chances to understand data in contexts
including communications, information studies, library science, and media research. The
difference between studying data science in an iSchool, as opposed to within a computer
science or statistics program, is that the former tends to focus on analyzing data and
extracting insightful information grounded in context. This is why the study of “where
information comes from” is as equally important as “what it represents,” and “how it can be
turned into a valuable resource in the creation of business and information technology
strategies.” For instance, in the case of analyzing electronic health records, researchers at
iSchools are additionally interested in investigating how corresponding patients perceive
and seek health-related information and support from both professionals and peers. In short,
if you are interested in combining the technical with the practical, as well as the human, you
would be right at home in an iSchool’s data science department.
Many skills are considered “basic” for everyone. These include reading, writing, and
thinking. It does not matter what gender, profession, or discipline one belongs to; one
should have all these abilities. In today’s world, computational thinking is becoming an
essential skill, not reserved for computer scientists only.
What is computational thinking? Typically, it means thinking like a computer scientist.
But that is not very helpful, even to computer scientists! According to Jeannette Wing,35
“Computational thinking is using abstraction and decomposition when attacking a large
complex task or designing a large complex system” (p. 33). It is an iterative process based
on the following three stages:
18 Introduction
Figure 1.2 Three-stage process describing computational thinking. From Repenning, A., Basawapatna, A., & Escherle, N.
(2016). Computational thinking tools. In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing
(VL/HCC) (pp. 218–222), September.
What did we just do? We broke down a complex problem (looking through 10 numbers) into a set of
small problems (comparing two numbers at a time). This process is called decomposition, which refers to
identifying small steps to solve a large problem.
More than that, we derived a process that could be applied to not just 10 numbers (which is not that
complex), but to 100 numbers, 1000 numbers, or a billion numbers! This is called abstraction and
generalization. Here, abstraction refers to treating the actual object of interest (10 numbers) as a series of
numbers, and generalization refers to being able to devise a process that is applicable to the abstracted
quantity (a series of numbers) and not just the specific objects (the given 10 numbers).
And there you have an example of computational thinking. We approached a problem to find a solution
using a systematic process that can be expressed using clear, feasible computational steps. And that is all.
You do not need to know any programming language to do this. Sure, you could write a computer program
to carry out this process (an algorithm). But here, our focus is on the thinking behind this.
Let us take one more step with the previous example. Assume you are interested not just in the largest
number, but also the second largest, third largest, and so on. One way to do this is to sort the numbers in
some (increasing or decreasing) order. It looks easy when you have such a small set of numbers. But
imagine you have a huge unsorted shelf of books that you want to alphabetize. Not only is this a tougher
problem than the previous one, but it becomes increasingly challenging as the number of items increases.
So, let us step back and try to think of a systematic approach.
A natural way to solve the problem would be just to scan the shelf and look for out-of-order pairs, for
instance Rowling, J. K., followed by Lee, Stan, and flipping them around. Flip out-of-order pairs, then
continue your scan of the rest of the shelf, and start again at the beginning of the shelf each time you
reach the end until you make a complete pass without finding a single out-of-order pair on the entire
shelf. That will get your job done. But depending on the size of your collection and how unordered the
books are at the beginning of the process, it will take a lot of time. It is not a very efficient tactic.
Here is an alternative approach. Let us pick any book at random, say Lee, Stan, and reorder the shelf so
that all the books that are earlier (letters to the left of “L” in the dictionary, A–K) than Lee, Stan, are on the
left-hand side of it, and the later ones (M–Z) are on the right. At the end of this step, the Lee, Stan, is in its
final position, probably near the middle. Next you perform the same steps to the subshelf of the books on
the left, and separately to the subshelf of books on the right. Continue this effort until every book is in its
final position, and thus the shelf is sorted.
Now you might be wondering, what is the easiest way to sort the subshelves? Let us take the same set
of numbers from the last example and see how it works. Assume that you have picked the first number, 7,
as the chosen one. So, you want all the numbers that are smaller than 7 on the left-hand side of it and the
larger ones on the right. You can start by assuming 7 is the lowest number in the queue and therefore its
final position will be first, in its current position. Now you compare the rest of the numbers with 7 and
adjust its position accordingly. Let us start at the beginning. You have 24 at the beginning of the rest,
which is larger than 7. Therefore, the tentative position of 7 remains at the beginning. Next, is 62, which is,
again, larger than 7, therefore, no change in the tentative position of 7. Same for the next number, 11.
Next, the comparison is between 4 and 7. Unlike the previous three numbers, 4 is smaller than 7. Here,
20 Introduction
your assumption of 7 as the smallest number in the queue is rendered incorrect. So, you need to readjust
your assumption of 7 from smallest to second smallest.
Here is how to perform the readjustment. First, you have to switch the place of 4 and the number in
second position, 24. As a result the queue becomes 7, 4, 62, 11, 24, 39, 42, 5, 97, 54. And the tentative
position of 7 has shifted to the second position, right after 4, making the queue 4, 7, 62, 11, 24, 39, 42, 5,
97, 54.
Now you might be thinking, why not swap between 7 and 4 instead of 24 and 4. The reason is that you
started with the assumption that 7 is the smallest number in the queue. And so far during comparisons you
have found just one violation of the assumption; that is, with 4. Therefore, it is logical that at the end of
the current comparison you will adjust your assumption to 7 as the second smallest element and 4 as the
smallest one, which is reflected by the current queue.
Moving on with comparisons, the next numbers in the queue are 39 and 42, both of them are larger
than 7, and thus no change in our assumption. The next number is 5, which is, again, smaller than 7. So,
you follow the same drill as you did with 4. Swap the third element of the queue with 5 to readjust your
assumption as 7 as the third smallest element in the queue and continue the process until you reach the
end of the queue. At the end of this step, your queue is transformed into 4, 5, 7, 11, 24, 39, 42, 62, 97, 54,
and the initial assumption has evolved, as now 7 is the third smallest number in the queue. So now, 7 has
been placed in its final position. Notice that, all the elements to the left (4, 5) of 7 are smaller than 7, and
the larger ones are on the right.
If you now perform the same set of previous steps with the numbers on the left and separately to the
numbers on the right, every number will fall into the right place and you will have a perfectly ordered list
of ascending numbers.
Once again, a nice characteristic that all these approaches share is that the process for finding a solution
is clear, systematic, and repeatable, regardless of the size of the input (number of numbers or books). That
is what makes it computationally feasible.
Now that you have seen these examples, try finding more problems around you and see if
you can practice your computational thinking by devising solutions in this manner. Below
are some possibilities to get you started.
3. Strategize your meetings with potential employers at a job fair so that you can optimize connecting
with both high-profile companies (long lines) and startups (short lines).
By now, hopefully you are convinced that: (1) data science is a flourishing and a
fantastic field; (2) it is virtually everywhere; and (3) perhaps you want to pursue it as
a career! OK, maybe you are still pondering the last one, but if you are convinced
about the first two and still holding this book, you may be at least curious about what
you should have in your toolkit to be a data scientist. Let us look at carefully what
data scientists are, what they do, and what kinds of skills one may need to make their
way in and through this field.
One Twitter quip36 about data scientists captures their skill set particularly well: “Data
Scientist (n.): Person who is better at statistics than any software engineer and better at
software engineering than any statistician.”
In her Harvard Business Review article,37 noted academic and business executive
Jeanne Harris listed some skills that employers expect from data scientists: willing to
experiment, proficiency in mathematical reasoning, and data literacy. We will explore
these concepts in relation to what business professionals are seeking in a potential
candidate and why.
1. Willing to Experiment. A data scientist needs to have the drive, intuition, and curiosity
not only to solve problems as they are presented, but also to identify and articulate
problems on her own. Intellectual curiosity and the ability to experiment require an
amalgamation of analytical and creative thinking. To explain this from a more technical
perspective, employers are seeking applicants who can ask questions to define intelli-
gent hypotheses and to explore the data utilizing basic statistical methods and models.
Harris also notes that employers incorporate questions in their application process to
determine the degree of curiosity and creative thinking of an applicant – the purpose of
these questions is not to elicit a specific correct answer, but to observe the approach and
techniques used to discover a possible answer. “Hence, job applicants are often asked
questions such as ‘How many golf balls would fit in a school bus?’ or ‘How many sewer
covers are there in Manhattan?’.”
2. Proficiency in Mathematical Reasoning. Mathematical and statistical knowledge is the
second critical skill for a potential applicant seeking a job in data science. We are not
suggesting that you need a Ph.D. in mathematics or statistics, but you do need to have a
strong grasp on the basic statistical methods and how to employ them. Employers are
seeking applicants who can demonstrate their ability in reasoning, logic, interpreting
data, and developing strategies to perform analysis. Harris further notes that,
22 Introduction
“interpretation and use of numeric data are going to be increasingly critical in business
practices. As a result, an increasing trend in hiring for most companies is to check if
applicants are adept at mathematical reasoning.”
3. Data Literacy. Data literacy is the ability to extract meaningful information from a
dataset, and any modern business has a collection of data that needs to be interpreted. A
skilled data scientist plays an intrinsic role for businesses through an ability to assess a
dataset for relevance and suitability for the purpose of interpretation, to perform
analysis, and create meaningful visualizations to tell valuable data stories. Harris
observes that “data literacy training for business users is now a priority. Managers are
being trained to understand which data is suitable, and how to use visualization and
simulation to process and interpret it.” Data-driven decision-making is a driving force
for innovation in business, and data scientists are integral to this process. Data literacy is
an important skill, not just for data scientists, but for all. Scholars and educators have
started arguing that, similar to the abilities of reading and writing that are essential in any
educational program, data literacy is a basic, fundamental skill, and should be taught to
all. More on this can be found in the FYI box that follows.
In another view, Dave Holtz blogs about specific skill sets desired by various positions to
which a data scientist may apply. He lists basic types of data science jobs:38
1. A Data Scientist Is a Data Analyst Who Lives in San Francisco! Holtz notes that, for
some companies, a data scientist and a data analyst are synonymous. These roles are
typically entry-level and will work with pre-existing tools and applications that require
the basics skills to retrieve, wrangle, and visualize data. These digital tools may include
MySQL databases and advanced functions within Excel such as pivot tables and basic
data visualizations (e.g., line and bar charts). Additionally, the data analyst may perform
the analysis of experimental testing results or manage other pre-existing analytical
toolboxes such as Google Analytics or Tableau. Holtz further notes that, “jobs such as
these are excellent entry-level positions, and may even allow a budding data scientist to
try new things and expand their skillset.”
2. Please Wrangle Our Data! Companies will discover that they are drowning in data and
need someone to develop a data management system and infrastructure that will house
the enormous (and growing) dataset, and create access to perform data retrieval and
analysis. “Data engineer” and “data scientist” are the typical job titles you will find
associated with this type of required skill set and experience. In these scenarios, a
candidate will likely be one of the company’s first data hires and thus this person should
be able to do the job without significant statistics or machine-learning expertise. A data
scientist with a software engineering background might excel at a company like this,
where it is more important that they make meaningful data-like contributions to the
production code and provide basic insights and analyses. Mentorship opportunities for
junior data scientists may be less plentiful at a company like this. As a result, an
associate will have great opportunities to shine and grow via trial by fire, but there
will be less guidance and a greater risk of flopping or stagnating.
3. We Are Data. Data Is Us. There are a number of companies for whom their data (or their
data analysis platform) is their product. These environments offer intense data analysis
or machine learning opportunities. Ideal candidates will likely have a formal mathe-
matics, statistics, or physics background and hope to continue down a more academic
path. Data scientists at these types of firms would focus more on producing data-driven
products than answering operational corporate questions. Companies that fall into this
group include consumer-facing organizations with massive amounts of data and com-
panies that offer a data-based service.
4. Reasonably Sized Non-Data Companies Who Are Data-Driven. This categorizes
many modern businesses. This type of role involves joining an established team of
other data scientists. The company evaluates data but is not entirely concerned about
data. Its data scientists perform analysis, touch production code, visualize data, etc.
These companies are either looking for generalists or they are looking to fill a specific
niche where they feel their team is lacking, such as data visualization or machine
learning. Some of the more important skills when interviewing at these firms are
familiarity with tools designed for “big data” (e.g., Hive or Pig), and experience with
messy, real-life datasets.
These skills are summarized in Figure 1.3.
24 Introduction
A Data
Scientist is a Reasonably
Data Analyst Please Sized Non-Data
Who Lives in Wrangle Our We Are Data. Companies Who
San Francisco Data! Data Is Us. Are Data-Driven
Basic Tools
Software Engineering
Statistics
Machine Learning
Data Munging
programming, statistics, or data science techniques, we are going to follow a very simple process and walk
through an easy example. Eventually, as you develop a stronger technical background and understand the
ins and outs of data science methods, you will be able to tackle problems with bigger datasets and more
complex analyses.
For this example, we will use the dataset of average heights and weights for American women available
from OA 1.1.
This file is in comma-separated values (CSV) format – something that we will revisit in the next
chapter. For now, go ahead and download it. Once downloaded, you can open this file in a spreadsheet
program such as Microsoft Excel or Google Sheets.
For your reference, this data is also provided in Table 1.1. As you can see, the dataset contains a sample
of 15 observations. Let us consider what is present in the dataset. At the first look, it is clear that the data is
already sorted – both the height and weight numbers range from small to large. That makes it easier to
see the boundaries of this dataset – height ranges from 58 to 72, and weight ranges from 115 to 164.
Next, let us consider averages. We can easily compute average height by adding up the numbers in the
“Height” column and dividing by 15 (because that is how many observations we have). That yields a value
of 65. In other words, we can conclude that the average height of an American woman is 65 inches, at least
according to these 15 observations. Similarly, we can compute the average weight – 136 pounds in this
case.
The dataset also reveals that an increase in height correlates with the value of weight. This may be
clearer using a visualization. If you know any kind of a spreadsheet program (e.g., Microsoft Excel, Google
Sheets), you easily generate a plot of values. Figure 1.4 provides an example. Look at the curve. As we
move from left to right (Height), the line increases in value (Weight).
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
26 Introduction
Now, let us ask a question: On average, how much increase can we expect in weight with an increase of
one inch in height?
Think for a moment how you would address this question.
Do not proceed until you have figured out a solution yourself.
A simple method is to compute the differences in height (72 − 58 = 14 inches) and weight (164 − 115 =
49 pounds), then divide the weight difference by the height difference, that is, 49/14, leading to 3.5. In other
words, we see that, on average, one inch of height difference leads to a difference of 3.5 pounds in weight.
If you want to dig deeper, you may discover that the weight change with respect to the height change
is not that uniform. On average, an increase of an inch in height results in an increase of less than 3 pounds
in weight for height between 58 and 65 inches (remember that 65 inches is the average). For values of
height greater than 65 inches, weight increases more rapidly (by 4 pounds mostly until 70 inches, and
5 pounds for more than 70 inches).
Here is another question: What would you expect the weight to be of an American woman who is
57 inches tall? To answer this, we will have to extrapolate the data we have. We know from the previous
paragraph that in the lower range of height (less than the average of 65 inches), with each inch of height
change, weight changes by about 3 pounds. Given that we know for someone who is 58 inches in height,
the corresponding weight is 115 pounds; if we deduct an inch from the height, we should deduct 3 pounds
from the weight. This gives us the answer (or at least our guess), 112 pounds.
What about the end of the data with the larger values for weight and height? What would you expect
the weight of someone who is 73 inches tall to be?
180
160
140
120
100
Weight
80
60
40
20
0
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
Height
Figure 1.4 Visualization of height vs. weight data.
27 1.7 Tools for Data Science
The correct estimate is 169 pounds. Students should verify this answer.
More than the answer, what is important is the process. Can you explain that to someone? Can you
document it? Can you repeat it for the same problem but with different values, or for similar problems, in the
future? If the answer to these questions is “yes,” then you just practiced some science. Yes, it is important for
us not only to solve data-driven problems, but to be able to explain, verify, and repeat that process.
And that, in short, is what we are going to do in data science.
A couple of sections ago, we discussed what kind of skills one needs to have to be a
successful data scientist. We also know by now that a lot of what data scientists do involves
processing data and deriving insights. An example was given above, along with a hands-on
practice problem. These things should at least give you an idea of what you may expect to
do in data science. Going forward, it is important that you develop a solid foundation in
statistical techniques (covered in Chapter 3) and computational thinking (covered in an
earlier section). And then you need to pick up a couple of programing and data processing
tools. A whole section of this book is devoted to such tools (Part II) and covers some of the
most used tools in data science – Python, R, and SQL. But let us quickly review these here
so we understand what to expect when we get to those chapters.
Let me start by noting that there are no special tools for doing data science; there just
happen to be some tools that are more suitable for the kind of things one does in data
science. And so, if you already know some programing language (e.g., C, Java, PHP) or a
scientific data processing environment (e.g., Matlab), you could use them to solve many or
most of the problems and tasks in data science. Of course, if you go through this book, you
would also find that Python or R could generate a graph with one line of code – something
that could take you a lot more effort in C or Java. In other words, while Python or R were not
specifically designed for people to do data science, they provide excellent environments for
quick implementation, visualization, and testing for most of what one would want to do in
data science – at least at the level in which we are interested in this book.
Python is a scripting language. It means that programs written in Python do not need to be
compiled as a whole like you would do with a program in C or Java; instead, a Python
28 Introduction
program runs line by line. The language (its syntax and structure) also provides a very easy
learning curve for the beginner, yet giving very powerful tools for advanced programmers.
Let us see this with an example. If you want to write the classic “Hello, World” program
in Java, here is how it goes:
Step 1: Write the code and save as HellowWorld.java.
public class HelloWorld {
public static void main(String[] args) {
System.out.println(“Hello, World”);
}
}
% javac HelloWorld.java
This should display “Hello, World” on the console. Do not worry if you have never done
Java (or any) programming before and all this looks confusing. I hope you can at least see
that printing a simple message on the screen is quite complicated (we have not even done
any data processing!).
In contrast, here is how you do the same in Python:
Step 1: Write the code and save as hello.py
print(“Hello, World”)
Step 2: Run the program.
% python hello.py
Again, do not worry about actually trying this now. We will see detailed instructions in Chapter 5.
For now, at least you can appreciate how easy it is to code in Python. And if you want to
accomplish the same in R, you type the same – print(“Hello, World”) – in R console.
Both Python and R offer a very easy introduction to programming, and even if you have
never done any programming before, it is possible to start solving data problems from day 1 of
using either of these. Both of them also offer plenty of packages that you can import or call into
them to accomplish more complex tasks such as machine learning (see Part III of this book).
Most times in this book we will see data available to us in simple text files formatted as
CSV (comma-separated values) and we can load up that data into a Python or R environ-
ment. However, such a method has a major limit – the data we could store in a file or load in
a computer’s memory cannot be beyond a certain size. In such cases (and for some other
reasons), we may need to use a better storage of data in something called an SQL
(Structured Query Language) database. The field of this database is very rich with lots of
29 1.8 Issues of Ethics, Bias, and Privacy in Data Science
tools, techniques, and methods for addressing all kinds of data problems. We will, however,
limit ourselves to working with SQL databases through Python or R, primarily so that we
could work with large and remote datasets.
In addition to these top three most used tools for data science (see Appendix G), we will
also skim through basic UNIX. Why? Because a UNIX environment allows one to solve
many data problems and day-to-day data processing needs without writing any code. After
all, there is no perfect tool that could address all our data science needs or meet all of our
preferences and constraints. And so, we will pick up several of the most popular tools in
data science in this book, while solving data problems using a hands-on approach.
This chapter (and this book) may give an impression that data science is all good, that it is
the ultimate path to solve all of society’s and the world’s problems. First of all, I hope you
do not buy such exaggerations. Second, even at its best, data science and, in general,
anything that deals with data or employs data analysis using a statistical-computation
technique, bears several issues that should concern us all – as users or producers of data,
or as data scientists. Each of these issues is big and serious enough to warrant their own
separate books (and such books exist), but lengthy discussions will be beyond the scope of
this book. Instead, we will briefly mention these issues here and call them out at different
places throughout this book when appropriate.
Many of the issues related to privacy, bias, and ethics can be traced back to the origin of
the data. Ask – how, where, and why was the data collected? Who collected it? What did
they intend to use it for? More important, if the data was collected from people, did these
people know that: (1) such data was being collected about them; and (2) how the data would
be used? Often those collecting data mistake availability of data as the right to use that data.
For instance, just because data on a social media service such as Twitter is available on the
Web, it does not mean that one could collect and sell it for material gain without the consent
of the users of that service. In April 2018, a case surfaced that a data analytics firm,
Cambridge Analytica, obtained data about a large number of Facebook users to use for
political campaigning. Those Facebook users did not even know that: (1) such data about
them was collected and shared by Facebook to third parties; and (2) the data was used to
target political ads to them. This incident shed light on something that was not really new;
for many years, various companies such as Facebook and Google have collected enormous
amounts of data about and from their users in order not only to improve and market their
products, but also to share and/or sell it to other entities for profit. Worse, most people don’t
know about these practices. As the old saying goes, “there is no free lunch.” So, when you
are getting an email service or a social media account for “free,” ask why? As it is often
understood, “if you are not paying for it, you are the product.” Sure enough, for Facebook,
each user is worth $158. Equivalent values for other major companies are: $182/user for
Google and $733/user for Amazon.40
30 Introduction
There are many cases throughout our digital life history where data about users have been
intentionally or unintentionally exposed or shared that caused various levels of harm to the
users. And this is just the tip of the iceberg in terms of ethical or privacy violations.
What we are often not aware of is how even ethically collected data could be highly
biased. And if a data scientist is not careful, such inherent bias in the data could show up in
the analysis and the insights developed, often without anyone actively noticing it.
Many data and technology companies are trying to address these issues, often with very
little to no success. But it is admirable that they are trying. And while we also cannot be
successful at fending off biases and prejudices or being completely fair, we need to try. So,
as we proceed in this book with data collection and analysis methods, keep these issues at
the back of your mind. And, wherever appropriate, I will present some pointers in FYI
boxes, such as the one below.
FYI: Fairness
Understanding the gravity of ethics in practicing data analytics, Google, a company that has thrived during
the last two decades guided by machine learning, recently acknowledged the biases in traditional machine
learning approaches in one of its blog posts. You can read more about this announcement here: https://
developers.google.com/machine-learning/fairness-overview/.
In this regard, computational social science has a long way to go to adequately deal with ordinary human
biases. Just as with the field of genomics, to which computational social sciences has often been compared, it
may well take a generation or two before researchers combine high-level competence in data science with
equivalent expertise in anthropology, sociology, political science, and other social science disciplines.
There is a community, called Fairness, Accountability, and Transparency (FAT), that has emerged in
recent years that is trying to address some of these issues, or at least is shedding a light on them. This
community, thankfully, has scholars from fields of data science, machine learning, artificial intelligence,
education, information science, and several branches of social sciences.
This is a very important topic in data science and machine learning, and, therefore, we will continue
discussions throughout this book at appropriate places with such FYI boxes.
Summary
Data science is new in some ways and not new in other ways. Many would argue that
statisticians have already been doing a lot of what today we consider data science. On the
other hand, we have an explosion of data in every sector, with data varying a great deal in its
nature, format, size, and other aspects. Such data has also become substantially more
important in our daily lives – from connecting with our friends and family to doing
business. New problems and new opportunities have emerged and we have only scratched
the surface of possibilities. It is not enough to simply solve a data problem; we also need to
31 Key Terms
create new tools, techniques, and methods that offer verifiability, repeatability, and general-
izability. This is what data science covers, or at least is meant to cover. And that’s how we
are going to present data science in this book.
The present chapter provided several views on how people think and talk about data
science, how it affects or is connected to various fields, and what kinds of skills a data
scientist should have.
Using a small example, we practiced (1) data collection, (2) descriptive statistics,
(3) correlation, (4) data visualization, (5) model building, and (6) extrapolation and
regression analysis. As we progress through various parts of this book, we will dive
into all of these and more in detail, and learn scientific methods, tools, and techniques to
tackle data-driven problems, helping us derive interesting and important insights for
making decisions in various fields – business, education, healthcare, policy-making,
and more.
Finally, we touched on some of the issues in data science, namely, privacy, bias, and
ethics. More discussions on these issues will be considered as we proceed through different
topics in this book.
In the next chapter, we will learn more about data – types, formats, cleaning, and
transforming, among other things. Then, in Chapter 3, we will explore various techniques –
most of them statistical in nature. We can learn about them in theory and practice by hand
using small examples. But of course, if we want to work with real data, we need to develop
some technical skills. For this, we will acquire several tools in Chapters 4–7, including UNIX,
R, Python, and MySQL. By that time, you should be able to build your own models using
various programming tools and statistical techniques to solve data-driven problems. But
today’s world needs more than that. So, we will go a few steps further with three chapters on
machine learning. Then, in Chapter 11, we will take several real-world examples and
applications and see how we can apply all of our data science knowledge to solve problems
in various fields and derive decision-making insights. Finally, we will learn (at least on the
surface) some of the core methodologies for collecting and analyzing data, as well as
evaluating systems and analyses, in Chapter 12. Keep in mind that the appendices discuss
much of the background and basic materials. So, make sure to look at appropriate sections in
the appendices as you move forward.
Key Terms
• Data: Information that is factual, such as measurements or statistics, which can be used
as a basis for reasoning, discussion, or prediction.
• Information: Data that are endowed with meaning and purpose.
• Science: The systematic study of the structure and behavior of the physical and natural
world through observations and experiments.
• Data science: The field of study and practice that involves collection, storage, and
processing of data in order to derive important insights into a problem or a phenomenon.
32 Introduction
Conceptual Questions
1. What is data science? How does it relate to and differ from statistics?
2. Identify three areas or domains in which data science is being used and describe how.
3. If you are allocated 1 TB data to use on your phone, how many years will it take until you
run out of your quota of 1 GB/month consumption?
4. We saw an example of bias in predicting future crime potential due to misrepresentation
in the available data. Find at least two such instances where an analysis, a system, or an
algorithm exhibited some sort of bias or prejudice.
Hands-On Problems
Problem 1.1
Imagine you see yourself as the next Harland Sanders (founder of KFC) and want to learn
about the poultry business at a much earlier age than Mr. Sanders did. You want to figure out
what kind of feed can help grow healthier chickens. Below is a dataset that might help. The
dataset is sourced from OA 1.3.
1 179 Horsebean
2 160 Horsebean
3 136 Horsebean
4 227 Horsebean
5 217 Horsebean
6 168 Horsebean
7 108 Horsebean
8 124 Horsebean
33 Hands-On Problems
(Cont.)
# Weight (lbs) Feed
9 143 Horsebean
10 140 Horsebean
11 309 Linseed
12 229 Linseed
13 181 Linseed
14 141 Linseed
15 260 Linseed
16 203 Linseed
17 148 Linseed
18 169 Linseed
19 213 Linseed
20 257 Linseed
21 244 Linseed
22 271 Linseed
23 243 Soybean
24 230 Soybean
25 248 Soybean
26 327 Soybean
27 329 Soybean
28 250 Soybean
29 193 Soybean
30 271 Soybean
31 316 Soybean
32 267 Soybean
33 199 Soybean
34 171 Soybean
35 158 Soybean
36 248 Soybean
37 423 Sunflower
38 340 Sunflower
39 392 Sunflower
40 339 Sunflower
41 341 Sunflower
42 226 Sunflower
43 320 Sunflower
44 295 Sunflower
45 334 Sunflower
46 322 Sunflower
47 297 Sunflower
48 318 Sunflower
49 325 Meatmeal
50 257 Meatmeal
51 303 Meatmeal
34 Introduction
(Cont.)
# Weight (lbs) Feed
52 315 Meatmeal
53 380 Meatmeal
54 153 Meatmeal
55 263 Meatmeal
56 242 Meatmeal
57 206 Meatmeal
58 344 Meatmeal
59 258 Meatmeal
60 368 Casein
61 390 Casein
62 379 Casein
63 260 Casein
64 404 Casein
65 318 Casein
66 352 Casein
67 359 Casein
68 216 Casein
69 222 Casein
70 283 Casein
71 332 Casein
Based on this dataset, which type of chicken food appears the most beneficial for a
thriving poultry business?
Problem 1.2
The following table contains an imaginary dataset of auto insurance providers and their
ratings as provided by the latest three customers. Now if you had to choose an auto
insurance provider based on these ratings, which one would you opt for?
1 GEICO 4.7
2 GEICO 8.3
3 GEICO 9.2
4 Progressive 7.4
5 Progressive 6.7
6 Progressive 8.9
7 USAA 3.8
8 USAA 6.3
9 USAA 8.1
35 Hands-On Problems
Problem 1.3
Imagine you have grown to like Bollywood movies recently and started following some of
the well-known actors from the Hindi film industry. Now you want to predict which of these
actor’s movies you should watch when a new one is released. Here is a movie review dataset
from the past that might help. It consists of three attributes: movie name, leading actor in the
movie, and its IMDB rating. [Note: assume that a better rating means a more watchable
movie.]
IMDB rating
Leading actor Movie name (out of 10)
Notes
1. What is data science? https://datajobs.com/what-is-data-science
2. Davenport, T. H., & Patil, D. J. (2012). Data scientist: the sexiest job of the 21st century. Harvard
Business Review, October: https://hbr.org/192012/10/data-scientist-the-sexiest-job-of-the-21st-
century
3. Fortune.com: Data science is still white hot: http://fortune.com/2015/05/21/data-science-
white-hot/
4. Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64–73.
5. Computer Weekly: Data to grow more quickly says IDC’s Digital Universe study: https://www
.computerweekly.com/news/2240174381/Data-to-grow-more-quickly-says-IDCs-Digital-
Universe-study
6. Analytics Vidhya Content Team. (2015). 13 amazing applications/uses of data science today,
Sept. 21: https://www.analyticsvidhya.com/blog/2015/09/applications-data-science/
7. Kaggle: Lending Club loan data: https://www.kaggle.com/wendykan/lending-club-loan-data
8. Ahmed, S. Loan eligibility prediction: https://www.kdnuggets.com/2018/09/financial-data-ana
lysis-loan-eligibility-prediction.html
9. Data Science for Social Good: https://dssg.uchicago.edu/event/using-data-for-social-good-and-
public-policy-examples-opportunities-and-challenges/
10. Miller, C. C. (2008). How Obama’s internet campaign changed politics. The New York Times,
Nov. 7: http://bits.blogs.nytimes.com/2008/11/07/how-obamas-internet-campaign-changed-
politics/
11. What you can learn from data science in politics: http://schedule.sxsw.com/2016/events/
event_PP49570
12. Lee, J., & Lim, Y. S. (2016). Gendered campaign tweets: the cases of Hillary Clinton and Donald
Trump. Public Relations Review, 42(5), 849–855.
13. Cambridge Analytica: https://en.wikipedia.org/wiki/Cambridge_Analytica
36 Introduction
14. O’Reilly, T., Loukides, M., & Hill, C. (2015). How data science is transforming health care.
O’Reilly. May 4: https://www.oreilly.com/ideas/how-data-science-is-transforming-health-care
15. Cadmus-Bertram, L., Marcus, B. H., Patterson, R. E., Parker, B. A., & Morey, B. L. (2015). Use of the
Fitbit to measure adherence to a physical activity intervention among overweight or obese, post-
menopausal women: self-monitoring trajectory during 16 weeks. JMIR mHealth and uHealth, 3(4).
16. Apple Heart Study: http://med.stanford.edu/appleheartstudy.html
17. Your health insurance might score you an Apple Watch: https://www.engadget.com/2016/09/28/
your-health-insurance-might-score-you-an-apple-watch/
18. Argonne National Laboratory: http://www.anl.gov/about-argonne
19. Urban Center for Computation and Data: http://www.urbanccd.org/#urbanccd
20. Forbes Magazine. Fixing education with big data: http://www.forbes.com/sites/gilpress/2012/09/
12/fixing-education-with-big-data-turning-teachers-into-data-scientists/
21. Brookings Institution. Big data for education: https://www.brookings.edu/research/big-data-for-
education-data-mining-data-analytics-and-web-dashboards/
22. Syracuse University iSchool Blog: https://ischool.syr.edu/infospace/2012/07/16/data-science-
whats-in-it-for-the-new-librarian/
23. ACRL. Keeping up with big data: http://www.ala.org/acrl/publications/keeping_up_with/
big_data
24. Priceonomics. What’s the difference between data science and statistics?: https://priceonomics
.com/whats-the-difference-between-data-science-and/
25. FiveThirtyEight. 2016 election forecast: https://projects.fivethirtyeight.com/2016-election-
forecast/
26. New York Times. 2016 election forecast: https://www.nytimes.com/interactive/2016/upshot/
presidential-polls-forecast.html?_r=0#other-forecasts
27. Mixpanel. This is the difference between statistics and data science: https://blog.mixpanel.com/
2016/03/30/this-is-the-difference-between-statistics-and-data-science/
28. Andrew Gelman. Statistics is the least important part of data science: http://andrewgelman.com/
2013/11/14/statistics-least-important-part-data-science/
29. Flowingdata. Rise of the data scientist: https://flowingdata.com/2009/06/04/rise-of-the-data-
scientist/
30. Wikipedia. Business analytics: https://en.wikipedia.org/wiki/Business_analytics
31. Wallace, D. P. (2007). Knowledge Management: Historical and Cross-Disciplinary Themes.
Libraries Unlimited. pp. 1–14. ISBN 978-1-59158-502-2.
32. CDC. Smoking and tobacco use: https://www.cdc.gov/tobacco/data_statistics/fact_sheets/fas
t_facts/index.htm
33. Rowley, J., & Hartley, R. (2006). Organizing Knowledge: An Introduction to Managing Access
to Information. Ashgate Publishing. pp. 5–6. ISBN 978-0-7546-4431-6: https://en.wikipedia.org/
wiki/DIKW_Pyramid
34. Belkin, N. J., Cole, M., & Liu, J. (2009). A model for evaluation of interactive information
retrieval. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation (pp. 7–8),
July.
35. Wing, J. M. (2006). Computational thinking. Communications of the ACM, 49(3), 33–35.
36. @Josh_Wills Tweet on Data scientist: https://twitter.com/josh_wills/status/198093512149958656
37. Harris, J. (2012). Data is useless without the skills to analyze it. Harvard Business Review, Sept.
13: https://hbr.org/2012/09/data-is-useless-without-the-skills
38. https://blog.udacity.com/2014/11/data-science-job-skills.html
39. Udacity chart on data scientist skills: http://1onjea25cyhx3uvxgs4vu325.wpengine.netdna-cdn
.com/wp-content/uploads/2014/11/blog_dataChart_white.png
40. You are worth $182 to Google, $158 to Facebook and $733 to Amazon! https://arkenea.com/
blog/big-tech-companies-user-worth/
2 Data
“Data is a precious thing and will last longer than the systems themselves.”
— Tim Berners-Lee
2.1 Introduction
“Just as trees are the raw material from which paper is produced, so too, can data be viewed
as the raw material from which information is obtained.”1 To present and interpret informa-
tion, one must start with a process of gathering and sorting data. And for any kind of data
analysis, one must first identify the right kinds of information sources.
In the previous chapter, we discussed different forms of data. The height–weight data we
saw was numerical and structured. When you post a picture using your smartphone, that is
an example of multimedia data. The datasets mentioned in the section on public policy are
government or open data collections. We also discussed how and where this data is stored –
from as small and local as our personal computers, to as large and remote as data ware-
houses. In this chapter, we will look at these and more variations of data in a more formal
way. Specifically, we will discuss data types, data collection, and data formats. We will also
see and practice how data is cleaned, stored, and processed.
One of the most basic ways to think about data is whether it is structured or not. This is
especially important for data science because most of the techniques that we will learn
depend on one or the other inherent characteristic.
37
38 Data
Most commonly, structured data refers to highly organized information that can be
seamlessly included in a database and readily searched via simple search operations; whereas
unstructured data is essentially the opposite, devoid of any underlying structure. In structured
data, different values – whether they are numbers or something else – are labeled, which is not
the case when it comes to unstructured data. Let us look at these two types in more detail.
Structured data is the most important data type for us, as we will be using it for most of the
exercises in this book. Already we have seen it a couple of times. In the previous chapter
we discussed an example that included height and weight data. That example included
structured data because the data has defined fields or labels; we know “60” to be height and
“120” to be weight for a given record (which, in this case, is for one person).
But structured data does not need to be strictly numbers. Table 2.1 contains data about
some customers. This data includes numbers (age, income, num.vehicles), text (housing.
type), Boolean type (is.employed), and categorical data (sex, marital.stat). What matters for
us is that any data we see here – whether it is a number, a category, or a text – is labeled. In
other words, we know what that number, category, or text means.
Pick a data point from the table – say, third row and eighth column. That is “22.” We
know from the structure of the table that that data is a number; specifically, it is the age of
a customer. Which customer? The one with the ID 2848 and who lives in Georgia. You see
how easily we could interpret and use the data since it is in a structured format? Of course,
someone would have to collect, store, and present the data in such a format, but for now we
will not worry about that.
observation if the change in IQ score could be different, and, even if it was, it could not be
possibly concluded that the change was solely due to the difference in one’s height.”
In this paragraph, we have several data points: 65, 67, 125–130, female. However, they
are not clearly labeled. If we were to do some processing, as we did in the first chapter to try
to associate height and IQ, we would not be able to do that easily. And certainly, if we were
to create a systematic process (an algorithm, a program) to go through such data or
observations, we would be in trouble because that process would not be able to identify
which of these numbers corresponds to which of the quantities.
Of course, humans have no difficulty understanding a paragraph like this that contains
unstructured data. But if we want to do a systematic process for analyzing a large amount of
data and creating insights from it, the more structured it is, the better. As I mentioned, in this
book for the most part we will work with structured data. But at times when such data is not
available, we will look to other ways to convert unstructured data to structured data, or
process unstructured data, such as text, directly.
The lack of structure makes compilation and organizing unstructured data a time- and
energy-consuming task. It would be easy to derive insights from unstructured data if it
could be instantly transformed into structured data. However, structured data is akin to
machine language, in that it makes information much easier to be parsed by computers.
Unstructured data, on the other hand, is often how humans communicate (“natural
language”); but people do not interact naturally with information in strict, database
format.
For example, email is unstructured data. An individual may arrange their inbox in such
a way that it aligns with their organizational preferences, but that does not mean the data is
structured. If it were truly fully structured, it would also be arranged by exact subject and
content, with no deviation or variability. In practice, this would not work, because even
focused emails tend to cover multiple subjects.
Spreadsheets, which are arranged in a relational database format and can be quickly scanned
for information, are considered structured data. According to Brightplanet®, “The problem that
unstructured data presents is one of volume; most business interactions are of this kind,
requiring a huge investment of resources to sift through and extract the necessary elements,
as in a Web-based search engine.”2 And here is where data science is useful. Because the pool
of information is so large, current data mining techniques often miss a substantial amount of
available content, much of which could be game-changing if efficiently analyzed.
Now, if you want to find datasets like the one presented in the previous section or in the
previous chapter, where would you look? There are many places online to look for sets or
collections of data. Here are some of those sources.
40 Data
The idea behind open data is that some data should be freely available in a public domain
that can be used by anyone as they wish, without restrictions from copyright, patents, or
other mechanisms of control.
Local and federal governments, non-government organizations (NGOs), and academic
communities all lead open data initiatives. For example, you can visit data repositories
produced by the US Government3 or the City of Chicago.4 To unlock the true potential of
“information as open data,” the White House developed Project Open Data in 2013 –
a collection of code, tools, and case studies – to help agencies and individuals adopt the
Open Data Policy. To this extent, the US Government released a policy, M-13-3,5 that
instructs agencies to manage their data, and information more generally, as an asset from
the start, and, wherever possible, release it to the public in a way that makes it open,
discoverable, and usable. Following is the list of principles associated with open data as
observed in the policy document:
• Public. Agencies must adopt a presumption in favor of openness to the extent permitted
by law and subject to privacy, confidentiality, security, or other valid restrictions.
• Accessible. Open data are made available in convenient, modifiable, and open formats
that can be retrieved, downloaded, indexed, and searched. Formats should be machine-
readable (i.e., data are reasonably structured to allow automated processing). Open data
structures do not discriminate against any person or group of persons and should be made
available to the widest range of users for the widest range of purposes, often by providing
the data in multiple formats for consumption. To the extent permitted by law, these
formats should be non-proprietary, publicly available, and no restrictions should be
placed on their use.
• Described. Open data are described fully so that consumers of the data have sufficient
information to understand their strengths, weaknesses, analytical limitations, and security
requirements, as well as how to process them. This involves the use of robust, granular
metadata (i.e., fields or elements that describe data), thorough documentation of data
elements, data dictionaries, and, if applicable, additional descriptions of the purpose of
the collection, the population of interest, the characteristics of the sample, and the method
of data collection.
• Reusable. Open data are made available under an open license6 that places no restrictions
on their use.
• Complete. Open data are published in primary forms (i.e., as collected at the source), with
the finest possible level of granularity that is practicable and permitted by law and other
requirements. Derived or aggregate open data should also be published but must refer-
ence the primary data.
• Timely. Open data are made available as quickly as necessary to preserve the value of
the data. Frequency of release should account for key audiences and downstream
needs.
• Managed Post-Release. A point of contact must be designated to assist with data use and
to respond to complaints about adherence to these open data requirements.
41 2.3 Data Collections
Social media has become a gold mine for collecting data to analyze for research or marketing
purposes. This is facilitated by the Application Programming Interface (API) that social
media companies provide to researchers and developers. Think of the API as a set of rules
and methods for asking and sending data. For various data-related needs (e.g., retrieving
a user’s profile picture), one could send API requests to a particular social media service. This
is typically a programmatic call that results in that service sending a response in a structured
data format, such as an XML. We will discuss about XML later in this chapter.
The Facebook Graph API is a commonly used example.7 These APIs can be used by any
individual or organization to collect and use this data to accomplish a variety of tasks, such
as developing new socially impactful applications, research on human information beha-
vior, and monitoring the aftermath of natural calamities, etc. Furthermore, to encourage
research on niche areas, such datasets have often been released by the social media platform
itself. For example, Yelp, a popular crowd-sourced review platform for local businesses,
released datasets that have been used for research in a wide range of topics – from automatic
photo classification to natural language processing of review texts, and from sentiment
analysis to graph mining, etc. If you are interested in learning about and solving such
challenges, you can visit the Yelp.com dataset challenge8 to find out more. We will revisit
this method of collecting data in later chapters.
Depending on its nature, data is stored in various formats. We will start with simple kinds –
data in text form. If such data is structured, it is common to store and present it in some kind
of delimited way. That means various fields and values of the data are separated using
delimiters, such as commas or tabs. And that gives rise to two of the most commonly used
formats that store data as simple text – comma-separated values (CSV) and tab-separated
values (TSV).
1. CSV (Comma-Separated Values) format is the most common import and export
format for spreadsheets and databases. There is no “CSV standard,” so the format is
operationally defined by the many applications that read and write it. For example,
Depression.csv is a dataset that is available at UF Health, UF Biostatistics11 for down-
loading. The dataset represents the effectiveness of different treatment procedures on
separate individuals with clinical depression. A snippet of the file is shown below:
treat,before,after,diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3
In this snippet, the first row mentions the variable names. The remaining rows each
individually represent one data point. It should be noted that, for some data points,
values of all the columns may not be available. The “Data Pre-processing” section later
in this chapter describes how to deal with such missing information.
An advantage of the CSV format is that it is more generic and useful when sharing
with almost anyone. Why? Because specialized tools to read or manipulate it are not
required. Any spreadsheet program such as Microsoft Excel or Google Sheets can
readily open a CSV file and display it correctly most of the time. But there are also
several disadvantages. For instance, since the comma is used to separate fields, if the
data contains a comma, that could be problematic. This could be addressed by escaping
the comma (typically adding a backslash before that comma), but this remedy could be
frustrating because not everybody follows such standards.
2. TSV (Tab-Separated Values) files are used for raw data and can be imported into and
exported from spreadsheet software. Tab-separated values files are essentially text files,
and the raw data can be viewed by text editors, though such files are often used when
43 2.3 Data Collections
moving raw data between spreadsheets. An example of a TSV file is shown below, along
with the advantages and disadvantages of this format.
Suppose the registration records of all employees in an office are stored as follows:
Name<TAB>Age<TAB>Address
Ryan<TAB>33<TAB>1115 W Franklin
Paul<TAB>25<TAB>Big Farm Way
Jim<TAB>45<TAB>W Main St
Samantha<TAB>32<TAB>28 George St
That means whosoever reads this will not be able to readily format or process it. But in
contrast to HTML, the markup data in XML is not meant for direct visualization.
Instead, one could write a program, a script, or an app that specifically parses this
markup and uses it according to the context. For instance, one could develop a website
that runs in a Web browser and uses the above data in XML, whereas someone else could
write a different code and use this same data in a mobile app. In other words, the data
remains the same, but the presentation is different. This is one of the core advantages of
XML and one of the reasons XML is becoming quite important as we deal with multiple
devices, platforms, and services relying on the same data.
4. RSS (Really Simple Syndication) is a format used to share data between services, and
which was defined in the 1.0 version of XML. It facilitates the delivery of information
from various sources on the Web. Information provided by a website in an XML file in
such a way is called an RSS feed. Most current Web browsers can directly read RSS files,
but a special RSS reader or aggregator may also be used.13
The format of RSS follows XML standard usage but in addition defines the names of
specific tags (some required and some optional), and what kind of information should be
stored in them. It was designed to show selected data. So, RSS starts with the XML
standard, and then further defines it so that it is more specific.
Let us look at a practical example of RSS usage. Imagine you have a website that
provides several updates of some information (news, stocks, weather) per day. To keep
up with this, and even to simply check if there are any updates, a user will have to
continuously return to this website throughout the day. This is not only time-consuming,
but also unfruitful as the user may be checking too frequently and encountering no
updates, or, conversely, checking not often enough and missing out on crucial informa-
tion as it becomes available. Users can check your site faster using an RSS aggregator (a
site or program that gathers and sorts out RSS feeds). This aggregator will ensure that it
has the information as soon as the website provides it, and then it pushes that information
out to the user – often as a notification.
Since RSS data is small and fast loading, it can easily be used with services such as
mobile phones, personal digital assistants (PDAs), and smart watches.
RSS is useful for websites that are updated frequently, such as:
• News sites – Lists news with title, date and descriptions.
• Companies – Lists news and new products.
• Calendars – Lists upcoming events and important days.
• Site changes – Lists changed pages or new pages.
Do you want to publish your content using RSS? Here is a brief guideline on how to
make it happen.
First, you need to register your content with RSS aggregator(s). To participate, first
create an RSS document and save it with an .xml extension (see example below). Then,
upload the file to your website. Finally, register with an RSS aggregator. Each day (or
with a frequency you specify) the aggregator searches the registered websites for RSS
documents, verifies the link, and displays information about the feed so clients can link
to documents that interest them.14
45 2.3 Data Collections
<!DOCTYPE html>
<html>
<body>
<p id=“demo”></p>
<script>
var obj = {“name”:“John”, “age”:25, “state”: “New Jersey”};
var obj_JSON = JSON.stringify(obj);
window.location = “json_Demo.php?x=” + obj_JSON;
</script>
</body>
</html>
2. Receiving data: If the received data is in JSON format, we can convert it into
a JavaScript object. For example:
<!DOCTYPE html>
<html>
<body>
<p id=“demo”></p>
<script>
var obj_JSON = “{“name”:“John”, “age”:25, “state”:
“New Jersey”}”;
var obj = JSON.parse(obj_JSON);
document.getElementById(“demo”).innerHTML=obj.name;
</script>
</body>
</html>
47 2.4 Data Pre-processing
Now that we have seen several formats of data storage and presentation, it is important to
note that these are by no means the only ways to do it, but they are some of the most
preferred and commonly used ways.
Having familiarized ourselves with data formats, we will now move on with manipulat-
ing the data.
Data in the real world is often dirty; that is, it is in need of being cleaned up before it can be
used for a desired purpose. This is often called data pre-processing. What makes data
“dirty”? Here are some of the factors that indicate that data is not clean or ready to process:
• Incomplete. When some of the attribute values are lacking, certain attributes of interest
are lacking, or attributes contain only aggregate data.
• Noisy. When data contains errors or outliers. For example, some of the data
points in a dataset may contain extreme values that can severely affect the dataset’s
range.
• Inconsistent. Data contains discrepancies in codes or names. For example, if the
“Name” column for registration records of employees contains values other than
alphabetical letters, or if records do not start with a capital letter, discrepancies are
present.
Figure 2.1 shows the most important tasks involved in data pre-processing.20
In the subsections that follow, we will consider each of them in detail, and then work
through an example to practice these tasks.
Data Cleaning
Data Integration
Data Transformation –17, 25, 39, 128, –39 0.17, 0.25, 0.39, 1.28, –0.39
Figure 2.1 Forms of data pre-processing (N.H. Son, Data Cleaning and Data Pre-processing21).
Since there are several reasons why data could be “dirty,” there are just as many ways to
“clean” it. For this discussion, we will look at three key methods that describe ways in
which data may be “cleaned,” or better organized, or scrubbed of potentially incorrect,
incomplete, or duplicated information.
Tomato 2 Diced
Garlic 3 Cloves
Salt 1 Pinch
To be as efficient and effective for various data analyses as possible, data from various
sources commonly needs to be integrated. The following steps describe how to integrate
multiple databases or files.
1. Combine data from multiple sources into a coherent storage place (e.g., a single file or
a database).
2. Engage in schema integration, or the combining of metadata from different sources.
3. Detect and resolve data value conflicts. For example:
a. A conflict may arise; for instance, such as the presence of different attributes and
values from various sources for the same real-world entity.
b. Reasons for this conflict could be different representations or different scales; for
example, metric vs. British units.
4. Address redundant data in data integration. Redundant data is commonly generated in
the process of integrating multiple databases. For example:
a. The same attribute may have different names in different databases.
b. One attribute may be a “derived” attribute in another table; for example, annual revenue.
c. Correlation analysis may detect instances of redundant data.
If this has begun to appear confusing, hang in there – some of these steps will become
clearer as we take an example in the next section.
51 2.4 Data Pre-processing
Data must be transformed so it is consistent and readable (by a system). The following five
processes may be used for data transformation. For the time being, do not worry if these
seem too abstract. We will revisit some of them in the next section as we work through an
example of data pre-processing.
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and aggregation. Some of
the techniques that are used for accomplishing normalization (but we will not be
covering them here) are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction.
a. New attributes constructed from the given ones.
Detailed explanation of all of these techniques are out of scope for this book, but later in
this chapter we will do a hands-on exercise to practice some of these in simpler forms.
in the given data and/or creating composite dimensions or features that could sufficiently
represent a set of raw features. Strategies for reduction include sampling, clustering,
principal component analysis, etc. We will learn about clustering in multiple chapters in
this book as a part of machine learning. The rest are outside the scope of this book.
We are often dealing with data that are collected from processes that are continuous, such as
temperature, ambient light, and a company’s stock price. But sometimes we need to convert
these continuous values into more manageable parts. This mapping is called discretization.
And as you can see, in undertaking discretization, we are also essentially reducing data.
Thus, this process of discretization could also be perceived as a means of data reduction, but
it holds particular importance for numerical data. There are three types of attributes
involved in discretization:
a. Nominal: Values from an unordered set
b. Ordinal: Values from an ordered set
c. Continuous: Real numbers
To achieve discretization, divide the range of continuous attributes into intervals. For
instance, we could decide to split the range of temperature values into cold, moderate, and
hot, or the price of company stock into above or below its market valuation.
• Smooth Noisy Data. We can see that the wine consumption value for Iceland per capita is
−0.800000012. However, wine consumption values per capita cannot be negative. Therefore, it
must be a faulty entry and we should change the alcohol consumption for Iceland to 0.800000012.
Using the same logic, the number of deaths for Israel should be converted from −834 to 834.
• Handling Missing Data. As we can see in the dataset, we have missing values (represented by
NA – not available) of the number of cases of heart disease for Canada and number of cases of heart
and liver disease for Spain. A simple workaround for this is to replace all the NAs with some common
values, such as zero or average of all the values for that attribute. Here, we are going to use the
average of the attribute for handling the missing values. So, for both Canada and Spain, we will use
the value of 185 as number of heart diseases. Likewise, the number of liver diseases for Spain is
replaced by 20.27. It is important to note: depending on the nature of the problem, it may not be
a good idea to replace all of the NAs with the same value. A better solution would be to derive the
value of the missing attribute from the values of other attributes of that data point.
• Data Wrangling. As previously discussed, data wrangling is the process of manually converting or
mapping data from one “raw” form into another format. For example, it may happen that, for a country,
we have the value of the number of deaths as per 10,000, and not per 100,000, as other countries. In
that case, we need to transform the value of the number of deaths for that country into per 100,000, or
the same for every other country into 10,000. Fortunately for us, this dataset does not involve any data
wrangling steps. So, at the end of this stage the dataset would look like what we see in Table 2.4.
54 Data
2. Data Integration. Now let us assume we have another dataset (fictitious) collected from a different
source, which is about alcohol consumption and number of related fatalities across various states of
India, as shown in Table 2.5.
Table 2.4 Wine consumption vs. mortality data after data cleaning.
# Country Alcohol Deaths Heart Liver
Table 2.5 Data about alcohol consumption and health from various States in India.
Alcohol Fatal alcohol-
# Name of the State consumption Heart disease related accidents
Table 2.6 Wine consumption and associated mortality after data integration.
# Country Alcohol Deaths Heart Liver
at least similar; (b) the sample of these States is similar to the whole population of India; and (c) the wine
consumption is roughly equivalent to the total alcohol consumption value in India; even though in reality,
the wine consumption per capita should be less than the total alcohol consumption per capita, as there are
other kinds of alcoholic beverages in the market.
3. Data Transformation. As previously mentioned, the data transformation process involves one or more
of smoothing, removing noise from data, summarization, generalization, and normalization. For this
example, we will employ smoothing, which is simpler than summarization and normalization. As we can
see, in our data the wine consumption per capita for Italy is unusually high, whereas the same for Norway
is unusually low. So, chances are these are outliers. In this case we will replace the value of wine
consumption for Italy with 7.900000095. Similarly, for Norway we will use the value of 0.800000012 in
place of 0.0800000012. We are treating both of these potential errors as “equipment error” or “entry
error,” which resulted in an extra digit for both of these countries (extra “2” in front for Italy and extra “0”
after the decimal point for Norway). This is a reasonable assumption given the limited context we have
about the dataset. A more practical approach would be to look at the nearest geolocation for which we
have the values and use that value to make predictions about the countries with erroneous entries. So, at
the end of this step the dataset will be transformed into what is shown in Table 2.7.
Table 2.7 Wine consumption and associated mortality dataset after data transformation.
# Country Alcohol Deaths Heart Liver
4. Data Reduction. The process of data reduction is aimed at producing a reduced representation of the
dataset that can be used to obtain the same or similar analytical results. For our example, the sample is
relatively small, with only 22 rows. Now imagine that we have values for all 196 countries in the world,
and the geospatial values, for which the attribute values are available, are stated. In that case, the
number of rows is large, and, depending on the limited processing and storage capacity you have at
your disposal, it may make more sense to round up the alcohol consumption per capita to two decimal
places. Each extra decimal place for every data point in such a large dataset will need a significant
amount of storage capacity. Thus, reducing the liver column to one decimal place and the alcohol
consumption column to two decimal places would result in the dataset shown in Table 2.8.
Note that data reduction does not mean just reducing the size of attributes – it also may involve
removing some attributes, which is known as feature space selection. For example, if we are
interested in the relation between the wine consumed and number of casualties from heart disease, we
may opt to remove the attribute “number of liver diseases” if we assume that there is no relation
between number of heart disease fatalities and number of lung disease fatalities.
Table 2.8 Wine consumption and associated mortality dataset after data reduction.
# Country Alcohol Deaths Heart Liver
Table 2.9 Wine consumption and mortality dataset at the end of pre-processing.
# Country Alcohol Deaths Heart Liver
5. Data Discretization. As we can see, all the attributes involved in our dataset are continuous type
(values in real numbers). However, depending on the model you want to build, you may have to
discretize the attribute values into binary or categorical types. For example, you may want to discretize
the wine consumption per capita into four categories – less than or equal to 1.00 per capita
(represented by 0), more than 1.00 but less than or equal to 2.00 per capita (1), more than 2.00
but less than or equal to 5.00 per capita (2), and more than 5.00 per capita (3). The resultant dataset
should look like that shown in Table 2.9.
And that is the end result of this exercise. Yes, it may seem that we did not conduct real data processing or
analytics. But through our pre-processing techniques, we have managed to prepare a much better and
meaningful dataset. Often, that itself is half the battle. Having said that, for most of the book we will focus
on the other half of the battle – processing, visualizing, analyzing the data for solving problems, and
making decisions. Nonetheless, I hope the sections on data pre-processing and the hands-on exercise we
did here has given some insights into what needs to occur before you get your hands on nice-looking data
for processing.
59 Summary
Summary
Many of the examples of data we have seen so far have been in nice tables, but it should be
clear by now that data appears in many forms, sizes, and formats. Some are stored in
spreadsheets, and others are found in text files. Some are structured, and some are
unstructured. In this book, most data we will deal with are found in text format, but there
are plenty of data out there in image, audio, and video formats.
As we saw, the process of data processing is more complicated if there is missing or
corrupt data, and some data may need cleaning or converting before we can even begin to
do any processing with it. This requires several forms of pre-processing.
Some data cleaning or transformation may be required, and some may depend on our
purpose, context, and availability of analysis tools and skills. For instance, if you know
SQL (a program covered in Chapter 7) and want to take advantage of this effective and
efficient query language, you may want to import your CSV-formatted data into a MySQL
database, even if that CSV data has no “issues.”
Data pre-processing is so important that many organizations have specific job positions
just for this kind of work. These people are expected to have the skills to do all the stages
60 Data
described in this chapter: from cleaning to transformation, and even finding or approximat-
ing the missing or corrupt values in a dataset. There is some technique, some science, and
much engineering involved in this process. But it is a very important job, because, without
having the right data in the proper format, almost all that follows in this book would be
impossible. To put it differently – before you jump to any of the “fun” analyses here, make
sure you have at least thought about whether your data needs any pre-processing, otherwise
you may be asking the right question of the wrong data!
Key Terms
• Structured data: Structured data is highly organized information that can be seamlessly
included in a database and readily searched via simple search operations.
• Unstructured data: Unstructured data is information devoid of any underlying structure.
• Open data: Data that is freely available in a public domain that can be used by anyone as
they wish, without restrictions from copyright, patents, or other mechanisms of control.
• Application Programming Interface (API): A programmatic way to access data. A set
of rules and methods for asking and sending data.
• Outlier: A data point that is markedly different in value from the other data points of the
sample.
• Noisy data: The dataset has one or more instances of errors or outliers.
• Nominal data: The data type is nominal when there is no natural order between the
possible values, for example, colors.
• Ordinal data: If the possible values of a data type are from an ordered set, then the type is
ordinal. For example, grades in a mark sheet.
• Continuous data: A continuous data is a data type that has an infinite number of possible
values. For example, real numbers.
• Data cubes: They are multidimensional sets of data that can be stored in a spreadsheet.
A data cube could be in two, three, or higher dimensions. Each dimension typically
represents an attribute of interest.
• Feature space selection: A method for selecting a subset of features or columns from the
given dataset as a way to do data reduction.
Conceptual Questions
5. You are looking at employee records. Some have no middle name, some have a middle
initial, and others have a complete middle name. How do you explain such inconsis-
tency in the data? Provide at least two explanations.
Hands-On Problems
Problem 2.1
The following dataset, obtained from OA 2.2, contains statistics in arrests per 100,000
residents for assault and murder, in each of the 50 US states, in 1973. Also given is the
percentage of the population living in urban areas.
(Cont.)
Murder Assault Urban population (%)
Now, use the pre-processing techniques at your disposal to prepare the dataset for analysis.
a. Address all the missing values.
b. Look for outliers and smooth noisy data.
c. Prepare the dataset to establish a relation between an urban population category and
a crime type. [Hint: Convert the urban population percentage into categories, for
example, small (<50%), medium (<60%), large (<70%), and extra-large (70% and
above) urban population.]
Problem 2.2
The following is a dataset of bridges in Pittsburgh. The original dataset was prepared by
Yoram Reich and Steven J. Fenves, Department of Civil Engineering and Engineering
Design Research Center, Carnegie Mellon University, and is available from OA 2.3.
(Cont.)
ID Purpose Length Lanes Clear T or D Material Span Rel-L
Problem 2.3
The following is a dataset that involves child mortality rate and is inspired by data collected
from UNICEF. The original dataset is available from OA 2.4. According to the report, the
world has achieved substantial success in reducing child mortality during the last few
decades. According to the UNICEF report, globally the under-five age mortality rate has
decreased from 93 deaths per 1000 live births in 1990 to less than 50 in 2016.
Under-five mortality Infant mortality Neonatal mortality
Year rate rate rate
However, as you can see, the dataset has a number of missing instances, which need to be
fixed before a clear progress on child mortality can be explained from the year of 1990 to
2016. Use this dataset to complete the following tasks:
a. Address all the missing values using the techniques at your disposal.
b. Prepare the dataset to establish the following relations:
i. Under-five mortality rate and neonatal mortality rate.
65 Hands-On Problems
• Bellinger, G., Castro, D., & Mills, A. Data, information, knowledge, and wisdom: http://
www.systems-thinking.org/dikw/dikw.htm
• US Government Open Data Policy: https://project-open-data.cio.gov/
• Developing insights from social media data: https://sproutsocial.com/insights/social-
media-data/
• Social Media Data Analytics course on Coursera by the author: https://www.coursera.org
/learn/social-media-data-analytics
Notes
1. Statistics Canada. Definitions: http://www.statcan.gc.ca/edu/power-pouvoir/ch1/definitions/
5214853-eng.htm
2. BrightPlanet®. Structured vs. unstructured data definition: https://brightplanet.com/2012/06/
structured-vs-unstructured-data/
3. US Government data repository: https://www.data.gov/
4. City of Chicago data repository: https://data.cityofchicago.org/
5. US Government policy M-13-3: https://project-open-data.cio.gov/policy-memo/
6. Project Open Data “open license”: https://project-open-data.cio.gov/open-licenses/
7. Facebook Graph API: https://developers.facebook.com/docs/graph-api/
8. Yelp dataset challenge: https://www.yelp.com/dataset/challenge
9. SPM created by Karl Friston: https://en.wikipedia.org/wiki/Karl_Friston
10. UCL SPM website: http://www.fil.ion.ucl.ac.uk/spm/
11. UF Health. UF Biostatistics open learning textbook: http://bolt.mph.ufl.edu/2012/08/02/learn-by
-doing-exploring-a-dataset/
12. An actual tab will appear as simply a space. To aid clarity, in this book we are explicitly spelling out
<TAB>. Therefore, wherever you see in this book <TAB>, in reality an actual tab would appear as
a space.
13. XUL.fr. Really Simple Syndication definition: http://www.xul.fr/en-xml-rss.html
14. w3schools. XML RSS explanation and example: https://www.w3schools.com/xml/xml_rss.asp
15. FEED Validator: http://www.feedvalidator.org
16. Google: submit your content: http://www.google.com/submityourcontent/website-owner
17. Bing submit site: http://www.bing.com/toolbox/submit-site-url
18. JSON: http://www.json.org/
19. w3schools. JSON introduction: http://www.w3schools.com/js/js_json_intro.asp
20. KDnuggets™ introduction to data mining course: http://www.kdnuggets.com/data_mining_course/
21. Data cleaning and pre-processing presentation: http://www.mimuw.edu.pl/~son/datamining/
DM/4-preprocess.pdf
3 Techniques
“Information is the oil of the 21st century, and analytics is the combustion engine.”
— Peter Sondergaard, Senior Vice President, Gartner Research
3.1 Introduction
There are many tools and techniques that a data scientist is expected to know or acquire as
problems arise. Often, it is hard to separate tools and techniques. One whole section of this
book (four chapters) is dedicated to teaching how to use various tools, and, as we learn
about them, we also pick up and practice some essential techniques. This happens for two
reasons. The first one is already mentioned here – it is hard to separate tools from
techniques. Regarding the second reason – since our main purpose is not necessarily to
master any programming tools, we will learn about programming languages and platforms
in the context of solving data problems.
That said, there are aspects of data science-related techniques that are better
studied without worrying about any particular tool or programming language. And
that is the approach we will pursue. In this chapter, we will review some basic
techniques used in data science and see how they are used for performing analytics
and data analyses.
We will begin by considering some differences and similarities between data analysis
and data analytics. Often, it is not critical to ignore their differences, but here we will see
how distinguishing the two might be important. For the rest of the chapter we will look at
various forms of analyses: descriptive, diagnostic, predictive, prescriptive, exploratory, and
66
67 3.3 Descriptive Analysis
mechanistic. In the process we will be reviewing basic statistics. That should not surprise
you, as data science is often considered just a fancy term for statistics! As we learn about
these tools and techniques, we will also look at some examples and gain experience using
real data analysis (though it will be limited due to our lack of knowledge about any
programming or specialized tools as of this chapter).
These two terms – data analysis and data analytics – are often used interchangeably and
could be confusing. Is a job that calls for data analytics really talking about data analysis
and vice versa? Well, there are some subtle but important differences between analysis and
analytics. A lack of understanding can affect the practitioner’s ability to leverage the data to
their best advantage.1
According to Dave Kasik, Boeing’s Senior Technical Fellow in visualization and inter-
active techniques, “In my terminology, data analysis refers to hands-on data exploration and
evaluation. Data analytics is a broader term and includes data analysis as [a] necessary
subcomponent. Analytics defines the science behind the analysis. The science means
understanding the cognitive processes an analyst uses to understand problems and explore
data in meaningful ways.”2
One way to understand the difference between analysis and analytics is to think in
terms of past and future. Analysis looks backwards, providing marketers with a historical
view of what has happened. Analytics, on the other hand, models the future or predicts
a result.
Analytics makes extensive use of mathematics and statistics and the use of descriptive
techniques and predictive models to gain valuable knowledge from data. These insights
from data are used to recommend action or to guide decision-making in a business context.
Thus, analytics is not so much concerned with individual analysis or analysis steps, but with
the entire methodology.
There is no clear agreeable-to-all classification scheme available in the literature to
categorize all the analysis techniques that are used by data science professionals.
However, based on their application on various stages of data analysis, I have categorized
analysis techniques into six classes of analysis and analytics: descriptive analysis, diag-
nostic analytics, predictive analytics, prescriptive analytics, exploratory analysis, and
mechanistic analysis. Each of these and their applications are described below.
3.3.1 Variables
Before we process or analyze any data, we have to be able to capture and represent it. This is
done with the help of variables. A variable is a label we give to our data. For instance, you can
write down age values of all your cousins in a table or a spreadsheet and label that column
with “age.” Here, “age” is a variable and it is of type numeric (and “ratio” type as we will
soon see). If we then want to identify who is a student or not, we can create another column,
and next to each cousin’s name we can write down “yes” or “no” under a new column called
“student.” Here, “student” is a variable and it is of type categorical (more on that soon).
Figure 3.1 Census data as a way to describe the population.3
70 Techniques
Since a lot of what we will do in this book (and perhaps what you will do in a data science job)
will deal with different forms of numerical information, let us look further into such variables.
Numeric information can be separated into distinct categories that can be used to summarize
data. The first stage of summarizing any numeric information is to identify the category to which
it belongs. For example, the above section covered three operations for numbers: counting,
ranking, and placing on a scale. Each of these corresponds to different levels of measurement.
So, if people are classified based on their racial identities, statisticians can name the categories
and count their contents. Such use defines the categorical variable. Think about animal
taxonomy that biologists use – the one with mammals, reptiles, etc. Those represent categorical
levels. If we find it convenient to represent such categories using numbers, this becomes a
nominal variable. Essentially here, we are using numbers to represent categories, but cannot
use those numbers for any meaningful mathematical or statistical operations.
If we can differentiate among individuals within a group, we can use an ordinal variable
to represent those values. For example, we can rank a selection of people in terms of their
apparent communication skill. But this statistic can only go so far; it cannot, for example,
create an equal-unit scale. What this means is that, while we could order the entities, there is
no enhanced meaning to that order. For instance, we cannot just subtract someone at rank 5
with someone at rank 3 and say that the difference is what is represented by someone at rank
2. For that, we turn to an interval variable.
Let us think about the measurement of temperature. We do it in Fahrenheit or Celsius. If
the temperature is measured as 40 degrees Fahrenheit on a given day, that measure is placed
on a scale with an actual zero point (i.e., 0 degrees Fahrenheit). If the next day the
temperature is 45 degrees Fahrenheit, we can say that the temperature has risen by 5 degrees
(that is the difference). And 5 degrees Fahrenheit has physical meaning, unlike what
happens with an ordinal level of measurement. This kind of scenario describes an interval
level of measurement. Put another way, an interval level of measurement allows us to do
additions and subtractions but not multiplications or divisions. What does that mean? It
means we cannot talk about doubling or halving temperature. OK, well, we could, but that
multiplication or division has no physical meaning. Water evaporates at 100 degrees
Celsius, but at 200 degrees Celsius water does not evaporate twice as much or twice as fast.
For multiplication and division (as well as addition and subtraction), we turn to a ratio
variable. This is common in physical sciences and engineering. Examples include length (feet,
yards, meters), and weight (pounds, kilograms). If a pound of grapes costs $5, two pounds will
cost $10. If you have 4 yards of fabric, you can give 2 yards each to two of your friends.
All of these categories of variables are fine when we are dealing with one variable at
a time and doing descriptive analysis. But when we are trying to connect multiple variables
or using one set of variables to make predictions about another set, we may want to classify
them with some other names. A variable that is thought to be controlled or not affected by
other variables is called an independent variable. A variable that depends on other
variables (most often other independent variables) is called a dependent variable. In the
case of a prediction problem, an independent variable is also called a predictor variable
and a dependent variable is called an outcome variable.
For instance, imagine we have data about tumor size for some patients and whether the
patients have cancer or not. This could be in a table with two columns: “tumor size” and
71 3.3 Descriptive Analysis
“cancer,” the former being a ratio type variable (we can talk about one tumor being twice
the size of another), and the latter being a categorical type variable (“yes”, “no” values).
Now imagine we want to use the “tumor size” variable to say something about the “cancer”
variable. Later in this book we will see how something like this could be done under a class
of problems called “classification.” But for now, we can think of “tumor size” as an
independent or a predictor variable and “cancer” as a dependent or an outcome variable.
Of course, data needs to be displayed. Once some data has been collected, it is useful to plot
a graph showing how many times each score occurs. This is known as a frequency
distribution. Frequency distributions come in different shapes and sizes. Therefore, it is
important to have some general descriptions for common types of distribution. The
following are some of the ways in which statisticians can present numerical findings.
Histogram. Histograms plot values of observations on the horizontal axis, with a bar
showing how many times each value occurred in the dataset. Let us take a look at an
example of how a histogram can be crafted out of a dataset. Table 3.1 represents
Productivity measured in terms of output for a group of data science professionals. Some
of them went through extensive statistics training (represented as “Y” in the Training
column) while others did not (N). The dataset also contains the work experience (denoted
as Experience) of each professional in terms of number of working hours.
5 1 Y
2 0 N
10 10 Y
4 5 Y
6 5 Y
12 15 Y
5 10 Y
6 2 Y
4 4 Y
3 5 N
9 5 Y
8 10 Y
11 15 Y
13 19 Y
4 5 N
5 7 N
7 12 Y
8 15 N
12 20 Y
3 5 N
15 20 Y
Table 3.1 (cont.)
Productivity Experience Training
8 16 N
4 9 N
6 17 Y
9 13 Y
7 6 Y
5 8 N
14 18 Y
7 17 N
6 6 Y
Histogram of productivity
8
6
Frequency
4
2
0
2 4 6 8 10 12 14 16
Productivity
such as charts, plots, line graphs, maps, etc. If you are using a Google Sheet, the procedure to create the
histogram is first to select the intended column, followed by selecting the option of “insert chart”, denoted by
the icon in the toolbar, which will present you with the chart editor. In the editor, select the option of
histogram chart in the chart type dropdown and it will create a chart like that in Figure 3.2. You can further
customize the chart by specifying the color of the chart, the X-axis label, Y-axis label, etc.
Training
Y N
Figure 3.3 Pie chart showing the distribution of “Training” in the Productivity data.
You can follow the same process as for a histogram if you are using a Google Sheet. The key difference is
here you have to select the pie chart as chart type from the chart editor.
We will often be working with data that are numerical and we will need to understand
how those numbers are spread. For that, we can look at the nature of that distribution. It
turns out that, if the data is normally distributed, various forms of analyses become easy and
straightforward. What is a normal distribution?
74 Techniques
Normal distribution
1200
1000
800
Frequency
600
400
200
0
1200
1000
800
Frequency
600
400
200
0
Often, one number can tell us enough about a distribution. This is typically a number that
points to the “center” of a distribution. In other words, we can calculate where the “center”
of a frequency distribution lies, which is also known as the central tendency. We put
“center” in quotes because it depends how it is defined. There are three measures commonly
used: mean, median, and mode.
76 Techniques
1.0
0.8
fs(vec, 0, 0.5)
0.6
0.4
0.2
0.0
–5 0 5
vec
Figure 3.6 Examples of different kurtosis in a distribution (orange dashed line represents leptokurtic, blue solid line represents
the normal distribution, and red dotted line represents platykurtic).
Mean. You have come across this before even if you have never done statistics. Mean is
commonly known as average, though they are not exactly synonyms. Mean is most often used
to measure the central tendency of continuous data as well as a discrete dataset. If there are
n number of values in a dataset and the values are x1, x2, . . ., xn, then the mean is calculated as
x1 þ x2 þ x3 þ þ xn
x¼ : ð3:1Þ
n
Using the above formula, the mean of the Productivity column in Table 3.1 comes out to be
7.267. Go ahead and verify this.
There is a significant drawback to using the mean as a central statistic: it is
susceptible to the influence of outliers. Also, mean is only meaningful if the data is
normally distributed, or at least close to looking like a normal distribution. Take the
distribution of household income in the USA, for instance. Figure 3.7 shows this
distribution, obtained from the US Census Bureau. Does that distribution look normal?
No. A few people make a lot of money and a lot of people make very little money.
This is a highly skewed distribution. If you take the mean or average from this data, it
will not be a good representation of income for this population. So, what can we do?
We can use another measure of central tendency: median.
Median. The median is the middle score for a dataset that has been sorted according to the
values of the data. With an even number of values, the median is calculated as the average of
the middle two data points. For example, for the Productivity dataset, the median of
Experience is 9.5. What about the US household income? The median income in the
USA, as of 2014, is $53,700. That means half the people in the USA are making $53,700
or less and the other half are on the other side of that threshold.
Mode. The mode is the most frequently occurring value in a dataset. On a histogram
representation, the highest bar denotes the mode of the data. Normally, mode is used for
categorical data; for example, for the Training component in the Productivity dataset, the
most common category is the desired output.
77 3.3 Descriptive Analysis
Figure 3.7 Income distribution in the United States based on the latest census data available.5
20
15
count
10
0
N Y
Training
As depicted in Figure 3.8, in the Productivity dataset, there are 10 instances of N and 20
instances of Y values in Training. So, in this case, the mode for Training is Y. [Note: If the
number of instances of Y and N are the same, then there would be no mode for Training.]
We saw in section 3.3.2 that distributions come in all shapes and sizes. Simply looking at
a central point (mean, median, or mode) may not help in understanding the actual shape of
78 Techniques
0
1
2
4
5
5
5
5
5
5
6
6
7
8
9
79 3.3 Descriptive Analysis
10
10
10
12
13
15
15
15
16
17
17
18
19
20
20
20
15
10
5
0
Productivity Experience
Figure 3.9 Boxplot for the “Productivity” and “Experience” columns of the Productivity dataset.
As shown in the boxplot for the “Experience” attribute, after removing the top one-fourth values (between
15 and 20) and bottom one-fourth (close to zero to 5), the range of the remaining data can be calculated as
10 (from 5 to 15). Likewise, the interquartile range of the “Productivity” attribute can be calculated as 5.
Variance. The variance is a measure used to indicate how spread out the data points are. To
measure the variance, the common method is to pick a center of the distribution, typically
the mean, then measure how far each data point is from the center. If the individual
observations vary greatly from the group mean, the variance is big; and vice versa. Here,
it is important to distinguish between the variance of a population and the variance of
a sample. They have different notations, and they are computed differently. The variance of
a population is denoted by σ2; and the variance of a sample by s2.
The variance of a population is defined by the following formula:
X
ðXi X Þ2
σ ¼
2
; ð3:2Þ
N
where σ2 is the population variance, X is the population mean, Xi is the ith element from the
population, and N is the number of elements in the population.
The variance of a sample is defined by a slightly different formula:
X
ðxi xÞ2
s2 ¼ ; ð3:3Þ
ðn 1Þ
where s2 is the sample variance, x is the sample mean, xi is the ith element from the sample,
and n is the number of elements in the sample. Using this formula, the variance of the
sample is an unbiased estimate of the variance of the population.
Example: In the Productivity dataset given in Table 3.1, we find by applying the formula in
Equation 3.3 that the variance of the Productivity attribute can be calculated as 11.93 (approxi-
mated to two decimal places), and the variance of the Experience can be calculated as 36.
Figure 3.10 A snapshot from Google Sheets showing how to compute the standard deviation.
81 3.3 Descriptive Analysis
Standard Deviation. There is one issue with the variance as a measure. It gives us the
measure of spread in units squared. So, for example, if we measure the variance of age
(measured in years) of all the students in a class, the measure we will get will be in years2.
However, practically, it would make more sense if we got the measure in years (not years
squared). For this reason, we often take the square root of the variance, which ensures the
measure of average spread is in the same units as the original measure. This measure is
known as the standard deviation (see Figure 3.10).
The formula to compute the standard deviation of a sample is
sX ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðxi xÞ2
s¼ : ð3:4Þ
ðn 1Þ
Diagnostic analytics are used for discovery, or to determine why something happened.
Sometimes this type of analytics when done hands-on with a small dataset is also known as
causal analysis, since it involves at least one cause (usually more than one) and one effect.
This allows a look at past performance to determine what happened and why. The result
of the analysis is often referred to as an analytic dashboard.
For example, for a social media marketing campaign, you can use descriptive analytics to
assess the number of posts, mentions, followers, fans, page views, reviews, or pins, etc.
There can be thousands of online mentions that can be distilled into a single view to see
what worked and what did not work in your past campaigns.
There are various types of techniques available for diagnostic or causal analytics. Among
them, one of the most frequently used is correlation.
3.4.1 Correlations
Correlation is a statistical analysis that is used to measure and describe the strength and
direction of the relationship between two variables. Strength indicates how closely two
variables are related to each other, and direction indicates how one variable would change
its value as the value of the other variable changes.
Correlation is a simple statistical measure that examines how two variables change
together over time. Take, for example, “umbrella” and “rain.” If someone who grew
up in a place where it never rained saw rain for the first time, this person would
observe that, whenever it rains, people use umbrellas. They may also notice that, on
dry days, folks do not carry umbrellas. By definition, “rain” and “umbrella” are said
to be correlated! More specifically, this relationship is strong and positive. Think
about this for a second.
An important statistic, the Pearson’s r correlation, is widely used to measure the degree
of the relationship between linear related variables. When examining the stock market, for
example, the Pearson’s r correlation can measure the degree to which two commodities are
related. The following formula is used to calculate the Pearson’s r correlation:
X X X
N xy x y
r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X ; ð3:5Þ
X 2 X X 2
N x
2
x N y
2
y
where
64.5 118
73.3 143
68.8 172
65 147
69 146
64.5 138
66 175
66.3 134
68.8 172
64.5 118
As you may have guessed, predictive analytics has its roots in our ability to predict what
might happen. These analytics are about understanding the future using the data and the
trends we have seen in the past, as well as emerging new contexts and processes. An
example is trying to predict how people will spend their tax refunds based on how
consumers normally behave around a given time of the year (past data and trends), and
how a new tax policy (new context) may affect people’s refunds.
Predictive analytics provides companies with actionable insights based on data. Such
information includes estimates about the likelihood of a future outcome. It is important to
remember that no statistical algorithm can “predict” the future with 100% certainty because
the foundation of predictive analytics is based on probabilities. Companies use these
statistics to forecast what might happen. Some of the software most commonly used by
data science professionals for predictive analytics are SAS predictive analytics, IBM
predictive analytics, RapidMiner, and others.
As Figure 3.11 suggests, predictive analytics is done in stages.
1. First, once the data collection is complete, it needs to go through the process of cleaning
(refer to Chapter 2 on data).
2. Cleaned data can help us obtain hindsight in relationships between different variables.
Plotting the data (e.g., on a scatterplot) is a good place to look for hindsight.
3. Next, we need to confirm the existence of such relationships in the data. This is where
regression comes into play. From the regression equation, we can confirm the pattern of
distribution inside the data. In other words, we obtain insight from hindsight.
4. Finally, based on the identified patterns, or insight, we can predict the future, i.e.,
foresight.
The following example illustrates a use for predictive analytics.8 Let us assume that
Salesforce kept campaign data for the last eight quarters. This data comprises total sales
generated by newspaper, TV, and online ad campaigns and associated expenditures, as
provided in Table 3.4.
With this data, we can predict the sales based on the expenditures of ad campaigns in
different media for Salesforce.
Like data analytics, predictive analytics has a number of common applications. For
example, many people turn to predictive analytics to produce their credit scores.
Financial services use such numbers to determine the probability that a customer
will make their credit payments on time. FICO, in particular, has extensively used
predictive analytics to develop the methodology to calculate individual FICO scores.9
Customer relationship management (CRM) classifies another common area for predic-
tive analytics. Here, the process contributes to objectives such as marketing campaigns,
sales, and customer service. Predictive analytics applications are also used in the healthcare
field. They can determine which patients are at risk for developing certain conditions such
as diabetes, asthma, and other chronic or serious illnesses.
Prescriptive analytics10 is the area of business analytics dedicated to finding the best
course of action for a given situation. This may start by first analyzing the situation (using
descriptive analysis), but then moves toward finding connections among various para-
meters/variables, and their relation to each other to address a specific problem, more likely
that of prediction.
A process-intensive task, the prescriptive approach analyzes potential decisions, the
interactions between decisions, the influences that bear upon these decisions, and the
bearing all of this has on an outcome to ultimately prescribe an optimal course of action
in real time.11
Prescriptive analytics can also suggest options for taking advantage of a future opportu-
nity or mitigate a future risk and illustrate the implications of each. In practice, prescriptive
analytics can continually and automatically process new data to improve the accuracy of
predictions and provide advantageous decision options.
86 Techniques
Often when working with data, we may not have a clear understanding of the problem or the
situation. And yet, we may be called on to provide some insights. In other words, we are
asked to provide an answer without knowing the question! This is where we go for an
exploration.
Exploratory analysis is an approach to analyzing datasets to find previously
unknown relationships. Often such analysis involves using various data visualization
approaches. Yes, sometimes seeing is believing! But more important, when we lack
a clear question or a hypothesis, plotting the data in different forms could provide us
with some clues regarding what we may find or want to find in the data. Such insights
can then be useful for defining future studies/questions, leading to other forms of
analysis.
Usually not the definitive answer to the question at hand but only the start, exploratory
analysis should not be used alone for generalizing and/or making predictions from
the data.
Exploratory data analysis is an approach that postpones the usual assumptions about
what kind of model the data follows with the more direct approach of allowing the data
itself to reveal its underlying structure in the form of a model. Thus, exploratory
analysis is not a mere collection of techniques; rather, it offers a philosophy as to
how to dissect a dataset; what to look for; how to look; and how to interpret the
outcomes.
As exploratory analysis consists of a range of techniques; its application is varied as well.
However, the most common application is looking for patterns in the data, such as finding
groups of similar genes from a collection of samples.15
Let us consider the US census data available from the US census website.16 This data has
dozens of variables; we have already seen some of them in Figures 3.1 and 3.7. If you are
87 3.8 Mechanistic Analysis
looking for something specific (e.g., which State has the highest population), you could go
with descriptive analysis. If you are trying to predict something (e.g., which city will have
the lowest influx of immigrant population), you could use prescriptive or predictive
analysis. But, if someone gave you this data and asks you to find interesting insights,
then what do you do? You could still do descriptive or prescriptive analysis, but given that
there are lots of variables with massive amounts of data, it may be futile to do all possible
combinations of those variables. So, you need to go exploring. That could mean a number of
things. Remember, exploratory analysis is about the methodology or philosophy of doing
the analysis, rather than a specific technique. Here, for instance, you could take a small
sample (data and/or variables) from the entire dataset and plot some of the variables (bar
chart, scatterplot). Perhaps you see something interesting. You could go ahead and organize
some of the data points along one or two dimensions (variables) to see if you find any
patterns. The list goes on. We are not going to see these approaches/techniques right here.
Instead, you will encounter them (e.g., clustering, visualization, classification, etc.) in
various parts of this book.
Mechanistic analysis involves understanding the exact changes in variables that lead to
changes in other variables for individual objects. For instance, we may want to know how
the number of free doughnuts per employee per day affects employee productivity. Perhaps
by giving them one extra doughnut we gain a 5% productivity boost, but two extra dough-
nuts could end up making them lazy (and diabetic)!
More seriously, though, think about studying the effects of carbon emissions on bringing
about the Earth’s climate change. Here, we are interested in seeing how the increased
amount of CO2 in the atmosphere is causing the overall temperature to change. We now
know that, in the last 150 years, the CO2 levels have gone from 280 parts per million to 400
parts per million.17 And in that time, the Earth has heated up by 1.53 degrees Fahrenheit
(0.85 degrees Celsius).18 This is a clear sign of climate change, something that we all need
to be concerned about, but I will leave it there for now. What I want to bring you back to
thinking about is the kind of analysis we presented here – that of studying a relationship
between two variables. Such relationships are often explored using regression.
3.8.1 Regression
used in data analysis, assumes this relationship to be linear. In other words, the relationship
of the predictor variable(s) and outcome variable can be expressed by a straight line. If the
predictor variable is represented by x, and the outcome variable is represented by y, then the
relationship can be expressed by the equation
y ¼ β0 þ β1 x; ð3:6Þ
where β1 represents the slope of the x, and β0 is the intercept or error term for the equation.
What linear regression does is estimate the values of β0 and β1 from a set of observed data
points, where the values of x, and associated values of y, are provided. So, when a new or
previously unobserved data point comes where the value of y is unknown, it can fit the
values of x, β0, and β1 into the above equation to predict the value of y.
From statistical analysis, it has been shown that the slope of the regression β1 can be
expressed by the following equation:
sdy
β1 ¼ r ; ð3:7Þ
sdx
where r is the Pearson’s correlation coefficient, and sd represents the standard deviation of
the respective variable as calculated from the observed set of data points. Next, the value of
the error term can be calculated from the following formula:
β0 ¼ y β1 x; ð3:8Þ
where y and x represent the means of the y and x variables, respectively. (More on these
equations can be found in later chapters.) Once you have these values calculated, it is
possible to estimate the value of y from the value of x.
1 65 129
2 67 126
3 68 143
4 70 156
5 71 161
6 72 158
7 72 168
8 73 166
9 73 182
10 75 201
89 Summary
Here attitude is going to be the predictor variable, and what regression would be able to do is to estimate
the value of score from attitude. As explained above, first let us calculate the value of the slope, β1.
From the data, Pearson’s correlation coefficient r can be calculated as 0.94. The standard deviations of
x (attitude) and y (score) are 3.10 and 22.80, respectively. Therefore, the value of the slope is
22:80
β1 ¼ 0:94 ¼ 6:91:
3:10
Next, the calculation of the error term β0 requires the mean values of x and y. From the given dataset, y
and x are derived as 159 and 70.6, respectively. Therefore, the value of β0 will be
β0 ¼ 159 ð6:91 70:6Þ ¼ 328:85
Now, say you have a new participant whose positive attitude before taking the examination is
measured at 78. His score in the examination can be estimated at 210.13:
y ¼ 328:85 þ ð6:91 78Þ ¼ 210:13:
Regression analysis has a number of salient applications to data science and other statistical
fields. In the business realm, for example, powerful linear regression can be used to generate
insights on consumer behavior, which helps professionals understand business and factors
related to profitability. It can also help a corporation understand how sensitive its sales are to
advertising expenditures, or it can examine how a stock price is affected by changes in interest
rates. Regression analysis may even be used to look to the future; an equation may forecast
demand for a company’s products or predict stock behaviors.19
Summary
In this chapter, we reviewed some of the techniques and approaches used for data science. As
should be evident, a lot of this revolves around statistics. And there is no way we could even
introduce all of the statistics in one chapter. Therefore, this chapter focused on providing broader
strokes of what these approaches and analyses are, with a few concrete examples and
90 Techniques
applications. As we proceed, many of these broad strokes will become more precise. Another
reason for skimping on the details here is our lack of knowledge (or assumption about) any
specific programming tool. You will soon see that, while it is possible to have a theoretical
understanding of statistical analysis, for a hands-on data science approach it makes more sense to
actually do stuff and gain an understanding of such analysis. And so, in the next part of the book,
we are going to cover a bunch of tools and, while doing so, we will come back to most of these
techniques. Then, we will have a chance to really understand different kinds of analysis and
analytics as we apply them to solve various data problems.
Almost all the real-life data science-related problems use more than one category of the
analysis techniques described above. The number and types of categories used for analysis
can be an indicator of the quality of the analysis. For example, in social science-related
problems:
• A weak analysis will only tell a story or describe the topic.
• A good analysis will go beyond a mere description by engaging in several of the types of
analysis listed above, but it will be weak on sociological analysis, the future orientation,
and the development of social policy.
• An excellent analysis will engage in many of the types of analyses we have discussed and
will demonstrate an aggressive sociological analysis which develops a clear future orienta-
tion and offers social policy changes to address problems associated with the topic.
There is no clear agreeable-to-all classification scheme available in the literature to
categorize all the analysis techniques that are used by data science professionals.
However, based on their application to various stages of data analysis, we categorized
analysis techniques into certain classes. Each of these categories and their application were
described – some at length and some less so – with an understanding that we will revisit
them later when we are addressing various data problems.
I hope that with this chapter you can see that familiarity with various statistical measures and
techniques is an integral part of being a data scientist. Armed with this arsenal of tools, you can
take your skills and make important discoveries for a number of people in a number of areas.
Department used from 2004 to 2012 to temporarily detain, question, and search individuals on the street
whom they deemed suspicious turned out to have been a gross miscalculation based on human bias. The
actual data revealed that 88% of those stopped were not and did not become offenders.
Moral of the story? Do not trust the data or the technique blindly; they may be perpetuating the
inherent biases and prejudices we already have.
Key Terms
• Data analysis: This is a process that refers to hands-on data exploration and evaluation.
Analysis looks backwards, providing marketers with a historical view of what has
happened. Analytics, on the other hand, models the future or predicts a result.
• Data analytics: This defines the science behind the analysis. The science
means understanding the cognitive processes an analyst uses to understand pro-
blems and explore data in meaningful ways. It is used to model the future or
predict a result.
• Nominal variable: The variable type is nominal when there is no natural order between
the possible values that it stores, for example, colors.
• Ordinal variable: If the possible values of a data type are from an ordered set, then the
type is ordinal. For example, grades in a mark sheet.
• Interval variable: A kind of variable that provides numerical storage and allows us to do
additions and subtractions on them but not multiplications or divisions. Example:
temperature.
• Ratio variable: A kind of variable that provides numerical storage and allows us to do
additions and subtractions, as well as multiplications or divisions, on them. Example: weight.
• Independent /predictor variable: A variable that is thought to be controlled or not
affected by other variables.
• Dependent /outcome /response variable: A variable that depends on other variables
(most often other independent variables).
• Mean: Mean is the average of continuous data found by the summation of the given data
and dividing by the number of data entries.
• Median: Median is the middle data point in any ordinal dataset.
• Mode: Mode of a dataset is the value that occurs most frequently.
• Normal distribution: A normal distribution is a type of distribution of data points in
which, when ordered, most values cluster in the middle of the range and the rest of the
values symmetrically taper off toward both extremes.
• Correlation: This indicates how closely two variables are related and ranges from −1
(negatively related) to +1 (positively related). A correlation of 0 indicates no relation
between the variables.
92 Techniques
Conceptual Questions
Hands-On Problems
Problem 3.1
Imagine 10 years down the line, in a dark and gloomy world, your data science career has failed
to take off. Instead, you have settled for the much less glamorous job of a community librarian.
93 Hands-On Problems
Now, to simplify the logistics, the library has decided to limit all future procurement of books
either to hardback or to softback copies. The library also plans to convert all the existing books
to one cover type later. Fortunately, to help you decide, the library has gathered a small sample
of data that gives measurements on the volume, area (only the cover of the book), and weight of
15 existing books, some of which are softback (“Pb”) and the rest are hardback (“Hb”) copies.
The dataset is shown in the table and can be obtained from OA 3.6.
The above table represents that the dataset has 15 instances of the following four attributes:
• Volume: Book volumes in cubic centimeters
• Area: Total area of the book in square centimeters
• Weight: Book weights in grams
• Cover: A factor with levels; Hb for hardback, and Pb for paperback
Now use this dataset to decide which type of book you want to procure in the future. Here
is how you are going to do it. Determine:
a. The median of the book covers.
b. The mean of the book weights.
c. The variance in book volumes.
Use the above values to decide which book cover types the library should opt for in the future.
Problem 3.2
Following is a small dataset of list price vs. best price for a new GMC pickup truck in
$1000s. You can obtain it from OA 3.7. The x represents the list price, whereas the
y represents the best price values.
94 Techniques
x y
12.4 11.2
14.3 12.5
14.5 12.7
14.9 13.1
16.1 14.1
16.9 14.8
16.5 14.4
15.4 13.4
17 14.9
17.9 15.6
18.8 16.4
20.3 17.7
22.4 19.6
19.4 16.9
15.5 14
16.7 14.6
17.3 15.1
18.4 16.1
19.2 16.8
17.4 15.2
19.5 17
19.7 17.2
21.2 18.6
Problem 3.3
The following is a fictional dataset on the number of visitors to Asbury Park, NJ, in
hundreds a day, the number of tickets issued for parking violations, and the average
temperate (in degrees Celsius) for the same day.
15.8 8 35
12.3 6 38
19.5 9 32
95 Hands-On Problems
(cont.)
Number of visitors Average
(in hundreds a day) Number of parking tickets temperature
8.9 4 26
11.4 6 31
17.6 9 36
16.5 10 38
14.7 3 30
3.9 1 21
14.6 9 34
10.0 7 36
10.3 6 32
7.4 2 25
13.4 6 37
11.5 7 34
There are plenty of good (and some mediocre) books on statistics. If you want to develop
your techniques in data science, I suggest you pick up a good statistics book at the level you
need. A couple of such books are listed below.
• Salkind, N. (2016). Statistics for People Who (Think They) Hate Statistics. Sage.
• Krathwohl, D. R. (2009). Methods of Educational and Social Science Research: The Logic
of Methods. Waveland Press.
• Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage.
• A video by IBM describing the progression from descriptive analytics, through predictive
analytics to prescriptive analytics: https://www.youtube.com/watch?v=VtETirgVn9c
Notes
1. Analysis vs. analytics: What’s the difference? Blog by Connie Hill: http://www.1to1media.com
/data-analytics/analysis-vs-analytics-whats-difference
2. KDnuggets™: Interview: David Kasik, Boeing, on Data analysis vs. data analytics: http://www
.kdnuggets.com/2015/02/interview-david-kasik-boeing-data-analytics.html
3. Population map showing US census data: https://www.census.gov/2010census/popmap/
96 Techniques
4. Of course, we have not covered these yet. But have patience; we are getting there.
5. Income distribution from US Census: https://www.census.gov/library/visualizations/2015/demo/
distribution-of-household-income-2014.html
6. Pearson correlation: http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
7. Process of predictive analytics: http://www.amadeus.com/blog/07/04/5-examples-predictive-
analytics-travel-industry/
8. Use for predictive analytics: https://www.r-bloggers.com/predicting-marketing-campaign-
with-r/
9. Understanding predictive analytics: http://www.fico.com/en/predictive-analytics
10. A company called Ayata holds the trademark for the term “Prescriptive Analytics”. (Ayata is the
Sanskrit word for future.)
11. Process of prescriptive analytics: http://searchcio.techtarget.com/definition/Prescriptive-
analytics
12. Game theory: http://whatis.techtarget.com/definition/game-theory
13. Use of prescriptive analytics: http://www.ingrammicroadvisor.com/data-center/four-types-of-
big-data-analytics-and-examples-of-their-use
14. Gartner predicts predictive analytics as next big business trend: http://www.enterpriseappstoday
.com/business-intelligence/gartner-taps-predictive-analytics-as-next-big-business-intelligence-
trend.html
15. Six types of analyses: https://datascientistinsights.com/2013/01/29/six-types-of-analyses-every-
data-scientist-should-know/
16. Census data from US government: https://www.census.gov/data.html.
17. Climate change causes: https://climate.nasa.gov/causes/
18. Global temperature in the last 100 years: https://www2.ucar.edu/climate/faq/how-much-has-
global-temperature-risen-last-100-years
19. How businesses use regression analysis statistics: http://www.dummies.com/education/math/
business-statistics/how-businesses-use-regression-analysis-statistics/
PART II
97
4 UNIX
4.1 Introduction
While there are many powerful programming languages that one could use for solving data
science problems, people forget that one of the most powerful and simplest tools to use is
right under their noses. And that is UNIX. The name may generate images of old-time
hackers hacking away on monochrome terminals. Or, it may hearken the idea of UNIX as
a mainframe system, taking up lots of space in some warehouse. But, while UNIX is indeed
one of the oldest computing platforms, it is quite sophisticated and supremely capable of
handling almost any kind of computational and data problem. In fact, in many respects,
UNIX is leaps and bounds ahead of other operating systems; it can do things of which others
can only dream!
Alas, when people think of tools for data science or data analytics, UNIX does not come
to mind. Most books on these topics do not cover UNIX. But I think this is a missed
opportunity, as UNIX allows one to do many data science tasks, including data cleaning,
filtering, organizing (sorting), and even visualization, often using no more than its built-in
commands and utilities. That makes it appealing to people who have not mastered
a programming language or a statistics tool.
So, we are not going to pass up this wonderful opportunity. In this chapter, we will see
some basics of working in the UNIX environment. This involves running commands, piping
and redirecting outputs, and editing files. We will also see several shortcuts that make it easier
and faster to work on UNIX. Ultimately, of course, our goal is not mastering UNIX, but
99
100 UNIX
solving data-driven problems, and so we will see how UNIX is useful in solving many
problems without writing any code.
server is through a cloud service such as Amazon Web Services (AWS) or Google Cloud.
See the FYI box below and Appendix F for more details on them.
No matter what UNIX environment you end up using (whether you have a Linux or Mac
machine, install Cygwin on a Windows PC, or connect to a UNIX server remotely), all that
we are trying to do here (running commands, processing files, etc.) should work just the same.
If you have access to a UNIX, Linux, or Mac computer, you are in luck. Because then all
you need are a couple of freely available tools on your machine. Here, that is what we are
going to do. The two things you need are a way to connect to the server, and a way to
transfer files between your machine and that server.
4.3.1 SSH
Since we are going with the assumption that you are first connecting to a UNIX server
before doing any of the UNIX operations, we need to learn how to connect to such a server.
103 4.3 Connecting to a UNIX Server
The plain vanilla method is the use of Telnet service, but since that plain vanilla Telnet is
often insecure, many UNIX servers do not support that connection.
Instead, we will use SSH, which stands for “secure shell.” This essentially refers to two
parts: the server part and the client part. The former is something we are not going to worry
about because, if we have access to a UNIX server, that server will have the necessary
server part of SSH. What we need to figure out is the client part – a tool or a utility that we
will run on our computers. To connect a UNIX server using SSH, you need to be running
some kind of shell (a program that provides you a command-line interface to your operating
system) with SSH client service on your own machine.
Again, if you have a Linux or a Mac, all you have to do is open your terminal or console.
On the other hand, if you are using a PC, you need software that has SSH client service.
A couple of (free, of course) software options are WinSCP5 and PuTTY.6 (You can find
instructions for using them at WinSCP and Using PuTTY in Windows,7 respectively.)
Whichever option you choose, you will need three pieces of information: hostname,
username, and password. Figure 4.4 shows what it looks like with PuTTY.
Hostname is the full name or IP (Internet Protocol) address of the server. The name could
be something like example.organization.com and the IP address could be something like
192.168.2.1. The username and password are those related to your account on that server.
You will have to contact the administrator of the server you are hoping to connect to in order
to obtain these pieces of information.
If you are already on a UNIX system like a Linux or a Mac, run (type and hit “enter”) the
following command in your terminal.8
ssh username@hostname
If you are on a PC and are using one of the software options mentioned earlier (PuTTY,
WinSCP), open that tool and enter information about host or server name (or IP address),
username, and password in appropriate boxes and hit “Connect” (or equivalent).
Once successfully connected, you should get a command prompt. You are now (vir-
tually) on the server.
Refer to the screenshot in Figure 4.5 for an example of what you may see. Note that when
you get a prompt to enter your password, you are not going to see anything you type – not
even “*”. So just type your password and hit “enter.”
4.3.2 FTP/SCP/SFTP
Another important reason for connecting to the server is to transfer files between the client
(your machine) and the server. Again, we have two options – non-secure FTP (File Transfer
Protocol), or secure SCP (Secure Copy) or SFTP (Secure FTP).
If you are on a Linux or a Mac, you can use any of these utilities through your command
line or console/shell/terminal. But, unless you are comfortable with UNIX paths and
systems, you could become lost and confused.
105 4.3 Connecting to a UNIX Server
So, we will use more intuitive file transfer packages. FileZilla happens to be a good
one (free and easy to use) that is available for all platforms, but I am sure you can
search online and find one that you prefer. In the end, they all offer similar
functionalities.
Whatever tool you have, you will once again need those three pieces of information:
hostname, username, and password. Refer to the screenshot in Figure 4.6 from the
FileZilla project site to give you an idea of what you might see. Here, the connection
information that you need to enter are at the top (“Host,” “Username,” “Password,” and
“Port”). Leave “Port” empty unless you have specific instructions about it from your
system administrator.
Figure 4.7 offers another example – this time from a different FTP tool, but as you can
see, you need to enter the same kind of information: the server name (hostname), your
username, and password.
Go ahead and enter those details and connect to the server. Once connected, you
will be in the home directory on the server. Most file transfer software applications
provide a two-pane view, where one pane shows your local machine and the
other shows the server. Transferring files then becomes an easy drag-and-drop
operation.
106 UNIX
In this section, we will see a few common commands. Try out as many of them as possible
and be aware of the rest. They could save you a lot of time and trouble. I will assume that
you are connected to a UNIX server using SSH. Alternatively, you can have a Cygwin
environment installed (on a Windows PC), or you could work on a Linux or a Mac machine.
If you are using either a Linux or a Mac (not connected to a server), go ahead and open
a terminal or a console.
Let us look at some of the basic file- and directory-related commands you can use with
UNIX. Each of these is only briefly explained here and perhaps some of them may not make
much sense until you really need them. But go ahead and try as much as you can for now and
make a note of the others. In the examples below, whenever you see “filename”, you should
enter the actual filename such as “test.txt”.
1. pwd: Present working directory. By default, when you log in to a UNIX server or open
a terminal on your machine, you will be in your home directory. From here, you can
move around to other directories using the “cd” command (listed below). If you ever
get lost, or are not sure where you are, just enter “pwd”. The system will tell you the full
path of where you are.
2. rm: Remove or delete a file (e.g., “rm filename”). Be careful. Deleting a file may get rid
of that file permanently. So, if you are used to having a “Recycle Bin” or “Trash” icon
on your machine from where you can recover deleted files, you might be in for an
unpleasant surprise!
3. rmdir: Remove or delete a directory (e.g., “rmdir myfolder”). You need to make sure
that the directory/folder you are trying to delete is empty. Otherwise the system will not
let you delete it. Alternatively, you can say “rm -f myfolder”, where “-f” is for forcing
the deletion.
107 4.4 Basic Commands
4. cd: Change directory (e.g., “cd data” to move into “data” directory). Simply entering
“cd” will bring you to your home directory. Entering a space and two dots or full points
after “cd” (i.e., “cd ..”) will take you up a level.
5. ls: List the files in the current directory. If you want more details of the files, use the “-l”
option (e.g., “ls -l”).
6. du: Disk usage. To find out how much space a directory is taking up, you can issue
a “du” command, which will display space information in bytes. To see things in MB
and GB, use the “-h” option (e.g., “du -h”).
7. wc: Reports file size in lines, words, and characters (e.g., “wc myfile.txt”).
8. cat: Types the file content on the terminal (e.g., “cat myfile.txt”). Be careful about which
file type you use it with. If it is a binary file (something that contains other than simple
text), you may not only get weird characters filling up your screen, but you may even get
weird sounds and other things that start happening, including freezing up the machine.
9. more: To see more of a file. You can say something like “more filename”, which is like
“cat filename”, but it pauses after displaying a screenful of the file. You can hit Enter or
the space-bar to continue displaying the file. Once again, use this only with text files.
10. head: Print the first few lines of a file (e.g., “head filename”). If you want to see the top
three lines, you can use “head -3 filename”. Should I repeat that this needs to be tried
only with text files?!
Figure 4.8 shows some of these commands running in a terminal window. Note that
“keystone:data chirags$” is my command prompt. For you, it is going to be something
different.
While most operating systems hide what is going on behind the scene of a nice-looking
interface, UNIX gives unprecedented access to not only viewing those background pro-
cesses, but also manipulating them. Here we will list some of the basic commands you may
find useful for understanding and interacting with various processes.
1. Ctrl+c: Stop an ongoing process. If you are ever stuck in running a process or
a command that does not give your command prompt back, this is your best bet. You
may have to press Ctrl+c multiple times.
2. Ctrl+d: Logout. Enter this on a command prompt and you will be kicked out of your
session. This may even close your console window.
3. ps: Lists the processes that run through the current terminal.
4. ps aux: Lists the processes for everyone on the machine. This could be spooky since in
a multiuser environment (multiple users logging into the same server) one could
essentially see what others are doing! Of course, that also means someone else could
spy on you as well.
5. ps aux | grep daffy: List of processes for user “daffy.” Since there are likely to be lots of
processes going on in a server environment, and most of them are of no relevance to you,
you can use this combination to filter out only those processes that are running under
your username. We will soon revisit the “|” (pipe) character you see here.
6. top: Displays a list of top processes in real time (Figure 4.9). This could be useful to see
which processes are consuming considerable resources at the time. Note the PID column
at the left. This is where each process is reported with a unique process ID. You will need
this in case you want to terminate that process. Press “q” to get out of this display.
7. kill: To kill or terminate a process. Usage: “kill -9 1234”. Here, “-9” indicates forced kill
and “1234” is the process ID, which can be obtained from the second column of “ps” or
the first column of “top” command outputs.
1. man: Help (e.g., “man pwd”). Ever wanted to know more about using a command? Just
use “man” (refers to manual pages). You may be surprised (and overwhelmed) to learn
about all the possibilities that go with a command.
2. who: Find out who is logged in to the server. Yes, spooky!
4.4.4 Shortcuts
Those who are intimidated by it probably do not know the fantastic shortcuts that UNIX
offers. Here are some of them to make your life easier.
1. Auto-complete: Any time you are typing a command, a filename, or a path on the
terminal, type part of it and hit the “tab” key. The system will either complete the rest of
it or show you options.
2. Recall: UNIX saves a history of the commands you used. Simply pressing the up
and down arrow keys on the terminal will bring them up in the order they were
executed.
3. Search and recall: Do not want to go through pressing the up arrow so many times to
find that command? Hit Ctrl+r and start typing part of that command. The system will
110 UNIX
search through your command history. When you see what you were looking for, simply
hit enter (and you may have to hit enter again).
4. Auto-path: Tired of typing the full path to some program or file you use frequently? Add
it to your path by following these steps.
1. Go to your home directory on the server.
2. Open .bash_profile in an editor (notice the dot or full point before “bash_profile”).
3. Let us assume that the program you want to have direct access to is at /home/user/
daffy/MyProgram. Replace the line $PATH=$PATH:$HOME/bin with the following
line: PATH=$PATH:$HOME/bin:/home/user/daffy/MyProgram
4. Save the file and exit the editor.
5. On the command line, run “. .bash_profile” (dot space dot bash_profile). This will set
the path environment for your current session. For all future sessions, it will be set the
moment you log in.
Now, whenever you need to run /home/user/daffy/MyProgram, you can simply type
“MyProgram”.
And that is about it in terms of basic commands for UNIX. I know this could be a lot if
you have never used UNIX/Linux before, but keep in mind that there is no need to
memorize any of this. Instead, try practicing some of them and come back to others (or
the same ones) later. In the end, nothing works better than practice. So, do not feel bad
if these things do not sound intuitive – trust me, not all of them are! – or accessible enough
at first.
It may also help to practice these commands with a goal in mind. Look for a section later
in this chapter where we see how these commands, processes, and their combinations can
be used for solving data problems. But for now, let us move on to learn how to edit text files
in a UNIX environment.
One of the most basic and powerful editors on UNIX is called vi, short for “visual display.”
I would not recommend it until you are comfortable with UNIX. But sometimes you may
not have a choice – vi is omnipresent on UNIX. So even if some other editor may not be
available, chances are, on a UNIX system, you will have vi.
To edit or create a file using vi, type:
vi filename
at the command prompt. This will open up that file in the vi editor. If the file already exists,
vi will load its content and now you can edit the file. If the file does not exist, you will have
a blank editor.
111 4.5 Editing on UNIX
Here is the tricky part. You cannot simply start typing to edit the file. You have to first enter
the “insert” (editing) mode. To do so, simply press “i”. You will notice “- - INSERT - -” appear
at the bottom of your screen. Now you can start typing as you would normally in any text editor.
To save the file, you need to enter the command mode. Hit “esc” (the escape key at the
top-left of your keyboard), then “:” (colon), and then “w”. You should see a message at the
bottom of your screen that the file was saved. To exit, once again enter the command mode
by pressing “esc”, then “:”, and finally “q” for quit. Figure 4.10 shows a screenshot of what
editing with vi looks like.
I know all of this may sound daunting if you have never used UNIX. But if you keep up with
it, there are some tremendous benefits that only UNIX can provide. For instance, vi can run
quite an effective search in your file using regular expressions (pattern matching in strings).
I would recommend using Emacs as an easy-to-use alternative to vi. On the terminal, enter
“emacs file.txt” to edit or create the file.txt file. Start typing as you would normally. To save,
press the combination of Ctrl+x and Ctrl+s (hold down Ctrl key and press “x” and then “s”).
To quit, enter the combination Ctrl+x and Ctrl+c. See Figure 4.11 for an example of what
Emacs looks like.
Alternatively, you can create/edit a file on your computer and “FTP it” to the server. If
you decide to do this, make sure you are creating a simple text file and not a Word doc or
some other non-text format. Any time you want to read a file on the server, you can type “cat
filename”.
112 UNIX
Many programs and utilities (system-provided or user-created) can produce output, which
is usually displayed on the console. However, if you like, you can redirect that output.
For instance, the “ls” command lists all the files available in a given directory. If you want
this listing stored in a file instead of displayed on the console, you can run “ls > output”.
Here, “>” is a redirection operator and “output” is the name of the file where the output of
the “ls” command will be stored.
113 4.7 Solving Small Problems with UNIX
Here we assumed that the “output” file does not already exist. If it does, its content will be
overwritten by what “ls” generated. So be careful – check that you are not wiping out an
existing file before redirecting the output to it.
Sometimes you want to append new output to an existing file instead of overwriting it or
creating a new file. For that, you can use operator “>>”. Example: “ls >> output”. Now, the
new output will be added at the end of the “output” file. If the file does not exist, it will be
created just like before.
Redirection also works the other way. Let us take an example. We know that “wc -l xyz.
txt” can count the number of lines in xyz.txt and display on the console; specifically, it lists
the number of lines followed by the filename (here, xyz.txt). What if you want only the
number of lines? You can redirect the file to “wc -l” command like this: “wc -l < xyz.txt”.
Now you should see only a number.
Let us extend this further. Imagine you want to store this number in another file (instead
of displaying on the console). You can accomplish this by combining two redirection
operators, like this: “wc -l < xyz.txt > output”. Now, a number will be calculated and it
will be stored in a file named “output.” Go ahead and do “cat output” to read that file.
Redirection is primarily done with files, but UNIX allows other ways to connect different
commands or utilities. And that is done using pipes. Looking for a pipe symbol “|” on your
keyboard? It is that character above the “\” character.
Let us say you want to read a file. You can run “cat xyz.txt”. But it has too many lines and
you care about only the first five. You can pipe the output of “cat xyz.txt” command to
another command, “head -5”, which shows only the top five lines. And thus, the whole
command becomes “cat xyz.txt | head -5”.
Now imagine you want to see only the fifth line of that file. No problem. Pipe the above
output to another command “tail -1”, which shows only the last (1) line of whatever is
passed to it. So, the whole command becomes “cat xyz.txt | head -5 | tail -1”.
And what if you want to store that one line to a file instead of just seeing it on the console?
You guessed it – “cat xyz.txt | head -5 | tail -1 > output”.
In the next section, we will see more examples of how redirections and pipes can be used
for solving simple problems.
UNIX provides an excellent environment for problem solving. We will not be able to go
into every kind of problem and related details, but we will look at a few examples here.
114 UNIX
3. Sorting a File
Let us say we have a file numbers.txt with one number per line and we want to sort them.
Just run:
sort numbers.txt
Want them sorted in descending order (reverse order)? Run:
sort -r numbers.txt
We can do the same with non-numbers. Let us create a file text.txt with “the quick brown
fox jumps over the lazy dog,” as text written one word per line. And now run the sorting
command:
sort text.txt
to get those words sorted alphabetically.
How about sorting multicolumn data? Say your file, test.txt, has three columns and you
want to sort the dataset according to the values in the second column. Just run the following
commands:
sort -k2 test.txt
If the word exists in that file, it will be printed on the console, otherwise the output will be
nothing. But it does not end here. Let us say we want to search for “fox” in all the text files in
the current directory. That can be done using:
grep 'fox' *.txt
Here, “*.txt” indicates all the text files, with “*” being the wildcard. In the output, you
can see all the .txt files that have the word “fox”.
Bugs Bunny
Daffy Duck
Porky Pig
Now, let us say we want to get everyone’s first name. We use the “cut” command like this:
cut -d ' ' -f1 names.txt
Here, the “-d” option is for specifying a delimiter, which is a space (see the option right
after “-d”) and “-f1” indicates the first field. Let us create another file with phone numbers,
called phones.txt:
123 456 7890
456 789 1230
789 123 4560
How do we get last four digits of all phone numbers from this file?
cut -d ' ' -f3
output line and words are sequences of non-white-space characters), the “fmt” command
can be used to display individual words in a sentence. For example:
fmt -1 phones.txt
Running the above command in the previous phone.txt data will print the following:
123
456
7890
456
789
1230
789
123
4560
If your dataset is too large you can use “head” to print the first 10 lines. So, the above line of
code can be rewritten as:
fmt -1 phones.txt | head
Alternatively, we can use UNIX. If you are already on a UNIX-based machine or have a UNIX-based
environment, open a console and navigate to where this file is using “cd” commands. Or you can upload
this file to a UNIX server using FTP or SFTP and then log in to the machine using SSH. Either way, I am
assuming that you have a console or a terminal open to the place where this file is stored. Now simply
issue:
wc housing.txt
That should give you an output like this:
64536 3256971 53281106 housing.txt
The first number represents the number of lines, the second one refers to the number of words, and the
third one reports the number of characters in this file. In other words, we can immediately see from this
output that we have 64,535 records (one less because the first line has field names and not actual data).
See how easy this was?!
Next, can you figure out what fields this data has? It is easy using the “head” command on UNIX:
head -1 housing.txt
The output will list a whole bunch of fields, separated by commas. While some of them will not make
sense, some are easy to figure out. For instance, you can see a column listing the age (AGE1), value
(VALUE), number of rooms (ROOMS), and number of bedrooms (BEDRMS). Try some of the UNIX commands
you know to explore this data. For instance, you can use “head” to see the first few lines and “tail” to see
the last few lines.
Now, let us ask: What is the maximum number of rooms in any house in this data? To answer this, we
need to first extract the column that has information about the number of rooms (ROOMS) and then sort it
in descending order.
OK, so where is the ROOMS column? Using “head -1” command shown above, we can find the names of
all the fields or columns. Here, we can see that ROOMS is the nineteenth field. To extract it, we will split
the data into fields and ask for field #19. Here is how to do it:
cut -d ',' -f19 housing.txt
Here, the “cut” command allows us to split the data using delimiter (that is the -d option),
which is a comma here. And finally, we are looking for field #19 (-f19). If you happened to run
this command (oh, did I not tell you not to run it just yet?!), you perhaps saw a long list of
numbers flying across your screen. That is simply the nineteenth field (all 64,535 values) being
printed on your console. But we need to print it in a certain order. That means we can pipe with
a “sort” command, like this:
Can you figure out what all those new things mean? Well, the first is the pipe “|” symbol. Here it allows
us to channel the output of one command (here, “cut”) to another one (here, “sort”). Now we are looking
at the “sort” command, where we indicate that we want to sort numerical values in descending or reverse
order (thus, -nr).
Once again, if you had no patience and went ahead to try this, you saw a list of numbers flying by,
but in a different order. To make this output feasible to see, we can do one more piping – this time to
“head”:
cut -d ',' -f19 housing.txt | sort –nr | head
Did you run it? Yes, this time I do want you to run this command. Now you can see only the first few
values. And the largest value is 15, which means in our data the highest number of rooms any house has
is 15.
What if we also want to see how much these houses cost? Sure. For that, we need to extract field #15 as
well. If we do that, here’s what the above command looks like:
cut -d ',' -f15 -f19 housing.txt | sort -nr | head
Here are the top few lines in the output:
2520000,15
2520000,14
2520000,13
2520000,13
2520000,13
Do you see what is happening here? The “sort” command is taking the whole thing like “2520000,15”
for sorting. That is not what we really want. We want it to sort the data using only the number of rooms.
To make that happen, we need to have “sort” do its own splitting of the data passed to it by the “cut”
command and apply sorting to a specific field. Here is what that looks like:
cut -d ',' -f15 -f19 housing.txt | sort -t ',' -k2 -nr | head
Here we added the “-t” option to indicate we are going to split the data using a delimiter (“,” in this
case), and once we do that, we will use the second field or key to apply sorting (thus, -k2). And now the
output looks something like the following:
450000,15
2520000,15
990000,14
700000,14
600000,14
120 UNIX
Hopefully, now you have got a hang of how these things work. Go ahead and play
around some more with this data, trying to answer some questions or satisfying your
curiosity about this housing market data. You now know how to wield the awesome
power of UNIX!
Summary
UNIX is one of the most powerful platforms around, and the more you know about it and
how to use it, the more amazing things you can do with your data, without even writing any
code. People do not often think about using UNIX for data science, but as you can see in this
chapter, we could get so much done with so little work. And we have only scratched the
surface.
We learned about how to have a UNIX-like environment on your machine or to
connect to a UNIX server, as well as how to transfer files to/from that server. We
tried a few basic and a few not-so-basic commands. But it is really the use of pipes
and redirections that make these things pop out; this is where UNIX outshines
anything else. Finally, we applied these basic skills to solve a couple of small data
problems. It should be clear by now that those commands or programs on UNIX are
highly efficient and effective. They could crunch through a large amount of data
without ever choking!
Earlier in this chapter I told you that UNIX can also help with data visualization. We
avoided that topic here because creating visualizations (plots, etc.) will require a few
installations and configurations on the UNIX server. Since that is a tall order and I cannot
expect everyone to have access to such a server (and very friendly IT staff!), I decided to
leave that part out of this chapter. Besides, soon we will see much better and easier ways to
do data visualizations.
Going further, I recommend learning about shell scripting or programming. This is
UNIX’s own programming language that allows you to leverage the full potential of
UNIX. Most shell scripts are small and yet very powerful. You will be amazed by the
kinds of things you can have your operating system do for you!
Key Terms
• File: A file is collection of related data that appears to the user as a single, contiguous
block of information, has a name, and is retained in storage.
• Directory: A directory in an operating system (e.g., UNIX) is a special type of file that
contains a list of objects (i.e., other files, directories, and links) and their corresponding
details (e.g., when the file was created, last modified, file type, etc.) except the actual
content of those objects.
• Protocols: The system of rules governing affairs of a process, for example, the FTP
protocol defines the rules of file transfer process.
• SSH (Secure Shell): This is an application or interface that allows one to either run
UNIX commands or connect with a UNIX server.
• FTP (File Transfer Protocol): This is an Internet protocol that allows one to connect two
remote machines and transfer files between them.
122 UNIX
Conceptual Questions
Hands-On Problems
Problem 4.1
You are given a portion of the data from NYC about causes of death for some people in 2010
(available for download from OA 4.2). The data is in CSV format with the following
fields: Year, Ethnicity, Sex, Cause of Death, Count, Percent.
Answer the following questions using this data. Use UNIX commands or utilities.
Show your work. Note that the answers to each of these questions should be the
direct result of running appropriate commands/utilities and not involve any further
processing, including manual work. Answers without the method to achieve them will
not get any grade.
a. How many male record groups and how many female record groups does the data
have?
b. How many white female groups are there? Copy entire records of them to a new file
where the records are organized by death count in descending order.
c. List causes of death by their frequencies in descending order. What are the three most
frequent causes of death for males and females?
Problem 4.2
Over the years, UNICEF has supported countries in collecting data related to children and
women through an international household survey program. The survey responses on
sanitation and hygiene from 70 countries in 2015 are constructed into a dataset on
handwashing. You can download the CSV file from OA 4.3. Use the dataset to answer
the following questions using UNIX commands or utilities:
a. Which region has the lowest percentage of urban population?
b. List the region(s) where the number of urban population in thousands are more than a
million and yet they comprise less than half of the population.
123 Hands-On Problems
Problem 4.3
For the following exercise, use the data on availability of essential medicines in 38
countries from the World Health Organization (WHO). You can find the dataset in OA
4.4. Download the filtered data as a CSV table and use it to answer the following
questions.
1. Between 2007 and 2013, which country had the lowest percentage median availability of
selected generic medicines in private.
2. List the top five countries which have the highest percentage of public and private
median availability of selected medicines in 2007–2013.
3. List the top three countries where it is best to rely on the private
availability of selected generic medicines than public. Explain your answer with
valid reasons.
As noted earlier in this chapter, UNIX is often ignored as a tool/system/platform for solving
data problems. That being said, there are a few good options for educating yourself about
the potential of UNIX for data science that do a reasonable job of helping a beginner wield
at least part of the awesome power that UNIX has.
Data Science at the Command Line10 provides a nice list of basic commands and
command-line utilities that one could use while working on data science tasks. The author,
Jerosen Janssens, also has a book Data Science at Command Line, published by O’Reilly,
which is worth consideration if you want to go further in UNIX.
Similarly, Dr. Bunsen has a nice blog on explorations in UNIX available from http://
www.drbunsen.org/explorations-in-unix/.
There are several cheat sheets that one could find on the Web containing UNIX
commands and shortcuts. Some of them are here:
• http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/
• http://www.cheat-sheets.org/saved-copy/ubunturef.pdf
• https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
Notes
1. Picture of console window on a Linux KDE console: https://www.flickr.com/photos/okubax/
29814358851
2. Cygwin project: http://www.cygwin.com
3. Cygwin running on a Windows desktop: https://commons.wikimedia.org/wiki/File:
Cygwin_X11_rootless_WinXP.png
4. Free UNIX: http://sdf.lonestar.org/index.cgi
5. WinSCP: http://winscp.net/eng/index.php
124 UNIX
“Most good programmers do programming not because they expect to get paid or get
adulation by the public, but because it is fun to program.”
— Linus Torvalds
5.1 Introduction
Python is a simple-to-use yet powerful scripting language that allows one to solve data
problems of varying scale and complexity. It is also the most used tool in data science and
most frequently listed in data science job postings as the requirement. Python is a very
friendly and easy-to-learn language, making it ideal for the beginner. At the same time, it is
very powerful and extensible, making it suitable for advanced data science needs.
This chapter will start with an introduction to Python and then dive into using the language
for addressing various data problems using statistical processing and machine learning.
One of the appeals of Python is that it is available for almost every platform you can imagine,
and for free. In fact, in many cases – such as working on a UNIX or Linux machine – it is
likely to be already installed for you. If not, it is very easy to obtain and install.
125
126 Python
For the purpose of this book, I will assume that you have access to Python. Not sure where it
is? Try logging on to the server (using SSH) and run the command “python -version” at the
terminal. This should print the version of Python installed on the server.
It is also possible that you have Python installed on your machine. If you are on a Mac or
a Linux, open a terminal or a console and run the same command as above to see if you have
Python installed, and, if you do, what version.
Finally, if you would like, you can install an appropriate version of Python for your
system by downloading it directly from Python1 and following the installation and config-
uration instructions. See this chapter’s further reading and resources for a link, and
Appendices C and D for more help and details.
Assuming you have access to Python – either on your own machine or on the server – let us
now try something. On the console, first enter “python” to enter the Python environment.
You should see a message and a prompt like this:
Python 3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul 2 2016,
17:52:12)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type “help”, “copyright”, “credits” or “license” for more infor-
mation.
>>>
Now, at this prompt (the three “greater than” signs), write ‘print (“Hello, World!”)’
(without the single quotation marks) and hit enter. If things go right, you should see “Hello,
World!” printed on the screen like the following:
>>> print (”Hello, World!”)
Hello, World!
Let us now try a simple expression: 2+2. You see 4? Great!
Finally, let us exit this prompt by entering exit(). If it is more convenient, you could also
do Ctrl+d to exit the Python prompt.
There are several decent options for Python IDE, including using the Python plug-in for
a general-purpose IDE, such as Eclipse. If you are familiar with and invested in Eclipse, you
might get the Python plug-in for Eclipse and continue using Eclipse for your Python
programming. Look at the footnote for PyDev.2
If you want to try something new, then look up Anaconda, Spyder, and IPython (more in
Appendix D). Note that most beginners waste a lot of time trying to install and configure
packages needed for running various Python programs. So, to make your life easier,
I recommend using Anaconda as the platform and Spyder on top of it.
There are three parts to getting this going. The good news is – you will have to do this only
once.
First, make sure you have Python installed on your machine. Download and install an
appropriate version for your operating system from the Python3 link in the footnote. Next,
download and install Anaconda Navigator4 from the link in the footnote. Once ready, go
ahead and launch it. You will see something like Figure 5.1. Here, find a panel for “spyder.”
In the screenshot, you can see a “Launch” button because I already have Spyder installed.
For you, it may show “Install.” Go ahead and install Spyder through Anaconda Navigator.
Once installed, that “Install” button in the Spyder panel should become “Launch.” Go
ahead and launch Spyder. Figure 5.2 shows how it may look. Well, it is probably not going
to have all the stuff that I have showing here, but you should see three distinct panes: one
occupying the left half of the window and two on the right side. The left panel is where you
will type your code. The top-right panel, at the bottom, has Variable explorer, File explorer,
as well as Help. The bottom-right panel is where you will see the output of your code.
That is all for now in terms of setting things up. If you have made it thus far, you are ready
to do real Python programming. The nice thing is, whenever we need extra packages or
libraries for our work, we can go to Anaconda Navigator and install them through its nice
IDE, rather than fidgeting with command-line utilities (many of my students have reported
wasting hours doing that).
Rather than doing any programming theory, we will learn basic Python using hands-on
examples.
In this section, we will practice with a few basic elements of Python. If you have done any
programming before, especially one that involves scripting, this should be easy to understand.
The following screenshots are generated from an IPython (Jupyter) notebook. Refer to
Appendix D if you want to learn more about this tool. Here, “In” lines show you what you
enter, and “Out” lines show what you get in return. But it does not matter where you are
typing your Python code – directly at the Python console, in Spyder console, or in some
other Python tool – you should see the same outputs.
129 5.3 Basic Examples
In the above segment, we began with a couple of commands that we tried when we first
started up Python earlier in this chapter. Then, we did variable assignments. Entering
“x = 2” defines variable “x” and assigns value “2” to it. In many traditional programming
languages, doing this much could take up two to three steps as you have to declare what
kind of variable you want to define (in this case, an integer – one that could hold whole
numbers) before you could use it to assign values to it. Python makes it much simpler. Most
times you would not have to worry about declaring data types for a variable.
After assigning values to variables “x” and “y”, we performed a mathematical operation
when we entered “z = x + y”. But we do not see the outcome of that operation until we enter “z”.
This also should convey one more thing to you – generally speaking, when you want to know
the value stored in a variable, you can simply enter that variable’s name on the Python prompt.
Continuing on, let us see how we can use different arithmetic operators, followed by
the use of logical operators, for comparing numerical quantities.
130 Python
Here, first we entered a series of mathematical operations. As you can see, Python does
not care if you put them all on a single line, separated by commas. It understands that each
of them is a separate operation and provides you answers for each of them.
Most programming languages use logical operators such as “>,” “<,” “>=,” and “<=.”
Each of these should make sense, as they are exact representations of what we would use in
regular math or logic studies. Where you may find a little surprise is how we represent
comparison of two quantities (using “==”) and negation (using “!=”). Use of logical opera-
tions results in Boolean values – “true” or “false,” or 1 or 0. You can see that in the above
output: “2 > 3” is false and “3 >= 3” is true. Go ahead and try other operations like these.
Python, like most other programming languages, offers a variety of data types. What is
a data type? It is a format for storing data, including numbers and text. But to make things
easier for you, often these data types are hidden. In other words, most times, we do not have
to explicitly state what kind of data type a variable is going to store.
As you can see above, we could use “type” operation or function around a variable name
(e.g., “type(x)”) to find out its data type. Given that Python does not require you to explicitly
131 5.4 Control Structures
define a variable’s data type, it will make an appropriate decision based on what is being
stored in a variable. So, when we tried storing the result of a division operation – x/y – into
variable “z,” Python automatically decided to set z’s data type to be “float,” which is for
storing real numbers such as 1.1, −3.5, and 22/7.
To make decisions based on meeting a condition (or two), we can use “if” statements. Let us
say we want to find out if 2020 is a leap year. Here is the code:
year = 2020
if (year%4 == 0):
print (“Leap year”)
else:
print (“Not a leap year”)
Here, the modulus operator (%) divides 2020 by 4 and gives us the remainder. If that
remainder is 0, the script prints “Leap year,” otherwise we get “Not a leap year.”
Now what if we have multiple conditions to check? Easy. Use a sequence of “if” and
“elif” (short for “else if”). Here is the code that checks one variable (collegeYear), and,
based on its value, declares the corresponding label for that year:
collegeYear = 3
if (collegeYear == 1):
print (“Freshman”)
elif (collegeYear == 2):
print (“Sophomore”)
elif (collegeYear == 3):
print (“Junior”)
elif (collegeYear == 4):
print (“Senior”)
132 Python
else:
print (“Super-senior or not in college!”)
Another form of control structure is a loop. There are two primary kinds of loops: “while”
and “for”. The “while” loop allows us to do something until a condition is met. Take
a simple case of printing the first five numbers:
a, b = 1, 5
while (a<=b):
print (a)
a += 1
And here is how we could do the same with a “for” loop:
for x in range(1, 6):
print (x)
Let us take another set of examples and see how these control structures work. As
always, let us start with if–else. You probably have guessed the overall structure of the if–
else block by now from the previous example. In case you have not, here it is:
if condition1:
statement(s)
elif condition2:
statement(s)
else:
statement(s)
In the previous example, you saw a condition that involves numeric variables. Let us try one
that involves character variables. Imagine in a multiple-choice questionnaire you are given four
choices: A, B, C, and D. Among them, A and D are the correct choices and the rest are wrong.
So, if you want to check if the answer chosen is the correct answer, the code can be as follows:
if ans == ‘A’ or ans == ‘D’:
print (“Correct answer”)
else
print (“Wrong answer”)
Next, let us see if the same problem can be solved with the while loop:
ans = input(‘Guess the right answer: ’)
while (ans != ‘A’) and (ans != ‘D’):
print (“Wrong answer”)
ans = input(‘Guess the right answer: ’)
The above code, will prompt the user to provide a new choice until the correct answer is
provided. As evidenced from the two examples, the structure of the while loop can be
viewed as:
133 5.5 Statistics Essentials
while condition:
statement(s)
The statement(s) within the while loop are going to be executed repeatedly as long as the
condition remains true. The same programming goal can be accomplished with the for loop
as well:
correctAns = [“A”, “D”]
for ans in correctAns:
print(ans)
The above lines of code will print the correct choices for the question.
In this section, we will see how some statistical elements can be measured and manifested in
Python. You are encouraged to learn basic statistics or brush up on those concepts using
external resources (see Chapter 3 and Appendix C for some pointers).
Let us start with a distribution of numbers. We can represent this distribution using an
array, which is a collection of elements (in this case, numbers).
For example, we are creating our family tree, and having put some data on the branches
and leaves of this tree, we want to do some statistical analysis. Let us look at everyone’s age.
Before doing any processing, we need to represent it as follows:
data1=[85,62,78,64,25,12,74,96,63,45,78,20,5,30,45,78,45,
96,65,45,74,12,78,23,8]
If you like, you can call this a dataset. We will use a very popular Python package or library
called “numpy” to run our analyses. So, let us import and define this library:
import numpy as np
What did we just do? We asked Python to import a library called “numpy” and we said,
internally (for the current session or program), that we will refer to that library as “np”.
This particular library or package is extremely useful for us, as you will see. (Do not be
surprised if many of your Python sessions or programs have this line somewhere in the
beginning.)
134 Python
Here, plt.figure() creates an environment for plotting a figure. Then, we get the data
for creating a histogram using the second line. This data is passed to plt.bar()
function, along with some parameters for the axes to produce the histogram we see
in Figure 5.3.
Note that if you get an error for plt.figure(), just ignore it and continue with the rest of the
commands. It just might work!
If we are too lazy to type in a whole bunch of values to create a dataset to play with, we
could use the random number initialization function of numpy, like this:
data2 = np.random.randn(1000)
135 5.5 Statistics Essentials
0
0 20 40 60 80 100
If you did this exercise, you would notice that you get bars. But what if you wanted
a different number of bars? This may be useful to control the resolution of the figure. Here,
we have 1000 data points. So, on one extreme, we could ask for 1000 bars, but that may be
too much. At the same time, we may not want to let Python decide for us. There is a quick
fix. We can specify how many of these bars, also called “bins,” we would like. For instance,
if we wanted 100 bins or bars, we can write the following code.
plt.figure()
hist2, edges2 = np.histogram(data2, bins=100)
plt.bar(edges2[:-1], hist2, width=edges2[1:]-edges2[:-1])
And the result is shown in Figure 5.4. Note that your plot may look a little different because
your dataset may be different than mine. Why? Because we are getting these data points
using a random number generator. In fact, you may see different plots every time you run
your code starting with initializing data2!
40
35
30
25
20
15
10
0
–4 –3 –2 –1 0 1 2 3
dataset to practice calculating the minimum, maximum, range, and average for all the attributes. Plot the
data per attribute in a bar graph to visualize the distribution.
We have gathered a few useful tools and techniques in the previous section. Let us apply
them to a data problem, while also extending our reach with these tools. For this exercise, we
will work with a small dataset available from github6 (see link in footnote). This is
a macroscopic dataset with seven economic variables observed from the years 1947 to
1962 (n = 16).
First, we need to import that data into our Python environment. For this, we will use Pandas
library. Pandas is an important component of the Python scientific stack. The Pandas
DataFrame is quite handy since it provides useful information, such as column names
read from the data source so that the user can understand and manipulate the imported data
more easily. Let us say that the data is in a file “data.csv” in the current directory. The
following line loads that data in a variable “CSV_data.”
from pandas import read_csv CSV_data = read_csv(‘data.csv’)
Another way to use Pandas functionalities is the way we have worked with numpy. First,
we import Pandas library and then call its appropriate functions like this:
import pandas as pd
df = pd.read_csv(‘data.csv’)
This is especially useful if we need to use Pandas functionalities multiple times in the code.
137 5.5 Statistics Essentials
600
550
500
Total Employment
450
400
350
300
250
200
58 60 62 64 66 68 70 72
Gross National Product
Figure 5.5 Scatterplot to visualize the relationship between GNP and Employment.
FYI: Dataframes
If you have used arrays in a programming language before, you should feel well at home with the idea
of a dataframe in Python. A very popular way to implement a dataframe in Python is using Pandas
library.
We saw above how to use Pandas to import structured data in CSV format into a DataFrame kind of
object. Once imported, you could visualize the DataFrame object in Spyder by double-clicking its name in
Variable explorer. You will see that a dataframe is essentially a table or a matrix with rows and columns.
And that is how you can access each of its data points.
For instance, in a dataframe “df”, if you want the first row, first column element, you can ask for df.iat
[0,0]. Alternatively, if the rows are labeled as “row-1”, “row-2”, . . . and the columns are labeled as “col-
1”, “col-2”, . . ., you can also ask for the same with df.at[‘row-1’,‘col-1’]. You see how such addressing
makes it more readable?
138 Python
Do you ever need to save your dataframe to a CSV file? That is easy with
df.to_csv(‘mydata.csv’)
There is a lot more that you can do with dataframes, including adding rows and columns, and applying
functions. But those are out of scope for us. If you are still interested, I suggest you consult some of the
pointers at the end of this chapter.
Before proceeding, note that while you can run these commands on your Spyder console
and see immediate results, you may want to write them as a part of a program/script and run
that program. To do this, type the code above in the editor (left panel) in Spyder, save it as a
.py file, and click “Run file” (the “play” button) on the toolbar.
5.5.3 Correlation
One of the most common tests we often need to do while solving data-driven problems is to
see if two variables are related. For this, we can do a statistical test for correlation.
Let us assume we have the previous data ready in dataframe df. And we want to find if the
“Employed” field and “GNP” field are correlated. We could use the “corrcoef” function of
numpy to find the correlation coefficient, which gives us an idea of the strength of
correlation between these two variables. Here is that line of code:
np.corrcoef(df.Employed,df.GNP)[0,1]
The output of this statement tells us that there is very high correlation between these two
variables as represented by the correlation coefficient = 0.9835. Also note that this number
is positive, which means both variables move together in the same direction. If this
correlation coefficient were negative, we would still have a strong correlation, but just in
the opposite direction.
In other words, in this case knowing one variable should give us enough knowledge
about the other. Let us ask: If we know the value of one variable (independent variable or
predictor), can we predict the value of the other variable (dependent variable or response)?
For that, we need to perform regression analysis.
So, we can learn about two variables relating in some way, but if there is a relationship of
some kind, can we figure out if and how one variable could predict the other? Linear
regression allows us to do that. Specifically, we want to see how a variable X affects
a variable y.7 Here, X is called the independent variable or predictor; and y is called the
dependent variable or outcome.
139 5.5 Statistics Essentials
Figuring out this relationship between X and y can be seen as building or fitting a model
with linear regression. There are many methods for doing so, but perhaps the most common
is ordinary least squares (OLS).
For doing linear regression with Python, we can use statsmodels library’s API functions,
as follows:
import statsmodels.api as sm
lr_model = sm.OLS(y, X).fit()
Here, lr_model is the model built using linear regression with the OLS fitting approach.
How do we know this worked? Let us check the results of the model by running the
following command:
print(lr_model.summary())
Somewhere in the output here, we can find values of coefficients – one for const (constant)
and the other for GNP. And here is our regression equation:
Employed = coeff*GNP + const
Now just substitute a value for GNP, its coefficient (found from the output above), and
constant. See if the corresponding value in Employed matches with the data. There might be
some difference, but hopefully not much! Let us look at an example. We know from the
dataset that in 1960, GNP value was 502.601. We will use our regression equation to
calculate the value of Employed. Plugging in the values GNP = 502.601, coeff = 0.0348,
and const = 51.8436 in the above equation, we get:
Employed = 0.0348*502.601 + 51.8436 = 69.334
Now let us look up the actual value of “Employed” for 1960. It is 69.564. That means we
were off by only 0.23. That is not bad for our prediction. And, more important, we now have
a model (the line equation) that could allow us also to interpolate and extrapolate. In other
words, we could even plug in some unknown GNP value and find out what the approximate
value for Employed would be.
Well, why do not we do this more systematically? Specifically, let us come up with all
kinds of values for our independent variable, find the corresponding values for the depen-
dent variable using the above equation, and plot it on the scatterplot. We will see this as
a part of a full example below.
72
70
68
Total Employment
66
64
62
60
58
200 250 300 350 400 450 500 550 600
Gross National Product
Figure 5.6 Scatterplot of GNP vs. Employed overlaid with a regression line.
What we have seen so far is one variable (predictor) helping to predict another (response or
dependent variable). But there are many situations in life when there is not a single factor
that contributes to an outcome. And so, we need to look at multiple factors or variables.
That is when we use multiple linear regression.
As the name suggests, this is a method that takes into account multiple predictors in order
to predict one response or outcome variable. Let us take an example.
We will start by getting a small dataset from OA 5.2. This dataset contains information about advertising
budgets for TV and radio and corresponding sales numbers. What we want to learn here is how much those
budgets influence product sales.
Let us first load it up in our Python environment.
# Load the libraries we need – numpy, pandas, pyplot, and
statsmodels.api
142 Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Load the advertising dataset into a pandas dataframe
df = pd.read_csv(‘Advertising.csv’, index_col=0)
We start our analysis by doing linear regression, as we did before, to see how well we could use the “TV”
variable to predict “Sales”.
y = df.Sales
X = df.TV
X = sm.add_constant(X)
lr_model = sm.OLS(y,X).fit()
print(lr_model.summary())
print(lr_model.params)
In this output, what we are looking for is the R-squared value. It is around 0.61, which means that
about 61% of the variance in this TV–sales relationship can be explained using the model we built. Well,
that is not too bad, but before we move on, let us plot this relationship:
plt.figure()
plt.scatter(df.TV,df.Sales)
plt.xlabel(‘TV’)
plt.ylabel(‘Sales’)
The outcome is shown in Figure 5.7.
30
25
20
Sales
15
10
0
–50 0 50 100 150 200 250 300 350
TV
30
25
20
Sales
15
10
0
–10 0 10 20 30 40 50 60
Radio
Figure 5.8 Scatterplot of Radio vs. Sales from the advertising data.
30
25
20
Sales
15
10
0
60
50
40
30
Rad
20
10
io
0
300 350
200 250
–10 100 150
0 50
–50 TV
Figure 5.9 Three-dimensional scatterplot showing TV, Radio, and Sales variables.
This comes up with R-squared close to 90%. That is much better. Seems like two are better than one, and
that is our multiple linear regression!
And here is the code for plotting this regression in three dimensions (3D), with the result shown in
Figure 5.9. Consider this as optional or extra stuff.
from mpl_toolkits.mplot3d import Axes3D
# Figure out X and Y axis using ranges from TV and Radio
X_axis, Y_axis = np.meshgrid(np.linspace(X.TV.min(), X.TV.
max(), 100), np.linspace(X.Radio.min(), X.Radio.max(), 100))
# Plot the hyperplane by calculating corresponding Z axis
(Sales)
Z_axis = lr_model.params[0] + lr_model.params[1] * X_axis +
lr_model.params[2] * Y_axis
# Create matplotlib 3D axes
fig = plt.figure(figsize=(12, 8)) # figsize refers to width and
height of the figure
ax = Axes3D(fig, azim=-100)
145 5.6 Introduction to Machine Learning
# Plot hyperplane
ax.plot_surface(X_axis, Y_axis, Z_axis, cmap=plt.cm.cool-
warm, alpha=0.5, linewidth=0)
# Plot data points
ax.scatter(X.TV, X.Radio, y)
# set axis labels
ax.set_xlabel(‘TV’)
ax.set_ylabel(‘Radio’)
ax.set_zlabel(‘Sales’)
In a couple of chapters, we are going to see machine learning at its full glory (or at least the
glory that we could achieve in this book!). But while we are on a roll with Python, it would
be worthwhile to dip our toes in the waters of machine learning and see what sorts of data
problems we could solve.
We will start in the following subsection with a little introduction to machine learning,
and then quickly move to recognizing some of the basic problems, techniques, and solu-
tions. We will revisit most of these with more details and examples in Part III of this book.
Machine learning (ML) is a field of inquiry, an application area, and one of the most
important skills that a data scientist can list on their résumé. It sits at the intersections of
146 Python
Predict or
forecast a
value?
Yes No
Supervised Unsupervised
learning learning
Density
Classification Regression Clustering
estimation
computer science and statistics, among other related areas like engineering. It is used in
pretty much every area that deals with data processing, including business, bioinformatics,
weather forecasting, and intelligence (the NSA and CIA kind!).
Machine learning is about enabling computers and other artificial systems to learn
without explicitly programming them. This is where we want such systems to see some
data, learn from it, and then use that knowledge to infer things from other data.
Why machine learning? Because machine learning can help you turn your data into
information; it can allow you to translate a seemingly boring bunch of data points into
meaningful patterns that could help in critical decision-making; it lets you harness the true
power of having a lot of data.
We have already seen and done a form of machine learning when we tried predicting
values based on learned relationship between a predictor and an outcome. The core of
machine learning can be explained using the decision tree (which also happens to be the
name of one of the ML algorithms!) shown in Figure 5.10.
If we are trying to predict a value by learning (from data) how various predictors and the
response variables relate, we are looking at supervised learning. Within that branch, if the
response variable is continuous, the problem becomes that of regression, which we have
already seen. Think about knowing someone’s age and occupation and predicting their
income. If, on the other hand, the response variable is discrete (having a few possible values
or labels), this becomes a classification problem. For instance, if you are using someone’s
age and occupation to try to learn if they are a high-earner, medium-earner, or low-earner
(three classes), you are doing classification.
These learning problems require us to know the truth first. For example, in order to learn how
age and occupation could tell us about one’s earning class, we need to know the true value of
someone’s class – do they belong to high-earner, medium-earner, or low-earner. But there are
times when the data given to us do not have clear labels or true values. And yet, we are tasked
with exploring and explaining that data. In this case, we are dealing with unsupervised learning
147 5.6 Introduction to Machine Learning
problems. Within that, if we want to organize data into various groups, we encounter a clustering
problem. This is similar to classification, but, unlike classification, we do not know how many
classes there are and what they are called. On the other hand, if we are trying to explain the data
by estimating underlying processes that may be responsible for how the data is distributed, this
becomes a density estimation problem. In the following subsections, we will learn more about
these branches, with, of course, hands-on exercises.
Since we have already worked with regression, in this section we will focus on classi-
fication, clustering, and density estimation branches.
The task of classification is this: Given a set of data points and their corresponding labels,
learn how they are classified, so when a new data point comes, we can put it in the correct
class. There are many methods and algorithms for building classifiers, one of which is
k-nearest neighbor (kNN).
Here is how kNN works:
1. As in the general problem of classification, we have a set of data points for which we
know the correct class labels.
2. When we get a new data point, we compare it to each of our existing data points and find
similarity.
3. Take the most similar k data points (k nearest neighbors).
4. From these k data points, take the majority vote of their labels. The winning label is the
label or class of the new data point.
Usually k is a small number between 2 and 20. As you can imagine, the more the number of
nearest neighbors (value of k), the longer it takes us to do the processing.
Finding similarity between data points is also something that is very important, but we
are not going to discuss it here. For the time being, it is easier to visualize our data points on
a plane (in two or three dimensions) and think about distance between them using our
knowledge from linear algebra.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
df = pd.read_csv(“wine.csv”)
# Mark about 70% of the data for training and use the rest for
testing
# We will use ‘density’, ‘sulfates’, and ‘residual_sugar’
features
# for training a classifier on ‘high_quality’
X_train, X_test, y_train, y_test = train_test_split(df
[[‘density’,‘sulfates’,‘residual_sugar’]], df[‘high_qu-
ality’], test_size=.3)
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(X_train, y_train)
# Test the classifier by giving it test instances
prediction = classifier.predict(X_test)
# Count how many were correctly classified
correct = np.where(prediction==y_test, 1, 0).sum()
print (correct)
# Calculate the accuracy of this classifier
accuracy = correct/len(y_test)
print (accuracy)
Note that the above example uses k = 3 (checking on three nearest neighbors when doing
comparison). The accuracy is around 76% (you will get a different number every time because
a different set of data is used for training and testing every time you run the program). But what
would happen if the value of k were different? Let us try building and testing the classifier using a range of
values for k and plot the accuracy corresponding to each k.
# Start with an array where the results (k and corresponding
# accuracy) will be stored
results = []
for k in range(1, 51, 2):
classifier = KNeighborsClassifier(n_neighbors=k)
classifier.fit(X_train, y_train)
prediction = classifier.predict(X_test)
accuracy = np.where(prediction==y_test, 1, 0).sum() /
(len(y_test))
print (“k=”,k,“Accuracy=”, accuracy)
149 5.6 Introduction to Machine Learning
0.80
0.79
0.78
0.77
0.76
0.75
0.74
0 10 20 30 40 50
Figure 5.11 Plot showing how different values of k affect the accuracy of the kNN model built here.
plt.plot(results.k, results.accuracy)
plt.title(“Value of k and corresponding classification
accuracy”)
plt.show()
The plotting result is shown in Figure 5.11. Note that, again, every time you run this program, you will see
slightly different results (and plot). In the output, you will also notice that after a certain value of
k (typically 15), the improvements in accuracy are hardly noticeable. In other words, we reach the
saturation point.
Now we will look at another branch of machine learning. In the example we just saw, we knew
that we had two classes and the task was to assign a new data point to one of those existing class
labels. But what if we do not know what those labels are or even how many classes there are to
begin with? That is when we apply unsupervised learning with the help of clustering.
12
10
0
0 2 4 6 8 10
12
10
0
0 2 4 6 8 10
Now let us see what happens if we want three clusters. Change the n_clusters argument (input or
parameter) in KMeans function to be 3. And voilà! Figure 5.13 shows how this algorithm can give us three
clusters with the same six data points.
You can even try n_clusters=4. So, you see, this is unsupervised learning because the labels or
colors of the data points are not known to be able to put them in classes. In fact, we do not even know
how many labels or classes there should be, and so we could impose almost as many of those as we like.
But things are not always this simple. Because we were dealing with two dimensions and
so few data points, we could even visually identify how many clusters could be appropriate.
Nonetheless, the example above demonstrates how unsupervised clustering typically works.
153 5.6 Introduction to Machine Learning
One way to think about clustering when we do not know how many clusters we should have
is to let a process look for data points that are dense together and use that density
information to form a cluster. One such technique is MeanShift.
Let us first understand what density information or function is. Imagine you are trying to
represent the likelihood of finding a Starbucks in an area. You know that if it is a mall or
a shopping district, there is a good probability that there is a Starbucks (or two or three!), as
opposed to a less populated rural area. In other words, Starbucks has higher density in cities
and shopping areas than in less populated or less visited areas. A density function is
a function (think about a curve on a graph) that represents a relative likelihood of
a variable (e.g., existence of Starbucks) to taking on a given value.
Now let us get back to MeanShift. This is an algorithm that locates the maxima (max-
imum values) of a density function given a set of data points that fit that function. So,
roughly speaking, if we have data points corresponding to the locations of Starbucks,
MeanShift allows us to figure out where we are likely to find Starbucks; or on the other
hand, given a location, how likely are we to find a Starbucks.
15
10
–5
15
10
–4 –2 5
0 2 4 6 0
8 10 –5
ggplot styling
from matplotlib import style
style.use(“ggplot”)
# Let’s create a bunch of points around three centers in a 3D
# space X has those points and we can ignore y
centers = [[1,1,1],[5,5,5],[3,10,10]]
X, y = make_blobs(n_samples = 100, centers = centers, clus-
ter_std = 2)
# Perform clustering using MeanShift algorithm
ms = MeanShift()
ms.fit(X)
# “ms” holds the model; extract information about clusters as
# represented by their centroids, along with their labels
centroids = ms.cluster_centers
labels = ms.labels
print(centroids)
print(labels)
# Find out how many clusters we created
n_clusters_ = len(np.unique(labels))
print(“Number of estimated clusters:”, n_clusters_)
# Define a colors array
colors = [‘r’,‘g’,‘b’,‘c’,‘k’,‘y’,‘m’]
# Let’s do a 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection=‘3d’)
155 Summary
And this is where we are going to stop our exploration of applying machine learning to
various data problems using Python. If you are interested in more, go to Part III of this book.
But if you do not know enough about the R statistical tool, you should go through the next
chapter first, because later, when we do machine learning, we are going to exclusively
use R.
Summary
Python has recently taken the number one spot for programming languages, according to
the IEEE.8 And that is not a surprise. It is an easy-to-learn, yet very powerful, language. It is
ideal for data scientists because it offers straightforward ways to load and plot data,
provides a ton of packages for doing data visualization to parallel processing, and allows
easy integration to other tools and platforms. Want to do network programming? Python has
got it. Care about object-oriented programming? Python has you covered. What about GUI?
You bet!
It is hard to imagine any data science book without coverage of Python, but one of the
reasons it makes even more sense for us here is that, unlike some other programming
languages (e.g., Java), Python has a very low barrier. One can start seeing results of various
expressions and programming structures almost immediately without having to worry about
a whole lot of syntax or compilation. There are very few programming environments that are
easier than this.9 Not to mention, Python is free, open-source, and easily available. This may
not mean much in the beginning, but it has implications for its sustainability and support.
Python continues to flourish, be supported, and further enhanced due to a large community
of developers who have created outstanding packages that allow a Python programmer to do
all sorts of data processing with very little work. And such development continues to grow.
156 Python
Often students ask for a recommendation for a programming language to learn. It is hard
to give a good answer without knowing the context (why do you want to learn program-
ming, where would you use it, how long, etc.). But Python is an easy recommendation for
all the reasons above.
Having said that, I recommend not being obsessed with any programming tools or
languages. Remember what they are – just tools. Our goal, at least in this book, is not to
master these tools, but to use them to solve data problems. In this chapter, we looked at
Python. In the next, we will explore R. In the end, you may develop a preference for one
over the other, but as long as you understand how these tools can be used in solving
problems, that is all that matters.
Key Terms
Conceptual Questions
Hands-On Problems
Problem 5.1
Write a Python script that assigns a value to variable “age” and uses that information about
a person to determine if he/she is in high school. Assume that for a person to be in high
school, their age should be between 14 and 18. You do not have to write a complicated
code – simple and logical code is enough.
Problem 5.2
Problem 5.3
You are given a dataset named boston (OA 5.7). This dataset contains information collected
by the US Census Service concerning housing in the area of Boston, Mass. The dataset is
small in size, with only 506 cases. The data was originally published by Harrison, D., &
Rubinfeld, D.L. (1978). Hedonic prices and the demand for clean air. Journal of
Environmental Economics and Management, 5, 81–102.
Here are the variables captured in this dataset:
CRIM – per capita crime rate by town
ZN – proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS – proportion of non-retail business acres per town.
CHAS – Charles River dummy variable (1 if tract bounds river; 0
otherwise)
158 Python
Using appropriate correlation and regression tests, find which of the variables is the best
predictor of NOX (nitric oxides concentration). For that model, provide the regression plot
and equation.
Using appropriate correlation and regression tests, find which of the variables is the best
predictor of MEDV (median home value). For that model, provide the regression plot and
equation.
Problem 5.4
You have experienced a classification method, kNN classifier, in the class. A classification
method or algorithm is developed aiming to address different types of problems. As a result,
different classifiers show different classification results, or accuracy. The goal of this
assignment is to compare the different accuracy over the classifiers.
The dataset to use for this assignment is Iris, which is a classic and very easy multiclass
classification dataset. This dataset consists of three different types of irises’ (setosa, versicolor,
and virginica) petal and sepal length, stored in a 150 × 4 numpy.ndarray. The rows are the
samples and the columns are: Sepal length, Sepal width, Petal length, and Petal width.
You can load the iris dataset through the Python code below:
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features Real, positive
The second line will give you a classifier that you can store and process further just like you
do with a kNN-built classifier. Here we are saying that we want to use a linear kernel for our
SVM. Other options are “rbf” (radial basis function) and “poly” (polynomial). Try each of
these and see what accuracies you get. Note that every time you run your program, you may
get slightly different numbers, so try running it a few times:
For the classification, take only the first two columns from the dataset.
Split the dataset into 70% for training and 30% for test.
Show the resulting accuracies from kNN and three variations of SVM.
Problem 5.5
Let us work with Iris data again. For the previous question, we used it for doing classifica-
tion. Now we will do clustering.
First, load the data and extract the first two features.
Now, do flat clustering using k-means. You can decide how many clusters are appro-
priate. For this, you may like to plot the data first and see how it is scattered. Show the plot
with clusters marked.
Having done both classification and clustering on the same dataset, what can you say about
this data and/or the techniques you used? Write your thoughts in one or two paragraphs.
Problem 5.6
For this exercise, you need to work with the breast cancer Coimbra dataset. First download
the dataset from OA 5.8 and load the data. The dataset has 10 features including the class
labels (1 or 2). Next, you need to round off the Leptin feature values to two decimal places.
Having done that, use the first nine attributes (dataset minus the class label) to group the
data points into two clusters. You can use any clustering algorithm of your choice, but the
number of clusters should remain the same. Once the clustering is complete, use the class
labels to evaluate the accuracy of the clustering algorithm that you chose.
If you want to learn more about Python and its versatile applications, here are some useful
resources.
Python tutorials:
• https://www.w3schools.in/python-tutorial/
• https://www.learnpython.org/
• https://www.tutorialspoint.com/python/index.htm
• https://www.coursera.org/learn/python-programming
• https://developers.google.com/edu/python/
• https://wiki.python.org/moin/WebProgramming
160 Python
Notes
1. Python download: https://www.python.org/downloads/
2. PyDev: http://www.pydev.org
3. Python: https://www.python.org/downloads/
4. Anaconda Navigator: https://anaconda.org/anaconda/anaconda-navigator
5. Daily Demand Forecasting Orders dataset: https://archive.ics.uci.edu/ml/machine-learning-
databases/00409/Daily_Demand_Forecasting_Orders.csv
6. GitHub: http://vincentarelbundock.github.io/Rdatasets/csv/datasets/longley.csv
7. Notice that the predictor variable X is in uppercase and the outcome y is in lowercase. This is on
purpose. Often, there are multiple predictor variables, making X a vector (or a matrix), whereas
most times (and perhaps for us, all the time), there will be a single outcome variable.
8. IEEE (Python #1): http://spectrum.ieee.org/computing/software/the-2017-top-programming-
languages
9. Yes, there are easier and/or more fun ways to learn/do programming. One popular example is
Scratch: https://scratch.mit.edu/.
6 R
“Not everything that can be counted counts, and not everything that counts can be
counted.”
— Albert Einstein
6.1 Introduction
161
162 R
R is an open-source software for statistical computing, and it is available on all the major
platforms for free. If you have ever used or heard about Matlab, SPSS, SAS, etc., well,
R can do those kinds of things: statistics, data manipulation, visualization, and running data
mining and machine learning algorithms. But it costs you nothing and you have an amazing
set of tools at your disposal that is constantly expanding. There is also an active community
building and supporting these tools. R offers you all the power of statistics and graphics that
you can handle. No wonder it is becoming an industry standard for scientific computing.
You can download R from the R site.1 There you can also read up on R and related
projects, join mailing lists, and find all the help you need to get started and do amazing
things. Appendix D has more details for downloading and installing, if you like.
Once downloaded and installed, start the R program. You will be presented with the
R console (Figure 6.1). Here, you can run R commands and programs that we will see in the
next sections of this chapter. To leave the program, enter q().
But, as before, instead of using this program directly, we will take advantage of an
Integrated Development Environment (IDE). In this case, it is RStudio. So, go to
RStudio2 and pick the version that matches your operating system.
Again, once you have downloaded and installed RStudio, go ahead and start it up. You
should see multiple windows or panes (see Figure 6.2), including the familiar R console
where you can run R commands. As with the regular R, you can enter q() on the RStudio
console and exit.
Let us get started with actually using R now, and we will do it the way we have been
learning in this book – using hands-on exercises.
6.3.1 Basics
Let us start with things that are similar to what we did on the Python console. Everything
below that you see with “>” is what you would enter at the command prompt (“>”) in R (or
RStudio). The lines with bracketed numbers (e.g., “[1]”) show outputs from running those
commands.
164 R
> 2+2
[1] 4
> x=2
> y=2
> z=x+y
> z[1] 4
While these may be self-explanatory, let us still go ahead and explain them. We started
by simply entering a mathematical expression (2+2) and R executes it, giving us the answer.
Next, we defined and assigned values to variables x and y. Similar to other scripting
languages such as Python, we do not need to worry about the data type of a variable.
Just as we can assign a value to a variable (e.g., “x=2”), we can also assign a mathematical
expression to a variable (above, “z=x+y”) and, in that case, R will compute that expression
and store the result in that variable. If you want to see what is in a variable, simply enter the
name of that variable on the prompt.
We should note here that you could also assign values to a variable using “<-” like this:
> a<-7
In this book, you will see me using both of these notations for assigning values and writing
expressions.
Next, let us work with logical operations. The most common logical operators should
sound quite familiar: “>”, “<”, “>=”, and “<=”. What may not be so apparent, especially if
you have not done much or any programming before, are operators for comparison (“==”)
and for negation (“!=”). Let us go ahead and practice some of these:
> 2>3
[1] FALSE
> 2==2
[1] TRUE
As you can see, the result of a logical operation is a Boolean value – “TRUE” or “FALSE.”
year = 2020
if (year%%4==0) {
print (“Leap year”)
} else {
print (“Not a leap year”)
}
Save the file. Typically, R scripts have an “.r” extension. You can run one line at a time by putting your
cursor on that line and hitting the “Run” button in the toolbar of the editor. If you want to run the whole
script, just select it all (“Ctrl+A” on a PC or “Cmd+A” on a Mac) and click “Run.” The output will be in the
console. Now that you know how to put R code in a file and run it, you can start taking advantage of
existing R scripts and even creating your own.
Just like Python, R supports some basic control structures. In general, the control structures are
of two types: decision control structure and loop control structure. As the name suggests, if you
want to decide if a statement or a set of statements can be executed or not, based on some
condition, you need a decision control structure, for example, an “if–else” block. But if you need
the same set of statements to be executed iteratively as long as the decision condition remains
true, you would want loop control structures, for example, a “for loop,” “do–while” loop, etc.
Let us look at a few examples. Say you would like to decide if the weather is OK for
a bicycle trip based on the percentage of humidity present, and you want to write code for
this. It may look something like:
humidity = 45
if (humidity<40) {
print (“Perfect for a trip”)
} else if (humidity>70) {
print (“Not suitable for a trip”)
} else {
print (“May or may not be suitable for a trip”)
}
166 R
As shown in the above lines of code, the three conditions based on which decision to be
made is defined here as “humidity less than 40%,” or “more than 70%,” or the rest.
The same objective can be achieved with a “for loop” as well. Here is a demonstration:
# This is to store the humidity percentages in a vector
humidity <- c(20, 30, 60, 70, 65, 40, 35)
for (count in 1:7){
cat (“Weather for day ”, count, “:”)
if (humidity[count] < 40) {
print (“Perfect for a trip”)
} else if (humidity[count] > 70) {
print (“Not suitable for a trip!”)
} else {
print (“May or may not be suitable for trip”)
}
}
Stop here for a moment and make sure all of the code above makes sense to you. Of course, the best
way to ensure this is to actually try it all by yourself and make some changes to see if your logic works out
the right way. Recall our discussion on computational thinking from the first chapter in this book. This will
be a good place to practice it.
167 6.3 Getting Started with R
1. Use a “for loop” to print all the years that are leap years between 2008 and 2020.
2. Use a “while loop” to calculate the number of characters in the following line:
“Today is a good day.”
6.3.3 Functions
Now we come to the most useful parts of R for data science. Almost none of our actual
problem-solving will work without first having some data available to us in R. Let us see
how to import data into R. For now, we will work with CSV data, as we have done before.
R has a function “file.choose()” that allows you to pick out a file from your computer and
a “read.table()” function to read that file as a table, which works great for csv-formatted data.
168 R
For this example, we will use the IQ data file (iqsize.csv), available from OA 6.7. Type
the following code line by line, or save it as an R script and run one line at a time:
df = read.table(file.choose(),header=TRUE,sep=“,”)
brain = df[“Brain”]
print(summary(brain))
Running the first line brings up a file selection box. Navigate to the directory where
iqsize.csv is stored and select it. This is the result of the “file.choose()” function, which is
the first argument (parameter, input) in the “read.table()” function on that first line.
Alternatively, you can put the file name (with full path) in quotes for that first argument.
The second argument means we are expecting to see column labels in the first row (header),
and the third argument indicates how the columns are separated.
Once we have the data loaded in our dataframe (df) variable, we can process it. In
the second line, we are selecting the data stored in “Brain” column. And then in the third
line we are using the function “summary()” so we can obtain some basic statistical
characteristics of this dataframe and print them, as seen below:
Brain
Min. : 79.06
1st Qu.: 85.48
Median : 90.54
Mean : 90.68
3rd Qu.: 94.95
Max. :107.95
Do these quantities look familiar? They should if you know your basic statistics or have
reviewed descriptive statistics covered earlier in this book!
One of the core benefits of R is its ability to provide data visualizations with very
little effort, thanks to its built-in support, as well as numerous libraries or packages
and functions available from many developers around the world. Let us explore this.
Before working with graphics and plotting, let us make sure we have the appro-
priate libraries. Open RStudio and select Tools > Install Packages. In the dialog box
that pops up, make sure CRAN repository is selected for the installation source.
Now, type “ggplot2” in the packages box. Make sure “Install dependencies”
is checked. Hit “Install” and the ggplot2 package should be downloaded and
installed.
169 6.4 Graphics and Data Visualization
FYI: Dataframes
We saw the idea of a dataframe in the Python chapter. In a way, it is the same for R, that is, a dataframe is
like an array or a matrix that contains rows and columns. There are a couple of key differences, however.
First, in R, a dataframe is inherently available without having to load any external packages like we did
for Python with Pandas. The second big difference is how elements in a dataframe are addressed. Let us
see this using an example.
We will use one of the in-built dataframes called “mtcars”. This dataframe has car models as rows and
various attributes about a car as columns. If you want to find the mpg for a Fiat 128 model, you can enter
> mtcars[‘Fiat 128’,‘mpg’]
If you want the whole record for that car, you can enter
> mtcars[‘Fiat 128’,]
In other words, you are referring to a specific row and all corresponding columns with the above
addressing. Of course, you can also address an element in a dataframe using an index such as mtcars[12,1],
but, as you can see, addressing rows, columns, or a specific element by name make things a lot more
readable.
If you are interested in exploring dataframes in R, you may want to look at some of the pointers for
further reading at the end of this chapter.
For the examples in this section, we will work with customer data regarding health
insurance. It is available from OA 6.1. The data is in a file named custdata.tsv. Here,
“tsv” stands for tab-separated values. That means, instead of commas, the fields are
separated using tabs. Therefore, our loading command will become:
custdata = read.table(‘custdata.tsv’,header=T,sep=‘\t’)
Here, ‘\t’ indicates the tab character. The above command assumes that the custdata.tsv
file is in the current directory. If you do not want to take chances with that, you can replace
the file name with the “file.choose()” function, so when the “read.table()” function is run,
a file navigation box will pop up, allowing you to pick out the data file from your computer.
That line will look like:
custdata = read.table(file.choose(),header=T,sep=‘\t’)
Let us start with a simple histogram of our customers’ ages. First, we need to load the
“ggplot2” library and use its histogram function, like this:
170 R
100
75
count
50
25
0 50 100 150
age
library(ggplot2)
ggplot(custdata) +geom_histogram(aes(x=age), binwidth=5,
fill=“blue”)
This generates a nice histogram (Figure 6.3). In the code, “binwidth” indicates the range
that each bar covers on the x-axis. Here, we set it to 5, which means it would look at ranges
such as 0–5, 6–10, and so on. Each bar on the plot then represents how many items or
members fall within a given range. So, as you can imagine, if we increase the range, we get
more items per bar and the overall plot gets “flatter.” If we reduce the range, fewer items fit
in one bar and the plot looks more “jagged.”
But did you see how easy creating such a plot was? We typed only one line. Now, if you
have never used such a statistical package before and relied on spreadsheet programs to
create graphs, you might have thought it was easy to create a graph with those programs
using point and click. But was it difficult to type that one line here? Besides, with this one
line, we can more easily control how the graph looks than how you could with that point and
click. And, maybe it is just me, but I think the result looks a lot more “professional” than
what one could get from those spreadsheet programs!
Here, you can see that the histogram function has an argument for what the x-axis will
show, and how many data points to fit in each bin.
Let us now look at a field with categorical values. The data, “marital.stat,” is like that. We
can use a bar chart to plot the data (Figure 6.4).
171 6.4 Graphics and Data Visualization
500
400
300
count
200
100
Figure 6.4 Bar plot showing distribution of marital status in the customer data.
If you do not like the default color scheme of pie() in R, you can select different
color variations from the available, but you may need to specify the numbers of
colors you need according to the number of pie slices in your chart. Here is how to
do it.
Rented
1 12.04
2 12.80
3 13.39
4 13.20
5 13.23
173 6.4 Graphics and Data Visualization
13.4
13.2
12.8
12.6
12.4
12.2
12.0
1 2 3 4 5
Hours of operation
Figure 6.6 Line graph of average stock prices against the hour of operation.
The above lines should generate the line graph in Figure 6.6.
There are many other types of charts and different variations of the same chart that one
can draw in R. But we will stop at these four here and look at some other things we could do
and plot with the data we have.
We will start by first loading a dataset about customer age and income:
Let us find a correlation between age and income, which will tell us how much and in
which ways age and income are related. It is done using a simple command:
cor(custdata$age, custdata$income)
This gives a low correlation of 0.027. That means these two variables do not relate to one
another in any meaningful way. But wait a minute. A careful examination of the data tells us that
there are some null values, meaning some of the ages and incomes are reported to be 0. That
cannot be true, and perhaps this is a reflection of missing values. So, let us redo the correlation,
this time picking the values that are non-zero. Here is how we could create such a subset:
174 R
We reviewed statistical concepts in Chapter 3 and saw how we could use them (and some
more) using Python in the previous chapter. Now, it is time to use R to do the same or similar
things. So, before you begin this section, make sure that you have at least reviewed
statistical concepts. In addition, we will see how some of the basic machine learning
techniques using R could help us solve data problems. Regarding machine learning, I would
suggest reviewing the introductory part of the previous chapter.
We will start with getting some descriptive statistics. Let us work with “size.csv” data,
which you can download from OA 6.4. This data contains 38 records of different people’s
sizes in terms of height and weight. Here is how we load it:
size = read.table(‘size.csv’,header=T,sep=‘,’)
Once again, this assumes that the data is in the current directory. Alternatively, you can
replace “size.csv” with “file.choose()” to let you pick the file from your hard drive when
you run this line. Also, while you can run one line at a time on your console, you could type
them and save as an “.r” file, so that not only can you run line-by-line, but you can also store
the script for future runs.
175 6.5 Statistics and Machine Learning
Either way, I am assuming at this point that you have the data loaded. Now, we can ask
R to give us some basic statistics about it by running the summary command:
summary(size)
Height Weight
Min. :62.00 Min. :106.0
1st Qu.:66.00 1st Qu.:135.2
Median :68.00 Median :146.5
Mean :68.42 Mean :151.1
3rd Qu.:70.38 3rd Qu.:172.0
Max. :77.00 Max. :192.0
The output, as shown above, shows descriptive statistics for the two variables or columns
we have here: “Height” and “Weight.” We have seen such output before, so I will not bother
with the details.
Let us visualize this data on a scatterplot. In the following line, “ylim” is for specifying
minimum and maximum values for the y-axis:
library(ggplot2)
ggplot(size, aes(x=Height,y=Weight)) + geom_point() + ylim
(100,200)
The outcome is shown in Figure 6.7. Once again, you have got to appreciate how easy it is
with R to produce such professional-looking visualizations.
200
175
Weight
150
125
100
65 70 75
Height
6.5.2 Regression
Now that we have a scatterplot, we can start asking some questions. One straightforward
question is: What is the relationship between the two variables we just plotted? That is easy.
With R, you can keep the existing plotting information and just add a function to find a line
that captures the relationship:
ggplot(size, aes(x=Height,y=Weight)) + geom_point()+
stat_smooth(method=“lm”) + ylim(100,200)
Compare this command to the one we used above for creating the plot in Figure 6.7. You
will notice that we kept all of it and simply added a segment that overlaid a line on top of the
scatterplot. And that is how easy it is to do basic linear regression in R, a form of
supervised learning. Here, “lm” method refers to linear model. The output is in Figure 6.8.
You see that blue line? That is the regression line. It is also a model that shows the
connection between “Height” and “Weight” variables. What it means is that if we know the
value of “Height,” we could figure out the value of “Weight” anywhere on this line.
Want to see the line equation? Use the “lm” command to extract the coefficients:
lm(Weight ~ Height, size)
And here is the output.
Call:
lm(formula = Weight ~ Height, data = size)
Coefficients:
(Intercept) Height
-130.354 4.113
200
175
Weight
150
125
100
65 70 75
Height
You can see that the output contains coefficients for the independent or predictor
variable (Height) and the constant or intercept. The line equation becomes:
Try plugging in different values of “Height” in this equation and see what values of
“Weight” you get and how close your predicted or estimated values are to reality.
With linear regression, we managed to fit a straight line through the data. But perhaps the
relationship between “Height” and “Weight” is not all that straight. So, let us remove that
restriction of linear model:
And here is the output (Figure 6.9). As you can see, our data fits a curved line better than
a straight line.
Yes, the curved line fits the data better, and it may seem like a better idea than
trying to draw a straight line through this data. However, we may end up with the
curse of overfitting and overlearning with a curved shape for doing regression. It
means we were able to model the existing data really well, but in the process, we
compromised so much that we may not do so well for new data. Do not worry
about this problem for now. We will come back to these concepts in the machine
learning chapters. For now, just accept that a line is a good idea for doing regres-
sion, and whenever we talk about regression, we would implicitly mean linear
regression.
200
175
Weight
250
125
100
65 70 75
Height
6.5.3 Classification
We will now see how we could do some of the same things we did using Python. Let us start
with classification using the kNN method. As you may recall, classification with kNN is an
example of supervised learning, where we have some training data with true labels, and
we build a model (classifier) that could then help us classify unseen data.
Before we use classification in R, let us make sure we have a library or package named
“class” available to us. You can find available packages in the “Packages” tab in RStudio
(typically in the bottom-right window where you also see the plots). If you see “class” there,
make sure it is checked. If it is not there, you need to install that package using the same
method you did for the ggplot2 package.
Here, X_wine contains all the rows, but only some of the columns. It is a table or a matrix with [rows,
columns], and when we enter X_wine[inTrain,], we are picking only the rows that are marked in “inTrain,”
with all the columns of X_wine. In other words, we are generating our training data. The remaining data is
in X_wine[-inTrain,], giving us the test data.
On the other hand, y_wine is a vector (many rows, but one column). We can similarly split that vector
into training and testing using y_wine[inTrain] and y_wine[-inTrain].
We are now ready to run kNN on this data. For that, we will need to load the “class” library. Then use
X_train and y_train to build a model, and use X_test to find our predicted values for y:
library(class)
wine_pred <- knn(train=X_train, test=X_test, cl=y_train,k=3)
Finally, we want to see how well we were able to fit the model and how good our predictions were. For
this, let us load “gmodels” library3 and use its “CrossTable()” function:
library (gmodels)
CrossTable(x = y_test, y = wine_pred, prop.chisq=FALSE)
You should see output that looks something like:
Total Observations in Table: 1949
| wine_pred
y_test | 0| 1 | Row Total |
––––––––––––|––––––––––|–––––-––––|––––––––––|
0| 1363 | 198 | 1561 |
| 0.873 | 0.127 | 0.801 |
| 0.836 | 0.623 | |
| 0.699 | 0.102 | |
––––––––––––|––––––––––|–––––-––––|––––––––––|
1| 268 | 120 | 388 |
| 0.691 | 0.309 | 0.199 |
| 0.164 | 0.377 | |
| 0.138 | 0.062 | |
––––––––––––|––––––––––|–––––-––––|––––––––––|
Column Total | 1631 | 318 | 1949 |
| 0.837 | 0.163 | |
––––––––––––|––––––––––|–––––-––––|––––––––––|
180 R
Here, you can see out of 1949 instances (that is, 30% of data) for testing, how many times we predicted 0
and it was 0, or we predicted 1 and it was 1 (success), and how often we predicted 0 for 1 or 1 for 0 (true
negative or false positive).
Note that the last argument “prop.chisq” indicates whether or not the chi-square contribution of each
cell is included. The chi-square statistic is the sum of the contributions from each of the individual cells and
is used to decide whether the difference between the observed and the expected values is significant.
6.5.4 Clustering
Now we will switch to the unsupervised learning branch of machine learning. Recall from
Chapter 5 that this covers a class of problems where we do not have labels on our training
data. In other words, we do not have a way to know which data point should go to which
class. Instead, we are interested in somehow characterizing and explaining the data we
encounter. Perhaps there are some classes or patterns in them. Can we identify and explain
these? Such a process is often exploratory in nature. Clustering is the most widely used
method for such exploration and we will learn about it using a hands-on example.
2.5
Petal.Width 2.0
1.5 Species
setosa
versicolor
virginica
1.0
0.5
0.0
2 4 6
Petal.Length
2.5
2.0
irisClusterScluster
Petal.Width
3.0
1.5
2.5
2.0
1.5
1.0 1.0
0.5
0.0
2 4 6
Petal.Length
Summary
parameters for each of those functions), as well as explore other packages, you will realize
how easy and amazing it is to work with R. No wonder many data scientists start (and often
finish) with R for their needs.
It may be tempting to ask at this point which one is better – Python or R? I will not answer
that; I will leave it up to you. Sure, there are structural differences between them, and at
some level it is like comparing apples and oranges. But in the end, the choice often comes
down to personal preference, because you will discover that you could use either of these
tools to solve the problem at hand.
Key Terms
• Dataframe: Dataframe generally refers to “tabular” data, a data structure that represents
cases (represented by the rows), each of which consists of a number of observations or
measurements (represented by the columns). In R it is a special case of list where each
component is of equal length.
• Package: In R, packages are collections of functions and compiled code in a well-defined
format.
• Library: The directory where the packages are stored is R called the library. Often,
“package” and “library” are used interchangeably.
• Integrated Development Environment (IDE): This is an application that contains
various tools for writing, compiling, debugging, and running a program. Examples
include Eclipse, Spyder, and Visual Studio.
• Correlation: This indicates how closely two variables are related and ranges from −1
(negatively related) to +1 (positively related). A correlation of 0 indicates no relation
between the variables.
• Linear regression: Linear regression is an approach to model the relationship between
the outcome variable and predictor variable(s) by fitting a linear equation to observed
data.
• Machine learning: This is a field that explores the use of algorithms that can learn from
the data and use that knowledge to make predictions on data they have not seen before.
• Supervised learning: This is a branch of machine learning that includes problems where
a model could be built using the data and true labels or values.
• Unsupervised learning: This is a branch of machine learning that includes problems
where we do not have true labels for the data to train with. Instead, the goal is to somehow
organize the data into some meaningful clusters or densities.
• Predictor: A predictor variable is a variable that is being used to measure some other
variable or outcome. In an experiment, predictor variables are often independent vari-
ables, which are manipulated by the researcher rather than just measured.
• Outcome or response: Outcome or response variables are in most cases the dependent
variables which are observed and measured by changing the independent variables.
184 R
Conceptual Questions
Hands-On Problems
Problem 6.1
Write a multiplication script using either a “for” loop or a “while” loop. Show your script.
Problem 6.2
Like histogram, you can also plot the density of a variable. We have not seen this in this
chapter, but it is easy to do. Figure out how to plot density of income. Provide a couple of
sentences of description along with the plot.
Problem 6.3
Create a bar chart for housing type using the customers data (custdata.tsv, available from
OA 6.1). Make sure to remove the “NA” type. [Hint: You can use subset function with an
appropriate condition on housing type field.] Provide your commands and the plot.
Problem 6.4
Using the customers data (custdata.tsv, available from OA 6.1), extract a subset of custo-
mers that are married and have an income more than $50,000. What percentage of these
185 Hands-On Problems
customers have health insurance? How does this percentage differ from that for the whole
dataset?
Problem 6.5
In the customers data (custdata.tsv, available from OA 6.1), do you think there is any
correlation between age, income, and number of vehicles? Report your correlation numbers
and interpretations. [Hint: Make sure to remove invalid data points, otherwise you may get
incorrect answers!]
Problem 6.6
You are given a data file containing observations for dating (see OA 6.6). Someone who
dated 1000 people (!) recorded data about how much that person travels (Miles), plays
games (Games), and eats ice cream (Icecream). With this, the decision about that person
(Like) is also noted. Use this data to answer the following questions using R:
a. Is there a relationship between eating ice cream and playing games? What about
traveling and playing games? Report correlation values for these and comment on them.
b. Let us use Miles to predict Games. Perform regression using Miles as the predictor and
Games as the response variable. Show the regression graph with the regression line.
Write the line equation.
c. Now let us see how well we can cluster the data based on the outcome (Like). Use Miles
and Games to plot the data and color the points using Like. Now cluster the data using
k-means and plot the same data using clustering information. Show the plot and compare
it with the previous plot. Provide your thoughts about how well your clustering worked
in two to four sentences.
If you are interested in learning more about programming in R, or the platform RStudio,
following are a few links that might be useful:
• https://www.r-bloggers.com/how-to-learn-r-2/
• https://www.rstudio.com/online-learning/
• https://cran.r-project.org/doc/manuals/R-intro.pdf
• https://www.tutorialspoint.com/r/
• http://www.cyclismo.org/tutorial/R/
R Tutorial on dataframes:
• http://www.r-tutor.com/r-introduction/data-frame
186 R
Notes
1. Site for downloading R: https://www.r-project.org
2. RStudio: https://www.rstudio.com/products/rstudio/download/
3. I hope you are now comfortable with installing and using libraries/packages. If at any time you get
an error regarding a package not found, go ahead and install it first.
4. Reynolds, P. S. (1994). Time-series analyses of beaver body temperatures. In Case Studies in
Biometry (eds. Lange, N., Ryan, L., Billard, L., Brillinger, D., Conquest, L., & Greenhouse, J.).
John Wiley & Sons, New York, ch. 11.
7 MySQL
“If we have data, let us look at data. If all we have are opinions, let us go with mine.”
— Jim Barksdale, former Netscape CEO
7.1 Introduction
So far, we have seen data that comes in a file – whether it is in a table, a CSV, or an XML
format. But text files (including CSV) are not the best way to store or transfer data when we
are dealing with a large amount of them. We need something better – something that allows
us not only to store data more effectively and efficiently, but also provides additional tools
to process that data. That is where databases come in. There are several databases in use
today, but MySQL tops them all in the free, open-source category. It is widely available and
used, and thanks to its powerful Structured Query Language (SQL), it is also a compre-
hensive solution for data storage and processing.
This chapter will introduce MySQL, the most popular open-source database platform in
the world. We will learn how to create and access structured data using MySQL. And by
now, since we already know some Python and R, we will see how we can integrate them
with MySQL. I should emphasize this last part – our goal is not to study SQL for the sake of
studying databases; instead, we are still interested in using Python or R as our main tool of
choice and simply replacing text files with SQL databases. And because of that, we will not
cover certain basic elements of SQL that are otherwise covered in an introduction to
MySQL, including creating databases and records, as well as defining keys and pointers
to indicate relationships among various entities. Instead, we will assume that the data is
187
188 MySQL
already stored in the correct format, with appropriate relationships among different fields
and tables defined, and we will see how to retrieve and process data from such databases.
MySQL is a popular open-source database system, available for free. Most UNIX-based systems
come pre-installed with the server component, but one can install it on almost any system.
There are two primary components of MySQL: server and client. In case you are wonder-
ing, both of these are software. If you are on a UNIX or a Linux system (but not a Mac),
chances are you already have the MySQL server installed. If not, or if you are on a non-
UNIX system like Windows or a Mac without a pre-existing installation, you can download
the community version of the MySQL server from the MySQL community server.1 I will
not go into the details, but if you have ever done installation on your system this should be
no different. My suggestion would be to find an existing MySQL server – perhaps provided
by your school, your organization, or by a third-party website host – rather than trying to
install and configure it by yourself.
What is more important for us is the client software. Once again, on most UNIX or Linux
systems (but not a Mac), you should already have the client, which comes as a program or
utility that you can run straight from your terminal. So, if you are on a UNIX system, just
type “mysql” (later we will see the exact command). If you are on a Mac or a Windows, you
have two options: you can log in to a UNIX server using SSH and use the MySQL client
there, or you can install this client on your machine. In fact, you can download and install
Graphical User Interface (GUI)-based MySQL clients. An example of such clients is
MySQL Workbench2 that is available for almost all platforms. If you are on a Mac, I
suggest Sequel Pro.3 Both of these are available for free, and on their websites you can see
instructions for installing and using them.
Once you have access to a MySQL server, you are ready to log in to it. Depending on the
kind of MySQL client you have, the way you log in to a MySQL server will vary. But no
matter what method you follow, you will need at least the following information: your
MySQL username, your MySQL password, and the server name or its IP address. This is
similar to connecting using SSH as we saw in Chapter 4.
Figure 7.1 Connecting to a MySQL server with a standard security measure using a client.
Figure 7.2 Connecting to a MySQL server with the SSH tunneling approach using a client.
191 7.3 Creating and Inserting Records
Once connected, you can see tabs or a dropdown box that lists your databases. Once you
select a database you want to work with, you should see the tables within that database.
Since our focus here is using MySQL as a storage format that we could query from and
process data, we will not worry about constructing tables or datasets. Instead, we will start
with existing datasets or import some data directly into a MySQL database, and proceed
with retrieving and analyzing that data.
Before we could do any retrieval, let us import some data into our database. If you have rights to
create a database on the server, you could run the following command on the MySQL prompt:
create database world;
This will create a database named “world”.
If you cannot create the database, then work with the one already assigned to you
(perhaps this happened through your school’s IT department or your instructor). It is OK
if that database is named differently, but unless you have at least one database available to
you, you will not be able to proceed further in this chapter.
Now, let us get some data. MySQL provides several example datasets. The one that we
are interested in getting here is called the world dataset and it can be downloaded from
MySQL downloads.4 Once downloaded, unzip the file to get world.sql. This is a text file
with SQL commands. You can, in fact, open it in a text editor to view its content.
We need this file on our server. Use your favorite FTP software (see Chapter 4 for details)
to connect to the server and transfer world.sql from your machine to the server.
Let us assume you copied this file to your home directory. Now, let us log in to the server
using SSH. Once logged in, run the “mysql” command (see the first section of this chapter)
to log into and start MySQL. Once you are at your MySQL prompt, let us first open the
database. Assuming your database is called “world”, issue the following command:
use world;
Alternatively, if you are using a GUI-based client, simply select that database by clicking on
its name in the dropdown box or wherever/however you see the existing databases. Now
you are working within the “world” database. Let us go ahead and import that world.sql file
in this database. Run:
source world.sql;
You will see lots of statements flying by on your console. Hopefully everything runs
smoothly and you get your MySQL prompt back. That’s it. You have just imported a
whole lot of data into your database.
192 MySQL
If you are using a GUI-based MySQL client, you could import this data with a few clicks.
First, make sure the correct database is opened or selected in your client. Then find an
option from the File menu that says “Import . . .”. Once you click that, you will be able to
browse your local directories to find world.sql. Once selected, your MySQL client should
be able to import that file.
Just in case you are wondering how to create the same data manually, here are some
instructions. If you were able to do the previous section successfully, you should simply
skip this section (otherwise you would encounter many errors and duplicate data!).
First, make sure you have the right database open. To do so, once you are at your MySQL
prompt, enter:
use world;
If we want to create a table “City” that stores information about cities, here is the full
command:
CREATE TABLE ‘City’ (
‘ID’ int(11) NOT NULL auto_increment,
‘Name’ char(35) NOT NULL default ‘’,
‘CountryCode’ char(3) NOT NULL default ‘’,
‘District’ char(20) NOT NULL default ‘’,
‘Population’ int(11) NOT NULL default ‘0’,
PRIMARY KEY (‘ID’)
);
Here, we are saying that we want to create a table named “City” with five fields: ID,
Name, CountryCode, District, and Population. Each of these fields has different character-
istics, which include the type of data that will be stored in that field and default value. For
instance, ID field will store numbers (int), will not have a null (non-existence) value, and
will have value automatically incremented as new records are added. We are also declaring
that “ID” is our primary key, which means whenever we want to refer to a record, we could
use the “ID” value; it will be unique and non-empty.
in data science, you will be reading the records from a database rather than inserting them.
Even if you want to insert a record or two at times, or edit them, you are better off using a
GUI-based MySQL client. With such a client, you could enter a record or edit an existing
record very much like how you would in a spreadsheet program.
As noted above, fetching or reading the records from a database is what you would be doing
most times, and that is what we are going to see in detail now. For the examples here, we
will assume you are using a terminal-based MySQL client. If you are using a GUI, well,
things will be easier and more straightforward, and I will leave it to you to play around and
see if you can do the same kind of things as described below.
To see what tables you have available within a database, you can enter the following at the
MySQL prompt:
show tables;
How do you do the same in a GUI-based MySQL client? By simply selecting the
database. Yes, the client should show you the tables a database has once you select it.
To find out the structure of a table, you can use the “describe” command on your MySQL
command prompt. For instance, to know the structure of the “Country” table in our “world”
database, you can enter:
describe Country;
To extract information from MySQL tables, the primary command you have is “select”. It is
a very versatile and useful command. Let us see some examples.
To retrieve all the records from table “City”:
SELECT * FROM City;
To see how many records “City” has:
SELECT count(*) FROM City;
To get a set of records matching some criteria:
SELECT * FROM City WHERE population>7000000;
This will fetch us records from “City” where the value of “Population” is greater than
7,000,000.
194 MySQL
Figure 7.3 Example of running an SQL query in a GUI-based MySQL client (here, Sequel Pro).
195 7.5 Searching in MySQL
In this section, we will see how to do searching within MySQL. There are two primary
ways: using the “LIKE” expression and using the “MATCH..AGAINST” expression. The
former can be used without doing any extra work but has limitations in terms of the kinds of
fields it works on and the way searching is done. The latter requires us to build a full text
index. If you have a lot of textual data, it is a good idea to go with the latter option.
Even without doing anything extra, our MySQL database is ready to give us text-search
functionalities. Let us give it a spin with our “world” database. Try the queries below, and
similar ones, and see what you get:
SELECT * FROM Country WHERE HeadOfState LIKE ‘%bush%’;
SELECT * FROM Country WHERE HeadOfState LIKE ‘%elisa%’;
SELECT * FROM Country WHERE HeadOfState LIKE ‘%II%’;
You might notice that in the above expressions, “%” acts as a wildcard. Thus, looking for
%elisa% gives all the records that have “elisa” as a substring.
Now let us take this a step further and see how MySQL supports more sophisticated full-
text searching. Add an index to the “Country” table by issuing the following command:
Table 7.1 Comparison of LIKE and MATCH approaches for database searching.
LIKE MATCH
Since the above query does not contain any wildcards, you should get the results with
records where head of state has “elisa” as a full word. Can you obtain the same set of results
using the LIKE expression?
The real question is: Why would we want to create an index if we could do searches using
the LIKE expression? A comparison between the above two approaches is given in
Table 7.1.
Now, to answer the above question, while usage of MATCH requires you to create an
index and induces a slight overhead (in terms of processing power and memory needs)
while writing a record, it helps significantly during searches. Without an index, MySQL
goes record-by-record looking for an expression (serial scanning approach). This is ineffi-
cient and impractical for large datasets. Indexing allows MySQL to organize the informa-
tion in a better data structure that can reduce the search time significantly. On top of that,
MySQL also removes stop words from the text while indexing. Stop words are words that
are not useful for storage or matching. Typically, these words include the most frequent
words used in a language (e.g., in English, articles and forms of “to be,” such as a, an, the,
is, are, etc.). In addition to these words, MySQL also discounts all the words that occur in
more than 50% of the records or are shorter than three characters. Note that the MySQL stop
words list is available at MySQL Full-Text Stopwords.5
1. Search for population in the last table where Name contains “US.”
2. Search for records in the Country table where the head of state’s name ends with “i” and the country
name starts with a “U.”
We will now incorporate MySQL into other data science programming tools or environ-
ments we know. This way, instead of retrieving and separately analyzing the data from
197 7.6 Accessing MySQL with Python
We will now see how to access MySQL using R. As you might have guessed, to use MySQL
through R, we need a package. This time, it is “RMySQL.” First install the package if you
do not already have it, and then load it in the environment:
> install.packages(“RMySQL”)
> library(RMySQL)
Now, let us connect to our database server and select the database. This is equivalent to
specifying the parameters in your MySQL client:
> mydb = dbConnect(MySQL(), user=‘bugs’, password=‘bunny’,
dbname=‘world’, host=‘server.com’)
In this command, we are connecting to a MySQL server “server.com” with user “bugs” and
password “bunny.” We are also opening “world” database.
If we want to see the tables available in the database we just opened, we can run the
following:
> dbListTables(mydb)
Now, let us run a query to retrieve some results.
As we noted earlier in this chapter, there is a good reason we devoted this chapter to
MySQL; it is the most popular open-source, free database. But there are many other
choices, and it may be possible that you end up working at an organization where one of
these other choices is used. So, let us cover a few of them before we finish this chapter.
7.8.1 NoSQL
NoSQL, which stands for “not only SQL,” is a new approach to database design that goes
beyond the relational database like MySQL and can accommodate a wide variety of data
201 7.8 Introduction to Other Popular Databases
models, such as key-value, document, columnar, and graph formats. NoSQL databases are
most useful for working with large datasets that are distributed.
The name “NoSQL” sometimes is associated with early database designs that predate the
relational database management system (RDBMS). However, in general NoSQL refers to
the databases built in the early twenty-first century that were purposely designed to create
large-scale database clusters for cloud and Web applications, where performance and
scalability requirements surpassed the need for the rigid data consistency that the
RDBMS provided for transactional applications.
The basic NoSQL database classifications (key-value, document, wide columns, graph)
only serve as guidelines. Vendors, over time, have mixed and matched elements from
different NoSQL database families to create more useful systems. Popular implementations
of NoSQL include MongoDB, Redis, Google Bigtable, etc.
7.8.2 MongoDB
MongoDB is a cross-platform NoSQL database program that supports storage and retrieval
of unstructured data such as documents. To support document-oriented database programs,
MongoDB relies on JSON-like structure of documents with schemata. Data records stored
in MongoDB are called BSON files, which are, in fact, a little-modified version of JSON
files and hence support all JavaScript functionalities. MongoDB documents are composed
of field-and-value pairs and have the following structure:
{
field1: value1 [e.g., name: “Marie”]
field2: value2 [e.g., sex: “Female”]
...
fieldN: valueN [e.g., email: marie@abc.com]
}
One of the significant advantages of MongoDB over MySQL is that, unlike the latter, in
MongoDB there are no restrictions on schema design. The schema-free implementation of a
database in MongoDB eliminates the need for prerequisites of defining a fixed structure like
tables and columns in MySQL. However, schema-less documents in MongoDB, where it is
possible to store any information, may cause problems with data consistency.
has the in-built BI Engine (business intelligence engine) to support the user’s data
analysis requirement.
• The other advantage of having a serverless solution is that such implementation enables
the data storage to be separated from the computation part, which in turn offers seamless
scaling of the data storage possible.
While MySQL has obvious benefits of large userbase, compatibility with all major
platforms, and cost-effectiveness as an open-source platform, it simply cannot support
real-time analytics at scale the way BigQuery can, or at least for now. And the reason behind
this lies in how these two store data internally. Relational databases like MySQL store data
in row form, meaning that all data rows are stored together and the primary key acts as an
index which makes the data easily accessible, whereas BigQuery uses a columnar structure,
meaning the data is stored in columns instead of rows. The row form is great for transac-
tional purposes – like reading rows by ID – but it is inefficient if you wish to get analytical
insights from your data, as the row-form storage requires that you read through the entire
database, along with unused columns, to produce results.
Summary
Databases allow us to store, retrieve, and process data in an effective and efficient manner.
Of all the databases out there, MySQL is the most popular open-source database. It is free, it
is open, and it is widely available on all kinds of platforms. There are two parts to it – server
and client. While it is possible to have the server part installed on your machine, usually you
are going to use an actual server (one that your organization provides) and have the client
installed on your computer to access that server. In this chapter, we saw how to do that.
We also saw how easy it is to take a database file (with .sql extension) and import into a
MySQL database, and to insert records using either the SQL or the client software.
However, our focus here is on retrieving and processing the data, so we spent most of
this chapter doing that.
Once again, we have only scratched the surface with MySQL. The more you learn about
the SQL and the more data problems you practice with, the more skilled you will become.
Having said that, what we practiced here should allow you to address many kinds of data
problems. Make sure to do the exercises that follow here to hone your SQL skills.
Key Terms
Conceptual Questions
1. The “world” database example we saw in this chapter is said to contain structured
data. Why?
2. What is SSH tunneling? When is it needed?
3. What are the three pieces of information one needs to connect to a MySQL server? What
additional pieces of information are needed if that connection requires SSH tunneling?
4. How do “LIKE” and “MATCH” expressions for database searching differ?
5. “IP” is not a stop word, and yet you are not able to search for it in a MySQL database
because MySQL does not index it. Why is MySQL not indexing it?
Hands-On Problems
Problem 7.1
Answer the following questions using the “world” database, available for download from
OA 7.2. Present your SQL queries and/or processes that you used to derive the answers.
a. How many countries became independent in the twentieth century?
b. How many people in the world are expected to live for 75 years or more?
c. List the 10 most populated countries in the world with their population as percentage of
the world population. [Hint: You can first find the population of the world and then use it
for percentage for countries, so something like: select Population/5000000000 from
Country . . ..]
204 MySQL
d. List the top 10 countries with the highest population density. [Hint: For population
density, you can try something like: select Population/SurfaceArea from Country
where . . ..]
Problem 7.2
Answer the following questions using the “auto” database, available for download from
OA 7.3.
a. Let us use Python to explore the relationship of different variables to miles per gallon
(mpg). Find out which of the variables have high correlation with mpg. Report those
values. Build a regression model using one of those variables to predict mpg. Do the
same using two of those variables. Report your models along with the regression line
equations.
b. Let us use R to understand how horsepower and weights are related to each other. Plot
them using a scatterplot and color the data points using mpg. Do you see anything
interesting/useful here? Report your observations with this plot. Now let us cluster the
data on this plane in a “reasonable” number of groups. Show your plot where the data
points are now colored with the cluster information and provide your interpretations.
Problem 7.3
For the following exercise, first download the AIS Dynamic Data available from OA 7.4.
This data is collected by a Naval Academy receiver and is available from “Heterogeneous
integrated dataset for maritime intelligence, surveillance, and reconnaissance.” Using the
dataset, answer the following questions:
a. How many unique vessels are available in the dataset?
b. List the number of records available for each vessel in the dataset.
c. Find out the spatial (latitude and longitude) and temporal coverage of each vessel in the
dataset.
d. Let us use R to understand the relation between speed over the ground and spatial
coverage of the vessels that have multiple records.
Following are a few resources that may be useful if you are interested in learning more
about MySQL in general:
1. https://www.w3schools.com/sql/default.asp
2. https://dev.mysql.com/doc/mysql-shell-excerpt/5.7/en/
3. https://dev.mysql.com/doc/refman/5.7/en/tutorial.html
205 Hands-On Problems
4. https://www.tutorialspoint.com/mysql/
5. https://www.javatpoint.com/mysql-tutorial
Notes
1. MySQL community server: http://dev.mysql.com/downloads/mysql/
2. MySQL Workbench: http://www.mysql.com/products/workbench/
3. Sequel Pro download: http://sequelpro.com/
4. MySQL downloads: http://downloads.mysql.com/docs/world.sql.zip
5. MySQL Full-Text Stopwords: http://dev.mysql.com/doc/refman/5.7/en/fulltext-stopwords.html
6. GitHub for PyMySQL download: https://github.com/PyMySQL/PyMySQL
PART III
207
8 Machine Learning Introduction and Regression
“People worry that computers will get too smart and take over the world, but the real
problem is that they’re too stupid and they’ve already taken over the world.”
— Pedro Domingos
8.1 Introduction
So far, our work on data science problems has primarily involved applying statistical
techniques to analyze the data and derive some conclusions or insights. But there are
times when it is not as simple as that. Sometimes we want to learn something from that
data and use that learning or knowledge to solve not only the current problem but also future
data problems. We might want to look at shopping data at a grocery chain, combined with
farming and poultry data, and learn how supply and demand are related. This would enable
us to make recommendations for investments in both the grocery store and the food
industries. In addition, we want to keep updating the knowledge – often called a model –
derived from analyzing the data so far. Fortunately, there is a systematic way for tackling
such data problems. In fact, we have already seen this in the previous chapters: machine
learning.
In this chapter, we will introduce machine learning with a few definitions and examples.
Then, we will look at a large class of problems in machine learning called regression. This is
not the first time we have encountered regression. The first time we covered it was in
Chapter 3 while discussing various statistical techniques. And then, if you went through
either the Python or R chapter earlier, you would have seen regression in action. Here, we
209
210 Machine Learning Introduction and Regression
will approach regression as a learning problem and study linear regression by way of
applying a linear model as well as gradient descent.
In subsequent chapters (Chapters 9 and 10) we will see specific kinds of learning –
supervised and unsupervised. But first, let us start by introducing machine learning.
Machine learning is a spin-off or a subset of artificial intelligence (AI), and in this book it is
an application of data science skills. Here, the goal, according to Arthur Samuel,1 is to give
“computers the ability to learn without being explicitly programmed.” Tom Mitchell2 puts it
more formally: “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at tasks in T, as
measured by P, improves with experience E.”
Now that we know what machine learning is, in principle, let us see what it does and why.
First, we must consider the following questions:3 What is learning, anyway? What is it that
the machine is trying to learn here?
These are deep philosophical questions. But we will not be too concerned with philoso-
phy, as our emphasis is firmly on the practical side of machine learning. However, it is
worth spending a few moments at the outset on fundamental issues, just to see how tricky
they are, before rolling up our sleeves and looking at machine learning in practice.
For a moment, let us forget about the machine and think about learning in general. The
New Oxford American Dictionary (third edition)4 defines “to learn” as: “To get knowledge
of something by study, experience, or being taught; to become aware by information or
from observation; to commit to memory; to be informed of or to ascertain; to receive
instruction.”
All of these meanings have limitations when they are associated with computers or
machines. With the first two meanings, it is virtually impossible to test whether learning has
been achieved or not. How can you check whether a machine has obtained knowledge of
something? You probably cannot ask it questions; and even if you could, you would not be
testing its ability to learn but its ability to answer questions. How can you tell whether it has
become aware of something? The whole question of whether computers can be aware, or
conscious, is a burning philosophical issue.
As for the last three meanings, although we can see what they denote in human terms,
merely committing to memory and receiving instruction seems to fall short of what we
might mean by machine learning. These are too passive, and we know that these tasks are
trivial for today’s computers. Instead, we are interested in improvements in performance, or
at least in the potential for performance, in new situations. You can commit something to
memory or be informed of something by rote learning without being able to apply the new
knowledge to new situations. In other words, you can receive instruction without benefit-
ting from it at all.
Therefore, it is important to come up with a new operational definition of learning in the
context of the machine, which we can formulate as:
211 8.2 What Is Machine Learning?
Things learn when they change their behavior in a way that makes them perform better in
the future.
This ties learning to performance rather than knowledge. You can test learning by observing
present behavior and comparing it with past behavior. This is a more objective kind of
definition and is more satisfactory for our purposes. Of course, the more comprehensive and
formal definition based on this idea is what we saw before by Tom Mitchell.
In the context of this definition, machine learning explores the use of algorithms that can
learn from the data and use that knowledge to make predictions on data they have not seen
before – such algorithms are designed to overcome strictly static program instructions by
making data-driven predictions or decisions through building a model from sample inputs.
While quite a few machine learning algorithms have been around for a long time, the ability
to automatically apply complex mathematical calculations to big data in an efficient manner
is a recent development. Following are a few widely publicized examples of machine
learning applications you may be familiar with.
The first is the heavily hyped, self-driving Google car (now rebranded as WAYMO). As
shown in Figure 8.1, this car is taking a real view of the road to recognize objects and
patterns such as sky, road signs, and moving vehicles in a different lane. This process itself
is quite complicated for a machine to do. A lot of things may look like a car (that blue blob
in the bottom image is a car), and it may not be easy to identify where a street sign is. The
self-driving car needs not only to carry out such object recognition, but also to make
decisions about navigation. There is just so much unknown involved here that it is
impossible to come up with an algorithm (a set of instructions) for a car to execute.
Instead, the car needs to know the rules of driving, have the ability to do object and pattern
Figure 8.1 Machine learning technology behind self-driving car. (Source: YouTube: Deep Learning: Technology behind self-
driving car.5)
212 Machine Learning Introduction and Regression
recognition, and apply these to making decisions in real time. In addition, it needs to keep
improving. That is where machine learning comes into play.
Another classic example of machine learning is optical character recognition (OCR).
Humans are good with recognizing hand-written characters, but computers are not. Why?
Because there are too many variations in any one character that can be written, and there is
no way we could teach a computer all those variations. And then, of course, there may be
noise – an unfinished character, joining with another character, some unrelated stuff in the
background, an angle at which the character is being read, etc. So, once again, what we need
is a basic set of rules that tells the computer what “A,” “a,” “5,” etc., look like, and then have
it make a decision based on pattern recognition. The way this happens is by showing several
versions of a character to the computer so it learns that character, just like a child will do
through repetitions, and then have it go through the recognition process (Figure 8.2).
Let us take an example that is perhaps more relevant to everyday life. If you have used
any online services, chances are you have come across recommendations. Take, for
instance, services such as Amazon and Netflix. How do they know what products to
recommend? We understand that they are monitoring our activities, that they have our
past records, and that is how they are able to give us suggestions. But how exactly? They use
something called collaborative filtering (CF). This is a method that uses your past behavior
and compares its similarities with the behaviors of other users in that community to figure
out what you may like in the future.
Take a look at Table 8.1. Here, there are data about four people’s ratings for different
movies. And the objective for a system here is to figure out if Person 5 will like a movie or
not based on that data as well as her own movie likings from the past. In other words, it is
trying to learn what kinds of things Person 5 likes (and dislikes), what others similar to
Person 5 like, and uses that knowledge to make new recommendations. On top of that, as
Person 5 accepts or rejects its recommendations, the system extends its learning to include
knowledge about how Person 5 responds to its suggestions, and further corrects its models.
213 8.2 What Is Machine Learning?
Rating Person 1 4 5 3 4 2
Person 2 3 2 3 4 4
Person 3 4 3 4 5 3
Person 4 3 4 4 5 2
Person 5 4 ? 4 ? 4
Here are a couple more examples. Facebook uses machine learning to personalize each
member’s news feed. Most financial institutions use machine learning algorithms to detect
fraud. Intelligence agencies use machine learning to sift through mounds of information to
look for credible threats of terrorism.
There are many other applications that we encounter in daily life when machine learning
is working one way or another. In fact, it is almost impossible to finish our day without
having used something that is driven by machine learning. Did you do any online browsing
or searching today? Did you go to a grocery store? Did you use a social media app on your
phone? Then you have used machine learning applications.
So, are you convinced that machine learning is a very important field of study? If the
answer is “yes” and you are wondering what it takes to create a good machine learning
system, then the following list of criteria from SAS6 may help:
a. Data preparation capabilities.
b. Algorithms – basic and advanced.
c. Automation and iterative processes.
d. Scalability.
e. Ensemble modeling.
In this chapter, we will primarily focus on the second criterion: algorithms. More specifi-
cally, we will see some of the most important techniques and algorithms for developing
machine learning applications.
We will note here that, in most cases, the application of machine learning is entwined
with the application of statistical analysis. Therefore, it is important to remember the
differences in the nomenclature of these two fields.
• In machine learning, a target is called a label.
• In statistics, a target is called a dependent variable.
• A variable in statistics is called a feature in machine learning.
• A transformation in statistics is called feature creation in machine learning.
Machine learning algorithms are organized into a taxonomy, based on the desired out-
come of the algorithm. Common algorithm types include:
214 Machine Learning Introduction and Regression
a. Supervised learning. When we know the labels on the training examples we are using to
learn.
b. Unsupervised learning. When we do not know the labels (or even the number of labels or
classes) from the training examples we are using for learning.
c. Reinforcement learning. When we want to provide feedback to the system based on how
it performs with training examples.
Let us go through these systematically in the following sections, working with examples
and applying our data science tools and techniques.
However, it has been our experience that most intelligent tasks require some ability to induce new
knowledge from past experience.
This inducing knowledge is achieved through an explicit set of rules or using machine learning that
can extract some form of information automatically (i.e., without any constant human moderation).
Recently, we have seen machine learning becoming so successful that when we see AI mentioned, it
almost invariably refers to some form of machine learning.
In comparison, data mining as a field has taken much of its inspiration and techniques from machine
learning and some from statistics, but with a different goal. Data mining can be performed by a human
expert on a specific dataset, often with a clear end goal in mind. Typically, the goal is to leverage the
power of various algorithms from machine learning and statistics to discover insights to a problem
where knowledge is limited. Thus, data mining can use other techniques besides or on top of machine
learning.
Let us take an example to disentangle these two closely related concepts. Whenever you go to Yelp, a
popular platform for reviewing local businesses, you see a list of recommendations based on your
location, past reviews, time, weather, and other factors. Any such review platform employs machine
learning algorithms in its back end, where the goal is to provide an effective list of recommendations to
cater to the needs of different users. However, at a lower level, the platform is running a set of data
mining applications on the huge dataset it has accumulated from your past interaction with the platform
and leveraging that to predict what might be of interest to you. So, for a game day, it might recommend
a nearby wings and beer place, whereas on a rainy day it might suggest a place where you can get hot
soup delivered.
8.3 Regression
Our first stop is regression. Think about it as a much more sophisticated version of
extrapolation. For example, if you know the relationship between education and income
(the more someone is educated, the more money they make), we could predict someone’s
income based on their education. Simply speaking, learning such a relationship is
regression.
In more technical terms, regression is concerned with modeling the relationship between
variables of interest. These relationships use some measures of error in the predictions to
refine the models iteratively. In other words, regression is a process.8
We can learn about two variables relating in some way (e.g., correlation), but if there is a
relationship of some kind, can we figure out if or how one variable could predict the other?
Linear regression allows us to do that. Specifically, we want to see how a variable X affects a
variable y. Here, X is called the independent variable or predictor; y is called the dependent
variable or response. Take a note of the notation here. The X is in uppercase because it could
have multiple feature vectors, making it a feature matrix. If we are dealing with only a
216 Machine Learning Introduction and Regression
0.8
0.4
0.2
0.2 0.4 0.6 0.8
Excess Return
Figure 8.3 An example showing a relationship between annual return and excess return of stock using linear regression from
the stock portfolio dataset.9
single feature for X, we may decide to use the lowercase x. On the other hand, y is in
lowercase because it is a single value or feature being predicted.
As mentioned previously, linear regression fits a line (or plane, or hyperplane) to the
dataset. For example, in Figure 8.3, we want to predict the annual return using excess return
of stock in a stock portfolio. The line represents the relation between these two variables.
Here, it happens to be quite linear (see most of the data points close to the line), but such is
not always the case.
Some of the most popular regression algorithms are:
• Ordinary least squares regression (OLSR)
• Linear regression
• Logistic regression
• Stepwise regression
• Multivariate adaptive regression splines (MARS)
• Locally estimated scatterplot smoothing (LOESS)
Since linear regression is covered in an earlier chapter, here we will move to something
more general and a lot more useful in machine learning. For this, we need to take a step back
and think about how linear regression is solved. Take a look at Figure 8.3. Imagine we only
have those dots (data points) and no line. We can draw a random line and see how well it fits
the data. For this, we can find the distance of each data point from that line and add it all up.
That gives a number, often called cost or error. Now, let us draw another line and repeat the
process. We get another number for cost. If this is lower than the previous one, the new line
is better. If we keep repeating this until we find a line that gives us the lowest cost or error,
we have found our most fitting line, solving the problem of regression.
217 8.3 Regression
How about generalizing this process? Imagine we have some function or procedure for
finding the cost, given our data, and then the objective is to keep adjusting how the function
operates by picking different values for its input or parameters and see if we could lower the
cost. Whenever we find the lowest cost, we stop and note the values of the parameters to that
function. And those parameter values construe the most fitting model for the data. This
model could be a line, a plane, or in general a function. This is the essence of a technique
called gradient descent.
1 3
2 4
3 8
4 4
5 6
6 9
7 8
8 12
9 15
10 26
11 35
12 40
13 45
14 54
15 49
16 59
17 60
18 62
19 63
20 68
218 Machine Learning Introduction and Regression
60
40
y
20
0
5 10 15 20
x
60
40
y
20
5 10 15 20
x
(Intercept) x
-10.263 3.977
What follows the “lm” command is the output, where we can see the values of b (intercept) and m
(coefficient of x). Well, that was easy. But how did R come up with this solution? Let us dig deeper. Before
we move forward, though, let us see how this model looks. In this case, the model is a line. Let us plot it on
our data:
> ggplot(regressionData, aes(x=x, y=y)) + geom_point() +
stat_smooth(method=“lm”)
60
40
y
20
0
5 10 15 20
x
Obviously, this is a better fit, but it could also be an overfit. That means the model has learned the existing
data so well that it has very little error in explaining that data, but, on the flip side, it may be difficult for it to
adapt to a new kind of data. Do you know what stereotyping is? Well, that is us overfitting the data or
observation so far, and while we may have reasons for developing that stereotypical view of a given
phenomenon, it prevents us from easily accepting data that do not fit our preconceived notions. Think about it.
Now, let us get back to that line. It is possible to fit multiple lines to the same dataset, each
represented by the same equation but with different m and b values. Our job is to find the
best one, which will represent the dataset better than the other lines. In other words, we need
to find the best set of m and b values.
A standard approach to solving this problem is to define an error function (sometimes
also known as a cost function) that measures how good a given line is. This function will
take in a (m, b) pair and return an error value based on how well the line fits our data. To
compute this error for a given line, we will iterate through each (x, y) point in our dataset
221 8.4 Gradient Descent
and sum the square distances between each point’s y value and the candidate line’s y value
(computed at mx + b).
Formally, this error function looks like:
1X n
ϵ¼ ððmxi þ bÞ yi Þ2 : ð8:1Þ
n i¼1
We have squared the distance to ensure that it is positive and to make our error function
differentiable. Note that normally we will use m to indicate number of data points, but here
we are using that letter to indicate the slope, so we have made an exception and used n. Also
note that often the intercept for a line equation is represented using c instead of b, as we have
done.
The error function is defined in such a way that the lines that fit our data better will result
in lower error values. If we minimize this function, we will get the best line for our data.
Since our error function consists of two parameters (m and b), we can visualize it as a 3D
surface. Figure 8.7 depicts what it looks like for our dataset.
Each point in this 3D space represents a line. Let that sink in for a bit. Each point in this
3D figure represents a line. Can you see how? We have three dimensions: slope (m), y-
intercept (b), and error. Each point has values for these three, and that is what gives us the
line (technically, just the m and the b). In other words, this 3D figure presents a whole bunch
of possible lines we could have to fit the data shown in Table 8.2, allowing us to see which
line is the best.
The height of the function at each point is the error value for that line. You can see that
some lines yield smaller error values than others (i.e., fit our data better). The darker blue
1000
800
600
z
400
200
–4
–2 –4
0 –2
y
0
2 x
2
4 4
Figure 8.7 Error surface for various lines created using linear regression (x represents slope, y represents intercept and z is the
error value).
222 Machine Learning Introduction and Regression
color indicates the lower the error function value and the better it fits our data. We can find
the best m and b set that will minimize the cost function using gradient descent.
Gradient descent is an approach for looking for minima – points where the error is at its
lowest. When we run a gradient descent search, we start from some location on this surface
and move downhill to find the line with the lowest error.
To run gradient descent on this error function, we first need to compute its gradient or
slope. The gradient will act like a compass and always point us downhill. To compute it, we
will need to differentiate our error function. Since our function is defined by two parameters
(m and b), we will need to compute a partial derivative for each. These derivatives work out
to be:
∂ϵ 2X n
∂
¼ ððmxi þ bÞ yi Þ ððmxi þ bÞ yi Þ
∂m n i¼1 ∂m
ð8:2Þ
2X n
¼ ððmxi þ bÞ yi Þxi
n i¼1
and
∂ϵ 2 X n
∂
¼ ððmxi þ bÞ yi Þ ððmxi þ bÞ yi Þ
∂b n i¼1 ∂b
X n ð8:3Þ
2
¼ ððmxi þ bÞ yi Þ:
n i¼1
Now we know how to run gradient descent and get the smallest error. We can initialize
our search to start at any pair of m and b values (i.e., any line) and let the gradient descent
algorithm march downhill on our error function toward the best line. Each iteration will
update m and b to a line that yields slightly lower error than the previous iteration. The
direction to move in for each iteration is calculated using the two partial derivatives from
the above two equations.
Let us now generalize this. In the above example, m and b were the parameters we were
trying to estimate. But there could be many parameters in a problem, depending on the
dimensionality of the data or the number of features available. We will refer to these
parameters as θ values, and it is the job of the learning algorithm to estimate the best
possible values of the θ.
think, since they will not be able to come up with such things on their own, somehow, they would not
be so good at doing data science, or, specifically here, machine learning. And I am telling you (and
them) that it is a big misunderstanding.
Unless I am teaching machine learning to those who are focused on the algorithms themselves (as
opposed to their applications), I present these derivations as a way to convey intuition behind those
algorithms. The actual math is not that important; it is simply a way to present an idea in a compact
form.
So, there – think of all this math as just a shorthand writing to communicate complex, but still
intuitive, ideas. Was there a time in your life when you did not know what “LOL,” “ICYMI,” and “IMHO”
stood for? But now these are standard abbreviations that people use all the time – not intimidating at
all. Think about all this math the same way – you do not have to invent it; you just have to accept this
special language of abbreviations and understand the underlying concepts, ideas, and intuitions.
Earlier we defined an error function using a model built with two parameters (mxi + b).
Now, let us generalize it. Imagine that we have a model that could have any number of
parameters. Since this model is built using training examples, we would call it a
hypothesis function and represent it using h. It can be defined as
X
n
hðxÞ ¼ θ i xi : ð8:4Þ
i¼0
If we consider θ0 = b, θ1 = m, and assign x0 = 1, we can derive our line equation using the
above hypothesis function. In other words, a line equation is a special case of this
function.
Now, just as we defined the error function using the line equation, we could define a cost
function using the above hypothesis function as in the following:
1 Xm
JðθÞ ¼ ðhðxi Þ yi Þ2 : ð8:5Þ
2m i¼1
Compare this to the error function defined earlier. Yes, we are now back to using m to
represent number of samples or data points. And we have also added a scaling factor
of ½ in the mix, which is purely out of convenience, as you will see soon.
And just as we did before, finding the best values for our parameters means chasing the
slope for each of them and trying to reach as low cost as possible. In other words, we are
trying to minimize J(θ) and we will do that by following its slope along each parameter. Let
us say we are doing this for parameter θj. That means we will take the partial derivative of J
(θ) with respect to θj:
224 Machine Learning Introduction and Regression
∂ 1 ∂ X m
JðθÞ ¼ ðhðxi Þ yi Þ2
∂θj 2m ∂θj i¼1
2 X m
∂
¼ ðhðxi Þ yi Þ ðhðxi Þ yi Þ
2m i¼1 ∂θj
1X m
∂
¼ ðhðxi Þ yi Þ ðθ0 xi0 þ θ1 xi1 þ . . . þ θj xij þ . . . þ θn xin yÞ
m i¼1 ∂θj
1X m
¼ ðhðxi Þ yi Þxij : ð8:6Þ
m i¼1
This gives us our learning algorithm or rule, called gradient descent, as in the following:
1X m
θj :¼ θj α ðhðxi Þ yi Þxij : ð8:7Þ
m i¼1
This means we update θj (override its existing value) by subtracting a weighted slope or
gradient from it. In other words, we take a step in the direction of the slope. Here, α is the
learning rate, with value between 0 and 1, which controls how large a step we take downhill
during each iteration. If we take too large a step, we may step over the minimum. However,
if we take small steps, it will require many iterations to arrive at the minimum.
The above algorithm considers all the training examples while calculating the slope, and
therefore it is also called batch gradient descent. At times when the sample size is too large
and computing the cost function is too expensive, we could take one sample at a time to rule
the above algorithm. That method is called stochastic or incremental gradient descent.
Linear regression
70
60
50
40
y
30
20
10
5 10 15 20
x
Figure 8.8 Linear regression plot for the data in Table 8.2.
The above lines should generate the output shown in Figure 8.8.
In other words, we got the answer (the red regression line). But let us do this systematically. After all,
we are all about learning this process and not just getting the answer. For this, we will implement the
gradient descent algorithm using R. Let us first define our cost function.
#cost function
cost <- function(X, y, theta){
sum(X%*% theta – y)^2/(2*length(y))
}
We will recall this function later as we go through various possibilities for the parameters. For now, let
us go ahead and initialize that parameter vector or matrix with zeros. Here, we have two parameters, m
and b, so we need a 2D vector called θ (theta):
theta <- matrix(c(0,0), nrow = 2)
num_iterations <- 300
alpha <- 0.01
Here, α (alpha) indicates the learning rate, and we have decided to go with a very small value for it. We got
all our initial values to start the gradient descent. But before we run the algorithm, let us create storage
spaces to store values of cost or error and the parameters at every iteration:
cost_history <- double(num_iterations)
theta_history <- list(num_iterations)
226 Machine Learning Introduction and Regression
To use the generalized cost function, we will want our first parameter θ0 to be without any feature with
it, thus making x0 = 1:
X<-cbind(1, matrix(x))
And now we can implement our algorithm as a loop that goes through a certain number of iterations:
for(i in 1:num_iterations){
error <- (X %*% theta – y)
delta <- t(X) %*% error/length(y)
theta <- theta – alpha * delta
cost_history[i] <- cost(X, y, theta)
theta_history[[i]] <- theta
}
print(theta)
This will print out the final values of the parameters. If you are interested in the values of these parameters
as well as the cost that we calculated at every step, you could look into the theta-history and cost-history
variables. For now, let us go ahead and visualize how some of those interactions would look:
plot(x,y, main = “Gradient descent”)
abline(coef = theta_history[[1]])
abline(coef = theta_history[[2]])
abline(coef = theta_history[[3]])
abline(coef = theta_history[[4]])
abline(coef = theta_history[[5]])
Figure 8.9 shows how the output looks.
We could put this line drawing part in a loop to see how the whole process evolved with all those
iterations (see Figure 8.10):
plot(x,y, main = “Gradient descent”)
# Draw the first few lines and then draw every 10th line
for(i in c(1,2,3,4,5,seq(6,num_iterations, by = 10))){
abline(coef = theta_history[[i]], col=rgb(0.8,0,0,0.3))
}
We can also visualize how the cost function changes in each iteration by doing the following steps:
plot(cost_history, type = ‘line’, col = ‘blue’, lwd=2, main =
‘Cost function’, ylab=‘cost’, xlab = ‘Iterations’)
The output is shown in Figure 8.11.
As we can see, the cost quickly jumps down in just a few iterations, thus giving us a very fast
convergence. Of course, that is to be expected because we have only a couple of parameters and a very
227 8.4 Gradient Descent
Gradient descent
70
60
50
40
y 30
20
10
5 10 15 20
x
Figure 8.9 Regression lines produced using the gradient descent algorithm.
60
50
40
y
30
20
10
5 10 15 20
x
Figure 8.10 Finding the best regression line using gradient descent.
small sample size here. Try practicing this with another dataset (see a homework exercise below). Play
around with things like number of iterations and learning rate. If you want to have more fun and try your
coding skills, see if you could modify the algorithm to consider the change in the cost function to decide
when to stop rather than running it for a fixed number of steps as we did here.
228 Machine Learning Introduction and Regression
Cost function
Summary
In this chapter, we started exploring a host of new tools and techniques collectively parked under
the umbrella of machine learning, which we could use to solve various data science problems.
While it is easy to understand individual tools and methods, it is not always clear how to pick the
best one(s) given a problem. There are multiple factors that need to be considered before
choosing the right algorithm for a problem. Some of these factors are discussed below.
Accuracy
Most of the time, beginners in machine learning incorrectly assume that for each problem
the best algorithm is the most accurate one. However, getting the most accurate answer
possible is not always necessary. Sometimes an approximation is adequate, depending on
the problem. If so, you may be able to cut your processing time dramatically by sticking
with more approximate methods. Another advantage of more approximate methods is that
they naturally tend to avoid overfitting. We will revisit the notion of accuracy and other
metrics for measuring how good a model is in Chapter 12.
Training Time
The number of minutes or hours necessary to train a model varies between algorithms.
Training time is often closely tied to accuracy – one typically accompanies the other. In
addition, some algorithms are more sensitive to the number of data points than others. A
limit on time can drive the choice of algorithm, especially when the dataset is large.
Linearity
Lots of machine learning algorithms make use of linearity. Linear classification algorithms
assume that classes can be separated by a straight line (or its higher-dimensional analog).
These include logistic regression and support vector machines. Linear regression algo-
rithms assume that data trends follow a straight line. These assumptions are not bad for
some problems, but on others they bring accuracy down.
Number of Parameters
Parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are
numbers that affect the algorithm’s behavior, such as error tolerance, number of iterations, or
options between variants of how the algorithm behaves. The training time and accuracy of the
algorithm can sometimes be quite sensitive to getting just the right settings. Typically, algorithms
with a large number of parameters require the most trial and error to find a good combination.
Some off-the-shelf applications or service providers may include extra functionalities for
parameter tuning. For example, Microsoft Azure (see Appendix F) provides a parameter
230 Machine Learning Introduction and Regression
sweeping module block that automatically tries all parameter combinations at whatever
granularity the user may decide. While this is a great way to make sure you have tried every
possible combination in the parameter space, the time required to train a model increases
exponentially with number of parameters.
The upside is that having many parameters typically indicates that an algorithm has
greater flexibility. It can often achieve high accuracy, provided you find the right combina-
tion of parameter settings.
Number of Features
For certain types of data, the number of features can be very large compared to the number
of data points. This is often the case with genetics or textual data. The large number of
features can bog down some learning algorithms, making training time unfeasibly long.
Support vector machines are particularly well suited to this case (see Chapter 9).
Often the hardest part of solving a machine learning problem can be finding the right
estimator for the job. Different estimators are better suited for different types of data and
different problems. How do we learn about when to use which estimator or technique?
There are two primary ways that I can think of: (1) developing a comprehensive theoretical
understanding of different ways we could develop estimators or build models; and
(2) through lots of hands-on experience. As you may have guessed, in this book, we are
going with the latter.
If you are looking for a comprehensive and theoretical treatment of various machine
learning algorithms, you will have to use other textbooks and resources. You can find some
of those pointers at the end of this chapter. But if you are open to working with different data
problems and try different techniques in a hands-on manner in order to develop a practical
understanding of this matter, you are holding the right book. In the next two chapters, we
will go through many of the machine learning techniques by applying them to various data
problems.
Key Terms
• Machine learning: This is a field that explores the use of algorithms that can learn from
the data and use that knowledge to make predictions on data they have not seen before.
• Supervised learning: This is a branch of machine learning that includes problems where
a model could be built using the data and true labels or values.
• Unsupervised learning: This is a branch of machine learning that includes problems
where we do not have true labels for the data to train with. Instead, the goal is to somehow
organize the data into some meaningful clusters or densities.
231 Hands-On Problems
• Collaborative filtering (CF): This is a technique for recommender systems that uses
data from other people’s past behaviors to estimate what to recommend to a given user.
• Model: In machine learning a model refers to an artifact that is created by the training
process on a dataset that is representative of the population, often called a training set.
• Linear model: Linear model describes the relation between a continuous response
variable and one or more predictor variable(s).
• Parameter: A parameter is any numerical quantity that characterizes a given population
or some aspect of it.
• Feature: In machine learning, a feature is an individual measurable property or char-
acteristic of a phenomenon or object being observed.
• Independent /predictor variable: A variable that is thought to be controlled or not
affected by other variables.
• Dependent /outcome /response variable: A variable that depends on other variables
(most often other independent variables).
• Gradient descent: This is a machine learning algorithm that computes a slope down an
error surface in order to find a model that provides the best fit for the given data.
• Batch gradient descent: This is a gradient descent algorithm that considers all the
training examples while calculating the gradient.
• Stochastic or incremental gradient descent: This is a gradient descent algorithm that
considers one data point at a time while calculating the gradient.
Conceptual Questions
1. There is a lot around us that is driven by some form of machine learning (ML), but not
everything is. Give an example of a system or a service that does not use ML, and one
that does. Use this contrast to explain ML in your own words.
2. Many of the ML models are represented using parameters. Use this idea to define ML.
3. How do supervised learning and unsupervised learning differ? Give an example for each.
4. Compare batch gradient descent and stochastic gradient descent using their definitions,
and pros and cons.
Hands-On Problems
the following attributes: ambience, food, service, and overall rating. The first three attri-
butes are predictor variables and the remaining one is the outcome. Use a linear regression
model to predict how the predictor attributes impact the overall rating of the restaurant.
First, express the linear regression in mathematical form. Then, try solving it by hand as
we did in class. Here, you will have four parameters (the constant, and the three attributes),
with one predictor. You do not have to actually solve this with all possible values for these
parameters. Rather, show a couple of possible sets of values for the parameters with the
predictor value calculated. Finally, use R to find the linear regression model and report it in
appropriate terms (do not just dump the output from R).
For the next exercise, you are going to use the Airline Costs dataset available to download
from OA 8.4. The dataset has the following attributes, among others:
i. Airline name
ii. Length of flight in miles
iii. Speed of plane in miles per hour
iv. Daily flight time per plane in hours
v. Customers served in 1000s
vi. Total operating cost in cents per revenue ton-mile
vii. Total assets in $100,000s
viii. Investments and special funds in $100,000s
Use a linear regression model to predict the number of customers each airline serves from
its length of flight and daily flight time per plane. Next, build another regression model to
predict the total assets of an airline from the customers served by the airline. Do you have
any insight about the data from the last two regression models?
Download data from OA 8.5, which was obtained from BP Research (image analysis by
Ronit Katz, University of Oxford). This dataset contains measurements on 48 rock samples
from a petroleum reservoir. Here 12 core samples from petroleum reservoirs were sampled
in four cross-sections. Each core sample was measured for permeability, and each cross-
section has total area of pores, total perimeter of pores, and shape. As a result, each row in
the dataset has the following four columns:
i. Area: area of pore space, in pixels out of 256 by 256
ii. Peri: perimeter in pixels
iii. Shape: perimeter/square-root(area)
iv. Perm: permeability in milli-darcies
First, create a linear model and check if the perm has linear relationship with the remaining
three attributes. Next, use the gradient descent algorithm to find the optimal intercept and
gradient for the dataset.
233 Hands-On Problems
For this exercise, you are going to work again with a movie review dataset. In this dataset,
conventional and social media movies, the ratings, budgets, and other information of
popular movies released in 2014 and 2015 were collected from social media websites,
such as YouTube, Twitter, and IMDB, etc.; the aggregated dataset can be downloaded from
OA 8.6. Use this dataset to complete the following objectives:
a. What can you tell us about the rating of a movie from its budget and aggregated number
of followers in social media channels?
b. If you incorporate the type of interaction the movie has received (number of likes,
dislikes, and comments) in social media channels, does it improve your prediction?
c. Among all the factors you considered in the last two models, which one is the best
predictor of movie rating? With the best predictor feature, use the gradient descent
algorithm to find the optimal intercept and gradient for the dataset. Often the hardest part
of solving a machine learning problem can be finding the right estimator for the job.
Different estimators are better suited for different types of data and different problems.
If you are interested in learning more about the topics discussed in this chapter, following
are a few links that might be useful:
1. http://rstatistics.net/linear-regression-advanced-modelling-algorithm-example-with-r/
2. https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-
ridge-and-lasso-regression/
3. https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-algorithm
.html
4. https://machinelearningmastery.com/gradient-descent-for-machine-learning/
5. http://ruder.io/optimizing-gradient-descent/
Notes
1. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal
of Research and Development, 44, 206–226.
2. Mitchell, T. M. (1997). Machine Learning. WCB/McGraw-Hill, Burr Ridge, IL.
3. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical Machine Learning
Tools and Techniques. Morgan Kaufmann.
4. The New Oxford American Dictionary, defined on Wikipedia: https://en.wikipedia.org/wiki/
New_Oxford_American_Dictionary
5. YouTube: Deep Learning: Technology behind self-driving car: https://www.youtube.com/watch?
v=kMMbW96nMW8
234 Machine Learning Introduction and Regression
“Artificial Intelligence, deep learning, machine learning – whatever you’re doing if you
do not understand it – learn it. Because otherwise you’re going to be a dinosaur within
3 years.”
— Mark Cuban
9.1 Introduction
In the previous chapter we were introduced to the concept of learning – both for humans and
for machines. In either case, a primary way one learns is first knowing what is a correct
outcome or label of a given data point or a behavior. As it happens, there are many situations
when we have training examples with correct labels. In other words, we have data for which
we know the correct outcome value. This set of data problems collectively fall under
supervised learning.
Supervised learning algorithms use a set of examples from previous records to
make predictions about the future. For instance, existing car prices can be used to
make guesses about the future models. Each example used to train such an algorithm
is labeled with the value of interest – in this case, the car’s price. A supervised
learning algorithm looks for patterns in a training set. It may use any information that
might be relevant – the season, the car’s current sales records, similar offerings from
competitors, the manufacturer’s brand perception owned by the consumers – and each
algorithm may look for a different set of information and find different types of
235
236 Supervised Learning
patterns. Once the algorithm has found the best pattern it can, it uses that pattern to
make predictions for unlabeled testing data – tomorrow’s values.
There are several types of supervised learning that exist within machine learning.
Among them, the three most commonly used algorithm types are regression, classi-
fication, and anomaly detection. In this chapter, we will focus on regression and
classification. Yes, we covered linear regression in the previous chapter, but that was
for predicting a continuous variable such as age and income. When it comes to
predicting discrete values, we need to use another form of regression – logistic
regression or softmax regression. These are essentially forms of classification. And
then we will see several of the most popular and useful techniques for classification.
You will also find a quick introduction to anomaly detection in an FYI box later in
the chapter.
One thing you should have noticed by now about linear regression is that the outcome
variable is numerical. So, the question is: What happens when the outcome variable is not
numerical? For example, if you have a weather dataset with the attributes humidity,
temperature, and wind speed, each is describing one aspect of the weather for a day. And
based on these attributes, you want to predict if the weather for the day is suitable for
playing golf. In this case, the outcome variable that you want to predict is categorical (“yes”
or “no”). Fortunately, to deal with this kind of classification problem, we have logistic
regression.
Let us think of this in a formal way. Before, our outcome variable y was continuous. Now,
it can have only two possible values (labels). For simplicity, let us call these labels “1” and
“0” (“yes” and “no”). In other words,
y 2 f0; 1g:
We are still going to have continuous value(s) for the input, but now we need to have only
two possible values for the output. How do we do this? There is an amazing function called
sigmoid, which is defined as
1
gðzÞ ¼ ð9:1Þ
1 þ ez
and it looks like Figure 9.1.
As you can see in Figure 9.1, for any input, the output of this function is bound between 0
and 1. In other words, if used as the hypothesis function, we get the output in 0 to 1 range,
with 0 and 1 included:
hθ ðxÞ2b0; 1c: ð9:2Þ
The nice thing about this is that it follows the constraints of a probability distribution that
it should be contained between 0 and 1. And if we could compute a probability that ranges
237 9.2 Logistic Regression
1.0
0.8
0.6
sigmoid(z)
0.4
0.2
0.0
–4 –2 0 2 4
z
from 0 to 1, it would be easy to draw a threshold at 0.5 and say that any time we get an
outcome value from a hypothesis function h greater than that, we put it in class “1,”
otherwise it goes in class “0.” Formally:
Pðy ¼ 1jx; θÞ ¼ hθ ðxÞ;
Pðy ¼ 0jx; θÞ ¼ 1 hθ ðxÞ; ð9:3Þ
Pðyjx; θÞ ¼ ðhθ ðxÞÞy ð1 hθ ðxÞÞ1y :
The last formulation is the result of combining the first two lines to form one expression.
See if that makes sense. Try putting y = 1 and y = 0 in that expression and see if you get the
previous two lines.
Now, how do we use this for classification? In essence, we want to input whatever
features from the data we have into the hypothesis function (here, a sigmoid) and find
out the value that comes out between 0 and 1. Based on which side of 0.5 it is, we
can declare an appropriate label or class. But before we can do that (called testing),
we need to train a model. For this we need some data. One way we could build a
model from such data is to assume a model and ask if that model could explain or
classify the training data and how well. In other words, we are asking how good our
model is, given the data.
To understand the goodness of a model (as represented by the parameter vector θ), we can
ask how likely it is that the data we have is generated by the given model. This is called the
likelihood of the model and is represented as L(θ). Let us expand this likelihood function:
238 Supervised Learning
i¼1
To achieve a better model than the one we guessed, we need to increase the value of L(θ).
But, look at that function above. It has all those multiplications and exponents. So, to make
it easier for us to work with this function, we will take its log. This is using the property of
log that it is an increasing function (as x goes up, log(x) also goes up). This will give us a log
likelihood function as below:
lðθÞ ¼ logLðθÞ
X m
ð9:5Þ
¼ ½yi logðhθ ðxi ÞÞ þ ð1 yi Þlogð1 hθ ðxi ÞÞ :
i¼1
Once again, to achieve the best model, we need to maximize this log likelihood.
For that, we do what we already know – take the partial derivative, one parameter
at a time. In fact, for simplicity, we will even consider just one sample data at
a time:
∂ 1 1 ∂
lðθÞ ¼ y ð1 yÞ hθ ðxÞ
∂θj hθ ðxÞ 1 hθ ðX Þ ∂θj
¼ ½yð1 hθ ðxÞÞ ð1 yÞhθ ðxÞxj
¼ ðy hθ ðxÞÞxj : ð9:6Þ
The second line above follows the fact that, for a sigmoid function g(z), the derivative can
be expressed as: g0 ðzÞ ¼ gðzÞð1 gðzÞÞ.
Considering all training samples, we get:
∂ Xm
lðθÞ ¼ ðyi hθ ðxi ÞÞxij : ð9:7Þ
∂θj i¼1
Notice how we are updating the θ this time. We are moving up on the gradient instead of
moving down. And that is why this is called gradient ascent. It does look similar to gradient
descent, but the difference is the nature of the hypothesis function. Before, it was a linear
function. Now it is sigmoid or logit function. And because of that, this regression is called
logistic regression.
239 9.2 Logistic Regression
> sapply(titanic.data,function(x)length(unique(x)))
PassengerId Survived Pclass Name Sex Age SibSp
891 2 3 891 2 89 7
Parch Ticket Fare Cabin Embarked
7 681 248 148 4
Another way to estimate the missing values is to use a visualization package: the “Amelia” package has
a plotting function missmap() that serves the purpose. It will plot your dataset and highlight missing
values:
> library(Amelia)
Loading required package: Rcpp
##
## Amelia II: Multiple Imputation
## (Version 1.7.4, built: 2015–12-05)
## Copyright (C) 2005–2017 James Honaker, Gary King and
Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/for more
information
##
> missmap(titanic.data, main = “Missing values vs observed”)
Passengerld Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.00 1 0 PC 17599 71.2833 C85 C
4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 1 0 113803 53.1000 C123 S
9 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.00 0 2 347742 11.1333 S
Embarked
Cabin
Fare
Ticket
Parch
SibSp
Sex
Name
Pclass
Survived
Passengerld
Missing (2%) Observed (98%)
As Figure 9.3 suggests, the Age column has multiple missing values. So, we must clean up the
missing values before proceeding further. In Chapter 2, we saw multiple methods for doing such data
cleanup. In this case, we will go with replacing those missing values with the mean age value. This is
how to do that:
> titanic.data$Age[is.na(titanic.data$Age)] <- mean
(titanic.data$Age,na.rm=T)
Here we have replaced the missing values with the mean age of the rest of the population. If any column
has a significant number of missing values, you may want to consider removing the column altogether. For
the purpose of this exercise, we will use only the Age, Embarked, Fare, Ticket, Parch, SibSp, Sex, Pclass, and
Survived columns to simplify our model:
> titanic.data <- subset(titanic.data,select=c
(2,3,5,6,7,8,10,12))
For the categorical variables in the dataset, using the read.table() or read.csv() by default will encode
the categorical variables as factors. A factor is how R deals with categorical variables.
242 Supervised Learning
We can check the encoding using the is.factor() function, which should return “true” for all the
categorical variables:
> is.factor(titanic.data$Sex)
[1] TRUE
Now, before building the model you need to separate the dataset into training and test sets. I have used
the first 800 instances for training and the remaining 91 as test instances. You can opt for different
separation strategies:
AIC: 728.93.
Number of Fisher scoring iterations: 12.
From the result, it is clear that Fare and Embarked are not statistically significant. That means we do not
have enough confidence that these factors contribute all that much to the overall model. As for the
statistically significant variables, Sex has the lowest p-value, suggesting a strong association of the sex of
the passenger with the probability of having survived. The negative coefficient for this predictor suggests
that, all other variables being equal, the male passenger is less likely to have survived. At this time, we
should pause and think about this insight. As Titanic started sinking and the lifeboats were being filled
with rescued passengers, priority was given to women and children. And thus, it makes sense that a male
passenger, especially a male adult, would have had less chance of survival.
Now, we will see how good our model is in predicting values for test instances. By setting the
parameter type=‘response’, R will output probabilities in the form of P(y = 1 | X). Our decision
boundary will be 0.5. If P(y = 1 | X) > 0.5 then y = 1, otherwise y = 0.
> fitted.results <-
predict(model,newdata=subset(test,select=c
(2,3,4,5,6,7,8)), type=‘response’)
> fitted.results <- ifelse(fitted.results > 0.5,1,0)
>
> misClasificError <- mean(fitted.results != test$Survived)
> print(paste(‘Accuracy’,1-misClasificError))
[1] “Accuracy 0.842696629213483”
As we can see from the above result, the accuracy of our model in predicting the labels of the test
instances is at 0.84, which suggests that the model performed decently.
At the final step, we are going to plot the receiver operating curve (ROC) and calculate the area under
curve (AUC), which are typical performance measurements for a binary classifier (see Figure 9.4). For the
details of these measures, refer to Chapter 12.
> library(ROCR)
Loading required package: gplots
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’: lowess
> p <- predict(model, newdata=subset(test,select=c
(2,3,4,5,6,7,8)), type=”response”)
> pr <- prediction(p, test$Survived)
> prf <- performance(pr, measure = “tpr”, x.measure = “fpr”)
> plot(prf)
> auc <- performance(pr, measure = “auc”)
244 Supervised Learning
0.4
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 9.4 Receiver operating curve (ROC) for the classifier built on Titanic data.
So far, we have seen regression for numerical outcome variable as well as regression for
binomial (“yes” or “no”, “1” or “0”) categorical outcome. But what happens if we have
more than two categories. For example, you want to rate a student’s performance based on
the numbers he got in individual subjects as “excellent,” “good,” “average,” or “below
average.” We need to have multinomial logistic regression for this. In this sense
245 9.3 Softmax Regression
Before running the multinomial regression, we need to remember our outcome variable is not
ordinal (e.g., “good,” “better,” and “best”). So, to create our model, we need to choose the level of the
outcome that we wish to use as the baseline and specify this in the relevel function. Here is how to do
that:
> hsbdemo$prog2 <- relevel(hsbdemo$prog, ref = “academic”)
Here, instead of transforming the original variable “prog,” we have declared another variable “prog2”
using the relevel function, where the level “academic” is declared as baseline:
> library(nnet)
> model1 <- multinom(prog2 ~ ses + write, data = hsbdemo)
# weights: 15 (8 variable)
initial value 219.722458
iter 10 value 179.983731
final value 179.981726
Converged
Ignore all the warning messages. As we can see, we have built a model where the outcome variable is
“prog2.” For demonstration purposes, we used only “ses” and “write” as our predictors and ignored the
remaining variables. As we can see, the model has generated some output itself even though we are
assigning the model to a new R object. This model-running output includes some iteration history and
includes the final negative log-likelihood value, 179.981726.
Next, to explore more details about the model we have built so far, we can issue a summary command
on our model:
> summary(model1)
Call:
multinom(formula = prog2 ~ ses + write, data = hsbdemo)
Coefficients:
(Intercept) seslow sesmiddle write
general 1.689478 1.1628411 0.6295638 –0.05793086
vocation 4.235574 0.9827182 1.2740985 –0.11360389
Std. Errors:
(Intercept) seslow sesmiddle write
general 1.226939 0.5142211 0.4650289 0.02141101
vocation 1.204690 0.5955688 0.5111119 0.02222000
Residual Deviance: 359.9635
AIC: 375.9635
The output summary generated by the model has a block of coefficients and a block of standard errors. Each of
these blocks has one row of values corresponding to a model equation. We are going to focus on the block of
coefficients first. As we can see, the first row is comparing prog=general to our baseline
247 9.3 Softmax Regression
Sections 9.1 and 9.2 covered two forms of regression that accomplished one task: classi-
fication. We will continue with this now and look at other techniques for performing
classification. The task of classification is: given a set of data points and their corresponding
labels, to learn how they are classified so, when a new data point comes, we can put it in the
correct class.
Classification can be supervised or unsupervised. The former is the case when assigning
a label to a picture as, for example, either “cat” or “dog.” Here the number of possible
choices is predetermined. When there are only two choices, it is called two-class or
binomial classification. When there are more categories, such as when predicting the
winner of the NCAA March Madness tournament, it is known as multiclass or multinomial
classification. There are many methods and algorithms for building classifiers, with k
nearest neighbor (kNN) being one of the most popular ones.
Let us look at how kNN works by listing the major steps of the algorithm.
1. As in the general problem of classification, we have a set of data points for which we
know the correct class labels.
2. When we get a new data point, we compare it to each of our existing data points and find
similarity.
3. Take the most similar k data points (k nearest neighbors).
4. From these k data points, take the majority vote of their labels. The winning label is the
label/class of the new data point.
The number k is usually small between 2 and 20. As you can imagine, the more the number
of nearest neighbors (value of k), the longer it takes us to do the processing.
Species
4.4 Iris-setosa
Iris-versicolor
4.2 Iris-virginica
4.0
3.8
3.6
Sepal.Width
3.4
3.2
3.0
2.8
2.6
2.4
2.2
2.0
Figure 9.5 Plotting of IRIS data based on various flowers’ sepal lengths and widths.
# Iris scatterplot
iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species)
From Figure 9.5, we can see that there is a high correlation between the sepal length and the sepal
width of the Iris setosa flowers, whereas the correlation is somewhat less for the I. virginica and I. versicolor
flowers.
If we map the relation between petal length and the petal width, it tells a similar story, as shown in
Figure 9.6.
The graph indicates a positive correlation between the petal length and the petal width for all different
species that are included in the Iris dataset.
Once we have at least some idea of the nature of the dataset, we will see how to do kNN classification
in R.
But before we proceed into the classification task, we need to have a test set to assess the model’s
performance later. Since we do not have that yet, we will need to divide the dataset into two parts: a
training set and a test set. We will split the entire dataset into two-thirds and one-third. The first part, the
larger chunk of the dataset, will be reserved for training, whereas the rest of the dataset is going to be
used for testing. We can split them any way we want, but we need to remember that the training set must
be sufficiently large to produce a good model. Also, we must make sure that all three classes of species are
present in the training model. Even more important, the amounts of instances of all three species needs to
be more or less equal, so that you do not favor one or the other class in your predictions.
250 Supervised Learning
2.5
2.0
Species
Petal.Width
1.5
setosa
versicolor
1.0 virginica
0.5
0.0
2 4 6
Petal.Length
Figure 9.6 Plotting of Iris data based on various flowers’ petal lengths and widths.
To divide the dataset into training and test sets, we should first set a seed. This is a number of R’s
random number generator. The major advantage of setting a seed is that we can get the same sequence of
random numbers whenever you supply the same seed in the random number generator. Here is how to do
that:
set.seed(1234)
You can pick any number other than 1234 in the above line. Next, we want to make sure that our Iris
dataset is shuffled and that we have an equal amount of each species in our training and test sets. One way
to ensure that is to use the sample() function to take a sample with a size that is set as the number of rows
of the Iris dataset (here, 150). We sample with replacement: we choose from a vector of two elements and
assign either “1” or “2” to the 150 rows of the Iris dataset. The assignment of the elements is subject to
probability weights of 0.67 and 0.33. This results in getting about two-thirds of the data labeled as “1”
(training) and the rest as “2” (testing).
ind <- sample(2, nrow(iris), replace=TRUE, prob=c
(0.67, 0.33))
We can then use the sample that is stored in the variable “ind” to define our training and test sets, only
taking the first four columns or attributes from the data.
iris.training <- iris[ind==1, 1:4]
iris.test <- iris[ind==2, 1:4]
Also, we need to remember that “Species,” which is the class label, is our target variable, and the
remaining attributes are predictor attributes. Therefore, we need to store the class label in factor vectors
and divide them over the training and test sets, which can be done by following steps:
251 9.4 Classification with kNN
Cell Contents
|–––––––––––––––––––|
| N|
| N/Row Total|
| N/Col Total|
| N/Table Total|
|–––––––––––––––––––|
|iris.testLabels
iris_pred| setosa| versicolor| virginica| Row Total|
–––––––––––|–––––––––|–––––––––––|––––––––––|–––––––––|
setosa| 12| 0| 0| 12|
| 1.000| 0.000| 0.000| 0.300|
| 1.000| 0.000| 0.000| |
| 0.300| 0.000| 0.000| |
–––––––––––|–––––––––|–––––––––––|––––––––––|–––––––––|
versicolor| 0| 12| 1| 13|
| 0.000| 0.923| 0.077| 0.325|
| 0.000| 1.000| 0.062| |
| 0.000| 0.300| 0.025| |
–––––––––––|–––––––––|–––––––––––|––––––––––|–––––––––|
virginica| 0| 0| 15| 15|
| 0.000| 0.000| 1.000| 0.375|
| 0.000| 0.000| 0.938| |
| 0.000| 0.000| 0.375| |
–––––––––––|–––––––––|–––––––––––|––––––––––|–––––––––|
Column Total| 12| 12| 16| 40|
| 0.300| 0.300| 0.400| |
–––––––––––|–––––––––|–––––––––––|––––––––––|–––––––––|
Using this table, you can easily calculate our accuracy. Out of 40 predictions, we were wrong one time.
So that gives us 39/40 = 97.5% accuracy.
In machine learning, a decision tree is used for classification problems. In such problems, the
goal is to create a model that predicts the value of a target variable based on several input
variables. A decision tree builds classification or regression models in the form of a tree structure.
It breaks down a dataset into smaller and smaller subsets while at the same time an associated
253 9.5 Decision Tree
decision tree is incrementally developed. The final result is a tree with decision nodes and leaf
nodes.
Consider the balloons dataset (download from OA 9.6) presented in Table 9.1. The
dataset has four attributes: color, size, act, and age, and one class label, inflated (T = true or
F = false). We will use this dataset to understand how a decision tree algorithm works.
Several algorithms exist that generate decision trees, such as ID3/4/5, CART, CLS, etc.
Of these, the most popular one is ID3, developed by J. R. Quinlan, which uses a top-down,
greedy search through the space of possible branches with no backtracking. ID3 employs
Entropy and Information Gain to construct a decision tree. Before we go through the
algorithm, let us understand these two terms.
Entropy: Entropy (E) is a measure of disorder, uncertainty, or randomness. If I toss a fair
coin, there is an equal chance of getting a head as well as a tail. In other words, we would be
most uncertain about the outcome, or we would have a high entropy. The formula of this
measurement is:
Xk
EntropyðEÞ ¼ pi log2 ðpi Þ: ð9:11Þ
i¼1
Here, k is the number of possible class values, and pi is the number of occurrences of the
class i=1 in the dataset. So, in the “balloons” dataset the number of possible class values is 2
254 Supervised Learning
1
0.9
0.8
0.7
0.6
Entropy
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
p
(T or F). The reason for the minus signs is that the logarithms of fractions p1, p2, . . ., pn are
negative, so the entropy is actually positive. Usually the logarithms are expressed in base 2,
and then the entropy is in units called bits – just the usual kind of bits used with computers.
Figure 9.7 shows the entropy curve with respect to probability values for an event. As you
can see, it is at its highest (1) when the probability of a two-outcome event is 0.5. If we are
holding a fair coin, the probability of getting a head or a tail is 0.5. Entropy for this coin is the
highest at this point, which reflects the highest amount of uncertainty we will have with this
coin’s outcome. If, on the other hand, our coin is completely unfair and flips to “heads” every
time, the probability of a getting heads with this coin will be 1 and the corresponding entropy
will be 0, indicating that there is no uncertainty about the outcome of this event.
Information Gain: If you thought it was not going to rain today and I tell you it will indeed
rain, would you not say that you gained some information? On the other hand, if you already
knew it was going to rain, then my prediction will not really impact your existing knowl-
edge much. There is a mathematical way to measure such information gain:
IGðA; BÞ ¼ EntropyðAÞ EntropyðA; BÞ: ð9:12Þ
Here, information gain achieved by knowing B along with A is the difference between the
entropy (uncertainty) of A and both A and B. Keep this in the back of your mind and we will
revisit it as we work through an example next.
But first, let us get back to that decision tree algorithm. A decision tree is a hierarchical,
top-down tree, built from a root node to leaves, and involves partitioning the data into
smaller subsets that contain instances with similar values (homogeneous). The ID3 algo-
rithm uses entropy to calculate the homogeneity of a sample. If the sample is completely
homogeneous, the entropy is zero, and if the sample is equally divided, it has entropy of
one. In other words, entropy is a measurement of disorder in the data.
Now, to build the decision tree, we need to calculate two types of entropy using
frequency tables, as follows:
a. Entropy using the frequency table of one attribute:
255 9.5 Decision Tree
Inflated
True False
8 12
Therefore,
EðInflatedÞ ¼ Eð12; 8Þ
¼ Eð0:6; 0:4Þ
¼ ð0:6log2 0:6Þ ð0:4log2 0:4Þ
¼ 0:4422 þ 0:5288
¼ 0:9710:
True False
Act Dip 0 8 8
Stretch 8 4 12
Total 20
256 Supervised Learning
Act
= STRETCH = DIP
Age F (8.0)
= ADULT = CHILD
T (8.0) F (4.0)
Step 1: Calculate entropy of the target or class variable, which is 0.9710, in our case.
Step 2: The dataset is then split on the different attributes into smaller subtables, for
example, Inflated and Act, Inflated and Age, Inflated and Size, and Inflated and
Color. The entropy for each subtable is calculated. Then it is added proportionally,
to get total entropy for the split. The resulting entropy is subtracted from the
entropy before the split. The result is the information gain or decrease in entropy.
Step 3: Choose the attribute with the largest information gain as the decision node, divide
the dataset by its branches, and repeat the same process on every branch.
If you follow the above guidelines step-by-step, you should end up with the decision tree
shown in Figure 9.8.
Rules are a popular alternative to decision trees. Rules typically take the form of an {IF:
THEN} expression (e.g., {IF “condition” THEN “result”}). Typically for any dataset, an
individual rule in itself is not a model, as this rule can be applied when the associated
condition is satisfied. Therefore, rule-based machine learning methods typically identify a
set of rules that collectively comprise the prediction model, or the knowledge base.
To fit any dataset, a set of rules can easily be derived from a decision tree by following the
paths from the root node to the leaf nodes, one at a time. For the above decision tree in
Figure 9.8, the corresponding decision rules are shown on the left-hand side of Figure 9.9.
Decision rules yield orthogonal hyperplanes in n-dimensional space. What that means is
that, for each of the decision rules, we are looking at a line or a plane perpendicular to the
axis for the corresponding dimension. This hyperplane (fancy word for a line or a plane in
higher dimension) separates data points around that dimension. You can think about it as a
257 9.5 Decision Tree
= ADULT = CHILD
T (8.0) F (4.0)
decision boundary. Anything on one side of it belongs to one class, and those data points on
the other side belong to another class.
It is easy to read a set of classification rules directly off a decision tree. One rule is generated for
each leaf. The antecedent of the rule includes a condition for every node on the path from the
root to that leaf, and the consequent of the rule is the class assigned by the leaf. This procedure
produces rules that are unambiguous in that the order in which they are executed is irrelevant.
However, in general, rules that are read directly off a decision tree are far more complex than
necessary, and rules derived from trees are usually pruned to remove redundant tests. Because
decision trees cannot easily express the disjunction implied among the different rules in a set,
transforming a general set of rules into a tree is not quite so straightforward. A good illustration
of this occurs when the rules have the same structure but different attributes, such as:
If a and b then x
If c and d then x
Then it is necessary to break the symmetry and choose a single test for the root node. If, for
example, “if a” is chosen, the second rule must, in effect, be repeated twice in the tree. This
is known as the replicated subtree problem.
Association rules are no different from classification rules except that they can predict any
attribute, not just the class, and this gives them the freedom to predict combinations of
attributes, too. Also, association rules are not intended to be used together as a set, as
classification rules are. Different association rules express different regularities that under-
lie the dataset, and they generally predict different things.
258 Supervised Learning
Because so many different association rules can be derived from even a very small
dataset, interest is restricted to those that apply to a reasonably large number of
instances and have a reasonably high accuracy on the instances to which they apply.
The coverage of an association rule is the number of instances for which it predicts
correctly; this is often called its support. Its accuracy – often called confidence – is
the number of instances that it predicts correctly, expressed as a proportion of all
instances to which it applies.
For example, consider the weather data in Table 9.3, a training dataset of weather and
corresponding target variable “Play” (suggesting possibilities of playing). The decision tree
and the derived decision rules for this dataset are given in Figure 9.10.
Let us consider this rule:
If temperature = cool then humidity = normal.
The coverage is the number of days that are both cool and have normal humidity (four in the
data of Table 9.3), and the accuracy is the proportion of cool days that have normal humidity
(100% in this case). Some other good quality association rules for Figure 9.10 are:
Figure 9.10 Decision rules (left) and decision tree (right) for the weather data.
1
Color
p = 0.015
YELLOW PURPLE
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
A decision tree seems like a nice method for doing classification – it typically has a good
accuracy, and, more importantly, it provides human-understandable insights. But one big
problem the decision tree algorithm has is that it could overfit the data. What does that
mean? It means it could try to model the given data so well that, while the classification
accuracy on that dataset would be wonderful, the model may find itself crippled when
looking at any new data; it learned too much from the data!
261 9.6 Random Forest
One way to address this problem is to use not just one, not just two, but many decision
trees, each one created slightly differently. And then take some kind of average from what
these trees decide and predict. Such an approach is so useful and desirable in many
situations where there is a whole set of algorithms that apply them. They are called
ensemble methods.
In machine learning, ensemble methods rely on multiple learning algorithms to obtain
better prediction accuracy than what any of the constituent learning algorithms can achieve.
In general, an ensemble algorithm consists of a concrete and finite set of alternative models
but incorporates a much more flexible structure among those alternatives. One example of
an ensemble method is random forest, which can be used for both regression and classifica-
tion tasks.
Random forest operates by constructing a multitude of decision trees at training time and
selecting the mode of the class as the final class label for classification or mean prediction of
the individual trees when used for regression tasks. The advantage of using random forest
over decision tree is that the former tries to correct the decision tree’s habit of overfitting the
data to their training set. Here is how it works.
For a training set of N, each decision tree is created in the following manner:
1. A sample of the N training cases is taken at random but with replacement from the
original training set. This sample will be used as a training set to grow the tree.
2. If the dataset has M input variables, a number m (m being a lot smaller than M) is
specified such that, at each node, m variables are selected at random out of M. Among
this m, the best split is used to split the node. The value of m is held constant while we
grow the forest.
3. Following the above steps, each tree is grown to its largest possible extent and there is no
pruning.
4. Predict new data by aggregating the predictions of the n trees (i.e., majority votes for
classification, average for regression).
Let us say a training dataset, N, has four observations on three predictor variables,
namely A, B, and C. The training data is provided in Table 9.4.
We will now work through the random forest algorithm on this small dataset.
Step 1: Sample the N training cases at random. So these subsets of N, for example, n1, n2,
n3, . . ., nn (as depicted in Figure 9.12) are used for growing (training) the n number
of decision trees. These samples are drawn as randomly as possible, with or without
A B C
Training instances A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
262 Supervised Learning
n2
B n3
.
C .
n5
n1
nn
n2
N
n5 n3
n4
overlap between them. For example, n1 may consist of the training instances 1, 1, 1,
and 4. Similarly, n2 may consist of 2, 3, 3, and 4, and so on.
Step 2: Out of the three predictor variables, a number m≪3 is specified such that, at
each node, m variables are selected at random out of the M. Let us say here m
is 2. So, n1 can be trained on A, B; n2 can be trained on B, C; and so on (see
Table 9.5).
So, the resultant decision trees may look something like what is shown in Figure 9.13.
Random forest uses a bootstrap sampling technique, which involves sampling of the
input data with replacement. Before using the algorithm, a portion of the data
(typically one-third) that is not used for training is set aside for testing. These are
sometimes known as out-of-bag samples. An error estimation on this sample, known
as out-of-bag error, provides evidence that the out-of-bag estimate can be as accurate
as having a test set of equal size as the training set. Thus, use of an out-of-bag error
estimate removes the need for a set-aside test set here.
263 9.6 Random Forest
Figure 9.13 Various decision trees for the random forest data.
So, the big question is why does the random forest as a whole do a better job than the
individual decision trees? Although there is no clear consensus among researchers, there
are two major beliefs behind this:
1. As the saying goes, “Nobody knows everything, but everybody knows something.”
When it comes to a forest of trees, not all of them are perfect or most accurate. Most of
the trees provide correct predictions of class labels for most of the data. So, even if some
of the individual decision trees generate wrong predictions, the majority predict cor-
rectly. And since we are using the mode of output predictions to determine the class, it is
unaffected by those wrong instances. Intuitively, validating this belief depends on the
randomness in the sampling method. The more random the samples, the more decorre-
lated the trees will be, and the less likely are the chances of other trees being affected by
wrong predictions from the other trees.
2. More importantly, different trees are making mistakes at different places and not all of
them are making errors at the same location. Again, intuitively, this belief depends on
how randomly the attributes are selected. The more random they are, the less likely the
trees will make mistakes at the same location.
4000
3000
2000
1000
0
no yes
Figure 9.14 Bar plot depicting data points with “No” and “Yes” labels.
> set.seed(1234)
> population <- sample(nrow(bank), 0.75 * nrow(bank))
> train <- bank[population, ]
> test <- bank[-population, ]
As shown, the dataset is split into two parts: 75% for training purposes and the remaining to evaluate our
model. To build the model, the randomForest library is used. In case your system does not have this
package, make sure to install it first. Next, use the training instances to build the model:
> install.packages(“randomForest”)
> library(randomForest)
> model <- randomForest(y ~ ., data = train)
> model
We can use ntree and mtry to specify the total number of trees to build (default = 500), and the number of
predictors to randomly sample at each split, respectively. The above lines of code should generate the following
result:
Call:
randomForest(formula = y ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 9.94%
Confusion matrix:
no yes class.error
no 2901 97 0.0323549
yes 240 152 0.6122449
265 9.6 Random Forest
We can see that 500 trees were built, and the model randomly sampled four predictors at each split. It
also shows a confusion matrix containing prediction vs. actual, as well as classification error for each class.
Let us test the model to see how it performs on the test dataset:
Random forest is considered a panacea of all data science problems among most of
its practitioners. There is a belief that when you cannot think of any algorithm
irrespective of situation, use random forest. It is a bit irrational, since no algorithm
strictly dominates in all applications (one size does not fit all). Nonetheless, people
have their favorite algorithms. And there are reasons why, for many data scientists,
random forest is the favorite:
1. It can solve both types of problems, that is, classification and regression, and does a
decent estimation for both.
2. Random forest requires almost no input preparation. It can handle binary features,
categorical features, and numerical features without any need for scaling.
3. Random forest is not very sensitive to the specific set of parameters used. As a result, it
does not require a lot of tweaking and fiddling to get a decent model; just use a large
number of trees and things will not go terribly awry.
4. It is an effective method for estimating missing data and maintains accuracy when a
large proportion of the data are missing.
So, is random forest a silver bullet? Absolutely not. First, it does a good job at
classification but not as good as for regression problems, since it does not give
precise continuous nature predictions. Second, random forest can feel like a black-
box approach for statistical modelers, as you have very little control on what the
model does. At best, you can try different parameters and random seeds and hope that
will change the output.
266 Supervised Learning
We now move on to a very popular and robust approach for classification that uses Bayes’
theorem. The Bayesian classification represents a supervised learning method as well as a
statistical method for classification. In a nutshell, it is a classification technique based on
Bayes’ theorem with an assumption of independence among predictors. Here, all attributes
contribute equally and independently to the decision.
In simple terms, a Naïve Bayes classifier assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about three inches in diameter. Even if these
features depend on each other or upon the existence of other features, all of these properties
independently contribute to the probability that this fruit is an apple, and that is why it is
known as naïve. It turns out that in most cases, while such a naïve assumption is found to be
not true, the resulting classification models do amazingly well.
Let us first take a look at Bayes’ theorem, which provides a way of calculating posterior
probability P(c| x) from P(c), P(x), and P(x| c). Look at the equation below:
PðxjcÞPðcÞ
PðcjxÞ ¼ : ð9:14Þ
PðxÞ
Here:
• P(c| x) is the posterior probability of class (c, target) given predictor (x, attributes).
• P(c) is the prior probability of class.
• P(x| c) is the likelihood, which is the probability of predictor given class.
• P(x) is the prior probability of predictor.
And here is that naïve assumption: we believe that evidence can be split into parts that are
independent,
Pðx1 jcÞPðx2 jcÞPðx3 jcÞPðx4 jcÞ . . . Pðxn jcÞPðcÞ
PðcjxÞ ¼ ; ð9:15Þ
PðxÞ
where x1, x2, x3, . . ., xn are independent priors.
To understand Naïve Bayes in action, let us revisit the golf dataset from earlier in this
chapter to see how this algorithm works step-by-step. This dataset is repeated in reordered
form in Table 9.6, and can be downloaded from OA 9.10.
267 9.7 Naïve Bayes
As shown in Table 9.6, the dataset has four attributes, namely Outlook, Temperature,
Humidity, and Windy, which are all different aspects of weather conditions. Based on these
four attributes, the goal is to predict the value of the outcome variable, Play (yes or no) –
whether the weather is suitable to play golf. Following are the steps of the algorithm
through which we could accomplish that goal.
• Step 1: First convert the dataset into a frequency table (see Figure 9.15).
• Step 2: Create a likelihood table by finding the probabilities, like probability of being hot
is 0.29 and probability of playing is 0.64 as shown in Figure 9.15.
• Step 3: Now, use the Naïve Bayesian equation to calculate the posterior probability for
each class. The class with the highest posterior probability is the outcome of the
prediction.
To see this in action, let us say that we need to decide if one should go out to play when
the weather is mild based on the dataset. We can solve it using the above discussed method
of posterior probability. Using Bayes’ theorem:
PðMildjYesÞ PðYesÞ
PðYesjMildÞ ¼
PðMildÞ
Here we have:
Figure 9.15 Conversion of the dataset to a frequency table and to a likelihood table.
Now,
P(Yes|Mild) = (0.44 × 0.64)/0.43 = 0.65.
In other words, we have derived that the probability of playing when the weather is mild is
65%, and if we wanted to turn that into a Yes–No decision, we can see that this probability is
higher than the mid-point, that is, 50%. Thus, we can declare “Yes” for our answer.
Naïve Bayes uses a similar method to predict the probability of different classes based on
various attributes. This algorithm is mostly used in text classification and with problems
having two or more classes. One prominent example is spam detection. Spam filtering with
Naïve Bayes is a two-classes problem, that is, to determine a message or an email as spam or
not. Here is how it works.
Let us assume that there are certain words (e.g., “viagra,” “rich,” “friend”) that indicate a
given message as being spam. We can apply Bayes’ theorem to calculate the probability that
an email is spam given the email words as:
PðwordsjspamÞ PðspamÞ
PðspamjwordsÞ ¼
PðwordsÞ
PðspamÞ Pðviagra; rich; …; friendjspamÞ
¼
Pðviagra; rich; …; friendÞ
∝ PðspamÞ Pðviagra; rich; …; friendjspamÞ:
Here, ∝ is the proportion symbol.
According to Naïve Bayes, the word events are completely independent; therefore,
simplifying the above formula using the Bayes formula would look like:
269 9.7 Naïve Bayes
Temp
Y Cool Hot Mild
No 0.3333333 0.3333333 0.3333333
Yes 0.3333333 0.3333333 0.3333333
Humidity
Y High Normal
No 0.6666667 0.3333333
Yes 0.3333333 0.6666667
Windy
Y FALSE TRUE
No 0.3333333 0.6666667
Yes 0.6666667 0.3333333
The output contains a likelihood table as well as a-priori probabilities. The a-priori probabilities are
equivalent to the prior probability in Bayes’ theorem. That is, how frequently each level of class occurs in
the training dataset. The rationale underlying the prior probability is that, if a level is rare in the training
set, it is unlikely that such a level will occur in the test dataset. In other words, the prediction of an
outcome is influenced not only by the predictors, but also by the prevalence of the outcome.
Let us move on to an evaluation of this model. You have to use the test set for this, the data which the
algorithm did not see while training.
> prediction <- predict(golfModel, newdata = golfTest)
You can check what class labels your model has predicted for all the data points in the test data:
> print(prediction)
However, it will be easier if we can compare these predicted labels with actual labels side by side, or in
a confusion matrix. Fortunately, in R there is a package for this functionality, named caret. Again, if you do
not have it make sure you install it first. Keep in mind that installing caret requires some prerequisite
271 9.7 Naïve Bayes
packages such as lattice and ggplot2. Make sure you install them first. Once you have all of them, use the
following command:
> confusionMatrix(prediction, golfTest$PlayGolf)
And you will have a nice confusion matrix along with p-values and all other evaluation matrices.
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2 2
Yes 0 1
Accuracy : 0.6
95% CI : (0.1466, 0.9473)
No Information Rate : 0.6
P-Value [Acc > NIR] : 0.6826
Kappa : 0.2857
Mcnemar’s Test P-Value : 0.4795
Sensitivity : 1.0000
Specificity : 0.3333
Pos Pred Value : 0.5000
Neg Pred Value : 1.0000
Prevalence : 0.4000
Detection Rate : 0.4000
Detection Prevalence : 0.8000
Balanced Accuracy : 0.6667
‘Positive’ Class : No
As you can see, the accuracy is not great, 60%, which can be attributed to the fact that we had only
nine examples in your training set. Still, by now, you should have some idea about how to build Naïve
Bayes classifier in R.
Now we come to the last method for classification in this chapter. One thing that has
been common in all the classifier models we have seen so far is that they assume
linear separation of classes. In other words, they try to come up with a decision
boundary that is a line (or a hyperplane in a higher dimension). But many problems
do not have such linear characteristics. Support vector machine (SVM) is a method
for the classification of both linear and nonlinear data. SVMs are considered by many
to be the best stock classifier for doing machine learning tasks. By stock, here we
mean in its basic form and not modified. This means you can take the basic form of
the classifier and run it on the data, and the results will have low error rates. Support
vector machines make good decisions for data points that are outside the training set.
In a nutshell, an SVM is an algorithm that uses nonlinear mapping to transform the
original training data into a higher dimension. Within this new dimension, it searches
for the linear optimal separating hyperplane (i.e., a decision boundary separating the
tuples of one class from another). With an appropriate nonlinear mapping to a
sufficiently high dimension, data from two classes can always be separated by a
hyperplane. The SVM finds this hyperplane using support vectors (“essential” training
tuples) and margins (defined by the support vectors).
To understand what this means, let us look at an example. Let us start with a simple one, a
two-class problem.
Let the dataset D be given as (X1, y1), (X2, y2), . . ., (X|D|, y|D|), where Xi is the set of training
tuples with associated class labels yi. Each yi can take one of two values, either +1 or −1 (i.e.,
yi∈{+1,−1}), corresponding to the classes represented by the hollow red squares and blue
circles (ignore the data points represented by the solid symbols for now), respectively, in
Figure 9.16. From the graph, we see that the 2-D data are linearly separable (or “linear,” for
short), because a straight line can be drawn to separate all the tuples of class +1 from all the
tuples of class −1.
Note that if our data had three attributes (two independent variables and one dependent),
we would want to find the best separating plane (a demonstration is shown in Figure 9.17
for linearly non-separable data). Generalizing to n dimensions, if we had n number of
attributes, we want to find the best (n−1)-dimensional plane, called a hyperplane. In
general, we will use the term hyperplane to refer to the decision boundary that we are
searching for, regardless of the number of input attributes. So, in other words, our problem
is, how can we find the best hyperplane?
There are an infinite number of separating lines that could be drawn. We want to find the
“best” one, that is, one that (we hope) will have the minimum classification error on
previously unseen tuples. How can we find this best line?
An SVM approaches this problem by searching for the maximum marginal hyperplane.
Consider Figure 9.18, which shows two possible separating hyperplanes and their asso-
ciated margins. Before we get into the definition of margins, let us take an intuitive look at
this Figure 9.18. Both hyperplanes can correctly classify all the given data tuples.
273 9.8 Support Vector Machine (SVM)
x2
O
pt
im
al
hy
per
pl
an
e
Maximum
margin
x1
Figure 9.17 From line to hyperplane. (Source: Jiawei Han and Micheline Kamber. (2006). Data Mining: Concepts and Techniques.
Morgan Kaufmann.)
Figure 9.18 Possible hyperplanes and their margins. (Source: Jiawei Han and Micheline Kamber. (2006). Data Mining: Concepts
and Techniques. Morgan Kaufmann.)
Intuitively, however, we expect the hyperplane with the larger margin to be more accurate at
classifying future data tuples than the hyperplane with the smaller margin. This is why
(during the learning or training phase) the SVM searches for the hyperplane with the largest
margin, that is, the maximum marginal hyperplane (MMH). The associated margin gives
the largest separation between classes.
274 Supervised Learning
Roughly speaking, we would like to find the point closest to the separating hyperplane and
make sure this is as far away from the separating line as possible. This is known as “margin.”
The points closest to the separating hyperplane are known as support vectors. We want to
have the greatest possible margin, because if we made a mistake or trained our classifier on
limited data, we would want it to be as robust as possible. Now that we know that we are
trying to maximize the distance from the separating line to the support vectors, we need to
find a way to optimize this problem; that is, how do we find an SVM with the MMH and the
support vectors. Consider this: a separating hyperplane can be written as
f ðxÞ ¼ β0 þ βT x; ð9:16Þ
where T is a weight vector, namely, T = {1,2,. . ., n}, n is the number of attributes, and β0 is a
scalar, often referred to as a bias.2 The optimal hyperplane can be represented in an infinite
number of different ways by scaling of β and β0. As a matter of convention, among all the
possible representations of the hyperplane, the one chosen is
β0 þ βT x ¼ 1; ð9:17Þ
where x symbolizes the training examples closest to the hyperplane. In general, the training
examples that are closest to the hyperplane are called support vectors. This representation is
known as the canonical hyperplane.
Now, we know from geometry that the distance d between a point (m, n) and a straight
line represented by Ax+ By+ C = 0 is given by
jAm þ Bn þ Cj
d¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð9:18Þ
A2 þ B2
Therefore, extending the same equation to a hyperplane gives the distance between a point x
and a hyperplane:
jβ0 þ βT xj
d¼ : ð9:19Þ
‖β‖
In particular, for the canonical hyperplane, the numerator is equal to one and the distance to
the support vectors is
jβ þ βT xj
dsupport vectors ¼ 0 : ð9:20Þ
‖β‖
Now, the margin M is twice the distance to the closest examples. So
2
M¼ : ð9:21Þ
‖β‖
Finally, the problem of maximizing M is equivalent to the problem of minimizing a
function L() subject to some constraints. The constraints model the requirement for the
hyperplane to classify correctly all the training examples xi. Formally,
1
min LðβÞ ¼ ‖β‖2 subject to yi ðβT xi þ β0 Þ ≥ 1 8i; ð9:22Þ
β;β0 2