KEMBAR78
Principles of Data Science WEB 1 | PDF | Data Science | Time Series
0% found this document useful (1 vote)
5K views30 pages

Principles of Data Science WEB 1

Principles of Data Science is an introductory textbook designed for undergraduate students interested in data science, covering essential topics such as data collection, statistical analysis, and machine learning using Python. The book is structured into four units that guide students through the data science cycle and emphasizes ethical practices in data science. OpenStax, a nonprofit initiative of Rice University, provides this textbook for free under a Creative Commons license, allowing for noncommercial distribution and adaptation.

Uploaded by

pihak21291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
5K views30 pages

Principles of Data Science WEB 1

Principles of Data Science is an introductory textbook designed for undergraduate students interested in data science, covering essential topics such as data collection, statistical analysis, and machine learning using Python. The book is structured into four units that guide students through the data science cycle and emphasizes ethical practices in data science. OpenStax, a nonprofit initiative of Rice University, provides this textbook for free under a Creative Commons license, allowing for noncommercial distribution and adaptation.

Uploaded by

pihak21291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Principles of Data Science

SENIOR CONTRIBUTING AUTHORS


DR. SHAUN V. AULT, VALDOSTA STATE UNIVERSITY
DR. SOOHYUN NAM LIAO, UNIVERSITY OF CALIFORNIA SAN DIEGO
LARRY MUSOLINO, PENNSYLVANIA STATE UNIVERSITY
OpenStax
Rice University
6100 Main Street MS-375
Houston, Texas 77005

To learn more about OpenStax, visit https://openstax.org.


Individual print copies and bulk orders can be purchased through our website.

©2025 Rice University. Textbook content produced by OpenStax is licensed under a Creative Commons
Attribution Non-Commercial ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Under this license, any
user of this textbook or the textbook contents herein can share, remix, and build upon the content for
noncommercial purposes only. Any adaptations must be shared under the same type of license. In any case of
sharing the original or adapted material, whether in whole or in part, the user must provide proper attribution
as follows:

- If you noncommercially redistribute this textbook in a digital format (including but not limited to PDF and
HTML), then you must retain on every page the following attribution:
“Access for free at openstax.org.”
- If you noncommercially redistribute this textbook in a print format, then you must include on every
physical page the following attribution:
“Access for free at openstax.org.”
- If you noncommercially redistribute part of this textbook, then you must retain in every digital format
page view (including but not limited to PDF and HTML) and on every physical printed page the following
attribution:
“Access for free at openstax.org.”
- If you use this textbook as a bibliographic reference, please include
https://openstax.org/details/books/principles-data-science in your citation.

For questions regarding this licensing, please contact support@openstax.org.

Trademarks
The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, OpenStax CNX logo,
OpenStax Tutor name, Openstax Tutor logo, Connexions name, Connexions logo, Rice University name, and
Rice University logo are not subject to the license and may not be reproduced without the prior and express
written consent of Rice University.

Kendall Hunt and the Kendall Hunt Logo are trademarks of Kendall Hunt. The Kendall Hunt mark is registered
in the United States, Canada, and the European Union. These trademarks may not be used without the prior
and express written consent of Kendall Hunt.

COLOR PAPERBACK BOOK ISBN-13 979-8-3851-6185-0


B&W PAPERBACK BOOK ISBN-13 979-8-3851-6186-7
DIGITAL VERSION ISBN-13 978-1-961584-60-0
ORIGINAL PUBLICATION YEAR 2025
1 2 3 4 5 6 7 8 9 10 CJP 25
OPENSTAX

OpenStax provides free, peer-reviewed, openly licensed textbooks for introductory college and Advanced
Placement® courses and low-cost, personalized courseware that helps students learn. A nonprofit ed tech
initiative based at Rice University, we’re committed to helping students access the tools they need to complete
their courses and meet their educational goals.

RICE UNIVERSITY

OpenStax is an initiative of Rice University. As a leading research university with a distinctive commitment to
undergraduate education, Rice University aspires to path-breaking research, unsurpassed teaching, and
contributions to the betterment of our world. It seeks to fulfill this mission by cultivating a diverse community
of learning and discovery that produces leaders across the spectrum of human endeavor.

PHILANTHROPIC SUPPORT

OpenStax is grateful for the generous philanthropic partners who advance our mission to improve educational

access and learning for everyone. To see the impact of our supporter community and our most updated list of

partners, please visit openstax.org/foundation.

Arnold Ventures Burt and Deedee McMurtry

Chan Zuckerberg Initiative Michelson 20MM Foundation

Chegg, Inc. National Science Foundation

Arthur and Carlyse Ciocca Charitable Foundation The Open Society Foundations

Digital Promise Jumee Yhu and David E. Park III

Ann and John Doerr Brian D. Patterson USA-International Foundation

Bill & Melinda Gates Foundation The Bill and Stephanie Sick Fund

Girard Foundation Steven L. Smith & Diana T. Go

Google Inc. Stand Together

The William and Flora Hewlett Foundation Robin and Sandy Stuart Foundation

The Hewlett-Packard Company The Stuart Family Foundation

Intel Inc. Tammy and Guillermo Treviño

Rusty and John Jaggers Valhalla Charitable Foundation

The Calvin K. Kazanjian Economics Foundation White Star Education Foundation

Charles Koch Foundation Schmidt Futures

Leon Lowenstein Foundation, Inc. William Marsh Rice University

The Maxfield Foundation


CONTENTS

Preface 1

UNIT 1 INTRODUCING DATA SCIENCE AND DATA COLLECTION


1 What Are Data and Data Science? 9
Introduction 9
1.1 What Is Data Science? 9
1.2 Data Science in Practice 12
1.3 Data and Datasets 16
1.4 Using Technology for Data Science 29
1.5 Data Science with Python 31
Key Terms 53
Group Project 54
Chapter Review 55
Critical Thinking 55
Quantitative Problems 56
References 56

2 Collecting and Preparing Data 59


Introduction 59
2.1 Overview of Data Collection Methods 60
2.2 Survey Design and Implementation 63
2.3 Web Scraping and Social Media Data Collection 68
2.4 Data Cleaning and Preprocessing 78
2.5 Handling Large Datasets 90
Key Terms 96
Group Project 98
Critical Thinking 99
References 103

UNIT 2 ANALYZING DATA USING STATISTICS


Descriptive Statistics: Statistical Measurements and Probability
3
Distributions 105
Introduction 105
3.1 Measures of Center 106
3.2 Measures of Variation 112
3.3 Measures of Position 117
3.4 Probability Theory 121
3.5 Discrete and Continuous Probability Distributions 129
Key Terms 142
Group Project 143
Quantitative Problems 144
4 Inferential Statistics and Regression Analysis 147
Introduction 147
4.1 Statistical Inference and Confidence Intervals 148
4.2 Hypothesis Testing 167
4.3 Correlation and Linear Regression Analysis 189
4.4 Analysis of Variance (ANOVA) 205
Key Terms 210
Group Project 212
Quantitative Problems 212

UNIT 3 PREDICTING AND MODELING USING DATA


5 Time Series and Forecasting 215
Introduction 215
5.1 Introduction to Time Series Analysis 215
5.2 Components of Time Series Analysis 224
5.3 Time Series Forecasting Methods 229
5.4 Forecast Evaluation Methods 256
Key Terms 261
Group Project 262
Critical Thinking 263
Quantitative Problems 264

6 Decision-Making Using Machine Learning Basics 269


Introduction 269
6.1 What Is Machine Learning? 270
6.2 Classification Using Machine Learning 278
6.3 Machine Learning in Regression Analysis 297
6.4 Decision Trees 310
6.5 Other Machine Learning Techniques 320
Key Terms 330
Group Project 331
Chapter Review 332
Critical Thinking 332
Quantitative Problems 332
References 334

7 Deep Learning and AI Basics 335


Introduction 335
7.1 Introduction to Neural Networks 336
7.2 Backpropagation 345
7.3 Introduction to Deep Learning 357
7.4 Convolutional Neural Networks 361
7.5 Natural Language Processing 363
Key Terms 374

Access for free at openstax.org


Group Project 375
Chapter Review 376
Critical Thinking 377
Quantitative Problems 378
References 379

UNIT 4 MAINTAINING A PROFESSIONAL AND ETHICAL DATA SCIENCE PRACTICE


8 Ethics Throughout the Data Science Cycle 381
Introduction 381
8.1 Ethics in Data Collection 382
8.2 Ethics in Data Analysis and Modeling 392
8.3 Ethics in Visualization and Reporting 399
Key Terms 408
Group Project 409
Chapter Review 411
Critical Thinking 414
References 414

9 Visualizing Data 415


Introduction 415
9.1 Encoding Univariate Data 416
9.2 Encoding Data That Change Over Time 430
9.3 Graphing Probability Distributions 435
9.4 Geospatial and Heatmap Data Visualization Using Python 443
9.5 Multivariate and Network Data Visualization Using Python 449
Key Terms 461
Group Project 462
Critical Thinking 462

10 Reporting Results 465


Introduction 465
10.1 Writing an Informative Report 466
10.2 Validating Your Model 473
10.3 Effective Executive Summaries 488
Key Terms 494
Group Project 495
Chapter Review 495
Critical Thinking 497
References 498

A Appendix A: Review of Excel for Data Science 499

B Appendix B: Review of R Studio for Data Science 517


C Appendix C: Review of Python Algorithms 533

D Appendix D: Review of Python Functions 539

Answer Key 553

Index 557

Access for free at openstax.org


Preface 1

Preface
About OpenStax
OpenStax is part of Rice University, which is a 501(c)(3) nonprofit charitable corporation. As an educational
initiative, it's our mission to improve educational access and learning for everyone. Through our partnerships
with philanthropic organizations and our alliance with other educational resource companies, we're breaking
down the most common barriers to learning. Because we believe that everyone should and can have access to
knowledge.

About OpenStax Resources


Customization
Principles of Data Science is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0
(CC BY NC-SA) license, which means that you can non-commercially distribute, remix, and build upon the
content, as long as you provide attribution to OpenStax and its content contributors, under the same license.

Because our books are openly licensed, you are free to use the entire book or select only the sections that are
most relevant to the needs of your course. Feel free to remix the content by assigning your students certain
chapters and sections in your syllabus, in the order that you prefer. You can even provide a direct link in your
syllabus to the sections in the web view of your book.

Instructors also have the option of creating a customized version of their OpenStax book. Visit the Instructor
Resources section of your book page on OpenStax.org for more information.

Art Attribution
In Principles of Data Science, most art contains attribution to its title, creator or rights holder, host platform,
and license within the caption. Because the art is openly licensed, non-commercial users or organizations may
reuse the art as long as they provide the same attribution to its original source. (Commercial entities should
contact OpenStax to discuss reuse rights and permissions.) To maximize readability and content flow, some art
does not include attribution in the text. If you reuse art from this text that does not have attribution provided,
use the following attribution: Copyright Rice University, OpenStax, under CC BY-NC-SA 4.0 license.

Errata
All OpenStax textbooks undergo a rigorous review process. However, like any professional-grade textbook,
errors sometimes occur. In addition, the wide range of topics, data, technologies, and legal circumstances in
data science change frequently, and portions of the textbook may become out of date. Since our books are
web-based, we can make updates periodically when deemed pedagogically necessary. If you have a correction
to suggest, submit it through the link on your book page on OpenStax.org. Subject matter experts review all
errata suggestions. OpenStax is committed to remaining transparent about all updates, so you will also find a
list of past and pending errata changes on your book page on OpenStax.org.

Format
You can access this textbook for free in web view or PDF through OpenStax.org, and for a low cost in print. The
web view is the recommended format because it is the most accessible – including being WCAG 2.2 AA
compliant – and most current. Print versions are available for individual purchase, or they may be ordered
through your campus bookstore.

About Principles of Data Science


Summary
Principles of Data Science is intended as introductory material for a one- or two-semester course on data
science. It is appropriate for undergraduate students interested in the rapidly growing field of data science;
2 Preface

this may include data science majors, data science minors, or students concentrating in business, finance,
health care, engineering, the sciences, or a number of other fields where data science has become critically
important. The material is designed to prepare students for future coursework and career applications in a
data science–related field. It does not assume significant prior coding experience, nor does it assume
completion of more than college algebra. The text provides foundational statistics instruction for students who
may have a limited statistical background.

Coverage and Scope


Principles of Data Science emphasizes the use of Python code in relevant data science applications. Python
provides a versatile programming language with libraries and frameworks for data manipulation, analysis, and
machine learning. The book begins with an introduction to Python and presents Python libraries, algorithms,
and functions as they are needed throughout. In occasional, focused instances, the authors also use Excel to
illustrate the basic manipulation of data using functions, formulas, and tools for calculations, visualization, and
financial analysis. R, a programming language used most often for statistical modeling, is briefly described
and then summarized and applied to relevant examples in a book appendix. Excel and Python summaries are
also provided in appendices at the end of the book.

The table of contents (TOC) is divided into ten chapters, organized in four units, intuitively following the
standard data science cycle. The four units are:

Unit 1: Introducing Data Science and Data Collection


Unit 2: Analyzing Data Using Statistics
Unit 3: Predicting and Modeling Using Data
Unit 4: Maintaining a Professional and Ethical Data Science Practice

The learning objectives and curriculum of introductory data science courses vary, so this textbook aims to
provide broader and more detailed coverage than an average single-semester course. Instructors can choose
which chapters or sections they want to include in their particular course.

To enable this flexibility, chapters in this text can be used in a self-contained manner, although most chapters
do cross-reference sections and chapters that precede or follow. More importantly, the authors have taken
care to build topics gradually, from chapter to chapter, so instructors should bear this in mind when
considering alternate sequence coverage.

Unit 1: Introducing Data Science and Data Collection starts off with Chapter 1’s explanation of the data
science cycle (data collection and preparation, data analysis, and data reporting) and its practical applications
in fields such as medicine, engineering, and business. Chapter 1 also describes various types of datasets and
provides the student with basic data summary tools from the Python pandas library. Chapter 2 describes the
processes of data collection and cleaning and the challenges of managing large datasets. It previews some of
the qualitative ethical considerations that Chapters 7 and 8 later expand on.

Unit 2: Analyzing Data Using Statistics forms a self-contained unit that instructors may assign on a modular,
more optional basis, depending on students’ prior coursework. Chapter 3 focuses on measures of center,
variation, and position, leading up to probability theory and illustrating how to use Python with binomial and
normal distributions. Chapter 4 goes deeper into statistical analysis, demonstrating how to use Python to
calculate confidence intervals, conduct hypothesis tests, and perform correlation and regression analysis.

The three chapters in Unit 3: Predicting and Modeling Using Data form the core of the book. Chapter 5
introduces students to the concept and practical applications of time series. Chapter 5 provides focused
examples of both Python and Excel techniques useful in forecasting time series, analyzing seasonality, and
identifying measures of error. Chapter 6 starts with distinguishing supervised vs. unsupervised machine
learning and then develops some common methods of data classification, including logistic regression,
clustering algorithms, and decision trees. Chapter 6 includes Python techniques for more sophisticated

Access for free at openstax.org


Preface 3

statistical analyses such as regression with bootstrapping and multivariable regression. Finally, Chapter 6
refers back to the topics of data mining and big data introduced in Chapter 2.

Chapter 7 is a pedagogically rich chapter, with a balance of quantitative and qualitative content, covering the
role of neural networks in deep learning and applications in large language models. The first four sections
discuss the topics of neural networks (standard, recurrent, and convolutional), backpropagation, and deep
learning. The real-life application of classifying handwritten numerals is used as an example. The last section
dives into the important and rapidly changing technology of natural language processing (NLP), large
language models (LLMs), and artificial intelligence (AI). While in-depth coverage of these evolving subjects is
beyond the scope of this textbook, the pros/cons, the examples from technical and artistic applications, and
the online resources provided in this section all serve as a good starting point for classroom discussion. This
topic also naturally segues into the broader professional responsibility discussed in Chapter 8.

The final chapters in Unit 4: Maintaining a Professional and Ethical Data Science Practice help the student
apply and adjust the specific techniques learned in the previous chapters to the real-life data analysis,
decision-making, and communication situations they will encounter professionally. Chapter 8 emphasizes the
importance of ethical considerations along each part of the cycle: data collection; data preparation, analysis,
and modeling; and reporting and visualization. Coverage of the issues in this chapter makes students aware of
the subjective and sensitive aspects of privacy and informed consent at every step in the process. At the
professional level, students learn more about the evolving standards for the relatively new field of data
science, which may differ among industries or between the United States and other countries.

Chapter 9 circles back to some of the statistical concepts introduced in Chapters 3 and 4, with an emphasis on
clear visual analysis of data trends. Chapter 9 provides a range of Python techniques to create boxplots and
histograms for univariate data; to create line charts and trend curves for time series; to graph binomial,
Poisson, and normal distributions; to generate heatmaps from geospatial data; and to create correlation
heatmaps from multidimensional data.

Chapter 10 brings the student back to the practical decision-making setting introduced in Chapter 1. Chapter
10 helps the student address how to tailor the data analysis and presentation to the audience and purpose,
how to validate the assumptions of a model, and how to write an effective executive summary.

The four Appendices (A–D) provide a practical set of references for Excel commands and commands for R
statistical software as well as Python commands and algorithms. Appendix A uses a baseball dataset from
Chapter 1 to illustrate basic Microsoft® Excel® software commands for manipulating, analyzing, summarizing,
and graphing data. Appendix B provides a brief overview of data analysis with the open-source statistical
computing package R (https://openstax.org/r/project), using a stock price example. Appendix C lists the
approximately 60 Python algorithms used in the textbook, and Appendix D lists the code and syntax for the
approximately 75 Python functions demonstrated in the textbook. Both Appendices C and D are organized in
a tabular format, in consecutive chapter order, hyperlinked to the first significant use of each Python algorithm
and function. (Instructors may find Appendices C and D especially useful in developing their teaching plan.)

Pedagogical Foundation
Because this is a practical, introductory-level textbook, math equations and code are presented clearly
throughout. Particularly in the core chapters, students are introduced to key mathematical concepts,
equations, and formulas, often followed by numbered Example Problems that encourage students to apply the
concepts just covered in a variety of situations. Technical illustrations and Python code feature boxes build on
and supplement the theory. Students are encouraged to try out the Python code from the feature boxes in the
Google Colaboratory (https://openstax.org/r/colabresearch) (Colab) platform.

The authors have included a diverse mix of data types and sources for analysis, illustration, and discussion
purposes. Some scenarios are fictional and/or internal to standard Python libraries, while other datasets come
from external, real-world sources, both corporate and government (such as Federal Reserve Economic Data
4 Preface

(FRED) (https://openstax.org/r/nasdaq1), Statista (https://openstax.org/r/statista1), and Nasdaq


(https://openstax.org/r/nasdaq)). Most scenarios are either summarized in an in-line table, or have datasets
provided in a downloadable student spreadsheet for import as a .CSV file (Chapter 1 also discusses the .JSON
format) and/or with a hyperlink to the external source. Some examples focus on scientific topics (e.g., the
“classic” Iris flower dataset, annual temperature changes), while other datasets reflect phenomena with more
nuanced socioeconomic issues (gender-based salary differences, cardiac disease markers in patients).

While the book’s foundational chapters illustrate practical “techniques and tools,” the more process-oriented
chapters iteratively build on and emphasize an underlying framework of professional, responsible, and ethical
data science practice. Chapter 1 refers the student to several national and international data science
organizations that are developing professional standards. Chapter 2 emphasizes avoiding bias in survey and
sample design. Chapter 8 discusses relevant privacy legislation. For further class exploration, Chapters 7 and 8
include online resources on mitigating bias and discrimination in machine learning models, including related
Python libraries such as HolisticAI (https://openstax.org/r/holistic) and Fairlens (https://openstax.org/r/
projectfairlens). Chapter 10 references several executive dashboards that support transparency in
government.

The Group Projects at the end of each chapter encourage students to apply the techniques and considerations
covered in the book using either datasets already provided or new data sources that they might receive from
their instructors or in their own research. For example, project topics include the following: collecting data on
animal extinction due to global warming (Chapter 2), predicting future trends in stock market prices (Chapter
5), diagnosing patients for liver disease (Chapter 7), and analyzing the severity of ransomware attacks (Chapter
8).

Key Features
The key in-chapter features, depending on chapter content and topics, may include the following:

• Learning Outcomes (LOs) to guide the student’s progress through the chapter
• Example Problems, demonstrating calculations and solutions in-line
• Python code boxes, providing sample input code for and output from Google Colab
• Note boxes providing instructional tips to help with the practical aspects of the math and coding
• Data tables from a variety of social science and industry settings
• Technical charts and heatmaps to visually demonstrate code output and variable relationships
• Exploring Further boxes, with additional resources and online examples to extend learning
• Mathematical formulas and equations
• Links to downloadable spreadsheet containing key datasets referenced in the chapter for easy
manipulation of data

End-of-chapter (EOC) elements, depending on chapter content and topics, may include the following:

• Key Terms
• Group Projects
• Chapter Review Questions
• Critical Thinking Questions
• Quantitative Problems

Answers [and Solutions] to Questions in the Book


The student-facing Answer Key at the end of the book provides the correct answer letter and text for Chapter
Review questions (multiple-choice). An Instructor Solution Manual (ISM) will be available for verified instructors
and downloadable from the restricted OpenStax Instructor Resources web page, with detailed solutions to
Quantitative Problems, sample answers for Critical Thinking questions, and a brief explanation of the correct
answer for Chapter Review questions. (Sample calculations, tables, code, or figures may be included, as

Access for free at openstax.org


Preface 5

applicable.) An excerpt of the ISM, consisting of the solutions/sample answers for the odd-numbered
questions only, will also be available as a Student Solution Manual (SSM), downloadable from the public
OpenStax Student Resources web page. (Answers to the Group Projects are not provided, as they are
integrative, exploratory, open-ended assignments.)

About the Authors


Senior Contributing Authors

Senior Contributing Authors (left to right): Shaun V. Ault, Soohyun Nam Liao, Larry Musolino

Dr. Shaun V. Ault, Valdosta State University. Dr. Ault joined the Valdosta State University faculty in 2012,
serving as Department Head of Mathematics from 2017 to 2023 and Professor since 2021. He holds a PhD in
mathematics from The Ohio State University, a BA in mathematics from Oberlin College, and a Bachelor of
Music from the Oberlin Conservatory of Music. He previously taught at Fordham University and The Ohio State
University. He is a Certified Institutional Review Board Professional and holds membership in the Mathematical
Association of America, American Mathematical Society, and Society for Industrial and Applied Mathematics.
He has research interests in algebraic topology and computational mathematics and has published in a
number of peer-reviewed journal publications. He has authored two textbooks: Understanding Topology: A
Practical Introduction (Johns Hopkins University Press, 2018) and, with Charles Kicey, Counting Lattice Paths
Using Fourier Methods. Applied and Numerical Harmonic Analysis, Springer International (2019).

Dr. Soohyun Nam Liao, University of California San Diego. Dr. Liao joined the UC San Diego faculty in 2015,
serving as Assistant Teaching Professor since 2021. She holds PhD and MS degrees in computer science and
engineering from UC San Diego and a BS in electronics engineering from Seoul University, South Korea. She
previously taught at Princeton University and was an engineer at Qualcomm Inc. She focuses on computer
science (CS) education research as a means to support diversity and equity (DEI) in CS programs. Among her
recent co-authored papers is, with Yunyi She, Korena S. Klimczak, and Michael E. Levin, “ClearMind Workshop:
An ACT-Based Intervention Tailored for Academic Procrastination among Computing Students,” SIGCSE
(1) 2024: 1216-1222. She has received a National Science Foundation grant to develop a toolkit for A14All (data
science camps for high school students).

Larry Musolino, Pennsylvania State University. Larry Musolino joined the Penn State, Lehigh Valley, faculty
in 2015, serving as Assistant Teaching Professor of Mathematics since 2022. He received an MS in mathematics
from Texas A&M University, a MS in statistics from Rochester Institute of Technology (RIT), and MS degrees in
computer science and in electrical engineering, both from Lehigh University. He received his BS in electrical
engineering from City College of New York (CCNY). He previously was a Distinguished Member of Technical
Staff in semiconductor manufacturing at LSI Corporation. He is a member of the Penn State OER (Open
Educational Resources) Advisory Group and has authored a calculus open-source textbook. In addition, he co-
authored an open-source Calculus for Engineering workbook. He has contributed to several OpenStax
6 Preface

textbooks, authoring the statistics chapters in the Principles of Finance textbook and editing and revising
Introductory Statistics, 2e, and Introductory Business Statistics, 2e.

The authors wish to express their deep gratitude to Developmental Editor Ann West for her skillful editing and
gracious shepherding of this manuscript. The authors also thank Technical Editor Dhawani Shah (PhD
Statistics, Gujarat University) for contributing technical reviews of the chapters throughout the content
development process.

Contributing Authors
Wisam Bukaita, Lawrence Technological University
Aeron Zentner, Coastline Community College

Reviewers
Wisam Bukaita, Lawrence Technological University
Drew Lazar, Ball State University
J. Hathaway, Brigham Young University-Idaho
Salvatore Morgera, University of South Florida
David H. Olsen, Utah Tech University
Thomas Pfaff, Ithaca College
Jian Yang, University of North Texas
Aeron Zentner, Coastline Community College

Additional Resources
Student and Instructor Resources
We’ve compiled additional resources for both students and instructors, including Getting Started Guides.
Instructor resources require a verified instructor account, which you can apply for when you log in or create
your account on OpenStax.org. Take advantage of these resources to supplement your OpenStax book.

Academic Integrity
Academic integrity builds trust, understanding, equity, and genuine learning. While students may encounter
significant challenges in their courses and their lives, doing their own work and maintaining a high degree of
authenticity will result in meaningful outcomes that will extend far beyond their college career. Faculty,
administrators, resource providers, and students should work together to maintain a fair and positive
experience.

We realize that students benefit when academic integrity ground rules are established early in the course. To
that end, OpenStax has created an interactive to aid with academic integrity discussions in your course.

Visit our academic integrity slider (https://view.genial.ly/61e08a7af6db870d591078c1/interactive-image-defining-academic-integrity-


interactive-slider). Click and drag icons along the continuum to align these practices with your institution and course policies. You
may then include the graphic on your syllabus, present it in your first course meeting, or create a handout for students. (attribution:
Copyright Rice University, OpenStax, under CC BY 4.0 license)

Access for free at openstax.org


Preface 7

At OpenStax we are also developing resources supporting authentic learning experiences and assessment.
Please visit this book’s page for updates. For an in-depth review of academic integrity strategies, we highly
recommend visiting the International Center of Academic Integrity (ICAI) website at
https://academicintegrity.org/ (https://openstax.org/r/academicinte).

Community Hubs
OpenStax partners with the Institute for the Study of Knowledge Management in Education (ISKME) to offer
Community Hubs on OER Commons—a platform for instructors to share community-created resources that
support OpenStax books, free of charge. Through our Community Hubs, instructors can upload their own
materials or download resources to use in their own courses, including additional ancillaries, teaching
material, multimedia, and relevant course content. We encourage instructors to join the hubs for the subjects
most relevant to your teaching and research as an opportunity both to enrich your courses and to engage with
other faculty. To reach the Community Hubs, visit www.oercommons.org/hubs/openstax.

Technology Partners
As allies in making high-quality learning materials accessible, our technology partners offer optional low-cost
tools that are integrated with OpenStax books. To access the technology options for your text, visit your book
page on OpenStax.org.
8 Preface

Access for free at openstax.org


1
What Are Data and Data Science?

Figure 1.1 Petroglyphs are one of the earliest types of data generated by humanity, providing vital information about the daily life of
the people who created them. (credit: modification of work “Indian petroglyphs (~100 B.C. to ~1540 A.D.) (Newspaper Rock,
southeastern Utah, USA) 24” by James St. John/Flickr, CC BY 2.0)

Chapter Outline
1.1 What Is Data Science?
1.2 Data Science in Practice
1.3 Data and Datasets
1.4 Using Technology for Data Science
1.5 Data Science with Python

Introduction
Many of us use the terms “data” and “data science,” but not necessarily with a lot of precision. This chapter will
define data science terminology and apply the terms in multiple fields. The chapter will also briefly introduce
the types of technology (such as statistical software, spreadsheets, and programming languages) that data
scientists use to perform their work and will then take a deeper dive into the use of Python for data analysis.
The chapter should help you build a technical foundation so that you can practice the more advanced data
science concepts covered in future chapters.

1.1 What Is Data Science?


Learning Outcomes
By the end of this section, you should be able to:
• 1.1.1 Describe the goals of data science.
• 1.1.2 Explain the data science cycle and goals of each step in the cycle.
• 1.1.3 Explain the role of data management in the data science process.

Data science is a field of study that investigates how to collect, manage, and analyze data of all types in order
to retrieve meaningful information. Although we will describe data in more detail in Data and Datasets, you
can consider data to be any pieces of evidence or observations that can be analyzed to provide some insights.
10 1 • What Are Data and Data Science?

In its earliest days, the work of data science was spread across multiple disciplines, including statistics,
mathematics, computer science, and social science. It was commonly believed that the job of data collection,
management, and analysis would be carried out by different types of experts, with each job independent of
one another. To be more specific, data collection was considered to be the province of so-called domain
experts (e.g., doctors for medical data, psychologists for psychological data, business analysts for sales,
logistic, and marketing data, etc.) as they had a full context of the data; data management was for computer
scientists/engineers as they knew how to store and process data in computing systems (e.g., a single
computer, a server, a data warehouse); and data analysis was for statisticians and mathematicians as they
knew how to derive some meaningful insights from data. Technological advancement brought about the
proliferation of data, muddying the boundaries between these jobs, as shown in Figure 1.2. Now, it is expected
that a data scientist or data science team will have some expertise in all three domains.

Figure 1.2 The Field of Data Science

One good example of this is the development of personal cell phones. In the past, households typically had
only one landline telephone, and the only data that was generated with the telephone was the list of phone
numbers called by the household members. Today the majority of consumers own a smartphone, which
contains a tremendous amount of data: photos, social media contacts, videos, locations (usually), and perhaps
health data (with the consumers’ consent), among many other things.

Is the data from a smartphone solely collected by domain experts who are specialized in photos, videos, and
such? Probably not. They are automatically logged and collected by the smartphone system itself, which is
designed by computer scientists/engineers. For a health care scientist to collect data from many individuals in
the “traditional” way, bringing patients into a laboratory and taking vital signs regularly over a period of time
takes a lot of time and effort. A smartphone application is a more efficient and productive method, from a data
collection perspective.

Data science tasks are often described as a process, and this section provides an overview for each step of that
process.

Access for free at openstax.org


1.1 • What Is Data Science? 11

The Data Science Cycle


Data science tasks follow a process, called the data science cycle, which includes problem definition, then
data collection, preparation, analysis, and reporting, as illustrated in Figure 1.3. See this animation about data
science (https://openstax.org/r/youtube), which describes the data science cycle in much the same way.

Figure 1.3 The Data Science Cycle

Although data collection and preparation may sound like simple tasks compared to the more important work
of analysis, they are actually the most time- and effort-consuming steps in the data science cycle. According to
a survey conducted by Anaconda (2020), data scientists spend about half of the entire process in data
collection and cleaning, while data analysis and communication take about a quarter to a third of the time
each, depending on the job.

Problem Definition, Data Collection, and Data Preparation


The first step in the data science cycle is a precise definition of the problem statement to establish clear
objectives for the goal and scope of the data analysis project. Once the problem has been well defined, the
data must be generated and collected. Data collection is the systematic process of gathering information on
variables of interest. Data is often collected purposefully by domain experts to find answers to predefined
problems. One example is data on customer responses to a product satisfaction survey. These survey
questions will likely be crafted by the product sales and marketing representatives, who likely have a specific
plan for how they wish to use the response data once it is collected.

Not all data is generated this purposefully, though. A lot of data around our daily life is simply a by-product of
our activity. These by-products are kept as data because they could be used by others to find some helpful
insights later. One example is our web search histories. We use a web search engine like Google to search for
information about our interests, and such activity leaves a history of the search text we used on the Google
server. Google employees can utilize the records of numerous Google users in order to analyze common
search patterns, present accurate search results, and potentially, to display relatable advertisements back to
the searchers.

Often the collected data is not in an optimal form for analysis. It needs to be processed somehow so that it can
be analyzed, in the phase called data preparation or data processing. Suppose you work for Google and want
to know what kind of food people around the globe search about the most during the night. You have users’
search history from around the globe, but you probably cannot use the search history data as is. The search
keywords will probably be in different languages, and users live all around the Earth, so nighttime will vary by
each user’s time zone. Even then, some search keywords would have some typos, simply not make sense, or
even remain blank if the Google server somehow failed to store that specific search history record. Note that
all these scenarios are possible, and therefore data preparation should address these issues so that the actual
analysis can draw more accurate results. There are many different ways to manage these issues, which we will
discuss more fully in Collecting and Preparing Data.

Data Analysis
Once the data is collected and prepared, it must be analyzed in order to discover meaningful insights, a
process called data analysis. There are a variety of data analysis methods to choose from, ranging from
simple ones like checking minimum and maximum values, to more advanced ones such as modelling a
dependent variable. Most of the time data scientists start with simple methods and then move into more
12 1 • What Are Data and Data Science?

advanced ones, based on what they want to investigate further. Descriptive Statistics: Statistical Measurements
and Probability Distributions and Inferential Statistics and Regression Analysis discuss when and how to use
different analysis methods. Time Series and Forecasting and Decision-Making Using Machine Learning Basics
discuss forecasting and decision-making.

Data Reporting
Data reporting involves the presentation of data in a way that will best convey the information learned from
data analysis. The importance of data reporting cannot be overemphasized. Without it, data scientists cannot
effectively communicate to their audience the insights they discovered in the data. Data scientists work with
domain experts from different fields, and it is their responsibility to communicate the results of their analysis
in a way that those domain experts can understand. Data visualization is a graphical way of presenting and
reporting data to point out the patterns, trends, and hidden insights; it involves the use of visual elements
such as charts, graphs, and maps to present data in a way that is easy to comprehend and analyze. The goal of
data visualization is to communicate information effectively and facilitate better decision-making. Data
visualization and basic statistical graphing, including how to create graphical presentations of data using
Python, are explored in depth in Visualizing Data. Further details on reporting results are discussed in
Reporting Results.

Data Management
In the early days of data analysis (when generated data was mostly structured and not quite so “big”), it was
possible to keep data in local storage (e.g., on a single computer or a portable hard drive). With this setup,
data processing and analysis was all done locally as well.

When so much more data began to be collected—much of it unstructured as well as structured—cloud-based


management systems were developed to store all the data on a designated server, outside a local computer. At
the same time, data scientists began to see that most of their time was being spent on data processing rather
than analysis itself. To address this concern, modern data management systems not only store the data itself
but also perform some basic processing on a cloud. These systems, referred to as data warehousing, store
and manage large volumes of data from various sources in a central location, enabling efficient retrieval and
analysis for business intelligence and decision-making. (Data warehousing is covered in more detail in
Handling Large Datasets.)

Today, enterprises simply subscribe to a cloud-warehouse service such as Amazon RedShift (which runs on the
Amazon Web Services) or Google BigQuery (which runs on the Google Cloud) instead of buying physical
storage and configuring data management/processing systems on their own. These services ensure the data
is safely stored and processed on the cloud, all without spending money on purchasing/maintaining physical
storage.

1.2 Data Science in Practice


Learning Outcomes
By the end of this section, you should be able to:
• 1.2.1 Explain the interdisciplinary nature of data science in various fields.
• 1.2.2 Identify examples of data science applications in various fields.
• 1.2.3 Identify current issues and challenges in the field of data science

While data science has adopted techniques and theories from fields such as math, statistics, and computer
science, its applications concern an expanding number of fields. In this section we introduce some examples of
how data science is used in business and finance, public policy, health care and medicine, engineering and
sciences, and sports and entertainment.

Access for free at openstax.org


1.2 • Data Science in Practice 13

Data Science in Business


Data science plays a key role in many business operations. A variety of data related to customers, products,
and sales can be collected and generated within a business. These include customer names and lists of
products they have purchased as well as daily revenue. Business analytics investigate these data to launch new
products and to maximize the business revenue/profit.

Retail giant Walmart is known for using business analytics to improve the company’s bottom line. Walmart
collects multiple petabytes (1 petabyte = 1,024 terabytes) of unstructured data every hour from millions of
customers (commonly referred to as “big data”); as of 2024, Walmart’s customer data included roughly 255
million weekly customer visits (Statista, 2024). Walmart uses this big data to investigate consumer patterns
and adjust its inventories. Such analysis helps the company avoid overstocking or understocking and resulted
in an estimated online sales increase of between 10% and 15%, translating to an extra $1 billion in revenue
(ProjectPro, 2015). One often-reported example includes the predictive technology Walmart used to prepare
for Hurricane Frances in 2004. A week before the hurricane’s arrival, staff were asked to look back at their data
on sales during Hurricane Charley, which hit several weeks earlier, and then come up with some forecasts
about product demand ahead of Frances (Hays, 2004). Among other insights, the executives discovered that
strawberry Pop-Tart sales increased by about sevenfold during that time. As a result, in the days before
Hurricane Frances, Walmart shipped extra supplies of strawberry Pop-Tarts to stores in the storm’s path (Hays,
2004). The analysis also provided insights on how many checkout associates to assign at different times of the
day, where to place popular products, and many other sales and product details. In addition, the company has
launched social media analytics efforts to investigate hot keywords on social media and promptly make related
products available (ProjectPro, 2015).

Amazon provides another good example. Ever since it launched its Prime membership service, Amazon has
focused on how to minimize delivery time and cost. Like Walmart, it started by analyzing consumer patterns
and was able to place products close to customers. To do so, Amazon first divided the United States into eight
geographic regions and ensured that most items were warehoused and shipped within the same region; this
allowed the company to reduce the shipping time and cost. As of 2023, more than 76% of orders were shipped
from within the customer’s region, and items in same-day shipping facilities could be made ready to put on a
delivery truck in just 11 minutes (Herrington, 2023). Amazon also utilizes machine learning algorithms to
predict the demand for items in each region and have the highest-demand items available in advance at the
fulfillment center of the corresponding region. This predictive strategy has helped Amazon reduce the delivery
time for each product and extend the item selections for two-day shipping (Herrington, 2023).

Data science is utilized extensively in finance as well. Detecting and managing fraudulent transactions is now
done by automated machine learning algorithms (IABAC, 2023). Based on the customer data and the patterns
of past fraudulent activities, an algorithm can determine whether a transaction is fraudulent in real time.
Multiple tech companies, such as IBM and Amazon Web Services, offer their own fraud detection solutions to
their corporate clients. (For more information, see this online resource on Fraud Detection through Data
Analytics (https://openstax.org/r/iabac).)

Data Science in Engineering and Science


Various fields of engineering and science also benefit from data science. Internet of Things (IoT) is a good
example of a new technology paradigm that has benefited from data science. Internet of Things (IoT)
describes the network of multiple objects interacting with each other through the Internet. Data science plays
a crucial role in these interactions since behaviors of the objects in a network are often triggered by data
collected by another object in the network. For example, a smart doorbell or camera allows us to see a live
stream on our phone and alerts us to any unusual activity.

In addition, weather forecasting has always been a data-driven task. Weather analysts collect different
measures of the weather such as temperature and humidity and then make their best estimate for the
14 1 • What Are Data and Data Science?

weather in the future. Data science has made weather forecasting more reliable by adopting more
sophisticated prediction methods such as time-series analysis, artificial intelligence (AI), and machine learning
(covered in Time Series and Forecasting, Decision-Making Using Machine Learning Basics, and Deep Learning
and AI Basics). Such advancement in weather forecasting has also enabled engineers and scientists to predict
some natural disasters such as flood or wildfire and has enabled precision farming, with which agricultural
engineers can identify an optimal time window to plant, water, and harvest crops. For example, an agronomy
consultant, Ag Automation, has partnered with Hitachi to offer a solution that both automates data collection
and remotely monitors and controls irrigation for the best efficiency (Hitachi, 2023).

EXPLORING FURTHER

Using AI for Irrigation Control


See this Ag Automation video (https://openstax.org/r/youtube128) demonstrating the use of data collection
for controlling irrigation.

Data Science in Public Policy


Smart cities are among the most representative examples of the use of data science in public policy. Multiple
cities around the world, including Masdar City in the United Arab Emirates and Songdo in South Korea, have
installed thousands of data-collecting sensors used to optimize their energy consumption. The technology is
not perfect, and smart cities may not yet live up to their full potential, but many corporations and companies
are pursuing the goal of developing smart cities more broadly (Clewlow, 2024). The notion of a smart city has
also been applied on a smaller scale, such as to a parking lot, a building, or a street of lights. For example, the
city of San Diego installed thousands of sensors on the city streets to control the streetlights using data and
smart technology. The sensors measure traffic, parking occupancy, humidity, and temperature and are able to
turn on the lights only when necessary (Van Bocxlaer, 2020). New York City has adopted smart garbage bins
that monitor the amount of garbage in a bin, allowing garbage collection companies to route their collection
efforts more efficiently (Van Bocxlaer, 2020).

EXPLORING FURTHER

Sensor Networks to Monitor Energy Consumption


See how Songdo (https://openstax.org/r/youtube4) monitors energy consumption and safety with sensor
networks.

Data Science in Education


Data science also influences education. Traditional instruction, especially in higher education, has been
provided in a one-size-fits-all form, such as many students listening to a single instructor’s lecture in a
classroom. This makes it difficult for an instructor to keep track of each individual student’s learning progress.
However, many educational platforms these days are online and can produce an enormous amount of student
data, allowing instructors to investigate everyone's learning based on these collected data. For example, online
learning management systems such as Canvas (https://openstax.org/r/instructure) compile a grade book in
one place, and online textbooks such as zyBooks (https://openstax.org/r/zybooks) collect students’ mastery
level on each topic through their performance on exercises. All these data can be used to capture each
student’s progress and offer personalized learning experiences such as intelligent tutoring systems or
adaptive learning. ALEKS (https://openstax.org/r/aleks), an online adaptive learning application, offers
personalized material for each learner based on their past performance.

Access for free at openstax.org


1.2 • Data Science in Practice 15

Data Science in Health Care and Medicine


The fields of health care and medicine also use data science. Often their goal is to offer more accurate
diagnosis and treatment using predictive analytics—statistical techniques, algorithms, and machine learning
that analyze historical data and make predictions about future events. Medical diagnosis and prescription
practices have traditionally been based on a patient’s verbal description of symptoms and a doctor’s or a
group of doctors’ experience and intuition; this new movement allows health care professionals to make
decisions that are more data-driven. Data-driven decisions became more feasible thanks to all the personal
data collected through personal gadgets—smartphones, smart watches, and smart bands. Such devices collect
daily health/activity records, and this in turn helps health care professionals better capture each patient’s
situation. All this work will enable patients to receive more accurate diagnoses along with more personalized
treatment regimens in the long run.

The Precision Medicine Initiative (https://openstax.org/r/obamawhitehouse) is a long-term research endeavor


carried out by the National Institutes of Health (NIH) and other research centers, with the goal of better
understanding how a person's genetics, environment, and lifestyle can help determine the best approach to
prevent or treat disease. The initiative aims to look at as much data as possible to care for a patient’s health
more proactively. For example, the initiative includes genome sequencing to look for certain mutations that
indicate a higher risk of cancer or other diseases.

Another application of data science in health focuses on lowering the cost of health care services. Using
historical records of patients’ symptoms and prescription, a chatbot that is powered by artificial intelligence
can provide automated health care advice. This will reduce the need for patients to see a pharmacist or doctor,
which in turn improves health care accessibility for those who are in greater need.

EXPLORING FURTHER

Big Data In Health Care


The 2015 launch of the National Institutes of Health Precision Medicine Initiative was documented in One
Woman’s Quest to Cure Her Fatal Brain Disease (https://openstax.org/r/youtubevk). “Promise of Precision
Medicine” signaled a new approach to health care in the United States—one heavily reliant on big data.

Data Science in Sports and Entertainment


Data science is prevalent in the sports and the entertainment industry as well. Sports naturally produce much
data—about the player, positions, teams, seasons, and so on. Therefore, just as there is the concept of
business analytics, the analysis of such data in sports is called sports analytics. For example, the Oakland
Athletics baseball team famously analyzed player recruitment for the 2002 season. The team’s management
adapted a statistical approach referred to as sabermetrics to recruit and position players. With sabermetrics,
the team was able to identify critical yet traditionally overlooked metrics such as on-base percentage and
slugging percentage. The team, with its small budget compared to other teams, recruited a number of
undervalued players with strong scores on these metrics, and in the process, they became one of the most
exciting teams in baseball that year, breaking the American League record for 20 wins in a row. Does this story
sound familiar? This story was so dramatic that Michael Lewis wrote a book about it, which was also made into
a movie: Moneyball.
16 1 • What Are Data and Data Science?

EXPLORING FURTHER

The Sabermetrics YouTube Channel


Sabermetrics is so popular that there is a YouTube channel devoted to it: Simple Sabermetrics
(https://openstax.org/r/simplesabermetrics), with baseball animations and tutorials explaining how data
impacts the way today's game is played behind the scenes.

In the entertainment industry, data science is commonly used to make data-driven, personalized suggestions
that satisfy consumers known as recommendation systems. One example of a recommendation system is on
video streaming services such as Netflix. Netflix Research considers subscribers’ watch histories, satisfaction
with the content, and interaction records (e.g., search history). Their goal is to make perfect personalized
recommendations despite some challenges, including the fact that subscribers themselves often not do not
know what they want to see.

EXPLORING FURTHER

Careers in Data Science


As you advance in your data science training, consider the many professional options in this evolving field.
This helpful graphic from edX (https://openstax.org/r/edx) distinguishes data analyst vs. data science paths.
This Coursera article (https://openstax.org/r/coursera) lists typical skill sets by role. Current practitioner
discussions are available on forums such as Reddit’s r/data science (https://openstax.org/r/reddit).

Trends and Issues in Data Science


Technology has made it possible to collect abundant amounts of data, which has led to challenges in the
processing and analyzing of that data. But technology comes to the rescue again! Data scientists now use
machine learning to better understand the data, and artificial intelligence can make an automated, data-
driven decision on a task. Decision-Making Using Machine Learning Basics and Deep Learning and AI Basics
will cover the details of machine learning and artificial intelligence.

With these advances, many people have raised concerns about ethics and privacy. Who is allowed to collect
these data, and who has access to them? None of us want someone else to use our personal data (e.g., contact
information, health records, location, photos, web search history) without our consent or without knowing the
risk of sharing our data. Machine learning algorithms and artificial intelligence are trained to make a decision
based on the past data, and when the past data itself inherits some bias, the trained machine learning
algorithms and artificial intelligence will make biased decisions as well. Thus, carefully attending to the process
of collecting data and evaluating the bias of a trained results is critical. Ethics Throughout the Data Science
Cycle will discuss these and other ethical concerns and privacy issues in more depth.

1.3 Data and Datasets


Learning Outcomes
By the end of this section, you should be able to:
• 1.3.1 Define data and dataset.
• 1.3.2 Differentiate among the various data types used in data science.
• 1.3.3 Identify the type of data used in a dataset.
• 1.3.4 Discuss an item and attribute of a dataset.
• 1.3.5 Identify the different data formats and structures used in data science.

Access for free at openstax.org


1.3 • Data and Datasets 17

What Is Data Science? and Data Science in Practice introduced the many varieties of and uses for data science
in today’s world. Data science allows us to extract insights and knowledge from data, driving decision-making
and innovation in business, health care, entertainment, and so on. As we’ve seen, the field has roots in math,
statistics, and computer science, but it only began to emerge as its own distinct field in the early 2000s with
the proliferation of digital data and advances in computing power and technology. It gained significant
momentum and recognition around the mid to late 2000s with the rise of big data and the need for
sophisticated techniques to analyze and derive insights from large and complex datasets. Its evolution since
then has been rapid, and as we can see from the previous discussion, it is quickly becoming a cornerstone of
many industries and domains.

Data, however, is not new! Humans have been collecting data and generating datasets from the beginning of
time. This started in the Stone Age when people carved some shapes and pictures, called petroglyphs, on rock.
The petroglyphs provide insights on how animals looked and how they carried out their daily life, which is
valuable “data” for us. Ancient Egyptians invented a first form of paper—papyrus—in order to journal their
data. Papyrus also made it easier to store data in bulk, such as listing inventories, noting financial transactions,
and recording a story for future generations.

Data
“Data” is the plural of the Latin word “datum,” which translates as something that is given or used and is often
used to mean a single piece of information or a single point of reference in a dataset. When you hear the word
“data,” you may think of some sort of “numbers.” It is true that numbers are usually considered data, but there
are many other forms of data all around us. Anything that we can analyze to compile information—high-level
insights—is considered data.

Suppose you are debating whether to take a certain course next semester. What process do you go through in
order to make your decision? First, you might check the course evaluations, as shown in Table 1.1.

Semester Instructor Class Size Rating


Not
Fall 2020 A 100 recommended
at all
Highly
Spring 2021 A 50
recommended
Not quite
Fall 2021 B 120
recommended
Highly
Spring 2022 B 40
recommended
Fall 2022 A 110 Recommended
Highly
Spring 2023 B 50
recommended

Table 1.1 Course Evaluation Records

The evaluation record consists of four kinds of data, and they are grouped as columns of the table: Semester,
Instructor, Class Size, and Rating. Within each column there are six different pieces of data, located at each
row. For example, there are six pieces of text data under the Semester column: “Fall 2020,” “Spring 2021,” “Fall
2021,” “Spring 2022,” “Fall 2022,” and “Spring 2023.”

The course evaluation ratings themselves do not provide an idea on whether to take the course next semester.
The ratings are just a phrase (e.g., “Highly recommended” or “Not quite recommended”) that encodes how
18 1 • What Are Data and Data Science?

recommended the course was in that semester. You need to analyze them to come up with a decision!

Now let’s think about how to derive the information from these ratings. You would probably look at all the
data, including when in the semester the course was offered, who the instructor is, and class size. These
records would allow you to derive information that would help you decide “whether or not to take the course
next semester.”

EXAMPLE 1.1

Problem

Suppose you want to decide whether or not to put on a jacket today. You research the highest
temperatures in the past five days and determine whether you needed a jacket on each day. In this
scenario, what data are you using? And what information are you trying to derive?

Solution

Temperature readings and whether you needed a jacket on each of the past five days are two kinds of data
you are referring to. Again, they do not indicate anything related to wearing a jacket today. They are simply
five pairs of numbers (temperature) and yes/no (whether you needed a jacket) records, with each pair
representing a day. Using these data, you are deriving information that you can analyze to help you decide
whether to wear a jacket today.

Types of Data
The previous sections talked about how much our daily life is surrounded by data, how much our daily life
itself produces new data, and how often we make data-driven decisions without even noticing it. You might
have noticed that data comes in various types. Some data are quantitative, which means that they are
measured and expressed using numbers. Quantitative data deals with quantities and amounts and is usually
analyzed using statistical methods. Examples include numerical measurements like height, weight,
temperature, heart rate, and sales figures. Qualitative data are non-numerical data that generally describe
subjective attributes or characteristics and are analyzed using methods such as thematic analysis or content
analysis. Examples include descriptions, observations, interviews, and open-ended survey responses (as we’ll
see in Survey Design and Implementation) that address unquantifiable details (e.g., photos, posts on Reddit).
The data types often dictate methods for data analysis, so it is important to be able to identify a type of data.
Thus, this section will take a further dive into types of data.

Let’s revisit our previous example about deciding whether to take a certain course next semester. In that
example, we referred to four pieces of data. They are encoded in different types such as numbers, words, and
symbols.

1. The semester the course was offered—Fall 2020, Spring 2021, …, Fall 2022, Spring 2023
2. The instructor—A and B
3. The class size—100, 50, 120, 40, 110, 50
4. The course rating—“Not recommended at all,” …, “Highly recommended”

There are two primary types of quantitative data—numeric and categorical—and each of these can be divided
into a few subtypes. Numeric data is represented in numbers that indicate measurable quantities. It may be
followed by some symbols to indicate units. Numeric data is further divided into continuous data and discrete
data. With continuous data, the values can be any number. In other words, a value is chosen from an infinite
set of numbers. With discrete data, the values follow a specific precision, which makes the set of possible
values finite.

From the previous example, the class size 100, 150, etc. are numbers with the implied unit “students.” Also,

Access for free at openstax.org


1.3 • Data and Datasets 19

they indicate measurable quantities as they are head counts. Therefore, the class size is numeric data. It is also
continuous data since the size numbers seem to be any natural numbers and these numbers are chosen from
an infinite set of numbers, the set of natural numbers. Note that whether data is continuous (or discrete) also
depends on the context. For example, the same class size data can be discrete if the campus enforces all
classes to be 200 seats or less. Such restriction makes the class size values be chosen from a finite set of 200
numbers: 1, 2, 3, …, 198, 199, 200.

Categorical data is represented in different forms such as words, symbols, and even numbers. A categorical
value is chosen from a finite set of values, and the value does not necessarily indicate a measurable quantity.
Categorical data can be divided into nominal data and ordinal data. For nominal data, the set of possible
values does not include any ordering notion, whereas with ordinal data, the set of possible values includes an
ordering notion.

The rest—semester, instructor, and ratings—are categorical data. They are represented in symbols (e.g., “Fall
2020,” “A”) or words (e.g., “Highly recommended”), and these values are chosen from the finite set of those
symbols and words (e.g., A vs. B). The former two data are nominal since the semester and instructor do not
have orders to follow, while the latter is ordinal since there is a notion of degree (Not recommended at all ~
Highly recommended). You may argue that the semester could have chronological ordering: Fall 2020 comes
before Spring 2021, Fall 2021 follows Fall 2020. If you want to value that notion for your analysis, you could
consider the semester data to be ordinal as well—the chronological ordering is indeed critical when you are
looking at a time-series dataset. You will learn more about that in Time Series and Forecasting.

EXAMPLE 1.2

Problem

Consider the jacket scenario in Example 1.1. In that example, we referred to two kinds of data:

1. The temperature during past three days—90°F, 85°F, …


2. On each of those days, whether you needed a jacket—Yes, No, ...

What is the type of each data?

Solution

The temperatures are numbers followed by the unit degrees Fahrenheit (°F). Also, they indicate measurable
quantities as they are specific readings from a thermometer. Therefore, the temperature is numeric data.
They are also continuous data since they can be any real number, and the set of real numbers is infinite.

The other type of data—whether or not you needed a jacket—is categorical data. Categorical data are
represented in symbols (Yes/No), and the values are chosen from the finite set of those symbols. They are
also nominal since Yes/No does not have ordering to follow.

Datasets
A dataset is a collection of observations or data entities organized for analysis and interpretation, as shown in
Table 1.1. Many datasets can be represented as a table where each row indicates a unique data entity and each
column defines the structure of the entities.

Notice that the dataset we used in Table 1.1 has six entities (also referred to as items, entries, or instances),
distinguished by semester. Each entity is defined by a combination of four attributes or characteristics (also
known as features or variables)—Semester, Instructor, Class Size, and Rating. A combination of features
characterizes an entry of a dataset.

Although the actual values of the attributes are different across entities, note that all entities have values for
20 1 • What Are Data and Data Science?

the same four attributes, which makes them a structured dataset. As a structured dataset, these items can be
listed as a table where each item is listed along the rows of the table.

By contrast, an unstructured dataset is one that lacks a predefined or organized data model. While
structured datasets are organized in a tabular format with clearly defined fields and relationships,
unstructured data lacks a fixed schema. Unstructured data is often in the form of text, images, videos, audio
recordings, or other content where the information doesn't fit neatly into rows and columns.

There are plenty of unstructured datasets. Indeed, some people argue there are more unstructured datasets
than structured ones. A few examples include Amazon reviews on a set of products, Twitter posts last year,
public images on Instagram, popular short videos on TikTok, etc. These unstructured datasets are often
processed into a structured one so that data scientists can analyze the data. We’ll discuss different data
processing techniques in Collecting and Preparing Data.

EXAMPLE 1.3

Problem

Let’s revisit the jacket example: deciding whether to wear a jacket to class. Suppose the dataset looks as
provided in Table 1.2:

Needed a
Date Temperature
Jacket?
Oct. 10 80°F No
Oct. 11 60°F Yes
Oct. 12 65°F Yes
Oct. 13 75°F No

Table 1.2 Jacket Dataset

Is this dataset structured or unstructured?

Solution

It is a structured dataset since 1) every individual item is in the same structure with the same three
attributes—Date, Temperature, and Needed a Jacket—and 2) each value strictly fits into a cell of a table.

EXAMPLE 1.4

Problem

How many entries and attributes does the dataset in the previous example have?

Solution

The dataset has four entries, each of which is identified with a specific date (Oct. 10, Oct. 11, Oct. 12, Oct.
13). The dataset has three attributes—Date, Temperature, Needed a Jacket.

Access for free at openstax.org

You might also like