PPOL670 | Introduction to Data Science for Public Policy
Work Flow & Reproducibility
Data Science CS3072
Fall 2023
What is data science?
The Aim of Data Science
Generate
Valid
scrutiny, discussion & limitations
Unbiased
introspection, diversity & substantive knowledge
Reproducible
data provenance, code transparency & version control
Compelling
interpretable, intuitive & clear
insights using data to influence and inform
decision-making.
This Course focuses on
Tools: building a fundamental data science toolkit using R
Become a competent user of R
Promote self-learning: don't just pass this course, make this part
of your life.
Master the art of managing errors
Data Management: getting, tidying, and managing data
Cleaning Dirty Data
Structuring unstructured data types (like text)
Scraping
Incorporating a "Never touch the data" philosophy
Analytic Approaches: drawing insights
Exploring, Modeling, and Prediction
Course Calendar
Week Topics
1 Work Flow and Reproducibility
2 Introduction to Programming in R
3 Reproducibility in Practice
4 Data Wrangling in R
5 Data Visualization
6 Web Scrapping and APIs
7 Geospatial Data
8 Text as Data
9 Introduction to Statistical Learning
10 Applications in Supervised Learning (Regression)
11 Applications in Supervised Learning (Classification)
12 Interpretable Machine Learning
13 Applications in Unsupervised Learning
14 Project Presentations
Who is the course for?
Anyone who (a) finds this stuff fascinating, (b) wants to understand
how to fold data into their decision-making process, and (c) wants to
build up their data skills.
New R users; no programming experience assumed.
This course is a survey of data science approaches using R. You'll
walk out of this class being able to do a lot, but it is just a starting
point.
What this course is not?
A machine learning course (we dabble, but don't delve into any one
concept in depth).
A "big data" course: we won't delve into database structures, e.g. SQL
and Hadoop, or cloud computing, e.g. Azure and AWS.
Reproducibility
We focus on things like this...
And forget the reality that is this...
Reproducibility is fundamental to the scientific
method, but it is also a practical reality.
juggling multiple versions of the same file
collaboration can create conflicts across versions
projects are picked up and put down → tracing the progression of a
project across a spiderweb of files is not always easy (or possible)
new people enter the fray → getting them up-to-speed means walking
them through the labyrinth, which wastes time and resources.
Generating Reproducible Work
1. Readable
2. Portable
3. Well-Named
4. Repeatable
5. Version Control
Readable
x <- rnorm(100)
y <- 1 + 2*x + rnorm(100)
plot(y,x)
vs.
# Monte Carlo Simulation of a bivariate linear regression
sample_size <- 100 # simulated sample size
indep_var <- rnorm(sample_size) # independent variable
error <- rnorm(sample_size) # simulate error
# generate dependent variable as function of the
# independent variable and some error.
dep_var <- 1 + 2*indep_var + error
# plot values
plot(dep_var,indep_var)
Portable
Project can easily travel across computers
e.g. R Project (.rproj), packrat, and renv
Scripts avoid "machine" specific designations
Avoid specific file paths: /Users/my-user-name/data-
projects/my-project
Retain software and packages versions (e.g. R's packrat
package )
Use text files
Not software dependent (e.g. .docx, .ia); Can open on any system
Can be easily searched via the commandline
Easy to track changes via version control
Well-named
Maintain designated folders for different aspects of the project.
data-project
├── raw_data/ # Where our input data lives
├── output_data/ # Where our manipulated data lives
├── R/ # Where our R functions live
├── figures/ # Where our generated figures live
├── reports/ # Where our text-based report output live
└── analysis/ # Where our the code for our analyses live
Well-named
No spaces!
A space between designations can mean many things
spaces are ambiguous for the computer
data analysis 2.Rmd
↓
data-analysis-2.Rmd
Names that state the purpose of the file (no
matter how long).
data-analysis-2.Rmd
↓
Analysis01_wrangling-census-data-for-
visualization_v2.Rmd
Repeatable
Every step of the project expressed as code
Automate what you can
Use functions to repeat common tasks
Clearly state all dependencies (i.e. packages/modules) at the top
of every script
# Pacakges at the top
require(tidyverse)
require(sf)
# Then code
...
Version Control
Retain a record of all changes made throughout the project's
lifespan
Easily handle collaboration:
track who did what
uniform method dealing with
conflicting changes
Provides a room for experimentation and non-linear
exploration
No more version file names!
Interacting with R
R in a Nut Shell
R is a statistical and graphical programming language that is based off a
much older language called S. It's source code is written in C, Fortran, and
R. And it's completely free under a GNU General Public License.
What this means for us:
No Barriers to Entry → easy to acquire, easy to contribute
Active Community → if you can think it, there is likely a
package out there that does it.
Powerful and Adaptive → build an estimator from
scratch,
scrape a web-site, automate the coding of a dataset. All is within one's
reach.
Why use R?
R offers a powerful way to
analyze data
clean excel spreadsheets (and any other data format) systematically
migrate projects across platforms
format and clean text
manage any data source
produce compelling graphics
and maps
R Studio
R Studio is a graphical user interface (GUI) for the R programming
language. The software makes R more user-friendly adding some point-
and-click functionality along with a complete integration of graphs, the
data environment, and the coding script.
Think of it like this..
R is the engine that runs all our commands, and R Studio is the
leather seats and steering wheel. One does the work, the other
eases how that work is done.
Installing R and R Studio
To install R, download R from CRAN via the following:
Windows:
https://cran.r-project.org/bin/windows/base/ Mac:
https://cran.r-project.org/bin/macosx/
To install R Studio, download from the following:
https://www.rstudio.com/products/rstudio/download/
Useful video tutorials:
Prior TA's walkthrough
Rstudio's Walkthrough:
Install R
Install RStudio
Getting Familiar with R Studio
R Studio is broken up into 4 quandrants that can be arranged and
customized to the users preference.
Getting Familiar with R Studio
Getting Familiar with R Studio
Getting Familiar with R Studio
Getting Familiar with R Studio
The Console
The console is where all the action happens. This is "R".
The Console
All commands are processed through the console directly (that is, one can
type commands directly into it) or via a script.
Scripts
A script is a .R text file where we write and run our code.
Scripts
A RMarkdown script is a .Rmd text file where we write prose and
run our code together (more on this on Week 3).
Scripts
When we write a line of code, we can run it in the console by highlighting
the text and...
clicking run
pressing command + enter (mac)
pressing control + enter (windows)
Scripts
Everything in a script will be treated as code -- that is if you run it, the
line will be processed through the console.
However, we can leave comments and notes to ourselves by
commenting out sections of the script using a #