STAT121 / AC209 / E-109
CS109 Data Science
Hanspeter Pfister
pfister@seas.harvard.edu
Joe Blitzstein
blitzstein@stat.harvard.edu
Outline
What?
Why?
Who?
How?
Outline
What?
Why?
Who?
How?
Data Science
To gain insights into data through
computation, statistics, and visualization
A Data Scientist Is...
A data scientist is someone who knows more
statistics than a computer scientist and more
computer science than a statistician.
- Josh Blumenstock
Data Scientist = statistician + programmer +
coach + storyteller + artist
- Shlomo Aragmon
Nate Silver
Nate Silver won the election
Harvard Business Review
#natesilverfacts
http://techcrunch.com/2012/11/07/nate-silver-as-software/
Nate Silver on Pundits
Silver: Pundits are no
better than a coin toss.
Stewart: Do you foresee a
coin getting its own show?
The coin toss show?
http://www.thedailyshow.com/watch/wed-october-17-2012/nate-silver
Some Key Principles
use many data sources (the plural of anecdote is not data)
understand how the data were collected (sampling is essential)
weight the data thoughtfully (not all polls are equally good)
use statistical models (not just hacking around in Excel)
understand correlations (e.g., states that trend similarly)
think like a Bayesian, check like a frequentist (reconciliation)
have good communicationskills (What does a 60%
probability even mean? How can we visualize, validate, and
understand the conclusions?)
Human Genome
Microarrays
Affimetrix Chip
[wikipedia]
Sequencing
Sequencing Cost
Genome Data
Genome Visualization
[Krzywinski+2009]+
[Thorvaldsd,r-2013]-
[Meyer&2009]&
Personalized Therapy
...10 years from now, each cancer
patient is going to want to get a genomic
analysis of their cancer and will expect
customized therapy based on that
information.
Director, The Cancer Genome Atlas
(TCGA), Time Magazine, 6/13/11
Netflix Prize
Some Challenges
massive data (500k users, 20k movies, 100m ratings)
curse of dimensionality (very high-dimensional
problem)
missing data (99% of data missing; not missing at
random)
extremely complicated set of factors that affect peoples
ratings of movies (actors, directors, genre, ...)
need to avoid overfitting (test data vs. training data)
Netflix Prize Progress
http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html
Connectome
What is the connectivity of large brain circuits?
Ramn y Cajal, 1905
Connectome Workflow
Ultra-Thin Section EM
Automatic
Reconstruction
Combine Multiple 2D Globally Consistent
2D Segmentation
Segmentations with Fusion 3D Segmentation
[Kaynig et al., CVPR 10]
[Vazquez et al., ICCV 2011]
2012
Data Science
Computer
Statistics
Science
Domain Science Drew Conway
Machine Human
Data Management Human Cognition
Data Mining Perception
Machine Learning Visualization Story Telling
Business Intelligence Decision Making
Theory
Statistics
Data Science
Inspired by Daniel Keim, Visual Analytics: Definition,
Process, and Challenges
Outline
What?
Why?
Who?
How?
The Age of Big Data
BBC, 2013
Crime Prevention
Boston Globe,
Sunday, Aug 4, 2013
Big Data
2.5 exabytes
daily data
years 2012
[IBMbigdata]
[Domo]
Between the dawn of civilization and
2003, we only created five exabytes of
information; now were creating that
amount every two days.
Eric Schmidt, Google (and others)
http://onesecond.designly.com/
Smarter Devices
Michael Franklin, UC Berkeley
Commodity Computing
Michael Franklin, UC Berkeley
Ubiquitous Connectivity
Michael Franklin, UC Berkeley
travers808,Visual.ly
1 Zetabyte = 1 Billion Terabytes
Jim Gray, Microsoft
By 2018, the US could face a shortage
of up to 190,000 workers with analytical
skills
McKinsey Global Institute
The sexy job in the next 10 years will
be statisticians. Data Scientists?
Hal Varian, Prof. Emeritus UC Berkeley
Chief Economist, Google
Hal Varian Explains...
The ability to take data to be able to
understand it, to process it, to
extract value from it, to visualize it, to
communicate it's going to be a hugely
important skill in the next decades, not
only at the professional level but even at
the educational level for elementary school
kids, for high school kids, for college kids.
Because now we really do have essentially
free and ubiquitous data. Hal Varian
Ask an interesting What is the scientific goal?
What would you do if you had all the data?
question. What do you want to predict or estimate?
How were the data sampled?
Which data are relevant?
Get the data. Are there privacy issues?
Plot the data.
Explore the data. Are there anomalies?
Are there patterns?
Build a model.
Model the data. Fit the model.
Validate the model.
Communicate and What did we learn?
Do the results make sense?
visualize the results. Can we tell a story?
Outline
What?
Why?
Who?
How?
Hanspeter Pfister
An Wang
My Background
Grew up in Switzerland
M.Sc. in EE from ETH Zurich
Ph.D. in CS from SUNY Stony Brook
11 years in industry (MERL)
At Harvard since 2007, Visual Computing Group (4 Ph.D., 7 PD)
Teach CS109 / CS171, taught CS175 / CS264 / CS205
Director of the Institute of Applied Computational Science (IACS)
Two daughters, Lilly (10) and Audrey (7)
Joe Blitzstein
Professor of the Practice in Statistics,
Co-Director of Undergraduate Studies in Statistics
blitz@fas.harvard.edu, twitter @stat110, SC 714
CS109 Staff
Chris Beaumont, Head TF Ray Jones
Johanna Beyer Steffen Kirchhoff
Nicolas Bonneel Seymour Knowles-Barley
Alex DAmour Alexander Lex
Rahul Dave Deqing Sun
Brandon Haynes Tim Brenner, A/V
About You
Outline
What?
Why?
Who?
How?
CS109 Key Facets
data munging/scraping/sampling/cleaningin order to get an
informative, manageable data set;
data storage and management in order to be able to access
data - especially big data - quickly and reliably during
subsequent analysis;
exploratory data analysisto generate hypotheses and
intuition about the data;
predictionbased on statistical tools such as regression,
classification, and clustering; and
communicationof results through visualization, stories, and
interpretable summaries.
Act I: Predictions
Data Science Process
Data Types and Data Munging
Probability Review
Classification & Regression
Cross Validation, Clustering
Visualization & Story Telling
Act II: Recommendations
Bayesian Thinking & Computation
Monte Carlo Methods
Machine Learning Methods
MapReduce and Amazons EC2
Databases (Margo Seltzer)
Act III: Network Analysis
Network Visualization
Network Sampling
Community Detection
Guest Lecture
Abstractions...
...and Tools
xkcd
Homework
Real-World focus
Scrape and wrangle messy data
Apply sophisticated statistical analysis
Visualize and communicate results
Election data, movie reviews,Yelp! data, etc.
Final Project
Pick a project of your choosing
Teams of up to 2 students
Process books, web sites, screencasts
IPython (exceptions possible)
Best project prizes!
cs109.org
Is this course for me ???
Prerequisites
Programming experience
C, C++, Java, Python, etc.
Basic statistical knowledge
STAT100, ideally STAT110
Willingness to learn new software & tools
This can be time consuming
You will need to read online documentation
Be Patient
Be Flexible
Be Constructive
http://davidzinger.wordpress.com/2007/05/page/2/
Next Steps
HW 0
Good test of your basic skills
Installation of several Python frameworks
Not graded, do it as soon as possible
Read syllabus carefully
Do readings
Post comments to Piazza using #readings