STAT121 / AC209 / E-109
CS109 Data Science
Hanspeter Pster pster@seas.harvard.edu Joe Blitzstein blitzstein@stat.harvard.edu
Outline
What? Why? Who? How?
Outline
What? Why? Who? How?
Data Science
To gain insights into data through computation, statistics, and visualization
A Data Scientist Is...
A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. - Josh Blumenstock Data Scientist = statistician + programmer + coach + storyteller + artist - Shlomo Aragmon
Nate Silver
Nate Silver won the election Harvard Business Review
#natesilverfacts
http://techcrunch.com/2012/11/07/nate-silver-as-software/
Nate Silver on Pundits
Silver: Pundits are no better than a coin toss. Stewart: Do you foresee a coin getting its own show? The coin toss show?
http://www.thedailyshow.com/watch/wed-october-17-2012/nate-silver
Some Key Principles
use many data sources (the plural of anecdote is not data) understand how the data were collected (sampling is essential) weight the data thoughtfully (not all polls are equally good) use statistical models (not just hacking around in Excel) understand correlations (e.g., states that trend similarly) think like a Bayesian, check like a frequentist (reconciliation) have good communicationskills (What does a 60% probability even mean? How can we visualize, validate, and understand the conclusions?)
Human Genome
Microarrays
Afmetrix Chip
[wikipedia]
Sequencing
Sequencing Cost
Genome Data
Genome Visualization
[Krzywinski+2009]+
[Thorvaldsd,r-2013]-
[Meyer&2009]&
Personalized Therapy
...10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information. Director, The Cancer Genome Atlas (TCGA), Time Magazine, 6/13/11
Netix Prize
Some Challenges
massive data (500k users, 20k movies, 100m ratings) curse of dimensionality (very high-dimensional problem) missing data (99% of data missing; not missing at random) extremely complicated set of factors that affect peoples ratings of movies (actors, directors, genre, ...) need to avoid overtting (test data vs. training data)
Netix Prize Progress
http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html
Connectome
What is the connectivity of large brain circuits?
Ramn y Cajal, 1905
Connectome Workow
Ultra-Thin Section EM
Automatic Reconstruction
2D Segmentation
Combine Multiple 2D Segmentations with Fusion
Globally Consistent 3D Segmentation
[Kaynig et al., CVPR 10] [Vazquez et al., ICCV 2011]
2012
Data Science
Computer Science Statistics
Domain Science
Drew Conway
Machine Data Management Data Mining Machine Learning Visualization
Human Human Cognition Perception Story Telling Decision Making Theory
Business Intelligence Statistics
Data Science
Inspired by Daniel Keim, Visual Analytics: Denition, Process, and Challenges
Outline
What? Why? Who? How?
The Age of Big Data
BBC, 2013
Crime Prevention
Boston Globe, Sunday, Aug 4, 2013
Big Data
2.5 exabytes
daily data
years
2012
[IBMbigdata]
[Domo]
Between the dawn of civilization and 2003, we only created ve exabytes of information; now were creating that amount every two days. Eric Schmidt, Google (and others)
http://onesecond.designly.com/
Smarter Devices
Michael Franklin, UC Berkeley
Commodity Computing
Michael Franklin, UC Berkeley
Ubiquitous Connectivity
Michael Franklin, UC Berkeley
travers808,Visual.ly
1 Zetabyte = 1 Billion Terabytes
Jim Gray, Microsoft
By 2018, the US could face a shortage of up to 190,000 workers with analytical skills McKinsey Global Institute The sexy job in the next 10 years will be statisticians. Data Scientists? Hal Varian, Prof. Emeritus UC Berkeley Chief Economist, Google
Hal Varian Explains...
The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. Hal Varian
Ask an interesting question.
What is the scientic goal? What would you do if you had all the data? What do you want to predict or estimate?
Get the data.
How were the data sampled? Which data are relevant? Are there privacy issues?
Explore the data.
Plot the data. Are there anomalies? Are there patterns?
Model the data.
Build a model. Fit the model. Validate the model.
Communicate and visualize the results.
What did we learn? Do the results make sense? Can we tell a story?
Outline
What? Why? Who? How?
Hanspeter Pster
An Wang
My Background
Grew up in Switzerland M.Sc. in EE from ETH Zurich Ph.D. in CS from SUNY Stony Brook 11 years in industry (MERL) At Harvard since 2007, Visual Computing Group (4 Ph.D., 7 PD) Teach CS109 / CS171, taught CS175 / CS264 / CS205 Director of the Institute of Applied Computational Science (IACS) Two daughters, Lilly (10) and Audrey (7)
Professor of the Practice in Statistics, Co-Director of Undergraduate Studies in Statistics blitz@fas.harvard.edu, twitter @stat110, SC 714
Joe Blitzstein
CS109 Staff
Chris Beaumont, Head TF Johanna Beyer Nicolas Bonneel Alex DAmour Rahul Dave Brandon Haynes
Ray Jones Steffen Kirchhoff Seymour Knowles-Barley Alexander Lex Deqing Sun Tim Brenner, A/V
About You
Outline
What? Why? Who? How?
CS109 Key Facets
data munging/scraping/sampling/cleaningin order to get an informative, manageable data set; data storage and management in order to be able to access data - especially big data - quickly and reliably during subsequent analysis; exploratory data analysisto generate hypotheses and intuition about the data; predictionbased on statistical tools such as regression, classication, and clustering; and communicationof results through visualization, stories, and interpretable summaries.
Act I: Predictions
Data Science Process Data Types and Data Munging Probability Review Classication & Regression Cross Validation, Clustering Visualization & Story Telling
Act II: Recommendations
Bayesian Thinking & Computation Monte Carlo Methods Machine Learning Methods MapReduce and Amazons EC2 Databases (Margo Seltzer)
Act III: Network Analysis
Network Visualization Network Sampling Community Detection Guest Lecture
Abstractions...
...and Tools
xkcd
Homework
Real-World focus Scrape and wrangle messy data Apply sophisticated statistical analysis Visualize and communicate results Election data, movie reviews,Yelp! data, etc.
Final Project
Pick a project of your choosing Teams of up to 2 students Process books, web sites, screencasts IPython (exceptions possible) Best project prizes!
cs109.org
Is this course for me ???
Prerequisites
Programming experience
C, C++, Java, Python, etc.
Basic statistical knowledge
STAT100, ideally STAT110
Willingness to learn new software & tools
This can be time consuming You will need to read online documentation
Be Patient Be Flexible Be Constructive
http://davidzinger.wordpress.com/2007/05/page/2/
Next Steps
HW 0
Good test of your basic skills Installation of several Python frameworks Not graded, do it as soon as possible
Read syllabus carefully Do readings
Post comments to Piazza using #readings