STAT121 / AC209 / E-109
CS109 Data Science
        Hanspeter Pfister
   pfister@seas.harvard.edu
           Joe Blitzstein
   blitzstein@stat.harvard.edu
          Outline
 What?
 Why?
 Who?
 How?
          Outline
 What?
 Why?
 Who?
 How?
          Data Science
  To gain insights into data through
computation, statistics, and visualization
     A Data Scientist Is...
A data scientist is someone who knows more
 statistics than a computer scientist and more
      computer science than a statistician.
                             - Josh Blumenstock
 Data Scientist = statistician + programmer +
         coach + storyteller + artist
                              - Shlomo Aragmon
Nate Silver
Nate Silver won the election
  Harvard Business Review
#natesilverfacts
http://techcrunch.com/2012/11/07/nate-silver-as-software/
       Nate Silver on Pundits
                        Silver: Pundits are no
                       better than a coin toss.
                     Stewart: Do you foresee a
                     coin getting its own show?
                        The coin toss show?
http://www.thedailyshow.com/watch/wed-october-17-2012/nate-silver
    Some Key Principles
   use many data sources (the plural of anecdote is not data)
   understand how the data were collected (sampling is essential)
   weight the data thoughtfully (not all polls are equally good)
   use statistical models (not just hacking around in Excel)
   understand correlations (e.g., states that trend similarly)
   think like a Bayesian, check like a frequentist (reconciliation)
   have good communicationskills (What does a 60%
    probability even mean? How can we visualize, validate, and
    understand the conclusions?)
Human Genome
Microarrays
              Affimetrix Chip
                            [wikipedia]
Sequencing
Sequencing Cost
Genome Data
   Genome Visualization
                                 [Krzywinski+2009]+
[Thorvaldsd,r-2013]-
                        [Meyer&2009]&
Personalized Therapy
...10 years from now, each cancer
patient is going to want to get a genomic
analysis of their cancer and will expect
customized therapy based on that
information.
     Director, The Cancer Genome Atlas
        (TCGA), Time Magazine, 6/13/11
Netflix Prize
       Some Challenges
   massive data (500k users, 20k movies, 100m ratings)
   curse of dimensionality (very high-dimensional
    problem)
   missing data (99% of data missing; not missing at
    random)
   extremely complicated set of factors that affect peoples
    ratings of movies (actors, directors, genre, ...)
   need to avoid overfitting (test data vs. training data)
 Netflix Prize Progress
http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html
           Connectome
What is the connectivity of large brain circuits?
                                     Ramn y Cajal, 1905
Connectome Workflow
Ultra-Thin Section EM
               Automatic
             Reconstruction
                    Combine Multiple 2D         Globally Consistent
2D Segmentation
                  Segmentations with Fusion      3D Segmentation
                                                [Kaynig et al., CVPR 10]
                                              [Vazquez et al., ICCV 2011]
2012
           Data Science
Computer
                              Statistics
 Science
             Domain Science     Drew Conway
Machine                                        Human
Data Management                         Human Cognition
     Data Mining                                        Perception
Machine Learning      Visualization                 Story Telling
          Business Intelligence       Decision Making
                                              Theory
 Statistics
                    Data Science
                                          Inspired by Daniel Keim, Visual Analytics: Definition,
                                                       Process, and Challenges
            Outline
   What?
   Why?
   Who?
   How?
The Age of Big Data
                      BBC, 2013
Crime Prevention
  Boston Globe,
Sunday, Aug 4, 2013
                            Big Data
        2.5 exabytes
daily data
                  years        2012
                          [IBMbigdata]
                                         [Domo]
Between the dawn of civilization and
2003, we only created five exabytes of
information; now were creating that
amount every two days.
       Eric Schmidt, Google (and others)
http://onesecond.designly.com/
Smarter Devices
                  Michael Franklin, UC Berkeley
Commodity Computing
                 Michael Franklin, UC Berkeley
Ubiquitous Connectivity
                     Michael Franklin, UC Berkeley
travers808,Visual.ly
1 Zetabyte = 1 Billion Terabytes
Jim Gray, Microsoft
By 2018, the US could face a shortage
of up to 190,000 workers with analytical
skills
               McKinsey Global Institute
The sexy job in the next 10 years will
be statisticians. Data Scientists?
   Hal Varian, Prof. Emeritus UC Berkeley
                 Chief Economist, Google
   Hal Varian Explains...
The ability to take data  to be able to
understand it, to process it, to
extract value from it, to visualize it, to
communicate it's going to be a hugely
important skill in the next decades, not
only at the professional level but even at
the educational level for elementary school
kids, for high school kids, for college kids.
Because now we really do have essentially
free and ubiquitous data.  Hal Varian
Ask an interesting              What is the scientific goal?
                         What would you do if you had all the data?
    question.            What do you want to predict or estimate?
                               How were the data sampled?
                                Which data are relevant?
    Get the data.               Are there privacy issues?
                                       Plot the data.
 Explore the data.                 Are there anomalies?
                                    Are there patterns?
                                      Build a model.
  Model the data.                     Fit the model.
                                    Validate the model.
 Communicate and                    What did we learn?
                                Do the results make sense?
visualize the results.              Can we tell a story?
            Outline
   What?
   Why?
   Who?
   How?
Hanspeter Pfister
  An Wang
          My Background
   Grew up in Switzerland
   M.Sc. in EE from ETH Zurich
   Ph.D. in CS from SUNY Stony Brook
   11 years in industry (MERL)
   At Harvard since 2007, Visual Computing Group (4 Ph.D., 7 PD)
   Teach CS109 / CS171, taught CS175 / CS264 / CS205
   Director of the Institute of Applied Computational Science (IACS)
   Two daughters, Lilly (10) and Audrey (7)
             Joe Blitzstein
       Professor of the Practice in Statistics,
Co-Director of Undergraduate Studies in Statistics
 blitz@fas.harvard.edu, twitter @stat110, SC 714
                 CS109 Staff
   Chris Beaumont, Head TF      Ray Jones
   Johanna Beyer                Steffen Kirchhoff
   Nicolas Bonneel              Seymour Knowles-Barley
   Alex DAmour                 Alexander Lex
   Rahul Dave                   Deqing Sun
   Brandon Haynes               Tim Brenner, A/V
About You
          Outline
 What?
 Why?
 Who?
 How?
      CS109 Key Facets
   data munging/scraping/sampling/cleaningin order to get an
    informative, manageable data set;
   data storage and management in order to be able to access
    data - especially big data - quickly and reliably during
    subsequent analysis;
   exploratory data analysisto generate hypotheses and
    intuition about the data;
   predictionbased on statistical tools such as regression,
    classification, and clustering; and
   communicationof results through visualization, stories, and
    interpretable summaries.
     Act I: Predictions
 Data Science Process
 Data Types and Data Munging
 Probability Review
 Classification & Regression
 Cross Validation, Clustering
 Visualization & Story Telling
Act II: Recommendations
  Bayesian Thinking & Computation
  Monte Carlo Methods
  Machine Learning Methods
  MapReduce and Amazons EC2
  Databases (Margo Seltzer)
Act III: Network Analysis
  Network Visualization
  Network Sampling
  Community Detection
  Guest Lecture
Abstractions...
...and Tools
xkcd
           Homework
 Real-World focus
  Scrape and wrangle messy data
  Apply sophisticated statistical analysis
  Visualize and communicate results
 Election data, movie reviews,Yelp! data, etc.
         Final Project
 Pick a project of your choosing
 Teams of up to 2 students
 Process books, web sites, screencasts
 IPython (exceptions possible)
 Best project prizes!
cs109.org
Is this course for me ???
        Prerequisites
Programming experience
 C, C++, Java, Python, etc.
Basic statistical knowledge
 STAT100, ideally STAT110
Willingness to learn new software & tools
 This can be time consuming
 You will need to read online documentation
  Be Patient
  Be Flexible
Be Constructive
                  http://davidzinger.wordpress.com/2007/05/page/2/
                Next Steps
   HW 0
       Good test of your basic skills
       Installation of several Python frameworks
       Not graded, do it as soon as possible
   Read syllabus carefully
   Do readings
       Post comments to Piazza using #readings