0% found this document useful (0 votes)

98 views74 pages

Data Science Course Overview

This document provides an outline and overview of the CS109 Data Science course at Harvard University. It discusses what data science is, why it is important, who teaches the course, and how the course is structured. The course covers key data science topics like data munging, exploratory analysis, prediction, and communication of results. It is taught using real-world datasets and Python tools. The course aims to take students through the full data science process on projects from data collection to modeling to visualization.

Uploaded by

Matheus Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views74 pages

Data Science Course Overview

Uploaded by

Matheus Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

STAT121 / AC209 / E-109

CS109 Data Science

Hanspeter Pfister
pfister@seas.harvard.edu

Joe Blitzstein
blitzstein@stat.harvard.edu
Outline
What?
Why?
Who?
How?
Outline
What?
Why?
Who?
How?
Data Science
To gain insights into data through
computation, statistics, and visualization
A Data Scientist Is...
A data scientist is someone who knows more
statistics than a computer scientist and more
computer science than a statistician.
- Josh Blumenstock

Data Scientist = statistician + programmer +

coach + storyteller + artist
- Shlomo Aragmon
Nate Silver
Nate Silver won the election
Harvard Business Review
#natesilverfacts
http://techcrunch.com/2012/11/07/nate-silver-as-software/
Nate Silver on Pundits
Silver: Pundits are no
better than a coin toss.
Stewart: Do you foresee a
coin getting its own show?
The coin toss show?

http://www.thedailyshow.com/watch/wed-october-17-2012/nate-silver
Some Key Principles
use many data sources (the plural of anecdote is not data)

understand how the data were collected (sampling is essential)

weight the data thoughtfully (not all polls are equally good)

use statistical models (not just hacking around in Excel)

understand correlations (e.g., states that trend similarly)

think like a Bayesian, check like a frequentist (reconciliation)

have good communicationskills (What does a 60%

probability even mean? How can we visualize, validate, and
understand the conclusions?)
Human Genome
Microarrays

Affimetrix Chip

[wikipedia]
Sequencing
Sequencing Cost
Genome Data
Genome Visualization

[Krzywinski+2009]+

[Thorvaldsd,r-2013]-

[Meyer&2009]&
Personalized Therapy
...10 years from now, each cancer
patient is going to want to get a genomic
analysis of their cancer and will expect
customized therapy based on that
information.
Director, The Cancer Genome Atlas
(TCGA), Time Magazine, 6/13/11
Netflix Prize
Some Challenges
massive data (500k users, 20k movies, 100m ratings)

curse of dimensionality (very high-dimensional

problem)

missing data (99% of data missing; not missing at

random)

extremely complicated set of factors that affect peoples

ratings of movies (actors, directors, genre, ...)

need to avoid overfitting (test data vs. training data)

Netflix Prize Progress

http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html
Connectome
What is the connectivity of large brain circuits?

Ramn y Cajal, 1905

Connectome Workflow
Ultra-Thin Section EM
Automatic
Reconstruction

Combine Multiple 2D Globally Consistent

2D Segmentation
Segmentations with Fusion 3D Segmentation

[Kaynig et al., CVPR 10]

[Vazquez et al., ICCV 2011]
2012
Data Science
Computer
Statistics
Science

Domain Science Drew Conway

Machine Human
Data Management Human Cognition

Data Mining Perception

Machine Learning Visualization Story Telling

Business Intelligence Decision Making

Theory
Statistics
Data Science

Inspired by Daniel Keim, Visual Analytics: Definition,

Process, and Challenges
Outline

What?
Why?
Who?
How?
The Age of Big Data

BBC, 2013
Crime Prevention
Boston Globe,
Sunday, Aug 4, 2013
Big Data

2.5 exabytes
daily data

years 2012
[IBMbigdata]

[Domo]
Between the dawn of civilization and
2003, we only created five exabytes of
information; now were creating that
amount every two days.
Eric Schmidt, Google (and others)
http://onesecond.designly.com/
Smarter Devices

Michael Franklin, UC Berkeley

Commodity Computing

Michael Franklin, UC Berkeley

Ubiquitous Connectivity

Michael Franklin, UC Berkeley

travers808,Visual.ly
1 Zetabyte = 1 Billion Terabytes
Jim Gray, Microsoft
By 2018, the US could face a shortage
of up to 190,000 workers with analytical
skills
McKinsey Global Institute

The sexy job in the next 10 years will

be statisticians. Data Scientists?
Hal Varian, Prof. Emeritus UC Berkeley
Chief Economist, Google
Hal Varian Explains...
The ability to take data to be able to
understand it, to process it, to
extract value from it, to visualize it, to
communicate it's going to be a hugely
important skill in the next decades, not
only at the professional level but even at
the educational level for elementary school
kids, for high school kids, for college kids.
Because now we really do have essentially
free and ubiquitous data. Hal Varian
Ask an interesting What is the scientific goal?
What would you do if you had all the data?
question. What do you want to predict or estimate?

How were the data sampled?

Which data are relevant?
Get the data. Are there privacy issues?

Plot the data.

Explore the data. Are there anomalies?

Are there patterns?

Build a model.
Model the data. Fit the model.
Validate the model.

Communicate and What did we learn?

Do the results make sense?
visualize the results. Can we tell a story?
Outline

What?
Why?
Who?
How?
Hanspeter Pfister

An Wang
My Background
Grew up in Switzerland

M.Sc. in EE from ETH Zurich

Ph.D. in CS from SUNY Stony Brook

11 years in industry (MERL)

At Harvard since 2007, Visual Computing Group (4 Ph.D., 7 PD)

Teach CS109 / CS171, taught CS175 / CS264 / CS205

Director of the Institute of Applied Computational Science (IACS)

Two daughters, Lilly (10) and Audrey (7)

Joe Blitzstein
Professor of the Practice in Statistics,
Co-Director of Undergraduate Studies in Statistics
blitz@fas.harvard.edu, twitter @stat110, SC 714
CS109 Staff
Chris Beaumont, Head TF Ray Jones

Johanna Beyer Steffen Kirchhoff

Nicolas Bonneel Seymour Knowles-Barley

Alex DAmour Alexander Lex

Rahul Dave Deqing Sun

Brandon Haynes Tim Brenner, A/V

About You
Outline
What?
Why?
Who?
How?
CS109 Key Facets
data munging/scraping/sampling/cleaningin order to get an
informative, manageable data set;

data storage and management in order to be able to access

data - especially big data - quickly and reliably during
subsequent analysis;

exploratory data analysisto generate hypotheses and

intuition about the data;

predictionbased on statistical tools such as regression,

classification, and clustering; and

communicationof results through visualization, stories, and

interpretable summaries.
Act I: Predictions
Data Science Process
Data Types and Data Munging
Probability Review
Classification & Regression
Cross Validation, Clustering
Visualization & Story Telling
Act II: Recommendations
Bayesian Thinking & Computation
Monte Carlo Methods
Machine Learning Methods
MapReduce and Amazons EC2
Databases (Margo Seltzer)
Act III: Network Analysis
Network Visualization
Network Sampling
Community Detection
Guest Lecture
Abstractions...
...and Tools
xkcd
Homework
Real-World focus
Scrape and wrangle messy data
Apply sophisticated statistical analysis
Visualize and communicate results
Election data, movie reviews,Yelp! data, etc.
Final Project
Pick a project of your choosing
Teams of up to 2 students
Process books, web sites, screencasts
IPython (exceptions possible)
Best project prizes!
cs109.org
Is this course for me ???
Prerequisites
Programming experience
C, C++, Java, Python, etc.
Basic statistical knowledge
STAT100, ideally STAT110
Willingness to learn new software & tools
This can be time consuming
You will need to read online documentation
Be Patient

Be Flexible

Be Constructive

http://davidzinger.wordpress.com/2007/05/page/2/
Next Steps
HW 0
Good test of your basic skills

Installation of several Python frameworks

Not graded, do it as soon as possible

Read syllabus carefully

Do readings
Post comments to Piazza using #readings

Introduction To Data Science
75% (4)
Introduction To Data Science
74 pages
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
No ratings yet
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
47 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
36 pages
Week1 1
No ratings yet
Week1 1
40 pages
Intro To DS
No ratings yet
Intro To DS
37 pages
Modul 1
No ratings yet
Modul 1
56 pages
Data Science Insights for Students
No ratings yet
Data Science Insights for Students
71 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
6220010
No ratings yet
6220010
37 pages
Activ Steps
No ratings yet
Activ Steps
11 pages
Unit 1 - AP For Data Science
No ratings yet
Unit 1 - AP For Data Science
19 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
38 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
Intro To Data Science
No ratings yet
Intro To Data Science
100 pages
Asd 01
No ratings yet
Asd 01
38 pages
Module1 DS
No ratings yet
Module1 DS
61 pages
CS429: Data Mining: About Instructor
No ratings yet
CS429: Data Mining: About Instructor
26 pages
DS 1
No ratings yet
DS 1
56 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Data Strategy & Project Planning Guide
No ratings yet
Data Strategy & Project Planning Guide
36 pages
Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute Rcpaffenroth@wpi - Edu 2014
No ratings yet
Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute Rcpaffenroth@wpi - Edu 2014
22 pages
FIT1043 - Lecture 1 - 2024 Data Science
No ratings yet
FIT1043 - Lecture 1 - 2024 Data Science
66 pages
22CDE01-Data Science Unit 1
No ratings yet
22CDE01-Data Science Unit 1
82 pages
Data Science 1
100% (5)
Data Science 1
133 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Module 1
No ratings yet
Module 1
19 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Lec 1 Cgu 13 Updated
No ratings yet
Lec 1 Cgu 13 Updated
73 pages
Intro
No ratings yet
Intro
144 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
Research Paper On Hadoop
No ratings yet
Research Paper On Hadoop
47 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
No ratings yet
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
91 pages
Class VIII Data Science Book Cbse
No ratings yet
Class VIII Data Science Book Cbse
34 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
34 pages
DCIT414 Session 2
No ratings yet
DCIT414 Session 2
32 pages
File
No ratings yet
File
27 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
Lecture 1 - Introduction To Big Data
No ratings yet
Lecture 1 - Introduction To Big Data
51 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Executive Data Science A Guide To Training and Managing The Best Data Scientists by Brian Caffo, Roger D. Peng, Jeffrey T. Leek
100% (1)
Executive Data Science A Guide To Training and Managing The Best Data Scientists by Brian Caffo, Roger D. Peng, Jeffrey T. Leek
150 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
1.2 Introduction To Applied Data Science
No ratings yet
1.2 Introduction To Applied Data Science
47 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Data Science
100% (2)
Data Science
52 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Data Science
No ratings yet
Data Science
35 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Preprocessing of MRI Data For Alzheimer Diseases Diagnosis: July 2018
No ratings yet
Preprocessing of MRI Data For Alzheimer Diseases Diagnosis: July 2018
4 pages
My Portion : Written by Mark Barlow. Original Key DB Major
No ratings yet
My Portion : Written by Mark Barlow. Original Key DB Major
2 pages
DeepAD SubjectLevel Ready2submit Final
No ratings yet
DeepAD SubjectLevel Ready2submit Final
33 pages
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
No ratings yet
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
23 pages
Structure and Dynamics of Functional Networks in Child-Onset - Guilherme Ferraz de Arruda and Francisco A. Rodrigues
No ratings yet
Structure and Dynamics of Functional Networks in Child-Onset - Guilherme Ferraz de Arruda and Francisco A. Rodrigues
7 pages
Credit Risk Analysis Using Machine and Deep Learning
No ratings yet
Credit Risk Analysis Using Machine and Deep Learning
19 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
No ratings yet
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
28 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
19 Storytelling PDF
No ratings yet
19 Storytelling PDF
64 pages
Bias and Sampling: CS109/Stat121/AC209/E-109 Data Science
No ratings yet
Bias and Sampling: CS109/Stat121/AC209/E-109 Data Science
17 pages
CS109 Data Science: Trees, Networks & Databases
No ratings yet
CS109 Data Science: Trees, Networks & Databases
80 pages
Network Models II: CS109/Stat121/AC209/E-109 Data Science
No ratings yet
Network Models II: CS109/Stat121/AC209/E-109 Data Science
19 pages
Network Models in Data Science
No ratings yet
Network Models in Data Science
20 pages
14 MapReduce PDF
100% (1)
14 MapReduce PDF
82 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
No ratings yet
Classification and Clustering: CS109/Stat121/AC209/E-109 Data Science
28 pages
CS109/Stat121/AC209/E-109 Data Science: Network Models
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Network Models
20 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages