75% found this document useful (4 votes)

2K views74 pages

Introduction To Data Science

This document provides an outline for the CS109 Data Science course, including an overview of key concepts and principles covered in each section. The course aims to teach data munging, exploratory analysis, predictive modeling, and communicating results through three acts focused on predictions, recommendations, and network analysis. Students will apply these skills to real-world datasets in homework and a final project using tools like Python, IPython, and cloud computing platforms. Prerequisites include programming experience and basic statistics knowledge.

Uploaded by

Michael Wee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

75% found this document useful (4 votes)

2K views74 pages

Introduction To Data Science

Uploaded by

Michael Wee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

STAT121 / AC209 / E-109

CS109 Data Science

Hanspeter Pster pster@seas.harvard.edu Joe Blitzstein blitzstein@stat.harvard.edu

Outline
What? Why? Who? How?

Data Science
To gain insights into data through computation, statistics, and visualization

A Data Scientist Is...

A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. - Josh Blumenstock Data Scientist = statistician + programmer + coach + storyteller + artist - Shlomo Aragmon

Nate Silver

Nate Silver won the election Harvard Business Review

#natesilverfacts

http://techcrunch.com/2012/11/07/nate-silver-as-software/

Nate Silver on Pundits

Silver: Pundits are no better than a coin toss. Stewart: Do you foresee a coin getting its own show? The coin toss show?
http://www.thedailyshow.com/watch/wed-october-17-2012/nate-silver

Some Key Principles

use many data sources (the plural of anecdote is not data) understand how the data were collected (sampling is essential) weight the data thoughtfully (not all polls are equally good) use statistical models (not just hacking around in Excel) understand correlations (e.g., states that trend similarly) think like a Bayesian, check like a frequentist (reconciliation) have good communicationskills (What does a 60% probability even mean? How can we visualize, validate, and understand the conclusions?)

Human Genome

Microarrays

Afmetrix Chip

[wikipedia]

Sequencing

Sequencing Cost

Genome Data

Genome Visualization

[Krzywinski+2009]+

[Thorvaldsd,r-2013]-

[Meyer&2009]&

Personalized Therapy
...10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information. Director, The Cancer Genome Atlas (TCGA), Time Magazine, 6/13/11

Netix Prize

Some Challenges

massive data (500k users, 20k movies, 100m ratings) curse of dimensionality (very high-dimensional problem) missing data (99% of data missing; not missing at random) extremely complicated set of factors that affect peoples ratings of movies (actors, directors, genre, ...) need to avoid overtting (test data vs. training data)

Netix Prize Progress

http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html

Connectome
What is the connectivity of large brain circuits?

Ramn y Cajal, 1905

Connectome Workow

Ultra-Thin Section EM

Automatic Reconstruction

2D Segmentation

Combine Multiple 2D Segmentations with Fusion

Globally Consistent 3D Segmentation

[Kaynig et al., CVPR 10] [Vazquez et al., ICCV 2011]

2012

Data Science
Computer Science Statistics

Domain Science

Drew Conway

Machine Data Management Data Mining Machine Learning Visualization

Human Human Cognition Perception Story Telling Decision Making Theory

Business Intelligence Statistics

Data Science

Inspired by Daniel Keim, Visual Analytics: Denition, Process, and Challenges

Outline

What? Why? Who? How?

The Age of Big Data

BBC, 2013

Crime Prevention

Boston Globe, Sunday, Aug 4, 2013

Big Data
2.5 exabytes

daily data

years

2012
[IBMbigdata]

[Domo]

Between the dawn of civilization and 2003, we only created ve exabytes of information; now were creating that amount every two days. Eric Schmidt, Google (and others)

http://onesecond.designly.com/

Smarter Devices

Michael Franklin, UC Berkeley

Commodity Computing

Michael Franklin, UC Berkeley

Ubiquitous Connectivity

Michael Franklin, UC Berkeley

travers808,Visual.ly

1 Zetabyte = 1 Billion Terabytes

Jim Gray, Microsoft

By 2018, the US could face a shortage of up to 190,000 workers with analytical skills McKinsey Global Institute The sexy job in the next 10 years will be statisticians. Data Scientists? Hal Varian, Prof. Emeritus UC Berkeley Chief Economist, Google

Hal Varian Explains...

The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. Hal Varian

Ask an interesting question.

What is the scientic goal? What would you do if you had all the data? What do you want to predict or estimate?

Get the data.

How were the data sampled? Which data are relevant? Are there privacy issues?

Explore the data.

Plot the data. Are there anomalies? Are there patterns?

Model the data.

Build a model. Fit the model. Validate the model.

Communicate and visualize the results.

What did we learn? Do the results make sense? Can we tell a story?

Outline

What? Why? Who? How?

Hanspeter Pster

An Wang

My Background

Grew up in Switzerland M.Sc. in EE from ETH Zurich Ph.D. in CS from SUNY Stony Brook 11 years in industry (MERL) At Harvard since 2007, Visual Computing Group (4 Ph.D., 7 PD) Teach CS109 / CS171, taught CS175 / CS264 / CS205 Director of the Institute of Applied Computational Science (IACS) Two daughters, Lilly (10) and Audrey (7)

Professor of the Practice in Statistics, Co-Director of Undergraduate Studies in Statistics blitz@fas.harvard.edu, twitter @stat110, SC 714

Joe Blitzstein

CS109 Staff

Chris Beaumont, Head TF Johanna Beyer Nicolas Bonneel Alex DAmour Rahul Dave Brandon Haynes

Ray Jones Steffen Kirchhoff Seymour Knowles-Barley Alexander Lex Deqing Sun Tim Brenner, A/V

About You

Outline
What? Why? Who? How?

CS109 Key Facets

data munging/scraping/sampling/cleaningin order to get an informative, manageable data set; data storage and management in order to be able to access data - especially big data - quickly and reliably during subsequent analysis; exploratory data analysisto generate hypotheses and intuition about the data; predictionbased on statistical tools such as regression, classication, and clustering; and communicationof results through visualization, stories, and interpretable summaries.

Act I: Predictions
Data Science Process Data Types and Data Munging Probability Review Classication & Regression Cross Validation, Clustering Visualization & Story Telling

Act II: Recommendations

Bayesian Thinking & Computation Monte Carlo Methods Machine Learning Methods MapReduce and Amazons EC2 Databases (Margo Seltzer)

Act III: Network Analysis

Network Visualization Network Sampling Community Detection Guest Lecture

Abstractions...

...and Tools

xkcd

Homework
Real-World focus Scrape and wrangle messy data Apply sophisticated statistical analysis Visualize and communicate results Election data, movie reviews,Yelp! data, etc.

Final Project
Pick a project of your choosing Teams of up to 2 students Process books, web sites, screencasts IPython (exceptions possible) Best project prizes!

cs109.org

Is this course for me ???

Prerequisites
Programming experience

C, C++, Java, Python, etc.

Basic statistical knowledge

STAT100, ideally STAT110

Willingness to learn new software & tools

This can be time consuming You will need to read online documentation

Be Patient Be Flexible Be Constructive

http://davidzinger.wordpress.com/2007/05/page/2/

Next Steps

HW 0

Good test of your basic skills Installation of several Python frameworks Not graded, do it as soon as possible

Read syllabus carefully Do readings

Post comments to Piazza using #readings

Data Science Guide
No ratings yet
Data Science Guide
35 pages
Introduction To Data Science
94% (16)
Introduction To Data Science
530 pages
Data Science for Aspiring Analysts
100% (2)
Data Science for Aspiring Analysts
35 pages
Data Science With Python - Lesson 01 - Data Science Overview
100% (5)
Data Science With Python - Lesson 01 - Data Science Overview
35 pages
Data Science and Predictive Analytics
100% (10)
Data Science and Predictive Analytics
309 pages
Data Science Crash Course - SharpSight PDF
100% (3)
Data Science Crash Course - SharpSight PDF
107 pages
Data Science For Executives
100% (1)
Data Science For Executives
40 pages
Introduction To Data Science
67% (3)
Introduction To Data Science
363 pages
UltimateGuidetoDataScienceInterviews 2
100% (4)
UltimateGuidetoDataScienceInterviews 2
87 pages
Data Science Portfolio
No ratings yet
Data Science Portfolio
17 pages
Career in Data Science Ultimate Guide
No ratings yet
Career in Data Science Ultimate Guide
33 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Data Analytics
75% (4)
Data Analytics
45 pages
Data Science Mastery Course
100% (1)
Data Science Mastery Course
8 pages
Data Science Vs Machine Learning Vs Data Analytics
100% (2)
Data Science Vs Machine Learning Vs Data Analytics
13 pages
Lecture-1 Introduction To Data Science
100% (1)
Lecture-1 Introduction To Data Science
20 pages
Data Analytics for Aspiring Analysts
No ratings yet
Data Analytics for Aspiring Analysts
54 pages
Sampler PDF
0% (1)
Sampler PDF
25 pages
SQL For Data Science
75% (4)
SQL For Data Science
350 pages
Python Data Science
100% (1)
Python Data Science
173 pages
Introduction to Data Science Lecture
100% (2)
Introduction to Data Science Lecture
23 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Introduction to Data Science Workshop
No ratings yet
Introduction to Data Science Workshop
19 pages
Documenting Data Science Projects
No ratings yet
Documenting Data Science Projects
9 pages
A Beginners Guide To Getting First Data Science Job PDF
No ratings yet
A Beginners Guide To Getting First Data Science Job PDF
64 pages
Lecture 1 Data Mining
No ratings yet
Lecture 1 Data Mining
51 pages
Data Analytics and Performance
100% (8)
Data Analytics and Performance
81 pages
Big Data Analytics Tutorial
100% (15)
Big Data Analytics Tutorial
101 pages
65 Free Data Science Resources For Beginners PDF
No ratings yet
65 Free Data Science Resources For Beginners PDF
19 pages
Great Collection of Data Science Resources
100% (1)
Great Collection of Data Science Resources
2 pages
Python Data Science Cheatsheet
67% (3)
Python Data Science Cheatsheet
1 page
Data Science Interview Prep Guide
100% (1)
Data Science Interview Prep Guide
39 pages
Predictive Analytics for Business Growth
50% (2)
Predictive Analytics for Business Growth
43 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
Data Discovery & Visualization - New
100% (1)
Data Discovery & Visualization - New
41 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Data Science Hiring Guide
50% (2)
Data Science Hiring Guide
56 pages
Data Science Crash Course SharpSight
100% (6)
Data Science Crash Course SharpSight
107 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
173 pages
Ace Your Data Science Interview
83% (6)
Ace Your Data Science Interview
84 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
31 pages
Data Science Career Guide
100% (1)
Data Science Career Guide
32 pages
Python Data Science for Beginners
80% (5)
Python Data Science for Beginners
29 pages
3 - Big Data Insight V.2019 PDF
No ratings yet
3 - Big Data Insight V.2019 PDF
28 pages
Python Data Science Handbook - Python Data Science Handbook
0% (5)
Python Data Science Handbook - Python Data Science Handbook
4 pages
Data Science Interview Guide
No ratings yet
Data Science Interview Guide
93 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
74 pages
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
No ratings yet
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
47 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
36 pages
Week1 1
No ratings yet
Week1 1
40 pages
Intro To DS
No ratings yet
Intro To DS
37 pages
Modul 1
No ratings yet
Modul 1
56 pages
Data Science Insights for Students
No ratings yet
Data Science Insights for Students
71 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
6220010
No ratings yet
6220010
37 pages
Activ Steps
No ratings yet
Activ Steps
11 pages
Unit 1 - AP For Data Science
No ratings yet
Unit 1 - AP For Data Science
19 pages
Dhaka City Road Network Analysis
67% (3)
Dhaka City Road Network Analysis
2 pages
3VJ13405DB320AA0 Datasheet en
No ratings yet
3VJ13405DB320AA0 Datasheet en
4 pages
Chiller Selection
50% (2)
Chiller Selection
3 pages
Counter Circuit Design Guide
No ratings yet
Counter Circuit Design Guide
10 pages
BOM of 215ltr GD
100% (5)
BOM of 215ltr GD
4 pages
CAPROUND2
No ratings yet
CAPROUND2
2 pages
Ebl264 3
100% (3)
Ebl264 3
4 pages
Teacher: Education Council
No ratings yet
Teacher: Education Council
39 pages
Thesis Protocol 09
No ratings yet
Thesis Protocol 09
5 pages
BMS Nicmar M58 Lonf Paper - Final
No ratings yet
BMS Nicmar M58 Lonf Paper - Final
64 pages
Afiche Offshore Mooring
No ratings yet
Afiche Offshore Mooring
1 page
Leica FlexField 6.0 REL 0616 en
No ratings yet
Leica FlexField 6.0 REL 0616 en
4 pages
GF - Elven Jesters v2.50
No ratings yet
GF - Elven Jesters v2.50
2 pages
Service Not Recieved - No Confirmation But Payment Is Done
No ratings yet
Service Not Recieved - No Confirmation But Payment Is Done
5 pages
Pharmaceutical Isolator Technology
80% (5)
Pharmaceutical Isolator Technology
34 pages
Pipe Fitting Installation Guide
No ratings yet
Pipe Fitting Installation Guide
1 page
Product Data
No ratings yet
Product Data
2 pages
Advance Propulsion System
No ratings yet
Advance Propulsion System
33 pages
Production of Acrylic Acid Form Propylene: University Institute of Engineering Department of Chemical Engineering
67% (3)
Production of Acrylic Acid Form Propylene: University Institute of Engineering Department of Chemical Engineering
41 pages
RGB Color Codes Chart
No ratings yet
RGB Color Codes Chart
10 pages
Mechanical Engineer CV - Project Management & MEP Expertise
100% (1)
Mechanical Engineer CV - Project Management & MEP Expertise
5 pages
Chapter 7. Stabilization and Sweetening of Crude Oil
No ratings yet
Chapter 7. Stabilization and Sweetening of Crude Oil
13 pages
Electromagnetic Auditory Effects
No ratings yet
Electromagnetic Auditory Effects
4 pages
Soot Blower: Customer Training Module MPCGL-Malwa 16.01.2021
No ratings yet
Soot Blower: Customer Training Module MPCGL-Malwa 16.01.2021
37 pages
Quota Arrangement in Production - SCN
No ratings yet
Quota Arrangement in Production - SCN
15 pages
PT6982 S
No ratings yet
PT6982 S
4 pages
LKG 601 User Manual
No ratings yet
LKG 601 User Manual
20 pages
Decisive Aspects in The Evolution of Microprocessors: Dezsö Sima
No ratings yet
Decisive Aspects in The Evolution of Microprocessors: Dezsö Sima
31 pages
Caterpillar Cat 305.5E2CR Mini Hydraulic Excavator (Prefix EJX) Service Repair Manual (EJX00001 and Up)
100% (2)
Caterpillar Cat 305.5E2CR Mini Hydraulic Excavator (Prefix EJX) Service Repair Manual (EJX00001 and Up)
21 pages
Power Through Light Product Brochure 2025 Web
No ratings yet
Power Through Light Product Brochure 2025 Web
15 pages

Introduction To Data Science

Uploaded by

Introduction To Data Science

Uploaded by

STAT121 / AC209 / E-109

CS109 Data Science

A Data Scientist Is...

Nate Silver won the election Harvard Business Review

Nate Silver on Pundits

Some Key Principles

Netix Prize Progress

Ramn y Cajal, 1905

Combine Multiple 2D Segmentations with Fusion

Globally Consistent 3D Segmentation

Machine Data Management Data Mining Machine Learning Visualization

Human Human Cognition Perception Story Telling Decision Making Theory

Business Intelligence Statistics

Inspired by Daniel Keim, Visual Analytics: Denition, Process, and Challenges

The Age of Big Data

Boston Globe, Sunday, Aug 4, 2013

Michael Franklin, UC Berkeley

Michael Franklin, UC Berkeley

Michael Franklin, UC Berkeley

1 Zetabyte = 1 Billion Terabytes

Jim Gray, Microsoft

Hal Varian Explains...

Ask an interesting question.

Get the data.

Explore the data.

Plot the data. Are there anomalies? Are there patterns?

Model the data.

Build a model. Fit the model. Validate the model.

Communicate and visualize the results.

CS109 Key Facets

Act II: Recommendations

Act III: Network Analysis

Is this course for me ???

C, C++, Java, Python, etc.

STAT100, ideally STAT110

Be Patient Be Flexible Be Constructive

Read syllabus carefully Do readings

You might also like