What is Data Science?
Data, Databases, and the Extraction of Knowledge
Rene T., @becomingdatasci, November 2014
Lets start with: What is Data?
http://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA
_Big_Data.jpg
http://fc01.deviantart.net/fs71/i/2012/326/3/4/cute_dog_by_tho
masmeadows345-d5lsah9.jpg
https://encryptedtbn2.gstatic.com/images?q=tbn:ANd9GcS9dKu3_Tzi-sWWyAqee5y0EhuvoIZNSya_rAKnuBBd0JYxPX7pw
http://www.freefoto.com/images/1351/06/1351_06_2---Books-Shakespeare-and-Company-Bookstore--The-Latin-Quarter-Paris_web.jpg
http://upload.wikimedia.org/wikipedia/commons/9/96/Bill_Nye
,_Barack_Obama_and_Neil_deGrasse_Tyson_selfie_2014.jpg
http://upload.wikimedia.org/wikipedia/commons/e/e4/Gr
een_Bank_100m_diameter_Radio_Telescope.jpg
https://c2.staticflickr.com/4/3273/3017878633_65beb1c7d6.jpg
https://c1.staticflickr.com/1/2/1349370_07
03fce74c.jpg
Around 100 hours of video are uploaded to YouTube every minute
it would take about 15 years to watch every video uploaded in one day
AT&T is thought to hold the worlds largest volume of data in one
unique database its phone records database is 312 terabytes in size,
and contains almost 2 trillion rows.
Every minute we send 204,000,000 emails, generate 1,800,000 Facebook
likes, send 278,000 Tweets, and up-load 200,000 photos to Facebook
570 new websites spring into existence every minute of every day.
http://smartdatacollective.com/bernardmarr/277731/big-data-25-facts-everyone-needs-know
http://pixabay.com/static/uploads/photo/2014/03/13/01/12/datacen
ter-286386_640.jpg
https://c2.staticflickr.com/2/1296/533233247_b6baa30fdb_z.jpg?zz=1
Video clip:
http://youtu.be/PBx7rgqeGG8?t=2m
https://c1.staticflickr.com/3/2300/2596366618_2d6cb01735.jpg
http://upload.wiki
media.org/wikipedi
a/commons/9/90/Ke
ncf0618FacebookNe
twork.jpg
http://upload.wikimedia.org/wikipedia/commons/b/bf/USDA_Hardine
ss_zone_map.jpg
http://upload.wikimedia.org/wikipedia/commons/1/1c/CMS_Higgs-event.jpg
What is a database?
Database
[dey-tuh-beys]
noun
A comprehensive collection of related data
organized for convenient access, generally in
a computer.
-dictionary.com
Types of Databases
http://www.oaddo.org
Databases You Use
Pretty much every website you interact with
Social Media
Banking
File Sharing
Search Engines
Online Shopping
Course Registration/Canvas
Travel
Etc. etc. etc..
You broadcast/generate data everywhere you go
Cell phones
Purchases
Driving (GPS)
Streaming music
Email
Posting status updates
Attending events
Etc. etc. etc..
https://www.google.com/maps/@38.8905569,-77.1721577,13z/data=!5m1!1e1
http://upload.wikimedia.org/wikipedia/commons/6/69/Netflix_logo.svg
https://c2.staticflickr.com/4/3324/3507973704_563846fe14_z.jpg?zz=1
How is data
collected about you
used to help you?
Who builds these systems?
Data Scientist
Computer Scientist
Data collection systems
Machine Learning
Algorithms
Interface Design
Design/Manage/Query
Databases
Data Aggregation
Data Mining
Mathematician
Statistical Models
Evaluation Metrics
Predictive Analytics
Data Visualizations
Business Person
Domain Expertise
Knowing what
questions to ask
Interpreting results for
business decisions
Presenting outcomes
Examples not a complete definition, and not all
simultaneously necessary skills
Data Science Venn Diagram by Drew Conway
http://static.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10
f77ab/1364352052403/Data_Science_VD.png?format=750w
From Doing Data Science by Cathy
ONeill & Rachel Schutt
http://www.becomingadatascientist.com/wpcontent/uploads/2014/06/DS_profile.png
http://semanticommunity.info/@api/deki/files/27057/Figure14.png?size=bestfit&width=484&height=541&revision=1
No need to be a unicorn, but do need to know something
about all of these areas, and become expert in some
(Sound familiar, ISAT students?)
Some other names for Data Scientist
Statistician
Pythonista
Data Mining Specialist
Financial Analyst
Biostatistician
Recommendation System
Social Science Researcher
Big Data Analyst
Information Architect
Spatial/GIS Analyst
Artificial Intelligence
Natural Language
Programmer
Computational Physicist
Engineer
Researcher
Neuroscientist
Data Visualization Designer
Data Science jobs pay an
average of $118,000 per year
It is estimated that by 2018, US could have a
shortage of 140,000+ people with advanced
analytical skills & need 1.5M managers/analysts
that can make decisions based on data analysis
Extraction of Knowledge
Also known as knowledge discovery
Goes beyond queries
Data Mining
Business Understanding
Data Understanding
Data Preparation
Modeling
Clustering
Classification
Regression
Evaluation
From Data Science for
Business by Provost & Fawcett
Images from ODU ECE 607 Lecture Slides by Prof. Jiang Li
Video clip: Interview with Neha Kothari, LinkedIN Data Scientist
http://youtu.be/8dxKe5cGHdA?t=17s
Data Science Example
Kaggle competition hosted by UPenn and Mayo Clinic to
detect seizures in intracranial EEG recordings
https://www.kaggle.com/c/seizure-detection
Current detection systems have high false positive rate, resulting
in unnecessary stimulation
Need to rapidly and automatically detect onset of seizure
Data provided
Matrix of EEG sample values
Time duration latency (time before seizure)
Sampling frequency
Channels (electrodes)
Human and Canine Data
Latency only provided in training data because when taking
real-life data, you wont know if or how long until seizure hits
thats what youre trying to predict
This is an important point in predictive analytics!
Competition winner Michael Hills published his method
FFT = Fast Fourier Transform
Correlation Coefficient r
Eigenvalues can think of this as a
scaling factor
Put all these values into a
Random Forest classifier
http://en.wikipedia.org/wiki/Fast_Fourier_transform
Ensemble learning method combines results of many weak decision
trees, turns out to be better classifier than one strong decision tree
Can now train a classifier for each patient
He wrote a computer program to help him experiment & quickly
validate result of each brute force approach, trying every
technique he could find
Determines primary frequencies in
EEG sample
Used the same evaluation technique kaggle competition would use
Line of scikit-learn Python code for training winning submission:
RandomForestClassifier(n_estimators=3000, min_samples_split=1,
bootstrap=False, random_state=0)
Kaggles evaluation method:
Judged on the mean area under the ROC curve (AUC) of two predictions.
Receiver Operating Characteristic = true positive vs false positive.
1)
2)
Predict the probability that a given clip is a seizure.
Predict the probability that the clip is within the first 15 seconds its respective
seizure (the technical term for time into the seizure is "latency").
The competition metric is the mean of these two AUCs:
Michael Hills winning submission scored 0.963
His model will label 963 of every 1000 true seizure clips as seizures
He won $5000 (much less than UPenn/Mayo would have had to pay a
Data Scientist to develop this as an employee or consultant!)
Currently another similar contest posted w/$25,000 prize
My Machine Learning project
Using JMU first-time donor (and non-donor) data from two previous years, could
I classify who was likely to become a donor for the first time during the next year?
Correctly classified 67% of first-time donors, got great feedback from professor,
plan to continue the study for my masters program final project.
You can read all about it on my blog! BecomingADataScientist.com
Code snippet using Random Forest Classifier
Other Examples
Galaxy Classification using Convolutional
Neural Networks
http://benanne.github.io/2014/04/05/galaxy-zoo.html
Choosing Facebook Audience for Content
Promotion using Random Forests
http://citizennet.com/blog/2012/11/10/random-forestsensembles-and-performance-metrics/
Predicting Wine Quality with Principal
Component Analysis
http://fastml.com/predicting-wine-quality/
Readmission Risk Score to decide which
patients to give additional follow-up help at
Mt. Sinai hospital
http://www.technologyreview.com/news/518916/ahospital-takes-its-own-big-data-medicine/
Data Visualization Example
http://labs.strava.com/heatmap/#12/-78.90549/38.44669/blue/bike
http://xkcd.com/1425/
How to get started
Recommended skills to pick up while at JMU
Programming
Any language is good to
start with. Gain core
understanding.
Python or R data analysis
experience a plus
Database design, SQL
Math
Calculus
Linear Algebra
Statistics (2 levels)
Advanced: Optimization /
Linear Programming
Research and Analysis
Science involving data
collection and interpretation
Working with messy real
life data
Business Analytics
Data Mining
Others
Business / Communication
Graphic Design
Take classes on campus or online!
Read, read, read
Doing Data Science by Cathy ONeil* & Rachel Schutt
Data Science for Business by Forster Provost & Tom Fawcett
Data Smart by John Foreman* (uses Excel)
Ill review other books as I read them:
http://www.becomingadatascientist.com/learning/
Blogs & News Feeds (FlowingData.com is a good one to start with)
Twitter look for curated lists of people to follow
https://twitter.com/BecomingDataSci/lists/women-in-datascience/members
*on Twitter and
willing to chat!
Free Online Courses
Python Fundamentals Codecademy http://www.codecademy.com/tracks/python
Machine Learning Coursera / Stanford https://www.coursera.org/course/ml
Data Analyst Nanodegree Udacity https://www.udacity.com/course/nd002
(includes Hadoop mini-course)
Applied Data Mining and Statistical Learning Penn State
https://onlinecourses.science.psu.edu/stat857/
Pretty comprehensive list here: http://www.kdnuggets.com/education/online.html
TED talks on Data
http://www.ted.com/search?q=data
Susan Etlinger* http://www.ted.com/talks/susan_etlinger_what_do_we_do_with_all_this_big_data
Need to spend more time on critical thinking skills[because we have
the] potential to make bad decisions far more quickly, efficiently, and with
far greater impact than we did in the past.
we need to be clear about ..the methodologies that we use, because if I
don't know what questions you asked, I don't know what questions you
didn't ask.
Explore
Volunteer to Analyze Data (DataKind)
Play with public data sets
http://101.datascience.community/2014/10/17/data-sources-for-cool-datascience-projects-part-1-guest-post/
https://www.opensciencedatacloud.org/publicdata/
http://catalog.data.gov/dataset
https://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&nu
mAtt=&numIns=&type=&sort=nameUp&view=table
Data Science Competitions
(Kaggle also has knowledge competitions for learning)
What some of my followers on Twitter wish
they knew about data in college.
Questions?
Renee T.
[contact me via twitter or blog for email]
@becomingdatasci
http://www.becomingadatascientist.com