INF442 — Algorithms for Data Analysis in C++
Introduction to Data Science
— Steve Oudot
The big data era
Key figures:
I size of ‘global data sphere’:
obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation
2
The big data era
Key figures:
I size of ‘global data sphere’:
obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation
I correlated with World’s total storage capacity
— data centers and cloud (45% - 55% in 2025)
2
The big data era
Key figures:
I size of ‘global data sphere’:
obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation
I correlated with World’s total storage capacity
— data centers and cloud (45% - 55% in 2025)
I exponential growth (+ 30% each year)
— expected to be sustained on the long run
2
The big data era
Key figures:
I size of ‘global data sphere’:
obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation
I correlated with World’s total storage capacity
— data centers and cloud (45% - 55% in 2025)
I exponential growth (+ 30% each year)
— expected to be sustained on the long run
I small fraction of data is processed/analyzed
— shortage of trained data scientists
2
Data production
Data are produced at an unprecedented rate by:
I Industry / Economy
I Sciences
I End users
3
Challenges
Complex data Corrupted data
(non-linear, sparse, (noise, outliers,
high-dimensional) missing values)
Big data
(streamed, online, distributed)
4
Data science’s celebrated successes...
AI for games:
1997: IBM’s Deep Blue wins chess match
against world champion G. Kasparov
2016: DeepMind’s AlphaGo wins Go match
against 18-time world champion Lee Sedol
2019: DeepMind’s AlphaStar beats
Starcraft II professional players
5
Data science’s celebrated successes...
ImageNet Challenge:
I database of 40 · 106 + images, structured in 20 · 103 + categories
I images collected on the Internet
I annotation process crowdsourced
to Amazon Mechanical Turk
[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009] 5
Data science’s celebrated successes...
ImageNet Challenge:
I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
I until 2011, classification error rates around 25%
s network more than 60−→
hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%
[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...
ImageNet Challenge:
I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
I until 2011, classification error rates around 25%
s network more than 60−→
hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%
rrow I by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality
[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...
ImageNet Challenge:
I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
I until 2011, classification error rates around 25%
s network more than 60−→
hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%
unsupervised pre-training
I unsupervised ' usingleads
pre-training auto-encoders
to conceptaslearning
feature(e.g.
generators
humantoface,
be plugged
cat face)into
this image is the optimal stimulus of the
[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012] 5
Data science’s celebrated successes...
Healthcare data:
2011: a new subgroup of breast cancers
with excellent survival rates is discovered
using exploratory data analysis techniques
[Nicolau et al.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile
and excellent survival, PNAS 2011]
5
... and notorious failures
Microsoft’s Tay:
I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016
I learned from its interactions with people
6
... and notorious failures
Microsoft’s Tay:
I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016
I learned from its interactions with people
I shut down only 16 hours after launch
I produced inflammatory, offensive (racist, sexually-charged) tweets
6
... and notorious failures
Microsoft’s Tay:
I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016
I learned from its interactions with people
I shut down only 16 hours after launch
I produced inflammatory, offensive (racist, sexually-charged) tweets
I training overrun by trolls
I numerous questions raised
(technical, legal, ethical)
6
... and notorious failures
Some notorious AI failures in 2018:
Google Photo’s AI feature confuses skier
and mountain
Amazon’s AI recruiting tool proven to be
gender-based
Uber’s self-driving car kills pedestrian
in Arizona
6
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
word cloud of paper titles at NIPS 2016
(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
word cloud of paper titles at NIPS 2016
(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
word cloud of paper titles at NIPS 2016
(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
word cloud of paper titles at NIPS 2016
(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
word cloud of paper titles at NIPS 2016
(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
word cloud of paper titles at NIPS 2016
(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Core topics:
I statistical analysis
I machine learning
I pattern recognition
I data mining
I optimization (convex / combinatorial)
I database management and distributed systems
I high-performance computing (streaming, distributed, cloud)
7
Data?
Datum ≡ observation ≡ ”chunk of information”
8
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
e: ici representation
on parle des representations des donnees prise ven
ariables
entree par la plupart des algorithmes de t
{
x1
v1 ··· vd
observations
{
···
xn
coordinate
matrix
categorical variables: 1, 2, · · · , K (arbitrary labels)
(e.g. ”cat”, ”dog”, ”horse”)
continuous variables: real or complex values
(e.g. temperature, pressure, geographic coordinate, income, amplitude/phase)
8
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
e: ici representation
on parle des representations des donnees prise oen
bservations
entree par la plupart des algorithmes de t
{
Metric representation x1
x1 ··· xn
observations
{
···
xn
distance /
(dis-)similarity
matrix
distances: Euclidean, Hamming, geodesic, diffusion, edit, Jaccard, Wasserstein, etc.
(dis-)similarity measures: cosine, Kullback-Leibler, Bregman divergences, etc.
8
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t
Metric representation
Rd
∈
?
···
8
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t
Metric representation
Rd
∈
?
···
feature extraction
8
Programming languages for data science
I Databases / data manipulation: Structured Query Language (SQL)
te: all other modern languages are built on SQL (e.g. QBE is in fact just a front-e
9
Programming languages for data science
I Databases / data manipulation: Structured Query Language (SQL)
I Data analysis: Python (CS) / R (stats)
e: what is taught is: (1) principles of each approach; (2) how to apply it in Python
9
Programming languages for data science
I Databases / data manipulation: Structured Query Language (SQL)
I Data analysis: Python (CS) / R (stats)
I Effective data processing:
C / C++ / CUDA (GPGPU)
[...]
9
Learning paradigms
horse
Supervised learning
Input: data with labels (examples)
cat dog
Goal: predict the labels of new data
?
Typical problems:
I classification (categorical labels)
I regression (continuous labels)
I forecasting (regression on time series)
energy consumption
? weather parameters
10
Learning paradigms
Unsupervised learning
Input: data without labels
Goal: identify patterns, correlations
Typical problems:
I clustering
I dimensionality reduction
I anomaly detection / noise removal
10
Learning paradigms
Unsupervised learning
Semi-supervised learning (only a fraction of the input data has labels)
Supervised learning
10
Learning paradigms
Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards
Goal: find policy that minimizes the regret
es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)
Typical problems:
I exploration vs. exploitation
(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce
I control learning
10
Learning paradigms
Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards
Goal: find policy that minimizes the regret
es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)
Typical problems:
I exploration vs. exploitation
(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce
I control learning
[Géron 2017] 10
Learning paradigms
(source: NVIDIA)
10
Course outline
• Lecture 1: introduction to data science / C++ as C (1/2)
• Lecture 2: Nearest-Neighbor search / C++ as C (2/2)
• Lecture 3: k-means clustering / classes (1/2)
unsupervised
• Lecture 4: hierarchical clustering / classes (2/2)
• Lecture 5: density estimation and noise removal / inheritance
• Lecture 6: k-NN classifier / genericity
supervised
• Lecture 7: linear models for regression / STL
• Lecture 8: linear models for classification / -
• Lecture 9: artificial neural networks / C++11
• Lecture 10: feature extraction and dimensionality reduction / -
11