0% found this document useful (0 votes)

77 views40 pages

01a Data Science Introduction 1

The document discusses the rise of big data and data science. It provides key figures on the growing size of the global data sphere and data production. It highlights both celebrated successes of data science, like advances in image recognition and healthcare, as well as notorious failures, such as offensive tweets from Microsoft's AI chatbot Tay. The document defines data science as developing tools to store, manipulate, and analyze data to extract knowledge.

Uploaded by

KADDAMI Saousan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views40 pages

01a Data Science Introduction 1

Uploaded by

KADDAMI Saousan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

INF442 — Algorithms for Data Analysis in C++

Introduction to Data Science

— Steve Oudot
The big data era

Key figures:

I size of ‘global data sphere’:

obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

2
The big data era

Key figures:

I size of ‘global data sphere’:

I correlated with World’s total storage capacity

— data centers and cloud (45% - 55% in 2025)

2
The big data era

Key figures:

I size of ‘global data sphere’:

I correlated with World’s total storage capacity

— data centers and cloud (45% - 55% in 2025)

I exponential growth (+ 30% each year)

— expected to be sustained on the long run

2
The big data era

Key figures:

I size of ‘global data sphere’:

I correlated with World’s total storage capacity

— data centers and cloud (45% - 55% in 2025)

I exponential growth (+ 30% each year)

— expected to be sustained on the long run

I small fraction of data is processed/analyzed

— shortage of trained data scientists

2
Data production
Data are produced at an unprecedented rate by:
I Industry / Economy
I Sciences

I End users

3
Challenges

Complex data Corrupted data

(non-linear, sparse, (noise, outliers,
high-dimensional) missing values)

Big data
(streamed, online, distributed)

4
Data science’s celebrated successes...

AI for games:

1997: IBM’s Deep Blue wins chess match

against world champion G. Kasparov

2016: DeepMind’s AlphaGo wins Go match

against 18-time world champion Lee Sedol

2019: DeepMind’s AlphaStar beats

Starcraft II professional players
5
Data science’s celebrated successes...

ImageNet Challenge:

I database of 40 · 106 + images, structured in 20 · 103 + categories

I images collected on the Internet

I annotation process crowdsourced

to Amazon Mechanical Turk

[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→

hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→

hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

rrow I by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality

[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→

hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

unsupervised pre-training
I unsupervised ' usingleads
pre-training auto-encoders
to conceptaslearning
feature(e.g.
generators
humantoface,
be plugged
cat face)into

this image is the optimal stimulus of the

[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012] 5
Data science’s celebrated successes...

Healthcare data:

2011: a new subgroup of breast cancers

with excellent survival rates is discovered
using exploratory data analysis techniques

[Nicolau et al.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile
and excellent survival, PNAS 2011]
5
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

6
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

6
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

I training overrun by trolls

I numerous questions raised

(technical, legal, ethical)

6
... and notorious failures

Some notorious AI failures in 2018:

Google Photo’s AI feature confuses skier

and mountain

Amazon’s AI recruiting tool proven to be

gender-based

Uber’s self-driving car kills pedestrian

in Arizona

6
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data