KEMBAR78
A data view of the data science process | PDF
A data-view of
the data science
process
Mathieu d’Aquin - @mdaquin
Data Science Institute
Insight Centre for Data Analytics
NUI Galway
A data-view of
the data science
process
Mathieu d’Aquin - @mdaquin
Data Science Institute
Insight Centre for Data Analytics
NUI Galway
A data-view of
the data science
process
Mathieu d’Aquin - @mdaquin
Data Science Institute
Insight Centre for Data Analytics
NUI Galway
Why am I talking to you about
?
Healthcare
and
medicine
IoT and
Smart-cities
FinTech
Education
and
Learning
Digital
humanities
Media and
social Media
Agritech
Environment
and
Sustainability
Government
and public
sector
Customer
services
Entertain. /
creative
sector
A data-view of
the data science
process
Mathieu d’Aquin - @mdaquin
Data Science Institute
Insight Centre for Data Analytics
NUI Galway
?
A data-view of
the data science
process
Mathieu d’Aquin - @mdaquin
Data Science Institute
Insight Centre for Data Analytics
NUI Galway
?
As in Biology? Simplifying, the observation of naturally
occurring phenomenons and principles in relation to data?
As in Physics? Again simplifying, the theorisation and
experimental verification of fundamental laws of data?
As in Social Sciences? Really simplifying, the investigation
and the social, economic or cultural implications of data
on individuals, groups and society?
Hypo. /
Question
Plan Collect
data
Analyse
data
Extract
results
Exploit
results
Hypo. /
Question
Plan Collect
data
Analyse
data
Extract
results
Exploit
results
Data Models
New
info
What-
ever
was the
goal
Hypo. /
Question
Plan Collect
data
Analyse
data
Extract
results
Exploit
results
Data Models
New
info
What-
ever
was the
goal
The study of
this process
and its
characteristics
Hypo. /
Question
Plan Collect
data
Analyse
data
Extract
results
Exploit
results
Data Models
New
info
What-
ever
was the
goal
The study of
those things
and their
characteristics
Dataset
Dataset
Source
Dataset
Characteristics
obtained from with
derived from
Dataset
License
Regulation
Source
Dataset
Characteristics
associated with
obtained from with
derived from
Dataset
License
Regulation
Source
Dataset
Characteristics
Data
Science
Task
associated with
obtained from with
derived from
used for
Dataset
License
Regulation
Source
Dataset
Characteristics
Data
Science
Task
Technique
Parameters
...
associated with
obtained from with
derived from
used for
implemented by
using
produced
Dataset
License
Regulation
Source
Dataset
Characteristics
Data
Science
Task
Technique
Model
Model
Parameters
...
associated with
obtained from with
derived from
used for
implemented by
using
produced
version of
produced
Dataset
License
Regulation
Source
Dataset
Characteristics
Data
Science
Task
Technique
Model
Model
Parameters
...
associated with
obtained from with
derived from
used for
implemented by
using
produced
version of
produced
Example: Describing a data process with ontologies
(The Datanode ontology - E. Daga)
A vocabulary to describe the
relationships between input
data set, intermediary data
assets and the outputs of a
data process.
Dataset
License
Regulation
Source
Dataset
Characteristics
Data
Science
Task
Technique
Model
Model
Parameters
...
associated with
obtained from with
derived from
used for
implemented by
using
produced
version of
produced
Smart meter
data
Anonymisation
Solar panel
monitoring
Anonymisation
Weather data
Location
data
Electricity
tariff data
analysisAnon data
Anon data
Model
prediction/
recommendation
Results
Smart meter
data
Anonymisation
Solar panel
monitoring
Anonymisation
Weather data
Location
data
Electricity
tariff data
analysisAnon data
Anon data
Model
prediction/
recommendation
Results
Data
prot.
Corp
lic. 1
Corp
lic. 2
Data
prot.
Data
prot.
User
T&C
OGL
Corp
lic. 3
Smart meter
data
Anonymisation
Solar panel
monitoring
Anonymisation
Weather data
Location
data
Electricity
tariff data
analysisAnon data
Anon data
Model
prediction/
recommendation
Results
Data
prot.
Corp
lic. 1
Corp
lic. 2
Data
prot.
Data
prot.
User
T&C
OGL
Corp
lic. 3
?
Example: Machine readable policies and inference
rules for their propagation (E. Daga)
Dataset
License
Regulation
Source
Dataset
Characteristics
Data
Science
Task
Technique
Model
Model
Parameters
...
associated with
obtained from with
derived from
used for
implemented by
using
produced
version of
produced
Example: Studying large Data Science platforms
(ongoing work - M. Adel)
Thousands of datasets used in
thousands of data science
processes.
Allows us to better understand
the tasks of data science, how
they occur, in what contexts…
As well as what characteristics
of datasets lead to what use in
data science processes.
Data Ethics
Hypo. /
Question
Plan Collect
data
Analyse
data
Extract
results
Exploit
results
Where ethical implications are (might be) considered
Where they are important
Towards a methodology for Ethics by Design in Data Science
(with P. Troullinou)
‘Ethics by
Design’ for Data
Science
Dialectic
The process is based on a conversational
approach between data and critical social
scientists throughout the project’s life-cycle.
Reflective
Ethical concerns are not pre-fixed; they may
emanate from any stage of the project; thus,
constant reflexivity on activities and
researchers is needed.
Creative, not disruptive
The objective of this process is to achieve a
positive impact on the research, increase its
value addressing ethics throughout the
project’s life-cycle.
All- encompassing
Ethical concerns appear as much in the
research activities as in their outcomes, their
use and exploitation; the process needs to
expand on all stages.
Using science fiction to guide ethical thinking
Used/controlled by a small number of individuals
Used/controlled by all
Usedaccuratelyaccordingtointended
purpose
Hacked,biased,inaccurate
S3E1: Nosedive
S3E5: Men
against fire
S3E6: Hated
in the nation
S4E2: Arkangel
S4E3: Crocodile
S4E5:
Metalhead
S3E2:
Playtest
S2E1:
Be
right
back
S1E3: The Entire history of
you
Using science fiction to guide ethical thinking
Write scenarios, short stories, based on the following four
premisses: In a near future, what I am developing/the results I
will obtain will be...
Used as intended
by millions/most
people/many
people
Used as intended
a small group with
control/power
Abused, hacked,
inaccurate or
biased, while used
by millions/most
people/many
people
Abused, hacked,
inaccurate or
biased, while used
by a small group
with control/power
What could possibly go
wrong?
(see Re-coding Black
Mirror workshops)
Conclusion
Data Science has grown very quickly as a discipline, to reach huge
economic and societal impact. And it is not stopping.
This is leading to the creation of a very large number of datasets,
techniques, tools, models, approaches, methods, that are driven by
practices and applications in various domains.
The study of those artefacts is becoming critical, to extract the
fundamental principles that guide data science as a discipline and a
process. Understanding those principles is essential to drive the
impact of data science in an informed way.
Data science practice can support data science theory, but this is not a
job for the data/computer scientist alone. It needs to be a
conversation with social scientists, business experts, legal experts...
Mathieu d’Aquin
@mdaquin
mdaquin.net
mathieu.daquin@nuigalway.ie

A data view of the data science process