Dr Athanasios Tsanas (‘Thanasis’)
Associate Prof. in Data Science
Usher Institute, Medical School
University of Edinburgh
Day 1 • Introduction and overview; reminder of basic concepts
Day 2 • Data collection and sampling
Day 3 • Data mining: signal/image processing and information extraction
Day 4 • Data visualization: density estimation, statistical descriptors
Day 5 • Exploratory analysis: hypothesis testing and quantifying relationships
Day 6 • Feature selection and feature transformation
Day 7 • Statistical machine learning and model validation
Day 8 • Statistical machine learning and model validation
Day 9 • Practical examples: bringing things together
Day 10 • Revision and exam preparation
© A. Tsanas, 2020
ECG, EEG Activity Location
Subjects feature1 feature2 ... feature M
P1 3.1 1.3 0.9
P2 3.7 1.0 1.3
X
N P3 2.9 2.6 0.6
…
PN 1.7 2.0 0.7
M (features or characteristics) © A. Tsanas, 2020
Feature generation Feature selection Statistical
from raw data or transformation mapping
X y
Subjects feature1 feature2 ... feature M result
P1 3.1 1.3 0.9 1
P2 3.7 1.0 1.3 2
N P3 2.9 2.6 0.6 1
… …
PN 1.7 2.0 0.7 3
M (features or characteristics) outcome
Depending on the problem, “features” can be demographics, genes, …
y = f (X), f : mechanism X: feature set y: outcome © A. Tsanas, 2020
Exploratory
Data
analysis: Feature Statistical
visualization
hypothesis selection or mapping
(density
testing and transformation (regression/clas
estimation,
statistical (e.g. PCA) sification)
scatter plots)
associations
© A. Tsanas, 2020
We will focus primarily on studying properties
of a single variable
You can think of this as focusing on a single
feature, i.e. one column in X
We will subsequently also study the visual
exploration in 2D plots with two variables
© A. Tsanas, 2020
Discrete variable Finite set of possible values
• Use histograms
Continuous variable Typically all possible values
• Use probability density functions
• (e.g. kernel density estimation)
© A. Tsanas, 2020
20 throws of a dice:
3,4,4,4,1,3,4,5,1,6,6,4,5,5,3,6,5,4,4,1
Histogram of scores for 20 dice throws
7
6
Frequency
5
4
3
2
1
0
1 2 3 4 5 6
© A. Tsanas, 2020
Discretize possible values, use “bins”
Histogram of 1000 stock returns
160
140
120
Frequency
100
80
60
40
20
0
-3
-2
-1
3
0.5
1.5
2.5
-2.5
-1.5
-0.5
© A. Tsanas, 2020
Probability Density Function (PDF)
-3
x 10
4
mean = 500
3.5 X ~ N(500,10 2)
variance = 100
Compute PDF
probability density p(x)
3 standard deviation = 10
2.5 using kernel
2 density
1.5 estimation
1
0.5
0
0 100 200 300 400 500 600 700 800 900 1000
possible values x © A. Tsanas, 2020
1 𝑁
Mean (average): 𝜇 = σ𝑖=1 𝑥𝑖
𝑁
Median: rank values, and find middle value
1 𝑁 2
Standard deviation: 𝜎 = σ𝑖=1 𝑥𝑖 − 𝜇
𝑁
Variance: var 𝑋 = 𝜎 2
Interquartile range (iqr): 75% percentile – 25%
percentile
© A. Tsanas, 2020
∞
𝐸 𝑋 = 𝜇𝑋 = −∞ 𝑥 ∙ 𝑝 𝑥 𝑑𝑥
𝑉𝑎𝑟 𝑋 = 𝜎𝛸2 = 𝐸 𝑋 − E 𝑋 2
(𝑚) 𝑚
𝑀𝑜𝑚𝑒𝑛𝑡𝛸 = 𝐸 𝑋−E 𝑋
The expectation operator 𝐸 ∙ is computed
from the possible values in 𝑋 multiplied by
their probabilities
© A. Tsanas, 2020
Same information like PDF, presented differently!
Cumulative Probability Distribution for Stock Returns
1.0
P(return<X)
0.8
Probability
0.6
0.4
0.2
0.0
X
-1.5 -1 -0.5 0 0.5 1 1.5 2
Return
© A. Tsanas, 2020
Add noise to each observation (impose a kernel,
typically Gaussian kernel)
1 𝛮 𝑥𝑖 −𝑥0 2
𝑝Ƹ 𝑥0 = σ𝑖=1 exp −
𝑁 2𝜋𝜎 2 2𝜎 2
𝜎 is the kernel bandwidth
𝑁 is the number of samples
𝑥0 refers the point where we estimate the density
© A. Tsanas, 2020
Computing histogram and applying kernel density estimation
Image source: Wikipedia © A. Tsanas, 2020
• Many different
approaches to
computing the
bandwidth
(beyond this
course)
• Increasing the
kernel
bandwidth 𝜎
leads to
smoother
distribution
© A. Tsanas, 2020
You will notice that I have placed a lot of
emphasis on densities
These are important in their own right for
visualization, but more importantly…
Subsequent machine learning tools often
depend heavily on the density estimates
© A. Tsanas, 2020
Boxplot
“Box and Median
whiskers”
Easy to
IQR
understand
outlier
Portrays
outliers
© A. Tsanas, 2020
Two dimensional plot to visualize how one
variable is related to another
Often complemented with the ‘best linear fit’
to assess whether there is a positive or
negative relationship
© A. Tsanas, 2020
© A. Tsanas, 2020
H.J. Seltman: Experimental design and analysis
(chapter 3: pp. 19-46)
http://www.stat.cmu.edu/~hseltman/309/Book/Book.
pdf
© A. Tsanas, 2020