BIOM4025 - Statistical Modelling - Q&A session 2
Data and distributions
Erik Postma / Centre for Ecology and Conservation / University of Exeter
Today
Data and distributions
Different types of data
Mean, median and mode
Variance, standard deviation and standard error
The normal distribution
Probability density functions
Probabilistic statements about data and estimates derived from these
data
Standard normal (or 𝑧 ) distribution
𝑡 -distribution
3/24
Questions about the lecture
Questions about the lecture
" Should we use R or Rstudio?
Both!
‘R’ does the calculations
‘RStudio’ makes ‘R’ easier to use
5/24
Questions about the lecture
'https://shiny01.cles.ex.ac.uk/biom4025/app_02_1_1/
I can’t seem to get the links for the
(https://shiny01.cles.ex.ac.uk/biom4025/app_02_1_1/) to work, I’ve tried
safari and google chrome but no luck unfortunately. Not sure if its just me?
Sorry, I made a typo in the URL. Should be fixed now!
Next time, post questions like this in the Questions about the module
channel, where I will see them earlier.
6/24
Questions about the lecture
'analysis
If we were to write a scientific paper, would we do stuff like this in the
or is it just for us to understand the principles of stats?
See Practicals 2-5 for examples of what to write in a paper.
Means, variances, standard deviations, standard errors and confidence
intervals are all commonly reported.
Degrees of freedom, 𝑧 and 𝑡 -values are central to most statistical tests
as they will provide you with the p-value. More in Lecture 3!
7/24
Questions about the lecture
'‘ThisWhile explaining the n - 1 part of the equation for variation you say
is because we have first estimated the mean from our data’. You make
reference to it again saying ‘We lose one degree of freedom because we
have estimated the mean from the data’. I didn’t quite understand what that
meant.
'degrees
could you please explain the concept of variance and in particular the
of freedom again
'in variation.
Please could you further explain why we subtract 1 from the sample size
8/24
Variance
𝑛 ⎯⎯⎯ 2
2
∑𝑖=1 (𝑥𝑖 − 𝑥)
𝜎𝑥 =
𝑛−1
The mean squared deviation from the estimated mean is always larger
than the mean squared deviation from the true mean
Our estimate of the true mean will explain some of the variance around
the true mean
By dividing by 𝑛 − 1 we account for the fact that we estimate the mean
from our data and we don’t use the (unknown) true mean
9/24
Degrees of freedom
10/24
Degrees of freedom
The number of independent values that can vary freely
For example:
5 values: 6 , 4 , 5 , 2 , 3
6+4+5+2+3 20
Mean = 5
= 5
=4
If you know four out of five values and the mean, you know the fifth
value
Every parameter we estimate from our data constrains the value of an
observation
Degrees of freedom (d.f.) is sample size minus number of parameters
estimated from the data
11/24
Questions about the lecture
' Could you explain the variance histogram?
' When to use standard error?
'errorcanshould
you talk more about how and when standard deviation or standard
be applied to data and on graphs? can you go over confidence
interval of the mean again?
12/24
Questions about the lecture
'deviation
I am having a hard time understanding the difference between standard
and standard error - do you mind going over them again?
'dataCanandyouwhen
go over when you would use standard error when reporting
you would use standard deviation?
'thanWhydegrees
do we use sample size as the denominator for standard error rather
of freedom?
13/24
Estimating the mean
Sample size:
1 30 100
1 11 21 31 41 51 61 71 81 91 100
Number of repetitions:
1 500 1,000
1 101 201 401 601 801 1,000
Add samples one at a
time
Draw new sample
Error bars:
None
14/24
Standard deviation
‾∑
‾‾‾‾‾‾‾‾‾‾‾‾
𝑛 ⎯⎯⎯ 2‾
√
𝑖=1 (𝑥𝑖 − 𝑥)
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑜𝑟 𝜎) =
𝑛−1
= √‾𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
‾‾‾‾‾‾‾‾‾‾‾‾‾
(𝑜𝑟 𝜎 2‾)
Standard deviation: Measure of the amount of variation among
individuals in a sample
15/24
Standard error
𝑆𝐷
SE𝑥¯ =
√𝑛
Standard error: Measure of the uncertainty around an estimate
IF we were to repeat our experiment many times, the standard error
would be the standard deviation of our estimates
In practice we have just a single estimate, so we infer the standard error
from the variation in our data (in the case of the mean, from the
standard deviation)
16/24
Standard deviation vs. standard error
Only report standard deviation if you want to quantify the amount of
variation among observations (e.g. individuals)
Report standard errors whenever you are presenting estimats, e.g. of
the mean, the regression coefficient, or the difference between two
means.
17/24
Questions about the lecture
'andCanwhether
you please explain the bit about confidence interval of the mean
a 0 is included again?
95% confidence interval gives us the range that, with a probability of
95%, contains the true mean
There is 5% probability that the true mean lies outside of this range
If the 95% confidence interval excludes zero, the probability that the
true mean is zero, is less than 5%
Testing a mean against zero is usually not very interesting, but the
same logic applies to all estimates (e.g. of a slope or a difference
between two means) 18/24
The normal distribution
True mean: The mean of 𝑥 :
-100 0 100
[1] -0.2988209
-100 -60 -40 -20 0 20 40 60 80 100
True variance:
1 10 100 The variance of 𝑥 :
1 11 21 31 41 51 61 71 81 91 100
[1] 8.101418
Sample size
100 1,000
The standard deviation of 𝑥 :
10 109 208 406 604 802 1,000
Add fitted normal [1] 2.846299
distribution to plot
The standard error of the mean of 𝑥 :
[1] 0.2860638
95% confidence interval of mean of
𝑥:
[1] -0.8595059 0.2618642
19/24
Questions about the lecture
' Do we always standardise data?
No, but you can and it can be useful sometimes
We standardise parameters estimated from our data (e.g. slope,
difference among groups) all the time
Express slope or difference in standard errors units
Allows to obtain p-value using standard normal or 𝑡 distribution
What is the probability of finding a difference equal to or larger than 𝑥
standard errors if the true difference is 0?
20/24
Questions about the lecture
'decided?
What is the definition of a critical value? How is the critical value
The value of 𝑧 (and −𝑧 ) or 𝑡 (and −𝑡 ) for which you would like the area
under the curve
You decide on the critical value you want to use
For significance testing and a significance threshold of 5%, it is the
value for which the area under the curve between −𝑧 and 𝑧 (or −𝑡 and 𝑡
) is 0.95
21/24
t-distribution
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
𝑡=
𝑠. 𝑒.
Critical value:
0 1.96 4
0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4
Sample size:
3 200
3 23 43 63 83 103 123 143 163 183 200
Invert?
Size of shaded area (i.e. probability):
[1] 0.949
22/24
Questions about the lecture
'1.96xS.E
In the normal distribution the confidence interval is mean(x) +/-
as 95% of data falls between +/- 1.96 S.E, but as each t-
distribution is different and there is no set value for where 95% of the data
fall, how do you work out the 95% confidence interval if the data is instead
from a t distribution?
Quick and dirty: Mean ± 2 × standard error
Exact confidence interval depends on sample size
In R: use the confint() function
23/24
Questions about the lecture
'butI Iunderstood the math behind the confidence interval and t-distributions
didn’t quite understand what it was useful for in real-life. Can we see
an example?
See Lecture 3.
24/24