STA2007 flashcards:
Linear Regression: this is a technique for modelling the relationship
between a continuous response and several explanatory variables.
Statistical model: the statistical model has a stochastic component which
captures variability in the response that cannot be explained by the
deterministic part of the model
Significance of statistical models: they help us find patterns in complex
data sets and thus make predictions
Principle of model rule: since models re not accurate, they should be built
with looking at a certain aspect of the the thing youre trying to model
Types of data: continuous, count, binary, binomial, proportion, catergorical
, types of data determines the type of model we will use to analyse
When is a population not considered random or independent in data:
when data is collected over time or in space, because observations taken
near each other will ofen be more similar than those taken apart
What questions are we interested in when describing relationship between
two variables: How fast does the response variable (Y, the variable of
main interest) change when the explanatory variable (often denoted by X)
is changed by one unit? • Is the apparent relationship real or just by
chance? • How strong is the relationship, i.e. how predictable is the
response when I know the value of the explanatory variable?
What is correlaiton: Correlation is a measure of the strength of a linear
relationship
Exception with correlation:
an correlation close to zero does not imply no correlation, it could simply
no linear correlation
correlation does not imply causation
correlation is not the best way to describe relationship always because we
could use r2 or regression
What technique is used for non linear models: spearman rank correlation
What is the goal of simple linear regression model: it’s a line that fits and
estimates the true relationship between the x and y and estimates
responses at different x
How to dind the line of best fit: using the method of least squares OLS
What does the ordinary least squares do: it finds the line that minimises
the error or residual sum of squares
What does the hypothesis/p-value test in simple linear regression check to
see: s evidence in the data for a relationship
If model is valid description od data, there are two parts that a model
needs to get right: 1. The form of the relationship between the response
and the explanatory variables is correctly specified. 2. The error structure
(distribution and dependence structure) is correctly specified.
What is the best way to detect model misspecifications : looking at
residuals, looking at the scatter plot
The residual standard error gives one an idea of how much the
observations deviate from the fitted line or plane
What are the requirements of the error terms in a linear model: (1)
normally distributed, (2) all with the same variance (homoscedastic), and
(3) that they are independent.
What are the estimates of these error terms: residuals
So how does the requirements of error terms apply to residuals: Normally
distributed, equal variance and independent really means that there is no
discernible pattern or structure left in the residuals. If there is, then the
model has failed to pick up an important structure in the data.
what a plot of residuals vs fitted values should look like for a well-specified
model: This has no clear patterns in trend, variance, outlying
observations, or clumps of observations,
What are Outliers: observations where the response deviates substantially
from the rest of the modelled pattern.
High leverage points meaning and significance: extreme in the
explanatory variables, Points with high leverage have a potentially large
influence on the estimates of the regression coefficients if their
corresponding response values do not conform to the general pattern
(outliers)
Influential observation: are individual data points with a large influence on
the regression coefficients and the fitted regression line. Often they are
points with high leverage and outliers
What to do with influential observations: Check for errors in data
recording. • Check if a misspecification of the model structure may be
responsible for a group of influential observations. • If neither of the
above apply, one could fit the model to the data with and without these
observations, and show both sets of results. • When influential
observations are removed from the data, this should be declared, also
how this was done
How are residuals visually checked for normality: histograms and quantile-
quantile plots
when are residuals identified as non-influential: when they do not fall
beyond the cooks distance line the last plot
The F-test tests whether the explanatory variables can explain anything,
but is NOT an indication of how much they explain, or how good the model
is
F-test :The F-statistic on the last row of the R output is the test-statistic for
the hypothesis H0 : β1 = . . . = βk = 0, in other words, that all regression
coefficients in the model are equal to zero simultaneously F-test: H0 : β1
= . . . = βk = 0
when we are happy with the model, what do we check for next: 1.
variance explained 2. graphical comparisons of observed and fitted values
The F-test is just a check to see if the current model is better than the null
model
to understand variance in responses of Y: we check for R2
Multiple linear regression:
Multiple regression: refers to having multiple explanatory variables in one
regression model.
The relationship between Y and the explanatory variables in model with
more than one explanatory variable: is described by a surface, or a plane,
since we have added one dimension
What does the MLR model explain: • The expected (average) response is
linearly related to both explanatory variables. • The deviations of the
observed values (response) are normally distributed around the fitted
values. There is a model for the mean response (deterministic), and a
model for the errors (stochastic).
Relationship between βnXn with Y: This model describes the effect
(change in mean response) of each explanatory variable: βn is the change
in mean response, given the value of all the other explanatory variables
Why are confidence intervals better suited to estimate
variables/coefficients: it gives us a range of possible values for these
parameter estimates
What do we expect these parameter estimates (in particularly the
coefficients of x) to show under the null hypothesis: we expect most
parameters to fall within 2 standard errors of zero (cover 95%)
multiple linear regression models, the ttests and the coefficients measure
the amount that can be explained after all the other variables in the
model have already explained their part.
What can we conclude when the 95% CI’s don’t overlap zero: we can
conclude that there is enough evidence against the null hypothesis and
that the estimated parameter is non-zero.
The adjusted R 2 is an unbiased estimate of the proportion of variance in
the response explained by the model. The adjustment takes into account
the number of parameters estimated in the model, and gives a better idea
of how useful the model is in explaining the response.
If we were to add lots of useless explanatory variables to a model, the
multiple R 2 would increase, whereas the adjusted R 2 would decrease
and correctly indicate that such a model is worse than the simpler model.
When do we use adjusted r2: It is mostly used to compare models.
When trying to identify the relationship between a particular explanatory
variable and the response variable, what should we use: a fitted partial
relationship, fitted values are estimates of y for the OBSERVED values of
x, predicted values can be estimates of y for unobserved values of x.
What does this allow us to dow: observe the pattern and physical
relationship of the the particular variable related to Y.
What estimates the error variance (unexplained variability) in linear
models: the variance of the residuals.
Residual variance in MLM: This residual variance is the variability of the
data points around the fitted surface
What does: Yˆ, Y¯, Y, mean in statistics:
1. Y:
o Y usually represents the observed values of a dependent
variable in a dataset. It is the actual data collected from
observations or experiments.
2. Yˉ:
o Yˉ (read as "Y-bar") represents the mean (average) of the
observed values of YY. It is calculated as:
Yˉ=1n∑i=1nYiYˉ=n1i=1∑nYi
where nn is the number of observations and YiYi are the individual data
points.
3. Y^:
o Y^ (read as "Y-hat") typically represents the predicted
values of YY from a statistical model, such as a regression
model. It is the value of YY estimated by the model based on
the input variables. For example, in linear regression:
Y^=β0+β1X Y^=β0+β1X
where β0 and β1 are the regression coefficients, and XX is the
independent variable.
Summary:
YY: Observed values of the dependent variable.
YˉYˉ: Mean of the observed values.
Y^Y^: Predicted values from a statistical model.
When are intercepts not considered meaningful: when they do not fall
within the range of observed values for the explanatory
In multiple regression, with multiple parameter estimates from the same
fitted model, the t-test tests whether βj = 0 given all other terms in the
model, i.e. can the variable Xj explain anything further, not already
explained by the other variables in the model?
if ssr = sst (no sse), If SSR equals SST, our regression model perfectly
captures all the observed variability, but that’s rarely the case.
SST=∑(Yi−Yˉ)2, SSR=∑(Y^i−Yˉ)2, SSE=∑(Yi−Y^i)2, SST=SSR+SSE,
R2=SSTSSR.
F = SSregression/p SSE/(n − p − 1) = MSregression MSE
Collinearity: highly correlated explanatory variables.
What does collinearity cause: The result of this is inflated standard errors,
and consequently large p-values, and can also lead to large changes in
the coefficient estimates
How do we deal with collinearity: the solution is either to construct two
separate models, or to choose which of the two correlated variables is
more directly related or meaningful for modelling the response.
Assumptions in linear regression models: we assume that the relationship
is well described by a line, we assume that the error variance stays
constant, we assume that the errors (or data points) are independent, and
that the errors are approximately normally distributed.
What if these 4 assumptions are violated: If there are violations of the
assumptions, it is not that the data are wrong, but that the model is
inappropriate
How do we check these assumptions: s by looking at the behaviour of the
residuals, and by looking at the relationship between the response and
each of the explanatory variables (partial fitted relationships)
If a linear regression model is appropriate, we expect to see an equal
spread of the residuals throughout the range of fitted values. We call this
constant error variance or homoscedasticity
What if there is no equal spread of residuals: If the assumption is violated
and we have non-constant error variance the standard errors and the
associated hypothesis tests for the slope coefficients will be invalid.
What do we expect to see under a INDEPENDENT data set: Under
independence we expect model residuals to be independent and
consecutive errors to be unrelated.
What can cause dependent data sets: Autocorrelation: it measures the
degree to which past values in a series influence future values
o Temporal Dependence: In time series data, observations
collected over time may be correlated with previous
observations. For example, today's stock prices might be
influenced by yesterday's prices.
o Spatial Dependence: In spatial data, observations close to
each other in space may be more similar than those further
apart. For example, weather conditions in nearby locations are
often correlated.
How do we check for normality of residuals: through a histogram or a
quantile-quantile plot of the studentized residuals
When can suggest that residuals do not follow a normal distribution: when
it doesn’t or but if the assumption is badly violated and we have skew
residuals and outliers, this may be an indication that parts of the data are
not well described by the model.
Why do use confidence intervals for regression coeffecients: because they
are esitmates of the effects
How do we measure confidence intervals of these estimtes: because
regression coeffecietns are normally distrbutied, we can use the symettric
confidence interval formula βˆ j ± 2 × SE(βj)
Extensions of the Linear Regression Model
Why are straight lines sometimes not a good model for the relationship
between X and Y:
Non-constant variance: as the value of Y increases, so does its
variability
The values of Y are restricted: e.g. can only lie between 0 and 1
The relationship between X and Y may be multiplicative, not additive, Y
does not increase by a constant amount for a unit increase in X, but by
a factor.
There may be thresholds, optimal levels of X, or the effect of X may
level off.
A non-linear relationship may occur if Y depends on the square of X
How are these non-linear relationships detected:
Transform the response, so that the relationship becomes linear
Transform the explanatory variable, so that the relationship becomes
linear
Add polynomial terms of the explanatory variable to capture the non-
linear relationship
Use non-linear regression to capture the non-linear relationship
When do we use a log transformation of the response:
Y increases or decreases exponentially with changing X, and the spread
(variability) of the response increases with increasing Y.
A histogram of Y is positively skewed (a long tail to the right).
It is useful for multiplicative response
Why is log transforming this response beneficial: A log-transformation of
CO2 emissions is much more symmetrically distributed, and a linear
model for this response might work much better
Analysing R response:
For the CO2 data the estimated coefficient for biofuels is -0.037. This
means that for every 1% increase in biofuels used in total energy
production, log(CO2 emissions) decrease by 0.037 log metric tons. CO2
emissions change by a factor exp −0.037 = 0.96 per % increase in
biofuels. This is equivalent to a 4% decrease in CO2 emissions per %
increase in biofuels used.
The same principle holds for confidence intervals: first calculate
confidence intervals from the linear model for ˆ log(Y). Then transform the
confidence limits (exp(Climit)) to obtain a confidence interval for the
median response on the original scale
On log scale from linear model: βˆ 1 ± 2SE(βˆ 1)
Back-transformation to original scale: exp βˆ 1 ± 2SE(βˆ 1)
When do we use a polynomial regression model: If the relationship
between the response and the explanatory variable is not linear, and a
transformation of the response is not the most sensible thing to do, we
can fit a more flexible curve to the relationship
How do we create this more flexible curve: by adding polynomial terms in
the explanatory variables to the model
Why is a polynomial regression still a linear model: it is linear in the
parameters, even though it now describes a non-linear relationship
Why is it dangerous to extrapolate beyond the observed range of X in
polynomial regression: Because polynomial regression models can be so
flexible they very much adapt to the observed data, and the endpoints are
based on very few observations
What model do we use while describing categorical variables: we no
longer have a simple linear regression model but we can still use a linear
regression model to describe the data.
Dummy variables: not multiplied by the value of the variable, because we
don’t have variables, we have categories , thus we multiply them by 1 or
0 (switching them on and off)
The baseline level: estimated as part of the intercept coefficient (all
dummy variables are set to zero), the first level of the categorical
variables is always estimated by the intercept
What do the coefficients of dummy variables represent: The coefficients of
the dummy variables represent the difference in the predicted value of the
dependent variable between the corresponding category and the baseline
category
What does a positive and negative coefficient represent: A positive
coefficient indicates that the category is associated with a higher
predicted value compared to the baseline, while a negative coefficient
indicates a lower predicted value
Difference between fitting a linear model with continuous variable and a
categorical variable:
Interactions: when the effects of continuous explanatory variables are not
additive but the effect of X1 depends on the level of X2 , X1 and X2
interact
When do we try to add an interaction term to the model: when the effect
of X1 depends on the level of X2
Overfitting vs underfitting:
Overfitting occurs when a model is overly complex, fitting noise in
the data rather than the underlying pattern. This results in:
o Large standard errors due to parameter estimates being based
on limited data.
o Poor performance in predicting new data because of high
uncertainty in the estimates.
Underfitting happens when a model is too simple or rigid to
capture the data's structure. This leads to:
o Bias in the model, as it fails to describe the data adequately.
o Precise but inaccurate parameter estimates, giving a false
sense of confidence.
o Poor predictive performance, as predictions are often "precisely
wrong."
What is the method of maximum likelihood used for: it estimates
parameters, where it is an important alternative to the method of least
squares for estimating parameters
What is likelihood: the likelihood is a function that tells us how ‘likely’ each
parameter value is given the observed data and our assumed model; it is
a function of the unknown parameters. And the likelihood of each
parameter value is judged by how likely it makes the observed data
Maximum likelihood estimate: maximum likelihood estimate (MLE) that
value of p which maximizes the probability of the observed data
How does range for parameter values differ in least square regression
compared to maximum likelihood: For least squares regression we
minimized the error sum of squares. For maximum likelihood we write
down the likelihood function and maximize this with respect to all
parameters.
What does the likelihood of a fitted model represent: The likelihood of a
fitted model is defined as the value of the likelihood function at the
maximum likelihood estimate(s). This quantity can be used to compare
models: models with higher likelihood are better.
Over fitting vs Under fitting:
Models suitable for data mining and hypothesis generation (automated
model selection): all subsets regression, stepwise regression
All subsets regression: fits all possible models with p explanatory variables
(there are 2^p−1 − 1 possibilities)
Problem with this approach: The problem with this approach is that one
quickly ends up fitting many models to a limited data set. This approach is
guaranteed to lead to overfitting
Forward vs backward stepwise regression: forward selection starts with an
empty model and adds variables one by one, while backward selection
starts with a full model and removes variables one by one
Problem with stepwise regression: overfitting, you are almost guaranteed
to find some spurious results: variables with no predictive power appear
statistically important
Automated model use cases: We see two uses for these automated model
selection routines: the first is for generating hypotheses as mentioned
above; the second is after you have conducted a rigorous model selection
analysis and you want to explore further patterns in the data that no-one
expected
What we don’t use automated model uses for: these methods should not
be used for testing scientific hypotheses because they lead to overfitting
and spuriously significant results
What is multiple testing: which arises when numerous null hypotheses are
tested simultaneously.
Problem with multiple testing: If many true null hypotheses are tested,
some will appear significant (e.g., p < 0.05) purely by chance, emphasizes
that without careful planning, researchers risk identifying spurious
relationships due to chance
First step to model selection: construct candidate models: The idea is that
each model represents an alternative hypothesis about the processes that
generated the data and you should be able to justify the inclusion of each
model.
Model comparison and selection steps: we aim to choose a parsimonious
model if it is viable
How do we choose a parsimonious model: To choose a parsimonious
model we trade off goodness-of-fit and number of parameters used
Five methods to choose between models:
The adjusted R2
The residual mean square
Mallow’s Cp statistic
Analysis of variance/deviance
Information criteria
The adjusted R2 : see which model has the highest R2 value, This does
not necessarily mean that it is the best model for prediction. It just means
that this model explains the highest proportion of variance
The residual mean square (or MSE): estimates the residual variance, The
model that minimises the MSE fits the data most closely
What cause MSE to decrease or stabilise: This should decrease as more
important variables enter into the regression equation. MSE will tend to
stabilise as the number of variables included in the equation becomes
large
Mallows Cp statistic: used in regression analysis to help select the best
subset of predictors for a linear regression model. The goal is to avoid
overfitting while ensuring the model captures the essential relationships in
the data.
The Cp statistic is calculated as:
Cp=(SSEp/MSEfull)−(n−2p) (n = no. of observations, p = number of
predictors)
How do we interpret Cp: Cp compares the performance of a subset
model to the full model.
A model with a Cp value close to p is considered good. This indicates
that the subset model has a similar predictive performance to the full
model but with fewer predictors.
If Cp is much larger than p, the subset model may be underfitting
(missing important predictors).
If Cp is much smaller than pp, the subset model may be overfitting
(including unnecessary predictors).
Analysis of variance :ANOVA is used to compare the variability explained
by a model to the residual variability (unexplained variability). It is most
commonly applied in the context of linear regression or experimental
designs
Steps in ANOVA:
Partition the total variability into explained (SSR) and unexplained
(SSE) components.
Compute the F-statistic to test whether the model explains a significant
portion of the variability.
Compare the F-statistic to a critical value or compute a p-value to
determine significance
What does akaike’s information criterion allow: balances the trade of
between bias (underfitting) and variance (overfitting)
Why is it important to consider two models that are close to each other in
aic values: because the data is based on 1 data set, in another data set,
the models will be explained differently
MODULE 2 Experimental Design
Difference in interpreting observational and experimental studies: It is
only by experimentation that we can infer causality
Why: experiment if a change in variable A, say, results in a change in the
response Y, then we can be sure that A caused this change, because all
other factors were controlled and held constant. In an observational study
if we note that as variable A changes Y changes, we can say that A is
associated with a change in Y but we cannot be certain that A itself was
the cause of the change.
Why do we need experimental design:
An experiment is almost the only way in which one can control all
factors to such an extent as to eliminate any other possible
explanation for a change in response other than the treatment
factor of concern, allowing us to infer causality
Well-designed experiments are easy to analyse. Estimates of
treatment effects are independent, i.e. no issues of multicollinearity,
with different variables vying for the right to explain variation in the
response
Experiments are frequently used to find optimal levels of settings
(treatment factors) which will maximise (or minimise) the response.
In an experiment we can choose exactly those settings or treatment
levels we are interested in
Experimental unit: this is the entity (material) to which a treatment is
assigned, or that receives the treatment
Observational unit: the entity from which a measurement is taken
Observational units determine how often an experiment is replicated
within an Experimental units
Treatment factors: is the factor which the experimenter will actively
manipulate, in order to measure its effect on the response (explanatory
variable).
What are homogeneous experiments: If there are no distinguishable
differences between the experimental units prior to the experiment the
experimental units are said to be homogeneous
Why do we desire homogeneous experimental units: The more
homogeneous the experimental units are, the smaller the experimental
error variance (natural variation between observations which have
received the same treatment) will be
How do we account for difference between experimental units: If the
experimental units are not homogeneous, but heterogeneous, we can
group sets of homogeneous experimental units and thereby account for
differences between these groups. This is called blocking
What does blocking allow us to do:
Blocking allows us to tell which part of the total variation is due to
differences between treatments and which part is due to differences
between blocks (blocks are variable)
Within one block, the experimental units are similar and we can
compare the treatments more easily
Removes between-block variation from experimental error,
improving precision
Tests treatments across diverse conditions
What is “location” in terms of blocking factors: The experimental units at
one location can be expected to have different characteristics (more
shady) than those at another location (more sunny)
What is a replicated experiment: . If a treatment is applied independently
to more than one experimental unit
What is pseudo replication: Mistaking multiple measurements within the
same experimental unit as independent replicates
What problems does pseudo replication cause: The problem is that
without true replication, we don’t have an estimate of uncertainty, of how
repeatable, or how variable the result is if the same treatment were to be
applied repeatedly
Three fundamental principles of experimental design: Replicate,
randomise and reduce unexplained variation
Aim of replication: This ensures that the variation between two or more
units receiving the same treatment can be estimated and valid
comparisons can be made between the treatments
What ensures proper replication: s set up independently for each
experimental unit, to prevent confounding
What is confounding: it is not possible to separate the effects of two (or
more) factors on the response
Aim of randomisation: means allocating treatments to experimental units
in such a way that all experimental units have exactly the 11 same
chance of receiving a specific treatment
What does randomisation ensure:
There is no bias on the part of the experimenter
No experimental unit is favoured to receive a particular treatment
differences between treatment means can be attributed to
differences between treatments, and not to any prior differences
between the treatment groups,
allows us to assume independence between observations
Aim of reduction of experimental error variance: we want to reduce the
experimental error variance between experimental units because larger
unexplained variation makes it harder to detect differences between
treatments
How can we reduced experimental error variance:
Controlling extraneous factors
Blocking
Designing an experiment factors:
Treatment factors and their levels
The response
Experimental material/ units
Blocking factors
Number of replicates
What is treatment factors: The factors/variables that are investigated,
controlled, manipulated, thought to influence the response, are called
treatment factors
Treatment structure:
Single factor: the treatments are the levels of a single treatment
factor.
Factorial: an experiment with more than one treatment factor in
which the treatments are constructed by crossing the treatment
factors: the treatments are all possible combinations of the a levels
of factor A and the b levels of factor B, resulting in a × b treatments
Nested: If factors are nested, the levels of one factor, B, will not be
identical across all levels of another factor A. Each level of factor A
will contain different levels of factor B. We would say B is nested in A
What is the significance of a control treatment: A control treatment is a
benchmark treatment to evaluate the effectiveness of experimental
treatments
Blinding in experiments function: Prevents bias when humans are involved
as experimental subjects or observers, as expectations can consciously or
unconsciously influence results
Types of blinding:
Single-blind: Either the participant or the observer does not know
which treatment was given.
Double-blind: Both the participant and observer are unaware of
treatment assignments (gold standard for minimizing bias).
When to use blocking: Are there any structures/differences that need to be
blocked? Do I want to include experimental units of different types to
make the results more general? How many experimental units are
available in each block?
Two basic designs:
Completely Randomized Design: This design is used when the
experimental units are all homogeneous (no blocking required). The
treatments are randomly assigned to the experimental units.
Randomized Block Design: This design is used when the experimental
units are not all homogeneous but can be grouped into sets of
homogeneous units called blocks (one blocking factor). The treatments
are randomly assigned to the units within each block.
Methods of randomisation:
For completely randomized designs the experimental units are not
blocked, so the treatments (and their replications) are assigned
completely at random to all experimental units available (hence
completely randomized).
If there are blocks, the randomization of treatments to experimental
units occurs in each block.
Observational vs. Experimental Studies
Observational studies rely on observation rather
than manipulation of variables.
Cannot establish causality, only associations
Randomisation → Random sampling (to avoid selection bias).
Blocking → Stratification (grouping similar units into strata)
Balanced experiment: An experiment with the same number of replicates
for each treatment is balanced
What tests do we use to determine differences between experiments:
T-test: two experiments
ANOVA : more than two groups
Single-Factor Completely Randomized Design:
Single-factor design: Only one factor is tested (e.g., plant
species), with multiple levels (e.g., 9 species + control).
No blocking: Treatments are randomly assigned to experimental
units (e.g., containers) without restrictions.
One observation per unit: Ensures independence of data points
We can use ANOVA to analyse variance, its requirements are:
There are no outliers.
All groups have equal population variance.
The errors are normally distributed. (in side by side boxplot,
asymmetric boxes can show non-normal distribution)
The errors are independent.
How to find outliers in data: use side by side box plots
How do outliers arise: experimental errors or they indicate a additional
processes that were not part of the planned experiment
How to deal wit outliers: The safest is then to run the analysis with and
without these outliers to see whether the main conclusion depends on
whether these observations are part of the analysis or not.
How to check for equal population variance:
Sample variances won’t be identical due to random variation.
Goal: Verify if differences are small enough to justify the assumption
Visually :Side-by-side boxplots (by treatment group) help
compare variability (sizes of IQR)
Quantitatively : R, if ratio of smallest and largest variance is
less than 5, we good
Independence Assumption:
Violations occur when unaccounted factors (e.g., time, space,
hidden variables) introduce correlation. Examples:
o Unmodeled blocking factors, instrument drift, environmental
changes (e.g., temperature), or shared conditions (e.g.,
contaminated resources).
Effects: Autocorrelated residuals can bias estimates or
misrepresent standard errors.
Diagnostic tools:
o Cleveland dot plots (data/residuals vs. observation order) to
spot temporal/spatial trends.
o Plot residuals against spatial/experimental coordinates
Analysing g completely randomised designs with two levels: t-test:
To see if a treatment had an effect, compare it to the control with a
t-test
Analysing g completely randomised designs with two levels: Anova:
Yij = µ + αi + eij
i = 1, . . . , a (a = number of treatments)
j = 1, . . . ,r (r = number of replicates)
Yij = observation of the j th unit receiving treatment i
µ = overall or general mean
with eij ∼ N(0, σ 2 ) αi = µi − µ.)
αi = effect of the i th level of treatment factor A eij = random error
The fitted / predicted means for the treatments are:
Yˆ a = µˆ + αˆ a = Y¯ a
How to estimate necessary parameters:
µ (the overall mean) : µˆi = Y
a (treatment effects) : αˆi = Y¯ i· − Y¯ ··
σ 2 , the error variance :residuals mean squares (MSE)
Standard errors and confidence intervals:
variance for a treatment mean estimate: Var(µˆi) = σ2/ni
If we assume that two treatment means are independent , the
variance of the difference between two means is Var(µˆi − µˆj) =
Var(µˆi) + Var(µˆj) = σ1/ni + σ2/nj
confidence intervals for the population treatment means and
differences between means are of the form : estimate ± t (α/2) ν ×
SE(estimate), where t α/2 ν is the α/2th percentile of Student’s t
distribution with ν degrees of freedom
ANOVA is a Linear Model:
o ANOVA (Analysis of Variance) is essentially a regression
model but parameterized for categorical
predictors (factors) rather than continuous ones.
o The key difference lies in how variance is partitioned and
interpreted.
Why Focus on Variance?
o In well-designed experiments, the total variance (sum of
squares) can be split into independent components for
each factor (e.g., treatment, blocking).
o This partitioning allows us to:
Quantify how much variation is due to each
factor (unlike observational studies, where factors often
overlap).
Estimate error variance (unexplained variability).
Conduct hypothesis tests (e.g., whether treatment
effects are significant).
One-Way ANOVA vs. t-Test:
o A one-way ANOVA (single categorical factor) generalizes
the two-sample t-test (assuming equal variances) to more
than two groups.
o Both compare means, but ANOVA handles multiple
levels efficiently.
How variance is partitioned:
The basic idea of ANOVA relies on the ratio of the amongtreatment-means
variation to the within-treatment variation. This is the F-ratio
Large F and small F values : Large ratios imply the signal (difference
among the means) is large relative to the noise (variation within groups)
and so there is evidence of a difference in the means. Small ratios imply
the signal (difference among the means) is small relative to the noise
(variation within groups) and so there is no evidence that the means differ
How DOF is calculated in ANOVA: treatment dof = no. treatment -1,
residual dof = no. observation – treatment dof
To check for final errors in model: check distributions carts of residuals
What is a contrast: A comparison of (groups of) treatments is called a
contrast.
Contrast with a single factor that contains two treatments: a t-test
Two different parameterizations of the ANOVA model for analysing
treatment effects:
1. Sum-to-Zero Parameterization:
o Model: Yij=μ+αi+eijYij=μ+αi+eij
o Parameters:
μμ = overall mean
αiαi = treatment effect (difference between treatment
mean and μμ)
o Constraint: ∑αi=0∑αi=0
o Useful for constructing ANOVA tables.
2. Treatment Contrast Parameterization (Default in R):
o Model: Yij=αi+eijYij=αi+eij
o Parameters:
α1α1 = mean of the baseline treatment (first treatment
alphabetically)
α2,α3,...α2,α3,... = differences between subsequent
treatments and the baseline
o Interpretation:
α2α2 = difference between treatment 2 and treatment 1
α3α3 = difference between treatment 3 and treatment
1, etc.
o Hypothesis testing: H0:αi=0H0:αi=0 tests if a treatment
differs from the baseline.
Why Use Contrasts?
ANOVA F-test only confirms if differences exist among treatments.
Contrasts pinpoint which specific groups or pairs differ.
Flexible for complex comparisons (e.g., weighted averages of
multiple treatments).
Type I Error (False Positive)
Definition: Rejecting the null hypothesis (H0H0) when it is
actually true.
Probability: Denoted by αα (typically set at 0.05).
Example: Concluding two treatments differ when they don’t.
The Challenge of Hypothesis Testing
Uncertainty: A small p-value doesn’t guarantee H0H0 is false—
it could be a rare outcome under H0H0.
Limitation: We can’t distinguish between a true effect and a
false positive (Type I error).
Multiple Testing Problem
Issue: Conducting many tests increases the chance of at least
one Type I error.
Bonferroni Inequality:
Worst-case probability of ≥1 Type I error
≈ m×αm×α (where mm = number of tests).
Example: 10 tests at α=0.05α=0.05 → Up to 50%
chance of at least one false positive.
Experiment-Wise Error Rate: The overall Type I error rate
across all tests in an experiment.
Planned vs. Unplanned (Post-hoc) Contrasts in Statistical Analysis
Key Points
1. Planned (A-Priori) Contrasts
o Definition: Pre-specified comparisons based on
hypotheses defined before seeing the data.
o Advantages:
Controls Type I error inflation because the number of
tests is limited.
Stronger, more reliable conclusions since tests are
theory-driven.
o Use Case: Testing specific research questions the experiment
was designed to address.
2. Unplanned (Post-hoc) Contrasts
o Types:
All possible pairwise comparisons (e.g., Tukey’s
HSD).
Data-driven comparisons (e.g., picking "interesting"
differences after seeing results).
o Problems:
Type I error inflation: More tests → higher chance of
false positives.
Circular reasoning: Selecting extreme differences
biases results.
o Appropriate Use:
Exploratory analysis (hypothesis generation, not
confirmation).
Must be clearly labeled as post-hoc to avoid
misinterpretation.
3. Why Planning Matters
o Multiple Comparisons Issue: Unplanned testing increases
false positives (e.g., 20 tests at α=0.05 → ~64% chance of ≥1
false positive).
o Solution: Pre-register hypotheses or use stricter corrections
(e.g., Bonferroni) for unplanned tests.
4. Best Practices
o Prioritize planned contrasts for confirmatory conclusions.
o If conducting post-hoc tests:
Use adjusted significance thresholds (e.g., Bonferroni,
FDR).
Clearly distinguish exploratory vs. confirmatory findings
in reporting.
o Avoid "p-hacking": Cherry-picking significant results
invalidates inference.
Managing Multiple Comparisons in Statistical Analysis
Core Problem
Multiple comparisons inflate Type I errors (false positives).
Even with planned contrasts, excessive testing increases the risk of
spurious findings.
Solutions to Control Experiment-Wise Error Rate
1. Bonferroni Correction
o Approach: Adjust significance threshold by dividing αα by the
number of tests (mm) or multiply raw pp-values by mm.
o Example: For 5 tests at αE=0.05αE=0.05, reject H0H0 only
if p<0.01p<0.01 (or use adjusted pp-values < 0.05).
o Pros: Universally applicable (not limited to pairwise
comparisons).
o Cons: Overly conservative (reduces power; high false-
negative risk).
o For Confidence Intervals: Use higher confidence levels
(e.g., 99% CIs for 5 tests to maintain 95% experiment-wise
coverage).
2. Planned vs. Unplanned Tests:
o Planned contrasts: Fewer tests → less severe correction
needed (e.g., Bonferroni for small mm).
o Unplanned/post-hoc tests: Require stricter methods (e.g.,
Tukey, Scheffé).
3. Regression Context: Testing many predictors without hypotheses
also risks false positives; apply corrections similarly.
Tukey's HSD for Pairwise Comparisons
Purpose
Controls the experiment-wise Type I error rate (α) when
conducting all possible pairwise comparisons between group
means in ANOVA.
Key Features
1. Balanced Design Recommended:
o Works best when groups have equal sample sizes (n).
o Uses the residual standard error (s) and degrees of
freedom (ν) from ANOVA.
2. Formula:
HSD=qα,a,ν⋅snHSD=qα,a,ν⋅ns
o qα,a,νqα,a,ν: Critical value from the studentized range
distribution (accounts for multiple comparisons).
o Interpretation: Two means are significantly different if their
difference exceeds HSD.
3. Experiment-Wise Error Control:
o Ensures the probability of at least one false positive across
all comparisons is α (e.g., 5%).
Advantages
Stronger than Bonferroni for pairwise tests (less conservative).
Automatically adjusts for all comparisons.
When to Use
Post-hoc analysis after a significant ANOVA.
No pre-planned hypotheses; exploratory comparison of all
groups.
Scheffé's Method for Multiple Comparisons
Purpose
Controls the experiment-wise Type I error rate (α) for any
number and type of contrasts, including complex, non-pairwise
comparisons.
Key Features
1. Flexibility:
o Works for all possible contrasts (not just pairwise), making
it ideal for exploratory analyses with many unplanned tests.
o Based on the F-distribution, adjusting p-values and
confidence intervals conservatively.
2. Formula:
o Adjusted p-value:
padj=P(T2a−1≥Fa−1,ν)padj=P(a−1T2≥Fa−1,ν)
where:
TT = Test statistic (e.g., L^/SE(L^)L^/SE(L^)).
aa = Total number of groups.
νν = Residual degrees of freedom.
Fa−1,νFa−1,ν = Critical F-value.
3. Comparison with Other Methods:
o More powerful than Tukey’s HSD for general
contrasts (e.g., comparing group averages).
o More conservative for pairwise comparisons (wider CIs,
larger p-values).
Advantages & Limitations
Pros:
o Controls error rate for all possible contrasts, not just
pairwise.
o More powerful than Bonferroni for complex comparisons.
Cons:
o Overly conservative for pairwise tests (Tukey’s HSD is
better).
o Reduces power (increases Type II error risk).
The Importance of Power Analysis in Ecological
Experiments
What is Power Analysis?
Purpose: Determines the sample size needed to detect a
biologically meaningful effect or estimates the probability of
detecting an effect given existing constraints.
Statistical Decisions and Errors
1. Type I Error (False Positive): Rejecting a true null
hypothesis (H₀) at rate α (e.g., 5%).
2. Type II Error (False Negative): Failing to reject a false H₀ at
rate β.
Power = 1−β: Probability of correctly detecting a true
effect.
Implementing Power Analysis
Requires specifying:
o Effect size:
o Variability:
o α and power thresholds (e.g., α=0.05, power=0.8).
Trade-offs: Higher power requires larger samples or larger effect
sizes.
Understanding Statistical Power in Ecological Studies
Core Concept
Power is the probability of correctly detecting a true effect (e.g., a
1%/year decline in a threatened species). Low power risks Type II
errors (missing real declines), with serious consequences for
conservation.
Key Factors Affecting Power
1. Effect Size:
o Larger effects (e.g., 2.5% vs. 1% decline) are easier to detect
(higher power).
o Example: Power jumps from 17% to 70.5% when detecting a
2.5% vs. 1% decline (same variability).
2. Variability (Noise):
o Lower noise (smaller standard deviation) increases power.
o Example: Reducing SD from 1 to 0.3 boosts power from 17%
to 91.5% for a 1% decline.
3. Sample Size/Replication:
o More data (e.g., longer monitoring periods) shrinks standard
errors, enhancing power.
4. Significance Level (α):
o Higher α (e.g., 0.2 vs. 0.05) increases power but raises Type I
error risk (false alarms).
o Trade-off: α = 0.2 gives 88.9% power for a 2.5% decline but
accepts 20% false positives.
Power Analysis Workflow
1. Define:
o Minimum effect size of interest (e.g., 1% population
decline).
o Acceptable α (typically 0.05) and desired power (e.g.,
80%).
2. Estimate: Natural variability (from pilot data or literature).
3. Calculate: Required sample size or achievable power given
constraints.
Practical Implications
Low power (e.g., 17%) risks missing critical declines → increase
replication or reduce noise (e.g., better monitoring methods).
If power is unattainable: Reconsider study design (e.g., focus on
larger effects) or seek collaborative data.
Takeaway
Power analysis is a non-negotiable step in ecological research design.
Balancing effect size, variability, α, and sample size ensures studies
can reliably detect meaningful changes—vital for protecting species and
ecosystems.
Power Analysis for ANOVA (F-Test) in Experimental Design
Core Concept
Power analysis for an F-test in ANOVA determines the sample size needed
to detect meaningful differences among treatment means, ensuring the
experiment has sufficient sensitivity (e.g., 80% power) to reject the null
hypothesis when true differences exist
Key Steps & Formulas
1. Model & Hypothesis:
o Model: Yij=μi+eijYij=μi+eij (single-factor CRD
with aa treatments and rr replicates).
o Null Hypothesis (H₀): All treatment means (μiμi) are equal.
2. Non-Central F-Distribution:
o Under H₁ (H₀ false), the F-statistic follows a non-central F-
distribution with:
Degrees of freedom: (a−1,N−a)(a−1,N−a).
Non-centrality parameter (λ):
λ=r⋅∑(μi−μˉ)2σ2≈rD22σ2λ=r⋅σ2∑(μi−μˉ)2≈2σ2rD2
where DD = smallest biologically meaningful difference between any two
means.
3. Power Calculation:
o Power = P(F>Fcritical∣H₁)P(F>Fcritical∣H₁),
where FcriticalFcritical is the threshold from the central F-
distribution (H₀).
o Inputs:
Effect size (DD), noise (σσ), significance level (αα), and
desired power (e.g., 0.8).
Practical Insights
1. Trade-offs:
o Smaller DD or higher σσ → More replicates needed.
o Higher power or stricter αα → Larger sample size.
2. Design Implications:
o Underpowered experiments risk Type II errors (missing real
effects).
o Resource planning: Use power analysis to justify sample
size or redesign (e.g., blocking to reduce σσ).
3. Validation:
o For two-treatment designs, F-test power matches t-test
results (e.g., 24 replicates for D=10D=10, σ=12σ=12).
Takeaway
Power analysis for ANOVA ensures experiments are designed to detect
meaningful differences efficiently. By quantifying the relationship
between effect size, variability, sample size, and power, researchers
can:
Optimize resources (avoid over/under-sampling).
Communicate limitations (e.g., "This experiment can only detect
differences ≥3 units").
Prevent false negatives in ecological and environmental studies.
The Pitfalls of Retrospective Power Analysis
What is Retrospective Power Analysis?
A post-hoc calculation of statistical power after an experiment fails to
reject the null hypothesis (H₀), using observed effect sizes and variability.
Intended to diagnose whether a non-significant result (large p-value) was
due to:
1. H₀ being true (or the effect being biologically trivial), or
2. Low power (the experiment was incapable of detecting a
meaningful effect).
Key Problems
1. Circular Logic:
o Retrospective power is calculated from the observed effect
size and standard error—the same values that led to the
non-significant result.
o If the observed effect was small relative to noise, power
will always appear low, revealing nothing new.
2. No Additional Insight:
o A non-significant result with wide confidence intervals
(CIs) already indicates low power.
o A narrow CI around zero suggests the true effect is likely
negligible, regardless of power.
3. Misinterpretation Risk:
o Journals sometimes demand retrospective power, but it’s
often misleading. A large p-value cannot prove H₀ is true—
only a lack of evidence against it.
Better Alternatives
1. Pre-Experiment Power Analysis:
o Plan sample sizes before data collection to ensure adequate
power for meaningful effects.
2. Confidence Intervals:
o Report CIs to show the precision of effect estimates.
Example:
CI: [−2, 1] (narrow, includes zero) → Likely no
meaningful effect.
CI: [−10, 15] (wide) → Inconclusive; suggests low power.
3. Replication or Meta-Analysis:
o Combine results with future/past studies to improve precision.
Takeaway
Retrospective power analysis is redundant—it echoes what CIs
and p-values already reveal.
Focus on pre-planning and transparent reporting:
o "We failed to reject H₀, but our CI [−5, 8] cannot rule out a
meaningful effect due to limited samples."
Avoid journal requests for retrospective power; advocate
for prospective power calculations and CI
interpretation instead.
Idea of blocking:
often there is important variation in additional variables that we are not
directly interested in. If we can group our experimental units with respect
to these variables to make them more similar, we have a more powerful
design. This is the idea of blocking.
Why does this work: If we use blocks in a particular way this will allow us
to separate variability due to treatments, blocks and errors, and thereby
reduce the unexplained variability
What differs blocking factors and treatments: whether we can manipulate
the factor and randomly assign experimental units to its levels
Randomisation in block designs: Whatever the choice, the experimental
units within each block again need to be randomised to treatments and
randomisation happens independently in each block.
Blocking in non-demonic intrusions: if a an accident causes extra noise in
data, the blocking can be a distinctive advantage if experimental units
within a block are likely to suffer in the same way from such accidents
Assumption of no interaction in block experiments: we assume that the
treatments and blocks do not interact, that is, the effect of the Interaction:
If two factors, A and B say, interact, this means that the effects of A
depend on the level of factor B (or the effects of B depend on the level of
factor A). treatment does not depend on which block it is in, or – in other
words – the effect of treatment i is the same in every block
Balanced block design and its significance:
each block is the same with respect to treatments, i.e. the same set
of treatments occurs in every block.
In balanced designs treatment and block effects can be completely
separated (are independent).
Incomplete block design: If the blocks are too small to receive all
treatments, we can only have a subset of the treatments in each block
When is an incomplete block design considered balanced: If all pairs of
treatments occur together within a block an equal number of times, the
design is still balanced.
Generalised randomised complete block design:
If we have more experimental units in a block than treatments, we can
replicate (some) of the treatments in each block, leading to a generalized
randomized block design
Advantage of Generalised randomised complete block design:
The advantage of a generalized randomized complete block design is that
you can estimate the interaction between block and treatment if each
treatment is replicated at least twice in each block
How to check if data meet assumptions fairly well:
The first thing we need to check is that the analysis matches the
design (blocking, randomisation, sample size)
The next thing to check is that there are no outliers
all population variances are equal
errors are normally distributed.
errors are independent.
effects of blocks and treatments are additive, the treatment effects
are similar in all blocks.
How to check if effect of blocks and treatments are additive:
added a colour-coded line for each babbler to the side-by side boxplots
(Fig. 5.12). Each line connects the experimental units of one block
(babbler). The lines don’t cross too much, indicating that it is reasonable
to assume that the different call types had similar effects for all
Why Additivity Matters:
1. Simplifies Interpretation:
o If effects are additive, the treatment effect (difference
between call types) is the same regardless of the block
(babbler).
o This means we can analyze the data using a simple additive
model (e.g., two-way ANOVA without an interaction term).
When do blocks improve effecienecy of design:
we might be interested in whether blocking increased the efficiency of the
reduce unexplained error variance. If F ∼ 1, the blocks did not improve the
design, i.e. reduced the unexplained variation. If F > 1, then blocking did
power of the experiment and you would have been about equally well off
with a completely randomised design
how do we obtain estimates for the treatment and block effects: we
minimize the error sum of squares (method of least squares)
Contrasts vs. ANOVA:
o ANOVA tests whether any differences exist between
treatments but doesn’t pinpoint specifics.
o Contrasts (e.g., comparing fenced vs. unfenced trees)
directly address hypotheses like H₀: μ₁ − μ₂ = 0.
Regression vs. ANOVA:
o lm(): Used for estimating coefficients (e.g., treatment
difference = 0.96, SE = 0.42) and testing contrasts.
1. Example: lm(diff1m ~ treat + factor(pair)) yielded identical
results to the paired t-test (t = 2.27, p = 0.06).
o aov(): Preferred for partitioning variance across factors (e.g.,
treatment, block).
Interacting factors meaning: t the effects of the one factor depend on the
level or settings of the other factors.
What are factorial experiments: In factorial experiments we have more
than one treatment factor
What are complete factorial experiments: In a complete factorial
experiment every combination of factor levels is studied and the number
of treatments is the product of the number of levels of each factor, i.e.
each treatment is a combination of factor levels: one level from each
factor
Types of effects in an experiment:
Main effect, interaction effect and random effect
Main effect: The main effect of a treatment measures the average change
in response, averaged over all levels of the other factors, relative to the
overall mean. When there is only a single factor in an experiment we only
have main effects
The interaction effect: measures the change in response relative to the
main effects with a particular treatment.
Hidden replication within factorial experiments: the levels of each factor
are replicated
Equation for factorial experiments and how they are parameterised:
Yijk=μ+αi+βj+(αβ)ij+eijk
Generalised linear models:
Type of data where the response variable is not continuous:
Bianry response: only two possible outcomes
Binomial response: number of successes out of n trials
Count response: number of a certain object or event
Where does generalised models differ from linear regression:
In generalized linear models we think in terms of how parameters are
related to explanatory variables: probability of success for binary or
binomial data, rate parameter for count data, mean for normal data
What are the parameters in binomial and Poisson distributions directly
related to:
the parameters of the binomial and Poisson distributions are directly
related to the expected or mean response
Why are GLMS still considered linear models:
GLM’s are linear models because the parameter (or some form of it) is still
linearly related to the explanatory variables. Linear here refers to
the β coefficients appearing in a linear combination, i.e. the terms are just
added together.
What are link functions:
The link function defines the form of the relationship between the
parameter of interest and the explanatory variables: linear (identity link),
exponential (log link) or S-shaped (logit link).
when do we use log links and logit links:
by using a log link we are assuming that there is an exponential
relationship between λ and the explanatory variables, and by using a logit
link we are assuming that there is an S-shaped relationship between the
probability of success pi and the explanatory variables.
Why do we need a different model for the response in non-normal data:
count data, binomial or binary data don’t have a normal or even
symmetrical distribution
the variance (or uncertainty) of observations is not constant, as is
often the case in GLMs
range of the response is limited to integers ≥0 for count data, 0 or 1
for binary data, and limited to integers between 0 and n for binomial
data
what method is used to estimate parameters in generalised linear models:
method of maximum likelihood
why are binary data considered a special case of binomial dats:
Binary data are a special case of binomial data, with all ni=1.
binomial distribution with n=1 is also called a Bernoulli distribution,
i.e. Yi∼Bernoulli(pi).
How to construct a GLM:
first, we need to specify the (error) distribution for the response
variable, and
secondly, we need to specify how the parameter of interest
(probability of lawn grass) is related to the explanatory variables,
and to which explanatory variables.
Parameters of the bernouli distribution:
Yi∼Bernoulli(pi). Where n = 1
Yi is the ith observation (0 for failure, 1 for success), pi is the
probability of success for the ith observation.
Why do we use logit link function:
Probabilities are usually not linearly related to explanatory variables
but the log odds of success often are, so that the logit link
function is a reasonable model.
The link function (here logit) links the parameter to the linear
predictor (linear combination of explanatory variables).
The limits: the probability or proportion is limited to lie between 0
and 1, but the logit can (theoretically) range from −∞ to ∞.
What are the log odds:
The logit transformation or link function leads to the name logistic
regression. logit(pi)=log(pi1−pi) is called the logit transformation of pi or
the log-odds. (natural logarithm)
Benefits of logit transformation:
The logit transformation of a probability parameter
ensures that, when transforming back to the probability of success,
all predicted probabilities range between 0 and 1.
ensures that logit(pi) can take on any value between −∞ and ∞. So
no matter how extreme the x-values become, the predicted value
will never be illegal on the log-odds or logit scale.
is often linearly related to the explanatory variables, whereas pi is
not.
In summary, the logistic regression model is:
Yi∼Bin(ni,pi)
logit(pi)=β0+β1xi
how are estimates for the B parameter found:
Estimates for the β parameters are found by using the method of
maximum likelihood, i.e., by finding those values of β0 and β1 that make
the observed values most likely, given the specified model
Interpreting β coefficients
factors larger than 1 imply a positive effect → exp(βi ) = 1.8: a 1 unit
increase in X is associated with the odds of event being multiplied
by 1.8 ie an increase of 80%.
factors less than 1 imply a negative effect → exp(βi ) = 0.6: a 1 unit
increase in X is associated with the odds of event being multiplied
by 0.6 ie a reduction of 40%.
exp(βi ) = 1 imply no effect
Note that the effects do not say anything about the baseline rate,
just about relative odds!
model for pi :
pi=exp(β0+β1xi)/(1+exp(β0+β1xi))
what are odds:
dds are ratios of two probabilities. For example, the odds of lawn grass in
a given area is the probability of lawn grass divided by the probability of
no lawn grass.
what do different log-odds mean:
A log-odds of 0 means that success and failure are equally likely. Negative
log-odds imply probability of success less than 0.5, positive log-odds imply
probability of success larger than 0.5
Why is log(Bi) an odds ratio:
exp(βi) is an odds-ratio because it compares odds at 2 levels
(at x0+1 vs x0). The odds-ratio can be understood as the factor by which
the odds of success change for every one unit increase in the explanatory
variable
⇒ expβi gives the odds ratio associated with a one unit increase in
⇒ expr×βi gives the odds ratio associated with a r unit increase in
Xi .
Xi .
What is the reason for maximum likelihood estimation:
Maximum likelihood estimation involves choosing those values of the
parameters that maximize the likelihood or, equivalently, the log-
likelihood. Intuitively, this means that we choose the parameter values
which maximize the probability of occurrence of the observations.
Why are maximum likelihood estimates considered asymptotically
normally distributed: this means that the normal approximation is
relatively good in large data sets
Wald intervals in GLMS:
A 95% CI for a parameter, based on a normally distributed estimate
(e.g., regression coefficient, MLE estimate) can be calculated
as:estimate±1.96×SE(estimate)
In GLMs such intervals are called Wald intervals. `Wald’ implies
that we have assumed a normal distribution for the MLE estimates.
Assumptions in binomial distribution for a random variable:
there is a constant probability of success
the ni trials are independent. This means that the outcome of one
trial is not influenced by the outcome of the others.
the final number of trials does not depend on the number of
successes
how to check if he relationship between the response, or rather the
modelled parameter (probability of success in logistic regression), and the
explanatory variables is adequately captured by the model in logistic
regression:
Single explanatory variable: plot the observed proportions against
the explanatory variable, with the fitted line superimposed onto this
plot. For binary data we mostly don’t have proportions, even though
it may be possible to calculate proportions for sections of the
explanatory variable (as we did in the lawn grass example).
Several explanatory variables: partial residual plots for every
explanatory variable, with the fitted lines superimposed (we can
again use R’s visreg() function for this).
With large n or large counts, what do we check for in glm:
changes in the mean of the residuals point out that the relationship
between the parameter of interest and the explanatory variables
has been mis-specified
changes in the variance of the residuals is an indication that the
variance is not adequately described by the assumed model
skew distributions of the residuals point out that the large
observations are not adequately fitted by the model.
check for influential observations
the residuals must be independent, which means that we must
account for spatial, serial or blocking structures in the model
Methods used for comparing GLMs: likelihood ratio tests and Akaike’s
Information Criterion, AIC (or similar criteria)
When are models considered nested: When one model contains terms that
are additional to those in another model, the two models are said to
be nested
What does the difference in deviance in two nested models prove:
measures the extent to which the additional terms improve the fit
Residuals in Logistic Regression:
Two types of residuals are commonly used in GLMs:
1. Pearson residuals:
ri=observed−fittedSE(fitted)
2. Deviance residuals:
The deviance of a model is calculated as
D=−2[ℓ(model)−ℓ(saturated model)]
What does ROC and AUC allow us to do: helps us get an idea of how well
the fitted model will be able to predict presence/absence given specific
values of the covariate
How do we predict the presence/absence of a covariate: we give a
threshold probability(p), where x>p is presence and x<p is absence
Sensitivity: is the proportion of true positives predicted correctly as
positives
Specificity: is the proportion of true negatives predicted correctly as
negatives
What is the receiver operating characteristic curve: The ROC curve plots
the True Positive Rate (TPR, sensitivity) against the False Positive Rate
(FPR, 1-specificity) at different classification thresholds
What does the area under the ROC curve represent: used as a measure of
how good the classifier works
Characteristics of a good ROC curve: extends far into the top left corner of
the plot.
What does the AUC probably illustrate: The AUC gives the probability that
a randomly chosen positive is ranked higher (higher predicted probability)
than a randomly chosen negative, or, if you have a negative and a
positive, the proportion of time the model will classify/rank them correctly
Trade off between specificity and sensitivity:
You can’t maximize both simultaneously (unless the model is
perfect).
Adjusting the classification threshold shifts the balance:
o Higher threshold (e.g., 0.9):
Fewer positives predicted → Higher specificity (fewer
FPs), but lower sensitivity (more FNs).
Example: A cancer screening model set to only flag the
clearest cases misses some actual cancers.
o Lower threshold (e.g., 0.1):
More positives predicted → Higher sensitivity (fewer
FNs), but lower specificity (more FPs).
Example: Flagging many patients as "at risk" catches
more true cases but also leads to unnecessary tests.
Best models for comparing GLMS:
Likelihood ratio tests: The likelihood ratio test can only be used for
nested models, and is valid only asymptotically, i.e., large ni and
large N (number of observations).
AIC: The AIC can be used for comparing non-nested models and is
valid for small ni and small N, therefore we much prefer the AIC.
What does difference in deviance in nested models explain: the
differences in deviance between the two nested models measures the
extent to which the additional terms improve the fit.
What does overdispersion and under dispersion:
refer to a failure of the error distribution assumed for the response
variable to properly describe the variation in the observed data. % and will
often not be picked up in residual plots.
Why does overdispersion and under dispersion occur in non-normal data:
Overdispersion is common in both binomial and count data. Both the
binomial and Poisson distribution have a single parameter: probability of
success, and average rate, respectively. They don’t have an extra
variance parameter to separately describe the variability, as the normal
distribution. Instead, the variance is restricted and directly related to the
one parameter.
What does overdispersion come from:
clustering, unmeasured variables, or violated independence, where
variance is high, such that Variance >> mean
Where does under dispersion come from: if data shows more uniformity,
where variance < mean
Consequences of Ignoring Overdispersion:
Standard errors are too small → false confidence in effects (Type I
errors).
Model fit appears worse (high residual deviance) → may lead to
unnecessary model complexity.
How to fix overdispersion:
Quasi-likelihood models (adjust SEs using dispersion parameter).
Mixed models (account for clustering with random effects).
Alternative distributions (e.g., negative binomial for counts, beta-
binomial for proportions).
Poison regression:
What is poison regression used for: used to model the effect of
explanatory variables on a response that is a count of events or
occurrences.
What does the Poisson distribution determine: the probability of
observing y events per unit time or space
Independently and randomly meanings: Independently and randomly
means that the individual events don’t influence each other in when or
where they occur, and randomly means that they are equally likely to
appear at any point in time, but when exactly is unpredictable (similarly in
space).
What is the λ parameter in the Poisson distribution: is the average/mean
rate of events (per minute, km or km2, or other constant unit)
(sometimes μ is used instead of λ). Both the mean, or expected, count, as
well as the variance are equal to λ.
What link function is used Poisson distribution: the relationship between
counts and explanatory variables is commonly exponential, and the link
function most commonly used with the Poisson distribution is the log link.
Why does poisson distribution work best for counts of rare events:
because this is the situation where the two ingredients of the
Poisson process are most likely satisfied: independent and random
events
features of count data:
the variance in the counts increases when the average rate (mean)
increases
the response is strictly an integer ≥0
the counts are not symmetrically distributed around the mean
(average rate)
the mean response often increases exponentially in relation to
explanatory variables, not linearly
How are the model parameters estimated: The model parameters are
estimated using the method of maximum likelihood
How the log link is interpreted:
log(λi)^=β^0+β^1xi
λ^i=exp(β^0+β^1xi)
β^1 is interpreted as the change in the log average rate per unit increase
in the explanatory variable x. To illustrate what happens to the average
rate of events when x increases by one unit
In other words, the average rate of events λ, changes by a
factor exp(β1) per unit increase in x.
Confidence intervals for λ and change in λ:
On the log-link scale:
β^1±1.96×SE(β^1)
The above is a confidence interval for the change in log(λ), i.e., the
change in the log average rate, per unit increase in x.
exp[β^1±1.96×SE(β^1)]
is a confidence interval for the factor (rate ratio) by which the average
rate λ changes per unit increase in x
questions we ask to check for model specifications:
Is the relationship between response and explanatory variables
correctly modelled?
Is the distribution of the response around the expected value
correctly modelled?
How do we check these questions:
We often use the residuals to check the above, any patterns in the
residuals point out that either or both of the above two parts of the model
are not correctly specified.
What does the goodness of fit test aim to prove with poisson models:
It hypothesizes (null hypothesis) that observed counts matched a
Poisson distribution with a constant rate, implying that counts may
be random
Important property of rate parameter when comparing different spatial
events:
we need to make sure that the rate parameter λ uses the same unit
for all observations
Offset term: account for differently sized units
Fitted value: E(Yi) = µi = niλi
log(µi) = log(ni) + log(λi) = log(ni) + β0 + β1x1i + . . .
log(ni) is known as the offset
In interaction between a continuous and categorical variable, what are the
two types of difference and what do we look for:
There is a difference between number, but rate of change is equal
There is a difference between number, and the rate of change is
different, this means that there is an interaction between
explanatory variables