KEMBAR78
Stats Notes | PDF | Errors And Residuals | Regression Analysis
0% found this document useful (0 votes)
19 views48 pages

Stats Notes

The document provides an overview of key concepts in linear regression and statistical modeling, including types of data, model assumptions, significance testing, and the interpretation of coefficients. It discusses the importance of residual analysis, the implications of collinearity, and methods for handling non-linear relationships and categorical variables. Additionally, it emphasizes the use of confidence intervals and the need for careful model specification to ensure valid results.

Uploaded by

jamespresa777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views48 pages

Stats Notes

The document provides an overview of key concepts in linear regression and statistical modeling, including types of data, model assumptions, significance testing, and the interpretation of coefficients. It discusses the importance of residual analysis, the implications of collinearity, and methods for handling non-linear relationships and categorical variables. Additionally, it emphasizes the use of confidence intervals and the need for careful model specification to ensure valid results.

Uploaded by

jamespresa777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

STA2007 flashcards:

Linear Regression: this is a technique for modelling the relationship


between a continuous response and several explanatory variables.

Statistical model: the statistical model has a stochastic component which


captures variability in the response that cannot be explained by the
deterministic part of the model

Significance of statistical models: they help us find patterns in complex


data sets and thus make predictions

Principle of model rule: since models re not accurate, they should be built
with looking at a certain aspect of the the thing youre trying to model

Types of data: continuous, count, binary, binomial, proportion, catergorical


, types of data determines the type of model we will use to analyse

When is a population not considered random or independent in data:


when data is collected over time or in space, because observations taken
near each other will ofen be more similar than those taken apart

What questions are we interested in when describing relationship between


two variables: How fast does the response variable (Y, the variable of
main interest) change when the explanatory variable (often denoted by X)
is changed by one unit? • Is the apparent relationship real or just by
chance? • How strong is the relationship, i.e. how predictable is the
response when I know the value of the explanatory variable?

What is correlaiton: Correlation is a measure of the strength of a linear


relationship

Exception with correlation:

an correlation close to zero does not imply no correlation, it could simply


no linear correlation

correlation does not imply causation


correlation is not the best way to describe relationship always because we
could use r2 or regression

What technique is used for non linear models: spearman rank correlation

What is the goal of simple linear regression model: it’s a line that fits and
estimates the true relationship between the x and y and estimates
responses at different x

How to dind the line of best fit: using the method of least squares OLS

What does the ordinary least squares do: it finds the line that minimises
the error or residual sum of squares

What does the hypothesis/p-value test in simple linear regression check to


see: s evidence in the data for a relationship

If model is valid description od data, there are two parts that a model
needs to get right: 1. The form of the relationship between the response
and the explanatory variables is correctly specified. 2. The error structure
(distribution and dependence structure) is correctly specified.

What is the best way to detect model misspecifications : looking at


residuals, looking at the scatter plot

The residual standard error gives one an idea of how much the
observations deviate from the fitted line or plane

What are the requirements of the error terms in a linear model: (1)
normally distributed, (2) all with the same variance (homoscedastic), and
(3) that they are independent.

What are the estimates of these error terms: residuals

So how does the requirements of error terms apply to residuals: Normally


distributed, equal variance and independent really means that there is no
discernible pattern or structure left in the residuals. If there is, then the
model has failed to pick up an important structure in the data.
what a plot of residuals vs fitted values should look like for a well-specified
model: This has no clear patterns in trend, variance, outlying
observations, or clumps of observations,

What are Outliers: observations where the response deviates substantially


from the rest of the modelled pattern.

High leverage points meaning and significance: extreme in the


explanatory variables, Points with high leverage have a potentially large
influence on the estimates of the regression coefficients if their
corresponding response values do not conform to the general pattern
(outliers)

Influential observation: are individual data points with a large influence on


the regression coefficients and the fitted regression line. Often they are
points with high leverage and outliers

What to do with influential observations: Check for errors in data


recording. • Check if a misspecification of the model structure may be
responsible for a group of influential observations. • If neither of the
above apply, one could fit the model to the data with and without these
observations, and show both sets of results. • When influential
observations are removed from the data, this should be declared, also
how this was done

How are residuals visually checked for normality: histograms and quantile-
quantile plots

when are residuals identified as non-influential: when they do not fall


beyond the cooks distance line the last plot

The F-test tests whether the explanatory variables can explain anything,
but is NOT an indication of how much they explain, or how good the model
is

F-test :The F-statistic on the last row of the R output is the test-statistic for
the hypothesis H0 : β1 = . . . = βk = 0, in other words, that all regression
coefficients in the model are equal to zero simultaneously F-test: H0 : β1
= . . . = βk = 0
when we are happy with the model, what do we check for next: 1.
variance explained 2. graphical comparisons of observed and fitted values

The F-test is just a check to see if the current model is better than the null
model

to understand variance in responses of Y: we check for R2

Multiple linear regression:

Multiple regression: refers to having multiple explanatory variables in one


regression model.

The relationship between Y and the explanatory variables in model with


more than one explanatory variable: is described by a surface, or a plane,
since we have added one dimension

What does the MLR model explain: • The expected (average) response is
linearly related to both explanatory variables. • The deviations of the
observed values (response) are normally distributed around the fitted
values. There is a model for the mean response (deterministic), and a
model for the errors (stochastic).

Relationship between βnXn with Y: This model describes the effect


(change in mean response) of each explanatory variable: βn is the change
in mean response, given the value of all the other explanatory variables

Why are confidence intervals better suited to estimate


variables/coefficients: it gives us a range of possible values for these
parameter estimates

What do we expect these parameter estimates (in particularly the


coefficients of x) to show under the null hypothesis: we expect most
parameters to fall within 2 standard errors of zero (cover 95%)

multiple linear regression models, the ttests and the coefficients measure
the amount that can be explained after all the other variables in the
model have already explained their part.
What can we conclude when the 95% CI’s don’t overlap zero: we can
conclude that there is enough evidence against the null hypothesis and
that the estimated parameter is non-zero.

The adjusted R 2 is an unbiased estimate of the proportion of variance in


the response explained by the model. The adjustment takes into account
the number of parameters estimated in the model, and gives a better idea
of how useful the model is in explaining the response.

If we were to add lots of useless explanatory variables to a model, the


multiple R 2 would increase, whereas the adjusted R 2 would decrease
and correctly indicate that such a model is worse than the simpler model.

When do we use adjusted r2: It is mostly used to compare models.

When trying to identify the relationship between a particular explanatory


variable and the response variable, what should we use: a fitted partial
relationship, fitted values are estimates of y for the OBSERVED values of
x, predicted values can be estimates of y for unobserved values of x.

What does this allow us to dow: observe the pattern and physical
relationship of the the particular variable related to Y.

What estimates the error variance (unexplained variability) in linear


models: the variance of the residuals.

Residual variance in MLM: This residual variance is the variability of the


data points around the fitted surface

What does: Yˆ, Y¯, Y, mean in statistics:

1. Y:

o Y usually represents the observed values of a dependent


variable in a dataset. It is the actual data collected from
observations or experiments.

2. Yˉ:

o Yˉ (read as "Y-bar") represents the mean (average) of the


observed values of YY. It is calculated as:

Yˉ=1n∑i=1nYiYˉ=n1i=1∑nYi
where nn is the number of observations and YiYi are the individual data
points.

3. Y^:

o Y^ (read as "Y-hat") typically represents the predicted


values of YY from a statistical model, such as a regression
model. It is the value of YY estimated by the model based on
the input variables. For example, in linear regression:

Y^=β0+β1X Y^=β0+β1X

where β0 and β1 are the regression coefficients, and XX is the


independent variable.

Summary:

YY: Observed values of the dependent variable.

YˉYˉ: Mean of the observed values.

Y^Y^: Predicted values from a statistical model.

When are intercepts not considered meaningful: when they do not fall
within the range of observed values for the explanatory

In multiple regression, with multiple parameter estimates from the same


fitted model, the t-test tests whether βj = 0 given all other terms in the
model, i.e. can the variable Xj explain anything further, not already
explained by the other variables in the model?
if ssr = sst (no sse), If SSR equals SST, our regression model perfectly
captures all the observed variability, but that’s rarely the case.

SST=∑(Yi−Yˉ)2, SSR=∑(Y^i−Yˉ)2, SSE=∑(Yi−Y^i)2, SST=SSR+SSE,


R2=SSTSSR.

F = SSregression/p SSE/(n − p − 1) = MSregression MSE

Collinearity: highly correlated explanatory variables.

What does collinearity cause: The result of this is inflated standard errors,
and consequently large p-values, and can also lead to large changes in
the coefficient estimates

How do we deal with collinearity: the solution is either to construct two


separate models, or to choose which of the two correlated variables is
more directly related or meaningful for modelling the response.

Assumptions in linear regression models: we assume that the relationship


is well described by a line, we assume that the error variance stays
constant, we assume that the errors (or data points) are independent, and
that the errors are approximately normally distributed.

What if these 4 assumptions are violated: If there are violations of the


assumptions, it is not that the data are wrong, but that the model is
inappropriate

How do we check these assumptions: s by looking at the behaviour of the


residuals, and by looking at the relationship between the response and
each of the explanatory variables (partial fitted relationships)

If a linear regression model is appropriate, we expect to see an equal


spread of the residuals throughout the range of fitted values. We call this
constant error variance or homoscedasticity

What if there is no equal spread of residuals: If the assumption is violated


and we have non-constant error variance the standard errors and the
associated hypothesis tests for the slope coefficients will be invalid.

What do we expect to see under a INDEPENDENT data set: Under


independence we expect model residuals to be independent and
consecutive errors to be unrelated.

What can cause dependent data sets: Autocorrelation: it measures the


degree to which past values in a series influence future values
o Temporal Dependence: In time series data, observations
collected over time may be correlated with previous
observations. For example, today's stock prices might be
influenced by yesterday's prices.

o Spatial Dependence: In spatial data, observations close to


each other in space may be more similar than those further
apart. For example, weather conditions in nearby locations are
often correlated.

How do we check for normality of residuals: through a histogram or a


quantile-quantile plot of the studentized residuals

When can suggest that residuals do not follow a normal distribution: when
it doesn’t or but if the assumption is badly violated and we have skew
residuals and outliers, this may be an indication that parts of the data are
not well described by the model.

Why do use confidence intervals for regression coeffecients: because they


are esitmates of the effects

How do we measure confidence intervals of these estimtes: because


regression coeffecietns are normally distrbutied, we can use the symettric
confidence interval formula βˆ j ± 2 × SE(βj)

Extensions of the Linear Regression Model

Why are straight lines sometimes not a good model for the relationship
between X and Y:

Non-constant variance: as the value of Y increases, so does its


variability

The values of Y are restricted: e.g. can only lie between 0 and 1

The relationship between X and Y may be multiplicative, not additive, Y


does not increase by a constant amount for a unit increase in X, but by
a factor.

There may be thresholds, optimal levels of X, or the effect of X may


level off.

A non-linear relationship may occur if Y depends on the square of X


How are these non-linear relationships detected:

Transform the response, so that the relationship becomes linear

Transform the explanatory variable, so that the relationship becomes


linear

Add polynomial terms of the explanatory variable to capture the non-


linear relationship

Use non-linear regression to capture the non-linear relationship

When do we use a log transformation of the response:

Y increases or decreases exponentially with changing X, and the spread


(variability) of the response increases with increasing Y.

A histogram of Y is positively skewed (a long tail to the right).

It is useful for multiplicative response

Why is log transforming this response beneficial: A log-transformation of


CO2 emissions is much more symmetrically distributed, and a linear
model for this response might work much better

Analysing R response:

For the CO2 data the estimated coefficient for biofuels is -0.037. This
means that for every 1% increase in biofuels used in total energy
production, log(CO2 emissions) decrease by 0.037 log metric tons. CO2
emissions change by a factor exp −0.037 = 0.96 per % increase in
biofuels. This is equivalent to a 4% decrease in CO2 emissions per %
increase in biofuels used.

The same principle holds for confidence intervals: first calculate


confidence intervals from the linear model for ˆ log(Y). Then transform the
confidence limits (exp(Climit)) to obtain a confidence interval for the
median response on the original scale

On log scale from linear model: βˆ 1 ± 2SE(βˆ 1)

Back-transformation to original scale: exp βˆ 1 ± 2SE(βˆ 1)


When do we use a polynomial regression model: If the relationship
between the response and the explanatory variable is not linear, and a
transformation of the response is not the most sensible thing to do, we
can fit a more flexible curve to the relationship

How do we create this more flexible curve: by adding polynomial terms in


the explanatory variables to the model

Why is a polynomial regression still a linear model: it is linear in the


parameters, even though it now describes a non-linear relationship

Why is it dangerous to extrapolate beyond the observed range of X in


polynomial regression: Because polynomial regression models can be so
flexible they very much adapt to the observed data, and the endpoints are
based on very few observations

What model do we use while describing categorical variables: we no


longer have a simple linear regression model but we can still use a linear
regression model to describe the data.

Dummy variables: not multiplied by the value of the variable, because we


don’t have variables, we have categories , thus we multiply them by 1 or
0 (switching them on and off)

The baseline level: estimated as part of the intercept coefficient (all


dummy variables are set to zero), the first level of the categorical
variables is always estimated by the intercept

What do the coefficients of dummy variables represent: The coefficients of


the dummy variables represent the difference in the predicted value of the
dependent variable between the corresponding category and the baseline
category

What does a positive and negative coefficient represent: A positive


coefficient indicates that the category is associated with a higher
predicted value compared to the baseline, while a negative coefficient
indicates a lower predicted value

Difference between fitting a linear model with continuous variable and a


categorical variable:
Interactions: when the effects of continuous explanatory variables are not
additive but the effect of X1 depends on the level of X2 , X1 and X2
interact

When do we try to add an interaction term to the model: when the effect
of X1 depends on the level of X2

Overfitting vs underfitting:

Overfitting occurs when a model is overly complex, fitting noise in


the data rather than the underlying pattern. This results in:

o Large standard errors due to parameter estimates being based


on limited data.

o Poor performance in predicting new data because of high


uncertainty in the estimates.

Underfitting happens when a model is too simple or rigid to


capture the data's structure. This leads to:

o Bias in the model, as it fails to describe the data adequately.

o Precise but inaccurate parameter estimates, giving a false


sense of confidence.

o Poor predictive performance, as predictions are often "precisely


wrong."

What is the method of maximum likelihood used for: it estimates


parameters, where it is an important alternative to the method of least
squares for estimating parameters

What is likelihood: the likelihood is a function that tells us how ‘likely’ each
parameter value is given the observed data and our assumed model; it is
a function of the unknown parameters. And the likelihood of each
parameter value is judged by how likely it makes the observed data

Maximum likelihood estimate: maximum likelihood estimate (MLE) that


value of p which maximizes the probability of the observed data

How does range for parameter values differ in least square regression
compared to maximum likelihood: For least squares regression we
minimized the error sum of squares. For maximum likelihood we write
down the likelihood function and maximize this with respect to all
parameters.
What does the likelihood of a fitted model represent: The likelihood of a
fitted model is defined as the value of the likelihood function at the
maximum likelihood estimate(s). This quantity can be used to compare
models: models with higher likelihood are better.

Over fitting vs Under fitting:

Models suitable for data mining and hypothesis generation (automated


model selection): all subsets regression, stepwise regression

All subsets regression: fits all possible models with p explanatory variables
(there are 2^p−1 − 1 possibilities)

Problem with this approach: The problem with this approach is that one
quickly ends up fitting many models to a limited data set. This approach is
guaranteed to lead to overfitting

Forward vs backward stepwise regression: forward selection starts with an


empty model and adds variables one by one, while backward selection
starts with a full model and removes variables one by one

Problem with stepwise regression: overfitting, you are almost guaranteed


to find some spurious results: variables with no predictive power appear
statistically important

Automated model use cases: We see two uses for these automated model
selection routines: the first is for generating hypotheses as mentioned
above; the second is after you have conducted a rigorous model selection
analysis and you want to explore further patterns in the data that no-one
expected

What we don’t use automated model uses for: these methods should not
be used for testing scientific hypotheses because they lead to overfitting
and spuriously significant results

What is multiple testing: which arises when numerous null hypotheses are
tested simultaneously.

Problem with multiple testing: If many true null hypotheses are tested,
some will appear significant (e.g., p < 0.05) purely by chance, emphasizes
that without careful planning, researchers risk identifying spurious
relationships due to chance

First step to model selection: construct candidate models: The idea is that
each model represents an alternative hypothesis about the processes that
generated the data and you should be able to justify the inclusion of each
model.

Model comparison and selection steps: we aim to choose a parsimonious


model if it is viable

How do we choose a parsimonious model: To choose a parsimonious


model we trade off goodness-of-fit and number of parameters used

Five methods to choose between models:

The adjusted R2

The residual mean square

Mallow’s Cp statistic

Analysis of variance/deviance

Information criteria

The adjusted R2 : see which model has the highest R2 value, This does
not necessarily mean that it is the best model for prediction. It just means
that this model explains the highest proportion of variance

The residual mean square (or MSE): estimates the residual variance, The
model that minimises the MSE fits the data most closely

What cause MSE to decrease or stabilise: This should decrease as more


important variables enter into the regression equation. MSE will tend to
stabilise as the number of variables included in the equation becomes
large

Mallows Cp statistic: used in regression analysis to help select the best


subset of predictors for a linear regression model. The goal is to avoid
overfitting while ensuring the model captures the essential relationships in
the data.

The Cp statistic is calculated as:

Cp=(SSEp/MSEfull)−(n−2p) (n = no. of observations, p = number of


predictors)

How do we interpret Cp: Cp compares the performance of a subset


model to the full model.
A model with a Cp value close to p is considered good. This indicates
that the subset model has a similar predictive performance to the full
model but with fewer predictors.

If Cp is much larger than p, the subset model may be underfitting


(missing important predictors).

If Cp is much smaller than pp, the subset model may be overfitting


(including unnecessary predictors).

Analysis of variance :ANOVA is used to compare the variability explained


by a model to the residual variability (unexplained variability). It is most
commonly applied in the context of linear regression or experimental
designs

Steps in ANOVA:

Partition the total variability into explained (SSR) and unexplained


(SSE) components.

Compute the F-statistic to test whether the model explains a significant


portion of the variability.

Compare the F-statistic to a critical value or compute a p-value to


determine significance

What does akaike’s information criterion allow: balances the trade of


between bias (underfitting) and variance (overfitting)

Why is it important to consider two models that are close to each other in
aic values: because the data is based on 1 data set, in another data set,
the models will be explained differently

MODULE 2 Experimental Design

Difference in interpreting observational and experimental studies: It is


only by experimentation that we can infer causality

Why: experiment if a change in variable A, say, results in a change in the


response Y, then we can be sure that A caused this change, because all
other factors were controlled and held constant. In an observational study
if we note that as variable A changes Y changes, we can say that A is
associated with a change in Y but we cannot be certain that A itself was
the cause of the change.
Why do we need experimental design:

 An experiment is almost the only way in which one can control all
factors to such an extent as to eliminate any other possible
explanation for a change in response other than the treatment
factor of concern, allowing us to infer causality
 Well-designed experiments are easy to analyse. Estimates of
treatment effects are independent, i.e. no issues of multicollinearity,
with different variables vying for the right to explain variation in the
response
 Experiments are frequently used to find optimal levels of settings
(treatment factors) which will maximise (or minimise) the response.
 In an experiment we can choose exactly those settings or treatment
levels we are interested in

Experimental unit: this is the entity (material) to which a treatment is


assigned, or that receives the treatment

Observational unit: the entity from which a measurement is taken

Observational units determine how often an experiment is replicated


within an Experimental units

Treatment factors: is the factor which the experimenter will actively


manipulate, in order to measure its effect on the response (explanatory
variable).

What are homogeneous experiments: If there are no distinguishable


differences between the experimental units prior to the experiment the
experimental units are said to be homogeneous

Why do we desire homogeneous experimental units: The more


homogeneous the experimental units are, the smaller the experimental
error variance (natural variation between observations which have
received the same treatment) will be

How do we account for difference between experimental units: If the


experimental units are not homogeneous, but heterogeneous, we can
group sets of homogeneous experimental units and thereby account for
differences between these groups. This is called blocking
What does blocking allow us to do:

 Blocking allows us to tell which part of the total variation is due to


differences between treatments and which part is due to differences
between blocks (blocks are variable)
 Within one block, the experimental units are similar and we can
compare the treatments more easily
 Removes between-block variation from experimental error,
improving precision
 Tests treatments across diverse conditions

What is “location” in terms of blocking factors: The experimental units at


one location can be expected to have different characteristics (more
shady) than those at another location (more sunny)

What is a replicated experiment: . If a treatment is applied independently


to more than one experimental unit

What is pseudo replication: Mistaking multiple measurements within the


same experimental unit as independent replicates

What problems does pseudo replication cause: The problem is that


without true replication, we don’t have an estimate of uncertainty, of how
repeatable, or how variable the result is if the same treatment were to be
applied repeatedly

Three fundamental principles of experimental design: Replicate,


randomise and reduce unexplained variation

Aim of replication: This ensures that the variation between two or more
units receiving the same treatment can be estimated and valid
comparisons can be made between the treatments

What ensures proper replication: s set up independently for each


experimental unit, to prevent confounding

What is confounding: it is not possible to separate the effects of two (or


more) factors on the response

Aim of randomisation: means allocating treatments to experimental units


in such a way that all experimental units have exactly the 11 same
chance of receiving a specific treatment
What does randomisation ensure:

 There is no bias on the part of the experimenter


 No experimental unit is favoured to receive a particular treatment
 differences between treatment means can be attributed to
differences between treatments, and not to any prior differences
between the treatment groups,
 allows us to assume independence between observations

Aim of reduction of experimental error variance: we want to reduce the


experimental error variance between experimental units because larger
unexplained variation makes it harder to detect differences between
treatments

How can we reduced experimental error variance:

 Controlling extraneous factors


 Blocking

Designing an experiment factors:

 Treatment factors and their levels


 The response
 Experimental material/ units
 Blocking factors
 Number of replicates

What is treatment factors: The factors/variables that are investigated,


controlled, manipulated, thought to influence the response, are called
treatment factors

Treatment structure:

 Single factor: the treatments are the levels of a single treatment


factor.
 Factorial: an experiment with more than one treatment factor in
which the treatments are constructed by crossing the treatment
factors: the treatments are all possible combinations of the a levels
of factor A and the b levels of factor B, resulting in a × b treatments
 Nested: If factors are nested, the levels of one factor, B, will not be
identical across all levels of another factor A. Each level of factor A
will contain different levels of factor B. We would say B is nested in A
What is the significance of a control treatment: A control treatment is a
benchmark treatment to evaluate the effectiveness of experimental
treatments

Blinding in experiments function: Prevents bias when humans are involved


as experimental subjects or observers, as expectations can consciously or
unconsciously influence results

Types of blinding:

 Single-blind: Either the participant or the observer does not know


which treatment was given.

 Double-blind: Both the participant and observer are unaware of


treatment assignments (gold standard for minimizing bias).

When to use blocking: Are there any structures/differences that need to be


blocked? Do I want to include experimental units of different types to
make the results more general? How many experimental units are
available in each block?

Two basic designs:

Completely Randomized Design: This design is used when the


experimental units are all homogeneous (no blocking required). The
treatments are randomly assigned to the experimental units.

Randomized Block Design: This design is used when the experimental


units are not all homogeneous but can be grouped into sets of
homogeneous units called blocks (one blocking factor). The treatments
are randomly assigned to the units within each block.

Methods of randomisation:

 For completely randomized designs the experimental units are not


blocked, so the treatments (and their replications) are assigned
completely at random to all experimental units available (hence
completely randomized).
 If there are blocks, the randomization of treatments to experimental
units occurs in each block.
Observational vs. Experimental Studies

 Observational studies rely on observation rather


than manipulation of variables.

 Cannot establish causality, only associations

 Randomisation → Random sampling (to avoid selection bias).


 Blocking → Stratification (grouping similar units into strata)

Balanced experiment: An experiment with the same number of replicates


for each treatment is balanced

What tests do we use to determine differences between experiments:

 T-test: two experiments


 ANOVA : more than two groups

Single-Factor Completely Randomized Design:

 Single-factor design: Only one factor is tested (e.g., plant


species), with multiple levels (e.g., 9 species + control).

 No blocking: Treatments are randomly assigned to experimental


units (e.g., containers) without restrictions.

 One observation per unit: Ensures independence of data points

We can use ANOVA to analyse variance, its requirements are:

 There are no outliers.


 All groups have equal population variance.
 The errors are normally distributed. (in side by side boxplot,
asymmetric boxes can show non-normal distribution)
 The errors are independent.

How to find outliers in data: use side by side box plots

How do outliers arise: experimental errors or they indicate a additional


processes that were not part of the planned experiment
How to deal wit outliers: The safest is then to run the analysis with and
without these outliers to see whether the main conclusion depends on
whether these observations are part of the analysis or not.

How to check for equal population variance:

 Sample variances won’t be identical due to random variation.

 Goal: Verify if differences are small enough to justify the assumption

 Visually :Side-by-side boxplots (by treatment group) help


compare variability (sizes of IQR)

 Quantitatively : R, if ratio of smallest and largest variance is


less than 5, we good

Independence Assumption:

 Violations occur when unaccounted factors (e.g., time, space,


hidden variables) introduce correlation. Examples:

o Unmodeled blocking factors, instrument drift, environmental


changes (e.g., temperature), or shared conditions (e.g.,
contaminated resources).

 Effects: Autocorrelated residuals can bias estimates or


misrepresent standard errors.

 Diagnostic tools:

o Cleveland dot plots (data/residuals vs. observation order) to


spot temporal/spatial trends.

o Plot residuals against spatial/experimental coordinates

Analysing g completely randomised designs with two levels: t-test:

 To see if a treatment had an effect, compare it to the control with a


t-test

Analysing g completely randomised designs with two levels: Anova:

 Yij = µ + αi + eij
 i = 1, . . . , a (a = number of treatments)
 j = 1, . . . ,r (r = number of replicates)
 Yij = observation of the j th unit receiving treatment i
 µ = overall or general mean
with eij ∼ N(0, σ 2 ) αi = µi − µ.)
 αi = effect of the i th level of treatment factor A eij = random error

The fitted / predicted means for the treatments are:

Yˆ a = µˆ + αˆ a = Y¯ a

How to estimate necessary parameters:

 µ (the overall mean) : µˆi = Y


 a (treatment effects) : αˆi = Y¯ i· − Y¯ ··
 σ 2 , the error variance :residuals mean squares (MSE)

Standard errors and confidence intervals:

 variance for a treatment mean estimate: Var(µˆi) = σ2/ni


 If we assume that two treatment means are independent , the
variance of the difference between two means is Var(µˆi − µˆj) =
Var(µˆi) + Var(µˆj) = σ1/ni + σ2/nj
 confidence intervals for the population treatment means and
differences between means are of the form : estimate ± t (α/2) ν ×
SE(estimate), where t α/2 ν is the α/2th percentile of Student’s t
distribution with ν degrees of freedom

ANOVA is a Linear Model:

o ANOVA (Analysis of Variance) is essentially a regression


model but parameterized for categorical
predictors (factors) rather than continuous ones.

o The key difference lies in how variance is partitioned and


interpreted.

Why Focus on Variance?

o In well-designed experiments, the total variance (sum of


squares) can be split into independent components for
each factor (e.g., treatment, blocking).

o This partitioning allows us to:


 Quantify how much variation is due to each
factor (unlike observational studies, where factors often
overlap).

 Estimate error variance (unexplained variability).

 Conduct hypothesis tests (e.g., whether treatment


effects are significant).

One-Way ANOVA vs. t-Test:

o A one-way ANOVA (single categorical factor) generalizes


the two-sample t-test (assuming equal variances) to more
than two groups.

o Both compare means, but ANOVA handles multiple


levels efficiently.

How variance is partitioned:

The basic idea of ANOVA relies on the ratio of the amongtreatment-means


variation to the within-treatment variation. This is the F-ratio

Large F and small F values : Large ratios imply the signal (difference
among the means) is large relative to the noise (variation within groups)
and so there is evidence of a difference in the means. Small ratios imply
the signal (difference among the means) is small relative to the noise
(variation within groups) and so there is no evidence that the means differ

How DOF is calculated in ANOVA: treatment dof = no. treatment -1,


residual dof = no. observation – treatment dof

To check for final errors in model: check distributions carts of residuals

What is a contrast: A comparison of (groups of) treatments is called a


contrast.

Contrast with a single factor that contains two treatments: a t-test


Two different parameterizations of the ANOVA model for analysing
treatment effects:

1. Sum-to-Zero Parameterization:

o Model: Yij=μ+αi+eijYij=μ+αi+eij

o Parameters:

 μμ = overall mean

 αiαi = treatment effect (difference between treatment


mean and μμ)

o Constraint: ∑αi=0∑αi=0

o Useful for constructing ANOVA tables.

2. Treatment Contrast Parameterization (Default in R):

o Model: Yij=αi+eijYij=αi+eij

o Parameters:

 α1α1 = mean of the baseline treatment (first treatment


alphabetically)

 α2,α3,...α2,α3,... = differences between subsequent


treatments and the baseline

o Interpretation:

 α2α2 = difference between treatment 2 and treatment 1

 α3α3 = difference between treatment 3 and treatment


1, etc.

o Hypothesis testing: H0:αi=0H0:αi=0 tests if a treatment


differs from the baseline.

Why Use Contrasts?

 ANOVA F-test only confirms if differences exist among treatments.

 Contrasts pinpoint which specific groups or pairs differ.

 Flexible for complex comparisons (e.g., weighted averages of


multiple treatments).
Type I Error (False Positive)
Definition: Rejecting the null hypothesis (H0H0) when it is
actually true.
Probability: Denoted by αα (typically set at 0.05).
Example: Concluding two treatments differ when they don’t.
The Challenge of Hypothesis Testing
Uncertainty: A small p-value doesn’t guarantee H0H0 is false—
it could be a rare outcome under H0H0.
Limitation: We can’t distinguish between a true effect and a
false positive (Type I error).
Multiple Testing Problem
Issue: Conducting many tests increases the chance of at least
one Type I error.
Bonferroni Inequality:
Worst-case probability of ≥1 Type I error
≈ m×αm×α (where mm = number of tests).
Example: 10 tests at α=0.05α=0.05 → Up to 50%
chance of at least one false positive.
Experiment-Wise Error Rate: The overall Type I error rate
across all tests in an experiment.

Planned vs. Unplanned (Post-hoc) Contrasts in Statistical Analysis

Key Points

1. Planned (A-Priori) Contrasts


o Definition: Pre-specified comparisons based on
hypotheses defined before seeing the data.
o Advantages:
 Controls Type I error inflation because the number of
tests is limited.
 Stronger, more reliable conclusions since tests are
theory-driven.
o Use Case: Testing specific research questions the experiment
was designed to address.
2. Unplanned (Post-hoc) Contrasts
o Types:
 All possible pairwise comparisons (e.g., Tukey’s
HSD).
 Data-driven comparisons (e.g., picking "interesting"
differences after seeing results).
o Problems:
 Type I error inflation: More tests → higher chance of
false positives.
 Circular reasoning: Selecting extreme differences
biases results.
o Appropriate Use:
 Exploratory analysis (hypothesis generation, not
confirmation).
 Must be clearly labeled as post-hoc to avoid
misinterpretation.
3. Why Planning Matters
o Multiple Comparisons Issue: Unplanned testing increases
false positives (e.g., 20 tests at α=0.05 → ~64% chance of ≥1
false positive).
o Solution: Pre-register hypotheses or use stricter corrections
(e.g., Bonferroni) for unplanned tests.
4. Best Practices
o Prioritize planned contrasts for confirmatory conclusions.
o If conducting post-hoc tests:
 Use adjusted significance thresholds (e.g., Bonferroni,
FDR).
 Clearly distinguish exploratory vs. confirmatory findings
in reporting.
o Avoid "p-hacking": Cherry-picking significant results
invalidates inference.

Managing Multiple Comparisons in Statistical Analysis

Core Problem

 Multiple comparisons inflate Type I errors (false positives).


Even with planned contrasts, excessive testing increases the risk of
spurious findings.

Solutions to Control Experiment-Wise Error Rate

1. Bonferroni Correction

o Approach: Adjust significance threshold by dividing αα by the


number of tests (mm) or multiply raw pp-values by mm.

o Example: For 5 tests at αE=0.05αE=0.05, reject H0H0 only


if p<0.01p<0.01 (or use adjusted pp-values < 0.05).

o Pros: Universally applicable (not limited to pairwise


comparisons).
o Cons: Overly conservative (reduces power; high false-
negative risk).

o For Confidence Intervals: Use higher confidence levels


(e.g., 99% CIs for 5 tests to maintain 95% experiment-wise
coverage).

2. Planned vs. Unplanned Tests:


o Planned contrasts: Fewer tests → less severe correction
needed (e.g., Bonferroni for small mm).
o Unplanned/post-hoc tests: Require stricter methods (e.g.,
Tukey, Scheffé).
3. Regression Context: Testing many predictors without hypotheses
also risks false positives; apply corrections similarly.

Tukey's HSD for Pairwise Comparisons

Purpose

 Controls the experiment-wise Type I error rate (α) when


conducting all possible pairwise comparisons between group
means in ANOVA.

Key Features

1. Balanced Design Recommended:

o Works best when groups have equal sample sizes (n).

o Uses the residual standard error (s) and degrees of


freedom (ν) from ANOVA.

2. Formula:

HSD=qα,a,ν⋅snHSD=qα,a,ν⋅ns

o qα,a,νqα,a,ν: Critical value from the studentized range


distribution (accounts for multiple comparisons).

o Interpretation: Two means are significantly different if their


difference exceeds HSD.

3. Experiment-Wise Error Control:

o Ensures the probability of at least one false positive across


all comparisons is α (e.g., 5%).
Advantages

 Stronger than Bonferroni for pairwise tests (less conservative).

 Automatically adjusts for all comparisons.

When to Use

 Post-hoc analysis after a significant ANOVA.

 No pre-planned hypotheses; exploratory comparison of all


groups.

Scheffé's Method for Multiple Comparisons

Purpose

 Controls the experiment-wise Type I error rate (α) for any


number and type of contrasts, including complex, non-pairwise
comparisons.

Key Features

1. Flexibility:

o Works for all possible contrasts (not just pairwise), making


it ideal for exploratory analyses with many unplanned tests.

o Based on the F-distribution, adjusting p-values and


confidence intervals conservatively.

2. Formula:

o Adjusted p-value:

padj=P(T2a−1≥Fa−1,ν)padj=P(a−1T2≥Fa−1,ν)

where:

 TT = Test statistic (e.g., L^/SE(L^)L^/SE(L^)).

 aa = Total number of groups.

 νν = Residual degrees of freedom.

 Fa−1,νFa−1,ν = Critical F-value.

3. Comparison with Other Methods:


o More powerful than Tukey’s HSD for general
contrasts (e.g., comparing group averages).
o More conservative for pairwise comparisons (wider CIs,
larger p-values).

Advantages & Limitations

 Pros:

o Controls error rate for all possible contrasts, not just


pairwise.

o More powerful than Bonferroni for complex comparisons.

 Cons:

o Overly conservative for pairwise tests (Tukey’s HSD is


better).

o Reduces power (increases Type II error risk).

The Importance of Power Analysis in Ecological


Experiments
What is Power Analysis?

 Purpose: Determines the sample size needed to detect a


biologically meaningful effect or estimates the probability of
detecting an effect given existing constraints.

Statistical Decisions and Errors

1. Type I Error (False Positive): Rejecting a true null


hypothesis (H₀) at rate α (e.g., 5%).

2. Type II Error (False Negative): Failing to reject a false H₀ at


rate β.

 Power = 1−β: Probability of correctly detecting a true


effect.

Implementing Power Analysis

 Requires specifying:

o Effect size:

o Variability:

o α and power thresholds (e.g., α=0.05, power=0.8).


 Trade-offs: Higher power requires larger samples or larger effect
sizes.

Understanding Statistical Power in Ecological Studies

Core Concept

Power is the probability of correctly detecting a true effect (e.g., a


1%/year decline in a threatened species). Low power risks Type II
errors (missing real declines), with serious consequences for
conservation.

Key Factors Affecting Power

1. Effect Size:

o Larger effects (e.g., 2.5% vs. 1% decline) are easier to detect


(higher power).

o Example: Power jumps from 17% to 70.5% when detecting a


2.5% vs. 1% decline (same variability).

2. Variability (Noise):

o Lower noise (smaller standard deviation) increases power.

o Example: Reducing SD from 1 to 0.3 boosts power from 17%


to 91.5% for a 1% decline.

3. Sample Size/Replication:

o More data (e.g., longer monitoring periods) shrinks standard


errors, enhancing power.

4. Significance Level (α):

o Higher α (e.g., 0.2 vs. 0.05) increases power but raises Type I
error risk (false alarms).

o Trade-off: α = 0.2 gives 88.9% power for a 2.5% decline but


accepts 20% false positives.

Power Analysis Workflow

1. Define:

o Minimum effect size of interest (e.g., 1% population


decline).

o Acceptable α (typically 0.05) and desired power (e.g.,


80%).
2. Estimate: Natural variability (from pilot data or literature).

3. Calculate: Required sample size or achievable power given


constraints.

Practical Implications

 Low power (e.g., 17%) risks missing critical declines → increase


replication or reduce noise (e.g., better monitoring methods).

 If power is unattainable: Reconsider study design (e.g., focus on


larger effects) or seek collaborative data.

Takeaway

Power analysis is a non-negotiable step in ecological research design.


Balancing effect size, variability, α, and sample size ensures studies
can reliably detect meaningful changes—vital for protecting species and
ecosystems.

Power Analysis for ANOVA (F-Test) in Experimental Design

Core Concept

Power analysis for an F-test in ANOVA determines the sample size needed
to detect meaningful differences among treatment means, ensuring the
experiment has sufficient sensitivity (e.g., 80% power) to reject the null
hypothesis when true differences exist

Key Steps & Formulas

1. Model & Hypothesis:

o Model: Yij=μi+eijYij=μi+eij (single-factor CRD


with aa treatments and rr replicates).

o Null Hypothesis (H₀): All treatment means (μiμi) are equal.

2. Non-Central F-Distribution:

o Under H₁ (H₀ false), the F-statistic follows a non-central F-


distribution with:

 Degrees of freedom: (a−1,N−a)(a−1,N−a).

 Non-centrality parameter (λ):

λ=r⋅∑(μi−μˉ)2σ2≈rD22σ2λ=r⋅σ2∑(μi−μˉ)2≈2σ2rD2

where DD = smallest biologically meaningful difference between any two


means.
3. Power Calculation:

o Power = P(F>Fcritical∣H₁)P(F>Fcritical∣H₁),
where FcriticalFcritical is the threshold from the central F-
distribution (H₀).

o Inputs:

 Effect size (DD), noise (σσ), significance level (αα), and


desired power (e.g., 0.8).

Practical Insights

1. Trade-offs:

o Smaller DD or higher σσ → More replicates needed.

o Higher power or stricter αα → Larger sample size.

2. Design Implications:

o Underpowered experiments risk Type II errors (missing real


effects).

o Resource planning: Use power analysis to justify sample


size or redesign (e.g., blocking to reduce σσ).

3. Validation:

o For two-treatment designs, F-test power matches t-test


results (e.g., 24 replicates for D=10D=10, σ=12σ=12).

Takeaway

Power analysis for ANOVA ensures experiments are designed to detect


meaningful differences efficiently. By quantifying the relationship
between effect size, variability, sample size, and power, researchers
can:

 Optimize resources (avoid over/under-sampling).

 Communicate limitations (e.g., "This experiment can only detect


differences ≥3 units").

 Prevent false negatives in ecological and environmental studies.

The Pitfalls of Retrospective Power Analysis

What is Retrospective Power Analysis?


A post-hoc calculation of statistical power after an experiment fails to
reject the null hypothesis (H₀), using observed effect sizes and variability.
Intended to diagnose whether a non-significant result (large p-value) was
due to:

1. H₀ being true (or the effect being biologically trivial), or

2. Low power (the experiment was incapable of detecting a


meaningful effect).

Key Problems

1. Circular Logic:

o Retrospective power is calculated from the observed effect


size and standard error—the same values that led to the
non-significant result.

o If the observed effect was small relative to noise, power


will always appear low, revealing nothing new.

2. No Additional Insight:

o A non-significant result with wide confidence intervals


(CIs) already indicates low power.

o A narrow CI around zero suggests the true effect is likely


negligible, regardless of power.

3. Misinterpretation Risk:

o Journals sometimes demand retrospective power, but it’s


often misleading. A large p-value cannot prove H₀ is true—
only a lack of evidence against it.

Better Alternatives

1. Pre-Experiment Power Analysis:

o Plan sample sizes before data collection to ensure adequate


power for meaningful effects.

2. Confidence Intervals:

o Report CIs to show the precision of effect estimates.


Example:

 CI: [−2, 1] (narrow, includes zero) → Likely no


meaningful effect.
 CI: [−10, 15] (wide) → Inconclusive; suggests low power.

3. Replication or Meta-Analysis:

o Combine results with future/past studies to improve precision.

Takeaway

 Retrospective power analysis is redundant—it echoes what CIs


and p-values already reveal.

 Focus on pre-planning and transparent reporting:

o "We failed to reject H₀, but our CI [−5, 8] cannot rule out a
meaningful effect due to limited samples."

 Avoid journal requests for retrospective power; advocate


for prospective power calculations and CI
interpretation instead.

Idea of blocking:

often there is important variation in additional variables that we are not


directly interested in. If we can group our experimental units with respect
to these variables to make them more similar, we have a more powerful
design. This is the idea of blocking.

Why does this work: If we use blocks in a particular way this will allow us
to separate variability due to treatments, blocks and errors, and thereby
reduce the unexplained variability

What differs blocking factors and treatments: whether we can manipulate


the factor and randomly assign experimental units to its levels

Randomisation in block designs: Whatever the choice, the experimental


units within each block again need to be randomised to treatments and
randomisation happens independently in each block.

Blocking in non-demonic intrusions: if a an accident causes extra noise in


data, the blocking can be a distinctive advantage if experimental units
within a block are likely to suffer in the same way from such accidents
Assumption of no interaction in block experiments: we assume that the
treatments and blocks do not interact, that is, the effect of the Interaction:
If two factors, A and B say, interact, this means that the effects of A
depend on the level of factor B (or the effects of B depend on the level of
factor A). treatment does not depend on which block it is in, or – in other
words – the effect of treatment i is the same in every block

Balanced block design and its significance:

 each block is the same with respect to treatments, i.e. the same set
of treatments occurs in every block.
 In balanced designs treatment and block effects can be completely
separated (are independent).

Incomplete block design: If the blocks are too small to receive all
treatments, we can only have a subset of the treatments in each block

When is an incomplete block design considered balanced: If all pairs of


treatments occur together within a block an equal number of times, the
design is still balanced.

Generalised randomised complete block design:

If we have more experimental units in a block than treatments, we can


replicate (some) of the treatments in each block, leading to a generalized
randomized block design

Advantage of Generalised randomised complete block design:

The advantage of a generalized randomized complete block design is that


you can estimate the interaction between block and treatment if each
treatment is replicated at least twice in each block

How to check if data meet assumptions fairly well:

 The first thing we need to check is that the analysis matches the
design (blocking, randomisation, sample size)
 The next thing to check is that there are no outliers
 all population variances are equal
 errors are normally distributed.
 errors are independent.
 effects of blocks and treatments are additive, the treatment effects
are similar in all blocks.

How to check if effect of blocks and treatments are additive:

added a colour-coded line for each babbler to the side-by side boxplots
(Fig. 5.12). Each line connects the experimental units of one block
(babbler). The lines don’t cross too much, indicating that it is reasonable
to assume that the different call types had similar effects for all

Why Additivity Matters:

1. Simplifies Interpretation:

o If effects are additive, the treatment effect (difference


between call types) is the same regardless of the block
(babbler).

o This means we can analyze the data using a simple additive


model (e.g., two-way ANOVA without an interaction term).

When do blocks improve effecienecy of design:

we might be interested in whether blocking increased the efficiency of the

reduce unexplained error variance. If F ∼ 1, the blocks did not improve the
design, i.e. reduced the unexplained variation. If F > 1, then blocking did

power of the experiment and you would have been about equally well off
with a completely randomised design

how do we obtain estimates for the treatment and block effects: we


minimize the error sum of squares (method of least squares)

Contrasts vs. ANOVA:

o ANOVA tests whether any differences exist between


treatments but doesn’t pinpoint specifics.

o Contrasts (e.g., comparing fenced vs. unfenced trees)


directly address hypotheses like H₀: μ₁ − μ₂ = 0.

Regression vs. ANOVA:


o lm(): Used for estimating coefficients (e.g., treatment
difference = 0.96, SE = 0.42) and testing contrasts.
1. Example: lm(diff1m ~ treat + factor(pair)) yielded identical
results to the paired t-test (t = 2.27, p = 0.06).
o aov(): Preferred for partitioning variance across factors (e.g.,
treatment, block).
Interacting factors meaning: t the effects of the one factor depend on the
level or settings of the other factors.

What are factorial experiments: In factorial experiments we have more


than one treatment factor

What are complete factorial experiments: In a complete factorial


experiment every combination of factor levels is studied and the number
of treatments is the product of the number of levels of each factor, i.e.
each treatment is a combination of factor levels: one level from each
factor

Types of effects in an experiment:

Main effect, interaction effect and random effect

Main effect: The main effect of a treatment measures the average change
in response, averaged over all levels of the other factors, relative to the
overall mean. When there is only a single factor in an experiment we only
have main effects

The interaction effect: measures the change in response relative to the


main effects with a particular treatment.

Hidden replication within factorial experiments: the levels of each factor


are replicated

Equation for factorial experiments and how they are parameterised:

Yijk=μ+αi+βj+(αβ)ij+eijk
Generalised linear models:

Type of data where the response variable is not continuous:

 Bianry response: only two possible outcomes


 Binomial response: number of successes out of n trials
 Count response: number of a certain object or event

Where does generalised models differ from linear regression:

In generalized linear models we think in terms of how parameters are


related to explanatory variables: probability of success for binary or
binomial data, rate parameter for count data, mean for normal data

What are the parameters in binomial and Poisson distributions directly


related to:

the parameters of the binomial and Poisson distributions are directly


related to the expected or mean response

Why are GLMS still considered linear models:

GLM’s are linear models because the parameter (or some form of it) is still
linearly related to the explanatory variables. Linear here refers to
the β coefficients appearing in a linear combination, i.e. the terms are just
added together.

What are link functions:

The link function defines the form of the relationship between the
parameter of interest and the explanatory variables: linear (identity link),
exponential (log link) or S-shaped (logit link).

when do we use log links and logit links:

by using a log link we are assuming that there is an exponential


relationship between λ and the explanatory variables, and by using a logit
link we are assuming that there is an S-shaped relationship between the
probability of success pi and the explanatory variables.

Why do we need a different model for the response in non-normal data:

 count data, binomial or binary data don’t have a normal or even


symmetrical distribution
 the variance (or uncertainty) of observations is not constant, as is
often the case in GLMs
 range of the response is limited to integers ≥0 for count data, 0 or 1
for binary data, and limited to integers between 0 and n for binomial
data

what method is used to estimate parameters in generalised linear models:


method of maximum likelihood

why are binary data considered a special case of binomial dats:

Binary data are a special case of binomial data, with all ni=1.

binomial distribution with n=1 is also called a Bernoulli distribution,


i.e. Yi∼Bernoulli(pi).

How to construct a GLM:

 first, we need to specify the (error) distribution for the response


variable, and
 secondly, we need to specify how the parameter of interest
(probability of lawn grass) is related to the explanatory variables,
and to which explanatory variables.

Parameters of the bernouli distribution:

 Yi∼Bernoulli(pi). Where n = 1
 Yi is the ith observation (0 for failure, 1 for success), pi is the
probability of success for the ith observation.

Why do we use logit link function:

 Probabilities are usually not linearly related to explanatory variables


but the log odds of success often are, so that the logit link
function is a reasonable model.
 The link function (here logit) links the parameter to the linear
predictor (linear combination of explanatory variables).
 The limits: the probability or proportion is limited to lie between 0
and 1, but the logit can (theoretically) range from −∞ to ∞.

What are the log odds:

The logit transformation or link function leads to the name logistic


regression. logit(pi)=log(pi1−pi) is called the logit transformation of pi or
the log-odds. (natural logarithm)
Benefits of logit transformation:

The logit transformation of a probability parameter

 ensures that, when transforming back to the probability of success,


all predicted probabilities range between 0 and 1.

 ensures that logit(pi) can take on any value between −∞ and ∞. So


no matter how extreme the x-values become, the predicted value
will never be illegal on the log-odds or logit scale.

 is often linearly related to the explanatory variables, whereas pi is


not.

In summary, the logistic regression model is:

 Yi∼Bin(ni,pi)
 logit(pi)=β0+β1xi

how are estimates for the B parameter found:


Estimates for the β parameters are found by using the method of
maximum likelihood, i.e., by finding those values of β0 and β1 that make
the observed values most likely, given the specified model

Interpreting β coefficients

 factors larger than 1 imply a positive effect → exp(βi ) = 1.8: a 1 unit


increase in X is associated with the odds of event being multiplied
by 1.8 ie an increase of 80%.
 factors less than 1 imply a negative effect → exp(βi ) = 0.6: a 1 unit
increase in X is associated with the odds of event being multiplied
by 0.6 ie a reduction of 40%.
 exp(βi ) = 1 imply no effect
 Note that the effects do not say anything about the baseline rate,
just about relative odds!

model for pi :

pi=exp(β0+β1xi)/(1+exp(β0+β1xi))

what are odds:

dds are ratios of two probabilities. For example, the odds of lawn grass in
a given area is the probability of lawn grass divided by the probability of
no lawn grass.

what do different log-odds mean:


A log-odds of 0 means that success and failure are equally likely. Negative
log-odds imply probability of success less than 0.5, positive log-odds imply
probability of success larger than 0.5

Why is log(Bi) an odds ratio:


exp(βi) is an odds-ratio because it compares odds at 2 levels
(at x0+1 vs x0). The odds-ratio can be understood as the factor by which
the odds of success change for every one unit increase in the explanatory
variable

 ⇒ expβi gives the odds ratio associated with a one unit increase in

⇒ expr×βi gives the odds ratio associated with a r unit increase in


Xi .

Xi .

What is the reason for maximum likelihood estimation:

Maximum likelihood estimation involves choosing those values of the


parameters that maximize the likelihood or, equivalently, the log-
likelihood. Intuitively, this means that we choose the parameter values
which maximize the probability of occurrence of the observations.

Why are maximum likelihood estimates considered asymptotically


normally distributed: this means that the normal approximation is
relatively good in large data sets

Wald intervals in GLMS:

 A 95% CI for a parameter, based on a normally distributed estimate


(e.g., regression coefficient, MLE estimate) can be calculated
as:estimate±1.96×SE(estimate)
 In GLMs such intervals are called Wald intervals. `Wald’ implies
that we have assumed a normal distribution for the MLE estimates.

Assumptions in binomial distribution for a random variable:

 there is a constant probability of success


 the ni trials are independent. This means that the outcome of one
trial is not influenced by the outcome of the others.
 the final number of trials does not depend on the number of
successes

how to check if he relationship between the response, or rather the


modelled parameter (probability of success in logistic regression), and the
explanatory variables is adequately captured by the model in logistic
regression:

 Single explanatory variable: plot the observed proportions against


the explanatory variable, with the fitted line superimposed onto this
plot. For binary data we mostly don’t have proportions, even though
it may be possible to calculate proportions for sections of the
explanatory variable (as we did in the lawn grass example).

 Several explanatory variables: partial residual plots for every


explanatory variable, with the fitted lines superimposed (we can
again use R’s visreg() function for this).

With large n or large counts, what do we check for in glm:

 changes in the mean of the residuals point out that the relationship
between the parameter of interest and the explanatory variables
has been mis-specified
 changes in the variance of the residuals is an indication that the
variance is not adequately described by the assumed model
 skew distributions of the residuals point out that the large
observations are not adequately fitted by the model.
 check for influential observations
 the residuals must be independent, which means that we must
account for spatial, serial or blocking structures in the model

Methods used for comparing GLMs: likelihood ratio tests and Akaike’s
Information Criterion, AIC (or similar criteria)

When are models considered nested: When one model contains terms that
are additional to those in another model, the two models are said to
be nested

What does the difference in deviance in two nested models prove:


measures the extent to which the additional terms improve the fit
Residuals in Logistic Regression:

Two types of residuals are commonly used in GLMs:

1. Pearson residuals:

ri=observed−fittedSE(fitted)

2. Deviance residuals:

The deviance of a model is calculated as

D=−2[ℓ(model)−ℓ(saturated model)]

What does ROC and AUC allow us to do: helps us get an idea of how well
the fitted model will be able to predict presence/absence given specific
values of the covariate

How do we predict the presence/absence of a covariate: we give a


threshold probability(p), where x>p is presence and x<p is absence

Sensitivity: is the proportion of true positives predicted correctly as


positives

Specificity: is the proportion of true negatives predicted correctly as


negatives

What is the receiver operating characteristic curve: The ROC curve plots
the True Positive Rate (TPR, sensitivity) against the False Positive Rate
(FPR, 1-specificity) at different classification thresholds

What does the area under the ROC curve represent: used as a measure of
how good the classifier works

Characteristics of a good ROC curve: extends far into the top left corner of
the plot.

What does the AUC probably illustrate: The AUC gives the probability that
a randomly chosen positive is ranked higher (higher predicted probability)
than a randomly chosen negative, or, if you have a negative and a
positive, the proportion of time the model will classify/rank them correctly

Trade off between specificity and sensitivity:

 You can’t maximize both simultaneously (unless the model is


perfect).

 Adjusting the classification threshold shifts the balance:

o Higher threshold (e.g., 0.9):

 Fewer positives predicted → Higher specificity (fewer


FPs), but lower sensitivity (more FNs).

 Example: A cancer screening model set to only flag the


clearest cases misses some actual cancers.

o Lower threshold (e.g., 0.1):

 More positives predicted → Higher sensitivity (fewer


FNs), but lower specificity (more FPs).

 Example: Flagging many patients as "at risk" catches


more true cases but also leads to unnecessary tests.

Best models for comparing GLMS:

 Likelihood ratio tests: The likelihood ratio test can only be used for
nested models, and is valid only asymptotically, i.e., large ni and
large N (number of observations).
 AIC: The AIC can be used for comparing non-nested models and is
valid for small ni and small N, therefore we much prefer the AIC.

What does difference in deviance in nested models explain: the


differences in deviance between the two nested models measures the
extent to which the additional terms improve the fit.

What does overdispersion and under dispersion:

refer to a failure of the error distribution assumed for the response


variable to properly describe the variation in the observed data. % and will
often not be picked up in residual plots.
Why does overdispersion and under dispersion occur in non-normal data:
Overdispersion is common in both binomial and count data. Both the
binomial and Poisson distribution have a single parameter: probability of
success, and average rate, respectively. They don’t have an extra
variance parameter to separately describe the variability, as the normal
distribution. Instead, the variance is restricted and directly related to the
one parameter.

What does overdispersion come from:

clustering, unmeasured variables, or violated independence, where


variance is high, such that Variance >> mean

Where does under dispersion come from: if data shows more uniformity,
where variance < mean

Consequences of Ignoring Overdispersion:

 Standard errors are too small → false confidence in effects (Type I


errors).

 Model fit appears worse (high residual deviance) → may lead to


unnecessary model complexity.

How to fix overdispersion:

 Quasi-likelihood models (adjust SEs using dispersion parameter).

 Mixed models (account for clustering with random effects).

 Alternative distributions (e.g., negative binomial for counts, beta-


binomial for proportions).

Poison regression:

What is poison regression used for: used to model the effect of


explanatory variables on a response that is a count of events or
occurrences.

What does the Poisson distribution determine: the probability of


observing y events per unit time or space
Independently and randomly meanings: Independently and randomly
means that the individual events don’t influence each other in when or
where they occur, and randomly means that they are equally likely to
appear at any point in time, but when exactly is unpredictable (similarly in
space).

What is the λ parameter in the Poisson distribution: is the average/mean


rate of events (per minute, km or km2, or other constant unit)
(sometimes μ is used instead of λ). Both the mean, or expected, count, as
well as the variance are equal to λ.

What link function is used Poisson distribution: the relationship between


counts and explanatory variables is commonly exponential, and the link
function most commonly used with the Poisson distribution is the log link.

Why does poisson distribution work best for counts of rare events:
because this is the situation where the two ingredients of the
Poisson process are most likely satisfied: independent and random
events

features of count data:

 the variance in the counts increases when the average rate (mean)
increases

 the response is strictly an integer ≥0

 the counts are not symmetrically distributed around the mean


(average rate)

 the mean response often increases exponentially in relation to


explanatory variables, not linearly

How are the model parameters estimated: The model parameters are
estimated using the method of maximum likelihood
How the log link is interpreted:

log⁡(λi)^=β^0+β^1xi

λ^i=exp⁡(β^0+β^1xi)

β^1 is interpreted as the change in the log average rate per unit increase
in the explanatory variable x. To illustrate what happens to the average
rate of events when x increases by one unit

In other words, the average rate of events λ, changes by a


factor exp⁡(β1) per unit increase in x.

Confidence intervals for λ and change in λ:

On the log-link scale:

β^1±1.96×SE(β^1)

The above is a confidence interval for the change in log⁡(λ), i.e., the
change in the log average rate, per unit increase in x.

exp⁡[β^1±1.96×SE(β^1)]

is a confidence interval for the factor (rate ratio) by which the average
rate λ changes per unit increase in x

questions we ask to check for model specifications:

 Is the relationship between response and explanatory variables


correctly modelled?

 Is the distribution of the response around the expected value


correctly modelled?

How do we check these questions:

We often use the residuals to check the above, any patterns in the
residuals point out that either or both of the above two parts of the model
are not correctly specified.
What does the goodness of fit test aim to prove with poisson models:

It hypothesizes (null hypothesis) that observed counts matched a


Poisson distribution with a constant rate, implying that counts may
be random

Important property of rate parameter when comparing different spatial


events:

we need to make sure that the rate parameter λ uses the same unit
for all observations
Offset term: account for differently sized units

Fitted value: E(Yi) = µi = niλi

log(µi) = log(ni) + log(λi) = log(ni) + β0 + β1x1i + . . .

log(ni) is known as the offset

In interaction between a continuous and categorical variable, what are the


two types of difference and what do we look for:

 There is a difference between number, but rate of change is equal


 There is a difference between number, and the rate of change is
different, this means that there is an interaction between
explanatory variables

You might also like