LECTURE 10
Polynomial regression
Our linear model does not necessarily have to be a straight line — it can also be
a curve. This curve represents polynomial regression, which is not a linear
relationship (even though the model is still linear) because it is not directly
proportional.
Polynomial regression is a type of
regression that models the relationship
between the independent variable (X) and the
dependent variable (Y) using a polynomial
equation instead of a straight line. It is
useful when the data shows a curved or
non-linear trend that cannot be captured by
simple linear regression.
● Allows you to model a wide range of
non-linear relationships
● It describes non-linear relationships
but does not provide any mechanistic
explanation for these relationships.
● Even though the fitted model is not a
straight line, it is still a linear model
Linear Model and Categorical Variables
What is a linear model?
“Linear models are those statistical models in which a series of parameters are
arranged as a linear combination” (Murray, 2010)
● a linear model uses groups like ANOVA
● anova and linear model are the same thing
What happens if our variable is a categorical variable? For example, if it represents
a type of soil and not a number.
Consider a simple experiment designed to study the effect of soil type on plant
growth (biomass production).
● Response variable: biomass (g)
● Categorical variable: soil type (3 levels: T1, T2, T3)
If we have three types of soil, we can substitute T1, T2, and T3 with "dummy 1,"
"dummy 2," and "dummy 3", which are dummy (fictitious) variables. In the
equation, we replace T1, T2, and T3 with these dummy variables.
In our example, y is a numeric dependent variable, and x is a categorical
variable.
To model this, I create a second table, where:
● The first column contains the numeric values of y.
● The other columns contain the dummy variables.
Under each dummy variable column, I write 1 or 0, depending on whether the y
value depends on that categorical variable x.
For example, if y = 2 depends only on dummy 1, I write 1 under dummy 1, and 0
under dummy 2 and dummy 3, because 2 does not depend on the other two
variables.
In the equation yᵢ, I substitute 0 and 1 for each dummy variable, where:
● i = group,
● j = observation
For example, to write the equation for y₂, I consider group 2 (dummy 2) and
observation j. If j = 2, I take the second observation of group 2, e.g., value 7.
● μ is the intercept,
● β is the coefficient (estimated in R).
If calculated manually, β is computed using the b̂ formula.
Multiple Linear Regression
More than one predictor in our data → multiple linear regression.
When I have more than one variable, I can create many models, but we usually
aim to write the simplest possible model.
No model is perfect, so we simplify as much as possible.
In the equation, we will have multiple variables and thus multiple coefficients.
We always start with having an intercept and an error.
Assumptions to test afterward:
● There has to be a linear relationship between xᵢ and the dependent variable
y.
● Normality of residuals.
● Homoscedasticity of residuals (equal variance).
● No multicollinearity (independent variables should not be highly correlated
— they should not show the same trend).
Minimal Adequate Model (MAM)
We aim to find the minimal adequate model, the simplest model that still
explains something about the variables
Procedure:
1. Start with a saturated model, which includes all variables and interactions
— at this stage, it's hard to see which variables are important.
2. Remove one variable at a time, recalculating the model each time.
3. First, remove the highest interaction (e.g., the interaction between all three
variables if there are three).
4. Then, remove two-way interactions, starting with the least significant.
5. Iterate this process until only significant variables remain (based on
p-value).
6. If no parameters remain significant, you reach the null model, which contains
only the overall mean.
so..
we have 4 main types of model selection
1. Saturated Model → Too Complicated
● This model tries too hard to fit the data by using a lot of parameters.
● It explains the data perfectly, but it doesn’t help us understand anything
useful.
● Example: Imagine trying to predict student grades using every possible factor
(sleep, diet, number of pets, favorite color, etc.). The model fits the data well,
but it’s too complex and not generalizable.
2. Maximal Model → Complex but Useful
● This model is still complicated, but it contains only variables that seem
important.
● It’s a starting point before simplifying the model further.
● Example: A grade prediction model using study hours, sleep, and past
performance—it still has many variables, but they all seem relevant.
3. Minimal Model → The Best Simple Model
● This is a simplified version of the maximal model that keeps only the most
important variables.
● It balances simplicity and accuracy, so it’s easy to interpret while still
making good predictions.
● Example: After testing, we find that study hours and past performance are the
only really important factors for grades. We remove sleep from the model,
making it simpler but still effective.
4. Null Model → No Explanation, Just an Average
● This model ignores all predictors and just gives a single average value as
the prediction.
● It doesn’t explain anything—it just assumes the same outcome for everyone.
● Example: Predicting that every student gets the class average grade, no
matter how much they study.
Example
Our dependent variable is ozone concentration (y), and we have three
independent variables (x):
● x₁ = wind,
● x₂ = air temperature,
● x₃ = solar radiation.
Possible interactions:
● Triple interaction: x₁ * x₂ * x₃.
● Double interactions: x₁ * x₂, x₂ * x₃, x₁ * x₃.
Collinearity check (prior hypothesis):
In this graph 0.29 indicates the correlation between temperature and radiation,
which as it is close to 0 is very small. The light blue colour indicates the
graduation between 0.2 and 0.4. From this graph, I see the most important
correlations which are the ones to be considered in the model. For example the
relationship between ozone and temperature is very high (0.7)
On the other hand, even in the following graph that tests the correlation, I can see
which graphs decrease or increase significantly. (e.g. wind and temp
interaction)
In R:
● model1 contains all interactions.
● summary(model1) shows the table of results.
● First step: remove variables one by one, starting with interactions. I can
exclude an interaction or a variable just if it’s not statistically significant.
The asterisks indicate significance, so if there are asterisks near the variable
or interaction, it means that there is a strong significance.
● If there are no asterisks, you can check the estimate value: the smallest
estimate value (in absolute value) should be removed. The estimate values
are the β coefficients, so the smallest one has to be removed
To move to the next model:
● To advance models I can write on R model2 <- update model1 and remove
the interaction.
● I go on doing the same thing over and over again. In theory, the number of
asterisks should always increase because what I have becomes more and
more significant.
● In the last model, I have asterisks everywhere, so we have reached the
minimum adequate model.
Assumption checking (in R):
● Plot 4 diagnostic graphs:
○ Homoscedasticity (values should be scattered).
○ Normality (QQ plot).
There is a weird pattern in the first plot
on the left. There is very low variation
in the residuals and then there is an
increase in spread around 0. There is
horizontal V shape (increase in spread
in residuals) → sign there is no
homoscedasticity
qq plot → on the extreme side there
are data points that are diverging →
normality assumption is not met
assumptions are not met
So what do we do?
we can do a data transformation.
If normality is satisfied but variance is not, we can transform data with
logarithms, e.g.,
log(ozone)
● step() function can make the process faster and immediately reach the model
simplification.
Check model assumptions again!
We have found the minimal adequate model.
● It does meet the assumptions
● the first plot looks like a sky full of star at night while the qq plot shows that
there are much more data points in a straight line → assumption are met
What is the variable in our minimal adequate model that has the most influence?
Each variable has a different unit of measurement and I cannot compare the
estimators directly, so I have to standardize coefficients.
● Standard deviation (sd) is used to standardize.
● Alternatively, use lmbeta() in R to get standardized coefficients.
● Once standardized, I can rewrite the initial equation with standardized β.
● wind seems to have a strong negative effect on the ozone, which is followed
by the radiation and then the temperature
● you can find out what variables drive the most ozone concentration
Akaike Information Criterion (AIC)
● It takes into account how well the model fits your data and how many
parameters the model includes. We try to find the best model with a low
number of parameters
● the more complicated the model, the higher the AIC value. The lower the
AIC value, the better.
● In R, you can calculate it using AIC()
● the R calculation makes a first step, where it doesn't remove anything, in the
second step it removes something, and the more the variables or
interactions get removed, the more the AIC gets smaller. R will proceed to
select models that have smaller AIC values. AIC values are useful for
➡️
understanding how well the model fits your data
The smaller the AIC, the better the fit.
Multicollinearity Check
To verify correlation assumptions:
● Use Variance Inflation Factor (VIF).
● Calculate VIF for each variable.
● R is in the tolerance coefficient;
subtracting it from 1, you get the value in r. R² is
the non-adjusted version.
● the higher the R squared value the
stronger the relationship between the variables
you include in the model
● if VIF is high, the tolerance value will be
very low.
● If the value on R is greater than 5, I
delete the variable.
How to calculate VIF for the temperature factor?
Step 1: Fit a linear model using temperature as a response, and wind and rad as
explanatory variables
Step 2: Get the non-adjusted R^2
Step 3: calculate VIF
● the higher the R squared value the
stronger the relationship between the variables
you include in the model