STATISTICS AND RESEARCH DESIGN
Generalized linear models
Tomasz Burzykowski,a,b Melvin Geubbelmans,a,b Axel-Jan Rousseau,a,b and Dirk Valkenborga,b
Hasselt, Belgium, and Bialystok, Poland
M
achine learning (ML) algorithms use statistical 0, we can reject (at the 2-sided significance level of
models to find patterns or structures in data. 0.05) the null hypothesis that the true value of the coef-
These models can formulate predictions for ficient is equal to 0. On the basis of the estimated coef-
new observations on the basis of these patterns, which ficient value, we can conclude that the mean of the
the ML algorithm can translate into decisions. physical health score increases by about 0.33 for a unit
The simple linear regression model1 is the most increase in the environment score. Figure B includes a
fundamental of all statistical and ML models. It is used straight line illustrating the estimated regression model.
to describe an effect of a continuous explanatory vari- The regression line does seem to reasonably fit the cloud
able (covariate) on the mean (expected) value of a of points.
continuous dependent variable (continuous response). Simple linear regression can be extended to include
In particular, the mean value of the dependent variable more than 1 covariate. The extension leads to a multiple
is expressed as a linear function of the covariate. linear regression model.1
To illustrate the idea, let us consider Figure A. It pre- In practice, it may be the case that some (or even all)
sents a scatterplot of measurements of the physical of the potential explanatory variables may not be
health and environment quality-of-life domains ob- continuous but rather discrete (categorical) (ie, they
tained with the help of the World Health Organization may assume only a limited set of values). Sometimes
Quality of Life BREF questionnaire2 for 290 subjects such variables are referred to as factors, with their values
with and without oral submucous fibrosis (OSMF).3 referred to as levels. Examples include sex (with 2 levels:
The plot suggests a positive association between the 2 male and female), smoking status (with 3 levels:
measurements: the physical health scores seem to in- nonsmoker, light-smoker, heavy-smoker), race, and so
crease (ie, become more favorable) for increasing (ie, on. A linear regression model that includes only factors
more favorable) environment scores. To describe this as explanatory variables is equivalent to an analysis of
relationship, we can apply a simple linear regression variance model. A linear regression model that includes
model with the physical health score as the dependent a mix of covariates and factors can be seen as equivalent
variable and the environment score as the explanatory to an analysis of covariance (ANCOVA) model.
variable. As a result, we obtain the following estimated We can include a factor in a linear regression model
form of the model: using dummy variables (ie, binary variables coding
particular factor levels); for a factor with K levels, we
mean physical health score 5 8:9 1 0:33 should include the dummy variables corresponding to
3ðenvironment scoreÞ only K 1 of the levels in the model.
To illustrate the idea, let us consider Figure C. It pre-
The 95% confidence interval for the coefficient of the sents the scatterplot of measurements of the physical
environment score is 0.27-0.40. As it does not include health and environment scores, but with colors indi-
cating the OSMF status of the subjects. For controls (sub-
a
Data Science Institute and Center for Statistics, Hasselt University, Hasselt,
jects without OSMF) and cases (subjects with OSMF), the
Belgium. plot suggests a positive association between the 2 scores.
b
Department of Biostatistics and Medical Informatics, Medical University of It seems, however, that the physical health scores for
Bialystok, Bialystok, Poland.
This research received funding from the Flemish Government under the
cases are slightly lower than for controls. To quantify
“Onderzoeksprogramma Artifici€ele Intelligentie (AI) Vlaanderen” program. this observation, we may apply a linear regression model
Address correspondence to: Tomasz Burzykowski, Hasselt University - Data with the physical health score as the dependent variable,
Science Institute, Agoralaan 1, Building D, B-3590 Diepenbeek, Belgium;
e-mail, tomasz.burzykowski@uhasselt.be.
and the environment score and the OSMF status as the
Submitted and accepted, July 2023. explanatory variables. In particular, for the OSMF status
Am J Orthod Dentofacial Orthop 2023;164:604-6 (a factor with 2 levels), we use 1 dummy variable for the
0889-5406/$36.00
Ó 2023 by the American Association of Orthodontists. All rights reserved.
cases (ie, equal to 0 for controls and 1 for cases). The esti-
https://doi.org/10.1016/j.ajodo.2023.07.005 mated ANCOVA model is as follows:
604
Statistics and research design 605
Fig. Physical health and environment domain scores of the World Health Organization Quality of Life
BREF questionnaire for 290 persons with and without oral submucous fibrosis: A, a scatterplot of
scores for all subjects; B, the result of a simple linear regression model; C, a scatterplot of scores
for the cases and controls; D, the result of a multiple linear regression (analysis of covariance) model.
mean physical health score 5 9:9 1 0:29 regression line for cases lies below the line for the con-
3 ðenvironment scoreÞ trols, indicating a decrease (0.72) in the mean of the
0:723ðOSMF caseÞ physical health score for the cases.
It is worth noting that linear regression models make
The 95% confidence interval for the coefficient of 2 fundamental assumptions. One, already mentioned
the environment score is 0.22-0.36, whereas, for the above, postulates that the mean of the response is a
OSMF case, it is 1.00 to 0.43. Neither of them in- linear function of the explanatory variables. The other
cludes 0. Thus, for both explanatory variables, we can one specifies that the (continuous) response is normally
reject the null hypothesis that the true value of the coef- distributed with a constant variance.
ficient is equal to 0. In particular, we can conclude that In practice, we often deal with responses that are not
irrespective of the OSMF status, the mean of the physical normally distributed and/or are noncontinuous. For
health score increases by about 0.29 for a unit increase in instance, we may be interested in a binary response (eg,
the environment score. In contrast, the mean score de- presence or absence of teeth in a child) or a response in
creases, irrespective of the environment score, by about the form of a count (eg, number of teeth present in a child).
0.72 for cases (as compared with controls), confirming Applying a linear regression model to a noncontin-
the observation on the basis of Figure C. uous response is infeasible because of the nonnormality
Figure D includes the 2 straight lines illustrating the of the response distribution. Putting aside this issue, an
estimated form of the ANCOVA model. Note that the important challenge is that the mean value of such a
American Journal of Orthodontics and Dentofacial Orthopedics October 2023 Vol 164 Issue 4
606 Statistics and research design
response may be naturally restricted. For instance, the function of the linear combination of the explanatory
mean value of a binary response is the probability of variables. Note that the values of the logistic function
the response (ie, a value from the 0-1 interval). In are naturally limited to the 0-1 interval, which is the
contrast, the mean value of a count is a nonnegative range of admissible values for a probability. The result-
integer. In those cases, making the mean a linear func- ing GLM is called a logistic regression model.1
tion of explanatory variables may result in a model For a count response, the logarithmic transforma-
that would predict, for instance, negative mean values tion of the mean is usually considered:
which are inadmissible.
lnðmean value of the countÞ 5 linear function of
Generalized linear models (GLMs) address the issues
explanatory variables:
above. In particular, the models allow a nonnormal
response distribution as long as this distribution comes This model can be equivalently specified as follows:
from an exponential family. The family includes, among
others, the normal distribution (useful for continuous mean value of the count 5 elinear function of explanatory variables
:
responses), the binomial distribution (useful for binary
responses), and the Poisson distribution (useful for Note that the values of the exponential function are
counts). In addition, in GLMs, a transformation of the naturally restricted to be positive, as are the mean count
mean value is assumed to be linearly dependent on values. The logarithmic transformation underlies the
explanatory variables. This makes it possible to restrict Poisson regression model and, in the context of analysis
the model-predicted mean to the required range. of contingency tables, the log-linear model.
The multiple linear regression model is an example of To conclude, GLMs are the fundamental class of sta-
a GLM. In that case, the mean is assumed to be directly tistical models that serves as a starting point for devel-
(without any transformation) linearly dependent on the oping many other, more complex models, including
explanatory variables. those used in algorithms constructed by ML. Neverthe-
For a binary response, the logit transformation is less, even the basic GLMs can be used in ML and appear
most often used: potent and interpretable models. For instance, logistic
regression models are often applied to address classifi-
ln½ðprobability of responseÞ cation problems.4
= ðprobability of nonresponseÞ
5 linear function of explanatory variables; REFERENCES
where the function on the left side of the equality sign is 1. Petrie A, Bulman JS, Osborn JF. Further statistics in dentistry Part 6:
called the logit and ln(x) denotes the natural-logarithm multiple linear regression. Br Dent J 2002;193:675-82.
function of x. 2. World Health Organization. WHOQOL-BREF: introduction, admin-
istration, scoring and generic version of the assessment: field trial
Note that this implies the following: version. Available at: https://www.who.int/publications/i/item/
WHOQOL-BREF. Accessed August 25, 2023.
probability of response 5 elinear function of explanatory variables 3. Patnana A, Dutt K. Evaluating the impact of oral submucous fibrosis
on oral health quality of life. Available at: https://data.mendeley.
1 1 elinear function of explanatory variables ;
com/datasets/wbzr6xv6xp/1. Accessed August 25, 2023.
4. Valkenborg D, Geubbelmans M, Rousseau A-J, Burzykowski T. Su-
where ex denotes the exponential function of x; thus, the pervised learning. Am J Orthod Dentofacial Orthop 2023;164:
probability (mean) of the response becomes a logistic 146-9.
October 2023 Vol 164 Issue 4 American Journal of Orthodontics and Dentofacial Orthopedics