An introduction to simple linear
regression
Published on February 19, 2020 by Rebecca Bevans. Revised on October 26, 2020.
Regression models describe the relationship between variables by fitting a line to the observed
data. Linear regression models use a straight line, while logistic and nonlinear regression models
use a curved line. Regression allows you to estimate how a dependent variable changes as the
independent variable(s) change.
Simple linear regression is used to estimate the relationship between two quantitative
variables. You can use simple linear regression when you want to know:
   1. How strong the relationship is between two variables (e.g. the relationship between
      rainfall and soil erosion).
   2. The value of the dependent variable at a certain value of the independent variable (e.g.
      the amount of soil erosion at a certain level of rainfall).
Example You are a social researcher interested in the relationship between income and
happiness. You survey 500 people whose incomes range from $15k to $75k and ask them to rank
their happiness on a scale from 1 to 10.
Your independent variable (income) and dependent variable (happiness) are both quantitative, so
you can do a regression analysis to see if there is a linear relationship between them.
If you have more than one independent variable, use multiple linear regression instead.
Table of contents
      1.   Assumptions of simple linear regression
      2.   How to perform a simple linear regression
      3.   Interpreting the results
      4.   Presenting the results
      5.   Can you predict values outside the range of your data?
      6.   Frequently asked questions about simple linear regression
Assumptions of simple linear regression
Simple linear regression is a parametric test, meaning that it makes certain assumptions about
the data. These assumptions are:
   1. Homogeneity of variance (homoscedasticity): the size of the error in our prediction
      doesn’t change significantly across the values of the independent variable.
   2. Independence of observations: the observations in the dataset were collected
      using statistically valid sampling methods, and there are no hidden relationships among
      observations.
   3. Normality: The data follows a normal distribution.
Linear regression makes one additional assumption:
   4. The relationship between the independent and dependent variable is linear: the line of
      best fit through the data points is a straight line (rather than a curve or some sort of
      grouping factor).
If your data do not meet the assumptions of homoscedasticity or normality, you may be able to
use a nonparametric test instead, such as the Spearman rank test.
Example: Data that doesn’t meet the assumptionsYou think there is a linear relationship between
cured meat consumption and the incidence of colorectal cancer in the U.S. However, you find
that much more data has been collected at high rates of meat consumption than at low rates of
meat consumption, with the result that there is much more variation in the estimate of cancer
rates at the low range than at the high range. Because the data violate the assumption of
homoscedasticity, it doesn’t work for regression, but you perform a Spearman rank test instead.
If your data violate the assumption of independence of observations (e.g. if observations are
repeated over time), you may be able to perform a linear mixed-effects model that accounts for
the additional structure in the data.
How to perform a simple linear regression
Simple linear regression formula
The formula for a simple linear regression is:
      y is the predicted value of the dependent variable (y) for any given value of the
       independent variable (x).
      B0 is the intercept, the predicted value of y when the x is 0.
      B1 is the regression coefficient – how much we expect y to change as x increases.
      x is the independent variable ( the variable we expect is influencing y).
      e is the error of the estimate, or how much variation there is in our estimate of the
       regression coefficient.
Linear regression finds the line of best fit line through your data by searching for the regression
coefficient (B1) that minimizes the total error (e) of the model.
While you can perform a linear regression by hand, this is a tedious process, so most people use
statistical programs to help them quickly analyze the data.
Simple linear regression in R
R is a free, powerful, and widely-used statistical program. Download the dataset to try it yourself
using our income and happiness example.
                           Dataset for simple linear regression (.csv)
Load the income.data da