Week 3
Data visualization I
Main Ideas
• Data visualization is an extremely effective way to express information and
extract meaning from data.
• We can build up an effective visualization systematically, layer by layer, using a
grammar of graphics (`ggplot2`).
• The simple graph has brought more information to the data analyst's mind than
any other device" - John Tukey.
Let us start
• load the `tidyverse` package. Recall, a package is just a bundle of shareable code.
We rely on tidyverse for visualization
library(tidyverse)
ggplot
• Exploratory data analysis (EDA) is an approach to analyzing datasets in order to
summarize the main characteristics, often with visual representations of the data.
We can also calculate summary statistics and perform data wrangling,
manipulation, and transformation.
• We will use `ggplot2` to construct visualizations. The gg in `ggplot2` stands for
"grammar of graphics", a system or framework that allows us to describe the
components of a graphic, building up an effective visualization layer by layer.
Dataset (Minneapolis housing)
• We will introduce visualization using data on single-family homes sold in
Minneapolis, Minnesota between 2005 and 2015.
mn_homes <- read_csv("data/mn_homes.csv")
glimpse(mn_homes)
First visualization
• `ggplot()` creates the initial base coordinate system that we will add layers to. We
first specify the dataset we will use with `data = mn_homes`. The `mapping`
argument is paired with an aesthetic (`aes`), which tells us how the variables in
our dataset should be mapped to the visual properties of the graph.
• ggplot base-layer
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice))
• ggplot r add-points
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point()
Aesthetics
• An aesthetic is a visual property of one of the objects in your plot.
- Shape
- color
- Size
- Alpha (transparency)
We can map a variable in our dataset to a color, a size, a transparency, and so on
• Color aesthetic
ggplot(data = mn_homes, +
mapping = aes(x = area, y = salesprice, +
color = fireplace)) + geom_point()
• Shape aesthetic
ggplot(data = mn_homes, +
mapping = aes(x = area, y = salesprice, +
shape = fireplace)) + geom_point()
• Graph Labels
ggplot(data = mn_homes, +
mapping = aes(x = area, y = salesprice, +
shape = fireplace)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN", +
x = "Area (square feet)", y = "Sales Price (dollars)")
• Practice:
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice,
color = fireplace,
size = lotsize)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)")
**Question:** Are the above visualizations effective? Why or why not? How might
you improve them?
Geom_xxx() aesthetic
• You can also set the aesthetic properties of your geom manually. For example, we
can make all of the points in our plot blue
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = area, y = salesprice), color = "blue")
Use `aes` to map variables to plot features, use arguments in `geom_xxx` for
customization not mapped to a variable.
Faceting
• We can use smaller plots to display different subsets of the data using faceting.
This is helpful to examine conditional relationships.
• Let's try a few simple examples of faceting. Note that these plots should be
improved by careful consideration of labels, aesthetics, etc.
• `facet_grid()`
• 2d grid
• rows ~ cols
• use . for no plot
• `facet_wrap()`
• 1d ribbon wrapped into 2d
• Facet_wrap()
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_wrap(~ community, nrow=2)
• Facet Grid (cols)
ggplot(data = mn_homes,
Mapping = aes(x = area, y = salesprice)) +
geom_point() +
facet_grid(. ~ beds)
• Facet_grid(rows)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
facet_grid(beds ~ .)
• facet_grid(2 variables)
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") +
facet_grid(beds ~ baths)
• Adding Labels:
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
geom_smooth() +
labs(title = "Sales price vs. area of homes in Minneapolis, MN",
x = "Area (square feet)", y = "Sales Price (dollars)") ```
Geometric Objects
• A geom is the geometrical object that a plot uses to represent data.
• We often describe plots by the type of geom that the plot uses.
• For example, bar charts use bar geoms, line charts use line geoms, boxplots use
boxplot geoms, and so on.
• Scatterplots break the trend; they use the point geom.
Geometric Objects (geom_smooth())
• Smooth geom plots a smooth line fitted to the data.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice)) +
geom_point() +
geom_smooth()
Run `?geom_smooth` in the console. What does this function do?
geom_smooth() linetype
• You can set the linetype of a line.
• geom_smooth() will draw a different line, with a different linetype, for each
unique value of the variable that you map to linetype.
ggplot(data = mn_homes,
mapping = aes(x = area, y = salesprice, linetype=community)) +
geom_point() +
geom_smooth()
Practice
• Create a scatterplot using variables of your choosing using the
`mn_homes` data.
• Modify your scatterplot above by coloring the points for each
community.
What is next?
• Bar plots (geom_bar)
• Colored bar (color, fill)
• Position Adjustments
• Coordinate Systems
Bar Plots (Categorical variables)
• Bar plots allow us to visualize categorical variables.
• Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
• We use geom_bar() for bar plots.
ggplot(data = mn_homes) +
geom_bar(mapping = aes(x = community)) + coord_flip()
Bar Plots (Categorical variables)
• The default y Axis is count.
• We can override that with propotion as the following example:
ggplot(data = mn_homes) +
geom_bar(mapping = aes(x = community , y = ..prop.., group = 1)) + coord_flip()
Colored Bar Charts
• You can color a bar chart using either the color aesthetic, or more
usefully, fill.
ggplot(data = mn_homes) +
geom_bar(mapping = aes(x = community, fill=community))+coord_flip()
ggplot(data = mn_homes) +
geom_bar(mapping = aes(x = community, fill=fireplace))+coord_flip()
Position Adjustments for Bar Chart
• There are three options for position argument in Bar Chart: "identity",
"dodge“, or "fill”
Try the following:
ggplot(data = mn_homes) +
geom_bar(mapping = aes(x = community, fill=fireplace),position=“dodge”)+coord_flip()
### Types of variables
• Numerical variables can be classified as either **continuous** or
**discrete**.
Continuous numeric variables have an infinite number of values
between any two values. Discrete numeric variables have a countable
number of values
- Height
- number of siblings
### Numeric variables
• To describe the distribution of a numeric we will use the properties below
- Shape
- skewness: right-skewed, left-skewed, symmetric
- modality: unimodal, bimodal, multimodal, uniform
- center: mean (`mean()`), median (`median()`) range (`IQR()`)
- outliers: observations outside the pattern of the data
We will continue our investigation of home prices in Minneapolis, Minnesota.
• ```{r load-data, message = FALSE}
mn_homes <- read_csv("data/mn_homes.csv") ```
Add a `glimpse()` to the code chunk below and identify the following variables as
numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.
-area
-beds
-community
```{r glimpse-data} ```
• We can use a **histogram** to summarize a single numeric variable.
```{r histogram}
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_histogram(bins = 25) ```
A **density plot** is another option. We just connect the boxes in a histogram
with a smooth curve
• ```{r density-plot}
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_density() ```
Side-by-side **boxplots** are helpful to visualize the distribution of a numeric
variable across the levels of a categorical variable.
```{r boxplots}
ggplot(data = mn_homes,
mapping = aes(x = community, y = salesprice)) +
geom_boxplot() + coord_flip() ```
```{r boxplots}
ggplot(data = mn_homes,
mapping = aes(x = community, y = salesprice)) +
geom_boxplot() + coord_flip() ```
• **Question:** What is `coord_flip()` doing in the code chunk above?
Try removing it to see
General principles for effective data
visualization
- keep it simple
- use color effectively
- tell a story
Take a look at https://
github.com/GraphicsPrinciples/CheatSheet/blob/master/NVSCheatSheet.pdf
for how to think through creating an effective visualization
## Practice
• 1) Modify the code outline to create a faceted histogram examining the
distribution of year built within each community. When you are finished remove
`eval = FALSE` and knit the file to see the changes.
```{r eval = FALSE}
ggplot(data = mn_homes, mapping = aes(x = ______)) +
geom_histogram() +
facet_wrap(~______) +
labs(x = "______",
title = "_______",
subtitle = "Faceted by ______") ```
## Additional Resources
- https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-visualization-2.1.pdf
- https://github.com/GraphicsPrinciples/CheatSheet/blob/master/NVSCheatSheet.pdf
- https://ggplot2.tidyverse.org/
- http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
- https://
medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-tea
m-works-with-graphics-in-r-ed0b35693535
- https://ggplot2-book.org/
- https://ggplot2.tidyverse.org/reference/geom_histogram.html
- https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf