Data Analytics in R
[1] Introduction to Data Analytics
Rafael Greminger
1
The Basics
2
Rstudio
3
Getting help
Various ways to get help when you are stuck:
– Built-in help files: ?mean opens helpfile for function mean
– The internet is your friend.
– Our textbook https://rgreminger.github.io/files/msin0010book/
– DataCamp exercises
– Me
4
Packages
Packages are code provided by other users.
They extend R’s functionality.
They need to be installed once.
They need to be loaded in every session!
# Install tydyverse package (only run once)
install.packages('tidyverse')
# Load tidyverse package (run every session)
library(tidyverse)
# note: anything on a line after # is a comment and will not run
5
Types of objects
There are many different types of objects in R:
– Lists
– Vectors
– matrices
– data frames
– tibbles
– ....
6
Creating / assigning objects
a <- 1 # assign value 1 to object a
print(a) # print object a
[1] 1
b <- c(1,2) # assign a vector [1,2] to object b
print(b)
[1] 1 2
7
Creating / assigning objects
Name <- c("Jon", "Bill")
Age <- c(23, 41)
df <- data.frame(Name, Age)
head(df)
Name Age
1 Jon 23
2 Bill 41
8
Keept track of your objects!
9
Functions
We already used various functions now:
– library()
– c()
– print()
Functions are the workhorses of R.
– They take input, do something, and return an output.
Packages provide more functions.
10
Different ways are valid
There are usually multiple ways of doing the same thing in R.
There is NOT the one and only way of doing things.
– Multiple functions/packages can achieve the same thing.
I show ways that work for the purpose of this course.
– You can always do it differently if you want to.
11
Working with data sets
12
Different formats
Data sets come in different storage formats:
– .csv: comma-separated values
– .xls: Excel spreadsheet
– .txt: text file
– Provided by an R package
– ...
13
Loading data in R
R can read in many different formats.
We use two:
1. provided by an R package
2. .csv spreadsheets
14
Loading data from a package
library(ISLR) # load the package containing the data, here ISLR
data(OJ) # load the OJ dataset, it will be stored as `OJ`
head(OJ,2)
Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
1 CH 237 1 1.75 1.99 0 0.0 0
2 CH 239 1 1.75 1.99 0 0.3 0
SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
1 0 0.5 1.99 1.75 0.24 No 0.000000
2 1 0.6 1.69 1.75 -0.06 No 0.150754
PctDiscCH ListPriceDiff STORE
1 0 0.24 1
2 0 0.24 1
15
Loading data from csv file
# Load tidyverse package, which provides read_csv() function
library(tidyverse)
# Load beer dataset
df_beer <- read_csv("beer.csv")
head(df_beer,2)
# A tibble: 2 x 11
store upc week move price sale profit brand packsize itemsize
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 86 1820000016 91 23 3.49 <NA> 19.0 BUDWE~ 6 12
2 86 1820000784 91 9 3.79 <NA> 28.2 O'DOU~ 6 12
16
Errors, errors, errors . . .
17
Errors
Computers are precise and powerful, but stupid. . .
They do exactly what you tell them to do.
If you are not precise, they won’t understand.
– “Computer says no”.
If you tell them to do something they cannot do, they will fail.
– “Computer says no”.
18
A common error: file locations
R needs to know where to look for the file beer.csv.
– read_csv("beer.csv") will look in the working directory.
# Check working directory, which is where R will look for files
getwd()
[1] "C:/Users/Rafael/Dropbox (UCL)/Teaching/Data Analytics MSIN0010/week 1/
# Change working directory
setwd("new_directory")
# Or specify different path
df_beer <- read_csv("PATH/beer.csv") # will look in PATH
19
Interpreting errors
If you do not specify the correct path, R will throw an error.
df_beer <- read_csv("wrong_path/beer.csv")
Error: 'wrong_path/beer.csv' does not exist in current working directory ('
Read the error messages!
– They are usually quite helpful.
If not, copy and paste the error message into Google.
– Most likely someone else has had the same problem before.
20
Inspecting datasets
21
Raw data: Excel
22
Accessing rows and columns
# first three elements of column price
df_beer$price[1:3]
[1] 3.49 3.79 3.69
# first two elements of column 3
df_beer[1:2,3]
# A tibble: 2 x 1
week
<dbl>
1 91
2 91
NOTE: First index is always row, second index is always column.
df_beer[ROW_INDEX,COLUMN_INDEX]
23
Summary statistics
# Mean of beer prices
mean(df_beer$price)
[1] 4.282289
# Median of beer prices
median(df_beer$price)
[1] 3.99
# Variance of beer prices
var(df_beer$price)
[1] 1.057594
# Correlation of price and profit
cor(df_beer$price,df_beer$profit)
[1] 0.05824536
24
Summary statistics
Statistic R function
Mean mean()
Median median()
Variance var()
Standard deviation sd()
Correlation cor()
Minimum min()
Maximum max()
25
Data visualization
26
Why visualize data?
The goal is not to produce pretty pictures.
The goal is to clearly communicate insights.
27
The tidyverse data visualization pipeline
tidyverse: an opinionated collection of R packages.
– Data wrangling
– Data visualization
There are many different graphs that can be used to visualize data.
– https://www.r-graph-gallery.com/ provides a fairly exhaustive list.
We focus on a few main ones:
– Histogram, boxplot, scatter plot, line plot
28
Histogram
Distribution of Weekly Prices
4000
3000
count
2000
1000
0
3 4 5 6 7
price
29
Histogram
df_beer %>%
ggplot(aes(x=price)) +
geom_histogram() +
labs(title="Distribution of Weekly Prices")
df_beer %>% passes the data frame df_beer to the next function
1. ggplot() initializes the plot
2. aes() specifies the aesthetics to be plotted, here price on the x-axis
3. geom_histogram() specifies the type of plot
4. labs() specifies labels for the plot (here, title)
30
Histogram
ggplot(data=df_beer, aes(x=price)) +
geom_histogram() +
labs(title="Distribution of Weekly Prices")
Produces the same plot.
The difference is how we specify which data frame to use.
31
Boxplot
Distribution of Weekly Prices
7
5
price
−0.4 −0.2 0.0 0.2 0.4
32
Boxplot
df_beer %>%
ggplot(aes(y=price)) +
geom_boxplot() +
labs(title="Distribution of Weekly Prices")
Now we’re using geom_boxplot() instead of geom_histogram().
And we’re saying y-axis is price instead of x-axis
33
Scatter plot
Price vs. Demand
300
200
move
100
0
3 4 5 6 7
price
34
Scatter plot
df_beer %>%
ggplot(aes(x=price, y=move)) +
geom_point() +
labs(title="Price vs. Demand")
Now we are using geom_point().
And we’re specifying both x- and y-axis in aes().
35
Line plot
Budweiser price over time in store 86
4.0
3.9
3.8
price
3.7
3.6
3.5
100 150 200 250 300
week
36
Line plot
df_beer %>%
filter(brand == "BUDWEISER BEER" & store == 86) %>%
ggplot(aes(x=week, y=price)) +
geom_line() +
labs(title="Budweiser price over time in store 86")
Now we are using geom_line().
We also use filter() to select only one brand and one store.
37
Data Wrangling
38
Tidyverse data wrangling pipeline
Many useful functions for data wrangling are provided by the tidyverse package.
Here, I will only discuss a two of them that we will be using.
39
Filtering data
df_subset <- df_beer %>%
filter(brand == "BUDWEISER BEER" & store == 86)
head(df_subset,2)
# A tibble: 2 x 11
store upc week move price sale profit brand packsize itemsize
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 86 1820000016 91 23 3.49 <NA> 19.0 BUDWE~ 6 12
2 86 1820000016 92 46 3.49 <NA> 19.0 BUDWE~ 6 12
We already saw the filter() function in the line plot.
It allows us to select only a subset of the data.
40
Filtering data
df_subset <- df_beer %>%
filter(brand == "BUDWEISER BEER" & store == 86)
& is AND operator.
This again uses pipe operator.
– filter(df_beer, brand == "BUDWEISER BEER" & store == 86) would
produce the same result.
41
Grouped data
We often have data for multiple groups.
– Panel data: data for multiple individuals over time
We can use group_by() to group data by a variable and then summarize each
group with another function.
average_price_in_week <- df_beer %>%
group_by(week) %>%
summarize(mean(price))
head(average_price_in_week,2)
# A tibble: 2 x 2
week `mean(price)`
<dbl> <dbl>
1 91 3.98
2 92 4.03
42
Grouped data
We can also used grouped data to create a plot for each group.
df_beer %>%
group_by(week) %>%
summarize(mean(price)) %>%
ggplot(aes(x=week, y=`mean(price)`)) +
geom_line() +
labs(title="Average price over time (all brands+stores)")
43
Grouped data
Average beer price over time (all brands+stores)
5.5
5.0
mean(price)
4.5
4.0
100 200 300 400
week
44
A few tips to remember
45
Tips to remember
Keep track of variables and what they are
Know your working directory
Use the help function
Understand the different data structures you’re using
Do not just copy and run code, understand it!
Learn about the packages you’re using (e.g. tidyverse,ggplot2, . . . )
46