DOTE2040 Business Analytics
Lecture 1: Course Introduction and R
Agenda
Course overview
Overview of R
Basics of R
COURSE OVERVIEW
3
Course Overview
• This is a course about data, analytics, statistical/
machine learning and their applications in business.
• The goal is to let you have some knowledge and
experience of business analytics.
• After taking this course, we will know how to use and
what to employ, in terms of data and analytics.
4
Course Overview
• Related Courses • Pre-Requisites
Statistical analysis Mathematical content
Data mining (calculus, probability, statistics)
Machine learning Comfortable with notation
Business intelligence Some experience of computer
Artificial intelligence coding (that is, scripting)
• Course Content • Final goal: make better business
Analyze data decisions based on data and
Interpret results analytics
5
OVERVIEW OF R
6
R Introduction
S is a statistical high-level and interpreted programming language
developed at the Bell laboratories around 1975 by John Chambers. The
commercial implementation of S is called S-PLUS and appeared in 1988.
R is an open-source implementation of S and was created in the early
nineties by Ross Ihaka and Robert Gentleman at the University of
Auckland. These days, R is maintained by the R core team.
R has become very popular particularly in academia and in industry.
Much of R’s success is due to the packages written for R by the R-
community.
7
What is R?
A software package
A programming language
A toolkit for developing statistical and analytical tools
An extensive library of statistical and mathematical software and
algorithms
A scripting language
...
8
Why R?
R is free!
R is cross-platform and runs on Windows, Mac, and Linux.
R provides a vast number of useful statistical tools, many of
which have been tested.
R produces publication-quality graphics in a variety of formats.
R plays well with FORTRAN, C, and scripts in many languages.
There is open source software (e.g., R-Studio) to make it easy
to use.
It is NOT Excel.
9
Install R and RStudio
https://posit.co/downloads/
10
BASICS OF R
11
Get Started
R commands:
Assign a value to a variable:
a=5
b<-10
Simple math calculations
a+b-a*b
The “ < ” and “ = ” are both assignment operators.
The standard R prompt is a “ > ” sign.
Display the names of the objects
ls()
Remove variables: rm()
Note that a line starts with # is used for informational purpose
12
Rules for Names in R
Any combination of letters, numbers, underscore, and “.”.
R is case-sensitive.
Variable names should be short, but descriptive.
Camel caps: MyMathScore =95
Underscore: my_math_score=95
Dot separated: my.math.score=95
13
R Help Functions
If you know the name of the function or object on which you want
help:
help('read.csv')
?'read.csv'
If you do not know the name of the function or object on which you
want help:
help.search('input')
??'input'
Do not forget our friends: search engines, generative AIs
14
Data Types in R
Vectors
Factors
Matrices
Data frames
Lists
15
Vectors
Assignment using function c():
x = c(5, 8, 12)
5:7 -> y
z <- c(x, 2, y)
length(z)
Vector arithmetic:
Elementary arithmetic operators: +,-,*,/,ˆ
Common arithmetic functions: log, exp, sin, cos, tan, 𝑥,. . .
Other important functions: range(), length(), max(), min(), sum(),
prod(), mean(), var(), sort()
Generating regular sequences via seq() and rep():
seq(-5, 5, by=1) -> x
y <- seq(length=10, from=-5, by=.5)
z <- rep(x, times=5)
16
Vector Operations
Operations on a single vector are typically done
element-by-element.
If the operation involves two vectors:
Same length: R simply applies the operation to each pair of
elements.
Different lengths, but one length a multiple of the other: R reuses the
shorter vector as needed.
Different lengths, but one length not a multiple of the other: R delivers
a warning, but may reuse the shorter vector as needed.
17
Examples
x=1:6
y=2
x*y
[1] 2 4 6 8 10 12
z=c(1,10)
x*z
[1] 1 20 3 40 5 60
# x is long vector (used once): 1 2 3 4 5 6
# z is shorter (used 3 times): 1 10 1 10 1 10
18
Character Vectors
Numeric vector is not the only type.
We can create another type of vector called character vector.
Example:
s = c("ab", "hello", "this is Tom")
s
[1] "ab" "hello" "this is Tom"
We can use function class() to detect the type.
class(s)
[1] "character"
19
Logical Vectors
Logical vectors are generated by conditions:
E.g., x<-5>4
Logical operators are <, <=, >, >=, ==, !=
Logical expressions: &, |, !
20
Example
x=2:6
# create a numerical vector
y=(x>3)
y
[1] FALSE FALSE TRUE TRUE TRUE
# test whether x>3, create a logical vector
# we assign the results to a variable called y
21
Factors
A factor is a special type of vector, normally used to hold a
categorical variable in many statistical functions.
Such vectors have a class named “factor”.
Factors in R often appear to be character vectors when printed,
but you will notice that they do not have double quotes around
them.
Factors are associated with levels, which are integers.
22
Examples
country<-c("US","China","Japan")
countryf<-factor(country)
# create a character vector and then convert it to factor
country
[1] "US" "China" "Japan"
countryf
[1] US China Japan
Levels: China Japan US
as.character(countryf)
[1] "US" "China" "Japan"
# reference the characters within a factor
as.numeric(countryf) [1] 3 1 2
# reference the numeric values within a factor
23
Matrices and Data Frame
• A matrix is a rectangular array. It can be viewed as a collection of column
vectors all of the same length and the same type (i.e., numeric, character or
logical).
• A data frame is also a rectangular array. All of the columns must be the
same length, but they may be of different types.
• The rows and columns of a matrix or data frame can be given names.
24
Matrix Operations
Create a Matrix via cbind():
a<-1:5
b<-rep(8, times=5)
c<-cbind(a,b)
# create a matrix by column binding
c[4,2]
b
8
c[1,]
a b
1 8
c[,2]
[1] 8 8 8 8 8
# index an entry, a row, or a column of a matrix
25
Matrices Versus Data Frames
Matrices vs. Data Frames
x=1:10
y=rep(8,times=10)
matrix1<-cbind(x,y)
class(matrix1)
[1] "matrix“
class(matrix1[,1])
[1] "numeric"
# combining numeric columns yields a matrix of numeric values
z=paste0('a',1:10)
matrix2<-cbind(x,y,z)
class(matrix2)
[1] "matrix"
class(matrix2[,1])
[1] "character"
# combining numeric and character columns yields a matrix of characters
26
Matrices Versus Data Frames
Matrices vs. Data Frames
tab<-data.frame(x,y,z)
class(tab)
[1] "data.frame"
class(tab[,1])
[1] "integer“
class(tab[,3])
[1] "character“
# data frame keeps the respective properties (i.e., numeric value or
# character value) of forming columns
27
Matrices Versus Data Frames
• Data frame columns can be referred to by name using “dollar sign” operator $
while this is not feasible for matrix
tab$x
[1] 1 2 3 4 5 6 7 8 9 10
try matrix1$x,see what will happen.
• The command length() applied to data frame returns the number of columns,
while the same comment applied to matrix returns … …
try length(tab) 3
and length(matrix1) 20
28
List
A list is a collection of objects that may be the same or different
types.
A data frame is a list of matched column vectors. Hence, the
commands for list applies to a data frame.
29
List: Examples
Create a list
list1=list(100,"hello",c(2,4,6))
list1
[[1]] [1] 100 [[2]] [1] “hello" [[3]] [1] 2 4 6
class(list1)
[1] "list"
list1[[2]]
[1] "hello"
list1[[3]]
[1] 2 4 6
is.list(tab)
[1] TRUE
tab[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
# can view data frame as a special case of list
30
Read Data
Functions read.table(), read.csv()
Data is stored in a format referred to as data frame
View(), fix() to view/modify the data in a spreadsheet like window
Use read.table() or read.csv() to read data into R
header=T/TRUE tells R that the first line contains variable names.
na.strings tells R that it sees a particular character, it should be treated
as a missing element. (NA is used to represent missing value in R).
Read data from a file
Auto=read.csv("Auto.csv", header=T, na.strings="?")
Read data from the Internet
theURL <- "http://www.jaredlander.com/data/Tomato%20First.csv"
tomato <- read.table(file=theURL,header=TRUE, sep=",")
csv
head(tomato) 31
Probability Distributions
R provides a set of functions to evaluate
The Cumulative distribution function P r( X ≤ x), e.g.,
pnorm(2,mean=5,sd=10)
The probability density function and the quantile function, e.g.,
dnorm(2,mean=5,sd=10)
qnorm(.38,mean=5,sd=10)
Random generations from the distribution
z=rnorm(n=10,mean=5,sd=100)
Prefix names
‘d’ for the density, computes
‘p’ for the CDF, computes F = Pr(
‘q’ for the quantile function, computes x such that Pr(
‘r’ for the random variables, returns a random variable
32
Distribution, R Name, Additional Arguments
Distribution R name Additional arguments
uniform unif min, max
binomial binom size, prob
normal norm mean, sd
Poisson pois lambda
Student’s t df, ncp
F f df1, df2, ncp
chi-squared chisq df, ncp
...
33
Reproducibility of Random Generation
Role of function set.seed(): Setting a seed ensures reproducible results
from random processes in R
Random generation
> rnorm(3,mean=10,sd=20)
[1] 11.40286 44.22882 -2.05816
Now set a seed for the generation
> set.seed(5)
> rnorm(3,mean=10,sd=20)
[1] -6.81711 37.68719 -15.10984
Reproduce the generation with the same seed
> set.seed(5)
> rnorm(3,mean=10,sd=20)
[1] -6.81711 37.68719 -15.10984
34
Miscellaneous Issues
Use the hot key Ctrl + L to clear the command window
Use getwd() and setwd() to get and set the working directory
Alternatively, choose “Session” in the menu bar of R studio and then
select “Choose Working Directory”
In the working directory, .Rdata saves the environment that we
worked on last time, while .Rhistory records the commands used
previously
Another miss value in R is NULL: NULL cannot exist within a vector;
if used, it simply disappears. Try c(1:5,NA) and c(1:5, NULL)
A known issue associated with set.seed() is probable inconsistency
across different versions of R. See the discussion as follows:
https://stackoverflow.com/questions/47199415/is-set-seed-consistent-over-different-
versions-of-r-and-ubuntu
35