Lecture 1 - R Introduction1
Lecture 1 - R Introduction1
Lecture 1: R Introduction
Li Xiaoli
Nanyang Technological University
Outline
1. What is R?
2. Why do we learn R?
3. Programming interface and simple coding
4. R simple data types
5. R complex data types
2
1. What is R?
• R is a programming language and software environment for data
analytics, statistical computing and graphics supported by the R
Foundation. The R language is widely used among data miners and
statisticians for developing statistical software and data analysis.
• R is freely available under the GNU General Public License (GPL), and
pre-compiled binary versions are provided for various operating
systems. While R has a command line interface, there are several GUIs
available (e.g. Rstudio, Jupyter Notebook, Colab) – we should use GUIs.
• GPL is free software license, which guarantees end users (individuals,
organizations, companies) the freedoms to run, study, share (copy), and
modify the software.
- Adapt from Wikipedia
3
What are R’s main functions?
R is an integrated suite of software for data manipulation,
calculation, and graphical display
• Effective data handling (read, write, and manipulate data in various formats).
• Rich data types (vectors, matrices, data frames, and lists, making it
versatile for different types of data analysis) .
• Well-developed language including conditionals, loops, functions
and I/O capabilities.
• Various operators for calculations on arrays/matrices.
• Rich data analytics (machine learning, optimization) packages
• Graphical facilities for data analysis (R graphics system, ggplot2, heatmap, …).
4
What are R’s packages? https://www.r-project.org/
Open source, most widely used for statistical
analysis, data analytics and graphics
Extensible via dynamically loadable add-on
packages
Large number (~20k) of packages
6
2. Why do we learn R?
• Open source – It’s free! Fintech companies likely turn to R or Python
(they typically use SAS, which is quite expensive)
• Data Mining – R is widely used by data scientists.
• Statistical functions – R is designed for statistical computing.
• Econometrics
• Genetics
• …..
• Good graphing engine – plot nice graphs with built-in functions and
external packages like ggplot2 and heatmap.
• Easy to Use and powerful – do more with less, so you can spend more
time thinking about the problem you are trying to solve, instead of
focusing too much on implementations.
7
Why do we learn R?
• Statistics & Data Mining
• Commercial
• Technical computing
• Matrix and vector
formulations
• Commercial
Statistical computing and graphics
http://www.r-project.org
• Data Visualization and analysis platform •Expanded by community as open source
• Image processing, vector computing • Statistically rich
• Not well suited to general programming • Data analytics packages rich
8
The Programmer’s Dilemma
What programming
language to use & why?
Scripting
(R, Python, MATLAB, IDL)
Object Oriented
(C++, Java)
Functional languages
(C, Fortran)
Assembly
9
Why R?
https://www.kdnuggets.com/2020/06/data-science-tools-popularity-animated.html 10
Outline
1. What is R?
2. Why do we learn R?
3. Programming interfaces and simple coding
4. R simple data types
5. R complex data types
11
2 Programming Interfaces and Installations
• The following multiple GUIs (program interfaces) are widely used. They can run R and Python.
• 1. Rstudio: install both R and Rstudio
• 2. Google Colab: just use browser
1. Rstudio
2. Google Colab
You can choose one of them, namely Rstudio IDE, Google Colab, based on your own preference
There are other interfaces can also run R, including Jupyter Notebook/Lab, VS Code, or Anaconda 12
Rstudio IDE software at
https://posit.co/download/rstudio-desktop/
CRAN Task Views (useful for us to find some useful packages in certain
GUI 1: Install domain/vertical)
https://cran.r-project.org/web/views/
R and Rstudio
13
• No explicit installation.
14
Below, we will
demo how use A. Rstudio
each of the
GUIs/software B. Google Colab
in turn
15
A. Running programs/commands using Rstudio
• 1. Command line/Console:
– After R or Rstudio is started, there is a console waiting for input.
– At the prompt (>), you can enter numbers and operators to perform calculations:
– >1+2
[1] 3
– > 5*8
[1] 40
– comments: all text (within same line) after pound sign "#"
>1+1 # this is a comment - system will ignore
[1] 2
You can then choose a folder to save this R source file. Default file extension name is .R
18
B. Running programs/commands using Goggle Colab
First, suggested to use the following to start Google Colab
https://colab.research.google.com/notebook#create=true&language=r
You can choose to format text using heading, bold, italicize, insert hyperlink,
insert image, Indent, add numbered list, add bulleted list, horizontal rule etc
19
R’s own datasets
R comes with a number of sample datasets that you can experiment with. Type
> data( )
To see the available datasets. The results will depend on which packages you
have installed & loaded. You can type the name of data set to see its content.
> CO2
20
Simple Coding
• To facilitate your learning, I provide all the codes that we cover
in the lectures.
• You can run them so that you understand grammar and
semantics behind.
• However, this is DEFINITELY NOT enough, you should write
your own codes – knowledge can then be really gained and
become part of your skill sets that can be used repeatedly for
your future career.
21
3.1. Arithmetic in R
• You can use R as a calculator
• Typed expressions will be evaluated and printed out
• Main operations: +, -, *, /, ^
• Obeys order of operations
• Use parentheses to group expressions
• More complex operations appear as functions
• sqrt(2)
• sin(pi/4), cos(pi/4), tan(pi/4), asin(1), acos(1), atan(1)
• exp(1), log(2), log10(10), log2(2)
22
3.2. Some basic functions
# list installed packages
> library()
# install a package - sometimes we need additional functionality
beyond those offered by the core R library. In order to install an extension
package, you should invoke the install.packages function at the prompt and
follow the instruction.
From https://cran.r-
project.org/web/packages/available_packages_by_date.html, it has a package
caret: Classification and Regression Training
• > install.packages("caret")
# load a library
library(caret)
https://data-flair.training/blogs/r-packages-for-data-science/
23
R Packages
• One of the strengths of R is that the system can easily be extended. The
system allows you to write new functions and package those functions in
a so-called `R package' (or `R library').
• The R package may also contain other R objects, for example data sets or
documentation. There is a lively R user community and many R packages
have been written and made available on CRAN for other users.
• Instructions for Creating Your Own R Package:
http://web.mit.edu/insong/www/pdf/rpackage_instructions.pdf
• Just a few examples, there are existing packages for portfolio
optimization, drawing maps, exporting objects to html, time series
analysis, spatial statistics and the list goes on and on.
24
R Packages
• When you download R, already a number of packages are downloaded
as well.
• To use a function in an R package, that package has to be attached to
the system.
• When you start R not all of the downloaded packages are loaded or
attached, only some important packages are attached to the system by
default. You can use the function search to see a list of packages that
are currently attached to the system, this list is also called the search
path.
25
Getting help & find useful information
• From R GUI (i.e., Command line interface), you can type the
following commands and understand more about different
functions
> help(function_name)
> help(prcomp) #Principal Components Analysis
> ? function_name
> ?prcomp
> help.search(“topic”)
> ?? topic
Q:What is the difference between ? and ??
26
Getting help & find useful information
# help for a function: Single ? (if you know what function
you are looking for)
? mean
x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))
27
Getting help & find useful information
# search help files – double ?? (free text search from all
the help in the menu)
> ??mean
# you can perform a fuzzy search with the apropos
function.
> apropos("nova")
> apropos: Find Objects by Partial Name
• apropos() returns a character vector giving the object names matching search
query word (e.g. “nova”).
> V=apropos("GLM") # several results
> V[1]
> V[5]
> apropos("GLM", ignore.case = FALSE) # No result returned
> apropos("lq")
28
Call a in-built R Function
R functions are invoked by its name, then followed by the parenthesis,
and zero or more arguments. The following apply the function c to
combine three numeric values into a vector.
> c(1, 2, 3)
[1] 1 2 3
> factorial(5)
[1] 120
29
3.3. Variables and assignment
• Use variables to store values
• Just typing the variable by itself at the prompt will print out the value.
• Three ways to assign variables
•a=6
• a <- 6
• 6 -> a
31
4. R simple data types
• There are 5 basic R data types that are of frequent
occurrence in routine R calculations:
1. Numeric
2. Integer
3. Complex
4. Logical
5. Character
• We can better understand them by direct experimentation
with some R codes.
32
R simple data type: Numeric
• Decimal values are called numerics. It is the default
computational data type. If we assign a decimal value to a
variable x, x will be of numeric type.
• > x = 10.5 # assign a decimal value
>x # print the value of x
[1] 10.5
• > class(x) # print the class name of x
[1] "numeric"
• Furthermore, even if we assign an integer to a variable k, it is
still being saved as a numeric value.
33
R simple data type: Numeric
• >k=1
>k # print the value of k
[1] 1
• > class(k) # print the class name of k
[1] "numeric"
• The fact that k is not an integer can be also confirmed
with the is.integer function.
• > is.integer(k) # is k an integer?
[1] FALSE
34
R simple data type: Integer
• In order to create an integer variable in R, we invoke the
as.integer function. We can be assured that y is indeed an
integer by applying the is.integer function.
• > y = as.integer(3)
>y # print the value of y
[1] 3
• > class(y) # print the class (type) name of y
[1] "integer"
> is.integer(y) # is y an integer?
[1] TRUE
• We can coerce a numeric value into an integer with the same
as.integer function.
35
R simple data type: Integer
• > as.integer(3.14) # coerce a numeric value
[1] 3
• And we can parse a decimal string for decimal values in the same way.
• > as.integer("5.27") # coerce a decimal string
[1] 5
• It simply means we can convert data type from numeric or decimal string
to integer
• > class(as.integer("5.27"))
[1] "integer"
36
R simple data type: Integer (Cont.)
• On the other hand, it is erroneous trying to parse a non-decimal string.
• > as.integer("Joe") # coerce an non−decimal string
[1] NA
Warning message:
NAs introduced by coercion
• Often, it is useful to perform arithmetic on logical values. TRUE has the
value 1, while FALSE has value 0.
• > as.integer(TRUE) # the numeric value of TRUE
[1] 1
> as.integer(FALSE) # the numeric value of FALSE
[1] 0
• Then, how about as.integer(3<5)?
37
R simple data type: Complex
• A complex value in R is defined via the pure imaginary value i.
• > z = 1 + 2i # create a complex number
>z # print the value of z
[1] 1+2i
> class(z) # print the class name of z
[1] "complex"
• The following gives an error as −1 is not a complex value.
• > sqrt(−1) # square root of −1
[1] NaN
Warning message:
In sqrt(−1) : NaNs produced
• Instead, we have to use the complex value −1 + 0i.
• > sqrt(−1+0i) # square root of −1+0i
[1] 0+1i
• An alternative is to coerce −1 into a complex value.
• > sqrt(as.complex(−1))
[1] 0+1i
38
R simple data type: Logical
• A logical value is often created via comparison between variables.
• > x = 1; y = 2 # sample values
> z = x > y # is x larger than y?
>z # print the logical value
[1] FALSE
> class(z) # print the class name of z
[1] "logical"
• Standard logical operations are "&" (and), "|" (or), and "!" (negation).
• > u = TRUE; v = FALSE
>u&v # u AND v
[1] FALSE
>u|v # u OR v
[1] TRUE
> !u # negation of u
[1] FALSE
• Further details can be found in the R documentation.
> help("&")
39
R simple data type: Character
• A character object is used to represent string values in R. We convert objects
into character values with the as.character() function:
• > x = as.character(3.14)
>x # print the character string
[1] "3.14"
> class(x) # print the class name of x
[1] "character"
• Two character values can be concatenated with the paste function.
• > fname = "Joe"; lname ="Biden"
> paste(fname, lname)
[1] "Joe Biden"
• However, it is often more convenient to create a readable string with the
sprintf function, which has a C language syntax.
> sprintf("%s has %d dollars", "Sam", 100)
[1] "Sam has 100 dollars"
40
R simple data type: Character (Cont.)
• To extract a substring, we apply the substr function. Here is an example
showing how to extract the substring between the third and twelfth
positions in a string.
• > substr("Mary has a little lamb.", start=3, stop=12)
[1] "ry has a l"
• And to replace/substitute the first occurrence of the word "little" by another
word "big" in the string, we apply the sub function.
• > sub("little", "big", "Mary has a little lamb.")
[1] "Mary has a big lamb."
• More functions for string manipulation can be found in the R
documentation.
• > help("sub")
41
Outline
1. What is R?
2. Why do we learn R?
3. Programming interfaces and simple coding
4. R simple data types
5. R complex data types
42
5. R: complex data types
Vectors: numerical vector, character vector, logical vector
Matrices: all columns in a matrix must have the same mode (numeric,
character, ...) and the same length.
Arrays: Arrays are similar to matrices but can have more than two dimensions.
Lists: An ordered collection of objects (components). A list allows you to gather
a variety of (possibly unrelated) objects under one name.
Data frames: more general than a matrix, in that different columns can have
different modes (numeric, character, ….). Like DB table
Factors: Tell R that a variable is nominal by making it a factor. The factor stores
the nominal values as an integer vector in the range [ 1... k ] (where k is the
number of unique values in the nominal variable), and an internal vector of
character strings (the original values) mapped to these integers.
43
5.1 R complex data type: Vector
• A vector is a sequence of data elements of the same basic type.
Members in a vector are officially called components.
Nevertheless, we can just call them members.
• We use c() function to combine values into a vector. We can
construct different types of vectors
• Here is a vector containing three numeric values 2, 3 and 5.
> c(2, 3, 5)
[1] 2 3 5
44
R complex data type: Vector (cont.)
• And here is a vector of logical values.
> c(TRUE, FALSE, TRUE, FALSE, FALSE)
[1] TRUE FALSE TRUE FALSE FALSE
• A vector can contain character strings.
> c("aa", "bb", "cc", "dd", "ee")
[1] "aa" "bb" "cc" "dd" "ee"
• The number of members in a vector is given by the length function.
> length(c("aa", "bb", "cc", "dd", "ee"))
[1] 5
45
Combining Vectors
• Vectors can be combined via the function c. For examples, the following
two vectors n and s are combined into a new vector w containing
elements from both vectors.
• > n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> w = c(n, s)
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee"
• Value Coercion
In the code snippet above, notice how the numeric values are being coerced into
character strings when the two vectors are combined. This is necessary so as to
maintain the same primitive data type for members in the same vector.
46
Vectors and vector operations
To create a vector: To access vector elements:
# c() command to create vector x # 2nd element of x
x=c(12,32,54,33,21,65) x[2]
# c() to add elements to vector x # first five elements of x
x=c(x,100,101) x[1:5]
# all but the 3rd element of x
# seq() command to create x[-3]
sequence of numbers conveniently # values of x that are < 40
years=seq(1990,2003) x[x<40]
# to contain in steps of .5 # Select all elements with values smaller
a=seq(3,5,.5) than 40; might be hard to understand
# can use : to step by 1
years=1990:2003;
To perform operations:
# rep() command to create data # mathematical operations on vectors
that follow a regular pattern
b=rep(1,5) y=c(3,2,4,3,7,6,1,1)
c=rep(1:2,4) x+y; 2*y; x*y; x/y; y^2
47
rep replicates the values for some times
Example: Vector Arithmetic
• Arithmetic operations of vectors are performed member-by-member, i.e., memberwise, e.g. suppose
we have two vectors a and b.
• > a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
• If we multiply a by 5, we get a vector with each of its members multiplied by 5.
• >5*a
[1] 5 15 25 35
• And if we add a and b together, the sum would be a vector whose members are the sum of the
corresponding members from a and b.
• >a+b
[1] 2 5 9 15
• Similarly for subtraction, multiplication and division, we get new vectors via memberwise operations.
• >a-b
[1] 0 1 1 -1
>a*b
[1] 1 6 20 56
>a/b
[1] 1.000 1.500 1.250 0.875 48
Access vector elements: Vector Index
• We retrieve values in a vector by declaring an index inside a single square
bracket "[]" operator.
• For example, the following shows how to retrieve a vector member. Since
the vector index is 1-based, we use the index position 3 for retrieving the
third member.
• > s = c("aa", "bb", "cc", "dd", "ee")
> s[3]
[1] "cc"
• Unlike other programming languages, the square bracket operator could
return more than just individual members. In fact, the result of the square
bracket operator is another vector, and s[3] is a vector slice (not element)
containing a single member "cc".
49
Vector Index (Cont.)
• Negative Index
• If the index is negative, it would strip the member whose position has
the same absolute value as the negative index. For example, the
following creates a vector slice with the third member removed.
• > s[-3] s = c("aa", "bb", "cc", "dd", "ee")
>A # print A
col1 col2 col3
row1 2 4 3
row2 1 5 7
62
Matrices & matrix operations
To create a matrix:
# matrix() command to create matrix A with rows and cols
A=matrix(c(54,49,49,41,26,43,49,50,58,71),nrow=5,ncol=2)
B=matrix(1,nrow=4,ncol=4)
63
Useful more functions for vectors and matrices
• Find # of elements or dimensions
• length(v), length(A), dim(A)
• Transpose
• t(v), t(A)
• Matrix inverse
• solve(A)
• solve(A, b): returns vector x in the equation b = Ax (i.e., A-1b)
• Sort vector values
• sort(v)
• Statistics
• min(), max(), mean(), median(), sum(), sd(), quantile()
• Treat matrices as a single vector (same with sort()) 64
5.1, 5.2 Summary of Vector and Matrix
• Vector (members have same type)
– N =c(12,32,54,33,21,65) #numerical vector,
– C = c(TRUE, FALSE, TRUE, FALSE, FALSE) #logical vector
– L =c("aa", "bb", "cc", "dd", "ee") # character vector
• Matrix (same type, same length)
– A = matrix(c(2, 4, 3, 1, 5, 7), nrow=2, ncol=3, byrow = TRUE)
– A = matrix(c('1','2','3','4'), nrow=2, ncol=2)
– Y = matrix(1:20, nrow=5,ncol=4)
65
Arrays
• Arrays are similar to matrices but can have more than two
dimensions. An n-dimensional array is a set of stacked matrices of
identical dimensions. For example, we create a 3-d array with 4
matrices (each 2*3 matrix)
• a <- matrix(6, 2, 3) # 2 x 3 matrix
• b <- matrix(7, 2, 3) # 2 x 3 matrix
• c <- matrix(8, 2, 3) # 2 x 3 matrix
• d <- matrix(9, 2, 3) # 2 x 3 matrix
• myarray=array(c(a, b, c, d), c(2, 3, 4))
# Creates a 2 x 3 x 4 array
66
4 matrices Creates a 2 x 3 x 4 array
(each 2*3 matrix)
67
Array example
• myarray1 <- array(1:24, dim=c(3,4,2))
• myarray1
• Here the data as the first argument and a vector with the sizes of the
dimensions as the second argument. Our array has 3 rows, 4 columns,
and 2 “tables” :
Access Array elements. Format: Array[row, col, matrix]
# 3rd row of the second matrix:
myarray1[3,,2]
# 1st row and 3rd column of the 1st matrix:
myarray1[1,3,1]
# 2nd Matrix
myarray1[,,2]
#apply() function below to calculate the sum of the elements in the rows of an array
across all the matrices.
result <- apply(myarray1, c(1), sum) # 1 indicates rows; 2 indicates columns; c(1, 2)
indicates rows and columns.
68
5.3 List
• A list is a generic vector containing different objects (different types,
different length), e.g., the following variable x is a list containing copies of
3 vectors n, s, b, and a numeric value 3.
• > n = c(2, 3, 5) n>
> s = c("aa", "bb", "cc", "dd", "ee") [1] 2 3 5
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3)
• >x [[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee" Objects using double square bracket [[ ]]
[[3]]
[1] TRUE FALSE TRUE FALSE FALSE
[[4]]
[1] 3
69
List Slicing: Still get a list
• We retrieve a list slice with the single square bracket "[]" operator. [] extracts
a list.
• > x[2]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
• >class (x[2]) #list X[2] is a list, not an actual member
• With an index vector, we can retrieve a slice with multiple objects. Here a
slice containing the second and fourth objects of x.
• > x[c(2, 4)] X>
[[1]] [[1]]
$ Mary
[1] "aa" "bb"
72
List Slicing by Name: Still get a list
• We retrieve a list slice with the single square bracket "[]" operator. Here is a list slice
containing a member of v named “Bob".
• >v["Bob"]
$Bob
[1] 2 3 5
>class(v["Bob"]) #list
• With an index vector, we can retrieve a slice with multiple members. Here is a list
slice with both objects of v. Notice how they are reversed from their original
positions in v.
• > v[c("Mary","Bob")]
$Mary
[1] "aa" "bb"
$Bob
[1] 2 3 5
73
Member Reference: access members
• In order to reference a list member directly, we have to use the
double square bracket "[[]]" operator. The following references a
member of v by name.
• > v[["Bob"]]
[1] 2 3 5
• class(v["Bob"]) #list
• class(v[["Bob"]]) #numeric
• A named list member can also be referenced directly with the "$"
operator in lieu of the double square bracket operator.
• > v$Bob For list, What is the difference
between [], [[]], [[]][], and $ ?
[1] 2 3 5
• class(v$Bob) #numeric []: list; [[]] and $: vector; [[]][]: member
74
Homework
• 1. Test out two programming interfaces
• 2. Read machine learning package at CRAN package distribution,
to have a rough understanding on its functions (note you may
not understand the meaning of all the topics).
• 3. Run and understand the Basic R program (BasicR_RStudio.zip
that maps to the today’s lecture.
• Read thru the article https://data-flair.training/blogs/machine-
learning-for-r-programming/#google_vignette
Practice makes perfect!
75
Useful R links
• R Home: http://www.r-project.org/
• R’s CRAN package distribution: https://cran.r-project.org/, CLICK Packages
and CRAN Task Views: study
• More comprehensive than our lectures: An Introduction to R Notes on R:
A Programming Environment for Data Analysis and Graphics Version 4.2.1
(2022-06-23) : https://cran.r-project.org/doc/manuals/r-release/R-
intro.pdf
• Writing R extensions: https://cran.r-project.org/doc/manuals/R-exts.html
• Other R documentation:
• http://www.r-tutor.com/r-introduction
• http://www.tutorialspoint.com/r/ 76
Contact: xlli@ntu.edu.sg, xlli@i2r.a-star.edu.sg if you have questions
77