Module 2 Textbook Content
Module 2 Textbook Content
using R
Seema Acharya
Senior Lead Principal
Infosys Limited
Chapter 1 Introduction to R 1
1.1 Introduction 1
1.1.1 What is R? 1
1.1.2 Why R? 2
1.1.3 Advantages of R Over Other Programming Languages 3
1.2 Downloading and Installing R 4
1.2.1 Downloading R 4
1.2.2 Installing R 6
1.2.3 Primary File Types of R 10
1.3 IDEs and Text Editors 11
1.3.1 R Studio 12
1.3.2 Eclipse with StatET 13
1.4 Handling Packages in R 13
1.4.1 Installing an R Package 15
1.4.2 Few Commands to Get Started 16
Summary 22
Key Terms 23
Multiple Choice Questions 23
Short Questions 24
LEARNING OUTCOME
At the end of this chapter, you will be able to:
c Install R
c Install any R package
c Work with any R package using functions such as find.package(), install.pack-
ages(), library(), vignette() and packageDescription()
1.1 InTroDUcTIon
Statistical computing and high-scale data analysis tasks needed a new category of
computer language besides the existing procedural and object-oriented programming
languages, which would support these tasks instead of developing new software. There is
plenty of data available today which can be analysed in different ways to provide a wide
range of useful insights for multiple operations in various industries. Problems such as
the lack of support, tools and techniques for varied data analysis have been solved with
the introduction of one such language called R.
1.1.1 What is R?
R is a scripting or programming language which provides an environment for statistical
computing, data science and graphics. It was inspired by, and is mostly compatible with,
the statistical language S developed at Bell laboratory (formerly AT & T, now Lucent
technologies). Although there are some very important differences between R and S, much
2 Data Analytics using R
of the code written for S runs unaltered on R. R has become so popular that it is used as
the single most important tool for computational statistics, visualisation and data science.
1.1.2 Why R?
R has opened tremendous scope for statistical computing and data analysis. It provides
techniques for various statistical analyses like classical tests and classification, time-
series analysis, clustering, linear and non-linear modelling and graphical operations. The
techniques supported by R are highly extensible.
S is the pioneer of statistical computing; however, it is a proprietary solution and is not
readily available to developers. In contrast, R is available freely under the GNU license.
Hence, it helps the developer community in research and development.
Another reason behind the popularity and widespread use of R is its superior support
for graphics. It can provide well-developed and high-quality plots from data analysis.
The plots can contain mathematical formulae and symbols, if necessary, and users have
full control over the selection and use of symbols in the graphics. Hence, other than
robustness, user-experience and user-friendliness are two key aspects of R.
Why Learn R?
The following points describe why R language should be used (Figure 1.1):
d If you need to run statistical calculations in your application, learn and deploy R. It
easily integrates with programming languages such as Java, C++, Python and Ruby.
d If you wish to perform a quick analysis for making sense of data.
d If you are working on an optimisation problem.
d If you need to use re-usable libraries to solve a complex problem, leverage the 2000+
free libraries provided by R.
d If you wish to create compelling charts.
d If you aspire to be a Data Scientist.
d If you want to have fun with statistics.
Advanced Statistics
Supportive Open
Fun with Statistics
Source Community
d R is free. It is available under the terms of the Free Software Foundation’s GNU
General Public License in source code form.
d It is available for Windows, Mac and a wide variety of Unix platforms (including
FreeBSD, Linux, etc.).
d In addition to enabling statistical operations, it is a general programming language
so that you can automate your analyses and create new functions.
d R has excellent tools for creating graphics such as bar charts, scatter plots, multi-
panel lattice charts, etc.
d It has an object oriented and functional programming structure along with support
from a robust and vibrant community.
d R has a flexible analysis tool kit, which makes it easy to access data in various for-
mats, manipulate it (transform, merge, aggregate, etc.), and subject it to traditional
and modern statistical models (such as regression, ANOVA, tree models, etc.)
d R can be extended easily via packages. It relates easily to other programming lan-
guages. Existing software as well as emerging software can be integrated with R
packages to make them more productive.
d R can easily import data from MS Excel, MS Access, MySQL, SQLite, Oracle etc. It
can easily connect to databases using ODBC (Open Database Connectivity Protocol)
and ROracle package.
SciPy is used for performing data analysis tasks and NumPy is used for representing the
data or objects.
2. R has the fundamental data type, i.e., a vector that can be organised and aggregated
in different ways even though the core is the same. Vector data type imposes some
limitations on the language as this is a rigid type. However, it gives a strong logical
base to R. Based on the vector data type, R uses the concept of data frames that are
4 Data Analytics using R
like a matrix with attributes and internal data structure similar to spreadsheets or
relational database. Hence, R follows a column-wise data structure based on the
aggregation of vectors.
Just Remember
There are also some disadvantages of R. For example, R cannot scale efficiently for larger data sets.
Hence, the use of R is limited to prototyping and sandboxing. It is rarely used for enterprise-level solutions.
By default, R uses a single-thread execution approach while working on data stored in the RAM which
leads to scalability issues as well. Developers from open source communities are working hard on these
issues to make R capable of multi-threading execution and parallelisation. This will help R to utilise more
than one core processor. There are big data extensions from companies like Revolution R and the issues
are expected to be resolved soon. Other languages like SPlus can help to store objects permanently on
disks, hence, supporting better memory management and analysis of high volume of massive datasets.
1.2.1 Downloading R
To download R, users need to visit the CRAN mirror page and click on the URL of the
chosen mirror that will redirect them to the respective site (Figure 1.2).
1
URL of CRAN—https://cran.r-project.org/mirrors.html
Figure 1.2 CRAN website for downloading R
Introduction to R
5
6 Data Analytics using R
In some Linux OS, R distributions are included by default. Hence, it is a good idea to check the
package management system of a Linux OS platform before installing R on it.
1.2.2 Installing R
After downloading R distribution binaries for the correct OS platform, R is installed.
Installing R on Windows
Installing R on Windows is simple. Users need to double click on the downloaded binary,
named R-3.3.1-win.exe, on a graphical interface. Command line installation options are
available for Windows (Figure 1.6).
Two versions are available for 32-bit and 64-bit Windows OS. By default, both the versions are
installed. Hence, users need to select the desired version manually during installation.
Figure 1.3 Downloading R for Windows
Introduction to R
7
8
Data Analytics using R
Installing Rtools
Rtools is an additional requirement for developing R packages under Windows OS
environment. In addition to installing the R software on Windows, users need to install
Rtools for the installed version of R.
Installing R on Mac
The process for installing R on Mac is similar to that for Windows. Users need to double
click on the binaries downloaded from the CRAN website and follow the prompts.
Installing R on Linux
Users need to install R from the source on Linux distributions. This can be done by
following commands in the supervisor mode. The following steps will install and configure
R into a user-specific subdirectory within the home directory:
$ tar xvf R-3.1.1.tar.gz
$ cd R-3.1.1
$ ./configure --prefix=$HOME/R
$ make && make install
Setting the path on a Linux machine is very critical. Without the path, R and RScript do
not work.
RScript
RScript is a text file that contains commands for an R program. The same commands
can be executed individually on the CLI of Integrated Development Environment (IDE)
for R programming. An RScript can be also be developed and executed. However, there
is a difference between executing a command directly on CLI and executing the same
command through an R script. An RScript has a .R extension.
Command line interface is needed for quick and small data processing and checking
operations. In large-scale solutions, it integrates multiple programs during prototyping and
subsequent phases. In that case, RScripts are used for managing the integration process.
Markdown Documents
R markdown documents are produced for creating and authoring dynamic documents,
reports and presentations from R. R markdown documents have a set of markdown
Introduction to R 11
syntaxes derived from the core markdown syntaxes. These syntaxes are embedded into
RScripts and codes. When these embedded codes and scripts are executed then the output
is formatted based on the markdown syntaxes and hence becomes easily understandable.
R markdown documents can be regenerated automatically if the underlying RScripts and
codes or data are changed. The output format of an R markdown covers a wide range
of formats including PDF, HTML, HTML5 slides, websites, dashboards, tufte handouts,
notebooks, books, MS word, etc. The extension for R markdown document files is .rmd.
Table 1.1 Some IDEs and text editors for writing and executing R codes
Name Platform(s) License Details and Usage
Notepad Windows, GNU GPL Notepad++ to R is an editor for R that is simple and robust.
and Linux and It supports extensions like close passing to Notepad++
Notepad++ Mac editor, R GUI editor and optionally to a PuTTY window on a
to R remote machine. It supports batch processing using shortcuts,
monitoring of execution of RScripts and so on.
Tinn-R Windows GNU GPL Tinn-R is a word processor and text editor that can process
generic ASCII and UNICODE on Windows OS. This is well
integrated into R and supports GUI and IDE for R.
Revolution Commercial Revolution productivity enhancer is an R productivity or
Productivity enhanced environment. However, it can work as an IDE for
Enhancer new users. The usability features of RPE are very supportive.
(RPE) It includes features like IntelliSense for detecting completion
of word, code snippets, and so on. Hence, RPE is an integrated
IDE and editor with built-in visual debugging tools.
12 Data Analytics using R
There are various IDEs used in R language. You will learn about these IDEs in the
following section.
1.3.1 R Studio
R studio is the most widely used IDE for writing, testing and executing R codes (Figure
1.7). This is a user-friendly and open source solution. There are various parts in a typical
screen of an R studio IDE. These are:
d Console, where users write a command and see the output
d Workspace tab, where users can see active objects from the code written in the
console
d History tab, which shows a history of commands used in the code
d File tab, where folders and files can be seen in the default workspace
d Plot tab, which shows graphs
d Packages tab, which shows add-ons and packages required for running specific
process(s)
d Help tab, which contains the information on IDE, commands, etc.
Example
> .libPaths()
Output
C:/R/R-3.1.3/library
This is the default package library location. The following command will change it
into another path:
Example
> .libPaths(“~/R/win-library/3.1-mran-2016-07-02”)
Output
C:/Users/User1/Documents/R/win-library/3.1-mran-2016-07-02
R can be extended easily with the help of a rich set of packages. There are more than
10,000 packages available for R. These packages are used for different purposes. Tables
1.2 and 1.3 list some commonly used R packages for different purposes.
….
remove.packages() can be used to uninstall a package.
packageDescription()
“DESCRIPTION” file has the basic information about a package. It has details such as what
the package does, who is the author, what is the version for the documentation, the date,
the type of license its use, and the package dependencies, etc. To access the description file
inside R, use the function, packageDescription(“package”). The same can also be accessed
via the documentation of the package by using help(package = “package”).
Let us look at the description for the “stats” package.
Introduction to R 17
> packageDescription(“stats”)
Package: stats
Version: 3.2.3
Priority: base
Title: The R Stats Package
Author: R Core Team and contributors worldwide
Maintainer: R Core Team <R-core@r-project.org>
Description: R statistical functions.
License: Part of R 3.2.3
Suggests: MASS, Matrix, Suppdists, methods, stats4
Build: R 3.2.3; x86_64-w64-mingw32; 2015-12-10 13:03:29 UTC; windows
help(package = “package”)
To get an overview of all the functions and datasets in an R package, use the help()
function.
> help(package = "datasets")
The above will provide an overview of all functions and datasets inside the package,
“datasets”. One of the dataset available in “datasets” package is “AirPassengers”. To
18 Data Analytics using R
access the dataset, “AirPassengers” inside the “datasets” package, use the code given
below:
If there will be frequent use of this package, it is worthwhile to load it into the memory.
This can be achieved using the library function:
> library (datasets)
Note: the package name has to be specified without enclosing it in quotes. The library()
function will load the package, “datasets” into the memory. Then any dataset within this
package can be accessed by simply typing the name of the dataset at the R prompt.
Example
To install a single package, the command is:
>find.package(“ggplot2”)
>install.packages(“ggplot2”)
Output
The first command will help to find if there is any package named “ggplot2” installed
in the system or not. Then the install.packages() function will install the package
named “ggplot2” CLI (Figure 1.8). It will download and install the package and all the
dependencies of the package.
Example
To install more than one package(s) at a time, the install.packages() command will
have the following format:
>install.packages(c(“ggplot”, “tidyr”, “dplyr”))
20 Data Analytics using R
Output
It will install packages ggplot, tidyr and dplyr.
The command to check whether a package is installed or not is the ‘if’ condition checking. The
command for checking whether the package “ggplot2” is installed or not can be done by using:
>if (!require(“ggplot2”)){install.packages(“ggplot2”)}
library()
library() command loads a package.
Example
>library(ggplot2)
Output
It will load the package “ggplot2”.
vignette()
Vignettes are a very useful source of help with packages. They are provided by the package
authors to demonstrate and highlight few functionalities of their package in detail. Use
browseVignettes() function to get a list of all vignettes available with your installed
packages.
> browseVignettes()
Introduction to R 21
To view all vignettes for a specific package, e.g., “ggplot2”, use the vignette() function.
Vignettes in package ‘ggplot2’:
Just Remember
To access help in RStudio, it can be accessed from the console and from the CLI (Figure 1.9). The command
is help().
Figure 1.9 Accessing help() command from the console and CLI
Summary
d R is an open source and object-oriented programming language for statistical computing and data
visualisation.
d R is a successor of the proprietary statistical computing programming language S.
d R can be downloaded and installed on different OS platforms like Windows, Linux and Mac.
d R has the fundamental data type of vector.
d Text editors like Notepad++ to R, Tinn-R and Rev R are more than just editors for R. These can sup-
port extended functionalities and IDE features.
d R has several IDEs like RStudio, Eclipse with StatET and so on.
d R has a rich library of more than 10,000 packages.
d R has two fundamental file types called RScripts and R markdown documents.
d R commands can be written in RScripts or through the command line interface.
d R has a rich collection of inbuilt data sets like mtcars, Biochemical Oxygen Demand (BOD), etc.
Introduction to R 23
Key Terms
d BOD: An inbuilt data set in R, which computer software. Usually, an IDE consists
contains data on the Biochemical Oxygen of a number of automation tools, a debug-
Demand. ger and an editor for coding.
d CLI: A console through which a user can d R: An open source and object oriented pro-
interact with a computer. The interaction gramming language for statistical comput-
happens through successive lines of com- ing and data visualisation.
mands on the console.
d IDE: A special type of software that offers
a set of comprehensive facilities to develop
1. What is R?
(a) An object-oriented programming language
(b) An open source project from CRAN
(c) A programming language for statistical computing
(d) All of these
2. Which one of the following programming languages is a dialect of R language?
(a) Python (b) C
(c) S (d) Q
3. Which one of the following is a text editor of R?
(a) RStudio (b) Microsoft word
(c) Notepad++ to R (d) Tableau
4. Which of the following are IDEs for R?
(a) RStudio (b) Both a and c
(c) Eclipse with StatET (d) None of these
5. What is the primary file type of R?
(a) Vector (b) Text file
(c) RScripts (d) Statistical file
6. R can be downloaded from:
(a) CRAN website (b) Google PlayStore
(c) None of these (d) All of these
7. Which one of the following R packages is used for data management?
(a) haven (b) igraph
(c) slidify (d) forecast
24 Data Analytics using R
shorT QuesTions
1. What is R? What are the advantages of R programming language over other general purpose
programming languages?
2. How can we install a package on R?
3. Give examples of two IDEs for R.
4. Give detailed examples of three packages used in R.
5. Give a detailed description of head() command used in R.
6. How can we install multiple R packages with a single command?
7. State the difference(s) between head() and tail() commands used in R.
8. State the difference(s) between ncol() and nrow() commands used in R.
LEARNING OUTCOME
At the end of this chapter, you will be able to:
c Analyse directory content with commands such as dir(), list()
c Analyse a dataset using functions such as str(), summary(), ncol(), nrow(),
head(), tail(), edit()
2.1 intRoDUCtion
Data exploration in R is an approach to summarise and visualise important characteristics
of a data set. An exploratory data analysis focusses on understanding the underlying
variables and data structures to see how they can help in data analysis through various
formal statistical methods.
Example
>getwd()
Output
[1] C:/Users/User1/Documents/R
Note the use of ‘/’ as the file separator on Windows. The file path does not have a trailing
‘/’ unless it is the root directory. The getwd() function can return NULL if the working
directory is not available.
Output
It will change the path to the user specified directory.
>list.files()
character(0)
The above command implies that there are no files or directories in the current directory.
Example 1
To display the files and directories in the current directory, use path= “.” as an argument
to dir().
Getting Started with R 27
>dir(path=".")
[1] "att connect" "BI_May_2015.pptx" "BI_MetroMap-Final.png" "BISkillMatrix- Final.xlsx"
[5] "C" "cache" "Custom Office Templates" "Dec2016-Broadband Bill.pdf"
[9] "decision_tree.png" "Default.rdp" "desktop.ini" "DSS.wma"
[13] "ILP-AssociationRuleMining.pptx" "May-Broadband bill.pdf" "My Data Sources" "My Music"
[17] "My Pictures" "My Shapes" "My Tableau Repository" "My Videos"
[21] "Northwind 2007 sample.accdt" "Oct-Broadband bill.pdf" "OneNote Notebooks" "Outlokk Files"
[25] "R" "Remote Assistance Logs" "samplelinearregression.png" "SAP"
[29] "SQL Server Management Studio" "Visual Studio 2005" "Visual Studio 2008" "Visual Studio 2010"
Example 2
To display the list of all files and directories in a specific path, use the command as follows:
> dir (path="C:/Users/Seema_acharya")
[1] "AppData"
[2] "Application Data"
[3] "ATT_Connect_Setup.exe"
[4] "CD95F661A5C444F5A6AAECDD91C2410a.TMP"
[5] "Contacts"
[6] "Cookies"
[7] "Desktop"
[8] "Documents"
[9] "Downloads"
[10] "Favorites"
[11] "Links"
[12] "Local Settings"
[13] "Music"
[14] "My Documents"
[15] "NetHood"
[16] "NTUSER.DAT"
[17] "ntuser.dat.LOG1"
[18] "ntuser.dat.LOG2"
[19] "NTUSER.DAT{6cced2f1-6e01-11de-8bed-001e0bcd1824}.TM.blf"
[20] "NTUSER.DAT{6cced2f1-6e01-11de-8bed-001e0bcd1824}.
TMContainer00000000000000000001.regtrans-ms"
[21] "NTUSER.DAT{6cced2f1-6e01-11de-8bed-001e0bcd1824}.
TMContainer00000000000000000002.regtrans-ms"
[22] "ntuser.ini"
[23] "ntuser.pol"
[24] "Pictures"
[25] "PrintHood"
[26] "Recent"
[27] "Saved Games"
[28] "Searches"
[29] "SendTo"
[30] "Start Menu"
[31] "Templates"
[32] "Videos"
Example 3
To display the complete or absolute path of all files and directories in the specified path,
use dir() as follows:
28 Data Analytics using R
Example 4
To look for a specific pattern, e.g. file/directory names beginning with a “D”, use the
dir() command with a pattern = “^D” argument.
> dir(path="C:/Users/Seema_acharya", pattern="^D")
[1] "Desktop" "Documents" "Downloads"
Example 5
To display a recursive list of files or directories in the specified path, use the dir()
command as follows:
> dir(path="d:/data")
[1] "db"
> dir(path="d:/data", recursive=TRUE,include.dirs=TRUE)
[1] "db" "db/Demo.0" "db/Demo.ns" "db/local.0" "db/local.ns"
"db/mongod.lock" "db/MyDB.0" "db/MyDB.ns"
The options or arguments used with dir() can also be used with list.files(). Try
it out and observe the output.
locations or size of memory reserved is determined by the data type of the variables. Data
type essentially means the kind of value which can be stored, such as boolean, numbers,
characters, etc. In R, however, variables are not declared as data types. Variables in R are
used to store some R objects and the data type of the R object becomes the data type of
the variable. The most popular (based on usage) R objects are:
d Vector
d List
d Matrix
d Array
d Factor
d Data Frames
A vector is the simplest of all R objects. It has varied data types. All other R objects are
based on these atomic vectors. The most commonly used data types are listed as follows:
Data types supported by R are:
d Logical
d Numeric
r Integer
d Character
d Double
d Complex
d Raw
class() function can be used to reveal the data type. Other R objects such as list, matrix,
array, factor and data frames are discussed in detail in Chapter 3.
Logical
TRUE / T and FALSE / F are logical values.
> TRUE
[1] TRUE
> class(TRUE)
[1] "logical"
> T
[1] TRUE
> class(T)
[1] "logical"
> FALSE
[1] FALSE
> class(FALSE)
[1] "logical"
> F
[1] FALSE
> class(F)
[1] "logical"
30 Data Analytics using R
Numeric
> 2
[1] 2
> class (2)
[1] "numeric"
> 76.25
[1] 76.25
> class(76.25)
[1] "numeric"
Integer
Integer data type is a sub class of numeric data type. Notice the use of “L“ as a suffix to
a numeric value in order for it to be considered an “integer”.
> 2L
[1] 2
> class(2L)
[1] "integer"
Functions such as is.numeric(), is.integer() can be used to test the data type.
> is.numeric(2)
[1] TRUE
> is.numeric(2L)
[1] TRUE
> is.integer(2)
[1] FALSE
> is.integer(2L)
[1] TRUE
Note: Integers are numeric but NOT all numbers are integers.
Character
> "Data Science"
[1] "Data Science"
> class("Data Science")
[1] "character"
is.character() function can be used to ascertain if a value is a character.
> is.character ("Data Science")
[1] TRUE
Complex
> 5 + 5i
[1] 5+5i
> class(5 + 5i)
[1] "complex"
Getting Started with R 31
Raw
> charToRaw("Hi")
[1] 48 69
> class (charToRaw ("Hi"))
[1] "raw"
typeof() function can also be used to check the data type (as shown).
> typeof(5 + 5i)
[1] "complex"
> typeof(charToRaw ("Hi")
+ )
[1] "raw"
> typeof ("DataScience")
[1] "character"
> typeof (2L)
[1] "integer"
> typeof (76.25)
[1] "double"
2.3.1 Coercion
Coercion helps to convert one data type to another, e.g. logical “TRUE” value when
converted to numeric yields “1”. Likewise, logical “FALSE” value yields “0 ”.
> as.numeric(TRUE)
[1] 1
> as.numeric(FALSE)
[1] 0
Numeric 5 can be converted to character 5 using as.character().
> as.character(5)
[1] "5"
> as.integer(5.5)
[1] 5
On converting characters, “hi” to numeric data type, the as.numeric() returns NA.
> as.numeric("hi")
[1] NA
Warning message:
NAs introduced by coercion
summary() Command
summary() command includes functions like min, max, median, mean, etc., for each
variable present in the given data frame.
Example
>summary(mtcars)
Output
The output shows a six-point summary of each of the column or variable of the dataset
“mtcars”. The summary points are min, 1st quartile, mean, median, 3rd quartile and max
(Figure 2.1).
str() Command
str() command displays the internal structure of a data frame. It can be used as an
alternative to summary function. It is a diagnostic function and roughly displays one
line per basic object.
Example 1
>str(str)
function(object,…)
The above example shows str() function itself serving as an argument. It displays
compactly str() internal structure, stating that it is a function which takes an object
as an argument.
Example 2
str(ls)
function(name, pos = -1L, envir = as.environment(pos), all.names =
FALSE, pattern, sorted = TRUE)
Here, ls() is used as an argument to str() function. It provides a brief outline of
the ls() function.
Example 3
>str(mtcars)
Output
When a data frame named “mtcars” is supplied, the command shows the internal structure
of the data frame. The CLI is:
Getting Started with R 35
>str(mtcars)
“data.frame”: 32 obs. of 11 variables:
$ mpg :num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl :num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs :num 0 0 1 1 0 1 0 1 1 1 ...
$ am :num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
It shows the individual datatype of each column or variable of the mtcars dataset.
Example 4
Let us generate a vector of 100 normally distributed random numbers using the function
rnorm(). To learn more about the rnorm() function, use help(rnorm()) at the R prompt.
However, for curious minds, remember to use help(rnorm()) at the R prompt. The
standard mean and sd arguments used are 2 and 4, respectively.
When we run the summary() function with “x” as the argument, we get the “minimum
”, “1st quartile ”, “Median ”, “Mean ”, “3rd Quartile” and “Maximum” for “x ”.
Next, when we run str() on “x ”, we get the information that “x” is a numeric vector
consisting of 100 elements and it also returns the first 5 elements from the “x” vector.
Example 5
Let us now take it a step further by creating a 10 by 10 matrix, “m” and calling str() on it.
36 Data Analytics using R
The command shows the last 5 observations from the data frame.
ncol() Command
ncol() command returns the number of columns in the given dataset.
Example
>ncol(mtcars)
Output
The output shows the number of columns in the “mtcars” dataset.
>ncol(mtcars)
[1] 11
38 Data Analytics using R
nrow() Command
nrow() command returns the number of rows in the given dataset.
Example
>nrow(mtcars)
Output
The output shows the number of rows in the “mtcars” dataset.
>nrow(mtcars)
[1] 32
edit() Command
edit() command helps with the dynamic editing or data manipulation of a dataset. When
this command is invoked, a dynamic data editor window opens with a tabular view of
the dataset. Hereafter, the required changes to the dataset can be made.
Example
>edit(mtcars)
Output
The output shows the changes made in the first row of the “mtcars” dataset.
> edit(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 UPDATED 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Getting Started with R 39
The modified dataset should be stored in a new variable. For example, it is a good practice to
call the edit() method as mtcars_new = edit(mtcars).
fix() Command
fix() command saves the changes in the dataset itself, so there is no need to assign any
variable to it.
Example
> fix(mtcars)
> View(mtcars)
Output
Figure 2.2 Viewing the “mtcars” dataset after the modifications using the View() command
40 Data Analytics using R
It shows the changes made to the first row of the dataset and the changes saved
automatically rather than being discarded as in the edit() method (Figure 2.2).
To read help on any command in R, the user can type “?” followed by the function name on the
console.
data() Function
The data() function lists the available datasets.
Syntax
> data()
Output
Figure 2.3 Scatter plot between the variables of the trees dataset
save.image() Function
save.image() function writes an external representation of R objects to the specified
file. At a later point in time when it is required to read back the objects, one can use the
load or attach function.
Syntax
save.image(file = “.RData”, version = NULL, ascii = FALSE, safe = TRUE)
The file is to be given an extension of RData.
Note: The “R” and “D” in “RData” should be in capitals.
If ascii = TRUE, will save an ascii representation of the file. The default is ascii = FALSE.
With ascii being set to false, a binary representation of the file is saved.
Getting Started with R 43
version is used to specify the current workspace format version. The value of NULL
specifies the current default format.
safe is set to a logical value. A value of TRUE means that a temporary file is used to
create the saved workspace. This temporary file is renamed to file if the save succeeds.
Summary
d Data type essentially means the kind of value which can be stored, such as boolean, numbers,
characters, etc. In R, however, variables are not declared as data types. Variables in R are used to
store some R objects and the data type of the R object becomes the data type of the variable.
d ls() function lists all the objects in the working environment.
d class() function reveals the data type.
d typeof() function checks the data type.
d data() function lists the available datasets.
Key Terms
d dir(): dir() function returns a character d setwd(): setwd() command resets the
vector of the names of files or directories in current working directory to another loca-
the named directory. tion as per the user’s preference.
d getwd(): getwd() command returns the d typeof(): typeof() function is used to
absolute file path of the current working check the data type.
directory. This function has no arguments.
44 Data Analytics using R
PracTical exercises
1. BOD is an inbuilt data set in R. The output of the command View(BOD) is given below.
What will be done by the code given below? Explain.
>View(BOD)
>nrow(BOD)
LEARNING OUTCOME
At the end of this chapter, you will be able to:
c Store data of varied data types into vectors, matrixes, and lists
c Load data from .csv, spreadsheets, web, Jason documents, and XML
c Deal with missing or invalid values
c Run R functions on the data (sum(), min(), max(), rep(), grep(), substr(),
strsplit(), etc.)
c Use R with databases such as MySQL, PostgreSQL, SQLlite, and JasperDB
c Create visualisations to help with deeper understanding of data
3.1 introDuCtion
Enterprise applications today generate a huge amount of data. This data is analysed to
draw useful insights that can help decision makers make better and faster decisions. This
chapter introduces the different data types such as numbers, text, logical values, dates,
etc., supported in R. It also describes various R objects such as vector, matrix, list, dataset,
etc., and how to manipulate data using R functions such as sum(), min(), max(), rep()
and string functions such as substr(), grep(), strsplit(), etc. It explores import of
data into R from .csv (comma separated values), spreadsheets, XML documents, JASON
(Java Script Object Notation) documents, web data, etc., and interfacing R with databases
such as MySQL, PostGreSQL, SQLlite, etc. There are quite a few challenges in analysing
46 Data Analytics using R
data. For instance, data is not always homogeneous, i.e. it comes from varied sources and
in different formats. Ensuring data quality can pose several challenges. Stakeholders also
view data from many perspectives and may have different requirements from it.
3.3.1 Expressions
Look at a few arithmetic operations such as addition, subtraction, multiplication, division,
exponentiation, finding the remainder (modulus), integer division and computing the
square root as given in Table 3.1.
48 Data Analytics using R
Guided Activity
Step 1: Create a vector, x consisting of 10 elements with values ranging from 1 to 10. Section
3.5 of this chapter deals with creation, accessing vector elements and vector arithmetic,
etc.
> x <- c(1:10)
Loading and Handling Data in R 49
Explanation
Part (i) Display ‘TRUE’ for elements whose values are more than 7, else display ‘FALSE’.
> x>7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Part (ii) Display ‘TRUE’ for elements whose values are less than 5, else display ‘FALSE’.
> x<5
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Step 4: Print the values of those elements whose values are greater than 7 and less than 10.
‘&’ is the AND operator. Use the AND operator to display elements whose values are
greater than 7 and less than 10.
> x[(x>7) & (x<10)]
[1] 8 9
3.3.3 Dates
The default format of date is YYYY-MM-DD.
(i) Print system’s date.
> Sys.Date()
[1] “2017-01-13”
(ii) Print system’s time.
> Sys.time()
[1] “2017-01-13 10:54:37 IST”
(iii) Print the time zone.
> Sys.timezone()
[1] “Asia/Calcutta”
(iv) Print today’s date.
> today <- Sys.Date()
> today
[1] “2017-01-13”
> format (today, format = “%B %d %Y”)
[1] “January 13 2017”
50 Data Analytics using R
3.3.4 Variables
(i) Assign a value of 50 to the variable called ‘Var’.
> Var <-50
Or
> Var=5
(ii) Print the value in the variable, ‘Var’.
> Var
[1] 50
(iii) Perform arithmetic operations on the variable, ‘Var’.
> Var + 10
[1] 60
> Var / 2
[1] 25
Variables can be reassigned values either of the same data type or of a different data
type.
(iv) Reassign a string value to the variable, ‘Var’.
> Var <- “R is a Statistical Programming Language”
Loading and Handling Data in R 51
3.3.5 Functions
In this section we will try out a few functions such as sum(), min(), max() and seq().
sum() function
sum() function returns the sum of all the values in its arguments.
Syntax
sum(..., na.rm = FALSE)
where … implies numeric or complex or logical vectors.
na,rm accepts a logical value. Should missing values (including NaN (Not a Number))
be removed?
Examples
(i) Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()
> sum(1, 2, 3)
[1] 6
(ii) What will be the output if NA is used for one of the arguments to sum()?
> sum(1, 5, NA, na.rm=FALSE)
[1] NA
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or
NaN to be returned.
(iii) What will be the output if NaN is used for one of the arguments to sum()?
> sum(1, 5, NaN, na.rm= FALSE)
[1] NaN
(iv) What will be the output if NA and NaN are used as arguments to sum()?
> sum(1, 5, NA, NaN, na.rm=FALSE)
[1] NA
(v) What will be the output if option, na.rm is set to TRUE?
If na.rm is TRUE, an NA or NaN value in any of the argument will be ignored.
> sum(1, 5, NA, na.rm=TRUE)
[1] 6
> sum(1, 5, NA, NaN, na.rm=TRUE)
[1] 6
52 Data Analytics using R
min() function
min() function returns the minimum of all the values present in their arguments.
Syntax
min(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a logical value.
Should missing values (including NaN) be removed?
Example
> min(1, 2, 3)
[1] 1
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN
to be returned.
> min(1, 2, 3, NA, na.rm=FALSE)
[1] NA
> min(1, 2, 3, NaN, na.rm=FALSE)
[1] NaN
> min(1, 2, 3, NA, NaN, na.rm=FALSE)
[1] NA
If na.rm is TRUE, an NA or NaN value in any of the argument will be ignored.
> min(1, 2, 3, NA, NaN, na.rm=TRUE)
[1] 1
max() function
max() function returns the maximum of all the values present in their arguments.
Syntax
max(…, na.rm=FALSE)
where … implies numeric or character arguments
na.rm accepts a logical value. Should missing values (including NaN) be removed?
Example
> max(44, 78, 66)
[1] 78
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN
to be returned.
Loading and Handling Data in R 53
seq() function
seq() function generates a regular sequence.
Syntax
seq(start from, end at, interval, length.out)
where,
Start from: It is the start value of the sequence.
End at: It is the maximal or end value of the sequence.
Interval: It is the increment of the sequence.
length.out: It is the desired length of the sequence.
Example
> seq(1, 10, 2)
[1] 1 3 5 7 9
> seq(1, 10, length.out=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Or
> seq_len(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> seq(1, 6, by=3)
[1] 1 4
rep() function
rep() function repeats a given argument for a specified number of times. In the example
below, the string, ‘statistics’ is repeated three times.
Example
> rep(“statistics”, 3)
[1] “statistics” “statistics” “statistics”
grep() function
In the example below, the function grep() finds the index position at which the string,
‘statistical’ is present.
Example
> grep(“statistical”,c(“R”,“is”,“a”,“statistical”,“language”),
fixed=TRUE)
[1] 4
toupper() function
toupper() function converts a given character vector into upper case.
Syntax
toupper(x)
x Æ is a character vector
Example
> toupper(“statistics”)
[1] “STATISTICS”
Or
> casefold (“r programming language”, upper=TRUE)
[1] “R PROGRAMMING LANGUAGE”
tolower() function
tolower() function converts the given character vector into lower case.
Syntax
tolower(x)
x Æ is a character vector
Example
> tolower(“STATISTICS”)
[1] “statistics”
56 Data Analytics using R
Or
> casefold(“R PROGRAMMING LANGUAGE”, upper=FALSE)
[1] “r programming language”
substr() function
substr() function extracts or replaces substrings in a character vector.
Syntax
substr(x, start, stop)
x Æ character vector
start Æ start position of extraction or replacement
stop Æ stop or end position of extraction or replacement
Example
Extract the string ‘tic’ from ‘statistics’. Begin the extraction at position 7 and continue the
extraction till position 9.
> substr(“statistics”, 7, 9)
[1] “tic”
The following example creates a vector ‘A’ with some missing values [10, 20, NA,
40] (Figure 3.2). The is.na(A) returns TRUE for the missing value. The na.omit(A)
and na.exclude(A) removes the missing value and stores it into vector ‘B’ and ‘D’,
respectively. The na.fail(A) generates an error if A has some missing value. The
na.pass(A) returns the usual vector A.
(Continued)
Loading and Handling Data in R 59
3.6 VeCtors
A vector can have a list of values. The values can be numbers, strings or logical. All the
values in a vector should be of the same data type.
A few points to remember about vectors in R are:
d Vectors are stored like arrays in C
d Vector indices begin at 1
d All vector elements must have the same mode such as integer, numeric (floating
point number), character (string), logical (Boolean), complex, object, etc.
Let us create a few vectors.
1. Create a vector of numbers
> c(4, 7, 8)
[1] 4 7 8
The c function (c is short for combine) creates a new vector consisting of three
values, viz. 4, 7 and 8.
2. Create a vector of string values.
> c(“R”, “SAS”, “SPSS”)
[1] “R” “SAS” “SPSS”
3. Create a vector of logical values.
> c(TRUE, FALSE)
[1] TRUE FALSE
A vector cannot hold values of different data types. Consider the example below on
placing integer, string and Boolean values together in a vector.
> c(4, 8, “R”, FALSE)
[1] “4” “8” “R” “FALSE”
All the values are converted into the same data type, i.e. ‘character’.
60 Data Analytics using R
4. Declare a vector by the name, ‘Project’ of length 3 and store values in it.
> Project <- vector(length = 3)
> Project [1] <- “Finance Project”
> Project [2] <- “Retail Project”
> Project [3] <- “Energy Project”
Outcome
> Project
[1] “Finance Project” “Retail Project” “Energy Project”
> length (Project)
[1] 3
Objective
Create a sequence of numbers between 1 and 5 (both inclusive).
> 1:5
[1] 1 2 3 4 5
Or
> seq(1:5)
[1] 1 2 3 4 5
The default increment with seq is 1. However, it also allows the use of increments
other than 1.
> seq (1, 10, 2)
[1] 1 3 5 7 9
Or
> seq (from=1, to=10, by=2)
[1] 1 3 5 7 9
Or
> seq (1, 10, by=2)
[1] 1 3 5 7 9
seq can also generate numbers in the descending order.
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> seq (10, 1, by=–2)
[1] 10 8 6 4 2
Objective
Demonstrate rep function.
Act
> rep (3, 4)
[1] 3 3 3 3
Or
> x <-rep (3, 4)
> x
[1] 3 3 3 3
Objective
To access values in a vector, specify the indices at which the value is present in the vector.
Indices start at 1.
> VariableSeq[1]
[1] “R”
> VariableSeq[2]
[1] “is”
> VariableSeq[3]
[1] “a”
> VariableSeq[4]
[1] “programming”
> VariableSeq[5]
[1] “language”
Objective
Assign new values in an existing vector. For example, let us assign value, ‘good
programming’ at indices 4 in the existing vector, ‘VariableSeq’.
> VariableSeq[4] <- “good programming”
Outcome
> VariableSeq[4]
[1] “good programming”
Objective
To access more than one value from the vector.
(a) Access the first and the fifth element from the vector, ‘VariableSeq’.
> VariableSeq[c(1, 5)]
[1] “R” “language”
62 Data Analytics using R
(b) Access first to the fourth element from the vector, ‘VariableSeq’.
> VariableSeq[1:4]
[1] “R” “is” “a” “good programming”
(c) Access the first, fourth and the fifth element from the vector, ‘VariableSeq’.
> VariableSeq[c(1, 4:5)]
[1] “R” “good programming” “language”
(d) Retrieve all the values from the variable, ‘VariableSeq’
> VariableSeq
[1] “R” “is” “a” “good programming”
[5] “language”
Objective
Plot a bar graph using the barplot function. The barplot function uses a vector’s values
to plot a bar chart.
Act
The vector used is called BarVector.
> BarVector <- c(4, 7, 8)
> barplot(BarVector)
Loading and Handling Data in R 63
Outcome
Let us use the name function to assign names to the vector elements. These names will
be used as labels in the barplot.
> names(BarVector) <- c(“India”, “MiddleEast”, “US”)
> barplot(BarVector)
Objective
Add two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) + c(4, 5, 6, 7, 8, 9)
[1] 5 7 9 8 10 12
Objective
Multiply the two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) * c(4, 5, 6, 7, 8, 9)
[1] 4 10 18 7 16 27
Objective
Plot a Scatter Plot. The function to plot a scatter plot is ‘plot’. This function uses two
vectors, i.e. one for the x axis and another for the y axis. The objective is to understand the
relationship between numbers and their sines. We will use two vectors. Vector, x which
will have a sequence of values between 1 and 25 at an interval of 0.1 and vector, y which
stores the sines of all values held in vector, x.
> x <-seq(1, 25, 0.1)
> y <-sin(x)
The plot function takes the values in the vector, x and plots it on the horizontal axis. It
then takes the values in the vector, y and places it on the vertical axis (Figure 3.4).
> plot(x, y)
3.7 MatriCes
Matrices are nothing but two-dimensional arrays.
Objective
Let us create a matrix which is 3 rows by 4 columns and set all its elements to 1.
> matrix (1, 3, 4)
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 1 1 1
[2, ] 1 1 1 1
[3, ] 1 1 1 1
Objective
Use a vector to create an array, 3 rows high and 3 columns wide.
Step 1: Begin by creating a vector that has elements from 10 to 90 with an interval of 10.
> a <- seq(10, 90, by = 10)
Step 2: Validate by printing the value of vector a.
> a
[1] 10 20 30 40 50 60 70 80 90
Step 3: Call the matrix function with vector, ‘a’ the number of rows and the number of
columns.
> matrix (a, 3, 3)
[, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
[3, ] 30 60 90
Objective
Re-shape the vector itself into an array using the dim function.
Step 1: Begin by creating a vector that has elements from 10 to 90 with an interval of 10.
> a <- seq (10, 90, by = 10)
Step 2: Validate by printing the value of vector, a.
> a
[1] 10 20 30 40 50 60 70 80 90
Step 3: Assign new dimensions to vector, a by passing a vector having 3 rows and 3
columns (c (3, 3)).
> dim(a) <- c(3, 3)
Step 4: Print the values of vector, a. You will notice that the values have shifted to form 3
rows by 3 columns. The vector is no longer one dimensional. It has been converted into
a two-dimensional matrix that is 3 rows high and 3 columns wide.
Loading and Handling Data in R 67
> a
[, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
[3, ] 30 60 90
Objective
Access the third row of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the third row of the matrix, simply provide the row number and omit
the column number.
> mat [3, ]
[1] 3 6 9 12
Objective
Access the second column of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
68 Data Analytics using R
Step 2: To access the second column of the matrix, simply provide the column number
and omit the row number.
> mat[, 2]
[1] 4 5 6
Objective
Access the second and third columns of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’.
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the second and third columns of the matrix, simply provide the column
numbers and omit the row number.
> mat[,2:3]
[, 1] [, 2]
[1, ] 4 7
[2, ] 5 8
[3, ] 6 9
Objective
Create a contour plot.
Create a matrix, ‘mat’ which is 9 rows high and 9 columns wide and assign the value
‘1’ to all its elements.
> mat <- matrix(1, 9, 9)
Print all the values of the matrix, ‘mat’.
> mat
[, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8] [, 9]
[1, ] 1 1 1 1 1 1 1 1 1
[2, ] 1 1 1 1 1 1 1 1 1
[3, ] 1 1 1 1 1 1 1 1 1
[4, ] 1 1 1 1 1 1 1 1 1
[5, ] 1 1 1 1 1 1 1 1 1
[6, ] 1 1 1 1 1 1 1 1 1
[7, ] 1 1 1 1 1 1 1 1 1
[8, ] 1 1 1 1 1 1 1 1 1
[9, ] 1 1 1 1 1 1 1 1 1
Assign ‘0’ as the value to the element present in the third row and third column of the
matrix, ‘mat’.
Loading and Handling Data in R 69
Objective
Create a 3D perspective plot with the persp() function (Figure 3.6). It provides a 3D
wireframe plot most commonly used to display a surface.
>persp(mat)
We can add a title to our plot with the parameter ‘main’. Similarly, ‘xlab’, ‘ylab’ and
‘zlab’ can be used to label the three axes. Coloring of the plot is done with parameter ‘col’.
Similarly, we can add shading with the parameter ‘shade’.
70 Data Analytics using R
Objective
R includes some sample data sets. One of these is ‘volcano’, which is a 3D map of a
dormant New Zealand volcano. Create a contour map of the volcano dataset (Figure 3.7).
> contour(volcano)
Let us create a 3D perspective map of the sample data set, ‘volcano’ (Figure 3.8).
> persp(volcano)
Objective
Create a heat map of the sample dataset, ‘volcano’ (Figure 3.9).
> image(volcano)
3.8 faCtors
3.8.1 Creating Factors
School, ‘XYZ’ places students in groups, also called houses. Each group is assigned a
unique color such as ‘red’, ‘green’, ‘blue’ or ‘yellow’. HouseColor is a vector that stores
the house colors of a group of students.
> HouseColor <- c(‘red’, ‘green’, ‘blue’, ‘yellow’, red’, ‘green’, ‘blue’, ‘blue’)
> types <- factor(HouseColor)
> HouseColor
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print(HouseColor)
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print (types)
[1] red green blue yellow red green blue blue
Levels: blue green red yellow
Levels denotes the unique values. The above has four distinct values such as ‘blue’,
‘green’, ‘red’ and ‘yellow’.
> as.integer(types)
[1] 3 2 1 4 3 2 1 1
The above output is explained as given below.
1 is the number assigned to blue.
2 is the number assigned to green.
3 is the number assigned to red.
4 is the number assigned to yellow.
> levels(types)
[1] “blue” “green” “red” “yellow”
The vector ‘NoofStudents’ stores the number of students in each house/group with
12 students in blue house, 14 students in green house, 12 students in red house and 13
students in yellow house.
> NoofStudents <- c(12, 14, 12, 13)
> NoofStudents
[1] 12 14 12 13
The vector, ‘AverageScore’ stores the average score of the students of each house/
group. 70 is the average score for students of the blue house, 80 is the average score for
students of the green house, 90 is the average score for the students of the red house and
95 is the average score for the students of the yellow house.
> AverageScore(70, 80, 90, 95)
> AverageScore
[1] 70 80 90 95
Objective
Plot the relationship between NoofStudents and AverageScore (Figure 3.10).
> plot(NoofStudents, AverageScore)
Loading and Handling Data in R 73
Figure 3.11 Relationship between "NoofStudents" and "AverageScore" using different symbols.
74 Data Analytics using R
To add further meaning to the graph, let us place a legend on the top right corner
(Figure 3.12).
> legend(“topright”, c(“red”, “green”, “blue”, “yellow”), pch=1:4)
3.9 list
List is similar to C Struct.
Objective
Create a list in R.
To create a list, ‘emp’ having three elements, ‘EmpName’, ‘EmpUnit’ and ‘EmpSal’.
> emp <- list (“EmpName=“Alex”, EmpUnit = “IT”, EmpSal = 55000)
Outcome
To get the elements of the list, ‘emp’ use the command given below.
> emp
$EmpName
[1] “Alex”
$EmpUnit
[1] “IT”
$EmpSal
[1] 55000
Loading and Handling Data in R 75
Actually, the element names, e.g. ‘EmpName’, ‘EmpUnit’ and ‘EmpSal’ are optional.
We could alternatively do this as shown below.
> EmpList <- list(“Alex”, “IT”, 55000)
> EmpList
[[1]]
[1] “Alex”
[[2]]
[1] “IT”
[[3]]
[1] 55000
Objective
Retrieve the names of the elements in the list ‘emp’.
> names(emp)
[1] “EmpName” “EmpUnit” “EmpSal”
Objective
Retrieve the values of the elements in the list ‘emp’.
> unlist(emp)
EmpName EmpUnit EmpSal
“Alex” “IT” “55000”
The command to retrieve the value of a single element in the list ‘emp’ is given below.
Objective
Retrieve the value of the element ‘EmpName’ in the list ‘emp’.
> unlist(emp[“EmpName”])
EmpName
“Alex”
The value of the other elements in the list can be checked in a similar manner.
76 Data Analytics using R
> unlist(emp[“EmpUnit”])
EmpUnit
“IT”
> unlist(emp[“EmpSal”])
EmpSal
55000
Yet another way to retrieve the values of the elements in the list ‘emp’ is given as
follows:
Objective
Retrieve the value of the element ‘EmpName’ in the list ‘emp’.
> emp[[“EmpName”]]
[1] “Alex”
Or
> emp[[1]]
[1] “Alex”
$EmpUnit
[1] “IT”
$EmpSal
[1] 55000
Objective
Add an element with the name ‘EmpDesg’ and value ‘Software Engineer’ to the list, ‘emp’.
> emp$EmpDesg = “Software Engineer”
Outcome
> emp
$EmpName
[1] “Alex”
$EmpUnit
[1] “IT”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
Loading and Handling Data in R 77
Objective
Delete an element with the name ‘EmpUnit’ and value ‘IT’ from the list, ‘emp’.
> emp$EmpUnit <- NULL
Outcome
> emp
$EmpName
[1] “Alex”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
Objective
Determine the number of elements in the list, ‘emp’.
> length(emp)
[1] 3
Recursive List
A recursive list means a list within a list.
Objective
Create a list within a list.
Let us begin with two lists, ‘emp’ and ‘emp1’.
The elements in both the lists are as shown below.
> emp
$EmpName
[1] “Alex”
78 Data Analytics using R
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
> emp1
$EmpUnit
[1] “IT”
$EmpCity
[1] “Los Angeles”
We would like to combine both the lists into a single list called ‘EmpList’.
> EmpList <- list(emp, emp1)
Outcome
> EmpList
[[1]]
[[1]] $EmpName
[1] “Alex”
[[1]]$EmpSal
[1] 55000
[[1]]$EmpDesg
[1] “Software Engineer”
[[2]]
[[2]]$EmpUnit
[1] “IT”
[[2]]$EmpCity
[1] “Los Angeles”
The following example loads a matrix into the workspace. All the above commands
are executed on the dataset, ‘Orange’ (Figures 3.13–3.15).
80 Data Analytics using R
Figure 3.13 Exploring a dataset using names(), summary() and str() functions
Figure 3.15 Exploring a dataset using class(), dim() and table() functions
The following example reads a table, ‘Hardware.csv’ into object, ‘TD’ on the R
workspace. The TD[1] and TD[, 1] commands displays rows and columns (Figure 3.16).
where, x is an object or data frame, y is an object or data frame and by, by.x, by.y arguments
define the common columns or rows for merging. All arguments contain logical values
‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns the full outer join by adding all
rows of x and y into the result object.
all.x argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then it
returns the dataset as per left outer join after merging the objects by adding an extra row
in x that is not matching with rows in y. If the value is FALSE then it merges the rows
with the data from both x and y into the result object.
all.y argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then
it returns the dataset as per right outer join after merging the objects by adding an extra
row in y that is not matching with rows in x. If the value is FALSE then it merges the
rows with data from both x and y into the result object.
The dots ‘…’ define the other optional argument.
The following example creates two data frames, ‘S’ and ‘T’. Then both the data frames
are merged into a new data frame, ‘E’ (Figure 3.17).
In this example, two data frames, ‘S’ and ‘T’ are using different values to merge data.
The merge command returns the data frames after merging them using the left and right
outer join (Figure 3.18).
Loading and Handling Data in R 83
where, x is an object, by argument defines the list of group elements of the specific variable
of the dataset, FUN argument is a statistic function that returns a numeric value after
given statistic operations and the dots ‘…’ define the other optional argument.
The following example reads a table, ‘Fruit_data.csv’ into object, ‘S’. The aggregate()
function computes the mean price of each type of fruit. Here by argument is list(Fruit.
Name = S$Fruit.Name) that groups the Fruit.Name columns (Figure 3.19).
3.12.1 Input
Input is the first step in any processing, including analytical data processing. Here, the
input is dataset, ‘Fruit’. For reading the dataset into R, use read.table() or read.csv()
function. In Figure 3.21, the dataset, ‘Fruit’ is being read into the R workspace using the
read.csv() function.
Loading and Handling Data in R 93
The read.table() function can also read data from CSV files. The syntax of the
function is
read.table(‘filename’, header=TRUE, sep=‘,’,…)
where,
filename argument defines the path of the file to be read, header argument contains
logical values TRUE and FALSE for defining whether the file has header names on the
first line or not, sep argument defines the character used for separating each column of
the file and the dots ‘…’ define the other optional arguments.
The following example reads a CSV file, ‘Hardware.csv’ using read.csv() and read.
table() function (Figure 3.27).
Reading Spreadsheets
A spreadsheet is a table that stores data in rows and columns. Many applications are
available for creating a spreadsheet. Microsoft Excel is the most popular for creating an
Excel file. An Excel file uses .xlsx extension and stores data in a spreadsheet.
In R, different packages are available such as gdata, xlsx, etc., that provide functions
for reading Excel files. Importing such packages is necessary before using any inbuilt
function of any package. The read.xlsx() is an inbuilt function of ‘xlsx’ package for
reading Excel files. The syntax of the read.xlsx() function is
read.xlsx(‘filename’,…)
Loading and Handling Data in R 95
where,
filename argument defines the path of the file to be read and the dots ‘…’ define the
other optional arguments.
In R, reading or writing (importing and exporting) data using packages may create some
problems like incompatibility of versions, additional packages not loaded and so on. In
order to avoid these problems, it is better to convert files into CSV files. After converting
files into CSV files, the converted file can be read using the read.csv() function.
The following example illustrates creation of an Excel file, ‘Softdrink.xlsx’. The ‘Software.
csv’ file is the converted form of the ‘Softdrink.xlsx’ file (Figure 3.28). The function read.
csv() is reading this file into R (Figure 3.29).
library() Function
The library() function loads packages into the R workspace. It is compulsory to import
the package before reading the available dataset of that package. The syntax of the
library() function is:
library(packagename)
where,
packagename argument is the name of the package to be read.
Figure 3.30 Subset of the data from “SampleSuperstore.xls”
Loading and Handling Data in R
97
98 Data Analytics using R
data() Function
The data() function lists all the available datasets of the loaded package into the R
workspace. For loading a new dataset into the loaded packages, users need to pass the
name of the new dataset into data() function. The syntax of the data() function is:
data(datasetname)
where,
datasetname argument is the name of the dataset to be read.
The following example illustrates the loading of a matrix. The data() function lists
all the available datasets of the loaded package. The ‘ > Orange ‘ command reads and
displays the content of the dataset, ‘Orange’ into the workspace.
The following example illustrates web scraping. Web scraping extracts data from any
webpage of a website. Here package ‘RCurl’ is used for web scraping (Figure 3.32). At
first, the package, ‘RCurl’ is imported into the workspace and then getURL() function of
the package, ‘RCurl’ takes the required webpage. Now htmlTreeParse() function parses
the content of the webpage.
$Name
[1] “Ricky” “Danny” “Mitchelle” “Ryan” “Gerry” “Nonita”
[7] “Simon” “Gallop”
$Dept
[1] “IT” “Operations” “IT” “HR” “Finance”
[6] “IT” “Operations” “Finance”
<EMPLOYEE>
<EMPID>1002</EMPID>
<EMPNAME>Ramya</EMPNAME>
<SKILLS>People Management</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1003</EMPID>
<EMPNAME>Fedora</EMPNAME>
<SKILLS>Recruitment</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
</RECORDS>
> print(output)
<?xml version=“1.0”?>
<RECORDS>
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>ComputerScience</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1002</EMPID>
<EMPNAME>Ramya</EMPNAME>
<SKILLS>PeopleManagement</SKILLS>
<DEPT>HumanResources</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1003</EMPID>
<EMPNAME>Fedora</EMPNAME>
<SKILLS>Recruitment</SKILLS>
<DEPT>HumanResources</DEPT>
</EMPLOYEE>
</RECORDS>
Step 2: Extract the root node from the XML file.
> rootnode <- xmlRoot(output)
5. What is a package?
Ans: A package is a collection of functions and datasets. In R, many packages are available
for doing different types of operations.
6. What is the use of the library() function?
Ans: The library() function loads packages into the R workspace. It is compulsory to
import packages before reading the available dataset of that package.
Figure 3.33 shows the official screenshot of the RCommander (Rcmdr) GUI that is
available in R.