KEMBAR78
Module 2 Textbook Content | PDF | Command Line Interface | Linux Distribution
0% found this document useful (0 votes)
28 views104 pages

Module 2 Textbook Content

Uploaded by

harshithkataray1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views104 pages

Module 2 Textbook Content

Uploaded by

harshithkataray1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Data analytics

using R

Seema Acharya
Senior Lead Principal
Infosys Limited

McGraw Hill Education (India) Private Limited


CHENNAI

McGraw Hill Education Offices


Chennai New York St Louis San Francisco Auckland Bogotá Caracas
Kuala Lumpur Lisbon London Madrid Mexico City Milan Montreal
San Juan Santiago Singapore Sydney Tokyo Toronto
Contents

About the Author ii


Preface vii
Acknowledgements xi

Chapter 1 Introduction to R 1
1.1 Introduction 1
1.1.1 What is R? 1
1.1.2 Why R? 2
1.1.3 Advantages of R Over Other Programming Languages 3
1.2 Downloading and Installing R 4
1.2.1 Downloading R 4
1.2.2 Installing R 6
1.2.3 Primary File Types of R 10
1.3 IDEs and Text Editors 11
1.3.1 R Studio 12
1.3.2 Eclipse with StatET 13
1.4 Handling Packages in R 13
1.4.1 Installing an R Package 15
1.4.2 Few Commands to Get Started 16
Summary 22
Key Terms 23
Multiple Choice Questions 23
Short Questions 24

Chapter 2 Getting Started with R 25


2.1 Introduction 25
2.2 Working with Directory 25
2.2.1 getwd() Command 25
2.2.2 setwd() Command 26
2.2.3 dir() Function 26
xiv Contents

2.3 Data Types in R 28


2.3.1 Coercion 31
2.3.2 Introducing Variables and ls() Function 31
2.4 Few Commands for Data Exploration 32
2.4.1 Load Internal Dataset 32
Key Terms 43
Summary 43
Practical Exercises 44

Chapter 3 Loading and Handling Data in R 45


3.1 Introduction 45
3.2 Challenges of Analytical Data Processing 46
3.2.1 Data Formats 46
3.2.2 Data Quality 46
3.2.3 Project Scope 46
3.2.4 Output Result via Stakeholder Expectation Management 47
3.3 Expression, Variables and Functions 47
3.3.1 Expressions 47
3.3.2 Logical Values 48
3.3.3 Dates 49
3.3.4 Variables 50
3.3.5 Functions 51
3.3.6 Manipulating Text in Data 53
3.4 Missing Values Treatment in R 56
3.5 Using the ‘as’ Operator to Change the Structure of Data 57
3.6 Vectors 59
3.6.1 Sequence Vector 60
3.6.2 rep function 60
3.6.3 Vector Access 61
3.6.4 Vector Names 62
3.6.5 Vector Math 63
3.6.6 Vector Recycling 64
3.7 Matrices 66
3.7.1 Matrix Access 67
3.8 Factors 72
3.8.1 Creating Factors 72
3.9 List 74
3.9.1 List Tags and Values 75
3.9.2 Add/Delete Element to or from a List 76
3.9.3 Size of a List 77
Contents xv

3.10 Few Common Analytical Tasks 78


3.10.1 Exploring a Dataset 79
3.10.2 Conditional Manipulation of a Dataset 81
3.10.3 Merging Data 81
3.11 Aggregating and Group Processing of a Variable 84
3.11.1 aggregate() Function 84
3.11.2 tapply() Function 85
3.12 Simple Analysis Using R 86
3.12.1 Input 86
3.12.2 Describe Data Structure 87
3.12.3 Describe Variable Structure 88
3.12.4 Output 90
3.13 Methods for Reading Data 93
3.13.1 CSV and Spreadsheets 93
3.13.2 Reading Data from Packages 96
3.13.3 Reading Data from Web/APIs 98
3.13.4 Reading a JSON (Java Script Object Notation) Document 99
3.13.5 Reading an XML File 102
3.14 Comparison of R GUIs for Data Input 106
3.15 Using R with Databases and Business Intelligence Systems 108
3.15.1 RODBC 109
3.15.2 Using MySQL and R 110
3.15.3 Using PostgreSQL and R 111
3.15.4 Using SQLite and R 111
3.15.5 Using JasperDB and R 112
3.15.6 Using Pentaho and R 112
Case Study: Log Analysis 113
Summary 116
Key Terms 118
Multiple Choice Questions 119
Short Questions 121
Long Questions 122

Chapter 4 Exploring Data in R 124


4.1 Introduction 124
4.2 Data Frames 125
4.2.1 Data Frame Access 125
4.2.2 Ordering the Data Frames 128
4.3 R Functions for Understanding Data in Data Frames 128
4.3.1 dim() Function 128
Chapter 1
Introduction to R

LEARNING OUTCOME
At the end of this chapter, you will be able to:
c Install R
c Install any R package
c Work with any R package using functions such as find.package(), install.pack-
ages(), library(), vignette() and packageDescription()

1.1 InTroDUcTIon
Statistical computing and high-scale data analysis tasks needed a new category of
computer language besides the existing procedural and object-oriented programming
languages, which would support these tasks instead of developing new software. There is
plenty of data available today which can be analysed in different ways to provide a wide
range of useful insights for multiple operations in various industries. Problems such as
the lack of support, tools and techniques for varied data analysis have been solved with
the introduction of one such language called R.

1.1.1 What is R?
R is a scripting or programming language which provides an environment for statistical
computing, data science and graphics. It was inspired by, and is mostly compatible with,
the statistical language S developed at Bell laboratory (formerly AT & T, now Lucent
technologies). Although there are some very important differences between R and S, much
2 Data Analytics using R

of the code written for S runs unaltered on R. R has become so popular that it is used as
the single most important tool for computational statistics, visualisation and data science.

1.1.2 Why R?
R has opened tremendous scope for statistical computing and data analysis. It provides
techniques for various statistical analyses like classical tests and classification, time-
series analysis, clustering, linear and non-linear modelling and graphical operations. The
techniques supported by R are highly extensible.
S is the pioneer of statistical computing; however, it is a proprietary solution and is not
readily available to developers. In contrast, R is available freely under the GNU license.
Hence, it helps the developer community in research and development.
Another reason behind the popularity and widespread use of R is its superior support
for graphics. It can provide well-developed and high-quality plots from data analysis.
The plots can contain mathematical formulae and symbols, if necessary, and users have
full control over the selection and use of symbols in the graphics. Hence, other than
robustness, user-experience and user-friendliness are two key aspects of R.

Why Learn R?
The following points describe why R language should be used (Figure 1.1):
d If you need to run statistical calculations in your application, learn and deploy R. It
easily integrates with programming languages such as Java, C++, Python and Ruby.
d If you wish to perform a quick analysis for making sense of data.
d If you are working on an optimisation problem.
d If you need to use re-usable libraries to solve a complex problem, leverage the 2000+
free libraries provided by R.
d If you wish to create compelling charts.
d If you aspire to be a Data Scientist.
d If you want to have fun with statistics.

Advanced Statistics

Supportive Open
Fun with Statistics
Source Community

Integration with other Why Free,


programming languages learn R? Open Source

Easy Extensibility Great Visualization


Cross Platform
Compatibility

Figure 1.1 Advantages of learning R language


Introduction to R 3

d R is free. It is available under the terms of the Free Software Foundation’s GNU
General Public License in source code form.
d It is available for Windows, Mac and a wide variety of Unix platforms (including
FreeBSD, Linux, etc.).
d In addition to enabling statistical operations, it is a general programming language
so that you can automate your analyses and create new functions.
d R has excellent tools for creating graphics such as bar charts, scatter plots, multi-
panel lattice charts, etc.
d It has an object oriented and functional programming structure along with support
from a robust and vibrant community.
d R has a flexible analysis tool kit, which makes it easy to access data in various for-
mats, manipulate it (transform, merge, aggregate, etc.), and subject it to traditional
and modern statistical models (such as regression, ANOVA, tree models, etc.)
d R can be extended easily via packages. It relates easily to other programming lan-
guages. Existing software as well as emerging software can be integrated with R
packages to make them more productive.
d R can easily import data from MS Excel, MS Access, MySQL, SQLite, Oracle etc. It
can easily connect to databases using ODBC (Open Database Connectivity Protocol)
and ROracle package.

1.1.3 Advantages of R Over Other Programming Languages


Advanced programming languages like Python also support statistical computing and
data visualisation along with traditional computer programming. However, R wins the
race over Python and similar languages because of the following two advantages:
1. Python needs third party extensions and support for data visualisation and
statistical computing. However, R does not require any such support extensively. For
example, the lm function is present for linear regression analysis and data analysis
in both Python and R. In R, data can be easily passed through the function and
the function will return an object with detailed information about the regression.
The function can also return information about the standard errors, coefficients,
residual values and so on. When lm function is called in the Python environment,
it will duplicate the functionalities using third party libraries such as SciPy, NumPy
and so on. Hence, R can do the same thing with a single line of code instead of
taking support from third party libraries.

SciPy is used for performing data analysis tasks and NumPy is used for representing the
data or objects.

2. R has the fundamental data type, i.e., a vector that can be organised and aggregated
in different ways even though the core is the same. Vector data type imposes some
limitations on the language as this is a rigid type. However, it gives a strong logical
base to R. Based on the vector data type, R uses the concept of data frames that are
4 Data Analytics using R

like a matrix with attributes and internal data structure similar to spreadsheets or
relational database. Hence, R follows a column-wise data structure based on the
aggregation of vectors.

Just Remember
There are also some disadvantages of R. For example, R cannot scale efficiently for larger data sets.
Hence, the use of R is limited to prototyping and sandboxing. It is rarely used for enterprise-level solutions.
By default, R uses a single-thread execution approach while working on data stored in the RAM which
leads to scalability issues as well. Developers from open source communities are working hard on these
issues to make R capable of multi-threading execution and parallelisation. This will help R to utilise more
than one core processor. There are big data extensions from companies like Revolution R and the issues
are expected to be resolved soon. Other languages like SPlus can help to store objects permanently on
disks, hence, supporting better memory management and analysis of high volume of massive datasets.

Check Your Understanding


1. What is R?
Ans: R is an open source programming language for data science and statistical computing.

2. What is the predecessor of R?


Ans: The statistical computing language, S is the predecessor of R.

3. What is the fundamental data type of R?


Ans: The fundamental data type of R is a vector.

4. What is the disadvantage of using R in enterprise-level large-scale solutions?


Ans: R language cannot scale up for large data sets. Hence, it is difficult to use R for large-
scale data analysis tasks for enterprise-level solutions.

1.2 DoWnloaDIng anD InsTallIng r


The integrated development suite for R language can be downloaded from the
Comprehensive R Archive Network (CRAN)1. The network includes mirror websites for
downloading the suite from different countries.

1.2.1 Downloading R
To download R, users need to visit the CRAN mirror page and click on the URL of the
chosen mirror that will redirect them to the respective site (Figure 1.2).

1
URL of CRAN—https://cran.r-project.org/mirrors.html
Figure 1.2 CRAN website for downloading R
Introduction to R
5
6 Data Analytics using R

R is offered as a precompiled binary distribution of a base system and contributing


packages. Different distributions of R are available for different operating systems (OS)
like Windows, Mac and Linux.

In some Linux OS, R distributions are included by default. Hence, it is a good idea to check the
package management system of a Linux OS platform before installing R on it.

Downloading R for Windows


Windows users need to first download and install binaries for the base distribution. The
current version of the base binary distribution is R 3.3.1. Users can check and download
previous contributions and versions of R, Rtools from the mirror website. Rtools is used
for building R and its packages (Figure 1.3).

Downloading R for Mac


R works on Mac OS version 10.6 or more. The downloadable directory contains the base
distribution and packages for downloading and installing R on Mac (Figure 1.4).

Downloading R for Linux


Different distributions of R are available for different distributions of Linux like Ubuntu,
Debian, RedHat and SUSE (Figure 1.5). On the Command Line Interface (CLI), the
following command will download the binary on a Linux machine—$ wgethttp://cran.
rstudio.com/src/base/R-3/R-3.1.1.tar.gz

1.2.2 Installing R
After downloading R distribution binaries for the correct OS platform, R is installed.

Installing R on Windows
Installing R on Windows is simple. Users need to double click on the downloaded binary,
named R-3.3.1-win.exe, on a graphical interface. Command line installation options are
available for Windows (Figure 1.6).

Two versions are available for 32-bit and 64-bit Windows OS. By default, both the versions are
installed. Hence, users need to select the desired version manually during installation.
Figure 1.3 Downloading R for Windows
Introduction to R
7
8
Data Analytics using R

Figure 1.4 Downloading R on Mac


Introduction to R 9

Figure 1.5 Downloading R for Linux distributions

Figure 1.6 R console on a 32-bit Windows PC


10 Data Analytics using R

Installing Rtools
Rtools is an additional requirement for developing R packages under Windows OS
environment. In addition to installing the R software on Windows, users need to install
Rtools for the installed version of R.

Installing R on Mac
The process for installing R on Mac is similar to that for Windows. Users need to double
click on the binaries downloaded from the CRAN website and follow the prompts.

Installing R on Linux
Users need to install R from the source on Linux distributions. This can be done by
following commands in the supervisor mode. The following steps will install and configure
R into a user-specific subdirectory within the home directory:
$ tar xvf R-3.1.1.tar.gz
$ cd R-3.1.1
$ ./configure --prefix=$HOME/R
$ make && make install

Setting the path on a Linux machine is very critical. Without the path, R and RScript do
not work.

1.2.3 Primary File Types of R


Working with R involves working on two types of files—RScripts and R markdown
documents.

RScript
RScript is a text file that contains commands for an R program. The same commands
can be executed individually on the CLI of Integrated Development Environment (IDE)
for R programming. An RScript can be also be developed and executed. However, there
is a difference between executing a command directly on CLI and executing the same
command through an R script. An RScript has a .R extension.
Command line interface is needed for quick and small data processing and checking
operations. In large-scale solutions, it integrates multiple programs during prototyping and
subsequent phases. In that case, RScripts are used for managing the integration process.

Markdown Documents
R markdown documents are produced for creating and authoring dynamic documents,
reports and presentations from R. R markdown documents have a set of markdown
Introduction to R 11

syntaxes derived from the core markdown syntaxes. These syntaxes are embedded into
RScripts and codes. When these embedded codes and scripts are executed then the output
is formatted based on the markdown syntaxes and hence becomes easily understandable.
R markdown documents can be regenerated automatically if the underlying RScripts and
codes or data are changed. The output format of an R markdown covers a wide range
of formats including PDF, HTML, HTML5 slides, websites, dashboards, tufte handouts,
notebooks, books, MS word, etc. The extension for R markdown document files is .rmd.

Check Your Understanding


1. How to locate an RScript file in a typical file system?
Ans: An RScript file can be located in a typical file system by verifying if the extension of the
file is .R.

2. What is R markdown and how is it different from word documentation?


Ans: R markdown documents are dynamic and reproducible. Markdown files are used for
making reports and documents with R. These markdown codes are embedded into
files such as PDF, HTML, word files, etc. On the contrary, word files are text files only
and do not support markdown.

1.3 IDEs anD TExT EDITors


Various text editors can be used for writing RScripts and codes. Table 1.1 describes some
popular IDEs and text editors for writing and executing R codes.

Table 1.1 Some IDEs and text editors for writing and executing R codes
Name Platform(s) License Details and Usage
Notepad Windows, GNU GPL Notepad++ to R is an editor for R that is simple and robust.
and Linux and It supports extensions like close passing to Notepad++
Notepad++ Mac editor, R GUI editor and optionally to a PuTTY window on a
to R remote machine. It supports batch processing using shortcuts,
monitoring of execution of RScripts and so on.
Tinn-R Windows GNU GPL Tinn-R is a word processor and text editor that can process
generic ASCII and UNICODE on Windows OS. This is well
integrated into R and supports GUI and IDE for R.
Revolution Commercial Revolution productivity enhancer is an R productivity or
Productivity enhanced environment. However, it can work as an IDE for
Enhancer new users. The usability features of RPE are very supportive.
(RPE) It includes features like IntelliSense for detecting completion
of word, code snippets, and so on. Hence, RPE is an integrated
IDE and editor with built-in visual debugging tools.
12 Data Analytics using R

There are various IDEs used in R language. You will learn about these IDEs in the
following section.

1.3.1 R Studio
R studio is the most widely used IDE for writing, testing and executing R codes (Figure
1.7). This is a user-friendly and open source solution. There are various parts in a typical
screen of an R studio IDE. These are:
d Console, where users write a command and see the output
d Workspace tab, where users can see active objects from the code written in the
console
d History tab, which shows a history of commands used in the code
d File tab, where folders and files can be seen in the default workspace
d Plot tab, which shows graphs
d Packages tab, which shows add-ons and packages required for running specific
process(s)
d Help tab, which contains the information on IDE, commands, etc.

Figure 1.7 R Studio Interface


Introduction to R 13

1.3.2 Eclipse with StatET


Eclipse is a well-known IDE for Java, C++, etc.; however, Eclipse can be used for statistical
programming based on R also. The corresponding IDE is called Eclipse with StatET.
Eclipse with StatET offers a set of tools that can be used for coding in R and building R
packages. It supports one or more local and remote installations of R. Its functionalities
can be expanded by using more add-ons like Sweave and Wikitext. Different parts of the
IDE are given below:
d Console for R
d Object browser
d Package manager
d Debugger
d Data viewer
d R help system.

1.4 HanDlIng PackagEs In r


A package in R is the fundamental unit of shareable code. It is a collection of the following
elements:
d Functions
d Data sets
d Compiled code
d Documentation for the package and for the functions inside
d Tests – few tests to check if everything works as it should.
The directory where packages are stored is called a library. R comes with a standard
set of packages. Others are available for download and installation as per requirement.
As on date, there are over 10,000 plus packages available in CRAN. This is also one of
the reasons behind the huge popularity and success of R.
Packages are used to share codes with others. One can develop their own R package.
Any R user can then download, install and learn to use the package. Packages, therefore
allow for an easy, transparent and cross-platform extension of the R base system.
R is an open source language; thus, new packages are being developed and updated
by developers daily. Some of these packages may not work properly or may have bugs.
Hence, it is not a good idea to use every new and updated package on R development
environment. This can affect the stability of the development environment. A stable
environment requires the sandboxing technique (a security mechanism often used to
execute untested or untrusted programs or code from unverified or untrusted third
parties, users, etc., without damaging/maligning the host machine or operating system
or production environment) to test new packages or update a package before installing
it in the development environment.
In general, there is a single package library with each installation of R on a computer.
Users can change the path to that library to install a package on a different location other
than the default package library. The command .libPaths() can be used to get or set
the path of the package library.
14 Data Analytics using R

Example
> .libPaths()

Output
C:/R/R-3.1.3/library

This is the default package library location. The following command will change it
into another path:
Example
> .libPaths(“~/R/win-library/3.1-mran-2016-07-02”)

Output
C:/Users/User1/Documents/R/win-library/3.1-mran-2016-07-02

R can be extended easily with the help of a rich set of packages. There are more than
10,000 packages available for R. These packages are used for different purposes. Tables
1.2 and 1.3 list some commonly used R packages for different purposes.

Table 1.2 Commonly used R packages for different purposes


Data Management Data Visualisation Data Products Data Modelling and
Simulation
dplyr, tidyr, foreign, ggplot, ggvis, lattice, shiny, slidify, knitr, MASS, forecast,
haven etc. igraph etc. markdown etc. bootstrap, broom, nlme,
ROCR, party etc.

Table 1.3 Commonly used packages in R


Author(s) Package Description Available At
Name
Andrew Gelman, arm It is used for hierarchical or multi-level http://cran.r-project.org/
et al. regression models. web/packages/arm/
Douglas Bates, lme4 It contains functions for generating http://cran.r-project.org/
Martin Maechler, generalised and linear mixed-effects models. web/packages/lme4/
and Ben Bolker
Duncan Temple Rcurl It provides an interface of R to the package http://www.omegahat.
Lang library, libcurl. The interface helps in org/RCurl/
interacting with the HTTP protocols for
importing raw data from the web.
Duncan Temple RJSONIO It provides a set of functions to read and http://www.omegahat.
Lang write JSON for analysing data from different org/RJSONIO/
web-based APIs.
Duncan Temple XML It provides functions and facilities for analys- http://www.omegahat.
Lang ing HTML and XML documents to extract org/RSXML/
structured data from web-based sources.
(Continued)
Introduction to R 15

Author(s) Package Description Available At


Name
Gabor Csardi igraph It contains routines for network analysis and http://igraph.
making simple graphs to represent social sourceforge.net/
networks.
Hadley Wickham ggplot It contains a set of grammar rules for http://cran.r-project.org/
implementing graphics in R. The package is web/packages/glmnet/
used for creating high-quality graphics. index.html
Hadley Wickham lubridate The package provides functions to use dates https://github.com/
in R in an easier way. hadley/lubridate
Hadley Wickham reshape It contains a set of tools for manipulation, http://had.co.nz/plyr/
aggregation and management of data in R.
Ingo Feinerer tm It contains functions to perform text mining http://www.spatstat.
in R. Text mining helps to work with org/spatstat/
unstructured data.
Jerome Friedman, glmnet It helps to work with the elastic-net and also http://had.co.nz/
Trevor Hastie, and regularised and generalised linear models. ggplot2/
Rob Tibshirani

1.4.1 Installing an R Package


R comes with some standard packages that are installed when a user first installs R and
additional packages can be installed separately. Users need to navigate through the package
library and install a package in the desired location. Following commands are used for
navigating through R package library and installing R package.
1. To start R, follow either Step 2 or 3. The assumption is that R is already installed on
your machine.
2. If there is an “R” icon on the desktop of the computer that you are using, double
click on the “R” icon to start R. If there is no “R” icon on the desktop then click on
the “Start” button at the bottom left of your computer screen, and then choose “All
programs”, and start R by selecting “R” (or R X.X.X, where X.X.X gives the version
of R, e.g. R 2.10.0) from the menu of programs.
3. The R console should show up.
4. Once you have started R, you can install an R package (e.g. the “ggplot2” package)
by choosing “Install package(s)” from the “Packages” menu at the top of the R
console. This will ask you for the website that you wish to download the package
from. You can choose “Iceland” (or another country, if you prefer). It will also bring
up a list of available packages that you can install, and you can choose the package
that you want to install from that list (e.g. “ggplot2”).
5. This will install the “ggplot2” package.
6. The “ggplot2” package is now installed. Whenever you want to use the “ggplot2”
package after this, after having successfully started R, you first have to load the
package by typing into the R console: library(“ggplot2”).
7. You can get help on a package by typing the following at the R prompt: help(package
= “ggplot2”)
16 Data Analytics using R

1.4.2 Few Commands to Get Started


installed.packages()
A user can check for all installed packages on the machine by using the installed.
packages() function.

….
remove.packages() can be used to uninstall a package.

packageDescription()
“DESCRIPTION” file has the basic information about a package. It has details such as what
the package does, who is the author, what is the version for the documentation, the date,
the type of license its use, and the package dependencies, etc. To access the description file
inside R, use the function, packageDescription(“package”). The same can also be accessed
via the documentation of the package by using help(package = “package”).
Let us look at the description for the “stats” package.
Introduction to R 17

> packageDescription(“stats”)
Package: stats
Version: 3.2.3
Priority: base
Title: The R Stats Package
Author: R Core Team and contributors worldwide
Maintainer: R Core Team <R-core@r-project.org>
Description: R statistical functions.
License: Part of R 3.2.3
Suggests: MASS, Matrix, Suppdists, methods, stats4
Build: R 3.2.3; x86_64-w64-mingw32; 2015-12-10 13:03:29 UTC; windows

-- File: C:/Program Files/R/R-3.2.3/library/stats/Meta/package.rds


Or
> help(package="stats")
The output shown is partial.

help(package = “package”)
To get an overview of all the functions and datasets in an R package, use the help()
function.
> help(package = "datasets")
The above will provide an overview of all functions and datasets inside the package,
“datasets”. One of the dataset available in “datasets” package is “AirPassengers”. To
18 Data Analytics using R

access the dataset, “AirPassengers” inside the “datasets” package, use the code given
below:

If there will be frequent use of this package, it is worthwhile to load it into the memory.
This can be achieved using the library function:
> library (datasets)
Note: the package name has to be specified without enclosing it in quotes. The library()
function will load the package, “datasets” into the memory. Then any dataset within this
package can be accessed by simply typing the name of the dataset at the R prompt.

find.package() and install.packages() Command


find.package() and install.packages() commands will find and install specific R
package(s). There are two versions of this command. The first helps in installing one
package at a time and the other is used to install multiple packages at once using a single
command—install.packages(). More details on commands like find.package() and
install.packages() can be retrieved using the help() command. For example, help
(installed.packages) can show details like the version number of a function.
Introduction to R 19

Example
To install a single package, the command is:
>find.package(“ggplot2”)
>install.packages(“ggplot2”)

Output
The first command will help to find if there is any package named “ggplot2” installed
in the system or not. Then the install.packages() function will install the package
named “ggplot2” CLI (Figure 1.8). It will download and install the package and all the
dependencies of the package.

Figure 1.8 Example of installing a package

Example
To install more than one package(s) at a time, the install.packages() command will
have the following format:
>install.packages(c(“ggplot”, “tidyr”, “dplyr”))
20 Data Analytics using R

Output
It will install packages ggplot, tidyr and dplyr.

The command to check whether a package is installed or not is the ‘if’ condition checking. The
command for checking whether the package “ggplot2” is installed or not can be done by using:
>if (!require(“ggplot2”)){install.packages(“ggplot2”)}

library()
library() command loads a package.
Example
>library(ggplot2)
Output
It will load the package “ggplot2”.

vignette()
Vignettes are a very useful source of help with packages. They are provided by the package
authors to demonstrate and highlight few functionalities of their package in detail. Use
browseVignettes() function to get a list of all vignettes available with your installed
packages.
> browseVignettes()
Introduction to R 21

To view all vignettes for a specific package, e.g., “ggplot2”, use the vignette() function.
Vignettes in package ‘ggplot2’:

ggplot2-specs Aesthetic specifications (source, html)


extending-ggplot2 Extending ggplot2 (source, html)

Check Your Understanding


1. Name a few packages used for data management in R.
Ans: dplyr, tidyr, foreign, haven, etc.

2. Name a few packages used for data visualisation in R.


Ans: ggplot, ggvis, lattice, igraph, etc.

3. Name a few packages used for developing data produces in R.


Ans: shiny, slidify, knitr, markdown, etc.

4. Name a few packages used for data modelling and simulation in R.


Ans: MASS, forecast, bootstrap, broom, nlme, ROCR, party, etc.

5. How can the default path to package library be changed in R?


Ans: To change the default package library in R, users need to follow the following steps on
the console of R IDE:
Step 1: Check the current path to the package library
> .libPaths()
Step 2: Change the path using the following command.
> .libPaths(“write the desired path here”)

6. What is the command to check and install the “dplyr” package?


Ans: if (!require(“dplyr “)) {install.packages(“dplyr”)}

7. How can we install multiple packages in R?


Ans: To install multiple packages in R the command is, >install.packages(c(“ggplo
t”,”tidyr”,”dplyr”))
22 Data Analytics using R

Just Remember
To access help in RStudio, it can be accessed from the console and from the CLI (Figure 1.9). The command
is help().

Figure 1.9 Accessing help() command from the console and CLI

Summary
d R is an open source and object-oriented programming language for statistical computing and data
visualisation.
d R is a successor of the proprietary statistical computing programming language S.
d R can be downloaded and installed on different OS platforms like Windows, Linux and Mac.
d R has the fundamental data type of vector.
d Text editors like Notepad++ to R, Tinn-R and Rev R are more than just editors for R. These can sup-
port extended functionalities and IDE features.
d R has several IDEs like RStudio, Eclipse with StatET and so on.
d R has a rich library of more than 10,000 packages.
d R has two fundamental file types called RScripts and R markdown documents.
d R commands can be written in RScripts or through the command line interface.
d R has a rich collection of inbuilt data sets like mtcars, Biochemical Oxygen Demand (BOD), etc.
Introduction to R 23

Key Terms

d BOD: An inbuilt data set in R, which computer software. Usually, an IDE consists
contains data on the Biochemical Oxygen of a number of automation tools, a debug-
Demand. ger and an editor for coding.
d CLI: A console through which a user can d R: An open source and object oriented pro-
interact with a computer. The interaction gramming language for statistical comput-
happens through successive lines of com- ing and data visualisation.
mands on the console.
d IDE: A special type of software that offers
a set of comprehensive facilities to develop

mulTiple ChoiCe QuesTions

1. What is R?
(a) An object-oriented programming language
(b) An open source project from CRAN
(c) A programming language for statistical computing
(d) All of these
2. Which one of the following programming languages is a dialect of R language?
(a) Python (b) C
(c) S (d) Q
3. Which one of the following is a text editor of R?
(a) RStudio (b) Microsoft word
(c) Notepad++ to R (d) Tableau
4. Which of the following are IDEs for R?
(a) RStudio (b) Both a and c
(c) Eclipse with StatET (d) None of these
5. What is the primary file type of R?
(a) Vector (b) Text file
(c) RScripts (d) Statistical file
6. R can be downloaded from:
(a) CRAN website (b) Google PlayStore
(c) None of these (d) All of these
7. Which one of the following R packages is used for data management?
(a) haven (b) igraph
(c) slidify (d) forecast
24 Data Analytics using R

8. Which one of the following R packages is used for data visualisation?


(a) haven (b) igraph
(c) slidify (d) forecast
9. Which one of the following R packages is used for data products?
(a) haven (b) igraph
(c) slidify (d) forecast
10. Which one of the following R packages is used for data modelling and simulation?
(a) haven (b) igraph
(c) slidify (d) forecast
11. The functionalities of R are divided among:
(a) Packages (b) Domains
(c) Libraries (d) None of these

shorT QuesTions

1. What is R? What are the advantages of R programming language over other general purpose
programming languages?
2. How can we install a package on R?
3. Give examples of two IDEs for R.
4. Give detailed examples of three packages used in R.
5. Give a detailed description of head() command used in R.
6. How can we install multiple R packages with a single command?
7. State the difference(s) between head() and tail() commands used in R.
8. State the difference(s) between ncol() and nrow() commands used in R.

11. (a) 10. (d) 9. (c) 8. (b)


7. (a) 6. (a) 5. (c) 4. (b) 3. (c) 2. (c) 1. (d)
Answers to MCQs:
Chapter 2
Getting Started with R

LEARNING OUTCOME
At the end of this chapter, you will be able to:
c Analyse directory content with commands such as dir(), list()
c Analyse a dataset using functions such as str(), summary(), ncol(), nrow(),
head(), tail(), edit()

2.1 intRoDUCtion
Data exploration in R is an approach to summarise and visualise important characteristics
of a data set. An exploratory data analysis focusses on understanding the underlying
variables and data structures to see how they can help in data analysis through various
formal statistical methods.

2.2 woRKinG witH DiReCtoRy


Before writing a program or code using R, it is important to find out the directory being
used. This can be done using the getwd() function. If the current working directory is
not as per preference, it can be changed using the setwd() function. The dir() or the
list.files() functions give information about the files and directories in the current
working directory or any other directory.

2.2.1 getwd() Command


getwd() command returns the absolute filepath of the current working directory. This
function has no arguments.
26 Data Analytics using R

Example
>getwd()

Output
[1] C:/Users/User1/Documents/R
Note the use of ‘/’ as the file separator on Windows. The file path does not have a trailing
‘/’ unless it is the root directory. The getwd() function can return NULL if the working
directory is not available.

2.2.2 setwd() Command


setwd() command resets the current working directory to another location as per the
user’s preference.
Example
>setwd(“C:/path/to/my_directory”)

Output
It will change the path to the user specified directory.

2.2.3 dir() Function


This is equivalent to list.files() function.
This function returns a character vector of the names of files or directories in the named
directory.
Syntax
dir(path = “.”, pattern = NULL, all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
or

list.files(path = “.”, pattern = NULL, all.files = FALSE,


full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
>dir()
character(0)

>list.files()
character(0)
The above command implies that there are no files or directories in the current directory.
Example 1
To display the files and directories in the current directory, use path= “.” as an argument
to dir().
Getting Started with R 27

>dir(path=".")
[1] "att connect" "BI_May_2015.pptx" "BI_MetroMap-Final.png" "BISkillMatrix- Final.xlsx"
[5] "C" "cache" "Custom Office Templates" "Dec2016-Broadband Bill.pdf"
[9] "decision_tree.png" "Default.rdp" "desktop.ini" "DSS.wma"
[13] "ILP-AssociationRuleMining.pptx" "May-Broadband bill.pdf" "My Data Sources" "My Music"
[17] "My Pictures" "My Shapes" "My Tableau Repository" "My Videos"
[21] "Northwind 2007 sample.accdt" "Oct-Broadband bill.pdf" "OneNote Notebooks" "Outlokk Files"
[25] "R" "Remote Assistance Logs" "samplelinearregression.png" "SAP"
[29] "SQL Server Management Studio" "Visual Studio 2005" "Visual Studio 2008" "Visual Studio 2010"

Example 2
To display the list of all files and directories in a specific path, use the command as follows:
> dir (path="C:/Users/Seema_acharya")
[1] "AppData"
[2] "Application Data"
[3] "ATT_Connect_Setup.exe"
[4] "CD95F661A5C444F5A6AAECDD91C2410a.TMP"
[5] "Contacts"
[6] "Cookies"
[7] "Desktop"
[8] "Documents"
[9] "Downloads"
[10] "Favorites"
[11] "Links"
[12] "Local Settings"
[13] "Music"
[14] "My Documents"
[15] "NetHood"
[16] "NTUSER.DAT"
[17] "ntuser.dat.LOG1"
[18] "ntuser.dat.LOG2"
[19] "NTUSER.DAT{6cced2f1-6e01-11de-8bed-001e0bcd1824}.TM.blf"
[20] "NTUSER.DAT{6cced2f1-6e01-11de-8bed-001e0bcd1824}.
TMContainer00000000000000000001.regtrans-ms"
[21] "NTUSER.DAT{6cced2f1-6e01-11de-8bed-001e0bcd1824}.
TMContainer00000000000000000002.regtrans-ms"
[22] "ntuser.ini"
[23] "ntuser.pol"
[24] "Pictures"
[25] "PrintHood"
[26] "Recent"
[27] "Saved Games"
[28] "Searches"
[29] "SendTo"
[30] "Start Menu"
[31] "Templates"
[32] "Videos"

Example 3
To display the complete or absolute path of all files and directories in the specified path,
use dir() as follows:
28 Data Analytics using R

Example 4
To look for a specific pattern, e.g. file/directory names beginning with a “D”, use the
dir() command with a pattern = “^D” argument.
> dir(path="C:/Users/Seema_acharya", pattern="^D")
[1] "Desktop" "Documents" "Downloads"

Example 5
To display a recursive list of files or directories in the specified path, use the dir()
command as follows:
> dir(path="d:/data")
[1] "db"
> dir(path="d:/data", recursive=TRUE,include.dirs=TRUE)
[1] "db" "db/Demo.0" "db/Demo.ns" "db/local.0" "db/local.ns"
"db/mongod.lock" "db/MyDB.0" "db/MyDB.ns"
The options or arguments used with dir() can also be used with list.files(). Try
it out and observe the output.

2.3 Data types in R


R is a programming language. Like other programming languages, R also makes use
of variables to store varied information. This means that when variables are created,
locations are reserved in the computer’s memory to hold the related values. The number of
Getting Started with R 29

locations or size of memory reserved is determined by the data type of the variables. Data
type essentially means the kind of value which can be stored, such as boolean, numbers,
characters, etc. In R, however, variables are not declared as data types. Variables in R are
used to store some R objects and the data type of the R object becomes the data type of
the variable. The most popular (based on usage) R objects are:
d Vector
d List
d Matrix
d Array
d Factor
d Data Frames
A vector is the simplest of all R objects. It has varied data types. All other R objects are
based on these atomic vectors. The most commonly used data types are listed as follows:
Data types supported by R are:
d Logical
d Numeric
r Integer

d Character
d Double
d Complex
d Raw
class() function can be used to reveal the data type. Other R objects such as list, matrix,
array, factor and data frames are discussed in detail in Chapter 3.

Logical
TRUE / T and FALSE / F are logical values.
> TRUE
[1] TRUE
> class(TRUE)
[1] "logical"
> T
[1] TRUE
> class(T)
[1] "logical"
> FALSE
[1] FALSE
> class(FALSE)
[1] "logical"
> F
[1] FALSE
> class(F)
[1] "logical"
30 Data Analytics using R

Numeric
> 2
[1] 2
> class (2)
[1] "numeric"
> 76.25
[1] 76.25
> class(76.25)
[1] "numeric"

Integer
Integer data type is a sub class of numeric data type. Notice the use of “L“ as a suffix to
a numeric value in order for it to be considered an “integer”.
> 2L
[1] 2
> class(2L)
[1] "integer"
Functions such as is.numeric(), is.integer() can be used to test the data type.
> is.numeric(2)
[1] TRUE
> is.numeric(2L)
[1] TRUE
> is.integer(2)
[1] FALSE
> is.integer(2L)
[1] TRUE
Note: Integers are numeric but NOT all numbers are integers.

Character
> "Data Science"
[1] "Data Science"
> class("Data Science")
[1] "character"
is.character() function can be used to ascertain if a value is a character.
> is.character ("Data Science")
[1] TRUE

Double (for double precision floating point numbers)


By default, numbers are of “double” type unless explicitly mentioned with an L suffixed
to the number for it to be considered an integer.
> typeof (76.25)
[1] "double"

Complex
> 5 + 5i
[1] 5+5i
> class(5 + 5i)
[1] "complex"
Getting Started with R 31

Raw
> charToRaw("Hi")
[1] 48 69
> class (charToRaw ("Hi"))
[1] "raw"
typeof() function can also be used to check the data type (as shown).
> typeof(5 + 5i)
[1] "complex"
> typeof(charToRaw ("Hi")
+ )
[1] "raw"
> typeof ("DataScience")
[1] "character"
> typeof (2L)
[1] "integer"
> typeof (76.25)
[1] "double"

2.3.1 Coercion
Coercion helps to convert one data type to another, e.g. logical “TRUE” value when
converted to numeric yields “1”. Likewise, logical “FALSE” value yields “0 ”.
> as.numeric(TRUE)
[1] 1
> as.numeric(FALSE)
[1] 0
Numeric 5 can be converted to character 5 using as.character().
> as.character(5)
[1] "5"
> as.integer(5.5)
[1] 5
On converting characters, “hi” to numeric data type, the as.numeric() returns NA.
> as.numeric("hi")
[1] NA
Warning message:
NAs introduced by coercion

2.3.2 Introducing Variables and ls() Function


R, like any other programming language, uses variables to store information. Let us start
by creating a variable “RectangleHeight” and assign the value 2 to it. Note the use of the
operator “<-” to assign a value to the variable. Likewise, the variable “RectangleWidth” is
defined and assigned the value 4. The area of the rectangle is computed using the formula
“RectangleHeight * RectangleWidth”. The computed value for the area of the rectangle is
stored in the variable “RectangleArea”.
32 Data Analytics using R

> RectangleHeight <- 2


> RectangleWidth <- 4
> RectangleArea <- RectangleHeight * RectangleWidth
> RectangleHeight
[1] 2
> RectangleWidth
[1] 4
> RectangleArea
[1] 8
Note: When a value is assigned to a variable, it does not display anything on the console.
To get the value, type the name of the variable at the prompt.
Use the ls() function to list all the objects in the working environment.
> 1s()
[1] "RectangleArea" "RectangleHeight" "RectangleWidth"
ls() is also useful to clean the environment before running a code. Execute the rm()
function as shown to clean up the environment.
> rm(list=1s())
> 1s()
character(0)

2.4 Few CommanDs FoR Data exploRation


This section will use functions such as summary(), str(), head(), tail(), view(),
edit(), etc., to explore a dataset. The dataset used in this section is “mtcars” from the
“datasets” package.
Background to the mtcars dataset from R documentation:
This data was extracted from the 1974 Motor Trend US magazine. It comprises fuel
consumption and 10 aspects of automobile design and performance for 32 automobiles
(1973–74 models).

2.4.1 Load Internal Dataset


There are various inbuilt datasets in R, e.g. AirPassengers, mtcars, BOD, etc. A list of
datasets is available at https://vincentarelbundock.github.io/Rdatasets/datasets.html
Let us load the mtcars dataset from the datasets package following the steps:
1. Check if the datasets package is already installed.
>installed.packages()
2. If already installed and will be used frequently, load the package.
>library(datasets)
Getting Started with R 33

3. Display the observations from the mtcars dataset.


mtcars is a dataset from the datasets package that has 32 observations on 11
variables. The 11 variables are described as follows:
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

A subset of observations is given as follows:


34 Data Analytics using R

summary() Command
summary() command includes functions like min, max, median, mean, etc., for each
variable present in the given data frame.
Example
>summary(mtcars)
Output
The output shows a six-point summary of each of the column or variable of the dataset
“mtcars”. The summary points are min, 1st quartile, mean, median, 3rd quartile and max
(Figure 2.1).

Figure 2.1 Example of summary() command

str() Command
str() command displays the internal structure of a data frame. It can be used as an
alternative to summary function. It is a diagnostic function and roughly displays one
line per basic object.
Example 1
>str(str)
function(object,…)

The above example shows str() function itself serving as an argument. It displays
compactly str() internal structure, stating that it is a function which takes an object
as an argument.
Example 2
str(ls)
function(name, pos = -1L, envir = as.environment(pos), all.names =
FALSE, pattern, sorted = TRUE)
Here, ls() is used as an argument to str() function. It provides a brief outline of
the ls() function.
Example 3
>str(mtcars)
Output
When a data frame named “mtcars” is supplied, the command shows the internal structure
of the data frame. The CLI is:
Getting Started with R 35

>str(mtcars)
“data.frame”: 32 obs. of 11 variables:
$ mpg :num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl :num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs :num 0 0 1 1 0 1 0 1 1 1 ...
$ am :num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
It shows the individual datatype of each column or variable of the mtcars dataset.
Example 4
Let us generate a vector of 100 normally distributed random numbers using the function
rnorm(). To learn more about the rnorm() function, use help(rnorm()) at the R prompt.
However, for curious minds, remember to use help(rnorm()) at the R prompt. The
standard mean and sd arguments used are 2 and 4, respectively.

When we run the summary() function with “x” as the argument, we get the “minimum
”, “1st quartile ”, “Median ”, “Mean ”, “3rd Quartile” and “Maximum” for “x ”.
Next, when we run str() on “x ”, we get the information that “x” is a numeric vector
consisting of 100 elements and it also returns the first 5 elements from the “x” vector.
Example 5
Let us now take it a step further by creating a 10 by 10 matrix, “m” and calling str() on it.
36 Data Analytics using R

> m <- matrix(rnorm(100),10,10)


> str(m)
num [1:10, 1:10] –2.231 1.089 0.573 -0.183 0.964 …
> m[,1]
[1] -2.2310749 1.0885324 0.5730995 -0.1827884 0.9638976 1.2520684
-1.8088454 0.3247033 0.7654839 -0.31007222
The str() function tells us that “m” is a matrix of 10 rows and 10 columns and also
displays the first 5 column values of the first row.
View() Command
View() command displays the given dataset in a spreadsheet-like data frame viewer.
Example
>View(“mtcars“)
Output
The output shows a tabular view of the content of the mtcars dataset (Figure 2.1).
head() Command
head() command displays the first “n” observations from the given data frame.
The default value for n is 6. However, users can specify the value of “n” as per their
requirement as well.
Example
>head(mtcars, n = 6)
Output
>head(mtcars, n = 6)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
>
The command shows the first 6 observations from mtcars.
tail() Command
tail() command displays the last “n” observations from a given data frame. The default
value for n is 6. However, users can specify the value of “n” as per their requirement as well.
Example
>tail(mtcars, n = 5)
Output
> tail(mtcars, n = 5)
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Getting Started with R 37

Figure 2.1 Example of View() command

The command shows the last 5 observations from the data frame.
ncol() Command
ncol() command returns the number of columns in the given dataset.
Example
>ncol(mtcars)
Output
The output shows the number of columns in the “mtcars” dataset.
>ncol(mtcars)
[1] 11
38 Data Analytics using R

nrow() Command
nrow() command returns the number of rows in the given dataset.
Example
>nrow(mtcars)
Output
The output shows the number of rows in the “mtcars” dataset.
>nrow(mtcars)
[1] 32
edit() Command
edit() command helps with the dynamic editing or data manipulation of a dataset. When
this command is invoked, a dynamic data editor window opens with a tabular view of
the dataset. Hereafter, the required changes to the dataset can be made.
Example
>edit(mtcars)
Output
The output shows the changes made in the first row of the “mtcars” dataset.
> edit(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 UPDATED 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Getting Started with R 39

The modified dataset should be stored in a new variable. For example, it is a good practice to
call the edit() method as mtcars_new = edit(mtcars).

fix() Command
fix() command saves the changes in the dataset itself, so there is no need to assign any
variable to it.
Example
> fix(mtcars)
> View(mtcars)
Output

Figure 2.2 Viewing the “mtcars” dataset after the modifications using the View() command
40 Data Analytics using R

It shows the changes made to the first row of the dataset and the changes saved
automatically rather than being discarded as in the edit() method (Figure 2.2).

To read help on any command in R, the user can type “?” followed by the function name on the
console.

data() Function
The data() function lists the available datasets.
Syntax
> data()

Output

data(trees) function loads the dataset, “trees”.


Syntax
> data(trees)
Getting Started with R 41

Let us look at the data held in the trees dataset.


> trees
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
7 11.0 66 15.6
8 11.0 75 18.2
9 11.1 80 22.6
10 11.2 75 19.9
11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
This dataset provides measurements of the girth, height and volume of timber in 31
felled blackberry trees.
Let us look at the summary of analysis on this dataset.
> summary(trees)
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu. :11.05 1st Qu. :72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu. :15.25 3rd Qu. :80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
Let us visualise this by plotting a scatter plot between the variables of the trees dataset
(Figure 2.3).
> plot(trees, col="red", pch=16,main="scatter plot b/w variables of trees")
42 Data Analytics using R

Figure 2.3 Scatter plot between the variables of the trees dataset

save.image() Function
save.image() function writes an external representation of R objects to the specified
file. At a later point in time when it is required to read back the objects, one can use the
load or attach function.
Syntax
save.image(file = “.RData”, version = NULL, ascii = FALSE, safe = TRUE)
The file is to be given an extension of RData.
Note: The “R” and “D” in “RData” should be in capitals.
If ascii = TRUE, will save an ascii representation of the file. The default is ascii = FALSE.
With ascii being set to false, a binary representation of the file is saved.
Getting Started with R 43

version is used to specify the current workspace format version. The value of NULL
specifies the current default format.
safe is set to a logical value. A value of TRUE means that a temporary file is used to
create the saved workspace. This temporary file is renamed to file if the save succeeds.

Check Your Understanding


1. What are the differences between the head() and tail() commands in R?
Ans: The head() command shows records from the start of the dataset, whereas the tail()
command shows records from the end of the dataset.

2. What does the data() function help with?


Ans: The data() function lists the available datasets.

3. What is nrow() function?


Ans: nrow() command returns the number of rows in a given dataset.

Summary
d Data type essentially means the kind of value which can be stored, such as boolean, numbers,
characters, etc. In R, however, variables are not declared as data types. Variables in R are used to
store some R objects and the data type of the R object becomes the data type of the variable.
d ls() function lists all the objects in the working environment.
d class() function reveals the data type.
d typeof() function checks the data type.
d data() function lists the available datasets.

Key Terms

d dir(): dir() function returns a character d setwd(): setwd() command resets the
vector of the names of files or directories in current working directory to another loca-
the named directory. tion as per the user’s preference.
d getwd(): getwd() command returns the d typeof(): typeof() function is used to
absolute file path of the current working check the data type.
directory. This function has no arguments.
44 Data Analytics using R

PracTical exercises

1. BOD is an inbuilt data set in R. The output of the command View(BOD) is given below.
What will be done by the code given below? Explain.
>View(BOD)

>nrow(BOD)

2. What will be done by the following code?


>head(BOD, n=3)

3. What will be the output of the following codes?


(a) The code is:
> summary(mtcars$mpg)
(b) The code is:
>summary(c(3,2,1,2,4,6))
(c) The code is:
>str(c(1,2,3,4))
(d) The code is:
>str(c(“Mon”, “Tue”,”Wed”,”Thurs”))
(e) The code is:
>head(c(“Mon”, “Tue”,”Wed”,”Thurs”),2)
(f) The code is:
>tail(c(“Mon”, “Tue”,”Wed”,”Thurs”),2)
(g) The code is:
class(76.25L)
Chapter 3
Loading and Handling Data in R

LEARNING OUTCOME
At the end of this chapter, you will be able to:
c Store data of varied data types into vectors, matrixes, and lists
c Load data from .csv, spreadsheets, web, Jason documents, and XML
c Deal with missing or invalid values
c Run R functions on the data (sum(), min(), max(), rep(), grep(), substr(),
strsplit(), etc.)
c Use R with databases such as MySQL, PostgreSQL, SQLlite, and JasperDB
c Create visualisations to help with deeper understanding of data

3.1 introDuCtion
Enterprise applications today generate a huge amount of data. This data is analysed to
draw useful insights that can help decision makers make better and faster decisions. This
chapter introduces the different data types such as numbers, text, logical values, dates,
etc., supported in R. It also describes various R objects such as vector, matrix, list, dataset,
etc., and how to manipulate data using R functions such as sum(), min(), max(), rep()
and string functions such as substr(), grep(), strsplit(), etc. It explores import of
data into R from .csv (comma separated values), spreadsheets, XML documents, JASON
(Java Script Object Notation) documents, web data, etc., and interfacing R with databases
such as MySQL, PostGreSQL, SQLlite, etc. There are quite a few challenges in analysing
46 Data Analytics using R

data. For instance, data is not always homogeneous, i.e. it comes from varied sources and
in different formats. Ensuring data quality can pose several challenges. Stakeholders also
view data from many perspectives and may have different requirements from it.

3.2 Challenges of analytiCal Data ProCessing


Analytical data processing is a part of business intelligence that includes relational database,
data warehousing, data mining and report mining. It is a computer processing technique
that handles different types of business processing practices like sales, budgeting, financial
reporting, management reporting, etc. All these processing techniques require big data.
Business analytics combines big data with technology. Different challenges occur
during business data analytics. However, most of these challenges are mainly associated
with data and they arise during the early stages of projects. Some of these challenges are
explained ahead.

3.2.1 Data Formats


Data is the main element of business analytics. Business analytics uses sets of data to store
a large amount of data. Selecting a data format is the first challenge in analytical data
processing for researchers or developers. Analytical data processing requires a complete
set of data, in the absence of which, developers can expect problems in further processing.
R is a well-documented programming language that stores data in the form of an
object. It has a very simple syntax that helps in processing any type of data. R provides
many packages and features such as open database connectivity (ODBC), which process
different types of data formats. For example, ODBC supports data formats such as CSV,
MS Excel, SQL, etc.

3.2.2 Data Quality


Maintaining data quality is another challenge in analytical data processing. Business
analysts are required to deliver perfect information, inferences, outliers and output
without any missing or invalid value. A data with inferior input or output is bound to
give incorrect quality results.
With the help of R, business analysts can maintain data quality. Different tools of R
help business analysts in removing invalid data, replacing missing values and removing
outliers in data.

3.2.3 Project Scope


Projects based on analytical data processing are costly and time consuming. Hence, before
starting a new project, business analysts should analyse the scope of the project. They
should identify the amount of data required from external sources, time of delivery and
other parameters related to the project.
Loading and Handling Data in R 47

3.2.4 Output Result via Stakeholder Expectation Management


In analytical data processing, analysts design projects that generate output with different
types of values like p-value, the degree of freedom, etc. However, users or stakeholders
prefer to see the output. The stakeholders do not want to see the constraints used in
data processing, assumptions, hypothesis, p-values, chi-square value or any other value.
Hence, an analytical project should try to fulfil all the expectations of the stakeholders.
Business analysts should use transparent methods and processes. They should also
validate the data using cross validation. If business analysts use the standard steps of
analytical data processing that generate the perfect output, they will not encounter any
problems. Data input, processing, descriptive statistics, visualisation of data, report
generation and output form the sequence of analytical data processing that analysts should
follow while conducting business analysis for their project.

Check Your Understanding


1. What is analytical data processing?
Ans: Analytical data processing is a part of business intelligence that includes relational
database, data warehousing, data mining and report mining.

2. List the challenges of analytical data processing.


Ans: Some challenges of analytical data processing are:
d Data formats
d Data quality
d Project scope
d Output results via stakeholder expectation management.

3. What are the common steps of analytical data processing?


Ans: Data input, processing, descriptive statistics, visualisation of data, report generation
and output are the common steps of analytical data processing.

3.3 exPression, Variables anD funCtions


Let us get familiar with the R interface. We will start out by practicing expressions,
variables and functions.

3.3.1 Expressions
Look at a few arithmetic operations such as addition, subtraction, multiplication, division,
exponentiation, finding the remainder (modulus), integer division and computing the
square root as given in Table 3.1.
48 Data Analytics using R

Table 3.1 Arithmetic operations


Operation Operator Description Example
Addition x+y y added to x > 4 + 8
[1] 12
Subtraction x–y y subtracted from x > 10 – 3
[1] 7
Multiplication x*y x multiplied by y > 7 * 8
[1] 56

Division x/y x divided by y < 8/3


[1] 2.666667
Exponentiation x^y x raised to the power y > 2 ^ 5
x ** y [1] 32
Or
>2 ** 5
[1] 32
Modulus x %% y Remainder of (x divided by y) > 5 %% 3
[1] 2
Integer Division x%/%y x divided by y but rounded down > 5 %/% 2
[1] 2
Computing the Square Root sqrt(x) Computing the square root of x > sqrt (25)
[1] 5

3.3.2 Logical Values


Logical values are TRUE and FALSE or T and F. Note that these are case sensitive. The
equality operator is ==.
> 8 < 4
[1] FALSE
> 3 * 2 == 5
[1] FALSE
> 3 * 2 == 6
[1] TRUE
> F == FALSE
[1] TRUE
> T == TRUE
[1] TRUE

Guided Activity
Step 1: Create a vector, x consisting of 10 elements with values ranging from 1 to 10. Section
3.5 of this chapter deals with creation, accessing vector elements and vector arithmetic,
etc.
> x <- c(1:10)
Loading and Handling Data in R 49

Step 2: Display the contents of the vector, x.


> x
[1] 1 2 3 4 5 6 7 8 9 10
Step 3: Print the values of those elements whose values are either greater than 7 or less
than 5.
‘|’ is the OR operator. Use the OR operator to display elements whose values are either
greater than 7 or less than 10.
> x[(x>7) | (x<5)]
[1] 1 2 3 4 8 9 10

Explanation
Part (i) Display ‘TRUE’ for elements whose values are more than 7, else display ‘FALSE’.
> x>7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Part (ii) Display ‘TRUE’ for elements whose values are less than 5, else display ‘FALSE’.
> x<5
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Step 4: Print the values of those elements whose values are greater than 7 and less than 10.
‘&’ is the AND operator. Use the AND operator to display elements whose values are
greater than 7 and less than 10.
> x[(x>7) & (x<10)]
[1] 8 9

3.3.3 Dates
The default format of date is YYYY-MM-DD.
(i) Print system’s date.
> Sys.Date()
[1] “2017-01-13”
(ii) Print system’s time.
> Sys.time()
[1] “2017-01-13 10:54:37 IST”
(iii) Print the time zone.
> Sys.timezone()
[1] “Asia/Calcutta”
(iv) Print today’s date.
> today <- Sys.Date()
> today
[1] “2017-01-13”
> format (today, format = “%B %d %Y”)
[1] “January 13 2017”
50 Data Analytics using R

(v) Store date as a text data type.


> CustomDate = “2016-01-13”
> CustomDate
[1] “2016-01-13”
> class (CustomDate)
[1] “character”
(vi) Convert the date stored as text data type into a date data type.
> CustDate = as.Date(CustomDate)
> class(CustDate)
[1] “Date”
> CustDate
[1] “2016-01-13”
(vii) Find the difference between the following two dates.
> strDates <- c(“08/15/1947”, “01/26/1950”)
(viii) Convert strings into date format.
> dates = as.Date(strDates, “%m /%d /%Y”)
> dates
[1] “1947-08-15” “1950-01-26”
(ix) Compute the difference between the two dates.
> dates[2] – dates[1]
Time difference of 895 days

3.3.4 Variables
(i) Assign a value of 50 to the variable called ‘Var’.
> Var <-50
Or
> Var=5
(ii) Print the value in the variable, ‘Var’.
> Var
[1] 50
(iii) Perform arithmetic operations on the variable, ‘Var’.
> Var + 10
[1] 60
> Var / 2
[1] 25
Variables can be reassigned values either of the same data type or of a different data
type.
(iv) Reassign a string value to the variable, ‘Var’.
> Var <- “R is a Statistical Programming Language”
Loading and Handling Data in R 51

Print the value in the variable, ‘Var’.


> Var
[1] “R is a Statistical Programming Language”
(v) Reassign a logical value to the variable, ‘Var’.
> Var <- TRUE
> Var
[1] TRUE

3.3.5 Functions
In this section we will try out a few functions such as sum(), min(), max() and seq().

sum() function
sum() function returns the sum of all the values in its arguments.

Syntax
sum(..., na.rm = FALSE)
where … implies numeric or complex or logical vectors.
na,rm accepts a logical value. Should missing values (including NaN (Not a Number))
be removed?
Examples
(i) Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()
> sum(1, 2, 3)
[1] 6
(ii) What will be the output if NA is used for one of the arguments to sum()?
> sum(1, 5, NA, na.rm=FALSE)
[1] NA
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or
NaN to be returned.
(iii) What will be the output if NaN is used for one of the arguments to sum()?
> sum(1, 5, NaN, na.rm= FALSE)
[1] NaN
(iv) What will be the output if NA and NaN are used as arguments to sum()?
> sum(1, 5, NA, NaN, na.rm=FALSE)
[1] NA
(v) What will be the output if option, na.rm is set to TRUE?
If na.rm is TRUE, an NA or NaN value in any of the argument will be ignored.
> sum(1, 5, NA, na.rm=TRUE)
[1] 6
> sum(1, 5, NA, NaN, na.rm=TRUE)
[1] 6
52 Data Analytics using R

min() function
min() function returns the minimum of all the values present in their arguments.

Syntax
min(…, na.rm=FALSE)
where … implies numeric or character arguments and na.rm accepts a logical value.
Should missing values (including NaN) be removed?
Example
> min(1, 2, 3)
[1] 1
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN
to be returned.
> min(1, 2, 3, NA, na.rm=FALSE)
[1] NA
> min(1, 2, 3, NaN, na.rm=FALSE)
[1] NaN
> min(1, 2, 3, NA, NaN, na.rm=FALSE)
[1] NA
If na.rm is TRUE, an NA or NaN value in any of the argument will be ignored.
> min(1, 2, 3, NA, NaN, na.rm=TRUE)
[1] 1

max() function
max() function returns the maximum of all the values present in their arguments.

Syntax
max(…, na.rm=FALSE)
where … implies numeric or character arguments
na.rm accepts a logical value. Should missing values (including NaN) be removed?
Example
> max(44, 78, 66)
[1] 78
If na.rm is FALSE, an NA or NaN value in any of the argument will cause NA or NaN
to be returned.
Loading and Handling Data in R 53

> max(44, 78, 66, NA, na.rm=FALSE)


[1] NA
> max(44, 78, 66, NaN, na.rm=FALSE)
[1] NaN
> max(44, 78, 66, NA, NaN, na.rm=FALSE)
[1] NA
If na.rm is TRUE, an NA or NaN value in any of the argument will be ignored.
> max(44, 78, 66, NA, NaN, na.rm=TRUE)
[1] 78

seq() function
seq() function generates a regular sequence.

Syntax
seq(start from, end at, interval, length.out)
where,
Start from: It is the start value of the sequence.
End at: It is the maximal or end value of the sequence.
Interval: It is the increment of the sequence.
length.out: It is the desired length of the sequence.
Example
> seq(1, 10, 2)
[1] 1 3 5 7 9
> seq(1, 10, length.out=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Or
> seq_len(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> seq(1, 6, by=3)
[1] 1 4

3.3.6 Manipulating Text in Data


There are many inbuilt string functions available in R that manipulate text or string.
Finding a part of some text string, searching some string in a text or concatenating strings
and other similar operations come under manipulating text operation. Table 3.2 explains
some useful text manipulation operations.
Let us take a look at how R treats strings.
String values have to be enclosed within double quotes.
> “R is a statistical programming language”
[1] “R is a statistical programming language”
54 Data Analytics using R

Table 3.2 Text manipulation of inbuilt functions of R


Functions Function Arguments Description
substr(a, d a is a character vector. The function returns a part of the string be-
start stop) d Start and stop arguments contain a ginning from the start argument and ending
numeric value. at the stop argument.
strsplit(a, d a is a character vector. The function splits the given text string into
split, …) d Split is also a character vector that substrings.
contains a regular expression for
splitting.
paste(…, sep= d The dots ‘…’ define R objects. The function concatenates string vectors after
‘‘, …) d sep argument is a character string converting the objects into strings.
for separating objects.
grep(pattern, d Pattern argument contains a The function returns string after searching for
a) matching pattern. a text pattern into a given text string.
d a is a character vector.
toupper(a) d a is a character vector. The function converts a string into uppercase.
tolower(a) d a is a character vector. The function converts a string into lowercase.

Figure 3.1 describes the strsplit() and grep() in the R workspace

Figure 3.1 Examples of string functions


Loading and Handling Data in R 55

Few string functions are explained in detail as follows.

rep() function
rep() function repeats a given argument for a specified number of times. In the example
below, the string, ‘statistics’ is repeated three times.
Example
> rep(“statistics”, 3)
[1] “statistics” “statistics” “statistics”

grep() function
In the example below, the function grep() finds the index position at which the string,
‘statistical’ is present.
Example
> grep(“statistical”,c(“R”,“is”,“a”,“statistical”,“language”),
fixed=TRUE)
[1] 4

toupper() function
toupper() function converts a given character vector into upper case.

Syntax
toupper(x)
x Æ is a character vector
Example
> toupper(“statistics”)
[1] “STATISTICS”
Or
> casefold (“r programming language”, upper=TRUE)
[1] “R PROGRAMMING LANGUAGE”

tolower() function
tolower() function converts the given character vector into lower case.

Syntax
tolower(x)
x Æ is a character vector
Example
> tolower(“STATISTICS”)
[1] “statistics”
56 Data Analytics using R

Or
> casefold(“R PROGRAMMING LANGUAGE”, upper=FALSE)
[1] “r programming language”

substr() function
substr() function extracts or replaces substrings in a character vector.

Syntax
substr(x, start, stop)
x Æ character vector
start Æ start position of extraction or replacement
stop Æ stop or end position of extraction or replacement
Example
Extract the string ‘tic’ from ‘statistics’. Begin the extraction at position 7 and continue the
extraction till position 9.
> substr(“statistics”, 7, 9)
[1] “tic”

3.4 Missing Values treatMent in r


During analytical data processing, users come across problems caused by missing and
infinite values. To get an accurate output, users should remove or clean the missing values.
In R, NA (Not Available) represents missing values and Inf (Infinite) represents infinite
values. R provides different functions that identify the missing values during processing
(Table 3.3).

Table 3.3 Functions for handling missing values


Functions Function Arguments Description
is.na(x) x is an R object to be tested. The function checks the object and
returns true if data is missing.
na.omit x is an R object from which NA needs to be The function returns the object after
(x, …) removed. removing missing values from it.
The dots ‘…’ define the other optional argument.
na.exclude x is an R object from which NA needs to be The function returns the object after
(x, …) removed. removing missing values from it.
The dots ‘…’ define the other optional argument.
na.fail The package provides the functions for accessing all The function will encounter an error if
(x, …) APIs. the object contains any missing values
and will return the object if it does not
contain any missing value.
na.pass x is an R object from which NA needs to be removed. The function returns the unchanged
(x, …) The dots ‘…’ define the other optional argument. object.
Loading and Handling Data in R 57

The following example creates a vector ‘A’ with some missing values [10, 20, NA,
40] (Figure 3.2). The is.na(A) returns TRUE for the missing value. The na.omit(A)
and na.exclude(A) removes the missing value and stores it into vector ‘B’ and ‘D’,
respectively. The na.fail(A) generates an error if A has some missing value. The
na.pass(A) returns the usual vector A.

Figure 3.2 Handling missing values

3.5 using the ‘as’ oPerator to Change the struCture of Data


Sometimes analytical data processing requires data conversion from one data format
into another. Generally, analytical data processing stores data in a table format, wherein
it requires only some part of the table or another structure to store the table’s data. In
this case, R can convert the structure of the table into other structures like factor, list, etc.
The operator ‘as’ provides the facility to convert the structure of one dataset into another
structure in R. The syntax of using this operator is
as.objecttype(objectname)
where,
objecttype is the type of object like data.frame, matrix, list, etc. and objectname is the
name of the object that needs to be converted into another format.
58 Data Analytics using R

Also, as.numeric() and as.character() functions convert characters and numbers,


respectively.
The following example creates a data frame D using two vectors a and b (Figure 3.3).
Now the command ‘as.list(D)’ converts the data frame into list B. The command ‘as.
matrix(D)’ converts the data frame into a matrix.

Figure 3.3 Use of ‘as’ operator

Check Your Understanding


1. What is the na.omit() function?
Ans: The na.omit() function is an inbuilt function of R that returns the object after
removing missing values from it.

2. What is the na.exclude() function?


Ans: The na.exclude() function is an inbuilt function of R that returns the object after
removing missing values from it.

(Continued)
Loading and Handling Data in R 59

3. What is na.fail() function?


Ans: The na.fail() function is an inbuilt function of R that shows an error if the object
contains any missing value and returns the object if it does not contain any missing
value.

4. Which function is used for checking missing values in an R object?


Ans: The is.na() is used for checking missing values in an R object. The function checks
the object and returns true if data is missing.

5. What is the use of ‘as’ operator?


Ans: ‘as’ operator converts the structure of one dataset into another structure using R.

3.6 VeCtors
A vector can have a list of values. The values can be numbers, strings or logical. All the
values in a vector should be of the same data type.
A few points to remember about vectors in R are:
d Vectors are stored like arrays in C
d Vector indices begin at 1
d All vector elements must have the same mode such as integer, numeric (floating
point number), character (string), logical (Boolean), complex, object, etc.
Let us create a few vectors.
1. Create a vector of numbers
> c(4, 7, 8)
[1] 4 7 8
The c function (c is short for combine) creates a new vector consisting of three
values, viz. 4, 7 and 8.
2. Create a vector of string values.
> c(“R”, “SAS”, “SPSS”)
[1] “R” “SAS” “SPSS”
3. Create a vector of logical values.
> c(TRUE, FALSE)
[1] TRUE FALSE
A vector cannot hold values of different data types. Consider the example below on
placing integer, string and Boolean values together in a vector.
> c(4, 8, “R”, FALSE)
[1] “4” “8” “R” “FALSE”

All the values are converted into the same data type, i.e. ‘character’.
60 Data Analytics using R

4. Declare a vector by the name, ‘Project’ of length 3 and store values in it.
> Project <- vector(length = 3)
> Project [1] <- “Finance Project”
> Project [2] <- “Retail Project”
> Project [3] <- “Energy Project”
Outcome
> Project
[1] “Finance Project” “Retail Project” “Energy Project”
> length (Project)
[1] 3

3.6.1 Sequence Vector


A sequence vector can be created with a start:end notation.

Objective
Create a sequence of numbers between 1 and 5 (both inclusive).
> 1:5
[1] 1 2 3 4 5
Or
> seq(1:5)
[1] 1 2 3 4 5
The default increment with seq is 1. However, it also allows the use of increments
other than 1.
> seq (1, 10, 2)
[1] 1 3 5 7 9
Or
> seq (from=1, to=10, by=2)
[1] 1 3 5 7 9
Or
> seq (1, 10, by=2)
[1] 1 3 5 7 9
seq can also generate numbers in the descending order.
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> seq (10, 1, by=–2)
[1] 10 8 6 4 2

3.6.2 rep function


The rep function is used to place the same constant into long vectors. The syntax is rep
(z,k), which creates a vector of k*length(z) elements, each equals to z.
Loading and Handling Data in R 61

Objective
Demonstrate rep function.

Act
> rep (3, 4)
[1] 3 3 3 3
Or
> x <-rep (3, 4)
> x
[1] 3 3 3 3

3.6.3 Vector Access


Objective
Let us create a variable, ‘VariableSeq’ and assign to it a vector consisting of string values.
> VariableSeq <- c (“R”, “is”, “a”, “programming”, “language”)

Objective
To access values in a vector, specify the indices at which the value is present in the vector.
Indices start at 1.
> VariableSeq[1]
[1] “R”
> VariableSeq[2]
[1] “is”
> VariableSeq[3]
[1] “a”
> VariableSeq[4]
[1] “programming”
> VariableSeq[5]
[1] “language”

Objective
Assign new values in an existing vector. For example, let us assign value, ‘good
programming’ at indices 4 in the existing vector, ‘VariableSeq’.
> VariableSeq[4] <- “good programming”

Outcome
> VariableSeq[4]
[1] “good programming”

Objective
To access more than one value from the vector.
(a) Access the first and the fifth element from the vector, ‘VariableSeq’.
> VariableSeq[c(1, 5)]
[1] “R” “language”
62 Data Analytics using R

(b) Access first to the fourth element from the vector, ‘VariableSeq’.
> VariableSeq[1:4]
[1] “R” “is” “a” “good programming”
(c) Access the first, fourth and the fifth element from the vector, ‘VariableSeq’.
> VariableSeq[c(1, 4:5)]
[1] “R” “good programming” “language”
(d) Retrieve all the values from the variable, ‘VariableSeq’
> VariableSeq
[1] “R” “is” “a” “good programming”
[5] “language”

3.6.4 Vector Names


The names() function helps to assign names to the vector elements.
This is accomplished in two steps as shown:
> placeholder <- 1:5
> names(placeholder) <- c(“r”, “is”, “a”, “programming”, “language”)
The vector elements can then be retrieved using the indices position.
> placeholder
r is a programming language
1 2 3 4 5
> placeholder [3]
a
3
> placeholder [1]
r
1
> placeholder[4:5]
programming language
4 5
Or
> placeholder [“programming”]
programming
4

Objective
Plot a bar graph using the barplot function. The barplot function uses a vector’s values
to plot a bar chart.

Act
The vector used is called BarVector.
> BarVector <- c(4, 7, 8)
> barplot(BarVector)
Loading and Handling Data in R 63

Outcome

Let us use the name function to assign names to the vector elements. These names will
be used as labels in the barplot.
> names(BarVector) <- c(“India”, “MiddleEast”, “US”)
> barplot(BarVector)

3.6.5 Vector Math


Let us define a vector, ‘x’ with three values. Let us add a scalar value (single value) to
the vector. This value will get added to each vector element.
64 Data Analytics using R

> x <- c(4, 7, 8)


> x +1
[1] 5 8 9
However, the vector will retain its individual elements.
> x
[1] 4 7 8
If the vector needs to be updated with the new values, type the statement given below.
> x <- x + 1
> x
[1] 5 8 9
We can run other arithmetic operations on the vector as given:
> x – 1
[1] 4 7 8
> x * 2
[1] 10 16 18
> x / 2
[1] 2.5 4.0 4.5
Let us practice these arithmetic operations on two vectors.
> x
[1] 5 8 9
> y <- c(1, 2, 3)
> y
[1] 1 2 3
> x + y
[1] 6 10 12
Other arithmetic operations are:
> x – y
[1] 4 6 6
> x * y
[1] 5 16 27
Check if the two vectors are equal. The comparison takes place element by element.
> x
[1] 5 8 9
> y
[1] 1 2 3
> x==y
[1] FALSE FALSE FALSE
> x < y
[1] FALSE FALSE FALSE
> sin(x)
[1] -0.9589243 0.9893582 0.4121185

3.6.6 Vector Recycling


If an operation is performed involving two vectors that requires them to be of the same
length, the shorter one is recycled, i.e. repeated until it is long enough to match the longer
one.
Loading and Handling Data in R 65

Objective
Add two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) + c(4, 5, 6, 7, 8, 9)
[1] 5 7 9 8 10 12

Objective
Multiply the two vectors wherein one has length, 3 and the other has length, 6.
> c(1, 2, 3) * c(4, 5, 6, 7, 8, 9)
[1] 4 10 18 7 16 27

Objective
Plot a Scatter Plot. The function to plot a scatter plot is ‘plot’. This function uses two
vectors, i.e. one for the x axis and another for the y axis. The objective is to understand the
relationship between numbers and their sines. We will use two vectors. Vector, x which
will have a sequence of values between 1 and 25 at an interval of 0.1 and vector, y which
stores the sines of all values held in vector, x.
> x <-seq(1, 25, 0.1)
> y <-sin(x)
The plot function takes the values in the vector, x and plots it on the horizontal axis. It
then takes the values in the vector, y and places it on the vertical axis (Figure 3.4).
> plot(x, y)

Figure 3.4 Scatter plot


66 Data Analytics using R

3.7 MatriCes
Matrices are nothing but two-dimensional arrays.

Objective
Let us create a matrix which is 3 rows by 4 columns and set all its elements to 1.
> matrix (1, 3, 4)
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 1 1 1
[2, ] 1 1 1 1
[3, ] 1 1 1 1

Objective
Use a vector to create an array, 3 rows high and 3 columns wide.
Step 1: Begin by creating a vector that has elements from 10 to 90 with an interval of 10.
> a <- seq(10, 90, by = 10)
Step 2: Validate by printing the value of vector a.
> a
[1] 10 20 30 40 50 60 70 80 90
Step 3: Call the matrix function with vector, ‘a’ the number of rows and the number of
columns.
> matrix (a, 3, 3)
[, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
[3, ] 30 60 90

Objective
Re-shape the vector itself into an array using the dim function.
Step 1: Begin by creating a vector that has elements from 10 to 90 with an interval of 10.
> a <- seq (10, 90, by = 10)
Step 2: Validate by printing the value of vector, a.
> a
[1] 10 20 30 40 50 60 70 80 90
Step 3: Assign new dimensions to vector, a by passing a vector having 3 rows and 3
columns (c (3, 3)).
> dim(a) <- c(3, 3)
Step 4: Print the values of vector, a. You will notice that the values have shifted to form 3
rows by 3 columns. The vector is no longer one dimensional. It has been converted into
a two-dimensional matrix that is 3 rows high and 3 columns wide.
Loading and Handling Data in R 67

> a
[, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
[3, ] 30 60 90

3.7.1 Matrix Access


Objective
Access the elements of a 3 *4 matrix.
Step 1: Create a matrix, ‘mat’, 3 rows high and 4 columns wide using a vector.
> x <- 1:12
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> mat <- matrix (x, 3, 4)
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: Access the element present in the second row and third column of the matrix, ‘mat’.
> mat [2, 3]
[1] 8

Objective
Access the third row of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the third row of the matrix, simply provide the row number and omit
the column number.
> mat [3, ]
[1] 3 6 9 12

Objective
Access the second column of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
68 Data Analytics using R

Step 2: To access the second column of the matrix, simply provide the column number
and omit the row number.
> mat[, 2]
[1] 4 5 6

Objective
Access the second and third columns of an existing matrix.
Step 1: Let us begin by printing the values of an existing matrix, ‘mat’.
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
Step 2: To access the second and third columns of the matrix, simply provide the column
numbers and omit the row number.
> mat[,2:3]
[, 1] [, 2]
[1, ] 4 7
[2, ] 5 8
[3, ] 6 9

Objective
Create a contour plot.
Create a matrix, ‘mat’ which is 9 rows high and 9 columns wide and assign the value
‘1’ to all its elements.
> mat <- matrix(1, 9, 9)
Print all the values of the matrix, ‘mat’.
> mat
[, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8] [, 9]
[1, ] 1 1 1 1 1 1 1 1 1
[2, ] 1 1 1 1 1 1 1 1 1
[3, ] 1 1 1 1 1 1 1 1 1
[4, ] 1 1 1 1 1 1 1 1 1
[5, ] 1 1 1 1 1 1 1 1 1
[6, ] 1 1 1 1 1 1 1 1 1
[7, ] 1 1 1 1 1 1 1 1 1
[8, ] 1 1 1 1 1 1 1 1 1
[9, ] 1 1 1 1 1 1 1 1 1
Assign ‘0’ as the value to the element present in the third row and third column of the
matrix, ‘mat’.
Loading and Handling Data in R 69

> mat[3, 3] <-0


> mat
[, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8] [, 9]
[1, ] 1 1 1 1 1 1 1 1 1
[2, ] 1 1 1 1 1 1 1 1 1
[3, ] 1 1 0 1 1 1 1 1 1
[4, ] 1 1 1 1 1 1 1 1 1
[5, ] 1 1 1 1 1 1 1 1 1
[6, ] 1 1 1 1 1 1 1 1 1
[7, ] 1 1 1 1 1 1 1 1 1
[8, ] 1 1 1 1 1 1 1 1 1
[9, ] 1 1 1 1 1 1 1 1 1
Plot the contour chart using the contour() function (Figure 3.5). The contour()
function creates a contour plot or adds contour lines to an existing plot. Look up the R
documentation for a complete description of the contour() function.
> contour(mat)

Figure 3.5 Contour plot

Objective
Create a 3D perspective plot with the persp() function (Figure 3.6). It provides a 3D
wireframe plot most commonly used to display a surface.
>persp(mat)
We can add a title to our plot with the parameter ‘main’. Similarly, ‘xlab’, ‘ylab’ and
‘zlab’ can be used to label the three axes. Coloring of the plot is done with parameter ‘col’.
Similarly, we can add shading with the parameter ‘shade’.
70 Data Analytics using R

Figure 3.6 3D perspective plot

Objective
R includes some sample data sets. One of these is ‘volcano’, which is a 3D map of a
dormant New Zealand volcano. Create a contour map of the volcano dataset (Figure 3.7).
> contour(volcano)

Figure 3.7 Contour map


Loading and Handling Data in R 71

Let us create a 3D perspective map of the sample data set, ‘volcano’ (Figure 3.8).
> persp(volcano)

Figure 3.8 3D perspective map of the sample data set, ‘volcano’

Objective
Create a heat map of the sample dataset, ‘volcano’ (Figure 3.9).
> image(volcano)

Figure 3.9 Heat map of the sample dataset, ‘volcano’


72 Data Analytics using R

3.8 faCtors
3.8.1 Creating Factors
School, ‘XYZ’ places students in groups, also called houses. Each group is assigned a
unique color such as ‘red’, ‘green’, ‘blue’ or ‘yellow’. HouseColor is a vector that stores
the house colors of a group of students.
> HouseColor <- c(‘red’, ‘green’, ‘blue’, ‘yellow’, red’, ‘green’, ‘blue’, ‘blue’)
> types <- factor(HouseColor)
> HouseColor
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print(HouseColor)
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print (types)
[1] red green blue yellow red green blue blue
Levels: blue green red yellow
Levels denotes the unique values. The above has four distinct values such as ‘blue’,
‘green’, ‘red’ and ‘yellow’.
> as.integer(types)
[1] 3 2 1 4 3 2 1 1
The above output is explained as given below.
1 is the number assigned to blue.
2 is the number assigned to green.
3 is the number assigned to red.
4 is the number assigned to yellow.
> levels(types)
[1] “blue” “green” “red” “yellow”
The vector ‘NoofStudents’ stores the number of students in each house/group with
12 students in blue house, 14 students in green house, 12 students in red house and 13
students in yellow house.
> NoofStudents <- c(12, 14, 12, 13)
> NoofStudents
[1] 12 14 12 13
The vector, ‘AverageScore’ stores the average score of the students of each house/
group. 70 is the average score for students of the blue house, 80 is the average score for
students of the green house, 90 is the average score for the students of the red house and
95 is the average score for the students of the yellow house.
> AverageScore(70, 80, 90, 95)
> AverageScore
[1] 70 80 90 95

Objective
Plot the relationship between NoofStudents and AverageScore (Figure 3.10).
> plot(NoofStudents, AverageScore)
Loading and Handling Data in R 73

Figure 3.10 Relationship between "NoofStudents" and "AverageScore"

> plot (NoofStudents, AverageScore, pch=as.integer (types))


The above graph in Figure 3.10 displays 4 dots. Let us improve the graph by at least
using different symbols to represent each house (Figure 3.11).

Figure 3.11 Relationship between "NoofStudents" and "AverageScore" using different symbols.
74 Data Analytics using R

To add further meaning to the graph, let us place a legend on the top right corner
(Figure 3.12).
> legend(“topright”, c(“red”, “green”, “blue”, “yellow”), pch=1:4)

Figure 3.12 Relationship between "NoofStudents" and "AverageScore" (with legends)

3.9 list
List is similar to C Struct.

Objective
Create a list in R.
To create a list, ‘emp’ having three elements, ‘EmpName’, ‘EmpUnit’ and ‘EmpSal’.
> emp <- list (“EmpName=“Alex”, EmpUnit = “IT”, EmpSal = 55000)

Outcome
To get the elements of the list, ‘emp’ use the command given below.
> emp
$EmpName
[1] “Alex”

$EmpUnit
[1] “IT”

$EmpSal
[1] 55000
Loading and Handling Data in R 75

Actually, the element names, e.g. ‘EmpName’, ‘EmpUnit’ and ‘EmpSal’ are optional.
We could alternatively do this as shown below.
> EmpList <- list(“Alex”, “IT”, 55000)
> EmpList
[[1]]
[1] “Alex”

[[2]]
[1] “IT”

[[3]]
[1] 55000

Here the elements of EmpList are referred to as 1, 2 and 3.

3.9.1 List Tags and Values


A list has elements. The elements in a list can have names, which are referred to as tags.
Elements can also have values.
For example, in the ‘emp’ list we have three elements, viz. EmpName, EmpUnit and
EmpSal. The values are as follows. The element ‘EmpName’ has the value ‘Alex’, the
element ‘EmpUnit’ has the value ‘IT’ and the element ‘EmpSal’ has the value 55000.
Let us look at the command to retrieve the names and values of the elements in a list.

Objective
Retrieve the names of the elements in the list ‘emp’.
> names(emp)
[1] “EmpName” “EmpUnit” “EmpSal”

Objective
Retrieve the values of the elements in the list ‘emp’.
> unlist(emp)
EmpName EmpUnit EmpSal
“Alex” “IT” “55000”
The command to retrieve the value of a single element in the list ‘emp’ is given below.

Objective
Retrieve the value of the element ‘EmpName’ in the list ‘emp’.
> unlist(emp[“EmpName”])
EmpName
“Alex”
The value of the other elements in the list can be checked in a similar manner.
76 Data Analytics using R

> unlist(emp[“EmpUnit”])
EmpUnit
“IT”
> unlist(emp[“EmpSal”])
EmpSal
55000
Yet another way to retrieve the values of the elements in the list ‘emp’ is given as
follows:

Objective
Retrieve the value of the element ‘EmpName’ in the list ‘emp’.
> emp[[“EmpName”]]
[1] “Alex”
Or
> emp[[1]]
[1] “Alex”

3.9.2 Add/Delete Element to or from a List


Before adding an element to the list ‘emp’, let us verify what elements exist in the list.
> emp
$EmpName
[1] “Alex”

$EmpUnit
[1] “IT”

$EmpSal
[1] 55000

Objective
Add an element with the name ‘EmpDesg’ and value ‘Software Engineer’ to the list, ‘emp’.
> emp$EmpDesg = “Software Engineer”

Outcome
> emp
$EmpName
[1] “Alex”

$EmpUnit
[1] “IT”

$EmpSal
[1] 55000

$EmpDesg
[1] “Software Engineer”
Loading and Handling Data in R 77

Objective
Delete an element with the name ‘EmpUnit’ and value ‘IT’ from the list, ‘emp’.
> emp$EmpUnit <- NULL

Outcome
> emp
$EmpName
[1] “Alex”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”

3.9.3 Size of a List


length() function can be used to determine the number of elements present in the list.
The list, ‘emp’ has three elements as shown:
> emp
$EmpName
[1] “Alex”

$EmpSal
[1] 55000

$EmpDesg
[1] “Software Engineer”

Objective
Determine the number of elements in the list, ‘emp’.
> length(emp)
[1] 3

Recursive List
A recursive list means a list within a list.

Objective
Create a list within a list.
Let us begin with two lists, ‘emp’ and ‘emp1’.
The elements in both the lists are as shown below.
> emp
$EmpName
[1] “Alex”
78 Data Analytics using R

$EmpSal
[1] 55000

$EmpDesg
[1] “Software Engineer”

> emp1
$EmpUnit
[1] “IT”

$EmpCity
[1] “Los Angeles”
We would like to combine both the lists into a single list called ‘EmpList’.
> EmpList <- list(emp, emp1)

Outcome
> EmpList
[[1]]
[[1]] $EmpName
[1] “Alex”

[[1]]$EmpSal
[1] 55000

[[1]]$EmpDesg
[1] “Software Engineer”

[[2]]
[[2]]$EmpUnit
[1] “IT”

[[2]]$EmpCity
[1] “Los Angeles”

3.10 few CoMMon analytiCal tasks


Reading, writing, updating and merging data are common operations in any programming
language. These are used for processing data. All programming languages work with
different types of data like numeric, characters, logical, etc. Just like any other processing,
analytical data processing also requires general operations for complex processing. In
the next section, you will learn about some common tasks of R that are required during
analytical data processing.
Loading and Handling Data in R 79

3.10.1 Exploring a Dataset


Exploring a dataset means displaying the data of the dataset in a different form. Datasets
are the main part of analytical data processing. It uses different forms or parts of the
dataset. With the help of R commands, analysts can easily explore a dataset in different
ways. Table 3.4 describes some functions for exploring a dataset.

Table 3.4 Functions for exploring a dataset


Functions Function Arguments Description
names(dataset) d Dataset argument contains The function displays the
the name of the dataset. variables of the given dataset.
summary(dataset) d Dataset argument contains The function displays the
the name of the dataset. summary of the given dataset.
str(dataset) d Dataset argument contains The function displays the
the name of the dataset. structure of the given dataset.
head(dataset, n) d Dataset argument contains The function displays the top
the name of the dataset. rows according to the value
d n is a numeric value to of n. If the value of n is not
display the number of top provided in the function then
rows. by default the function displays
the top 6 rows of the dataset.
tail(dataset, n) d Dataset argument contains The function displays the top
the name of the dataset. rows according to the value
d n is a numeric value to of n. If the value of n is not
display the number of bot- provided in the function then
tom rows. by default the function displays
the bottom 6 rows of the
dataset.
class(dataset) d Dataset argument contains The function displays the class
the name of the dataset. of the dataset.
dim(dataset) d Dataset argument contains The function returns the
the name of the dataset. dimension of the dataset which
implies the total number of
rows and columns of the
dataset.
table(dataset$variable d Dataset argument contains The function returns the
names) the name of the dataset. number of categorical values
d Variable name contains after counting them.
the name of the variable
names.

The following example loads a matrix into the workspace. All the above commands
are executed on the dataset, ‘Orange’ (Figures 3.13–3.15).
80 Data Analytics using R

Figure 3.13 Exploring a dataset using names(), summary() and str() functions

Figure 3.14 Exploring a dataset using head() and tail() functions


Loading and Handling Data in R 81

Figure 3.15 Exploring a dataset using class(), dim() and table() functions

3.10.2 Conditional Manipulation of a Dataset


Analytical data processing sometimes may require specific rows and columns of a dataset.
Table 3.5 lists commands that can be used for accessing specific rows and columns of
a dataset.
Table 3.5 Commands for accessing specific rows and columns of a dataset
Commands Command Arguments Description
Tablename[n] n is a numeric value. The command displays the rows according to the given
value of argument n of the table.
Tablename[, n] n is a numeric value. The command displays the columns according to the
given value of argument n of the table.

The following example reads a table, ‘Hardware.csv’ into object, ‘TD’ on the R
workspace. The TD[1] and TD[, 1] commands displays rows and columns (Figure 3.16).

3.10.3 Merging Data


Merging different datasets or objects is another common task used in most processing
activities. Analytical data processing may also require merging two or more data objects. R
provides a function merge() that merges data objects. The merge() function combines data
frames by common columns or row names. It also follows the database join operations.
The syntax of the merge() function is given as follows:
merge(x, y,…) OR
merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y =
by, all = FALSE, all.x = all, all.y = all, …)
82 Data Analytics using R

Figure 3.16 Conditional manipulation of a dataset

where, x is an object or data frame, y is an object or data frame and by, by.x, by.y arguments
define the common columns or rows for merging. All arguments contain logical values
‘TRUE’ or ‘FALSE’. If the value is TRUE then it returns the full outer join by adding all
rows of x and y into the result object.
all.x argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then it
returns the dataset as per left outer join after merging the objects by adding an extra row
in x that is not matching with rows in y. If the value is FALSE then it merges the rows
with the data from both x and y into the result object.
all.y argument contains logical values, ‘TRUE’ or ‘FALSE’. If the value is TRUE then
it returns the dataset as per right outer join after merging the objects by adding an extra
row in y that is not matching with rows in x. If the value is FALSE then it merges the
rows with data from both x and y into the result object.
The dots ‘…’ define the other optional argument.
The following example creates two data frames, ‘S’ and ‘T’. Then both the data frames
are merged into a new data frame, ‘E’ (Figure 3.17).
In this example, two data frames, ‘S’ and ‘T’ are using different values to merge data.
The merge command returns the data frames after merging them using the left and right
outer join (Figure 3.18).
Loading and Handling Data in R 83

Figure 3.17 Merging data

Figure 3.18 Merging data using join condition


84 Data Analytics using R

3.11 aggregating anD grouP ProCessing of a Variable


Aggregate and group operations aggregate the data of specific variables of a dataset after
grouping variable data. Like merging, analytical data processing also requires aggregation
and grouping operation on a dataset. R provides some functions for aggregation operation.
The next section describes two functions aggregate() and tapply() of R.

3.11.1 aggregate() Function


The aggregate() function is an inbuilt function of R that aggregates data values. The
function also splits data into groups after performing given statistical functions. The
syntax of the aggregate() function is
aggregate(x, …) or
aggregate(x, by, FUN, …)

where, x is an object, by argument defines the list of group elements of the specific variable
of the dataset, FUN argument is a statistic function that returns a numeric value after
given statistic operations and the dots ‘…’ define the other optional argument.
The following example reads a table, ‘Fruit_data.csv’ into object, ‘S’. The aggregate()
function computes the mean price of each type of fruit. Here by argument is list(Fruit.
Name = S$Fruit.Name) that groups the Fruit.Name columns (Figure 3.19).

Figure 3.19 Example of aggregate() function


Loading and Handling Data in R 85

3.11.2 tapply() Function


The tapply() function is also an inbuilt function of R and works in a manner similar
to the function aggregate(). The function aggregates the data values into groups after
performing the given statistical functions. The syntax of the tapply () function is
tapply (x, …) or
tapply(x, INDEX, FUN, …)
where, x is an object that defines the summary variable, INDEX argument defines the
list of group elements—also called group variable, FUN argument is a statistic function
that returns a numeric value after given statistic operations and the dots ‘…’ define the
other optional argument.
The following example reads the table, ‘Fruit_data.csv’ into object, ‘A’. The tapply()
function computes the sum and price of each type of fruit. Here Fruit.Price is a summary
variable and Fruit.Name is a grouping variable. The FUN function is applied on the
summary variable, Fruit.Price (Figure 3.20).

Figure 3.20 Example of tapply() function


86 Data Analytics using R

Check Your Understanding


1. How do you define exploring a dataset?
Ans: Exploring a dataset implies display of data of a dataset in different forms.

2. Which function is used to display the summary of a dataset?


Ans: The summary() function is used to display the summary of a dataset.

3. What is the head() function?


Ans: The head() function is an inbuilt data exploring function that displays the top rows
according to a given value.
4. What is the tail() function?
Ans: The tail() function is an inbuilt data exploring function that displays the bottom
rows according to a given value.

5. What is the use of merge() function?


Ans: The merge() function is an inbuilt function of R. It combines data frames by common
columns or row names. It also follows the database join operations.

6. What is the use of aggregate() function?


Ans: The aggregate() function is an inbuilt function of R which aggregates data values
and splits data into groups after performing the required statistical functions.

7. What is the use of tapply() function?


Ans: The tapply() function is an inbuilt function of R which aggregates data values into
groups after performing the required statistical functions.

8. List the inbuilt functions of R for manipulating text.


Ans: Some inbuilt functions of R for manipulating text are:
d substr()
d strsplit()
d paste()
d grep()

3.12 siMPle analysis using r


In this section, you will learn how to read data from a dataset, perform a common
operation and see the output.

3.12.1 Input
Input is the first step in any processing, including analytical data processing. Here, the
input is dataset, ‘Fruit’. For reading the dataset into R, use read.table() or read.csv()
function. In Figure 3.21, the dataset, ‘Fruit’ is being read into the R workspace using the
read.csv() function.
Loading and Handling Data in R 93

Check Your Understanding


1. Write the names of the functions used for reading datasets or tables into the R
workspace.
Ans: Functions used for reading datasets or tables into the R workspace are:
d read.csv()
d read.table()

2. List the inbuilt functions used for describing a dataset.


Ans: Some inbuilt functions used for describing a dataset are:
d names()
d str()
d summary()
d head()
d tail()

3. List the functions of R for describing variables.


Ans: Functions for describing variables are:
d table()
d summary(tablename $ variablename)
d paste()
d grep()
d hist()
d plot()

3.13 MethoDs for reaDing Data


R supports different types of data formats related to a database. With the help of import
and export utility of R, any type of data can be imported and exported into R. In this
section, you will learn about the different methods used for reading data.

3.13.1 CSV and Spreadsheets


Comma separated value (CSV) files and spreadsheets are used for storing small size data.
R has an inbuilt function facility through which analysts can read both types of files.

Reading CSV Files


A CSV file uses .csv extension and stores data in a table structure format in any plain text.
The following function reads data from a CSV file:
read.csv(‘filename’)
where,
filename is the name of the CSV file that needs to be imported.
94 Data Analytics using R

The read.table() function can also read data from CSV files. The syntax of the
function is
read.table(‘filename’, header=TRUE, sep=‘,’,…)
where,
filename argument defines the path of the file to be read, header argument contains
logical values TRUE and FALSE for defining whether the file has header names on the
first line or not, sep argument defines the character used for separating each column of
the file and the dots ‘…’ define the other optional arguments.
The following example reads a CSV file, ‘Hardware.csv’ using read.csv() and read.
table() function (Figure 3.27).

Figure 3.27 Reading CSV file

Reading Spreadsheets
A spreadsheet is a table that stores data in rows and columns. Many applications are
available for creating a spreadsheet. Microsoft Excel is the most popular for creating an
Excel file. An Excel file uses .xlsx extension and stores data in a spreadsheet.
In R, different packages are available such as gdata, xlsx, etc., that provide functions
for reading Excel files. Importing such packages is necessary before using any inbuilt
function of any package. The read.xlsx() is an inbuilt function of ‘xlsx’ package for
reading Excel files. The syntax of the read.xlsx() function is
read.xlsx(‘filename’,…)
Loading and Handling Data in R 95

where,
filename argument defines the path of the file to be read and the dots ‘…’ define the
other optional arguments.
In R, reading or writing (importing and exporting) data using packages may create some
problems like incompatibility of versions, additional packages not loaded and so on. In
order to avoid these problems, it is better to convert files into CSV files. After converting
files into CSV files, the converted file can be read using the read.csv() function.
The following example illustrates creation of an Excel file, ‘Softdrink.xlsx’. The ‘Software.
csv’ file is the converted form of the ‘Softdrink.xlsx’ file (Figure 3.28). The function read.
csv() is reading this file into R (Figure 3.29).

Figure 3.28 Spreadsheet of Excel file

Figure 3.29 Reading a converted CSV file


96 Data Analytics using R

Example: Reading the .csv file


To read the data from a .csv file (D:\SampleSuperstore.csv) into a data frame. The data
should be grouped by ‘Category’. The column on which grouping is done is ‘Sales’. The
aggregate function to be used is ‘sum’.
Step 1: The data is stored in ‘D:\SampleSuperstore.csv’. It is available under the following
columns:
Row ID, Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer Name,
Segment, Country, State, City, Postal Code, Region, Product ID, Category, Sub-Category,
Product Name, Sales, Quantity, Discount, Price.
A subset of the data is shown in Figure 3.30.
With the use of read.csv function, the data is read from ‘D:\SampleSuperstore.csv’ file
and stored in the data frame named, ‘InputData’.
> InputData <- read.csv(“d:/SampleSuperstore.csv”)
Step 2: Data is grouped and aggregated on InputData$Sales by InputData$Category. The
aggregation function used is ‘sum’. InputData$Sales refers to the ‘Sales’ column of the
data frame, ‘InputData’. Similarly, InputData$Category refers to the ‘Category’ column
of the data frame, ‘InputData’.
> GroupedInputData <- aggregate(InputData$Sales ~
InputData$Category, InputData, sum)
Display the aggregated data. As evident from the display below, the data is available
in three categories, viz. ‘Furniture’, ‘Office Supplies’ and ‘Technology’.
> GroupedInputData
InputData$Category InputData$Sales
1 Furniture 156514.4
2 Office Supplies 132600.8
3 Technology 168638.0

3.13.2 Reading Data from Packages


A package is a collection of functions and datasets. In R, many packages are available for
doing different types of operations (Figure 2.4). Some functions for reading and loading
the dataset from and into packages defined in R are explained next.

library() Function
The library() function loads packages into the R workspace. It is compulsory to import
the package before reading the available dataset of that package. The syntax of the
library() function is:
library(packagename)
where,
packagename argument is the name of the package to be read.
Figure 3.30 Subset of the data from “SampleSuperstore.xls”
Loading and Handling Data in R
97
98 Data Analytics using R

data() Function
The data() function lists all the available datasets of the loaded package into the R
workspace. For loading a new dataset into the loaded packages, users need to pass the
name of the new dataset into data() function. The syntax of the data() function is:
data(datasetname)
where,
datasetname argument is the name of the dataset to be read.
The following example illustrates the loading of a matrix. The data() function lists
all the available datasets of the loaded package. The ‘ > Orange ‘ command reads and
displays the content of the dataset, ‘Orange’ into the workspace.

Figure 3.31 Reading data from packages

3.13.3 Reading Data from Web/APIs


Nowadays most business organisations are using the Internet and cloud services for
storing data. This online dataset is directly accessible through packages and application
programming interfaces (APIs). Different packages are available in R for reading from
online datasets. Refer to Table 3.6 to view some packages.
Loading and Handling Data in R 99

Table 3.6 Packages for reading web data


Packages Description Download Link
RCurl The package permits download of https://cran.r-project.org/web/
files from the web server and post packages/RCurl/index.html
forms.
Google Prediction API It allows uploading of data to http://code.google.com/p/r-google-
Google storage and then training predictionapi-v121
them for Google Prediction API.
Infochimps The package provides the http://api.infochimps.com
functions for accessing all API.
HttpRequest The package reads the web data https://cran.r-project.org/web/
with the help of an HTTP request packages/httpRequest/index.html
protocol and implements the GET,
POST request.
WDI The package reads all World Bank http://cransprojectorg/web/
data. packages/WD1/index.html
XML The package reads and creates http://cransprojectorg/web/
an XML and HTML document packages/XML/index.html
with the help of an HTTP or FTP
protocol.
Quantmod The package reads finance data http://crans-projectorg/web/
from Yahoo finance. packages/quantmodfindex.html
ScrapeR The package reads online data. http://crans-projectorg/web/
packages/scrapeR/index.html

The following example illustrates web scraping. Web scraping extracts data from any
webpage of a website. Here package ‘RCurl’ is used for web scraping (Figure 3.32). At
first, the package, ‘RCurl’ is imported into the workspace and then getURL() function of
the package, ‘RCurl’ takes the required webpage. Now htmlTreeParse() function parses
the content of the webpage.

3.13.4 Reading a JSON (Java Script Object Notation) Document


Step 1: Install rjson package.
> install.packages(“rjson”)
Installing package into ‘C:/Users/seema_acharya/Documents/R/win-
library/3.2’(as ‘lib’ is unspecified)
trying URL ‘https://cran.hafro.is/bin/windows/contrib/3.2/
rjson_0.2.15.zip’
Content type ‘application/zip’ length 493614 bytes (482 KB)
downloaded 482 KB

package ‘rjson’ successfully unpacked and MD5 sums checked


100
Data Analytics using R

Figure 3.32 Reading web data using the ‘RCurl’ package


Loading and Handling Data in R 101

Step 2: Input data.


Store the data given below in a text file (‘D:/Jsondoc.json’). Ensure that the file is saved
with an extension of .json
{
‘EMPID’:[‘1001’,’2001’,’3001’,’4001’,’5001’,’6001’,’7001’,’8001’
],
‘Name’:[‘Ricky’,’Danny’,’Mitchelle’,’Ryan’,’Gerry’,’Nonita’,’Sim
on’,’Gallop’ ],
‘Dept’: [‘IT’,’Operations’,’IT’,’HR’,’Finance’,’IT’,’Operations’
,’Finance’]
}
A JSON document begins and ends with a curly brace ({}). A JSON document is a set
of key value pairs. Each key:value pair is delimited using ‘,’ as a delimiter.
Step 3: Read the JSON file, ‘d:/Jsondoc.json’.
> output <- fromJSON(file = “d:/Jsondoc.json”)
> output
$EMPID
[1] “1001” “2001” “3001” “4001” “5001” “6001” “7001” “8001”

$Name
[1] “Ricky” “Danny” “Mitchelle” “Ryan” “Gerry” “Nonita”
[7] “Simon” “Gallop”

$Dept
[1] “IT” “Operations” “IT” “HR” “Finance”
[6] “IT” “Operations” “Finance”

Step 4: Convert JSON to a data frame.


> JSONDataFrame <- as.data.frame(output)

Display the content of the data frame, ‘output’.


> JSONDataFrame
EMPID Name Dept
1 1001 Ricky IT
2 2001 Danny Operations
3 3001 Mitchelle IT
4 4001 Ryan HR
5 5001 Gerry Finance
6 6001 Nonita IT
7 7001 Simon Operations
8 8001 Gallop Finance
102 Data Analytics using R

3.13.5 Reading an XML File


Step 1: Install an XML package.
> install.packages(“XML”)
Installing package into ‘C:/Users/seema_acharya/Documents/R/win-
library/3.2’(as ‘lib’ is unspecified)
trying URL ‘https://cran.hafro.is/bin/windows/contrib/3.2/XML_3.98-
1.3.zip’
Content type ‘application/zip’ length 4299803 bytes (4.1 MB)
downloaded 4.1 MB

package ‘XML’ successfully unpacked and MD5 sums checked


Step 2: Input data.
Store the data below in a text file (XMLFile.xml in the D: drive). Ensure that the file is
saved with an extension of .xml.
<RECORDS>
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>Computer Science</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<EMPID>1002</EMPID>
<EMPNAME>Ramya</EMPNAME>
<SKILLS>People Management</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<EMPID>1003</EMPID>
<EMPNAME>Fedora</EMPNAME>
<SKILLS>Recruitment</SKILLS>
<DEPT>Human Resources</DEPT>
</EMPLOYEE>
</RECORDS>

Reading an XML File


The xml file is read in R using the function xmlParse(). It is stored as a list in R.
Loading and Handling Data in R 103

Step 1: Begin by loading the required packages.


> library(“XML”)
Warning message:
package ‘XML’ was built under R version 3.2.3
> library (“methods”)
> output <- xmlParse(file = “d:/XMLFile.xml”)

> print(output)
<?xml version=“1.0”?>
<RECORDS>
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>ComputerScience</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1002</EMPID>
<EMPNAME>Ramya</EMPNAME>
<SKILLS>PeopleManagement</SKILLS>
<DEPT>HumanResources</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<EMPID>1003</EMPID>
<EMPNAME>Fedora</EMPNAME>
<SKILLS>Recruitment</SKILLS>
<DEPT>HumanResources</DEPT>
</EMPLOYEE>
</RECORDS>
Step 2: Extract the root node from the XML file.
> rootnode <- xmlRoot(output)

Find the number of nodes in the root.


> rootsize <- xmlSize(rootnode)
> rootsize
[1] 3
104 Data Analytics using R

Let us display the details of the first node.


> print (rootnode[1])
$EMPLOYEE
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>ComputerScience</DEPT>
</EMPLOYEE>
attr(, “class”)
[1] “XMLInternalNodeList” “XMLNodeList”
Let us display the details of the first element of the first node.
> print(rootnode[[1]][[1]])
<EMPID>1001</EMPID>
Let us display the details of the third element of the first node.
> print(rootnode[[1]][[3]])
<SKILLS>MongoDB</SKILLS>
Next, display the details of the third element of the second node.
> print(rootnode[[2]][[3]])
<SKILLS>PeopleManagement</SKILLS>
We can also display the value of 2nd element of the first node.
> output <-xmlValue(rootnode[[1]][[2]])
> output
[1] “Merrilyn”
Step 3: Convert the input xml file to a data frame using the xmlToDataFrame function.
> xmldataframe <- xmlToDataFrame(“d:/XMLFile.xml”)

Display the output of the data frame.


> xmldataframe
EMPID EMPNAME SKILLS DEPT
1 1001 Merrilyn MongoDB ComputerScience
2 1002 Ramya PeopleMananement HumanResources
3 1003 Fedora Recruitment HumanResources
Loading and Handling Data in R 105

Check Your Understanding


1. What is a CSV file?
Ans: A CSV file uses .csv extension and stores data in a table structure format in any plain
text.

2. What is the use of read.csv() function?


Ans: A read.csv() function reads data from CSV files.

3. What is the use of read.table() function?


Ans: A read.table() function reads data from text files or CSV files.

4. What is the use of read.xlsx() function?


Ans: A read.xlsx() is an inbuilt function of ‘xlsx’ package for reading Excel files.

5. What is a package?
Ans: A package is a collection of functions and datasets. In R, many packages are available
for doing different types of operations.
6. What is the use of the library() function?
Ans: The library() function loads packages into the R workspace. It is compulsory to
import packages before reading the available dataset of that package.

7. What is the use of data() function?


Ans: The data() function lists all the available datasets of the loaded packages into the R
workspace.

8. List five R packages for accessing web data.


Ans: Different packages are available in R for reading from an online dataset. These are:
d RCurl
d Google Prediction API
d WDI
d XML
d ScrapeR

9. What is web scraping?


Ans: Web scraping extracts data from any web page of a website.
106 Data Analytics using R

3.14 CoMParison of r guis for Data inPut


R is mainly used for statistical analytical data processing. Analytical data processing needs
a large dataset that is stored in a tabular form. Sometimes it is difficult to use inbuilt
functions of R for doing such analytical data processing operations in R console. Hence,
to overcome this problem, GUI is developed for R.
Graphical user interface is a graphical medium through which users interact with the
language or perform operations. Different GUIs are available for data input in R. Each
GUI has its own features. Table 3.7 describes some of the most popular R GUIs.
Table 3.7 Some popular R GUIs
GUI Name Description Download Weblink
RCommander d RCommander was developed by John http://socserv.mcmaster.ca/jfox/Misc/
(Rcmdr) Fox and licensed under the GNU Rcmdr/
public license. Or
d It comes with many plug-ins and has https://cran.r-project.org/web/packages/
a very simple interface. Rcmdr/index.html
d Users can install it like other packages
of R within language.
Rattle d Dr. Graham Williams developed the http://rattle.togaware.com/
Rattle GUI package written in R. Or
d Data mining operation is the main http://rattle.togaware.com/rattle-install-
application area of Rattle. mswindows.html
d It offers statistical analysis, validation,
testing and other operations.
RKWard d RKWard community developed the https://rkward.kde.org/
RKWard package. Or
d It provides a transparent front end http://download.kde.org/stable/
and supports different features for rkward/0.6.5/win32/install_rkward_0.6.5.exe
doing analytical operations in R.
d It supports different platforms, such
as Windows, Linux, BSD, and OS X.
JGR (Java d Markus Helbig, Simon Urbanek, and http://www.rforge.net/JGR/
GUI for R) lan Fellows developed JGR. Or
d JGR is a universal GUI for R that sup- https://cran.r-project.org/web/packages/
ports cross platform. JGR/
d Users can use it as a replacement for
the default R GUI on Windows.
Deducer d Deducer is another simple GUI that http://www.deducer.org/pmwiki/pmwiki.
has a menu system for doing common php?n=Main.DeducerManual
data operations, analytical processing Or
and other operations. http://www.deducer.org/
d It is mainly designed to use it with the pmwiki/index.php?n=Main.
Java-based R Console [JGR]. DownloadingAndInstallingDeducer

Figure 3.33 shows the official screenshot of the RCommander (Rcmdr) GUI that is
available in R.

You might also like