KEMBAR78
Practical Notes | PDF | Matrix (Mathematics) | Rounding
0% found this document useful (0 votes)
50 views68 pages

Practical Notes

Uploaded by

mwams2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views68 pages

Practical Notes

Uploaded by

mwams2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

STATISTICS 2A

PRACTICAL LECTURE NOTES

DEPARTMENT OF STATISTICS
UNIVERSITY OF JOHANNESBURG

Copyright © 2016, University of Johannesburg, South Africa


Printed and published by the University of Johannesburg
© All rights reserved. Apart from any fair dealing for the purpose of research, criticism or review as permitted under the Copyright Act 98 of 1978, no part of this material
may be reproduced, stored in a retrieval system, transmitted or used in any form or be published, redistributed or screened by any means electronic, photocopying,
recording or otherwise without the prior written permission of the University of Johannesburg.
Contents
1. Introduction to R.............................................................................................................................................. 1
1.1 R and RStudio ......................................................................................................................................... 1
1.2 Structures in R......................................................................................................................................... 4
1.3 Programming in R ................................................................................................................................... 7
2. Data and Built-in Function ............................................................................................................................ 18
2.1 Basic Built-In Functions ....................................................................................................................... 18
2.2 Data ....................................................................................................................................................... 20
2.3 Matrices................................................................................................................................................. 20
2.4 Data Frames .......................................................................................................................................... 28
3. Tables, Graphs and Basic Statistical Analysis............................................................................................... 30
3.1 Using the R Script Editor ...................................................................................................................... 30
3.2 Datasets ................................................................................................................................................. 31
3.3 The Working Directory ......................................................................................................................... 32
3.4 Importing Text Data.............................................................................................................................. 32
3.5 Frequency Tables and Contingency Tables .......................................................................................... 34
3.6 Graphs ................................................................................................................................................... 36
3.7 Basic Statistical Analysis ...................................................................................................................... 45
4. Random Sampling, Numbers and Variables.................................................................................................. 47
4.1 Random Sampling................................................................................................................................. 47
4.2 Random Number Generation ................................................................................................................ 48
4.3 Probability Calculation ......................................................................................................................... 51
5. Writing Functions in R .................................................................................................................................. 53
5.1 Functions in R ....................................................................................................................................... 53
5.2 Optimising a Functions in R ................................................................................................................. 56
5.3 The For Loop in R................................................................................................................................. 57
5.4 The If-Else Statement in R .................................................................................................................... 58
6. Expectation and Simulation ........................................................................................................................... 60
6.1 Expected Value and Variance ............................................................................................................... 60
6.2 Joint Distributions ................................................................................................................................. 62
6.3 Simulation ............................................................................................................................................. 63
1. Introduction to R
1.1 R and RStudio
1.1.1 What is R
R is a free and open-source system with numerous (free) online tutorials and user manuals. It is increasingly
common to see people who develop new methodology simultaneously producing an R package. With R you can
write functions, do calculations, perform statistical analysis, create a variety of different graphs, and even write
your own code or library. It is essentially an integrated suite of software facilities for data manipulation,
calculation and graphical display.

The R system consists of two major parts: the base system and a collection of user-contributed add-on packages
(or libraries). A package is a collection of functions, examples and documentation. The R language is
implemented in the base system.

R is an interpreted programming language, not a compiled programming language, in other words, all commands
that are typed in the program are directly executed without requiring you to build a complete program like in most
computer languages. The R code or syntax is very simple and intuitive, once you are familiar with the structure
of the language, and how R defines objects and data types.

R is a command line driven program. The user enters commands at the command line prompt (> by default) and
each command is executed one at a time. It is also possible, and more practical, to script the commands in the R
Editor, which can be saved to retain all the code written for a specific purpose. Note that R coding is case sensitive,
e.g., “A” is different from “a”.

RStudio is the most popular R code editor. RStudio is an integrated development environment (IDE) for R. It
includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting,
history, debugging and workspace management. RStudio interfaces with R for Windows, MacOS, and Linux
platforms. RStudio is more user-friendly than R itself. RStudio will be used for all practical sessions in the
module.

1
1.1.2 Where to get R and RStudio
To install the base package for R for Windows, go to www.r-project.org and click on “Download R for Windows”.
Note that R is available for Linux and MacOS systems as well.

Click on “Install R for the first time”

Click on the link for the latest version of R to download the program file.

Thereafter, click on the downloaded program and follow the prompts to install the software.
2
To install RStudio, go to https://posit.co/downloads/, click on “download RStudio”. This will take you to the
following page to install the software. Note, you can also install R from this page.

1.1.3 Starting RStudio


To run RStudio, click on the RStudio icon , either on the desktop, from the list of programs in the Start Menu
or from the taskbar (once it is pinned to the taskbar). The RStudio program consists of different panes. The default
includes the source, console, workspace and plots (which includes files, packages and help). It is possible to
customize the panes you wish to see and the placement of the panes. On the left-hand side you see the Console
with information about the version number of R, the date of your version, a disclaimer stating that there is no
warranty for this free software, and additional information about the licence, contributors and the help function.

The workspace is your current R working environment and includes any user-defined objects (vectors, matrices,
data frames, lists, functions). At the end of an R session, you can save an image of the current workspace that is
automatically reloaded the next time R is started. In the console you will see a blank line with a “>” symbol in
the left-hand margin. This is called the command line prompt and is R’s way of saying ‘What now?’. This is
where you type in your commands. Later in the module I will show you how we use the R script file.

3
When working, you may see a “+” at the left-hand side of the screen instead of “>”. This means that the last
command you typed is incomplete. This typically happens if the command given is incomplete, for example you
forgot one or more brackets. If you can see what is missing such as a final right-hand bracket, just type the missing
character and press enter, at which point the command will execute. If you have made a mistake, you can also
just press the Esc key and the command line prompt “>” will reappear. You can then re-type the code.

1.2 Structures in R
1.2.1 Data types
In programming we use variables to store information. In R the variable is an object. An object is a data structure
with attributes and methods which are applied to its attributes. Variables can be divided into two types, namely
numerical and character. Character variables are called factors and divided into two types, namely nominal
variables (possible values are unordered, e.g., provinces of SA) and ordinal variable (possible values are ordered,
e.g., satisfaction level of user from extremely poor to extremely good). Numeric variables are either interval or
ratio. It is important to understand the difference between the different data types since the possible analyses we
can perform on a variable will depend on the nature of the variable. The data types in R are as follows:

Double
Doubles are used to represent continuous numerical variables, such as the time taken by all the athletes to complete
a single lap in a race, height of adult females in South Africa, etc.

Integer
Integers are natural numbers. They can be used to represent counting variables, for example the number of items
in the trolley at Pick ‘n Pay.

Complex
To represent complex numbers, object of type complex is used.

Logical
An object of data type logical can have the value TRUE or FALSE and is used to indicate if a condition is true or
false. Such objects are usually the result of logical expressions.

Character
A character object is represented by letters, numbers or symbols between double quotes (" ").

4
Factor
The factor data type is used to represent categorical data. The possible values of a factor variable are referred to
as levels. If a variable has more than two levels, it is called as polychotomous.

1.2.2 Objects in R
R has a wide variety of objects including vectors, matrices, arrays, data frames, tables and lists.

Vectors
The simplest object in R is the vector. A vector is an object that consists of a number of elements of the same
type.

Matrices
A matrix can be regarded as a generalization of a vector. As with vectors, all the elements of a matrix must be of
the same data type.

Arrays
Arrays are generalizations of vectors and matrices. A vector is a one-dimensional array, and a matrix is a two-
dimensional array. As with vectors and matrices, all the elements of an array must be of the same data type.

Data Frames
Data frames can also be regarded as an extension of matrices, but data frames can have columns of different data
types and are the most convenient data structure for data analysis in R. Most statistical modelling routines in R
require a data frame as input. A data frame can have the attribute names and row names. The attribute names
contain the column names of the data frame, and the row names contain the row names of the data frame. Data
frames are always in rectangular form and must have equal number of elements.

Tables
Another common way to store information is in a table, which is essentially a crosstabulation format.

Lists
A list is like a vector. However, an element of a list can be an object of any type and structure. Consequently, a
list can contain another list and therefore it can be used to construct arbitrary data structures.

5
1.2.3 Operators in R
R uses a wide range of operators to structure code (arithmetic, logical, etc.) that are typically used in most
programming languages. These operators work on scalars, vectors and matrices. The following table shows a list
of the most commonly used operators in basic coding.

Type Operator Function


Arithmetic + Addition
- Subtraction
* Multiplication
/ Division
^ Exponentiation
Logical < Less than
<= Less than or equal to
> Greater than
>= Greater than or equal to
== Equals to
!= Not equal to
! NOT
& AND
| OR
Other <- Assignment
= Assignment
: Sequence
~ Statistical model formulation
$ List indexing
[] Vector and matrix indexing

1.2.4 Getting help


As you work in R you will often need more information about a function, specifically about the different
arguments that are built into the function. You can get online help through the help function. For a specific
function, say assign, your can type help(assign) or ?assign in the command line and press the enter key. This will
open the R documentation for the specified function, list the description, usage, arguments and examples. RStudio
has easy access to the help files.

6
1.3 Programming in R
1.3.1 Basic arithmetic
In the command line you can add (+), subtract (-), multiply (*), divide(/), or apply any of a number of different
arithmetic functions, thereby using R as a calculator. Without the use of brackets, multiply and divide will occur
before plus and minus, and powers (like squared or cube root), denoted with “^”, are done before multiplication
and division. For very large numbers or very small numbers R uses the following scheme, called exponents:
• 1.2e3 = 1200 because the e3 means ‘move the decimal point 3 places to the right’
• 1.2e-2 = 0.012 because the e-2 means ‘move the decimal point 2 places to the left’

When coding in R, the hash character (#) is used to make user comments, as shown in the first example below.

The command line can be used as a calculator, for example:

> log(42/7.3) #natural log, namely ln


[1] 1.749795

> 7+3-5*2
[1] 0

> 3^2 / 2
[1] 4.5

If an expression is incomplete, the prompt changes from “>” to “+”, and you must complete the expression on the
following line, for example, summing the numbers 5, 6, 3, and 2:
> 5+6+
+ 3+2
[1] 16

Two or more expressions can be placed on a single line as long as they are separated by semicolons, for example:
> 2 + 3; 5*7; 3 - 7
[1] 5
[1] 35
[1] -4

7
In general, integer arithmetic will be exact between −1016 and 1016, but for fractions and other real numbers we
lose accuracy because of round-off error. You need to be careful in programming when you want to test whether
or not two computed numbers are equal. R will assume that you mean ’exactly equal’, and what that means
depends upon machine precision. Most numbers are rounded to an accuracy of 53 binary digits. Therefore, two
floating point numbers will not reliably be equal unless they were computed by the same algorithm, and not
always even then. You can see this by squaring the square root of 2. Surely these values are the same?
> sqrt(2)* sqrt(2)==2
[1] FALSE

In fact, they are not the same. We can see by how much the two values differ by subtraction:
> 2 - sqrt(2)* sqrt(2)
[1] -4.440892e-16

1.3.2 Rounding
Various sorts of rounding (rounding up, rounding down, rounding to the nearest integer) can be done in R. The
following tables shows the most commonly used functions.

Function Description
floor(x) First integer smaller than x
ceiling(x) First integer larger than x
trunc(x) Truncate x towards 0
round(x, digits = 0) Round x to the specified number of decimals (digits)

> floor(5.7)
[1] 5
> ceiling(5.7)
[1] 6
> trunc(5.7)
[1] 5
> trunc(-5.7)
[1] -5
> round(5.7845,3)
[1] 5.785

8
1.3.3 Assignment and vectors
R operates on named data structures or objects. To assign data to an object, you first decide on a name of the
object. This name cannot be the name of an object that already exists within R, either as part of the R system or
if you have already created an object with that name. You can check this by just typing the object name you want
to use in the command line and pressing enter. If the name already exists, the content of the object will be shown.
If the name does not exist, an error message will be shown indicating that the object does not exist. Object names
are case sensitive, cannot included spaces or special characters and must start with a letter. We can use the full
stop or an underscore to create object names (not commas), but those cannot be the first character in the name.

Once you have chosen a name, you assign data to that name using the less than sign (<) followed by a dash or
minus sign (-), namely “<-”. The equal sign “=” can also be used. For example, assigning the value 10 to x:
> x <- 10 #assign the value of 10 to the object x using <-
> x = 10 #OR, assign the value of 10 to the object x using =
> x #call object x
[1] 10 #data of object x displayed/printed in the console

The simplest data structure is the numeric vector, which is a single entity consisting of an ordered collection of
numbers. To set up a vector named x consisting of n numbers, we use the combine function c() or the assignment
function assign().

For example, suppose we recorded the number of typos per page on a document consisting of eight pages, and
found the following values: 2 3 0 3 1 0 0 1

To enter this vector of values into R we assign a name to the vector, say, typos, and then enter the values using
the c() function as follows:
> typos = c(2,3,0,3,1,0,0,1)
> typos
[1] 2 3 0 3 1 0 0 1

The assignment function is equivalent to the combine function, but it has two arguments inside the brackets:
> assign("typos", c(2,3,0,3,1,0,0,1))
> typos
[1] 2 3 0 3 1 0 0 1

9
The data are now stored in R as a vector in the order that we entered the data, i.e., the first element, the second
element, up to a last element. We can access specific elements within the vector by simply indexing the position
where the element is stored. Since vectors are also mathematical objects, we can apply mathematical concepts
such as addition, multiplication, etc. to the elements of the vector.

Let’s see how these apply to our typos example. First, suppose these are the typos for the first draft of the
document. We might want to keep track of our various drafts as we fix the typos. For example, say we fixed the
two typos on page 1 of the document, then the vector of typos of the first two drafts are as follows:
> typos.draft1 = c(2,3,0,3,1,0,0,1)
> typos.draft2 = c(0,3,0,3,1,0,0,1)

It is important that we use different names for the different drafts. If we use the same name for the second draft
as we used for the first draft, R will simply overwrite the original object without warning. It may seem like a lot
of work to type the data a second time, but there are ways to make this easier. One way is to just recall the code
you used to create typos.draft1 in the command line, using the up or down arrows. Then you can use the right/left
arrows to place the cursor at the number for the first page and just change the 2 to a 0.

Another way to do this is to create a copy of typos.draft1 and give it a different name, say, typos.draft2. We can
then access the value in position 1 of typos.draft2 by referencing the first entry in the vector using square brackets
[ ] and change it to a 0:
> typos.draft1 = c(2,3,0,3,1,0,0,1)
> typos.draft2 = typos.draft1 #make a copy
> typos.draft2[1] = 0 #assign the first page 0 typos

Note that parentheses ( ) are for functions and square brackets [ ] are for vectors (and later arrays and lists).
Negative indices give everything except these indices, and we can index more than one value at a time by using
another vector of index numbers, called slicing. We can then access different values or subsets within a vector:
> typos.draft2 #print out the whole vector
[1] 0 3 0 3 1 0 0 1
> typos.draft2[2] #print out the value in position 2, i.e., the 2nd page
[1] 3
> typos.draft2[-4] #print out all values except the value in position 4, i.e., the 4th page
[1] 0 3 0 1 0 0 1
> typos.draft2[c(1,2,3)] #print out the values in positions 1, 2 and 3 of the vector
[1] 0 3 0
10
In this example we have a very small dataset/vector. Without the need to visually examine the data, we can make
use of programming to, for example, identify the pages with many errors. Say we want to know which pages
contain errors, by inspection we can see that this is the case for pages 2, 4, 5 and 8. To do this using coding, we
can use logical or comparison operators to identify specific entries:

> typos.draft2 #print out the whole vector


[1] 0 3 0 3 1 0 0 1

> typos.draft2 >= 1 #logical indicator to identity which entries are greater or equal to 1
[1] FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE

> which(typos.draft2 >= 1) #identity the vector position of entries greater or equal to 1
[1] 2 4 5 8

> which(typos.draft2==max(typos.draft2)) #identity the vector position of entries equal to the maximum value
[1] 2 4

There are several ways to extract data from a vector. Here is a summary of commands for slicing and extraction
from a vector x.

Description Example
Length of the vector / how many elements length(x)
Show the ith element x[i]
Show all elements except the ith element x[-i]
Show the first k elements x[1:k]
Show the last k elements x[(length(x) + 1 - k):length(x)]
Show a specific subset of elements x[c(1,3,5)] #1st, 3rd and 5th
Show all elements greater than some value x[x>3] #greater than 3
Show all elements greater than some value x[ x< -2 | x > 2] #less than -2 or greater than 2
or less than some value
Show the largest element(s) which(x == max(x)) #all elements equal to the maximum value

11
1.3.4 Generating sequences
An important way of creating vectors is to generate a sequence of numbers. The simplest sequences are in steps
of 1, and the colon operator is the simplest way of generating such sequences. All you do is to specify the first
and last values separated by a colon.

A sequence from 0 up to 10:


> 0:10
[1] 0 1 2 3 4 5 6 7 8 9 10

A sequence from 15 down to 5:


> 15:5
[1] 15 14 13 12 11 10 9 8 7 6 5

The seq() function is used to generate a sequence in steps other than 1. In its simplest form the function has three
arguments: “from”, “to”, and “by” (the initial value, the final value and the increment). The sign of the increment
value in the function indicates whether the sequence increases or decreases between the initial and final values.

A sequence from 0 to 1.5 by increasing increments of 0.1:


> seq(0, 1.5, 0.1)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

A sequence from 6 to 4 by decreasing increments of 0.2:


> seq(6,4,-0.2)
[1] 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0

In many cases, you want to generate a sequence to match an existing vector in length. For example, consider the
following vector of, say, population sizes:
> N = c(55,76,92,103,84,88,121,91,65,77,99)

You need to plot N against a sequence that starts at 0.04 in increments of 0.01. This can be done in different ways:
> seq(from=0.04,by=0.01,length=11)
> seq(from=0.04,by=0.01,along=N)
> seq(from=0.04,to=0.14,along=N)
[1] 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14

12
1.3.5 Generating repeats
You will often want to generate repeats of numbers or characters. This is done using rep() function, which consists
of at least two arguments, and results in a vector of numbers or characters. The first argument is the number,
vector or factor that must be repeated in the vector. The subsequent arguments indicate how many times the
number or sequence must be repeated, or how many times each number is successively repeated.

For example:
> rep(3,5)
[1] 3 3 3 3 3

> rep(1:4, 2)
[1] 1 2 3 4 1 2 3 4

> rep(1:4, times = 2)


[1] 1 2 3 4 1 2 3 4

> rep(1:4, each = 2)


[1] 1 1 2 2 3 3 4 4

> rep(1:4, each = 2, times = 3)


[1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4

> rep(1:4,1:4)
[1] 1 2 2 3 3 3 4 4 4 4

> rep(1:4,c(4,1,4,2))
[1] 1 1 1 1 2 3 3 3 3 4 4

> rep(c("cat","dog","gerbil","goldfish","rat"),c(2,3,2,1,3))
[1] "cat" "cat" "dog" "dog" "dog" "gerbil"
[7] "gerbil" "goldfish" "rat" "rat" "rat"

13
1.3.6 Generating a complete sample space
Consider a game of rolling a 6-sided die once, therefore, the complete sample space is Ω = {1,2,3,4,5,6}. To
create a list of the sample space we can use the sequence function seq(1:6), or just 1:6. If we roll the die twice,
the sample space increase to 36 outcomes (ωi, ωj) where ωi = {1,2,3,4,5,6} is the outcome of the first roll of the
die, and ωj = {1,2,3,4,5,6} is the outcome of the second roll of the die. In this case the expand.grid() function can
be used to create a data frame from all combinations of the given vectors or factors.

1) Ω for a single roll of a 6-sided die


> expand.grid(1:6)
Var1
1 1
2 2
3 3
4 4
5 5
6 6
2) Ω for two rolls of a 4-sided die (or rolling two 4-sided dice simultaneously)
> expand.grid(1:4,1:4)
Var1 Var2
1 1 1
2 2 1
3 3 1
4 4 1
5 1 2
6 2 2
7 3 2
8 4 2
9 1 3
10 2 3
11 3 3
12 4 3
13 1 4
14 2 4
15 3 4
16 4 4
14
3) Ω for a single roll of a 6-sided die followed by two flips of a fair coin
> expand.grid(1:6,c("H","T"),c("H","T"))
Var1 Var2 Var3
1 1 H H
2 2 H H
3 3 H H
4 4 H H
5 5 H H
6 6 H H
7 1 T H
8 2 T H
9 3 T H
10 4 T H
11 5 T H
12 6 T H
13 1 H T
14 2 H T
15 3 H T
16 4 H T
17 5 H T
18 6 H T
19 1 T T
20 2 T T
21 3 T T
22 4 T T
23 5 T T
24 6 T T

15
1.3.7 Basic counting rules
The multiplication principle
If there are p experiments and the first has n1 possible outcomes, the second has n2 possible outcomes, …, and
the pth has np possible outcomes, there are n1  n2   n p possible outcomes of the p experiments.

For example, a developer of a new subdivision offers a prospective home buyer a choice of 4 designs (A,B,C,D),
and a garage or a carport.
1) How many different plans are available to the buyer?
> 4*2
[1] 8
2) List the complete sample space
> expand.grid(c("Design A","Design B","Design C","Design D"),c("Garage","Carport"))
Var1 Var2
1 Design A Garage
2 Design B Garage
3 Design C Garage
4 Design D Garage
5 Design A Carport
6 Design B Carport
7 Design C Carport
8 Design D Carport

Ordered arrangement without replacement: Factorial


The factorial() function is used to calculate n!, the total number of orderings of n objects where duplication is not
allowed.

For example, the number of distinct words that can be formed from all the letters in the word “WORD” is n! = 4!
> factorial(4)
[1] 24

16
Unordered arrangement without replacement: Combination
To select a sample of size r from n objects where duplication is not allowed, and the samples are not ordered, i.e.,
order is not important, we use combinations. In R, the choose() function provides the total number of
combinations, and the combn() function lists the complete sample space.

For example, 5 students applied to be tutors, and the lecturer randomly selects 3 of the 5.
1) How many different subsets of 3 out of 5 exists?
> choose(5,3)
[1] 10
2) List the complete sample space
> combn(5,3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 1 1 1 1 2 2 2 3
[2,] 2 2 2 3 3 4 3 3 4 4
[3,] 3 4 5 4 5 5 4 5 5 5

Ordered arrangement without replacement: Permutation


To select a sample of size r from n objects where duplication is not allowed, and the samples are ordered, i.e.,
order is important, we use permutations. There is no readily available function in R to list the complete sample
space, so the user will have to write a function to do this (this is beyond the scope of this module). To determine
the number of permutations, the permn() function in the combinate package can be used. However, since we only
use built-in function of the base package in this second-year module, we can use the relation between a
permutation and a combination to calculate a permutation.
n n!
 =
 r  r !( n − r ) !
n! r ! n! n
nPr = = = r ! 
( n − r )! r ! ( n − r )!  r 

For example, the number of possible 4-digit pin numbers if duplication is not allowed is a permutation, where we
select any 4 digits out of 10 and order the numbers, i.e., P(10,4)
> factorial(4) * choose(10,4)
[1] 5040

17
2. Data and Built-in Function
2.1 Basic Built-In Functions
One of the great strengths of R is its ability to evaluate functions over entire vectors, thereby avoiding the need
for loops. The following table shows a list of commonly used built-in functions in R.
Function Description
max(x) Maximum value in x
min(x) Minimum value in x
sum(x) Total of all the values in x
mean(x) Arithmetic average of the values in x
median(x) Median value in x
range(x) Vector of min(x) and max(x)
var(x) Sample variance of x
sd(x) Sample standard deviation of x
cor(x,y) Correlation between vectors x and y
sort(x) A sorted version of x
rank(x) Vector of the ranks of the values in x
order(x) An integer vector containing the permutation to sort x into ascending order
quantile(x) Vector containing the minimum, lower quartile, median, upper quartile, and maximum of x
cumsum(x) Vector containing the sum of all of the elements up to that point
cumprod(x) Vector containing the product of all of the elements up to that point
cummax(x) Vector of non-decreasing numbers with the cumulative maxima of x along the vector
cummin(x) Vector of non-increasing numbers with the cumulative minima of x along the vector
Vector, of length equal to the longest of x, y or z, containing the maximum of x, y or z for the
pmax(x,y,z)
ith position in each
Vector, of length equal to the longest of x, y or z, containing the minimum of x, y or z for the
pmin(x,y,z)
ith position in each
colMeans(x) Column means of data frame or matrix x
colSums(x) Column totals of data frame or matrix x
rowMeans(x) Row means of data frame or matrix x
rowSums(x) Row totals of data frame or matrix x
Returns specific quantiles for numerical columns of data frame or matrix x, and counts of
summary(x)
factors columns of data frame x

18
2.1.1 Using functions in vectors
The built-in functions can be applied to vectors. For example, enter the following set of values in a vector called
height using the c() function in R, and calculate the range, sum, mean, variance, standard deviation, cumulative
sum, and quantile of variable height.
175 185 199 142 155 180 166 164 172 174

> height = c(175,185,199,142,155,180,166,164,172,174)


> range(height)
[1] 142 199
> sum(height)
[1] 1712
> mean(height)
[1] 171.2
> var(height)
[1] 250.8444
> sd(height)
[1] 15.83807
> sqrt(var(height))
[1] 15.83807
> quantile(height)
0% 25% 50% 75% 100%
142.00 164.50 173.00 178.75 199.00
> cumsum(height)
[1] 175 360 559 701 856 1036 1202 1366 1538 1712

2.1.2 Using functions in matrices and data frames


The built-in functions can be applied to the columns or rows of a matrix or data frame. To do this in a matrix, we
specify the correct row or column using indexing. To do this in a data frame, we specify the correct column
(variable) using the $ notation. We then apply the required function on the selected portion of the matrix or data
frame.

19
2.2 Data
We have seen how we can create a vector of values in R. But a vector only represents the data of a single variable.
If we collected multiple variables, it makes more sense to represent the information in a complete data structure
(dataset), rather than having to deal with separate vectors.

For statistical analysis a dataset is a two-dimensional rectangular array with n rows and p columns. The n variables
are shown in the columns of the array, and the p units are shown in the rows of the array. The two most important
dataset structures in R are the matrix and the data frame. They look the same but are different in nature. The main
difference is that matrices can only contain a single class of data, while data frames can consist of many different
classes of data.

2.3 Matrices
A matrix is a homogeneous collection of data arranged in a two-dimensional rectangular array, i.e., an n × p array
where all elements of the matrix are of the same type. We can perform many arithmetic operations on a matrix
that consists of numerical values, such as addition, subtraction, multiplication, and division. Some statistical
processes require matrices as input to perform matrix algebra, which can only be applied to matrix data structure.
Having said that, R will convert the portion of a data frame needed for such analyses to a matrix data structure,
for example a regression or factor analyses.

2.3.1 Creating a matrix


There are several ways to create a matrix in R:
• Using the matrix() function.
• Combining vectors using the cbind() or rbind() functions.
• Changing a data frame to a matrix using the as.matrix() function, which will force all the elements into
one type.

Consider the following matrix with 2 rows and 3 columns:


 2 4 1
X = 
1 5 0 

20
The matrix() function has the following arguments:
1) data: A vector of data that can be created in the function using the c() function.
2) nrow: The number of rows of the matrix.
3) ncol: The number of columns of the matrix.
4) byrow: A logical argument where FALSE (default) = the matrix is created by column, or TRUE = the
matrix is created by row.
5) dimnames: An optional argument where one can include attribute labels for rows and columns.

> X = matrix(c(2,4,1,1,5,0), nrow=2, ncol=3, byrow=TRUE)


>X
[,1] [,2] [,3]
[1,] 2 4 1
[2,] 1 5 0

> X = matrix(c(2,1,4,5,1,0), nrow=2, ncol=3, byrow=FALSE)


>X
[,1] [,2] [,3]
[1,] 2 4 1
[2,] 1 5 0

The cbind() function creates a matrix by combining a series of vectors column by column.
> col1 = c(2,1)
> col2 = c(4,5)
> col3 = c(1,0)
> X = cbind(col1, col2, col3)

The rbind() function creates a matrix by combining a series of vectors row by row.
> row1 = c(2,4,1)
> row2 = c(1,5,0)
> X = rbind(row1, row2)

The rows and columns of a matrix are numbered, but we can label rows and columns of a matrix using dimnames
(part of the matrix() function) or rownames() and colnames().

21
Consider the following frequency table for a sample of n = 13 respondents, with gender (male/female) in the rows
of the table, and eye colour (blue, green, brown) in the columns of the table. We can represent this frequency
table in R as a 2 × 3 matrix.
Blue Green Brown
M 2 4 1
F 1 5 0

Label rows and columns using the dimnames argument in the matrix() function
Since the dimnames argument forms part of the matrix() function, we can attach row and column labels to a matrix
as part of the code creating the matrix. This argument is a list with two elements: first the row labels, then the
column labels. If the list is empty it is treated as NULL, i.e., no labels are attached. If the list consists of only one
element, that element denotes the row names and must therefore include the same number of names as there are
rows.
> Ftab1 = matrix(c(2,1,4,5,1,0),nrow=2,ncol=3,dimnames=list(c("M", "F"), c("Blue","Green","Brown")))
> Ftab1
Blue Green Brown
M 2 4 1
F 1 5 0

Label rows and columns using the rownames() and colnames() functions
We can attach row and column names after we have created a matrix using rownames() and colnames(). Inside
the brackets we enter the name of the object, and then we assign a vector of names to either the rows or the
columns. Since the names are all text, these have to be in double quotes.
> Ftab2 = matrix(c(2,1,4,5,1,0),nrow=2,ncol=3)
> colnames(Ftab2) = c("Blue","Green","Brown")
> rownames(Ftab2) = c("M", "F")
> Ftab2
Blue Green Brown
M 2 4 1
F 1 5 0

22
2.3.2 Extracting a subset of a matrix
Once we have a matrix we can extract/show/access any subset of row(s) and column(s) using brackets. Since a
matrix is a two-dimensional array, the matrix element in row i and column j is denoted by element [i, j].

Consider the matrix with the frequency table data:


> Ftab1
Blue Green Brown
M 2 4 1
F 1 5 0

1) Show the value in row 2 and column 3 of Ftab1


> Ftab1[2,3]
[1] 0

2) Show row 2 of Ftab1


> Ftab1[2,]
Blue Green Brown
1 5 0

3) Show column 3 of Ftab1


> Ftab1[,3]
M F
1 0

4) Show columns 1 and 3 of Ftab1


> Ftab1[,c(1,3)]
Blue Brown
M 2 1
F 1 0

5) Show columns 1 and 3 of row 2 of Ftab1


> Ftab1[2,c(1,3)]
Blue Brown
1 0

23
2.3.3 Row and column calculations
The apply() function is used to apply functions, such as those listed in Section 2.1, to the margins of a matrix, for
example, computing the average score of a test for n students. The function has three arguments:
1. X: the name of the array or matrix
2. MARGIN: 1 = rows, 2 = columns
3. FUN: the functions that must be applied to either rows or columns (min, max, sum, mean, var, sd, etc.)

The following table shows the marks of tests 1 and 2 for 8 students:
Test 1 Test 2
66 64
60 72
70 67
68 65
86 76
54 58
77 77
25 33

1) Create a matrix in R to store this data, with the variables in the columns of the matrix. Note, I only specified
the number of columns of the matrix in the code, and used NULL in the dimnames() argument because I do
not want to attach a whole new vector with row names for this matrix.

test=matrix(c(66,60,70,68,86,54,77,25,64,72,67,65,76,58,77,33),ncol=2,dimnames=list(NULL,c("Test 1","Test 2")))


> test
Test 1 Test 2
[1,] 66 64
[2,] 60 72
[3,] 70 67
[4,] 68 65
[5,] 86 76
[6,] 54 58
[7,] 77 77
[8,] 25 33

24
2) Find the minimum and maximum marks for both tests and combine the values in matrix format. Note how we
can use functions within functions.
> minmax = rbind(apply(test, 2, min), apply(test, 2, max))
> rownames(minmax) = c("Min", "Max")
> minmax
Test 1 Test 2
Min 25 33
Max 86 77

3) For each student, calculate the average of their two tests and attach these values to the data matrix. Note that
the use of the matrix function allows us to attach a column name to the new variable (column).

test2 = cbind(test, matrix(apply(test, 1, mean), ncol = 1, dimnames = list(NULL, "Average")))


> test2
Test 1 Test 2 Average
[1,] 66 64 65.0
[2,] 60 72 66.0
[3,] 70 67 68.5
[4,] 68 65 66.5
[5,] 86 76 81.0
[6,] 54 58 56.0
[7,] 77 77 77.0
[8,] 25 33 29.0

2.3.4 Matrix arithmetic operations


Consider the following matrices:
2 4 1 5 −1 1 
A = 1 5 0  B = 1 −2 −4 
 0 −1 0  1 2 1 

1) Create matrix A and matrix B in R


> A = matrix(c(2,4,1,1,5,0,0,-1,0),nrow=3,byrow=TRUE)
> B = matrix(c(5,-1,1,1,-2,-4,1,2,1),nrow=3,byrow=TRUE)

25
2) Multiply matrix A by 3, i.e., scalar multiplication
> 3*A
[,1] [,2] [,3]
[1,] 6 12 3
[2,] 3 15 0
[3,] 0 -3 0

3) Add the elements of matrix B to the corresponding elements of matrix A


> A+B
[,1] [,2] [,3]
[1,] 7 3 2
[2,] 2 3 -4
[3,] 1 1 1

4) Subtract the elements of matrix B from the corresponding elements of matrix A


> A-B
[,1] [,2] [,3]
[1,] -3 5 0
[2,] 0 7 4
[3,] -1 -3 -1

5) Multiply the elements of matrix A with the corresponding elements of matrix B


> A*B
[,1] [,2] [,3]
[1,] 10 -4 1
[2,] 1 -10 0
[3,] 0 -2 0

6) Divide the elements of matrix A by the corresponding elements of matrix B


> A/B
[,1] [,2] [,3]
[1,] 0.4 -4.0 1
[2,] 1.0 -2.5 0
[3,] 0.0 -0.5 0
26
7) Transpose matrix A
> t(A)
[,1] [,2] [,3]
[1,] 2 1 0
[2,] 4 5 -1
[3,] 1 0 0

8) Multiply matrix A with matrix B, i.e., matrix multiplication


> A%*%B
[,1] [,2] [,3]
[1,] 15 -8 -13
[2,] 10 -11 -19
[3,] -1 2 4

2.3.5 Sets
There are three essential functions for manipulating sets, namely union(), intersect() and setdiff(). These functions
can be applied to any data structure (vectors, matrices, etc.). Consider matrices A and B created in Section 2.3.4.

1) The union of A with B shows all elements that are either in A, or in B, or in both A and B:
> union(A,B)
[1] 2 1 0 4 5 -1 -2 -4

2) The intersection of A and B shows all elements that are common to both A and B:
> intersect(A,B)
[1] 2 1 5 -1

3) The difference between two sets is order dependent. It shows the elements that is in the first named set that
is not in the second named set. Therefore, setdiff(A,B) is different from setdiff(B,A).
> setdiff(A,B)
[1] 0 4
> setdiff(B,A)
[1] -2 -4

27
2.4 Data Frames
A data frame is more general than a matrix in that different columns can have different modes (numeric, character,
factor, etc.). In a data frame we can combine variables of equal length, with each row in the data frame containing
observations on the same sampling units. Hence, the data.frame() function is similar to the matrix() and cbind()
functions, but here we can combine numerical and categorical variables into an array.

Consider the following marks obtained by 8 students on their test.


66 40 70 68 86 54 77 25

Compute a new variable to show whether the student failed, passed, or passed with distinction. Combine the two
variables into a dataset using the data.frame() function.

> marks=c(66,40,70,68,86,54,77,25)
> result=marks #you can also use a vector of zeros of the length of object marks
> result[marks < 50]="Fail"
> result[marks>=50 & marks<75]="Pass"
> result[marks>=75]="Pass with distinction"
> result=as.factor(result)
> marks_data=data.frame(marks,result)
> marks_data
marks result
1 66 Pass
2 40 Fail
3 70 Pass
4 68 Pass
5 86 Pass with distinction
6 54 Pass
7 77 Pass with distinction
8 25 Fail

There are different functions used to extract information about the structure of an object. The names() function
lists the variable names; the class() function shows the class of the object; the dim() function shows the dimension
of a matrix or a data frame; the str() function gives detailed information about the structure of the object.

28
> names(marks_data)
[1] "marks" "result"

> class(marks_data)
[1] "data.frame"

> dim(marks_data)
[1] 8 2

> str(marks_data)
'data.frame': 8 obs. of 2 variables:
$ marks : num 66 40 70 68 86 54 77 25
$ result: Factor w/ 3 levels "Fail","Pass",..: 2 1 2 2 3 2 3 1

Since we have two variables in the two columns of the data frame, we can access a specific variable by specifying
the column value, column name, or using the $ command, and then apply a function to the variable.

> marks_data[,1] #extract all row values for column 1, namely variable marks
[1] 66 40 70 68 86 54 77 25
> marks_data[,"marks"]
[1] 66 40 70 68 86 54 77 25
> marks_data$marks
[1] 66 40 70 68 86 54 77 25
> mean(marks_data[,"marks"])
[1] 60.75

We can also select a subset of units for which a variable value adheres to some logical expression. For example,
if we want to show both variable values for all students who passed the test, we use the code:
> marks_data[marks_data[,2]=="Pass",]
marks result
1 66 Pass
3 70 Pass
4 68 Pass
6 54 Pass

29
3. Tables, Graphs and Basic Statistical Analysis
In statistical analysis, data can be stored in many different file formats, such as Excel, SPSS, tab-delimited ASCII
text, comma-delimited ASCII text, etc. The R Base package can import text files using code and Excel files using
the pull-down menu. Special libraries/packages are used to import files from other statistical software packages
such as SPSS and SAS, which is beyond the scope of the course. Please note, if I give you a .csv file with the
data, do not open the file in Excel, export from Excel and import the Excel file into R. We will only use the built-
in functions in R to directly import a .csv file into R.

Once we have collected our data and imported the file into R, the first step in the analysis is to perform data pre-
processing (DP), where we check the distribution of every variable to ensure that there are no errors, investigate
outliers, assess missing data, and prepare the data for statistical analysis. We do this using frequency tables,
summary statistics and graphs. This forms part of. It is an essential step in analysing data and takes up most of an
analyst’s time. When we fix errors, fill missing values, or construct new variables, we never overwrite our original
data and always work with a copy. We will not address the DP process in this module.

3.1 Using the R Script Editor


When we work on a project it is good practice to record all coding for the project, including the working directory,
in a text document. We can do this in RStudio using the script/text editor, referred to as the source. In the source,
you can open a new script file clicking File → New File → R Script. If the source is not part of the default pane
layout, you can add it by selecting Tools → Global Options → Pane Layout. You have the option to show four
different panes, namely Console (command prompt), Source (script), Environment and Plots (the last two have
the same list of content and you can decide what you want to show under each pane). Once you have selected
your preference, click on Apply.

You can now type commands in a script file in the source. The code is colour coded, automatically includes
brackets for the function, and also allows for text completion. To execute a command, highlight all the code you
want to run, then either click on the run button, or select and hold the control key on the keyboard followed by
the enter key (Ctrl+Enter). To execute one line of code, place your cursor anywhere within the line and click the
run button or select Ctrl+Enter. The commands and results will show in the R console. If you have set your
working directory you can save your script using Ctrl+S and it will automatically save the file in the specified
directory. Script files are saved as .R files.

30
3.2 Datasets
For this practical session we will use data in .csv format and datasets that are available in the base package of R.
To view and get information about built-in datasets in RStudio, such as the cars dataset, use the code:
> cars #the “cars” dataset will show in the console
> View(cars) #the dataset will open in a window in the source and looks similar to an Excel dataset
> ?cars #information about the content of the “cars” dataset will show in the Help window

Built-in datasets for examples


Faithful
Dataset recording waiting time between eruptions (eruptions) and the duration of the eruption (waiting) for the
Old Faithful geyser in Yellowstone National Park in Wyoming, USA.
InsectSprays: Dataset containing the counts of insects (count) in a total of 72 agricultural experimental units that
were treated with six different insecticides (spray), i.e., 12 units per insecticide type.

Excel datasets for examples


Coffee: Data collected from coffee consumers (more detail in Section 3.4).
API: A researcher collected air quality data, expressed as an air pollution index (api), wind velocity (wind) in
km/h, temperature (temp) in °F and humidity (humid) in percentage, in a large city for 20 days during summer.
Healthcare: Employees of a large company can choose any one of three health care plans. The HR manager
recorded the choices of plan (plan) and job categories (jobcat) of a random sample of 200 employees (employee).
Ice: The latent heat of fusion at a particular temperature is the amount of heat required to convert a unit mass of
solid into liquid. A researcher measured the latent heat (latent_heat) of ice fusion using two different methods
(method). Values are in metric units of specific heat capacity, i.e., calorie per gram per degree Celsius (cal/gm).
Jump: A group of 17 Sports Science students were selected to investigate whether a 12-week specialised training
programme improves their standing long jump performance. Each student’s standing long jump distance was
measured before (jump1) and after (jump2) the programme.

31
3.3 The Working Directory
Since we import the data from a specific folder, we must either specify the working directory for the R session,
or we can include the file path in the command. In Windows, the folders are separated by “\” in the file path. In
R, we cannot use “\”, we must use either “\\” or “/” to indicate the file path. In the computer labs on campus, you
must always use the local drive on the computers.

If you download a dataset that I provided in Moodle, it will automatically save to the downloads folder. You must
remove those files from the downloads folder and place them in the T-drive. If you work on your own computer,
you will save the downloaded files in the specific folder for that project. You can now specify the working
directory. I will illustrate this using the T-drive that is available in the computer labs on campus:
> setwd("T:/") #OR
> setwd("T:\\")

Once you have set the working directory, you can import the datasets without having to specify the path again.
Although you can import data by specifying the path directly in the function it is better and easier to define the
working directory for an entire project at the start of your program.

3.4 Importing Text Data


In the second year Statistics module you will only be given text data to import into R. The most common text file
formats are the tab-delimited text file (.txt) and the comma-delimited text file (.csv). We use the read.table()
function to import a tab-delimited text file and the read.csv() function to import a comma-delimited text file. For
this module we will only import .csv data. Consider a survey that collected demographic and coffee consumption
behaviour and preference information from a random sample of 20 coffee consumers. The survey yielded the
following nine variables:
ID ID number
gender Gender
age Age
qual Highest qualification
size Household size
cups Daily coffee consumption (number of cups)
pref Coffee type preference
import Choice of brand rating from not important (1) to very important (5)
affinity Coffee affinity score

32
The comma-delimited text file is named Coffee.csv. The character variables are kept as text as all values are
separated by a comma. The variables names appear on the first row of the file. In the read.csv() function we
indicate that the variable names are included, using the code header=TRUE or header=T.

1) Import the Coffee.csv data after specifying the working directory.


> Coffee=read.csv("DataC.csv",header=T)

2) View the data in a separate window.


> View(Coffee)

3) Show the class of the object Coffee.


> class(Coffee)
> [1] "data.frame"

4) List the variable names of the object Coffee.


> names(Coffee)
> [1] "ID" "gender" "age" "qual" "size" "cups" "pref" "import" "affinity"

5) Show the complete structure of the object Coffee.


> str(Coffee)
'data.frame': 20 obs. of 9 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender : chr "Male" "Male" "Female" "Female" ...
$ age : int 24 26 25 30 35 21 24 19 28 34 ...
$ qual : chr "Tertiary certificate" "Degree/Diploma" "Degree/Diploma" "Less than matric" ...
$ size : int 4 2 3 5 1 1 4 1 2 3 ...
$ cups : int 3 1 2 7 4 1 2 1 3 2 ...
$ pref : chr "Instant" "Instant" "Filter" "Instant" ...
$ import : int 2 1 1 5 3 3 4 4 2 1 ...
$ affinity : num 2.3 1.9 0.8 4.4 3.1 0.4 1.8 0.4 3.1 1.9 ...

6) The variable import is an ordinal scale but is coded with values, thus it is viewed as a number in R. To
attach labels to the factor levels of such a variable, if needed, we use the following command:
> Coffee$import=factor(Coffee$import,labels=c("Not important","2","3","4","Very important"))
> View(Coffee) #View the dataset to see how this variable is represented now
33
3.5 Frequency Tables and Contingency Tables
For numerical variables we can summarise data with means, medians, variance, standard deviation, range, and
quantiles. We do this using the functions listed in Section 2.1. For factor (categorical/character) variables we
summarise the data in frequency tables (univariate) or contingency tables (bivariate). The table() function is used
for both univariate and bivariate tables. We can execute the table command to get the output, or we can assign
the table command to an object which can then be used in subsequent analyses.

1) Import the Healthcare data, Healthcare.csv.


> Healthcare=read.csv("Healthcare.csv",header=T)

2) Create frequency tables for both the jobcat and plan variables in the Healthcare data.
> table(Healthcare$jobcat)
Admin Analyst Manager
35 129 36
> table(Healthcare$plan)
A B C
63 56 81

3) Assign the contingency table of jobcat × plan from the Healthcare data to an object, say ftab.
> ftab=table(Healthcare$job,Healthcare$plan)
> ftab
A B C
Admin 15 9 11
Analyst 41 37 51
Manager 7 10 19

4) Add row and column totals (margins) to the contingency table ftab.
> addmargins(ftab)
A B C Sum
Admin 15 9 11 35
Analyst 41 37 51 129
Manager 7 10 19 36
Sum 63 56 81 200

34
5) Convert the counts in the contingency table ftab to table proportions, i.e., all proportions add up to 1.
> prop.table(ftab)
A B C
Admin 0.075 0.045 0.055
Analyst 0.205 0.185 0.255
Manager 0.035 0.050 0.095

6) Convert the counts in the contingency table ftab to row proportions, correct to 3 decimal places, i.e.,
proportions within each row add up to 1. Add the row totals to the output.
> addmargins(round(prop.table(ftab,1),digits = 3),margin = 2)
A B C Sum
Admin 0.429 0.257 0.314 1.000
Analyst 0.318 0.287 0.395 1.000
Manager 0.194 0.278 0.528 1.000

7) Convert the counts in the contingency table ftab to column percentages, correct to the nearest integer, i.e.,
percentages within each column add up to 100. Add the columns totals to the output.
> addmargins(round(100*prop.table(ftab,2),digits = 0),margin = 1)
A B C
Admin 24 16 14
Analyst 65 66 63
Manager 11 18 23
Sum 100 100 100

35
3.6 Graphs
As part of the descriptive analysis, we investigate our data graphically. There are two kinds of graphical functions
in R: high-level plotting function (creates a new graph) and low-level plotting function (adds elements to an
existing graph). Graphs are created using graphical parameters, which are defined by default and can be modified
using the par() function.

3.6.1 Graphical functions


List of some high-level functions:
Function Description
plot() Plot the x-values on the y-axis, ordered on the x-axis
plot(x,y) Bivariate plot of the x-values on the x-axis and the y-values on the y-axis,
pie() Pie chart of variable X
barplot() Bar graph of factor variable (X) by frequencies (Y)
boxplot() Box-and-whisker plot of variable Y, which can be split by factor X
hist() Histogram of variable X
qqnorm() Normal quantiles of variable X

List of some low-level functions:


Function Description
points(x,y) Adds points
lines(x,y) Adds lines
segments(x0,y0,x1,y1) Draws a line from points (x1, y0) to (x1, y1)
abline(a,b) Draws a line with slope b and intercept a
legend(x,y,legend) Adds the legend at point (x, y) with symbols given in “legend”
title() Adds a title and optional subtitles

36
3.6.2 Graphical parameters
There are many graphical parameters that enable us to customize graphs in R. I will focus on a few basic
parameters, but you can find the exhaustive list of graphical parameters by typing ?par in the command prompt,
or searching the help files for par.

Parameter Description
cex A number giving the amount by which text and symbols should be magnified relative to the
default of 1, e.g., cex=1.5 increases text size by 50%, cex=0.8 decreases text size by 20%
col
lty Line type: 0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash):

lwd Line width indicated as a number, as is done with cex


pch Integer specifying the symbol/character to be used as the plotting points:

Type Type of plot, depending on the nature of the data:

37
3.6.3 Graphs
Pie charts
Create a pie chart of the jobcat variable in the Healthcare data. Add plot labels.
> pie(table(Healthcare$jobcat),main="Job category")

Line graphs
Create the following three vectors: x is a sequence from 1 to 10, y is equal to x2, and z is equal to 2y.
> x=1:10
> y=x*x
> z=2*y

Create a basic stair step plot of y (note that this plot only makes sense if the data are order from smallest to largest).
> plot(y,type = "S")

38
Create a line graph with both y and z, with different line types and colours, and add the legend in the top-left
corner of the graphs. Since z has higher values on the y-axis than z, we must define with size of the y-axis to
accommodate all possible values using the ylim argument in the plot() function.
> plot(y, type="o",ylim=c(0,200),lty=2,pch = 18, col = "blue")
> lines(z, type="o",lty=3, pch = 8, col = "red")
> legend("topleft", legend=c("y", "z"),col=c("blue", "red"), lty = c(2,3), cex=0.8)

Bar graphs
Create a bar graph of the plan variable in the Healthcare data. Add plot and axis labels.
> barplot(table(Healthcare$plan),main="Healthcare Plan",xlab="Plan",ylab="Frequency")

39
Create a horizontal stacked bar graph of the plan × jobcat contingency table ftab. Note, the ftab object contains
counts, but we can use the contingency table object with row/column percentages as input so that the bars of the
plot show the distribution within categories.
> barplot(ftab,horiz = TRUE)

Create a bar graph of the ftab contingency table showing the respective job categories per plan type. Use different
colours to the default. Place a legend on the top right-hand side of the graph.
> barplot(ftab,beside = TRUE,col = 10:12)
> legend("topright",legend=rownames(ftab),cex=0.5,fill = 10:12)

40
Box-and-whisker plots
Create a box-and-whisker plot of the api variable in the API dataset.
> boxplot(API$api)

Create a box-and-whisker plot of the count variable by spray in the InsectSprays dataset.
> boxplot(count~spray, data=InsectSprays)

41
Histograms
Create a histogram of the eruptions variable in the Faithful dataset. Add a plot label.
> hist(faithful$eruptions,main = "Histogram of eruptions")

Create a histogram of the difference between the before and after scores in the Jump data. Add plot labels.
> hist(Jump$jump1-Jump$jump2,main="Difference",xlab="Before - After")

42
Scatterplots
Create a scatterplot of the Faithful dataset, with X = eruptions and Y = waiting. Add the regression line to the
scatterplot, with a = 33.47 and b = 10.73, make the line red, and add plot labels.
> plot(faithful,xlab = "Eruption time (min)",ylab = "Waiting time to next eruption (min)")
> abline(a=33.47,b=10.73,col="red")

Create a scatterplot comparing the before with the after scores of the Jump data. Add the 45° line using segments
to be able to visually inspect the differences.
> plot(jump1,jump2,data=Jump,xlab="Jump 1",ylab="Jump 2")
> segments(x0=2,y0=2,x1=3,y1=3)

43
3.6.4 How to copy graphs to MS Word
For any analysis, as well as your practical tests, you will be asked to copy graphs from the output to a Word
document. The simplest way to do this is to use the export function in the Plots window in RStudio. Before you
export the graph, first maximize the plot window, then click on Export → Copy to clipboard.

This will open your graph in a separate window. At the bottom of the window you can choose either a bitmap or
metafile format. Then click on Copy Plot, go to Word and Paste.

If you choose the bitmap format, the graph will fit the page width in Word, but you can resize this as follows:
Click on the graph → go to Picture Format → change the size at the top right.

If you choose the metafile format, the graph itself in Word will be smaller, but the image will have extra blank
spaces on the right and at the bottom, but you can crop out the blank spaces as follows:
Click on the graph → go to Picture Format → click on Crop → crop the graph by clicking on the think lines on
the borders and moving those to the desired position.

44
3.7 Basic Statistical Analysis
In this section we will look at the R functions to calculate correlations and perform a Chi-squared test. Other
hypothesis tests and statistical modelling procedures such as t-tests, ANOVA and regression analysis will be dealt
with in the second semester.

3.7.1 Correlation analysis


The cor() function is used to create a single correlation between two numerical variables, or a p × p correlation
matrix showing all pairwise correlations between the p numerical variables. The default correlation coefficient is
Pearson’s product moment correlation coefficient. For this analysis, import the API.csv dataset.

1) Create a correlation matrix of all four variables in the API dataset.


> cor(API)
api wind temp humid
api 1.0000000 -0.5746979 0.3404103 0.3080594
wind -0.5746979 1.0000000 0.0139061 0.3185819
temp 0.3404103 0.0139061 1.0000000 0.2412395
humid 0.3080594 0.3185819 0.2412395 1.0000000

2) Create a scatterplot matrix of all four variables in the API dataset.


> plot(API)

45
3.7.2 Chi-squared test of independence
The Chi-squared test of independence is used to test whether two factor variables are associated / related /
dependent. It tests the hypothesis H0: the variables are independent vs. H1: the variables are dependent. The test
compares observed frequencies with frequencies we would expect to get under the assumption of the null
hypothesis (independence). A minimum requirement is that the expected frequencies must be at least 5. We can
assign the analysis to an object to store all the output.

For the Healthcare data, test whether there is an association between an employee’s job category and his/her
choice of healthcare plan.

1) Apply the chi-squared test of independence to the contingency table ftab. Assign the analysis to an object,
say chi_out.
> chi_out=chisq.test(ftab) #call the contingency table object
> chi_out=chisq.test(table(Healthcare$job,Healthcare$plan)) #create the contingency table in the function
> chi_out
Pearson's Chi-squared test
data: ftab
X-squared = 5.2656, df = 4, p-value = 0.2611

2) Using the names() function, we can see that the hypothesis test produces a range of outputs within the
object, which we can access using the $ notation.
> names(chi_out)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed" "expected" "residuals" "stdres"

3) Show the contingency table of the expected frequencies.


> chi_out$expected
A B C
Admin 11.025 9.80 14.175
Analyst 40.635 36.12 52.245
Manager 11.340 10.08 14.580

4) Check the minimum expected frequency.


> min(chi_out$expected)
[1] 9.8

46
4. Random Sampling, Numbers and Variables
4.1 Random Sampling
To select a random sample in R we use the sample() function. The structure of the function is as follows:
sample(x, size, replace = FALSE, prob = NULL)
where
• x is the vector/matrix of elements to choose from
• size is the sample size to select
• replace indicates whether selection is done with or without replacement (default is FALSE, i.e., without
replacement)
• prob is a vector of probability weights (we will not use this in the first semester)

For example:
1) Select a random sample of size 5 without replacement from a list of values from 1 to 20.
> sample(x=1:20, size=5, replace=FALSE)
[1] 19 7 14 1 4

2) Select a random sample of size 5 with replacement from the following vector:
2 7 8 10 15 17 29
> data=c(2,7,8,10,15,17,29)
> sample(data, size=5, replace=TRUE)
[1] 10 8 10 2 2

3) Create a data frame from the vectors x = [3 5 6 6 8], y = [12 6 4 23 25], and z = [2 7 8 8 15], and select a
random sample of 3 rows from the data frame without replacement.
> data=data.frame(x=c(3,5,6,6,8),y=c(12,6,4,23,25),z=c(2,7,8,8,15))
> data[sample(nrow(data),size=3), ]
x y z
3 6 4 8
1 3 12 2
5 8 25 15

47
4.2 Random Number Generation
When we generate random numbers in R, we are generating pseudorandom numbers using an algorithm that
requires a seed to initialize. Being pseudorandom instead of pure random means that, if you know the seed and
the generator, you can reproduce the output. We use the set.seed() function to replicate the results of some analysis
based on any random number generation function, such as the sample() function. The set.seed() function has a
single argument, which is any integer value from 1 to 2147483647 (.Machine$integer.max). Once we set the seed
in an R workspace, all subsequent random number generations will use the given seed value. To unset or reset a
seed in R, we set the seed to NULL to re-initialize the generator , i.e., set.seed(NULL).

4.2.1 Probability distributions


R has built-in functions to generate random numbers from many standard probability distributions for both
discrete and continuous random variables.

1) Random sample of size n from the Bernoulli(prob) or binomial(size,prob) distribution.


rbinom(n, size, prob)
> rbinom(3,1,0.6) #3 values from Bernoulli(0.6), i.e., binomial(n = 1, p = 0.6)
[1] 0 1 1
> rbinom(3,4,0.6) #3 values from binomial(n = 4, p = 0.6)
[1] 0 1 3

2) Random sample of size n from the Poisson(lambda) distribution.


rpois(n, lambda)
> rpois(5, 4) #5 values from Poisson(λ = 4)
[1] 7 4 6 3 2

3) Random sample of size n from the uniform(min, max) distribution.


runif(n, min = 0, max = 1) #default is uniform(0,1)
> runif(2) #2 values from uniform(0,1)
[1] 0.1872142 0.4256866
> runif(2,1,5) #2 values from uniform(1,5)
[1] 3.803109 3.294261

48
4) Random sample of size n from the normal(mean,sd) distribution.
rnorm(n, mean = 0, sd = 1) #default is Z ~ N(0,1)
> rnorm(1) #1 number from Z ~ N(0,1)
[1] -0.1783797
> rnorm(4,100,15) #4 numbers from X ~ N(µ = 100, σ2 = 152)
[1] 103.21222 136.59093 114.86349 87.20393

5) Random sample of size n from the exponential(rate) distribution.


rexp(n, rate = 1) #default is exp(λ = 1)
> rexp(3,2) #3 numbers from exp(λ = 2)
[1] 0.055294658 0.006985522 0.583403589

6) Random sample of size n from the gamma(shape,rate) distribution.


rgamma(n, shape, rate = 1, scale = 1/rate) #shape = α, rate = λ, based on our definition of the pdf
> rgamma(3,2,5) #3 numbers from gamma(α = 2, λ = 5)
[1] 0.1769519 0.6465804 0.5808767

4.2.2 Inverse transformation


Recall the probability integral transform (Proposition C) in Chapter 2, which states that if Y = F ( X ) then

Y ~ uniform(0,1) . Consider the following cdf and show the distribution of the cumulative probabilities calculated
for a random sample of 1 million values of x.
 0 x  200

Y = F ( X ) = 0.1x − 20 200  x  210
 x  210
 1

x=runif(1000000,200,210)
hist(0.1*x-20,freq = FALSE)

49
Recall the inverse transformation method (Proposition D) in Chapter 2, which states that for U ~ uniform(0,1)

and X = F −1 (U ) , the cdf of X is FX . This proposition allows us to generate a random variable X from its cdf

FX by generating a U ~ uniform(0,1) random number and setting X = F −1 (U ) . This is called the inverse

transformation method. The steps are as follows:


1) Find the cdf of X .
2) Set it equal to u.
3) Find an expression for x in terms of u.
4) Randomly select a value u from the uniform(0,1) distribution and substitute into the equation in (3).
5) This calculated value of x is a random number from the distribution of X .

For example, use the inverse transformation method to simulate 10000 numbers from an exponential distribution
with a rate of λ = 2. Create a histogram of the simulated values, together with the density curve of the exponential
distribution. Let X ~ exp ( 2 ) and U ~ uniform(0,1) . Then:

− ln (1 − u )
F ( X ) = 1 − e −2 x = u  x =
2

#create a histogram of x-values simulated from 10000 uniform(0,1) values


hist(-log(1-runif(10000))/2,breaks = 20,freq = FALSE)

#add a line to show the density of X ~ exp(2)


x=seq(0,5,length=1000)
fx=2*exp(-2*x)
lines(x,fx,lwd=2,col="red")

50
4.3 Probability Calculation
4.3.1 Calculation from known probability distributions
Just like we can generate a random number from a known probability distribution, we can calculate the exact
probabilities associated with these distributions. The functions are similar to those used to generate random
numbers, but the first letter of the code changes to a “d” for the mass p(x) or density f(x) functions, a “p” for the
cumulative distribution function F(x), and a “q” for quantiles. For a discrete random variable X the quantile is the
smallest value x for which F ( x ) = P ( X  x )  p . For a continuous random variable X the quantile is the value of

x for which F ( x ) = P ( X  x ) = p.

For X ~ Binomial(size = 4, prob = 0.2).


1) Calculate P ( X  1) = P ( X = 0 ) + P ( X = 1) using dbinom(x, size, prob, log = FALSE)

> dbinom(0,4,0.2) + dbinom(1,4,0.2)


[1] 0.8192
2) Calculate P ( X  1) using pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)

> pbinom(1,4,0.2)
[1] 0.8192
3) Determine x for which P ( X  x )  0.6 using qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)

> qbinom(0.6,4,0.2)
[1] 1

For X ~ Poisson(lambda = 4).


1) Calculate P ( X = 5) using dpois(x, lambda, log = FALSE)

> dpois(5,4)
[1] 0.1562935
2) Calculate P ( X  4 ) using ppois(q, lambda, lower.tail = TRUE, log.p = FALSE)

> 1-ppois(3,4)
[1] 0.5665299
3) Determine x for which P ( X  x )  0.9 using qpois(p, lambda, lower.tail = TRUE, log.p = FALSE)

> qpois(0.9,4)
[1] 7

51
For X ~ Normal(mean = 100, sd = 12).
1) Calculate f (100 ) using dnorm(x, mean = 0, sd = 1, log = FALSE)

> dnorm(100,100,12)
[1] 0.03324519
2) Calculate P (110  X  120 ) using pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

> pnorm(120,100,12)-pnorm(110,100,12)
[1] 0.154538
3) Determine x for which P ( X  x ) = 0.1 using qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)

> qnorm(0.1,100,12)
[1] 84.62138

For X ~ Exponential(lambda = 2).


1) Calculate f ( 3) using dexp(x, rate = 1, log = FALSE)

> dexp(3,2)
[1] 0.004957504
2) Calculate P ( X  3) using pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE)

> pexp(3,2)
[1] 0.9975212
3) Determine x for which P ( X  x ) = 0.6 using qexp(p, rate = 1, lower.tail = TRUE, log.p = FALSE)

> qexp(0.6,2)
[1] 0.4581454

4.3.2 Calculation using integration


We can calculate probabilities from density functions using the integrate() function in R. The function has several
different arguments, but we will focus on the density function to integrate, and the lower and upper limits of the
integration. Note, the limits could be infinite, denoted by Inf. Consider the following density function:

f ( x ) = 3x 2 for 0  x  1  P ( 0.11  X  0.75 ) = 


0.75 0.75
3x 2 dx = x 3 = 0.420544
0.11 0.11

We can calculate this probability in R as follows:


fX <- function(x) {3*x^2}
integrate(fX, lower = 0.11, upper = 0.75)
0.420544 with absolute error < 4.7e-15

52
5. Writing Functions in R
5.1 Functions in R
One of the great strengths of R is the user’s ability to add functions. The main purpose of creating a user-defined
function is to optimize the program and avoid repetition of the same code used for a specific task, thereby
automating the task.

Functions are created using the function() directive and are stored as R objects. Functions are defined by:
1. A function name with assignment to the function() directive. Function names can be almost anything,
however, using names of existing functions should be avoided.
2. The declaration of the arguments/variables assigned to the function.
3. The operations (the function body) that perform computations on the provided arguments.

The basic structure of a function is:


my.func <- function(arg1, arg2, arg3, ...) {
<commands>
return(output.object)
}

1. Function arguments (arg1, arg2, ...) are the function parameters, placed inside the parentheses and
separated with a comma. These parameters will be set to actual values each time we call the function.
2. The <commands> part describes what the function will do with the function parameters/arguments.
3. After doing these tasks, the return() function produces the output of interest. If this is omitted, output from
the last expression evaluated is returned. We can use print() instead of return() within our function,
although print() can be used as part of the function call.

53
Example 1
Create a function to calculate the circumference of a circle with a known radius using the formula 2πr. In this
example the function has only one parameter, namely r. After defining the function, we will call it with the radius
equal to 2, i.e., with the argument {2}.

#using return
circumference <- function(r){
result <- 2*pi*r
return(result)
}

circumference(2)
[1] 12.56637

#using print inside the function


circumference <- function(r){
result <- 2*pi*r
print(result)
}

circumference(2)
[1] 12.56637

#using print as part of the function call


circumference <- function(r){
result <- 2*pi*r
}

print(circumference(2))
[1] 12.56637

54
Example 2

Write a function to calculate the following: f ( x, y ) = x 2 + y 2 for x = 1, 2, ,10 , y = 1, 2, ,10 and x = y .

myfunction1 <- function(x,y){


return(sqrt(x^2+y^2))
}

myfunction1(1:10,1:10)
[1] 1.414214 2.828427 4.242641 5.656854 7.071068 8.485281 9.899495
[8] 11.313708 12.727922 14.142136

Example 3

Write a function to calculate f ( x, y ) = x 2 + y 2 and f ( x, y ) = xy for f ( 2, 4 ) and f ( 3, 2 ) . Use the list()

function to combine the output of the two formulae in the return() function. Assign the function to an object,
assess the structure, and call the output for each component of the list.

myfunction2 <- function(x,y){


out1 <- sqrt(x^2+y^2)
out2 <- sqrt(x*y)
return(list("Square root of sum of squares" = out1,"Square root of product" = out2))
}

output = myfunction2(c(2,3),c(4,2))

str(output)
List of 2
$ Square root of sum of squares: num [1:2] 4.47 3.61
$ Square root of product : num [1:2] 2.83 2.45

output$`Square root of sum of squares`


[1] 4.472136 3.605551

output$`Square root of product`


[1] 2.828427 2.449490
55
5.2 Optimising a Functions in R
Often we would like to find the global maximum or minimum of a function in a specific domain. This can easily
be approximated in R using the optimize() function. The function searches the interval from lower to upper for a
minimum or maximum of the function f with respect to its first argument. We will use the following arguments
of the function:
• f = the function to be optimized
• lower = the lower end point of the interval to be searched
• upper = the upper end point of the interval to be searched
• maximum = logical argument to maximize or minimize (default=minimize)

The function produces a list with the minimum (or maximum) and the objective, which gives the location of the
minimum (or maximum), i.e., the value of the function at that point.

Example 4
Consider the following equation: f ( x ) = − x 2 + 2 x + 3 . Create a function of this equation, use the curve() function

to plot the equation for −2  x  4 , and use the optimize() function to find the global maximum.

equation <- function(x){-1*x^2+2*x+3}


curve(equation, from=-2, to=4, , xlab="x", ylab="y")

max_eq <- optimize(equation, lower = -2, upper = 4, maximum = TRUE)


max_eq
$maximum
[1] 1
$objective
[1] 4

56
5.3 The For Loop in R
A for loop allows you to loop over values in a vector or list of numbers. Although many functions directly apply
code to all elements in a vector, the for loop is still a very powerful programming feature in R. In a for loop we
request that an index i takes on a sequence of values, and that one or more commands are executed as many times
as number of values in the sequence i.

Example 5
For the values i = 1 to 5, print the square of i. To build this vector, we first need to initialize the vector, by either
creating a vector with length 5 of, say, zeros, or by just creating an empty object.

x = numeric()
for(i in 1:5){
x[i] <- i^2
}
x
[1] 1 4 9 16 25

Example 6
Print a phrase where the year changes for each phrase from 2015 to 2023.

for(year in 2015:2023){
print.noquote(paste("The year is",year))
}

[1] The year is 2015


[1] The year is 2016
[1] The year is 2017
[1] The year is 2018
[1] The year is 2019
[1] The year is 2020
[1] The year is 2021
[1] The year is 2022
[1] The year is 2023

57
Example 7

Write a function to calculate f ( x, y ) = x 2 + y 2 for all x = 1, 2, ,5 , y = 1, 2, ,5 . This example is similar to

Example 3, but here we will create the function values for a 5 × 5 matrix, looping over all values of x and y.

myfunction <- function(x,y){


out <- matrix(0,nrow=x,ncol=y)
for(i in 1:x){
for(j in 1:y){
out[i,j] <- sqrt(i^2+j^2)
}
}
print(out)
}
myfunction(5,5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.414214 2.236068 3.162278 4.123106 5.099020
[2,] 2.236068 2.828427 3.605551 4.472136 5.385165
[3,] 3.162278 3.605551 4.242641 5.000000 5.830952
[4,] 4.123106 4.472136 5.000000 5.656854 6.403124
[5,] 5.099020 5.385165 5.830952 6.403124 7.071068

5.4 The If-Else Statement in R


Sometimes we want to do one thing if a condition is true, and a different thing if the condition is false. In R, the
if-else statement tells the program to run a block of code if the conditional statement is true, and a different block
of code if the conditional statement is false.

Example 8
Calculate the square root of a value x. We can only calculate the answer if x  0 .
x = -5
if(x >= 0){
print(sqrt(x))
} else {
print.noquote("The value cannot be negative")
}
58
Example 9
In a local neighbourhood, two soccer teams, A and B, play a match every Saturday. The following data frame
shows the results for the past 4 weeks:
Date Number of goals scored by Team A Number of goals scored by Team B
4 March 2023 1 1
11 March 2023 1 2
18 March 2023 3 0
25 March 2023 2 1

Write a function to show the results of each match, i.e., if Team A won, if Team B won, or if they tied. Combine
the original data and the result in a data frame.

scores <- function(TeamA,TeamB){


Result <- rep(0,length(TeamA))
for (i in 1:length(TeamA)) {
if (TeamA[i] > TeamB[i]){
Result[i] <- "Team A won"
} else if (TeamA[i] < TeamB[i]){
Result[i] <- "Team B won"
} else {
Result[i] <- "Team A & B tied"
}
}
return(as.data.frame(cbind(TeamA,TeamB,Result)))
}

score_data <- data.frame(matrix(c(1,1,3,2,1,2,0,1),ncol = 2,byrow = FALSE,


dimnames=list(c("4/3/23","11/3/23","18/3/23","25/3/23"), c("TeamA","TeamB"))))
scores(score_data$TeamA,score_data$TeamB)
TeamA TeamB Result
1 1 1 Team A & B tied
2 1 2 Team B won
3 3 0 Team A won
4 2 1 Team A won

59
6. Expectation and Simulation
6.1 Expected Value and Variance
For discrete random variables, we can write code in R to create the entire probability mass function, which consists
of the possible values of x and its corresponding probability of occurrence. We can then apply the formulae for
expectation to the pmf to calculate the mean and the variance . For example, if a random variable X follows a
binomial distribution with parameters n = 3 and p = 0.4 , the expected value and variance are

E ( X ) = 3  0.4 = 1.2 and Var ( X ) = 3  0.4  0.6 = 0.72 , respectively, and the pmf is as follows:

x 0 1 2 3
3 0 3  3 1 2 3 2 1  3 3 0
p(x)   0.4 0.6 = 0.216   0.4 0.6 = 0.432   0.4 0.6 = 0.288   0.4 0.6 = 0.064
0 1   2  3

1) Use R to calculate the distribution of X.


x=0:3
px=choose(3,x)*0.4^x*0.6^(3-x)
> px
[1] 0.216 0.432 0.288 0.064

2) Use R to calculate the expected value of X using E ( X ) =  x  p ( x ) .

EX=sum(x*px)
> EX
[1] 1.2

3) Use R to calculate the variance of X using E ( X 2 ) =  x 2  p ( x ) and Var ( X ) = E ( X 2 ) −  E ( X )  .


2

EX2=sum(x^2*px)
> EX2
[1] 2.16
VarX=EX2-EX^2
> VarX
[1] 0.72

60
For continuous random variables, we have already seen how we can use the integrate() function to calculate
probabilities from a pdf. We can also use this function to calculate expected values. What we have not discussed
previously is that the integrate() function generates a list with entries. If we assign the integrate() function to an
object, say EX, we can see the list by using the str() function, in this case str(EX). The first two entries are the
most important for this practical, namely the value of the integral (e.g., EX$value) and the error (e.g.,
EX$abs.error).

Consider the following pdf of a random variable X: f ( x ) = 3x 2 for 0  x  1

The expected value and variance are calculated through integration as follows:
1 1 1
3 31 3
E ( X ) =  xf ( x ) dx =  x  3x 2 dx =  3x 3dx = x | = = 0.75
0 0 0
4 0 4
1 1 1
E(X ) =  x f ( x ) dx =  x 3 51 3
2 2 2
 3x dx =  3x 4 dx =
2
x | = = 0.6
0 0 0
5 0 5
2
3 3
Var ( X ) = E ( X 2 ) =  E ( X )  =
2
−   = 0.0375
5 4

1) Use the integrate() function to calculate the expected value of X.


fX <- function(x) {3*x^3}
EX=integrate(fX, lower = 0, upper = 1)
> EX
0.75 with absolute error < 8.3e-15
> EX$value
[1] 0.75

2) Use the integrate() function to calculate the variance of X.


fX <- function(x) {3*x^4}
EX2=integrate(fX, lower = 0, upper = 1)
> EX2
0.6 with absolute error < 6.7e-15
> EX2$value
[1] 0.6
VarX=EX2$value-EX$value^2
> VarX
[1] 0.0375
61
6.2 Joint Distributions
Suppose that X has a binomial distribution with 3 trials and probability of success 0.3, and Y has a binomial
distribution with 4 trials and probability of success 0.1. X and Y are independent random variables.

1) Find the joint probability mass function of X and Y.


Since X and Y are independent, we can multiply the mass functions using vector (matrix) multiplication
%*%.
pxy=dbinom(0:3,3,0.3)%*%t(dbinom(0:4,4,0.1))
rownames(pxy)=0:3
colnames(pxy)=0:4
> pxy
0 1 2 3 4
0 0.2250423 0.1000188 0.0166698 0.0012348 3.43e-05
1 0.2893401 0.1285956 0.0214326 0.0015876 4.41e-05
2 0.1240029 0.0551124 0.0091854 0.0006804 1.89e-05
3 0.0177147 0.0078732 0.0013122 0.0000972 2.70e-06

2) Find the marginal distributions of X and Y.


px=rowSums(pxy)
> px
0 1 2 3
0.343 0.441 0.189 0.027
py=colSums(pxy)
> py
0 1 2 3 4
0.6561 0.2916 0.0486 0.0036 0.0001

3) Find the probability that X + Y = 3.


values=expand.grid(x=0:3,y=0:4)
which(values$x+values$y==3)
[1] 4 7 10 13
sum(pxy[which(values$x+values$y==3)])
[1] 0.0954945

62
6.3 Simulation
In practice we often need to create a model that is a realistic replica of a real-world phenomenon. In some cases
it is not possible to analyse the phenomenon mathematically. With the increase in computing power, it is possible
to model real-world phenomenon as realistically as possible using simulation, where we simulate data that
represent the structure as well as the restrictions of the phenomenon. We can then use the simulated data to
estimate parameters, compare models, calculate probabilities, and illustrate concepts and the behaviour of a
system or a model.

6.3.1 Estimation through simulation


Example 1
A random variable X ~ bin ( 3, 0.4 ) , i.e., E ( X ) = 1.2 and Var ( X ) = 3  0.4  0.6 = 0.72 . Generate 1000 binomial

random numbers using the built-in functions and estimate the mean and variance.
x=rbinom(1000,3,0.4)
> mean(x)
[1] 1.165
> var(x)
[1] 0.7164915

Example 2
Simulate 10000 values for 3 independent uniform(0,1) random variables, U1, U2 and U3, and estimate the
following:

1) E (U1 + U 2 + U 3 ) and E (U1 ) + E (U 2 ) + E (U 3 ) , and compare your answers.

 k  k
Since E   X i  =  E ( X i ) , the two answer should be the same.
 i =1  i =1
U1=runif(10000)
U2=runif(10000)
U3=runif(10000)
> mean(U1+U2+U3)
[1] 1.504518
> mean(U1)+mean(U2)+mean(U3)
[1] 1.504518
Note that the two answers are the same.

63
2) Var (U1 + U 2 + U 3 ) and Var (U1 ) + Var (U 2 ) + Var (U 3 ) , and compare your answers.

 k  k
For i = 1, 2, …, k independent random variables X i , Var   X i  = Var ( X i ) , the two answer should
 i =1  i =1
be the same.
> var(U1+U2+U3)
[1] 0.2515744
> var(U1)+var(U2)+var(U3)
[1] 0.2504313

The two answers are the same to 2 decimal places. The reason they are not exactly the same is that the 3
variables, through simulation, still have very small pairwise covariance/correlation coefficients, therefore:
Var (U1 + U 2 + U 3 ) = Var (U1 ) + Var (U 2 ) + Var (U 3 ) + 2Cov (U1 ,U 2 ) + 2Cov (U1 ,U 3 ) + 2Cov (U 2 ,U 3 )

3) E ( U1 + U 2 + U 3 )
> mean(sqrt(U1+U2+U3))
[1] 1.207451

Example 3
In inference we can derive the parameters of the sampling distribution of the mean. If a random variable

X ~ N (  ,  2 ) then X ~ N  ,  ( 2

n ) . If we want to illustrate this concept, we can simulate a large number of


random samples of size n from the normal distribution and calculate the average value of X for each sample. We
can then plot the distribution of all these sample averages and calculate its mean and standard deviation. Let

X ~ N (100,122 ) . For a sample of n = 25, X ~ N 100,12 ( 2

25 )
= 2.42 . We can show this result through simulation

by repeating the sampling, say, 10000 times, and calculating 10000 averages.

64
output = NULL #initialize the output vector
for (i in 1:10000){
output[i] = mean(rnorm(25,100,12))
}

> mean(output)
[1] 100.0084
> sd(output)
[1] 2.397492
> hist(output)

65
6.3.2 Simulate a conditional distribution
Recall Question 73 of Tutorial 4:
A fair coin is tossed n times and the number of heads N counted. The coin is then tossed N more times. Find the
expected total number of heads generated by this process.

We derived this conditional distribution from the distributions of variable N (number of heads on the first toss)
and variable X (number of heads on the second toss, conditional on N). Therefore, N ~ binomial(n, p) and
X ~ binomial (N, p).
E ( N + X ) = E  E ( N + X | N )  = E  N + E ( X | N )  = E ( N + Np ) = (1 + p ) E ( N ) = np (1 + p )

Write a function to simulate this for n = 10 and p = 0.2, therefore E ( N + X ) = 10  0.2  (1 + 0.2 ) = 2.4 . Do the

simulation 1000 times.

MyFunction=function(n,p,sim){
output=NULL
for (i in 1:sim) {
N=sum(sample(0:1, size = n, prob = c(1-p,p), replace = TRUE))
X=sum(sample(0:1, size = N, prob = c(1-p,p), replace = TRUE))
output=rbind(output,cbind(N,X,N+X))
}
return(output)
}

EX=MyFunction(n=10,p=0.2,sim=1000)
> dim(EX) #code to see the dimensions of the output from the function
[1] 1000 3
> mean(EX[,3])
[1] 2.445

66

You might also like