EASTERN AFRICA STATISTICAL TRAINING CENTRE
INTRODUCTION TO DATA ANALYSIS WITH R
BACHELOR DEGREE IN
DATA SCIENCE,
OFFICIAL STATISTICS,
BUSINESS STATISTICS AND ECONOMICS
Edwin Tito Magoti
Consultant: Training, Research and Data Analytics
Eastern Africa Statistical Training Centre
P.O. Box 35103, Dar Es Salaam
(+255) 766151460
(+255) 737825292
edwin.magoti@eastc.ac.tz
edwintitomagoti@gmail.com
Data Structures in R and R Studio
Getting Started with R: Presentation Outline
1 Data types
§ Numeric
§ Character/Strings
§ Logical
§ Complex
2 Objects in R
§ Vectors
§ Matrices
§ Factors
§ Lists 3
§ Data frames
1. Data Types
o There are at least five (05) data types that can be
assigned to Variables in R.
o They are:
• Numeric data type
• Character data type
• Logical data type
• Complex data type
• Raw data type
o For this presentation, we consider the first
three types.
4
1.1. Data Types: Numeric data
• Numeric data are real numbers.
• The function is.numeric( ) is used to determine
whether data is a real number or not.
5
1.2. Data Types: Character Data
• Characters are string Variables
• Are written by enclosing in double quotes “ ”
• Thecommand is.character() is used to determine
whether data is a character or not.
6
1.3. Data Types: Logical Data
o In R programming language, the logical data refers
to data values that take the Boolean statements,
either TRUE or FALSE.
o Also, the abbreviation “NA” which stands for Not
Applicable is treated as a logical operator.
o The command is.logical() is used to determine
whether data is a logical or not.
o It should be noted that R is case-sensitive, thus, only
upper cases should be used when referring to logical
operators.
7
1.4. Data Types: Logical Data
❖In R programing language, the logical data refersto
data values which take the Boolean statements,
either TRUE or FALSE.
❖Also, the abbreviation “NA” which stands for Not
Applicable isalso treated as logical operator.
❖The command is.logical( ) is used to determine
whether data is a logical or not.
❖It should be noted that R is Case
caseSensitive:
sensitive,Thethus, onlyis
word True
upper cases should be used not
when referring
a reserved word forlogical
logical
operators. operator
8
1.5. Data Types: Data Coercion
• Data Coercion is the art of changing a data
type of an object.
• R provides room for changing data types.
o If such a need arises, we use the function
as.datatype().
• Note that, if you need to store a new data
type in R while keeping the previous one, a
new object must be created.
o That is, a new assignment must be done.
o Changing the data type by using the same
object name will replace the existing
information.
9
1.5 Data Types: Data Coercion
❖When importing data in R, it may take a default ‘data
type’ format.
❖Using the function is.datatype, helps in knowing the
data type of a particular variable/object.
❖R provides as room for changing data types. If such a
need arises, the function as.datatype is used.
❖Coercion is therefore the art of changing a data type
of an object.
❖Note that, if you need to store a Coercion
new data istype in R
done
while keeping the previous one, a new object must be
for predefined
created (New assignment must be done). Objects
10
2. R Objects
§ R Objects simply means data structures that can
be stored in R
§ R can store various data structures (objects)
including but not limited to:
o Vectors
o Factors
o Matrices
o Lists
o Data frames
ü For this presentation, we take a look at Vectors,
Factors, and Data Frames.
11
2.1. VECTORS
o A vector is a one‐dimensional ordered collection of
data of the same type.
o The data may be numeric, character, logical, complex,
or raw.
o In creating vectors with more than one element a
function c( ) which means to combine the elements
into a vector, or simply concatenate is used.
o A function is.vector(<object>) is used to find out
whether an object is a vector or not 12
2.1. VECTORS: Creating Vectors
o A vector is a one-dimensional ordered collection of data of the
same type.
o The data may be numeric, character, logical, complex or raw.
o In creating vectors with more than one element a function c( )
which means to combine the elements into a vector is used.
o A function is.vector(<object>) is used to find out whether an
object is a vector or not
13
2.1. VECTORS: Length, Class and Structure
o The dimension of a vector is called length.
• This can be found by using the function
length(<vector_object>)
o You can call for the class (data type) of a vector
by using the class function defined as:
class(<vector_objects>).
o The structure of the vector can also be assessed
using the function syntax str(<object>).
§ This is useful in displaying:
• Data type,
• Dimension and 14
• Contents on a vector.
2.1. VECTORS: Length, Class and Structure
v In examining the length, data type, and
structure of a vector, the following commands
may be used
15
2.1. VECTORS: Creating Vectors Using Colon, :
• Alternatively; numeric vectors containing integers can be
created using a colon, : in between the two integers.
16
2.1. VECTORS: Creating Vectors Using seq() function
§ We can also create a vector as a sequence of numbers
using a function seq(from, to, by), or seq(from, by,
length),
§ where:
- from is the value the sequence starts at.
- to is the value it finishes at.
- by is an optional argument that gives the steps the
sequence increases by, its default is ±1 unless the
length option is used.
- length is an optional argument that gives the
required length of the sequence
17
2.1. VECTORS: Creating Vectors Using seq() function
• Illustration on creating Vectors (sequences) in R
18
2.1. VECTORS: Creating Vectors Using rep() function
o On the otherhand, a function rep( ) is used in case
there is a need for repeating a an entry several times.
19
2.1. VECTORS: Naming Vectors
o R provides room for naming the elements in a vector.
o Suppose we create a vector, (say Production), that
contains the production of maize from five maize-
producing regions in Tanzania.
20
2.1. VECTORS: Indexing/extracting Vectors
o When working with vectors, we may be interested
in extracting some of the elements in a vector.
ü Indexing provides a convenient way to
achieve that.
o We use the Square brackets, [ ] to tell R it’s an index
and the number in the square bracket gives the
position of the element.
21
2.1. VECTORS: Indexing Vectors
o Illustration….
22
2.2. FACTORS
§ Factors are the r-objects which are created using a
vector.
§ A Factor is a special vector used for storing
categorical data such as marital status, education
level, occupation, etc.
o A factor stores the vector along with the distinct values of the
elements in the vector as labels.
§ Factors are created using the function
factor(Vector_Object )
§ The different categories are called levels and they are
assigned the values 1, 2, 3, …, n.
§ The function nlevels gives the count of levels.
o The levels are (by default) sorted into alphabetical order23
2.2. FACTORS
o Using the str() function with factor variable
(object) gives information about:
§ Data type
§ Number of levels
§ Labels
§ value labels
24
2.2. FACTORS
v In the previous example, we can see that the
categories F and M have been assigned the values
(levels) 1 and 2.
v We can change the order(levels) using the function
factor(<vector object>, levels=<levels vector>)
25
2.2. FACTORS
v We can insert value labels to the categorical
variables using the function:
factor(<vector object>, levels=<levels vector>,
Labels = <labels vector>)
26
2.2. FACTORS
27
2.2. FACTORS
• R gives room for specifying the order of variables in
the case of ordered categorical variables.
• To achieve this, we set Ordered option as
TRUE
factor(<vector object>, levels=<levels vector>,
labels=<labels vector>, ordered=TRUE)
28
2.2. FACTORS
q Differences between Nominal and Ordered
categorical Variable
v R gives a room for specifying orders of categorical
variables(where necessary) such as rickets scales.
v To achieve this, we set Ordered function as TRUE
factor(<vector object>, levels=<levels vector>,
labels=<labelsvector>)
29
2.3. MATRICES
q Details
30
2.4. LISTS
q Details
31
2.5. DATA FRAME
o A data frame is a list of vectors and/or factors of the
same (equal) length.
o Objects (Variables) in the data frame can be of a
different data type but each column should
contain elements of the same data type.
- For instance:
ü The first column can be numeric,
ü The second column can be character
and
ü The third column can be logical.
o Rows of a data frame represent an observation
32
2.5. DATA FRAME: Creating data frame
v To create a data frame in R, a function data.frame( ) is used:
data.frame(<vector1>, <vector2>,…, <vectorn>)
o The command is.data.frame(<object>) is used to
determine whether an object is a data frame.
o The dimensions command dim(<object>)is used to
explore the number of rows and columns
o The numberof rows can be found usingthe
command nrow(<dataframe>).
o The number of columns can be found using the
command ncol(<dataframe>).
33
2.5. DATA FRAME: Creating a data frame
34
2.5. DATA FRAME: Creating a data frame
35
2.5. DATA FRAME: Creating a data frame
• After creating the data frame, the following syntaxies can
be used to explore the data frame
36
2.5. DATA FRAME: Combining Data Frames
• Data frames can be combined by binding them
together.
• We can bind data frames:
v Column-wise (column bind)
- If the aim is to add more variables
using the function cbind( ).
v Row-wise (row bind)
- If the aim is to add observations using
the function rbind( ).
§ Binding vectors form a matrix.
§ Binding a data frame with another object (such as a factor
or vector) creates a data frame
§ Biding data frames creates a data frame 37
2.5. DATA FRAME: Combining Data Frames - Column
Added Variables/Objects
38
2.5. DATA FRAME: Combining Data Frames - Row
• Row bind essentially adds more observations to
the existing data frame.
• Materializes when we have data of same variables
in different data frames (conceptually, different
files)
39
2.5. DATA FRAME: Combining Data Frames - Row
40
2.5. DATA FRAME: Indexing/sub-setting Data frame
• As noted earlier on, indexing or sub-setting
is the process of selecting some of the
elements of an object.
• Depending on the purpose, this can be
achieved by:
ü Giving the position of the element in
square brackets after the name of the
object
§ row/columns must be specified
ü Using the dollar ($) sign.
41
2.5. DATA FRAME: Indexing/sub-setting Data frame
42
2.5. DATA FRAME: Indexing/sub-setting Data frame
ü The $ is useful in indexing/extracting
variables.
43
2.5. DATA FRAME: Indexing/sub-setting Data frame
ü The $ is useful in indexing/extracting
variables.
44
End of Presentation II
?
Next Lesson:
Importing and Exporting data,
Combining Dataset 45