Advanced R Data Analysis
Training
Trainer: Dr. Ghazaleh Babanejad
Website:www.tertiarycourses.com.my
Email: malaysiacourses@tertiaryinfotech.com
About the Trainer
Dr Ghazaleh Babanejad has received Phd from
University Putra Malaysia in Faculty of
Computer Science and Information Technology..She is
working on recommender systems in the field of
skyline queries over Dynamic and Incomplete
databases for her PhD thesis. She is also working on
Data Science field as a trainer and Data Scientist. She
worked on Machine Learning and Process Mining
projects. She also has several international
certificates in Practical Machine Learning
(John Hopkins University) Mining Massive Datasets
(Stanford University), Process Mining
(Eindhoven University), Hadoop (University of San
Diego), MongoDB for DBAs (MongoDB Inc) and some
other certificates. She has more than 5 year
i l t dd t b d i i t t
Agenda
Module 1: R Data Analysis Packages
- Data Analysis Components
- Data Analysis Steps
- R Data Analysis Packages
Module 2: Obtaining Data
- Reading Data from CSV file
- Reading Data from JSON file
- Reading Data from XML file
- Reading Data from Web
- Reading Data from APIs
Agenda
Module 3: Data Exploration and Cleaning
- Exploring data
- Imputing missing data
- Dealing with Outliers
Module 4: Data Preprocessing
- Selecting columns and rows
- Calculated columns
- Arranging data
- Chain operations
- Joins
- Summarize and group by
Agenda
Module 5: Data Reshaping
- Splitting and merging columns
- Rearranging and reorienting columns
Module 6: Data Visualization
- ggplot2 syntax and analysis
Module 7: Advanced Analysis (optional)
- Map function
- User defined functions & logical testing
- pmap function
Prerequisite
Basic knowledge of R is assumed
Exercise Files
Download the exercise file from
https://github.com/rkrtiwari/rAdvanc
ed
Module 1
Getting Started
Data Analysis Steps
• Data Collection
• Data Processing
• Data Cleaning
• Data Visualization
• Data Product
R Data Analysis Packages
Data Manipulation
dplyr: Data manipulation
tasks
tidyr: Reshape data
mice: Missing data
Imputation
Data Analysis
Data Explorer: Visualize variables
R Data Analysis Packages
Data Visualization
ggplot2: Powerful visualization
shiny: Interactive data
visualization
VIM: Missing data
visualization
Install Packages
install.packages(“tidyverse”)
install.packages(“DataExplorer”)
install.packages(“data.table”)
install.packages("mice")
install.packages("ggplot2")
Module 2
Obtaining Data
Read Data from CSV File
data1 <- read.csv("data.csv", header =
TRUE)
Read Data from json
data <- fromJSON(“data.json”)
Read Data from Web
url<-
"http://archive.ics.uci.edu/ml/machi
ne-learning-
databases/wine/wine.data"
read.csv(url, nrows=5, header =
FALSE)
Read Data from XML
library(XML)
data <- xmlTreeParse(data.xml)
Challenge
Read the housing data from the
following webpage
“https://archive.ics.uci.edu/ml/machi
ne-learning-
databases/housing/housing.data”
and store it in a dataframe named
house
Time: 5 min
Module 3
Data Exploration
and Cleaning
Exploring our data
# load our library
library(DataExplorer)
library(data.table)
## explore our dataset
names(heart)
head(heart)
str(heart)
summary(heart)
## changing our data type
heartDT=data.table(heart)
Exploring our data
# grouping and frequency analysis
group_category(heartDT, "chest_pain", 0,
"chol")
# view frequency based on another
measure
group_category(heartDT, "chest_pain", 0,
"age")
Plotting
#discrete features (categorical data)
plot_bar(heartDT)
# continous features (numeric data)
plot_boxplot(heartDT, by="disease")
# disease is the categorical var
# correlation plot
plot_correlation(heartDT)
Plotting
# density plot
plot_density(heartDT)
# only for numerical columns
# histogram
plot_histogram(heartDT)
# only for numeric columns
# scatterplot
plot_scatterplot(heartDT,"age")
# using age as y axis
Splitting data
# will generate 2 data tables for
continuous and discrete data
output=split_columns(heartDT)
output$discrete
output$continous
Imputing data
library(mice)
library(VIM)
# Visualization of the missing pattern
aggr(miss_mtcars, numbers=TRUE
# Mean Substitution
mean_sub <- miss_mtcars
mean_sub$qsec[is.na(mean_sub$qsec)] <-
mean(mean_sub$qsec, na.rm = TRUE)
Dealing with Outliers
# ESD method
t=2
m=mean(x)
s=sd(x)
b1=m - s*t
b2=m + s*t
y=ifelse(x >=b1 & x <=b2, 0, 1)
table(y)
Dealing with Outliers
# boxplot method
boxplot(x)
boxplot.stats(x)
# outliers package
library(outliers)
dixon.test(x)
Challenge (10 mins)
Using the airquality dataset in R
1)explore the dataset
2)do frequency analysis
3) plot features and correlation plot
4)view the missing values
5) substitute the missing values with
mean
6)remove any outliers
Module 4
Data
Preprocessing
Data structure
glimpse(x)
lst(x)
tbl_sum(x)
Selecting columns
x2=select(x,col1,col2,col3,col4)
# selecting only 4 columns
x2=select(x, -col1, -col2)
# dropping columns 1 and 2
x2= rename(x, “col99”=col2)
# renaming column2 to column 99
Filtering rows
x2=filter(x, disease==“negative”)
# filter only negative disease rows
x2=filter(x, disease==“negative” &
thalach>160)
# double condition filtering
x2=filter(x, chest_pain != “asympt”)
# filter off “asympt”
x2=filter(x, chest_pain %in%
c(“asympt”,”angina”))
# only retain “asympt” and “angina”
Creating calculated columns
x2= mutate(x, old = age>50)
# this will give a new column with TRUE or
FALSE
x2= mutate(x, chol_class=chol/20)
x2= mutate(heart, chol_class=chol/20,
trestbps_class=trestbps/5)
# this will give two new columns
Creating calculated columns
# using if_else function in mutate
x2=mutate(x, cholLevel=
if_else(chol>250,"highrisk","normal"),
chol_class=chol/20)
Counting and arranging
count(x, chest_pain, sort = TRUE)
count(x, disease, sort=TRUE)
count(x, chest_pain, disease)
distinct(x, exang) # gives only 2 levels
distinct(x, exang, disease)
# look at 2 variables at same time
Counting and arranging
x2=arrange(x, age)
# arrange all the rows by the age var
number
x2=arrange(x, age, thalach)
# arrange by age first then thalach
x2=arrange(x, desc(age))
# descending order
x2=top_n(x,20)
#top 20 rows
Chaining
# the “%>%” is used in chain operations
# link one process to another
heart %>% select(1:5) %>%
mutate(chol_class=chol/20,
trestbps_class=trestbps/5)
heart %>% select(thalach) %>%
mutate(thalach_class=thalach/15)
Joins
left_join(A,B, by="col1")
#join matching rows from B to A
right_join(A,B, by="col1")
# join matching rows from B to A
inner_join(A,B, by="col1")
# join data, retain only rows in both sets)
full_join(A,B, by="col1")
# join data, retain all values, all rows)
Group by
groupDisease=group_by(x, disease)
# disease is the variable which we want to
create groups ["positive", "negative"]
groupDisease2=group_by(x, disease, fbs)
# more groups
Summarize
# you can choose your own summary
statistics
summarize(heart,
count=n(),
avgAge=mean(
age, na.rm=TRUE),
sdAge=sd(age, na.rm=TRUE),
medAge=median(age,
na.rm=TRUE),
Q3rdAge=quantile(age, .75)
)
Challenge (10 mins)
Use the mtcars dataset
1) Select first 9 columns and 20 rows
2) Create calculated column for average of
3) Mpg and Disp
4) Arrange by qsec descending
5) Group by cyl and vs
6) Do summary stats like (count, mean, max)
Module 5
Data Reshaping
Separate
# if your data contains 2 sets of
information in 1 column you can split them
up
Arguments
#first: dataset name,
#second: column Name,
#third: new col names to split column into
(names)
#fourth: the seperator (what split the
columns by)
Unite
#opposite of separate, combining columns
Arguments
#first: dataset name,
#second: column Name to unite columns
into,
#third: column names to combine
#fourth: the seperator in the new columns
unite(team, "Full Name", c(First_Name,
Last_Name), sep=" ")
Gather
# rearranging and re-orienting the
columns by stacking them into 1 single
year column
#first: dataset name,
#second: new column name (for columns
we are stacking into),
# third: new column names (for values of
the stacked columns)
#fourth: columns that we are stacking
homeruns2=gather(homeruns, year,
home_runs, YR2015:YR2013)
Spread
#opposite of gather, spreading out the
columns
# first: dataset name,
# second: column to spread across
multiple column,
# third: values multiple columns will take
spread(homeruns2, year, home_runs)
Module 6
Data Visualization
Scatter Plot
gplot(mtcars) + aes(x=wt, y=mpg) +
geom_point(size=3, color = “blue”)
Scatter Plot (grouped data)
ggplot(mtcars) + aes(x=wt, y=mpg,
color = factor(cyl) ) +
geom_point(size=3)
Scatter Plot (adding a trendline)
ggplot(mtcars) + aes(x=wt, y=mpg) +
geom_point() + stat_smooth(method =
"lm")
Scatter Plot (faceting: I)
ggplot(mtcars) + aes(x=wt, y=mpg) +
geom_point() + facet_grid( am ~ .)
Scatter Plot (faceting: II)
ggplot(mtcars) + aes(x=wt, y=mpg) +
geom_point() + facet_grid( am ~ cyl)
Scatter Plot (facetting: III)
ggplot(mm) + aes(x=value, y = mpg) +
geom_point() + facet_wrap( ~variable,
scales = "free", ncol = 2)
Bar Plot
ggplot(mtcars, aes(x = factor( cyl))) +
geom_bar()
Multiple Bar Plot
ggplot(mm) + aes(x=factor(month), y=
value) + geom_bar() + facet_grid( . ~
variable)
Histogram
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 3)
Boxplot
ggplot( mtcars, aes(x = factor( cyl), y =
mpg)) + geom_boxplot()
Challenge
Use ggplot to plot the Median value
of owner-occupied homes vs. per
capita crime rate
Module 7
Advanced Analysis
(optional)
Map functions
library(purr)
# map() returns a list or dataframe
# map_lgl() returns a logical vector
# map_int() returns a integer value
# map_dbl() returns a double vector
# map_chr() returns a character
vector
Map functions
map(x, summary) # find a summary
of each column
map_lgl(x, is.numeric) # find columns
that are numeric (return logical)
map_chr(x, typeof) # find the type of
each column (return character)
Apply functions
map_dbl(x, mean) # find column
means
map_dbl(x, sd) # find column std dev
map_dbl(x, quantile, probs=c(0.05)) #
find 5th percentile
Apply user-defined functions
# group the heart chest_pain types
# nest function to convert to tibble
n_heart <- heart %>%
group_by(chest_pain) %>%
nest()
Apply user-defined functions
# create a model for each chestpain
mod_fun=function(x) lm(chol~ age +
trestbps + thalach, data=x)
# apply the model
model_heart=n_heart %>%
mutate(model=map(data, mod_fun))
# use "data" to symbolize the data
Logical testing
pluck(heart,"age") # get values in
"age"
old=function(x){x>50}
keep(heart$age, old) # keep
elements that pass a logical test
discard(heart$age, old) # remove
elements that pass a logical test
Summarize data
every(heart$age, old)
# do all elements pass a test
some(heart$age, old)
# do some elements pass a test
detect(heart$age, old)
# find first element that pass a test
detect_index(heart$age, old)
pmap
# pmap takes a list of arguments as
input
# using multiple arguments with map
n=list(5,10,20)
mu=list(1,5,10)
sd=list(0.1,1,0.1)
pmap(list(n, mu, sd), rnorm)
Challenge (10 mins)
Use the mtcars dataset
1) Map summary of each column
2) Find column means
3) Group by cyl and am (nest)
4) Apply a model for each group
Summary
Parting
Message
Q&A
Feedback
https://goo.gl/EDezXH
Thank You!
Ghazaleh Babanejad
ghazaleh.babanejad@gmail.com
01123005257