0% found this document useful (0 votes)

29 views167 pages

Lecture 1

Uploaded by

Ny Sata Andrianirina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views167 pages

Lecture 1

Uploaded by

Ny Sata Andrianirina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 167

Msc EDCBA

Marc-Arthur Diaye
Full Professor
University Paris 1 Pantheon-Sorbonne

Data Analytics
Basic notions of R
CLASS 1A
• http://www.r-project.org/

• CRAN (Comprehensive R Arxiv Network)

• http://cran.r-project.org/

• Rstudio
• https://www.rstudio.com/
Data
• To find which folder R is currently looking in,
type :
• getwd()
• To change folder (directory) :
• setwd(‘’path’’)
• Important : use / instead of \.
• For instance: setwd(‘’C:/Marc/Master’’)
• Go directly to « Files » or « Session » (if you use
Rstudio)
• From « Files » or « Session », change folder.
Data
• R can read data sets in text format (ascii)
using the following functions :
• read.table
• scan
• read.fwf
Data
• R can read also files in Excel, SAS, SPSS,….
• These functions are however not in the
baseline package.
Data
• Function read.table permits to read a
dataset.
• It is the main function used to read a dataset.
•
• Example: A « txt » file called « coi2006 »
• From this dataset, we can create a dataset
called mydata:
• mydata<-read.table("coi2006.txt" ,
header=TRUE)
Data
• sep: sep="\t" tells R that the file is tab-
delimited (use " " for space delimited and ","
for comma delimited; use "," for a .csv file).

• row.names: a vector containing the names of

the lines which can be a vector of mode
character, or the number (or the name) of a
variable of the file (by default: 1, 2, 3, ...)
Data
• col.names: a vector that includes vecteur the
name of the dataset variables (by default :
V1, V2, V3, . . .).

• as.is: controls the conversion of character

variables into factor (if FALSE) or keeps them
in characters (TRUE); as.is can be a logical
vector, numeric or character specifying the
variables preserved in character.
Data
• Les variantes de read.table sont utiles car
elles ont des valeurs par défaut différentes :
• read.csv(file, header = TRUE, sep = ",",
quote="\"", dec=".", fill = TRUE, ...)
• read.csv2(file, header = TRUE, sep = ";",
quote="\"", dec=",",fill = TRUE, ...)
• read.delim(file, header = TRUE, sep = "\t",
quote="\"", dec=".",fill = TRUE, ...)
• read.delim2(file, header = TRUE, sep = "\t",
quote="\"", dec=",",fill = TRUE, ...)
Data
• The dataset we will use is already at the .dta
format
• In order to use database at this format, first
download the package called « foreign ».
• You can also directly write in the console :
library("foreign")
• Then duplicate a dataset called « mydata »
from « coi2006.dta ».
• mydata<-read.dta("coi2006.dta")
Data
• If you use Rstudio, you can directly visualize
the data from the « Environment » part.
• If you use the standard R, you can go to
« Edit », then to « Edit data».

• If you want to have the list of files in the folder that

you use, you can write in the console
• ls()
• Or
• objects()
Data
• Quit R

• Enter R

• library("foreign")

• mydata<-
read.dta(file="f:/Coi2006/coi2006.dta")
Data
• To have the list of variables:
• head(mydata)

• To make the columns of the dataset available

for calculus, the data must be attached:
• attach(mydata)
• The variables from the dataset are therefore
available for calculus.
Data
• You can compactly display the structure of all
variables from the dataset:
• str(mydata)

• You can compactly display the structure of a

specific variable z from the dataset:
• str(mydata$z)
• For instance : str(mydata$sexe)
Data
• Statistics:

• summary(mydata)
• Provide a summary of all variables from the dataset.

• summary(salnet)
• Provide a summary of variable salnet.

• For numerical variables, summary provide Min, Max,

Q1, Q2, Mean and Q3 (third quartile).

• Missing values are denoted NA.

Data
• Example of the NET WAGE variable: salnet.

• mean(salnet)
• var(salnet)
• sd(salnet)
• quantile(salnet)
• median(salnet)
• range(salnet)

• boxplot(salnet)
• boxplot(salnet, horizontal=TRUE)
Data

• Example of the NET WAGE variable: salnet.

• You can first define a list, before performing a

boxplot

• R> b=list("Man"=salnet[sex==1],
"Woman"=salnet[sex==2])
• R> boxplot(b)
Data

• BOXPLOT.

A boxplot, is a simple diagram that represents the

distribution of a variable.

This diagram is composed of:

- A rectangle that extends from the first to the third
quartile.
- The rectangle is divided by a line corresponding to the
median.
–
Data

• BOXPLOT.

- This rectangle is completed by two segments of lines.

- To draw them, we first calculate the bounds:
 b− = x1/4− 1.5IQ
 and b+ = x3/4 + 1.5IQ
With IQ the interquartile distance (i.e., the difference
between the 3rd quartile x3/4 and the 1st quartile x1/4).
Data
• BOXPLOT.
- The smallest and largest observation between
these boundaries is then identified. These
observations are called "adjacent values".
- We draw the line segments linking these
observations to the rectangle.
- Values that are not between adjacent values
are represented by dots and are called
"extreme values".
Data : More about Boxplot
• boxplot(list(salnet,salbrut))
• By default, whiskers have a maximum length
equal to 1.5 times the size of the box.
• This coefficient can be modified with the range
option.
• You can also change the width of the box with
the width option.
• The names option makes it possible to specify
the labels to be displayed under each box. For
example here one could use names = c("x","y").
Data : More about Boxplot
• Example:
• boxplot(list(salnet,salbrut),names=c("Net
wage", "Gross wage"))
Data

• Example of the NET WAGE variable: salnet.

• Compute average net wage by gender:

• R> b=list(sex)
• R> aggregate(salnet,by=b,mean)
• Or :
• R> aggregate(mydata$salnet,by=b,mean)
Data

• Example of the NET WAGE variable: salnet.

• Mean net wage per gender and PCS (3 :
Managers, 4: Middle managers, 5:
Employees, 6: Blue-collars):
• R> b=list(sex,cscor)
• R> aggregate(salnet,by=b,mean)
• Or :
• R> aggregate(mydata$salnet,by=b,mean)
Data

• Example of the NET WAGE variable: salnet.

• Standard-deviation of net wage per gender
and PCS (3 : Managers, 4: Middle managers,
5: Employees, 6: Blue-collars):
• R> b=list(sex,cscor)
• R> aggregate(salnet,by=b,sd)
• Or :
• R> aggregate(mydata$salnet,by=b,sd)
Data

• When a variable includes some missing

values:
• Example of variable r100 (profit).

• R> mean(r100)
• [1] NA
Data
• When a variable includes some missing
values:
• Example of variable r100 (profit).

• How to deal with it ?

• Answer : create a new database without
missing values:
• R> mydata2<-mydata[!is.na(mydata$r100),]
Data

• When a variable includes some missing values:

• Example of variable r100 (profit).

• Then compute the mean value of profit using the new

dataset.
• R> mean(mydata2$r100)
• [1] 17605.16

• A simpler solution is to use the option «na.rm »

directly on the original dataset:
• R> mean (r100, na.rm=TRUE)
Data
• When a variable includes some missing values:
• Suppose that you want to have the number of missing
observations from a a given variable.
• For instance « salnet » and « r100 »
• We know that salnet includes no missing values, while r100
includes some missing values.
• R> sum(is.na(salnet))
• [1] 0
• R> sum(is.na(r100))
• [1] 1072

• You can also use « table »:

• R> table(is.na(r100))
• [1] FALSE TRUE
• 11912 1072
Data

• Compute the mean of net wage, per gender.

• R> b=list(mydata2$sex)
• R> aggregate(mydata2$r100,by=b,mean)
Data
• Categorical variable :
• R> table(diplome)
• R> diplome
• 1 10 2 3 4 5 6 7 8 9
• 677 597 903 1864 2961 1066 1228 2350 716 622

• If we want some proportions:

• R> tab<-table(diplome)
• R> prop.table(tab)
• 1 10 2 3 4
• 0.05214110 0.04597967 0.06954713 0.14356131 0.22804991
• 5 6 7 8 9
• 0.08210105 0.09457794 0.18099199 0.05514479 0.04790511
• We can have the same result directly with :
• R> prop.table(table(diplome))
Data
• Categorical variable:

• Graph of frequencies:
• R> barplot(table(diplome))
Data
• Categorical variable:

• R> plot(table(diplome))
4. Data
• Categorical variable:
• Pie:
• pie(table(diplome))
Data
• Categorical variable :
• Pie:
• pie(table(sex), main="Distribution
Man/Woman", labels=c("Man",
"Woman"),col=c("green", "yellow"))
Data
• Two categorical variables:
• sex, diplome
• R> xtabs(~sex+diplome)
• Provides a contengency table.
• diplome
• sex 1 10 2 3 4 5 6 7 8 9
• 1 421 479 553 1486 1909 551 785 1271 384 320
• 2 256 118 350 378 1052 515 443 1079 332 302
Data
• Two categorical variables:

• We can do also:
• R> table(sex,diplome)

• Or:
• R> x<-table(sex,diplome)
• R> x
Data
• Two categorical variables:
• R> plot(x)
• Provides a bar plot.
Data
• Two categorical variables:
• R> summary(x)
• Provides the chi-square (Chisq), the number
of degree of freedom (df) and the p-value:
Independence test of two variables (here: sex
and diploma).
Data
• Two categorical variables:
• R> summary(x)

• Number of cases in table: 12984

• Number of factors: 2
• Test for independence of all factors:
• Chisq = 504.4, df = 9, p-value = 6.532e-103
Data
• Chi-square Independence Test
• H0 (Null hypothesis): The two distributions are
independent / H1: The two distributions are not
independent.

• summary(x) provides the p-value of the test.

• Fix the probability of the type 1 error (the rejection
of a true null hypothesis)  (1%, 5% or 10%). If the p-
value   then reject H0.
• In our example, p-value = 6.532e-103 <  =1%. Then
we do not reject H0.
Data
• Two numerical variables:
• Example : salnet (net wage); salbrut (gross
wage)
• R> cor(salnet,salbrut)
• Provides the correlation coefficient (in the
sense of Pearson) betwen the two variables.
• R> cor(salnet,salbrut,method=« spearman »)
provides the correlation coefficient in the
sense of Spearman.
Data
• Two numerical variables :
• Spearman Correlation
• Spearman compares the order of the values taken by
the two variables.
• Let us assume variables x = (x1,…,xn) and y =
(y1,…,yn).
• The distributions of x and y are ordered in the increasing
direction and renumbered, so that:
• x1<…<xn ; y1<…<yn.
• Spearman correlation coefficient =
• 1- where di = xi-yi
Data
• Two numerical variables :

• R> tapply(salnet,sex,summary)

• $`1`
• Min. 1st Qu. Median Mean 3rd Qu. Max.
• 3210 16930 21010 26330 28490 100000

• $`2`
• Min. 1st Qu. Median Mean 3rd Qu. Max.
• 3444 13580 16930 19940 22780 100000
Data

• Graph:
• R> b=list("Man"=salnet[sex==1],
"Woman"=salnet[sex==2])
• R> boxplot(b)
Data
• Remark concerning attach()
• We said that we need to write
• attach(mydata)
• before being able to do any calculus over the variables
of the data set.

• Actually, it is not necessary to go through this step.

But then, you have to specify the dataset that you use.
• Example:
• mean(mydata$salnet)
• table(mydata$sexe)
Data
• Create a new variable from another:
• We want to create a variable denoted
« lsalnet » defined as the ln of net wage.
• lsalnet<-log(salnet)
• mean(lsalnet)

• We can do also:
• mydata$lsalnet<-log(mydata$salnet)
• mean(mydata$lsalnet)
Data
• Create a new variable from another:

• Remind that the PCS variable (denoted CSCOR) takes four

modalities : 3 : Manager, 4: Middle Manager, 5: Employee,
6: (Blue-collar) Worker
• From CSCOR, we want to crate a new variable called PCS2.
With:
• PCS2 = TRUE if cscor = 3 ou 4
• PCS2 = FALSE otherwise
• pcs2<-cscor %in% c(3,4)

• On peut faire la même chose avec

• pcs2<-(cscor %in% c(3,4))
Data
• Create a new variable from another:
• If we want to create a binary variable 0/1:
CSP3

• pcs3<-rep(0,length(cscor))
• pcs3[cscor %in% c(3,4)]=1
Data
• Create a new variable from another:
• Practice.

• Create a binary variable from the « sex »

variable.
• Remind: sex=1 if man; 2 if woman
Data

• man=rep(0,length(sex))
• table(man)

• man[sex=="1"]=1
• table(man)
Data

• Practice : Impliq variable

• Create from this variable a binary effort
variable.

• Answer:
• Effort (impliq)
• effort<-rep(0,length(impliq))
• effort[impliq =="3"]=1
Data
• More concerning the Plot function
• xlim, ylim : set the lower and upper limits of
the axes (two-element vectors).
• xlab, ylab : allow you to specify the axis
legend (character mode).
• main : allows to put a title above the graph
(character mode).
• pch : defines the symbol representing the
points; an integer from 1 to 25, or any
character in quotation marks.
Data
• More concerning the Plot function
• col: specifies the color of symbols
("blue","red" etc. the exhaustive list of
available colours can be get with colors()),

• bty: controls the shape of the frame; default

square ("o"), L ("l"), U ("u"), C ("c"), 7 ("7") or
square brackets ("]").
Data
• More concerning the Plot function

• We can play on the size of the symbols thanks to

the option cex.

• By default cex=1 ; however we can provide to

the software, a positive number that represents
a multiplicative coefficient relative to the default
size (a value between 0 and 1 to reduce the size,
or greater than 1 to increase it).
Data
• More concerning the Plot function

• In the same way the options cex.axis, cex.lab

and cex.main control the size of the graduations
of the axes, labels of the axes and the title.

• To change the style of the text, we use the font

option, which also comes in the forms font.axis,
font.lab and font.main, (1 for normal, 2 for italic,
3 for bold and 4 for bold italic) .
Data
• More concerning the Hist function

• We saw that to draw a histogram, the basic

command is the hist function :
• hist(x)

• Some options of the hist function:

• breaks : allows you to specify the break points
between the bars of the histogram, either as a vector
or as a number of bars.
• freq : allows to choose the frequency (freq=TRUE,
default option), or the proportion (freq=FALSE).
Data
• More concerning the Hist function

• Some options of the hist function :

• col : indicates the color to fill the bars.
• plot : if plot=FALSE, the histogram is not drawn
and the function returns the list of break points
and numbers.
• right : allows to choose intervals of type ]a, b] if
right=TRUE (by default, [a, b[).
Data
• More concerning the Hist function

• Example:
• hist(salnet, freq = FALSE, main = "Histogram
NET Wage")
Data
• More concerning the Hist function

• Example:
• hist(salnet, freq = TRUE, main = "Histogram
Net Wage")
Data
• More concerning the Hist function

• Example:

• hist(salnet, freq = TRUE, col = "blue", main =

"Histogram Net Wage")
Data
• More concerning the Hist function

• Example: Add some cuts

• Suppose that you want two cuts:

• hist(salnet, breaks = 2, freq = FALSE, col =

"blue", main = "Histogram Net Wage")
• hist(salnet, breaks = 2, freq = TRUE, col =
"blue", main = "Histogram Net Wage")
Data
• More concerning the Hist function

• Transform a variable inside the hist function:

• hist(log(salnet), freq = FALSE, main =

"Histogram Net Wage")
Data
• More concerning the Hist function

• Transform a variable inside the hist function +

smooth:

• lines(density(log(salnet)), col= "red")

Data
#Remove some variables from the dataset : example with
cscor, age and effl_corr
myvars <- names(mydata) %in% c("cscor", "age", "effl_corr")
newdata <- mydata[!myvars]

#We can check that the variable cscor no more exists in

newdata
table(newdata$cscor)
< table of extent 0 >

#Remove the 3rd and 5th variables

newdata <- mydata[c(-3,-5)]
Data

#Remove variables TYPEMPLOI and IMPLIQ in the

original database mydata
mydata$typemploi <- mydata$impliq <- NULL

We can do also
mydata$typemploi <- NULL
mydata$impliq <- NULL

#If you don’t need to specify the dataset, then

typemploi <- impliq <- NULL
Data
# Select some variables: Example of cscor, age,
effl_corr
myvars2 <- c("cscor", "age", "effl_corr")
newdata <- mydata[myvars2]
fix(newdata)

# Select the 1st variable and variables 5 to 10

newdata <- mydata[c(1,5:10)]
Data
# Select the 20 first observations of the dataset
newdata <- mydata[1:20,]
fix(newdata)

#Select observations that fulfill some conditions

#Example: Select executive woman under 40 years old
newdata <- mydata[ which(mydata$sex=="2" &
mydata$cscor=="3" & mydata$age <= 40), ]

# or if we have first performed attach(mydata)

newdata <- mydata[ which(sex=="2" & cscor=="3" & age
<= 40), ]
Data
The best way to select observations is to use
the subset function

Syntax: subset( )
Data
Example:
• Select executive women less than 40 years old.
• Keep variables : sex, cscor, age, salnet, siren,
depnaiss, effl_corr, couple.

#Use function subset

newdata <- subset(mydata, sex=="2" & cscor=="3"
& age <= 40, select=c(sex,cscor,age,salnet,siren,
depnaiss,effl_corr,couple))
Data
Practice 1:
• Create from COI2006 two datasets including
respectively:
• Stressed men less than 35.
• Stressed men at least 35.
• Compare the average net wage of the two
groups.
Data
Practice 2:
• COI2006 includes two gender variables : sexe
and sex. However these variables come from
two different statistical sources. As a
consequence, they do not agree all the times.
• Compare both variables.
• Is it possible to have a convincing answer about
the agents’ true gender, when the two variables
disagree ?
Introduction to GGPLOT 2
CLASS 1 B
Comparing densities between groups
• library("foreign")
• mydata=read.dta(file="F:/COI2006/coi2006.dta")
• attach(mydata)

• library (lattice)
• densityplot(~salnet|sex)

#If you need to specify the name of the data set

• density(~salnet|sex, data=mydata)
Practice 1
Comparing (Kernel) densities between group

• In the same graph, draw the densities of net

wage (salnet variable) for managers and non
managers, for women and for men.
• Reproduce the following graph.
Solution 1
• library(lattice)

• v_manager<-rep("Non manager",length(cscor))
• v_manager[cscor %in% c(3,4)]="Manager"
• v_homme<-rep("Woman",length(sex))
• v_homme[sex=="1"]="Man"

• densityplot(~salnet|v_homme,
groups=v_manager, data=mydata,
auto.key=list(space="right"), main="Density Net
Wage Manager/Non manager, Woman/Man")
Data Visualization
• Visualisation is a fundamentally human
activity.

• A good visualization may show us things that

we did not expect, or raise new questions
about the data.
Datavisualization
• The majority of dataviz packages are part of the
so-called tidyverse.

• The packages in the tidyverse share a common

philosophy of data and R programming, and are
designed to work together naturally.

• We can install the complete tidyverse with a

single line of code:

• install.packages("tidyverse")
Datavisualization
• Then :
• library(tidyverse)

• You will see :

• -- Attaching packages --------------------------------------- tidyverse 1.2.1

• v ggplot2 3.1.0 v purrr 0.3.2
• v tibble 2.1.1 v dplyr 0.8.0.1
• v tidyr 0.8.3 v stringr 1.4.0
• v readr 1.3.1 v forcats 0.4.0
• -- Conflicts ------------------------------------------ tidyverse_conflicts() --
• x dplyr::filter() masks stats::filter()
• x dplyr::lag() masks stats::lag()
Datavisualization
• This tells us that tidyverse is loading the ggplot2,
tibble, tidyr, readr, purrr, and dplyr packages.

• These are considered to be the core of the tidyverse

because we will use them in almost every analysis.

• Packages in the tidyverse change fairly frequently.

• We can see if updates are available, and optionally

install them, by running :
• tidyverse_update()
Datavisualization
• Creating a ggplot

• With ggplot2, we begin a plot with the function

ggplot()

• ggplot() creates a coordinate system that we can add

layers to.

• The first argument of ggplot() is the dataset to use in

the graph.
• So ggplot(data = mydata) creates an empty graph.
Datavisualization
• Creating a ggplot

• We complete our graph by adding one or more

layers to ggplot().

• For instance, the function geom_point() adds a

layer of points to our plot, which creates a
scatterplot.

• ggplot2 comes with many geom functions that

each add a different type of layer to a plot.
Datavisualization
• Creating a ggplot

• Each geom function in ggplot2 takes a mapping

argument. This defines how variables in our dataset
are mapped to visual properties.

• The mapping argument is always paired with aes(),

and the x and y arguments of aes() specify which
variables to map to the x and y axes.

• ggplot2 looks for the mapped variables in the data

argument.
Datavisualization
• Creating a ggplot

• Example :
• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age))
Datavisualization
• Creating a ggplot

• We can convey information about our data by

mapping the aesthetics in our plot to the variables in
our dataset.

• For example, we can map the colors of our points to

the “sex” variable to reveal the “sex” of each .

• Example :
• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age, color=sex))
Datavisualization
• Creating a ggplot

• Example with shape :

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age, shape=sex))
Datavisualization
• Creating a ggplot

• Example with alpha :

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age, alpha=sex))
Datavisualization
• Creating a ggplot

• We can also set the aesthetic properties of

our geom manually.
• For example, we can make all of the points in
our plot blue:
• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue")
Datavisualization
• Creating a ggplot

• The color does not convey information about a

variable, but only changes the appearance of the plot.

• To set an aesthetic manually, set the aesthetic by

name as an argument of our geom function;

• We will need to pick a level that makes sense for that

aesthetic:
• The name of a color as a character string;
• The size of a point in mm;
• The shape of a point as a number (see the below figure).
Datavisualization
• Creating a ggplot

• Example:
• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), shape=23)

• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), color="blue", shape=11)

• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), color="blue", shape=19, size=3)
SHAPE=23
COLOR=BLUE , SHAPE=11
COLOR=BLUE , SHAPE=19
Datavisualization
• Creating a ggplot

• Example: Add a title using ggtitle

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + ggtitle("Shalom to my niece
Rivka")
Datavisualization
• Creating a ggplot

• Example: Add a title using labs

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + labs(title="Shalom to my niece
Rivka")
Datavisualization
• Creating a ggplot

• Example: Change the name of the axis using

xlab and ylab

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + ggtitle("Shalom to my niece
Rivka") + xlab("Net wage") + ylab("Ageeeee")
Datavisualization
• Creating a ggplot

• Example: We can have the same result (i.e.,

change the name of the axis) using labs

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + labs(title="Shalom to my niece
Rivka", x="Net wage", y="Ageeeee")
Datavisualization
• Creating a ggplot

• Example: Change the color, size and type of the main

title and the axis titles, using theme

• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), color="blue", shape=11) +
labs(title="Shalom to my niece Rivka", x="Net wage",
y="Ageeeee")+theme(plot.title =
element_text(color="red", size=14, face="bold.italic"),
• axis.title.x = element_text(color="blue", size=14,
face="bold"), axis.title.y =
element_text(color="#993333", size=14, face="bold"))
Hexadecimal color code chart
• Colors can specified as a hexadecimal RGB (Red, Green,
Blue) triplet, such as "#0066CC".

• The first two digits : are the level of red,

• the next two digits : green,
• and the last two digits : blue.

• The value for each ranges from 00 to FF in hexadecimal

(base-16) notation, which is equivalent to 0 and 255 in
base-10.

• For example, in the table below (slide 30), “#FFFFFF” is

white and “#990000” is a deep red.
http://www.visibone.com
Datavisualization
• Creating a ggplot

• Example: Remove main title or axis titles

using theme

• theme(plot.title = element_blank(),
• axis.title.x = element_blank(), axis.title.y =
element_blank())
Datavisualization
• Creating a ggplot

• One way to add additional variables is with

aesthetics.

• Another way, particularly useful for categorical

variables, is to split our plot into facets, subplots
that each display one subset of the data.

• To facet our plot by a single variable, we can use

facet_wrap().
Datavisualization
• Creating a ggplot

• The first argument of facet_wrap() should be

a formula, which we create with ~ followed
by a variable name.

• The variable that we pass to facet_wrap()

should be discrete.
Datavisualization
• Creating a ggplot

• Example :

• ggplot(data = mydata) + geom_point(mapping =

aes(x = age, y = salnet)) + facet_wrap(~sex, nrow
= 2)

• ggplot(data = mydata) + geom_point(mapping =

aes(x = age, y = salnet)) + facet_wrap(~cscor,
nrow = 2)
Datavisualization
• Creating a ggplot

• To facet our plot on the combination of two

variables, we can add facet_grid() to our plot
call.

• The first argument of facet_grid() is also a

formula.

• This time the formula should contain two

variable names separated by a ~.
Datavisualization
• Creating a ggplot

• Example :
• ggplot(data = mydata) +
geom_point(mapping = aes(x = age, y =
salnet)) + facet_grid(sex ~ cscor)
Datavisualization
• Creating a ggplot

• If we prefer to not facet in the rows or columns

dimension, we can use a “.” instead of a variable
name.
• Example :
• ggplot(data = mydata) + geom_point(mapping =
aes(x = age, y = salnet)) + facet_grid(. ~ cscor)
• ggplot(data = mydata) + geom_point(mapping =
aes(x = age, y = salnet)) + facet_grid(sex ~ .)
Datavisualization
• Creating a ggplot : geometric objects

• Since then, we have used geom_point.

• However for bar charts we can use bar geoms;
• For line charts, we can use line geoms;
• For boxplots, we can use boxplot geoms;
• (…)
• We can use the smooth geom, in order to have a smooth
line fitted to the data.
• the data is fitted using the so-called Loess method; Loess short
for Local Regression is a non-parametric approach that fits
multiple regressions in local neighborhood;
• Or the so-called Gam method; Gam short for generalized
additive model.
Datavisualization
• Creating a ggplot : geometric objects

• To change the geom in our plot, we have

simply to change the geom function that we
add to ggplot().

• Example: Smooth the plot

• ggplot(data = mydata) +
geom_smooth(mapping = aes(x = age, y =
salnet))
Datavisualization
• Creating a ggplot : geometric objects

• Every geom function in ggplot2 takes a mapping argument.

• However, not every aesthetic works with every geom.

• We cant set the shape of a point, but we cannot set the “shape” of
a line.

• On the other hand, we can set the linetype of a line.

• geom_smooth() will draw a different line, with a different linetype,

for each unique value of the variable that we map to linetype.
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata) +
geom_smooth(mapping = aes(x = age, y =
salnet, linetype = sex))
Datavisualization
• Creating a ggplot : geometric objects

• geom_smooth(), uses a single geometric

object to display multiple rows of data.
• Therefore we can set the group aesthetic to a
categorical variable to draw multiple objects.
• ggplot2 will draw a separate object for each
unique value of the grouping variable.
Datavisualization
• Creating a ggplot : geometric objects

• In practice, ggplot2 will automatically group the

data for these geoms whenever we map an
aesthetic to a discrete variable (as in the linetype
example).

• It is convenient to rely on this feature because

the group aesthetic by itself does not add a
legend or distinguishing features to the geoms.
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata) + geom_smooth(mapping =
aes(x = age, y = salnet, group = sex))

• ggplot(data = mydata) + geom_smooth(mapping =

aes(x = age, y = salnet, color = sex))

• ggplot(data = mydata) + geom_smooth(mapping =

aes(x = age, y = salnet, color = sex), show.legend =
FALSE)
Datavisualization
• Creating a ggplot : geometric objects

• To display multiple geoms in the same plot, we

can add multiple geom functions to ggplot().

• Example :
• ggplot(data = mydata) + geom_point(mapping =
aes(x = age, y = salnet)) +
geom_smooth(mapping = aes(x = age, y =
salnet))
Datavisualization
• Creating a ggplot : geometric objects

• This, however, introduces some duplication in

our code.

• Imagine if we wanted to change the y-axis to

display salbrut instead of salnet.

• We will need to change the variable in two

places, and we may forget to update one.
Datavisualization
• Creating a ggplot : geometric objects

• We can avoid this kind of repetition by

passing a set of mappings to ggplot().

• ggplot2 will treat these mappings as global

mappings that apply to each geom in the
graph.
Datavisualization
• Creating a ggplot : geometric objects

• Example :
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point() + geom_smooth()
Datavisualization
• Creating a ggplot : geometric objects

• If we place mappings in a geom function, ggplot2

will treat them as local mappings for the layer.
• It will use these mappings to extend or overwrite
the global mappings for that layer only.

• This makes it possible to display different

aesthetics in different layers.
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth()
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth(data =
filter(mydata, sex == "1"))
• #Here the smooth line is the one for men
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth(data =
filter(mydata, sex == “2"))
• #Here the smooth line is the one for women
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth(data =
filter(mydata, sex == “2"), se=FALSE)
• #Here the smooth line is the one for women
and we have removed the standard-error
representation
Datavisualization
• Creating a ggplot : geometric objects

• Practice 2

• Recreate the code necessary to have the

following graph:
Practice 2
Datavisualization
• Creating a ggplot : geometric objects

• A solution to Pratice 2:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet, color=sex)) + geom_point() +
geom_smooth()

Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
UL2
No ratings yet
UL2
2 pages
R File Code
No ratings yet
R File Code
16 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
Data Analytic R
No ratings yet
Data Analytic R
28 pages
R Functions List
No ratings yet
R Functions List
8 pages
DS Lab
No ratings yet
DS Lab
31 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Basic R Commands For Data Analysis
No ratings yet
Basic R Commands For Data Analysis
7 pages
R Lecture 2-1
No ratings yet
R Lecture 2-1
28 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
R Commands
No ratings yet
R Commands
18 pages
R ggplot2 Code Examples & Tips
No ratings yet
R ggplot2 Code Examples & Tips
22 pages
R1 Uptovisualisation
No ratings yet
R1 Uptovisualisation
122 pages
Business Analytics - L2
No ratings yet
Business Analytics - L2
41 pages
Ma 3
No ratings yet
Ma 3
32 pages
R Tutorial #1: Applied Econometrics (Econ3005)
No ratings yet
R Tutorial #1: Applied Econometrics (Econ3005)
21 pages
Rintro
No ratings yet
Rintro
42 pages
STTN 225 R Summary
No ratings yet
STTN 225 R Summary
18 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
Week 7
No ratings yet
Week 7
10 pages
Module 2.9
No ratings yet
Module 2.9
12 pages
R语言学习笔记
No ratings yet
R语言学习笔记
78 pages
CH 03
No ratings yet
CH 03
42 pages
R Studio Lab Summary Sheet
No ratings yet
R Studio Lab Summary Sheet
3 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
R Reference Guide for Programmers
No ratings yet
R Reference Guide for Programmers
6 pages
R Programming-1
No ratings yet
R Programming-1
6 pages
Practical4 Solution-1
No ratings yet
Practical4 Solution-1
9 pages
BAN5
No ratings yet
BAN5
2 pages
R Program
No ratings yet
R Program
22 pages
Exploratory Data Analysis and Visualization
No ratings yet
Exploratory Data Analysis and Visualization
10 pages
R Programming for Students
No ratings yet
R Programming for Students
10 pages
DSR LAB MANUAL - 10 Programs
No ratings yet
DSR LAB MANUAL - 10 Programs
34 pages
Lab Manual Record: St. Josephs PG College
No ratings yet
Lab Manual Record: St. Josephs PG College
14 pages
Intro To R Software
No ratings yet
Intro To R Software
7 pages
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
No ratings yet
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
7 pages
R Complete
No ratings yet
R Complete
24 pages
DMPA Codes
No ratings yet
DMPA Codes
16 pages
R-Tutorial - Introduction
No ratings yet
R-Tutorial - Introduction
30 pages
Essential R Commands Guide
No ratings yet
Essential R Commands Guide
11 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
R Short Tutorial
No ratings yet
R Short Tutorial
5 pages
R Code
No ratings yet
R Code
9 pages
R Studio Notes
No ratings yet
R Studio Notes
10 pages
R Reference Card
100% (4)
R Reference Card
4 pages
Session Set Working Directory Choose Directlry
No ratings yet
Session Set Working Directory Choose Directlry
17 pages
Arnold 1998
No ratings yet
Arnold 1998
17 pages
Computational Physics Lab: Writing Up: Laboratory Class Attendance
No ratings yet
Computational Physics Lab: Writing Up: Laboratory Class Attendance
3 pages
Chap-3 (Malware Analysis) (Sem-5)
No ratings yet
Chap-3 (Malware Analysis) (Sem-5)
22 pages
Millikan Oil Drop Experiment
No ratings yet
Millikan Oil Drop Experiment
6 pages
Minitab SPC
No ratings yet
Minitab SPC
11 pages
VDSL Tutorial
No ratings yet
VDSL Tutorial
10 pages
GE MR Principles
No ratings yet
GE MR Principles
98 pages
Bms Major Project
No ratings yet
Bms Major Project
11 pages
J Matchar 2021 110911
No ratings yet
J Matchar 2021 110911
14 pages
Physics Project Styrofoam Charge
100% (1)
Physics Project Styrofoam Charge
3 pages
PADS Tutorial
No ratings yet
PADS Tutorial
59 pages
GLC60 70VX 1
No ratings yet
GLC60 70VX 1
8 pages
NRB IT Mix MCQ
No ratings yet
NRB IT Mix MCQ
14 pages
Salt Harbour Part B Analysis-Easterly
No ratings yet
Salt Harbour Part B Analysis-Easterly
4 pages
RCM Program of Study 24-25
No ratings yet
RCM Program of Study 24-25
37 pages
Tutorial 1
No ratings yet
Tutorial 1
18 pages
Frequency Response Analysis: Sinusoidal Forcing of A First-Order Process
No ratings yet
Frequency Response Analysis: Sinusoidal Forcing of A First-Order Process
27 pages
Moisture Content Determination
No ratings yet
Moisture Content Determination
5 pages
Science 10 Second Grading Exam
No ratings yet
Science 10 Second Grading Exam
2 pages
Worksheet 8 Answers
No ratings yet
Worksheet 8 Answers
1 page
Extraction Notes
No ratings yet
Extraction Notes
16 pages
Statistics Summer Course
No ratings yet
Statistics Summer Course
49 pages
Power Plant Engineering
No ratings yet
Power Plant Engineering
20 pages
Types Roof Trusses: Building Technology 3 2012
100% (1)
Types Roof Trusses: Building Technology 3 2012
34 pages
Linear Regression
No ratings yet
Linear Regression
12 pages
St. Joseph's College of Engineering, Chennai-119 Department of Mechanical Engineering Sub. Name: Dynamics of Machinery Sub - Code: ME2302
No ratings yet
St. Joseph's College of Engineering, Chennai-119 Department of Mechanical Engineering Sub. Name: Dynamics of Machinery Sub - Code: ME2302
7 pages
Network Security
No ratings yet
Network Security
7 pages
SATB Cadences and Arrangements
No ratings yet
SATB Cadences and Arrangements
9 pages
Real-Life Applications of Linear Algebra
No ratings yet
Real-Life Applications of Linear Algebra
3 pages
Gamoyeneb. Tox.17.red
No ratings yet
Gamoyeneb. Tox.17.red
82 pages